## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 228”

## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 228”

## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 203”

## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 6”

## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 5”

## Joining, by = "Hugo_Symbol"

[1] “problematic genes removed: 462”

## Joining, by = "Approved_symbol"
## Joining, by = c("Approved_symbol", "filt")
## Joining, by = c("Approved_symbol", "Genetype")

Methods

We obtained all somatic mutations from The Cancer Genome Atlas (TCGA) which contains 11330 patients of 33 cancer types. Mutation maf files were downloaded from Firehose (http://gdac.broadinstitute.org/). The average gene expression levels (microarray RMA, log2ratio) for all genes in cancer cells were derived from 91 CCLE (Cancer Cell Line Encyclopedia) cell lines. Cancer genes were retrieved from cancer gene census (http://cancer.sanger.ac.uk/census COSMIC v81). To facilitate calculation of mutation rates, we excluded those cancer genes that are affected by translocation, amplifications or large deletions. This led to 262 cancer genes in which 67 are annotated as oncogenes, 100 as tumor suppressors. We calculated the raw mutation rate of a gene according to the number of mutations per base pair in the coding sequence (CDS) and normalized by sample size. For comparison, we use the mutation rates of 16617 non-cancer genes as a background (excluding 262 cancer genes and 2077 genes lacking expression information). They were firstly divided into 20 bins according to gene expression levels. An error bar was then calculated for each bin, marking 75 quantile, 25 quantile and median mutation rates. We then fitted a baseline for mutation rates of non-cancer genes using generalized additive model (GAM) provided by R package: ggplot2. Because different genes have different background (random) mutation rates, we utilized MutsigCV, which employed a sophisticated model for estimating gene-specific and patient-specific background mutation rate and calculating significantly mutated genes (Q-score, the smaller Q-score, the higher significance). For MutsigCV calculation, the combined pan-cancer mutation maf file was used as input and all other input files and default options were kept. Data analysis and visualization were conducted in R with the help of packages ggplot2.

Results

Fig. X compared mutation rates of cancer genes (oncogene and tumor suppressor genes) to the mutation rate of those genes involved in de novo pyrimidine biosynthesis genes (CAD, DHODH, CPS1 and UMPS) as a function of average expression levels in cancer. We plotted it as a function of expression levels to examine the dependence of mutation rates on expression rate. Both raw mutation rate (number of mutations per base pair in a gene, A) and statistically significance of mutations of a gene over expected mutations (B) are shown for comparison. The median mutation rates (or significance) for other genes (not annotated as cancer genes) are utilized as a baseline. It is clear that raw mutation rates of CAD, DHODH, CPS1 and UMPS are close to the baseline (Fig. XA) and these mutation rates are statistically indistinguishable from expected mutations according to MutsigCV (Q is nearly 1). By comparison, many oncogenes and tumor suppressor genes have highly significant mutations (20%, Q <0.1).

CGC (Cancer Gene Census):
[1] Futreal, P. Andrew, et al. “A census of human cancer genes.” Nature Reviews Cancer 4.3 (2004): 177-183.

CCLE (Cancer Cell Line Encyclopedia):
[2] Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603-7 (2012).

MutsigCV:
[3] Lawrence, Michael S., et al. “Mutational heterogeneity in cancer and the search for new cancer-associated genes.” Nature 499.7457 (2013): 214-218.

TCGA cancer mutation database:
“The results here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.”