Tumor Genome Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST512
Cancer Cancer will affect 1 in 2 men and 1 in 3 women in the United States, and the number of new cases of cancer is set to nearly double by the year Cancer is a genetic disease caused by mutations in the DNA Clinically tumors can look the same but most differ genetically. 2
Different Sequencing Approaches Capture-seq ($ ) –Could focus well known mutations Exome-seq ($700-2K) –All the exons in genes; promoters and LncRNA genes? RNA-seq ($500-2K) –Expression and mutations together, miss anything? Whole genome sequencing ($3-4K) –Majority of mutations non-coding, function unknown –Better at detecting structural changes (translocations, fusions) –Cost-vs-benefit balance 3
Two Major Cancer Genome Projects TCGA: The Cancer Genome Atlas (US) –> 30 cancer types and > 10K tumor samples –Primary tumors, fewer death events –Genome, transcriptome, DNA methylome, proteomics –Rigorous tumor sample QC, consistent profiling platform ICGC: International Cancer Genome Consortium (11 countries) –20 cancer types * 500 tumor samples each 4
Tumor Gene Expression Microarrays or RNA-seq Data analysis? Differential expression between cancer and normal Cluster the tumor samples into sub-types –Consensus clustering: sampling genes or tumors, get robust clustering Predict patient outcome (survival or recurrence) 5 Break
Survival Analysis Do patients receiving the treatment live longer? Are smokers more likely to have cancer currence Censored data: the value of a measurement or observation is only partially known –Some patients left the study –Study concluded 6
Survival Without Censoring 7
Survival With Censoring 8
Kaplan Meier Curve More individuals in each group, better separation of the groups, better p-value 9
Log Rank Test 10
Log Rank Test 11
More Variables 50-signature? Logistic regression: –Estimate odds ratio: ratio of proportions –Linear combination of all the genes to separate outcome (0, 1). Cox Regression –Estimate hazard ratio: ratio of incidence rates –Models the effect of covariates on the hazard rate but leaves the baseline hazard rate unspecified 12
Use Cox Regression to Separate Two Groups by Gene Signature 13
Caution About Gene Signature’s Predictive Power 14 Break
Mutations in the Tumor Genome Help us identify important genes for tumorigenesis and cancer progression Drivers – a.k.a gatekeepers, mutations that cause and accelerate cancers Passengers – Accidental by-products and thwarted DNA-repair mechanisms Recurrent mutations on genes or pathways are likely drivers 15
High Throughput Driver Detection Differential gene expression Copy number aberration (CNA) or variation (CNV) using CGH, tiling or SNP arrays 16
Comparative genomic hybridization (CGH) 17
GISTIC Gscore: frequency of occurrence and the amplitude of the aberration Statistical significance evaluated by permutation FDR adjust for multiple hypothesis testing 18
GATK FASTQ-> BAMBAM->VCFAnnotate 19
MAF and VCF Formats VCF (GWAS format) and MAF (TCGA format) Both can annotate somatic mutations and germline variants Tab delimited text file CHROM, POS, ID (SNP id, gene symbol, or ENTREZ gene id), REF (reference seq), ALT (altered sequence), QUAL (quality score), FILTER (PASS vs “q10;s50” quality <=10, <=50% samples have data here), INFO (allele counts, total counts, number of samples with data, somatic or not, validated, etc) 20
Example of a Cancer Genome Mutations Profile Circos Plot: how messed up a cancer genome is 21
Total alterations affecting protein- coding genes in selected tumors Vogelstein et al, Science
Somatic Mutation Frequency in 3K Tumor-Normal Pairs Typical tumors: median 45 mutations / tumor More mutations for tumors facing outside 23 Break
TS vs Oncogenes, GoF vs LoF Tumor suppressors vs oncogenes Gain of Function (GoF) or Loss of Function (LoF) mutations –Phenotypes How to tell? –From mutation patterns –From expression patterns –Functional studies Some genes can be both TS and oncogenes 24
Mutation Rate Heterogeneity Mutation rate correlated with replication timing, gene expression, and gene length Tumor evolution and selection 25 Lawrence et al, Nat 2013
Recurrent Mutations 26 Known Novel clear cancer assoc Novel Lawrence et al, Nat 2014
How Much Should We Sequence? Need ~200 patients for 20% mutation rate, ~550 pts for 10%, ~1200 pts for 5% mutation rate. Most driver mutations have been found, pressing need in basic cancer research to study their function Biggest surprise: mutations on chromatin regulators –> 50% new and strong cancer driver genes –Oncogenes: DNMT3A, IDH1 –Tumor Suppressor: MLL, ATRX, ARID1A, SNF5 –Both: EZH2 Sequencing metastasized or drug resistant tumors might yield insights on tumor progression 27
Resources MSKCC CBioPortalCBioPortal –GUI interface for experimental biologists Broad FireHoseFireHose –API for accessing processed TCGA data UCSC CGHubCGHub –API for accessing raw and processed cancer data Sanger COSMICCOSMIC –Catalog of Somatic Mutations in Cancer Many also provide software tools 28
Summary Different sequencing approaches Gene Expression, tumor sub-typing Survival analysis: KM vs Cox Regression Different mutation types and distributions Gain or loss of function mutations Tumor suppressor vs oncogenes 29
Acknolwedgement Aleksandar Milosavljevic Kristin Sainani Linda Staub & Alexandros Gekenidis Yin Bun Cheung, Paul Yip John Pack Cheng Li Xujun Wang Peng Jiang 30