Download presentation
Presentation is loading. Please wait.
1
Cancer Sequencing
2
What is Cancer? Definitions
A class of diseases characterized by malignant growth of a group of cells Growth is uncontrolled Invasive and Damaging Often able to metastasize An instance of such a disease (a malignant tumor) A disease of the genome What is a tumor? - I like wikipedia’s def. a class of disease Also refers to an instance of the disease Finally, since early days of karyotyping (image chromosomes with dyes, clear that it is a disease of the genome. Image of “Representative G-banded karyotype from a metastatic melanoma” There are more than 100 cancer types. Some only appear in specific organs, some have very distinctive phenotype when you look at the tissue level. But essential thing to know about cancer is that there are several fundamental pathways which govern transformation of normal tissue into a cancerous tissue
3
What is Cancer? Definitions
A class of diseases characterized by malignant growth of a group of cells Growth is uncontrolled Invasive and Damaging Often able to metastasize An instance of such a disease (a malignant tumor) A disease of the genome What is a tumor? - I like wikipedia’s def. a class of disease Also refers to an instance of the disease Finally, since early days of karyotyping (image chromosomes with dyes, clear that it is a disease of the genome. Image of “Representative G-banded karyotype from a metastatic melanoma” There are more than 100 cancer types. Some only appear in specific organs, some have very distinctive phenotype when you look at the tissue level. But essential thing to know about cancer is that there are several fundamental pathways which govern transformation of normal tissue into a cancerous tissue Make sure to say that because it is a disease of the genome, must understand that it is an evolutionary process. Critical to this lecture. May have heard that disease like HIV famous for being hard to treat because changes so rapidly. Just so in cancer – you are combatting an invasive organism. Beautifully Summarized by one of the seminal papers in cancer research called “Hallmarks of Cancer”, written by Douglas Hanahan and Robert Weinberg. Ref in corner Review paper, but marks an important transition in the field where began to think of Cancer in terms of a set of fundamental changes to cellular and extracellular behavior, and to think of cancer in evolutionary terms
4
Fundamental Changes in Cancer Cell Physiology
Exploitation of natural pathways for cellular growth Growth Signals (e.g. TGF family) Angiogenesis Tissue Invasion & Metastasis Evasion of anti-cancer control mechanisms Apoptosis (e.g. p53) Antigrowth signals (e.g. pRb) Cell Senescence Beautifully Summarized by one of the seminal papers in cancer research called “Hallmarks of Cancer”, written by Douglas Hanahan and Robert Weinberg. Ref in corner Review paper, but marks an important transition in the field where began to think of Cancer in terms of a set of fundamental changes to cellular and extracellular behavior, and to think of cancer in evolutionary terms Most and possibly all cancer must overcome fundamental anti-cancer and homeostatic mechanisms. Discuss a couple pathways: - Growth signals. Exogenous signals, intermembrane signalling proteins, intracellular pathways. Example, cell can become able to produce and export their own grow signals, causing a feedback loop. Also, one of most famous cases is that certain kinases (tyrosine kinase) that recognize growth factors outside the cell propogate the signal inside the cell, become constitutively active Perhaps most famous if Apoptosis why cell integrity is sufficiently disrupted, triggers pathway that destroys the cell. Sensors: Frequently occurs do to sufficient disorganization of DNA structure, and effectors. If you turn this pathway off, it means that you can have a cell continue to thrive dispite extremely abnormal behavior. P53 tumor supressor protein - Anti-growth signals are similar unless both internal and exogenous signals are in the correct state, prevents cell cycle from continuing and can even prompt cell to enter a quiescent state such that it can never divide again. This must be circumvented in order to become a cancer Cell senescence telomerase adds hexamers to ends - Metastasis – cell cell adhesion molecules (CAMs). Also, change integrins so that surface molecules look different. Also need extracellular proteases Acceleration of Cellular Evolution Via Genome Instability DNA Repair DNA Polymerase Hanahan and Weinberg The hallmarks of cancer. Cell 100:
5
Many Paths Lead to Cancer Self-Sufficiency
- Stages can happen in different order - There are multiple regulatory pathways responsible each physiological change – in some cases need multiple genes to be mutated to achieve change At each stage, individual cells can die (apoptosis, quiescence, immune response) – strong selective pressure in evolutionary context This explains why most cancer do not occur until larger in life: requires a certain amount of time before sufficient mutations, in the right pathways, can accumulate to cause cancer. When you do the math, also shows why it is nearly inevitable that you will get cancer if you live long enough. Also important to think about this in terms of genetic predisposition to cancer. If an essential pathway, likeTP53, is already ineffective at birth, then you’ve seeded your entire body with a potential first step towards cancer. That’s why genetic tests for genes like BRCA are so important – if you have the mutation, it drastically alters your risk of a certain cancer type. Hanahan, Douglas, and Ra Weinberg The hallmarks of cancer. Cell 100:
6
Cancer Heterogeneity Start out with a small group of cells that have started to transform into benign tumor. Differentiates into several differ paths, one of which is a cancer. Explosive growth of cells, but leads to many different sub populations. Important: Up until now, you’ve been considering sequencing an animal or an individual with a diploid genome. Suddenly that is not so simple – many different “individuals” in a cancer genome, part of the challenge. Also, important from clinicla perspective – when get chemo or target drug therapy, some cells die but some won’t. If the cancer recurs, those surviving cells will make up the next round of cancer cells For example, transforming growth factor-β is a cancer-ecosystem regulatory molecule45. Other cellular and cytokine components of inflammatory lesions are potent and common modulators of the cancer-cell ecosystem.The interaction between cancer cells and their tissue habitats is reciprocal. Cancer cells can remodel tissue micro-environments and specialized niches to their competitive advantage invasion by inflammatory or endothelial cells, is modified by external factors. As well as the tissue site, the ecosys- tem for each cancer includes environmental, lifestyle and associated aetiological exposure of the patient. Genotoxic exposure (such as, cigarette carcinogens or ultraviolet light), infection, and long-term dietary and exercise habits that affect calorie, hormone or inflam- mation levels can have a profound effect on the tissue micro-envi- ronments, as well as directly on cancer cells Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–13 (2012).
7
Why Sequence Cancer Genomes?
Better understand cancer biology Pathway information Types of mutations found in different cancers STOP FOR QUESTIONS HERE
8
Why Sequence Cancer Genomes?
Better understand cancer biology Pathway information Types of mutations found in different cancers Cancer Diagnosis Genetic signatures of cancer types will inform diagnosis Non-invasive means of detecting or confirming presence of cancer Improve cancer therapies Targeted treatment of cancer subtypes Samples 544809 Mutations 141212 Papers 10383 Whole Genomes 29 COSMIC Database, v48, July 2010 EGRF example for lung cancer – inhibitors very effective for mut. In EGRF for lung, no effect if wild type Forbes et al COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Research 39: D945-D950
9
Why Sequence Cancer Genomes?
Better understand cancer biology Pathway information Types of mutations found in different cancers Cancer Diagnosis Genetic signatures of cancer types will inform diagnosis Non-invasive means of detecting or confirming presence of cancer Improve cancer therapies Targeted treatment of cancer subtypes Samples Mutations Papers 20247 Whole Genomes 15047 COSMIC Database, v71, Oct 2014 There is now a whole genome browser specifically for looking at sample for which the whole genome is available
10
How Do We Sequence Cancer Genomes?
Fortunately, you guys had an introduction to sequencing yesterday
11
How Do We Sequence Cancer Genomes?
12
Read Mapping
13
Definition of Coverage
Length of genomic segment: L Number of reads: n Length of each read: l Definition: Coverage C = n l / L How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides
14
Read Mapping BWA
15
Paired-End Read Mapping
Reference Physical Coverage: 4 Sequence Coverage: 2 Physical coverage refers to the genomic coverage including the unsequenced regions of each DNA fragment Sequence coverage refers to the genomic coverage counting only the sequenced part of each DNA fragment Increased gap length between paired reads provides higher physical coverage without incurring increased costs for sequencing, which is useful for detecting certain types of mutations
16
Considerations for Cancer Sequencing
Factors that effect mutation signal Limited genetic material (lower depth) Mixture of tumor and normal tissue Cancer Heterogeneity Factors that introduce noise Formalin-fixed and Paraffin-embedded samples Increased number of mutations and unusual genomic rearrangements General Consideration Each individual has many unique mutations that could be confused with cancer causing mutations So ideally, you would want a model that was geared towards overcoming these challenges, and that took into account that fact that you’re trying to distinguish a given individuals health genome from the cancer genome.
17
Human Genome Variation
TGCTGAGA TGCCGAGA TGCTCGGAGA TGC GAGA SNP Novel Sequence Mobile Element or Pseudogene Insertion Inversion Translocation Tandem Duplication TGC - - AGA TGCCGAGA Microdeletion Transposition Novel Sequence at Breakpoint Large Deletion TGC
18
Variant Types Variant Types Single Nucleotide Variants(SNVs)
Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence
19
SNV Calling Variant Types Single Nucleotide Variants(SNVs) Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence SNV vs. SNP (SNV is refering to any location where one or more genomes of interest differ from the reference. SNP is usually designed as a single base difference from the reference that recurs in the population with a frequency of greater than 1%. Rare variant is <1%, but still recurring in population. Mutation means it is very uncommon. A bayesian approach is the most general and common method of calling SNVs MAQ, SOAPsnp, Genome Analyis ToolKit (GATK), SAMtools
20
SNV Calling Variant Types Single Nucleotide Variants(SNVs)
Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence SNV vs. SNP (SNV is refering to any location where one or more genomes of interest differ from the reference. SNP is usually designed as a single base difference from the reference that recurs in the population with a frequency of greater than 1%. Rare variant is <1%, but still recurring in population. Mutation means it is very uncommon.
21
SNV Calling Variant Types
Single Nucleotide Variants(SNVs) Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence SNV vs. SNP (SNV is refering to any location where one or more genomes of interest differ from the reference. SNP is usually designed as a single base difference from the reference that recurs in the population with a frequency of greater than 1%. Rare variant is <1%, but still recurring in population. Mutation means it is very uncommon. A given human genome (germline) differs from the reference genome at millions of positions. A cancer genome differs from the healthy genome of its host by tens of thousands of positions at most, which is several orders of magnitude fewer differences than germline versus reference How do we distinguish germline mutations from somatic mutations?
22
Somatic SNV calling Normal Tissue Tumor Tissue Compare the
alignment results Most naïve: use a standard SNV caller on both datasets. If there is a mutation found in the tumor sample but not the normal, it is somatic!
23
Somatic SNV calling JointSNVMix
probabilistic graphical models for joint tumor-normal SNV calling Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28, 907–13 (2012).
24
Short Indel Calling Reference Variant Types Insertion Deletion
Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence Insertion Deletion Reference For very small deletion, can use similar bayesian method as shown previous, but adding the option of a deletion in either the ref or the genome of interest. However, as get larger (>10bp, depending on read size), hard to map reads correctly to these locations, so need a new option.
25
Short Indel Calling Reference Read mapping in practice Reference
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence Insertion Deletion Reference Read mapping in practice For very small deletion, can use similar bayesian method as shown previous, but adding the option of a deletion in either the ref or the genome of interest. However, as get larger (>10bp, depending on read size), hard to map reads correctly to these locations, so need a new option. Reference Unmappable part of read (just the read end) Unmapped read (could not be aligned anywhere)
26
Short Indel Calling – Discordant Reads Pairs
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence I) Insertion l i Reference l - i II) Deletion l For very small deletion, can use similar bayesian method as shown previous, but adding the option of a deletion in either the ref or the genome of interest. However, as get larger (>10bp, depending on read size), hard to map reads correctly to these locations, so need a new option. d Reference l + d
27
Short Indel Calling – Split Read Mapping
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence Deletion Reference Read mapping in practice For very small deletion, can use similar bayesian method as shown previous, but adding the option of a deletion in either the ref or the genome of interest. However, as get larger (>10bp, depending on read size), hard to map reads correctly to these locations, so need a new option. Reference
28
Short Indel Calling – Split Read Mapping
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence Deletion Reference Read mapping in practice Remap each end of the suspicious reads For very small deletion, can use similar bayesian method as shown previous, but adding the option of a deletion in either the ref or the genome of interest. However, as get larger (>10bp, depending on read size), hard to map reads correctly to these locations, so need a new option. Reference
29
Paired-end mapping can improve power to detect variants without need for more sequencing
Same sequence coverage means same amount of sequence is happening. However, increased physical coverage means you have a lot more power to detect mutation because more reads span a given event. However, take into consideration the variance in the read length as well as the cost of making mate pair libraries Modified from Meyerson et al Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October):
30
Copy Number Variants A B C D C E F G H C I K A B C D C E F G H C I K
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence A B C D C E F G H C I K CNV is tricky because mapping is not disrupted. A category that includes duplications A B C D C E F G H C I K Ref: A B C D E F G H I K
31
Copy Number Variants C C C Depth of Coverage C A B C D C E F G H C I K
Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence C C C Depth of Coverage C Modified from Dalca and Brudno Genome variation discovery with high-throughput sequencing data. Briefings in bioinformatics 11, no. 1: 3-14 Coverage variable need long CNVs for this to work Coverage bias can model, but only to a degree Can’t locate CNVs A B C D C E F G H C I K Ref: A B C D E F G H I K
32
Copy Number Variants Problems with DOC
Variant Types Single Nucleotide Variants(SNVs) Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence C C C Depth of Coverage C Problems with DOC Very sensitive to stochastic variance in coverage Sensitive to bias coverage (e.g. GC content). Impossible to determine non-reference locations of CNVs Graph methods using paired-end reads help overcome some of these problems Coverage variable need long CNVs for this to work Coverage bias can model, but only to a degree Can’t locate CNVs A B C D C E F G H C I K Ref: A B C D E F G H I K
33
Copy Number Variants - CNAnorm
Overall steps in CNAnorm method, a tool for detecting copy number changes in tumor samples Data: sequence data from tumor and normal samples Steps: Count number of reads in fixed windows across the genome Calculate ratio of reads in tumor vs. reads in normal for each window, correcting for sequence biases (e.g. GC) Smooth ratio signal across windows Normalize data Estimate amount of normal contamination in tumor sample Perform segmentation on tumor data Cancer is more complicated than normal CNV calling Gusnanto, A., Wood, H. M., Pawitan, Y., Rabbitts, P. & Berri, S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics 28, 40–7 (2012).
34
Variant Types Variant Types Single Nucleotide Variants(SNVs) Short Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence 4 G I K Structural Rearrangement Translocation Inversion Large Insertion / Deletion ^ 2 Inlude large deletions and insertions. Account for the majority of the difference between any given individuals genomes Ref: A B C D E F G H I K
35
Summary of Variant Types
How do we tell what is significant??? Even if we have the cancer patient’s normal genome and have a low false positive rate for calls, there are a large number of mutations that have no effect and are simply propagating alongside functional mutations (have some effect on the cell) Meyerson et al Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October):
36
Passenger Mutations and Driver Mutations
Sequence Normal Driver or Passenger? Cancer X X Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–13 (2012).
37
Passenger Mutations and Driver Mutations
Bad news that many fold more passenger than driver mutations Stratton, Michael R, Peter J Campbell, and P Andrew Futreal The cancer genome. Nature 458, no (April): doi: /nature07943
38
Passenger Mutations and Driver Mutations
Distinguishing Features Train Classifier using Machine Learning Approaches Presence in many tumors Predicted to have functional impact on the cell Conserved Not seen in healthy adults (rare) Predicted to affect protein structure In pathways known to be involved in cancer CHASM – train a Random Forest classifier on a set of over 2000 known missense mutations involved in cancer, with negative control set Other methods – SVMs (KinaseSVM), etc. Important to remember that, in the end, must have biological validation in order to impress the cancer community. Carter et al Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer research, no. 16:
39
Tracking the Evolution of Cancer
40
Models of Breast Cancer Progression
41
Models of Breast Cancer Progression
42
What we did Cancer phylogenetics
Lineage relationship of neoplastic lesions with cancers using somatic SNVs as lineage markers Order of genomic events and drivers Slide Courtesy of Arend Sidow
43
Samples P1 P2 P3 P4 P5 P6 Lymph Normal CCL FEA DCIS IDC
Side 1 Side 2 All samples are FFPE material Slide Courtesy of Arend Sidow
44
Samples
45
Patient 1 Evolution – SNVs
GGATAG CCATGG Normal sample ATAGCG CATGGC Reads from sequencing patient sample TAGCGT CATGGC GCGTCC GGCAAA GGATAGTGTCCATGGCAAA Human genome reference Early neoplasia (EN) sample Code: EN with atypia (ENA) sample 0101 Invasive ductal carcinoma (IDC) sample
46
Patient 1 Evolution – SNVs
Normal sample C Reads from sequencing patient sample C C Human genome reference Early neoplasia (EN) sample Multisample SNV Code Code: EN with atypia (ENA) sample 0101 Invasive ductal carcinoma (IDC) sample Normal EN ENA IDC
47
Patient 1 Evolution – SNVs
Normal sample Reads from sequencing patient sample Human genome reference Early neoplasia (EN) sample C C C Multisample SNV Code Code: EN with atypia (ENA) sample C C C 0101 Invasive ductal carcinoma (IDC) sample Normal EN ENA IDC
48
Patient 1 Evolution – SNVs
Code Normal EN ENA IDC SUM 1000 89 0100 147 0010 102 0001 46 0011 755
49
Patient 1 Evolution – SNVs
Code Normal EN ENA IDC SUM 1000 89 0100 147 0010 102 0001 46 0011 755 89 147 Venn diagram view Normal EN 102 755 46 IDC ENA
50
Patient 1 Evolution – SNVs
Code Normal EN ENA IDC SUM 1000 89 0100 147 0010 102 0001 46 0011 755 P1 89 147 Normal EN 102 755 46 IDC ENA
51
Patient 6 Evolution - SNVs
Lymph CCL_CL CCL FEA DCIS IDC SUM 010000 219 001000 305 000100 345 000010 978 000001 608 000101 61 000110 185 000111 510 01XXXX 010101 000011 3211
52
Somatic changes Lineage concepts
53
Cell Divisions in One Generation
~60 new point mutations ~40 cell divisions
54
“Germline” You D? Your dad
55
“Germline” D? You Your dad Mutations detected here ...
... but not here ... Your dad
56
Somatic (Tumor) Lineages
Sampled lesion
57
Somatic (Tumor) Lineages
Sampled lesion
58
Somatic (Tumor) Lineages
Sampled lesion 1 Sampled lesion 2
59
Patient 6 Evolution - SNVs
Use frequency to order events
60
Patient 6 Evolution Slide Courtesy of Arend Sidow
61
Patient 6 Evolution – Copy Number Changes
Slide Courtesy of Arend Sidow
62
Putting the germline SNPs to good use (no somatic SNVs for this!)
Aneuploidies
63
Heterozygous Positions
m G A T A T C p 50%
64
LOH (e.g., paternal chromosome)
0% Fraction of “lesser allele”
65
Chromosome Duplication
G A T A T C p A T C 66%
66
Chromosome Duplication
G A T A T C p A T C 33% Fraction of “lesser allele”
67
Lesser Allele Fraction Plots
Chromosome Lesser allele fraction Running number of germline het SNP (N ~ 1.7 million) Plots are windows of 1000 SNPs, overlapping by 500
68
Lesser Allele Fraction Plots
69
Zoom-In But ... What is the actual ploidy?
70
Absolute Coverage Pattern of LOH and Gain
Normal LOH Gain
71
Absolute Coverage in LOH vs Ploidy Gain
Lesser allele FRACTION ? Prevalent allele absolute coverage Gain Lesser allele absolute coverage LOH
72
Fractions with normal contribution
G A Prevalent allele absolute coverage = 14 Lesser allele absolute coverage = 7 G A Prevalent allele absolute coverage = 7 Lesser allele absolute coverage = 0 Our samples: up to 50% normal (non-tumor) tissue content Lesser allele fraction = 7/21 = 0.33
73
Patient 6 Evolution – Copy Number Changes
Slide Courtesy of Arend Sidow
74
Patient 6 Evolution Slide Courtesy of Arend Sidow
75
Patient 2 Lymph Normal CCL FEA DCIS IDC
76
Patient 2 - normal
77
Patient 2 – CCL and DCIS 1q: 4N (3:1) 16p: 3N (2:1) X: 1N (LOH)
16q: 1N (LOH) 1q: 4N (3:1) X: 1N (LOH) 16p: 3N (2:1)
78
Patient 2 – IDC has same as CCL,DCIS
79
Patient 2 Aneuploidy Evolution
CCL,DCIS IDC 1q,16p 16q, X
80
Patient 2 - IDC
81
Patient 2 Aneuploidy Evolution
CCL,DCIS IDC IDC’ Major Crisis Involving all but 6 chromosomes, including 10 whole-chromosome LOHs No aneuploidies but ... 1q,16p 16q, X
82
Patient 2 – SNVs CCL: 894 DCIS: 884 IDC: 1276 Allele Freq in CCL 5 70
681 80 133 515
83
Patient 2 Evolution 1q,16p 16q, X 515 80 133 681 1p 2 4 5 8 9 10 11 13
14 15 17 19 21 515 80 133 1q,16p 16q, X 681
84
Patient Cancer Phylogeny Trees
Slide Courtesy of Arend Sidow
85
Mutational Profiles
86
Automated Inference of Multi-Sample Cancer Phylogenies
This type of data is the basis for the algorithm that I will present next. The method, which my student Victoria Popic named SMutH standing for Somatic Mutation Hierarchies, takes as input a set of SNVs and their variant allele frequencies in a set of samples. It assumes a branched tree model of cell lineage evolution, where once a cell obtains a mutation, its daughter cells also have that mutation. Then, SMutH assumes that we sequence multiple samples of a tissue, and each sample is a different composition of cell subpopulations where all subpopulation derive from the same cell lineage tree. The goal is to reconstruct the underlying cell lineage tree, and to the extent possible, determine the timing of the somatic mutations. Sample 1 Sample 2 Victoria Popic Raheleh Salari Sample 3 Branched tree model SMutH: Somatic Mutation Hierarchies
87
VAF profiles of SNVs across samples
The input to SMUTH is a variant allele frequency matrix of somatic mutation allele frequencies across the deep sequenced samples. As a first step, SNVs are grouped according to their presence and absence pattern in the samples, after setting some lower threshold for presence. So at the top you see SNV allele frequency vectors. The first vector is an SNV that is present in all samples, and therefore is germline. The second vector from the left is an SNV that is not present in lymph, but present in samples S1, 2, and 3. And so on. At the bottom you see the major represented groups. There is a group of SNVs present in samples 1, 2, and 3. There is another group present in samples 1, 3, and 4. There is another group present in samples 2 and 3. And finally a group of SNVs private to sample 4.
88
VAF profiles of SNVs – Clustering
Next, SNVs within each group are clustered according to their VAF vector across samples. The reasoning is that clusters of SNVs that have similar allele frequencies across multiple samples, likely represent an ancestral cell containing all these SNVs and which proliferated to a subpopulation in each of the samples.
89
Cell-Lineage VAF Constraint
“Possibly mutations in u happened before those in v” Then, SMUTH forms a network where each of the clusters of the previous step is a node. Directed edged between two groups of mutations u and v are meant to represent possible precedence relationship. That is, the mutations in u, possibly, could have happened before the mutations in v. For this to be true, any cell containing the mutations in v has to contain the mutations in u. That gives us the following VAF constraint. For every sample, the mean VAF of mutations in u is higher than the mean VAF of mutations in v. Because of possible errors in estimating VAFs, we allow for a small error epsilon. u Edge u v : v
90
Tree Construction Find all spanning trees that satisfy VAF constraints
(extension of Gabow&Myers spanning tree search algorithm) Rank trees according to their agreement with VAFs Once we constructed a network in this way, we are ready to search for trees that explain the VAF matrix. A valid tree should satisfy the following property: for every node u and its children, if children have nonzero VAFs in the same sample, they must represent distinct cell groups, all of which also contain the mutations in u. Therefore, the mean VAF of u in that sample has to be at least as high as the sum of the VAFs of u’s children. Simply, we search for spanning trees in this network using the Gabow Myers algorithm, which we modified so that it returns only trees that satisfy the VAF constraints. This speeds up the search and even though it is exponential, it works extremely fast for the problem sizes that we are currently interested in. The returned trees are ranked according to how well they fit the VAF constraints. u For each node u and its children C : w v
91
Simulation Results Pred: pairs of nodes ordered correctly
v w u v w z y x To assess the performance of our method, we first constructed simulations of an expanding cell lineage with associated somatic mutations, and then sampling from the resulting cells with a multisample deep sequencing approach. I will not get to the details of the simulations and we are currently working on a rigorous simulator for this type of problem. In simulations, SMUTH performs reasonably well. For example, if VAFs of mutations are estimated with standard deviation of 0.1, Pred: pairs of nodes ordered correctly Branch: pairs of nodes correctly assigned to separate branches Shared edges: edges shared between true and reconstructed trees
92
ccRCC Study of Renal Carcinoma by Gerlinger et. al (2014)
Reconstruction of Lineage Trees in Recent Literature ccRCC Study of Renal Carcinoma by Gerlinger et. al (2014) We also run our method on recently published multi-sample studies by others, a renal carcinoma study and an ovarian cancer study. In these studies, trees were constructed with a combination of general phylogenetic approaches such as neighbor joining, and manual inspection. In each of the patients, we recovered trees that are very similar to the previous trees, except for a few cases of sample heterogeneity which we predicted. Upon manual inspection, we find that our trees agree better with the data than the published trees. I can provide more details offline, for those who are interested. HGSC Study of Ovarian Cancer Bashashati et. al (2013)
93
Expanded Breast Cancer Lineage Trees
Finally, we ran SMutH in our six patients, and created the beautiful trees that I display here. I will not go into details, other than to point to an interesting case of recurrent mutations. PIC3CA is a known cancer recurrent mutation. In our trees, this mutation breaks perfect phylogeny, and recurs in multiple branches. Two specific mutations of PIC3CA recur, H1047R and H1047L. Our tree building is robust to tolerate such low number of mutations that disobey perfect phylogeny, which can then be placed back in the tree so as to see all the braches where they recurred. Here I am showing their allele frequencies in each sample. PIK3CA H1047R PIK3CA H1047L
94
Further Readings for the Curious
Fantastic Cancer Reviews Hanahan and Weinberg The hallmarks of cancer. Cell 100: Hanahan and Weinberg Hallmarks of cancer: the next generation. Cell 144, 646–74. Reviews of Cancer Genomics Meyerson, Matthew, Stacey Gabriel, and Gad Getz Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October): doi: /nrg Yates, L. R. & Campbell, P. J. Evolution of the cancer genome. Nat. Rev. Genet. 13, 795–806 (2012). Variant Calling Dalca, Adrian V, and Michael Brudno Genome variation discovery with high-throughput sequencing data. Briefings in bioinformatics 11, no. 1 (January): Medvedev, Paul, Monica Stanciu, and Michael Brudno Computational methods for discovering structural variation with next-generation sequencing. nature methods 6, no. 11
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.