Tumor Genomes Compromised genome stability Mutation and selection Chromosomal aberrations –Structural: translocations, inversions, fissions, fusions. –Copy number changes: gain and loss of chromosome arms, segmental duplications/deletions.
Rearrangements in Tumors Change gene structure, create novel fusion genes Gleevec (Novartis 2001) targets ABL-BCR fusion
End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb). Human DNA 2) Sequence ends of clones (500bp). 3) Map end sequences to human genome. Tumor DNA Each clone corresponds to pair of end sequences (ES pair) (x,y). Retain clones that correspond to a unique ES pair. yx
Valid ES pairs l ≤ y – x ≤ L, min (max) size of clone. Convergent orientation. End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb). Human DNA 2) Sequence ends of clones (500bp). 3) Map end sequences to human genome. Tumor DNA yx L
End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb). Human DNA 2) Sequence ends of clones (500bp). 3) Map end sequences to human genome. Tumor DNA y a Invalid ES pairs Putative rearrangement in tumor ES directions toward breakpoints (a,b): l ≤ |x-a| + |y-b| ≤ L L b x
Human genome (known) Tumor genome (unknown) Unknown sequence of rearrangements Location of ES pairs in human genome. (known) Map ES pairs to human genome. B CEA D x2x2 y2y2 x3x3 x4x4 y1y1 x5x5 y5y5 y4y4 y3y3 x1x1 ESP Genome Reconstruction Problem Reconstruct tumor genome
Human genome (known) Tumor genome (unknown) Unknown sequence of rearrangements Location of ES pairs in human genome. (known) Map ES pairs to human genome. -C -D EA B B CEA D x2x2 y2y2 x3x3 x4x4 y1y1 x5x5 y5y5 y4y4 y3y3 x1x1 ESP Genome Reconstruction Problem Reconstruct tumor genome
-C -D E A B -C-DEAB Tumor Human ESP Genome Reconstruction: Comparative Genomics BCEAD Tumor
BCEAD -C -D E A B Tumor Human ESP Genome Reconstruction: Comparative Genomics
BCEAD -C -D E A B Tumor Human ESP Genome Reconstruction: Comparative Genomics
BCEAD -C -D E A B Tumor (x 2,y 2 ) (x 3,y 3 ) (x 4,y 4 ) (x 1,y 1 ) y 4 y 3 x 1 x 2 x 3 x 4 y 1 y 2 ESP Genome Reconstruction: Comparative Genomics
B C E A D Human BCEAD 2D Representation of ESP Data Each point is ES pair. Can we reconstruct the tumor genome from the positions of the ES pairs? (x 2,y 2 ) (x 3,y 3 ) (x 4,y 4 ) (x 1,y 1 ) ESP Plot Human
B C E A D BCEAD 2D Representation of ESP Data Each point is ES pair. Can we reconstruct the tumor genome from the positions of the ES pairs? ESP Plot
B C E A D Human B -D E A DAC E -C B -D EA B Reconstructed Tumor Genome ESP Plot → Tumor Genome
B C E A D Human BCEAD 2D Representation of ESP Data Each point is ES pair. Can we reconstruct the tumor genome from the ES pairs?
Human 2D Representation of ESP Data Each point is ES pair. Can we reconstruct the tumor genome from the ES pairs?
Real data noisy and incomplete! Valid ES pairs satisfy length/direction constraints l ≤ y – x ≤ L Invalid ES pairs indicate rearrangements experimental errors
Computational Approach 2.Find simplest explanation for ESP data, given these mechanisms. 3.Motivation: Genome rearrangements studies in evolution/phylogeny. 1.Use known genome rearrangement mechanisms s A t C-B s A t CB inversion HumanTumor s A t -B s A t -CBDCD translocation
s,t (x) = Given: ES pairs (x 1, y 1 ), …, (x n, y n ) Find: Minimum number of reversals s1,t1, …, sn, tn such that if = s1,t1 … sn, tn, then ( x 1, y 1 ), …, ( x n, y n ) are valid ES pairs. G ’ = G G ESP Sorting Problem s A t C -B s A t B C x1x1 x2x2 y2y2 y1y1 x1x1 x2x2 y2y2 y1y1 G = [0,M], unichromosomal genome. Inversion (Reversal) s,t x, if x t, t – (x – s), otherwise.
Filtering Experimental Noise 1)Pieces of tumor genome: clones ( kb). Human DNA 2) Sequence ends of clones (500bp). 3) Map end sequences to human genome. Tumor DNA Rearrangement Cluster invalid pairs Chimeric clone Isolated invalid pair y x
Sparse Data Assumptions tumor 1.Each cluster results from single inversion. 2. Each clone contains at most one breakpoint. human y1y1 x2x2 x3x3 y3y3 y2y2 x1x1 y1y1 x2x2 x3x3 y3y3 y2y2 x1x1 tumor
Human ESP Genome Reconstruction: Discrete Approximation 1)Remove isolated invalid pairs (x,y)
Human 2)Define segments from clusters ESP Genome Reconstruction: Discrete Approximation 1)Remove isolated invalid pairs (x,y)
Human 3)ES Orientations define links between segment ends ESP Genome Reconstruction: Discrete Approximation 2)Define segments from clusters 1)Remove isolated invalid pairs (x,y)
Human ESP Genome Reconstruction: Discrete Approximation (x 2, y 2 ) (x 3, y 3 ) (x 1, y 1 ) t s 3)ES Orientations define links between segment ends 2)Define segments from clusters 1)Remove isolated invalid pairs (x,y)
ESP Graph Paths in graph are tumor genome architectures. Edges: 1.Human genome segments 2.ES pairs Tumor Genome ( ) Human Genome ( ) Minimal sequence of translocations and inversions
Breakpoint Graph end start Theorem: Minimum number of reversals to transform to identity permutation i is: d( ) ≥ n+1 - c( ) where c( ) = number of gray-black cycles. Black edges: adjacent elements of end start end Gray edges: adjacent elements of i = ESP Graph → Tumor Permutation and Breakpoint Graph Key parameter: Black-gray cycles
MCF7 Breast Cancer Cell Line Low-resolution chromosome painting suggests complex architecture. Many translocations, inversions.
MCF7 Genome Human chromosomesMCF7 chromosomes 5 inversions 15 translocations Raphael, et al. Bioinformatics Sequence
3. Rearrangement/duplication mechanisms Does ESP suggest mechanisms that scramble tumor genomes?
33/70 clusters Total length: 31Mb Another look at MCF ES pairs valid (black) 737 invalid 489 isolated (red) 248 form 70 clusters (blue)
Structure of Duplications in Tumors? Mechanisms not well understood. Human genome Duplicated segments may co-localize (Guan et al. Nat.Gen.1994) Tumor genome
Structure of Duplications in Tumors? Mechanisms not well understood. Human genome Tumor genome Duplicated segments may co-localize (Guan et al. Nat.Gen.1994)
Analyzing Duplications duplication u AB w CD v E u A B w DCD v E u AB w C v D ???? HumanTumor
Analyzing Duplications duplication u AB w CD v E u A B w DCD v E u AB w C v D HumanTumor u ABCD ??
Analyzing Duplications co-duplication u AB w CD v E u A B w DCD v E u AB w C v D HumanTumor u ABCD Additional ES pair resolves duplication duplication
Duples and Boundary Elements duplication u AB w CD v E u A B w DCD v E u AB w C v D HumanTumor Call this configuration a duple with boundary elements v and w. u ABCD
Duplications in ESP graph u AB w CD v E duplication duple boundary elements v,w are vertices in ESP graph v w u A B C D E u A B w DCD v E
Duplications in ESP graph u AB u A B w DCD v E w CD v E duplication Path between boundary elements resolves duple. v u A B C D E w duple boundary elements v,w are vertices in ESP graph
v w u Duplication Complications u AB w C v E ???? These configurations frequent in MCF7 data.
u Resolving Duplication as Paths u AB u AB wv ECD Path between boundary elements resolves duple. v w
v w u Resolving Duplications as Paths Multiple paths between duple boundary elements. u AB u AB w C v E
Many Paths in MCF7!
Tumor Amplisomes (Maurer, et al. 1987; Wahl, 1989…) Other terms: Episome Amplicon Double-minute
Duplication by Amplisome Gives single model for all duplications
Amplisome Reconstruction Problem Approach 1.Identify duplicated sequences A 1, …, A m 2.Amplisome is shortest common superstring of A 1, …, A m Assume 1.Tumor genome sequence is known. 2.Insertions are independent, –i.e. no insertions within insertions
Amplisome Reconstruction Problem Assume 1.Tumor genome sequence is known. 2.Insertions are independent, –i.e. no insertions within insertions
ESP Amplisome Reconstruction Problem Approach 1.Identify duples with boundary elements (v 1, w 1 ), … (v m, w m ) 2.Amplisome is shortest path in ESP graph containing subpaths v 1 …w 1, v 2 …w 2, …, v m …w m Assume 1.Insertions are independent, –i.e. no insertions within insertions u AB w C v E
33 clusters Total length: 31Mb Reconstructed MCF7 amplisome Chromosomes Amplisome model explains 24/33 invalid clusters. Raphael and Pevzner. Bioinformatics 2004.
Resulting clone: yiyi axixi b x2x2 y2y2 abx1x1 y1y1 (b – y 1 )(a – x 1 )+ Clone size: Duplicated Translocation Breakpoint (a,b) in one clone suggests sizes (a-x i ) + (b – y i ) for other clones in cluster Cluster of 20 ES pairs. One clone sequenced. Experimental sizes agreed with inferred sizes All clones share same breakpoint. Duplication of region occurs after translocation
Clone Sequencing (Joint work with Jan-Fang Cheng, LBNL ) Draft sequencing of 29 clones kb 117kb Three clones from MCF7 with indicated lengths. Colors and labels indicate chromosome of origin. 50 rearrangement breakpoints Some clones have complex internal organization
Array Comparative Genomic Hybridization (aCGH) 4. Combining ESP with other genome data Joint work with Z. Yakhini, D. Lipson (Agilent and Technion)
CGH Analysis Divide genome into segments of equal copy number Copy number profile Copy number Genome coordinate
CGH Analysis Divide genome into segments of equal copy number Copy number profile Numerous methods (e.g. clustering, Hidden Markov Model, Bayesian, etc.) Segmentation No information about: Structural rearrangements (inversions, translocations) Locations of duplicated material in tumor genome. Copy number Genome coordinate ESP!
CGH Segmentation How are the copies of segments linked??? Copy number Genome Coordinate Tumor genome ES pairs links segments
ESP + CGH ES near segment boundaries Copy number Genome Coordinate CGH breakpoint ESP breakpoint
ESP and CGH Breakpoints BT474 MCF7 ESP breakpoints CGH breakpoints 33 (P = 5.4 x ) (P = 1.2 x ) 730 ESP breakpoints CGH breakpoints /39 clusters 8/33 clusters
Microdeletion in BT Copy number ES pair ≈ 600kb Valid ES < 250kb “interesting” genes in this region
Combining ESP and CGH ES pairs links segments. Copy number balance at each segment boundary: 5 = Copy number Genome Coordinate 3 2 5
Combining ESP and CGH CGH copy number not exact. What genome architecture “most consistent” with ESP and CGH data? Copy number Genome Coordinate ≤ f(e) ≤ 5 1 ≤ f(e) ≤ 3 1 ≤ f(e) ≤ 4
Combining ESP and CGH Copy number Genome Coordinate Edge for each CGH segment. 2.Edge for each ES pair consistent with segments. 3.Range of copy number values for each CGH edge. Build graph 3 ≤ f(e) ≤ 51 ≤ f(e) ≤ 31 ≤ f(e) ≤ 4
Network Flow Problem Flow constraints: l(e) ≤ f(e) ≤ u(e) CGH edge: l(e) and u(e) from CGH ESP edge: l(e) = 1, u(e) = 1 f(e) Flow constraint on each CGH edge l(e) ≤ f(e) ≤ u(e) 8 e
Network Flow Problem Flow constraints: l(e) ≤ f(e) ≤ u(e) CGH edge: l(e) and u(e) from CGH ESP edge: l(e) = 1, u(e) = 1 f(e) Flow in = flow out at each vertex (u,v) f( (u,v) ) = (v,w) f( v,w) ) 8 v l(e) ≤ f(e) ≤ u(e) 8 e
Network Flow Problem Minimum Cost Circulation with Capacity Constraints (Sequencing by Hybridization, Sequence Assembly) Source/sink min e (e) Subject to: Costs: (e) = 0, e ESP or CGH edge 1, e incident to source/sink f(e) (u,v) f( (u,v) ) = (v,w) f( v,w) ) 8 v l(e) ≤ f(e) ≤ u(e) 8 e Flow constraints: l(e) ≤ f(e) ≤ u(e) CGH edge: l(e) and u(e) from CGH ESP edge: l(e) = 1, u(e) = 1
Network Flow Results Unsatisfied flow are putative locations of missing ESP data. Prioritize further sequencing. Source/sink f(e) Targeted ESP by screening library with CGH probes.
Network Flow Results Identify amplified translocations –14 in MCF7 –5 in BT474 Paths of high weight edges: amplicon structures Flow values → Edge weights