Genome Rearrangements in Evolution and Cancer Guillaume Bourque Genome Institute of Singapore HKU-Pasteur Research Centre - Hong Kong August 28 th, 2009
2 Outline Genome Rearrangements in Evolution [ ??? ] Cancer genomics
3 Genome rearrangements in evolution 1999
4 High hopes Explain the physical clustering of gene families (regulation, editing or retention). Understand whether even longer linkage associations were preserved by chance or by selection (developmental or functional). Resolve the mammalian phylogeny using genomic segment exchanges as characters. Discover molecular fossils of precipitous genomic events. Identify genetic determinants of reproductive isolation, adaptation, survival and species formation. O’Brien et al, Science 1999
5 Comparing 2 sequences GGCACAAATCCAAATCCAAATCCGGGTTGGGGTTGGGGTTGGGGTTGCGACACATTTGGCCTGTCGTCGTCCGTCGTC GGCACAAATCCAAATCCAAATCCAATGTGTCGCAACCCCAACCCCAACCCCAACCCTGGCCTGTCGTCGTCCGTCGTC Need to reverse complement
6 If you have 3 sequences… Seq_1 vs Seq_2Seq_1 vs Seq_3Seq_2 vs Seq_ Seq_1 : Seq_2 : Seq_3 :
7 Seq_1: Seq_2: Seq_3: Inversion Block 2 Inversion Block 4 A: Rearrangement Phylogeny
8 Synteny blocks
9 Genome rearrangements Reversal Translocation Fusion Fission
10 Algorithms for sorting genomes Polynomial algorithm for computing the rearrangement distance and the most parsimonious scenario between 2 unichromosomal genomes (Hannenhalli and Pevzner 1995). For example: Further developed for multi-chromosomal genomes (Tesler 2002) and multiple genomes (Bourque and Pevzner 2002).
11 Chromosome X two way similarities (PatternHunter) synteny bocks (GRIMM-Synteny) rearrangement scenario (MGR)
12 History of Chromosome X
13 Mammalian phylogeny Murphy et al, Science, 2005 cow pig dog catratmousehuman
14 X chromosome evolution
15 Overview of the Results Nearly 20% of chromosome breakpoint regions were reused. Gene-density is higher in evolutionary breakpoint regions. Segmental duplications populate the majority of primate- specific breakpoints.
16 Human Chromosome 11
17 Debate on ancestral reconstructions
18 Debate on ancestral reconstructions
19 Recovering true ancestral events Analyses of genome rearrangements are typically evaluated on: –Quality of the ancestral reconstructions –Ability to recover the correct topology –Total number of rearrangements in the scenario recovered (parsimony) We decided to focus on the accuracy of the rearrangements recovered Start by measuring accuracy using simulations and then apply the approach to real data sets Why? –Look for events that could have been involved in speciation –Look at sequence features associated with these events (e.g. repeats, genes, etc.) –Gain mechanistic insights into genome rearrangements
20 EMRAE :: Efficient Method to Recover Ancestral Events Relies on adjacencies conserved in a significant fraction of the genomes. Combines conserved adjacencies (and nearly conserved adjacencies) to predict rearrangement events. Applicable to uni and multi-chromosomal genomes. Currently models: inversions, translocations, fusions, fissions and transpositions. But also amenable to insertions and deletions. Achieves high specificity with comparable sensitivity.
21 Conserved adjacencies Define an adjacency a(c i, c i+1 ) as an ordered pair of integers c i c i+1 or its inverse -c i+1 -c i found in a given genome. For a given edge e, if the adjacency a is found in every genome of S A but not in any genome of S B we say that a is a conserved adjacency of S A.
22 Conserved adjacencies :: example
23 Simulation results Higher specificity
24 Mammalian rearrangements events ( reversals, translocations, transpositions, fusions/fissions ) Predicted 1109 events at a 10Kb resolution: 831 reversals 237 transpositions 15 translocations 26 fusions/fissions
25 Mammalian rearrangements events ( reversals, translocations, transpositions, fusions/fissions ) Predicted 1109 events at a 10Kb resolution: 831 reversals 237 transpositions 15 translocations 26 fusions/fissions
Human-chimp-specific reversal
27 Human-specific breakpoints are enriched in SDs Human-specific breakpoint regions are significantly enriched in SDs as compared to size-matched random regions (p-value < 0.001). Indeed, 93.2% of the human-specific breakpoint regions (69 out of 74) contain SDs. This is true for only approximately 60% of size-matched random regions.
28 Homologous matching pairs of SDs are enriched in human-specific breakpoints Taking the 74 human-specific breakpoints identified in this study, we observed 100 pairs of regions with matching pairs of SDs instead of an average of 25 pairs observed in the random simulated data sets.
29 Primate reversals are associated with SDs The average percent identity of the SDs that are associated with reversals correlates with the relative age of these events. This helps confirms the direct link between SDs and many rearrangements events.
Extension from primate specific reversals to all the predicted mammalian reversals We used BLAST to detect homology between breakpoints of the predicted reversals Many reversals are flanked by regions of high sequence identity (BLAST score >1000) If not SDs, what?
31 Homology flanking mammalian reversals We found that 58%, 29%, 24%, 42%, 47% and 20% of the human, chimp, rhesus, rat, mouse and dog reversals are supported by regions with Blast scores greater than What is the source of this homology? Is it expected? We restricted our analysis to the reversals with breakpoints defined within 100Kb and assessed the overlap between these regions of homology and repeats. We annotated each reversal to a particular repeat family when the overlap between the homologous segment identified and a repeat instance was greater than 50% and compared the results to matched simulated data sets.
32 Overrepresentation of paired L1 repeats
33 Outline Genome Rearrangements in Evolution [ ??? ] Cancer genomics
Sequencing Revolution Sanger sequencing (1970s) Next-Generation sequencing (2007-now) IlluminaSOLiD
Data Explosion Sequencing is no longer the rate limiting step This year, we expect: –2X increase in CPU –2X increase in memory –10X increase in sequencing (estimate from Illumina and SOLiD) or even 100X increase (Helios, Complete Genomics, etc.) Informatics challenges that we face now will only grow… 35
36
Paradigm Shift Things that are out: –Storing all primary data (images) –“All versus all” types of analysis –Single large repository (NCBI) –Careless data management (duplicated files, extra transferring steps, etc.) Things that are in: –Clusters and high performance storage –Cloud computing –Careful data management & planning –Bioinformaticians & IT engineers (even for relatively small labs) 37
38 Sequencing Human Genomes 1000 Genomes Project $$$ The Human Genome $$$$$$ Your Genome $ (?)
39 New opportunities… Evolution Populations Cancer In the study of …
40 Outline Genome Rearrangements in Evolution [ ??? ] Cancer genomics
41 Gene Identification Signature Ng, et al., Nature Methods, 2005
42 PET technology ~ ~ ~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ cDNA PET Cancer Cell Human Genome
43 Highly rearranged cancer genome Provided by Nalla Palanisamy, GIS
44 Impact of rearrangements on PETs Cancer Normal InversionDeletion Translocation Cancer Normal
45 GIS-PET MCF-7 Transcriptome 584,624 cDNA equivalents 135,757 Unique PETs One location (tag1)Unmappable (tag0) 92,928 PETs (69%)33,097 PETs (24%)9,732 PETs (7%) Multi-location
46 Sequence-based clustering All unmappable PETs (tag0) Cluster based on sequence similarity ---GGAGCCGCGGCCGCC ACGATCCCAC-AGCCTC ----GAGCCGCGGCCGCC---AAGAACGATACCAC-AGCCTC ATTGGAGCTGCGGCCGC ACGATCCCAC-AGCCTC --TGGAGCCGCGGCCGCCGA-----ACGATCCCAC-AGCCTC GCGGCGGCCGCC---AAGAACGATCCCAC-AGCCCC ----GAGCCGCGGCCGCCG---AGCACGATCCCACTAGCCTC Align ATTGGAGCCGCGGCCGCCGA AGAACGATCCCACAGCCTC 5’3’ Extract consensus Map to human genome 5’ 3’
47 5’3’ 77 unique PETs 339 total PETs 20q1317q23 BCAS4 BCAS3 … Largest unmappable cluster
48 BCAS4-3 fusion transcript
49 Fusion transcript discovery pipeline Ruan et al. Genome Res, 2007
Genomic DNA fragmentation PET library construction & sequencing PET sequences mapping to reference genome PET mapping span 1Kb10Kb 1Kb peak 10Kb peak Genomic PET (gPET)
51 Putting everything together… Mitelman 342 entries Fragile sites 118 entries ChimerDB 848 entries Sanger 428 entries aCGH Exon Array High-resolution map of aberrations in cancer 5’ 3’ Tag0s prioritize annotate GIS-PET & gPET
52 Acknowledgments From my group: –Zhao Hao, Chi Ho Lin, Johni Masli (NUS) –Galih Kunarso, Justin Jeyakani –Woo Xing Yi, Kelson Zawack With the help of: –Yijun Ruan, Yao Fei, Axel Hillmer, Chia-Lin Wei –Charlie Lee, Pramila Ariyaratne, Ken Sung –Ed Liu –Jian Ma (UCSC), Pavel Pevzner and Glenn Tesler (UCSD) –GIS and A*STAR for financial support