Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University.

Similar presentations


Presentation on theme: "Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University."— Presentation transcript:

1 Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University

2 © 2003 Steven L. Salzberg 2 What is genome paleontology? Compare genomes to uncover: history of species genome transformations recent mutations such as SNPs evolution

3 Outline (time permitting) An algorithm for rapid large-scale alignment A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes. Nucleic Acids Res 27:11 (1999), 2369-76. MUMmer 2: Delcher et al., NAR, 2002. Alignments and analyses of bacterial genomes J.A. Eisen, J.F. Heidelberg, O. White, and S.L. Salzberg. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biology 1:6 (2000), 1-9. Large-scale genome duplications: plant and human The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408 (2000), 796- 815. J.C. Venter et al. The sequence of the human genome. Science 291 (2001), 1304-1351. ●Lateral gene transfer between humans and bacteria S.L. Salzberg, O. White, J. Peterson, and J.A. Eisen. Microbial genes in the human genome: lateral transfer or gene loss? Science 292 (2001), 1903–1906.

4 4 Genomes completed and published by TIGR and our collaborators, 1995-present OrganismReference Arabidopsis thalianaLin et al., Nature 402: 761-8 (2000) Archaeoglobus fulgidusKlenk et al., Nature 390:364-370 (1997) Bacillus anthracis AmesRead et al., Nature 423: 81-86 (2003) Bacillus anthracis FloridaRead et al., Science 296, 2028-33 (2002) Borrelia burgdorferiFraser et al., Nature 390: 580-586 (1997) Brucella suisPaulsen et al., PNAS 99 (2002) Caulobacter crescentusNierman et al., PNAS 98 (2001) Chlamydia pneumoniaeRead et al., Nucl. Acids Res. 28, (2000) Chlamydia muridarumRead et al., Nucl. Acids Res. 28, (2000) Chlamydophila caviaeRead et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidumEisen et al., PNAS 99: 9509-9514 (2002) Coxiella burnetii RSA 493Seshadri et al., PNAS 100: 5455-60 (2003) Deinococcus radioduransWhite et al., Science 286 (1999) Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003) Haemophilus influenzaeFleischmann et al., Science 269, (1995) Helicobacter pyloriTomb et al., Nature 388:539-547 (1997) Methanococcus jannaschiiBult et al., Science 273:1058-1073 (1996) Mycobacterium tuberculosisFleischmann et al., J. Bact.184, (2002) Mycoplasma genitaliumFraser et al., Science 270:397-403 (1995) Neisseria meningitidisTettelin et al., Science 287 (2000) Oryza sativa (rice) chr 10Wing et al., Science 300: 1566-1569 (2003) Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002) Plasmodium yoeliiCarlton et al., Nature 419:512-519(2002) Porphyromonas gingivalis Nelson et al., J. Bact., in revision. Pseudomonas putida Nelson et al., Envir. Microbiol. (2002) Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiaeTettelin et al., PNAS. 99 (2002) Streptococcus pneumoniaeTettelin et al., Science 293 (2001) Sulfolobus islandicus virusArnold et al., Virology 15:252-66 (2000) Thermotoga maritimaNelson et al., Nature 399: 323-329 (1999) Treponema pallidumFraser et al., Science 281: 375-388 (1998) Vibrio choleraeHeidelberg et al., Nature 406, (2000)

5 5 Genomes in progress or recently completed Fibrobacter succinogenes Prevotella intermedia Pseudomonas fluorescens Silicibacter pomeroyi DSS-3 Streptococcus agalactiae A909 Streptococcus gordonii Streptococcus mitis Streptococcus pneumoniae 670 Acidobacterium capsulatum Bacillus anthracis A01055 Bacillus anthracis A0402 Bacillus anthracis Ames 0581 Burkholderia thailandensis Campylobacter coli RM2228 Campylobacter upsaliensis RM3195 Clostridium perfringens SM101 Epulopiscium fishelonii Hyphomonas neptunium Listeria monocytogenes F6854 Listeria monocytogenes H7858 Mycoplasma arthritidis Mycoplasma capricolum Myxococcus xanthus Prevotella ruminicola Pyrococcus furiosus Verrucomicrobium spinosum Actinomyces naeslundii Bacillus anthracis A0071 Bacillus anthracis Kruger B Erwinia chrysanthemi Gemmata obscuriglobus Mycobacterium tuberculosis Ruminococcus albus Streptococcus sobrinus Aspergillus fumigatus Brugia malayi Coccidioides immitis Cryptococcus neoformans Entamoeba histolytica Oryza sativa Chromosome 3 & 10 Plasmodium vivax Schistosoma mansoni Solanum spp. Tetrahymena thermophila Toxoplasma gondii Theileria parva Trichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi Acidithiobacillus ferrooxidans Bacillus anthracis Kruger B Burkholderia mallei Clostridium perfringens ATCC13124 Dehalococcoides ethenogenes Desulfovibrio vulgaris Ehrlichia chaffeensis Ehrlichia sennetsu Geobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatus Mycobacterium avium 104 Mycobacterium smegmatis Pseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticola Wolbachia sp. Anaplasma phagocytophila Bacillus cereus 10987 Bacteroides forsythes Brucella ovis Baumannia cicadellinicola Campylobacter jejuni Carboxydothermus hydrogenoformans Colwellia sp. 34H Dichelobacter nodosus

6 © 2003 Steven L. Salzberg 6 Efficiently compute alignments between entire genomes and chromosomes, for example: Two strains of B. anthracis, each 5.1 Mb (<30 CPU seconds) Two chromosomes of A. thaliana, each 20-30 Mb (< 5 minutes) Two chromosomes of human, 100+ Mb each (< 30 minutes) Genome-Scale Sequence Alignment

7 © 2003 Steven L. Salzberg 7 MUMs: Maximal Unique Matches Algorithm finds ALL matches String them together and align gaps Suffix trees Very fast alignment of long DNA sequences Linear time and space requirements Software at: http://www.tigr.org/software/mummer/ MUMmer alignments TIGRTIGR

8 © 2003 Steven L. Salzberg 8 A trie A tree with edges labelled by strings Each leaf represents a sequence—the labels on the path to it from the root The suffix tree for sequences A and B : Contains |A | + |B | leaf nodes. Can be constructed in O (|A | + |B |) time! Holds all suffixes of a set of sequences Suffix Trees

9 © 2003 Steven L. Salzberg 9 Sequences in genomes A and B that: Occur exactly once in A and in B Are not contained in any larger matching sequence Maximal Unique Matches (MUMs) A: B: Occurs only here Mismatch at both ends

10 © 2003 Steven L. Salzberg 10 MUMmer 2 streaming algorithm i+1 87 i 9110 536 Suffix Tree for String atgtgtgtc$ 123 c$ gt t $ c$ gt c$ gtc$ c$ gt 42 c$gtc$ Streaming String 4 567 8 9 10...atgtcc...

11 MUMmer results: M. tuberculosis CDC1551 vs. H37Rv ACGT A661649 C4881169 G1648944 T1115961 a MUM

12 © 2003 Steven L. Salzberg 12 Helicobacter pylori strain 26695 vs. J99

13 V. cholera vs. E. coli (forward)

14 V. cholera vs. E. coli (reverse)

15 V. cholera vs. E. coli (both strands)

16 © 2003 Steven L. Salzberg 16 Duplication and Gene Loss?

17 © 2003 Steven L. Salzberg 17 V. cholera vs. itself

18 © 2003 Steven L. Salzberg 18 S. pyogenes vs. itself

19 © 2003 Steven L. Salzberg 19 Symmetric Inversions Model B1 A1 B2 A2 B3 A3 B3 B2 24 23 22 21 20 19 18 1716 15 14 13 12 11 10 9 6 7 258 26 27 28 29 30 1 2 3 4 5 31 32 B1 31 32 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 31 32 B3 24 23 22 21 20 19 18 1716 15 14 13 12 11 10 9 6 7 258 26 27 28 29 3 32 31 30 4 5 2 1 A1 31 32 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 31 32 A2 31 32 6 7 8 9 10 11 12 13 19 18 1716 15 14 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 31 32 A3 2 6 7 8 9 10 11 12 13 19 18 1716 15 14 20 21 22 23 24 25 26 27 5 4 3 31 30 29 28 132 B2 ** ** ** ** Common Ancester of A and B 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 32 31 A2 A1 A2 A3 B2 B1 Inversion around terminus Inversion around terminus Inversion around origin Inversion around origin

20 © 2003 Steven L. Salzberg 20 M. leprae vs M. tuberculosis M. leprae M. tuberculosis

21 © 2003 Steven L. Salzberg 21 The “X-files” paper

22 © 2003 Steven L. Salzberg 22 Arabidopsis genome paleontology Compare all chromosomes to each other.... Diorama by B.E. Dahlgren, © The Field Museum, Chicago

23 © 2003 Steven L. Salzberg 23 The hunt for genome-scale duplications S. cerevisiae? 16% duplicated (Seoighe & Wolfe, 1999) Maize? 10 chromosomes vs. 5 in some related grasses; segmental allotetraploid? (Gaut & Dobley, 1997) Drosophila melanogaster - no duplications Vertebrates: much speculation but little evidence (Skrabanek & Wolfe, 1998) Arabidopsis thaliana: yes!

24 chr.2 chr.4 First discovery: large-scale duplication between chromosomes 2 and 4 (Lin et al., 1999)

25 chr.1 Tandem duplications

26 Over 60% of the genome is covered by duplicated regions Centromeres cover much of the rest Strikingly, only about 1/3 of the genes in each block remain as duplicates

27 © 2003 Steven L. Salzberg 27 No triplications! 19-24 large-scale duplications >60% of the genome duplicated If duplications occurred over time, triplications highly likely Duplications likely happened as one event (on evolutionary time scale) Conclusion: whole genome duplication

28 © 2003 Steven L. Salzberg 28 I III IVV I III IVV Warning: Salzberg’s speculation follows Start with 4 ancestral chromosomes

29 © 2003 Steven L. Salzberg 29 I III IVV I III IVV

30 © 2003 Steven L. Salzberg 30 I III IVV I III IVV

31 © 2003 Steven L. Salzberg 31 I III IVV I III IVV

32 © 2003 Steven L. Salzberg 32 I III IVV I III IVV

33 © 2003 Steven L. Salzberg 33 I III IVV I III IVV II

34 © 2003 Steven L. Salzberg 34 I III IVV I III IVV II

35 © 2003 Steven L. Salzberg 35 I III IVV I V II

36 © 2003 Steven L. Salzberg 36 I III IVV I V II

37 © 2003 Steven L. Salzberg 37 I III IVV I V II

38 © 2003 Steven L. Salzberg 38 I III IVV I V II

39 © 2003 Steven L. Salzberg 39 I III IVV I V II

40 © 2003 Steven L. Salzberg 40 I III IVV I V II

41 © 2003 Steven L. Salzberg 41 I III IVV I V II

42 © 2003 Steven L. Salzberg 42 I III IVV I V II

43 © 2003 Steven L. Salzberg 43 I III IVV I V II

44 © 2003 Steven L. Salzberg 44 I III IVV I V II

45 © 2003 Steven L. Salzberg 45 I III IVV I V II

46 © 2003 Steven L. Salzberg 46 I III IVV I V II

47 © 2003 Steven L. Salzberg 47 I III IVV I V II

48 © 2003 Steven L. Salzberg 48 I III IVV I V II

49 © 2003 Steven L. Salzberg 49 I III IVV I V II

50 © 2003 Steven L. Salzberg 50 I III IVVII

51 © 2003 Steven L. Salzberg 51 I III IVVII

52 © 2003 Steven L. Salzberg 52 IIIIIIIVV

53 IIIIIIIVV

54 © 2003 Steven L. Salzberg 54 Warning: data quality control Until December 2000, Arabidopsis data in GenBank was all BAC-based Errors included: BACs on the wrong chromosome BACs entered twice with different IDs, different annotation (sequenced twice), slightly different sequence For duplications analysis, these errors would prove disastrous Many of these errors are still in GenBank Old BACs are not automatically deleted

55 © 2003 Steven L. Salzberg 55 Human Genome analysis  used Celera’s assembly and annotation  26,588 genes, ordered along each of 24 chromosomes  MUMmer 2.0 used to align whole chromosomes  Nothing found in DNA-level alignments  Proteome alignments used instead  Recently re-computed using latest human genome annotation (Ensembl)

56 © 2003 Steven L. Salzberg 56 Human whole-genome aligment  Create 24 “mini-proteomes” by concatenating all proteins on each chromosome  Use MUMmer to align each mini-proteome to the complete proteome (9,675,713 amino acids)  Search for conserved clusters of proteins  Confirmed analysis by looking at Blast hits of all vs. all

57 © 2003 Steven L. Salzberg 57  Not looking for  tandem duplications  domain hits (very common, often give highly significant Blast hits) What we’re looking for

58 © 2003 Steven L. Salzberg 58 Summary results  1077 duplicated blocks  10,310 “gene pairs”  “pair” = 2 genes that match between two blocks  296 blocks with 3-4 gene pairs  781 blocks with 5 or more gene pairs  3522 distinct genes, many duplicated more than once  Large block: 33 genes on chr 2 and chr 14  spans 63Mbp on chr 14, over 70% of chr 14’s length  spread over 97 genes on chr 2 and 332 genes on 14  includes two of four known Hox clusters, an ancient duplication  Large block: 64 genes on chr 18 and chr 20  previously undiscovered  Shuffled data: 370 gene pairs (3.6% false positive rate)

59 © 2003 Steven L. Salzberg 59

60 © 2003 Steven L. Salzberg 60 Duplications in Human Chromosome 2

61 © 2003 Steven L. Salzberg 61 Human-mouse genome mapping  Close evolutionary distance permits DNA-level alignments  Protein similarity even greater than DNA  MUMmer quickly aligns each mouse mini- proteome to its human counterparts  Blast finds most (not all) of the same matches (and is far slower)  77% (566/731) of Mouse16 genes are found in syntenic regions of human  2.5% (18/731) of Mouse16 genes are unique to mouse, not found in human

62 © 2003 Steven L. Salzberg 62 Mouse chr 16 maps to human chromosomes 3, 8, 12, 16, 21, and 22

63 © 2003 Steven L. Salzberg 63 Have bacteria transferred their genes directly into the human genome? “Startling” discovery, Feb. 2001: 223 bacterial genes were laterally transferred into a vertebrate ancestor of humans (from the Nature human genome paper)

64 © 2003 Steven L. Salzberg 64 Horizontal (Lateral) Gene Transfer

65 © 2003 Steven L. Salzberg 65 Vertical Inheritance

66 © 2003 Steven L. Salzberg 66 Horizontal Gene Transfer ???

67 © 2003 Steven L. Salzberg 67 Horizontal gene transfer in Arabidopsis thaliana chr 2 (Lin et al., Nature, 1999) 135 genes most closely related to cyanobacterial genes and thus likely were transferred from chloroplast to the nucleus Very recent transfer of > 250 kb section of mitochondrial genome Many additional older mitochondrial → nuclear gene transfers

68 © 2003 Steven L. Salzberg 68 Examples of Horizontal Transfers Antibiotic resistance genes on plasmids Pathogenicity islands Toxin resistance genes on plasmids Agrobacterium Ti plasmid Viruses and viroids Organelle to nucleus transfers

69 © 2003 Steven L. Salzberg 69 Mechanisms of Horizontal Transfer Plasmid exchange (prokaryotes) Mating/conjugation (prokaryotes) Viruses and viroids Organelle to nucleus exchange (eukaryotes) Scavenging from environment Passive absorption Fusion of cells

70 © 2003 Steven L. Salzberg 70 Nature human genome paper (2001): Evidence for transfer? Evidence: Genes match bacteria, but do not match non-vertebrate eukaryotes Or, genes really are in non-vertebrates, but have stronger match to bacteria Measured by BLAST E-value 113 of the 223 genes found in a broad spectrum of prokaryotic species

71 © 2003 Steven L. Salzberg 71 Alternative explanations Gene loss from a small sample of non-vertebrate eukaryotes Only 4 non-vertebrates used for analysis: fruit fly, nematode, yeast, and mustard weed (Arabidopsis) Large and diverse set of prokaryotes (over 30 organisms, including extremophiles) used as well Rapid divergence in non-vertebrate eukaryotes (evolutionary rate variation) Still-incomplete genomes (e.g., D. melanogaster) Erroneous annotation/gene finding Contamination

72 © 2003 Steven L. Salzberg 72 Re-analysis: number of “transfers” decreases with # of genomes analyzed

73 © 2003 Steven L. Salzberg 73 Evolutionary Rate Variation

74 © 2003 Steven L. Salzberg 74 Trees Don’t Support Transfer

75 © 2003 Steven L. Salzberg 75 Birney et al., Nature special issue on human genome “The unfinished human genomic DNA may contain contamination, particularly from bacteria but also from other sources..... If the predicted gene matches a bacterial gene more closely than any vertebrate gene then it will almost always be a contaminant.”

76 © 2003 Steven L. Salzberg 76 Were genes really transferred? NO Our re-analysis finds just 41 genes (Ensembl) or 46 (Celera) with best hits to bacteria – not 223 All of these could be explained by alternative mechanisms More genomes will likely eliminate these remaining candidates At least 3 have already been found in Drosophila, 10 more in other species Great care is needed in order to make assertions of transfer from bacteria to humans Implications would be significant; e.g., GMOs Even more care is needed when working with unfinished data Nature erratum to human genome paper, August 2001: “We agree.”

77 © 2003 Steven L. Salzberg 77 Acknowledgements MUMmer: Arthur Delcher, Jeremy Peterson, Rob Fleischmann, Owen White, Simon Kasif, Jonathan Allen, Sam Angiuoli, Adam Phillippy X alignments: Jonathan Eisen, Owen White, John Heidelberg Arabidopsis duplications: TIGR: Maria Ermolaeva, Owen White, Jonathan Eisen, Xiaoying Lin, Samir Kaul AGI collaborators: Klaus Meyer and all his MIPS colleagues, Mike Bevan Human duplications: Mark Yandell, Mark Adams, Mani Subramanian, Craig Venter (all formerly Celera), Ron Wides (Bar- Ilan University), Art Delcher Lateral transfer: Jonathan Eisen, Owen White, Jeremy Peterson Funding support: National Institutes of Health (NHGRI, NLM) National Science Foundation (CISE, BIO)


Download ppt "Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University."

Similar presentations


Ads by Google