New Methods for Comparative genomics

Slides:



Advertisements
Similar presentations
Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
BLAST Sequence alignment, E-value & Extreme value distribution.
THE EVOLUTIONARY HISTORY OF BIODIVERSITY
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Types of homology BLAST
PHYLOGENY AND SYSTEMATICS
BIOE 109 Summer 2009 Lecture 4- Part I Mutation and genetic variation.
Bioinformatics and Phylogenetic Analysis
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Alternative splicing and evolution Daniel Jeffares.
Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, 2.Microarrays 3.Use BLAST to find homologous sequences 4.Multiple.
Lecture 12 Splicing and gene prediction in eukaryotes
EVOLUTIONARY AND COMPUTATIONAL GENOMICS Shin-Han Shiu Plant Biology / CMB / EEBB / Genetics / QBMI.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Construction of Substitution Matrices
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
1 GMOD Meeting, Spring 2005 Peili Zhang, FlyBase - Harvard Comparative Genome Annotation of Drosophila pseudoobscura and Its Implementation in chado.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Using blast to study gene evolution – an example.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Copyright OpenHelix. No use or reproduction without express written consent1.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
What is BLAST? Basic BLAST search What is BLAST?
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used for.
Gene_identifier color_no gtm1_mouse 2 gtm2_mouse 2 >fasta_format_description_line >GTM1_HUMAN GLUTATHIONE S-TRANSFERASE MU 1 (GSTM1-1) PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLPYLIDGAHKI.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
What is BLAST? Basic BLAST search What is BLAST?
CS515: Bioinformatic Algorithms
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used.
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Fig. 1. (a) Relationship of codon adaptation index (CAI) and the number of substitutions per fourfold-degenerate site (dS) between D. melanogaster.
Lesson Overview 17.4 Molecular Evolution.
Comparative Genomics.
Gil McVean Department of Statistics, Oxford
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
What are the Patterns Of Nucleotide Substitution Within Coding and
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Every living organism inherits a blueprint for life from its parents.
Evolution of eukaryote genomes
Secreted Fringe-like Signaling Molecules May Be Glycosyltransferases
Sequence alignment, Part 2
Comparative Genomics.
Basic Local Alignment Search Tool (BLAST)
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Michael S.Y. Lee, Julien Soubrier, Gregory D. Edgecombe 
Presentation transcript:

New Methods for Comparative genomics -Mark Yandell HHMI/BDGP

New Methods for Comparative genomics -Mark Yandell HHMI/BDGP CGL: a software library for comparative genomics Explore recent history (5-70 myr years) Explore ancient history (70-1000 myr years)

CGL: A software library for comparative genomics Part I: CGL: A software library for comparative genomics

>264 MSFNNALSGVNAAQKDLNVTANNIANVNTTGFKESRAEFADVYANSIFVNAKTQVGNGVA TGAVAQQFHQGALQFTNNALDLSIQGNGFFVTSDGLTNLDRTFTRAGAFKLNENSYMVNN QGNYLQGYEINTDGTPKAVSINATKPIQIPDRAGEPKMTELVEASFNLSIESKTKPTSPA AFDPTNSATFAHSTSVTIYDSLGAPHVITKYFVRHEDPAAPGTPLTPGVKMTFTSGKLDP TLTVPVDPIKTVALGTTAGIINNGADPTQTLEIRLGDVTQYSSPFNVTKLTQDGATVGNL TKVEITPDGIVSATYSNATTLKVAMVALAKFANSQGLTQVGDTSWRQSLLSGDALPGTPN SGTLGSIKSSALEQSNVDLTSQLVNLITAQRNFQANSRSLEVNSSLQQTILQI >264 ATGAAAGTTAGTTTTGAAAGAATAATTCCAAGTGAAAAAAGCTCTTTCCGCACACTGCAT AATAACTCTCCTATTTCTGAATTTAAATGGGAGTATCATTATCATCCGGAAATAGAACTG GTATGTGTAATTTCGGGAAGTGGCACACGGCATGTAGGCTACCATAAAAGCAATTATACA AACGGAGATCTTGTGTTAATAGGTTCAAACATTCCACATTCCGGATTTGGACTGAATTCT GTTGATCCGCATGAAGAAATAGTACTTCAGTTCAGGGAAGAGATTTTGCATTTTCCACAA CAGGAAGTTGAAACAAGAGCCGTGAAAGATCTACTGGAACGCTCTAAATATGGTATTCTG TATAGTACAGCTACAAAAAAGCTGCTCATGCCGAAACTAAAAAAGCTTCTGGAATCCGAA GGCTACAAAAGATACTTACTACTTCTGGAGATTCTCTTCGAACTTTCTTTGTGCGAGGAA TATGAATTGTTGAACAAAGAAATTATGCCTTATACCATAATCTCTAAAAATAAAACAAGA CTGGAAAATATCTTTACCTATGTGGAACATCATTACGATAAGGAAATAAATATAGAGGAT GTTGCAAAGCTGGCTAATCTTACTCTTCCTGCATTTTGTAATTTTTTTAAAAAAGCAACA CAGATTACCTTTACAGAATTTGTCAACCGTTACCGTATTAATAAAGCCTGCCTTCTGATG ACTCAGGATAAAACAATATCCGAATGCAGCTACAGTTGTGGCTTTAACAATGTTACTTAT TTCAACAGAATGTTTAAAAAATATACCAATAAAACGCCATCAGAATTT

Gene structure Sequence similarity ? ? ? …DEW RQWKMDKQV WDA ER …DDW RGWKMDKQV WDMER ? Gene structure Sequence similarity

Aligning two sequences implicitly aligns two genes gacgagtgg gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER …DDW RGWKMDKQV WDMER gt ag gt ag tga agaaatttcg atgcgcgcgcgcgcgcg gatgattgg gatgggtggcgcgggtggaaaatggacaaacaggtg tgggacatggagcga

Using BLASTP alignments to make sense of genomic annotations Annotation A gacgagtgg gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga ? agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER …DDW RGWKMDKQV WDMER gt ag gt ag tga agaaatttcg atgcgcgcgcgcgcgcg gatgattgg gatgggtggcgcgggtggaaaatggacaaacaggtg tgggacatggagcga Annotation B

Using genomic annotations to make sense of BLASTP alignments Score = 42.0 bits (97), Expect(2) = 2e-25 Identities = 23/64 (35%), Positives = 40/64 (61%), Gaps = 2/64 (3%) Frame = -3 Query: 1 MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEIDTILQLNL-G 59 + + DVS+ E++ M KV + GP+DIL+NNA ++ T P + +E D ++ +NL G Sbjct: 29 VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEWDRVISVNLKG 88 Score = 37.7 bits (86), Expect(2) = 2e-25 Identities = 17/43 (39%), Positives = 29/43 (66%) Frame = -1 Query: 60 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI 92 +TGA G+GRAI++ELAK+G ++ IN SGAE+ K+ +++ Sbjct: 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEEAKKTEEL 128

Using genomic annotations to make sense of BLASTP alignments Score = 42.0 bits (97), Expect(2) = 2e-25 Identities = 23/64 (35%), Positives = 40/64 (61%), Gaps = 2/64 (3%) Frame = -3 Query: 1 MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEIDTILQLNL-G 59 + + DVS+ E++ M KV + GP+DIL+NNA ++ T P + +E D ++ +NL G Sbjct: 29 VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEWDRVISVNLKG 88 Score = 37.7 bits (86), Expect(2) = 2e-25 Identities = 17/43 (39%), Positives = 29/43 (66%) Frame = -1 Query: 60 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI 92 +TGA G+GRAI++ELAK+G ++ IN SGAE+ K+ +++ Sbjct: 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEEAKKTEEL 128 splice junction splice junction splice junction splice junction splice junction splice junction

? Annotation A Some un-annotated genome …DEW gacgagtgg gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga ? agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER Some un-annotated genome

Using genomic annotations to make sense of TBLASTX alignments Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / +1 Query: 461 VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA 282 +TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: 45613 LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: 281 ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT 165 A VAVG * V+RVMDS+KPAVSTT Sbjct: 45793 AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT 45909

Using genomic annotations to make sense of TBLASTX alignments Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / +1 Query: 461 VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA 282 +TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: 45613 LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: 281 ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT 165 A VAVG * V+RVMDS+KPAVSTT Sbjct: 45793 AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT 45909 1st Coding Exon 5’-UTR 1st Intron 2nd Coding Exon

Using genomic annotations to make sense of TBLASTX alignments Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / +1 Query: 461 VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA 282 +TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: 45613 LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: 281 ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT* 165 A VAVG * V+RVMDS+KPAVSTT* Sbjct: 45793 AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT* 45909 1st Coding Exon 5’-UTR ? 1st Intron 2nd Coding Exon ?

Using TBLASTN alignments to identify orthologus exons and introns in an un-annotated genome Score = 53.5 bits (127), Expect(2) = 3e-10 Identities = 29/58 (50%), Positives = 30/58 (51%) Frame = -2 Query: 1 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT 58 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEV+V++ Sbjct: 2097 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270 splice junction 1st coding exon orthologus intron begins here (2252) Score = 105 bits (261), Expect(2) = 3e-10 Identities = 52/52 (100%), Positives = 52/52 (100%) Frame = -1 Query: 49 SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 108 +++ EVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct: 3991 NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 4146 splice junction 2nd coding exon orthologus intron ends here (4003)

Using TBLASTN alignments to identify orthologus exons and introns in an un-annotated genome Score = 53.5 bits (127), Expect(2) = 3e-10 Identities = 29/58 (50%), Positives = 30/58 (51%) Frame = -2 Query: 1 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT 58 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEV+V++ Sbjct: 2097 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270 splice junction 1st coding exon ‘1st coding exon’ orthologus intron begins here (2252) Score = 105 bits (261), Expect(2) = 3e-10 Identities = 52/52 (100%), Positives = 52/52 (100%) Frame = -1 Query: 49 SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 108 +++ EVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct: 3991 NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 4146 splice junction 2nd coding exon '2nd coding exon’ orthologus intron ends here (4003)

Exploring recent history with CGL Part II Exploring recent history with CGL D. melanogaster Intron length distribution

time (millions of years) D. virillis D. pseudoobscura D. ananassae D. yakuba D. simulans D. melanogaster 62.9 54.9 44.2 12.8 5.4 time (millions of years) Tamura et al (2003) Temporal Patterns of Fruit Fly (Drosophila Evolution Revealed by Mutation Clocks ; MBE

Compare intron lengths agtttaaagc Query: 2 gacgagtgg gaccagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga 54 || || ||| || ||| ||||||||||||||||||||| ||||||||||||||| Sbjct: 29 gatgattgg gatgggtgg---gggtggaaaatggacaaacaggtg tgggacatggagcga 84 agaaatttcg atgcgcgcgcgcgcgcg atgatatattatatatatcg DDWRGW-MDKQVWDMER DEWRQWKMDKQVWDAER Query: 2 Sbjct: 29 18 44 D+WR WKMDKQVWDMER How do intron lengths change over time? Compare intron lengths Contig or read MDSR DKQV MESR atggatagtagc gataagcagaag atggaaagtagc gataaacaaaag annotated inferred Li Li’ Annotation

Compare intron lengths agtttaaagc atgatatattatatatatcg Query: 2 gacgagtgg gaccagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga 54 || || ||| || ||| ||||||||||||||||||||| ||||||||||||||| Sbjct: 29 gatgattgg gatgggtgg---gggtggaaaatggacaaacaggtg tgggacatggagcga 84 agaaatttcg atgcgcgcgcgcgcgcg Compare intron lengths Contig or read MDSR DKQV MESR atggatagtagc gataagcagaag atggaaagtagc gataaacaaaag annotated inferred Li Li’ Annotation Log10 length D. simulans Log10 length D. melanogaster

Why do intron lengths change over time? D. pseudoobscura intron has transposon D. melanogaster intron has transposon Both introns have the same transposon D. pseudoobscura intron length D. melanogaster intron length

Does selection on the protein influence intron lengths ? [(Li +Lj) - |Li–Lj| ]/Li+Lj Annotation gataagcagaag atggatagtagc Ka Li annotated MDSR DKQV Lj MESR DKQV inferred atggaaagtagc gataaacaaaag Contig or read

melanogaster & simulans melanogaster & yakuba 5 million yrs 13 million yrs 55 million yrs 250 million yrs melanogaster & simulans melanogaster & yakuba melanogaster & pseudoobsura melanogaster & mosquito A new evolutionary clock?

A new evolutionary clock [(Li +Lj) - |Li–Lj| ]/Li+Lj D. simulans 5.4 myr D. yakuba 12.8 myr D. pseudoobscura 54.9 myr D. Virilis 62.9 myr D.ananassae 44.2 myr

P(6) vs. dpse << 0.001 P(6) vs. dyak = 0.001 Region appears To be un-duplicated In dpse; might be duplicated in dyak P(6) vs. dpse << 0.001 P(6) vs. dyak = 0.001

Are the phenoma we are seeing in th flies common to other orgs? Explain the experiment What does this tell us? AA: recripcol best hits seems to work. Do it al the time but how do we know it workks: these data provide some independent confirnation of recip-best hits approachs. A. most paralogs much older than 70 million years– current details as to genes and gene families established much, much more than 70 million years ago. B. Paralogs diverging from one-another at same rate in both genomes– on average

Exploring ancient history with CGL Part III Exploring ancient history with CGL C. elegans C. intestinalis H. sapiens M. musculus A.gambiae D. melanogaster 1000 time (myr) 250 70

protein similarities similarities in the intron-exon structures of genes

MIKEVFRPDKFGMDL MILEVFRPDKFGMDL MIKDVFRPDKFGIDL

Fraction of identical amino acids The evolutionary trajectory of protein similarities Best HSP, reciprocal best blastp hits Human vs. mouse Dmel vs. agam Human vs. ciona Human vs. plant 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Fraction of identical amino acids

Fraction of identical amino acids The evolutionary trajectory of protein similarities Asymptotic distribution? 70 million years 250 million years 600 million years 3 billion years 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Fraction of identical amino acids time

Fraction of identical amino acids D. melanogaster vs. A. gambiae H. sapiens vs. C. intestinalis 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Fraction of identical amino acids C. intestinalis H. sapiens M. musculus A.gambiae D. melanogaster C. elegans

Fraction of identical amino acids ? 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Fraction of identical amino acids

Organism A Organism B BLASTP All proteins All proteins Every reciprocal best hit For every HSP Global similarity between the two organisms’ proteomes

H ‘standard’ C. elegans C. intestinalis H. sapiens M. musculus A.gambiae D. melanogaster time (myr) 1000 250 70 C. elegans C. intestinalis H 100 H. sapiens 100 M. musculus A.gambiae 100 D. melanogaster

MIKEVFRPDKFGMDL MILEVFRPDKFGMDL MIKDVFRPDKFGIDL

I = * Organism A Organism B BLASTP All proteins All proteins Every reciprocal best hit XXXXXXXXXXXXXX For every HSP qintron-intron I = pquery intron * psbjct intron Global similarity between the two organisms’ intron-exon structures

Good trees can be made from both measures; these measures provide a rapid and robust means to calculate phylogenetic trees using the combined data from many annotated genomes simultaneously. C. elegans C. elegans C. intestinalis C. intestinalis 100 100 H. sapiens H. sapiens 100 100 M. musculus M. musculus A. gambiae A. gambiae 100 100 D. melanogaster D. melanogaster H I Trees based on I provide an independent check on trees made from amino-acid similarities; note that deep nodes are better resolved in the I tree.

Like intron lengths, gene-structures seem to evolve independently of protein Sequence– hold H constant and you still get the same I tree A. gambiae D. melanogaster A. gambiae D. melanogaster C. elegans M. musculus M. musculus C. elegans H. sapiens H. sapiens C. intestinalis C. intestinalis I H = 1.5 ~ 50% similarity

New Methods for Comparative genomics Part I: CGL: a software library for comparative genomics Genome annotations make possible any new kinds of genome analyses We have a developed a (soon to be) publicly available software library called CGL; ‘seagull’ that that greatly facilitates such analyses. Part II: Using CGL to explore recent history (1-70 myr years) Intron length can be used as an evolutionary clock. Seems to evolve independently of protein sequence. Can be used to confirm and calibrate existing protein clock approaches. Can be used to identify recent gene duplication events. Can be used to strengthen and extend with many ‘standard’ protein based approaches, e.g. quartet analysis. Part II: Using CGL to explore ancient history (70-1000 myr years) Proteins may saturate completely after ~ 1 billion years or so. Resolving deep evolutionary questions w/ proteins means computing ‘consensus’ trees; we offer one approach to this problem using ‘H’. Like intron length, the intron-exon structures of genes contains a phylogenetic signal; the ‘I’ vs. ‘H’ trees This signal seems to evolve independently of protein sequence, and perhaps more slowly. Large scale comparative genomics has much to tell us about the evolution of both genes and organisms.

Acknowledgements Chris Mungall Simon Prochnik Chris Smith Josh Kaminker George Hartzell Suzi Lewis Gerald M. Rubin