Download presentation
1
New Methods for Comparative genomics
-Mark Yandell HHMI/BDGP
2
New Methods for Comparative genomics
-Mark Yandell HHMI/BDGP CGL: a software library for comparative genomics Explore recent history (5-70 myr years) Explore ancient history ( myr years)
3
CGL: A software library for comparative genomics
Part I: CGL: A software library for comparative genomics
4
>264 MSFNNALSGVNAAQKDLNVTANNIANVNTTGFKESRAEFADVYANSIFVNAKTQVGNGVA TGAVAQQFHQGALQFTNNALDLSIQGNGFFVTSDGLTNLDRTFTRAGAFKLNENSYMVNN QGNYLQGYEINTDGTPKAVSINATKPIQIPDRAGEPKMTELVEASFNLSIESKTKPTSPA AFDPTNSATFAHSTSVTIYDSLGAPHVITKYFVRHEDPAAPGTPLTPGVKMTFTSGKLDP TLTVPVDPIKTVALGTTAGIINNGADPTQTLEIRLGDVTQYSSPFNVTKLTQDGATVGNL TKVEITPDGIVSATYSNATTLKVAMVALAKFANSQGLTQVGDTSWRQSLLSGDALPGTPN SGTLGSIKSSALEQSNVDLTSQLVNLITAQRNFQANSRSLEVNSSLQQTILQI >264 ATGAAAGTTAGTTTTGAAAGAATAATTCCAAGTGAAAAAAGCTCTTTCCGCACACTGCAT AATAACTCTCCTATTTCTGAATTTAAATGGGAGTATCATTATCATCCGGAAATAGAACTG GTATGTGTAATTTCGGGAAGTGGCACACGGCATGTAGGCTACCATAAAAGCAATTATACA AACGGAGATCTTGTGTTAATAGGTTCAAACATTCCACATTCCGGATTTGGACTGAATTCT GTTGATCCGCATGAAGAAATAGTACTTCAGTTCAGGGAAGAGATTTTGCATTTTCCACAA CAGGAAGTTGAAACAAGAGCCGTGAAAGATCTACTGGAACGCTCTAAATATGGTATTCTG TATAGTACAGCTACAAAAAAGCTGCTCATGCCGAAACTAAAAAAGCTTCTGGAATCCGAA GGCTACAAAAGATACTTACTACTTCTGGAGATTCTCTTCGAACTTTCTTTGTGCGAGGAA TATGAATTGTTGAACAAAGAAATTATGCCTTATACCATAATCTCTAAAAATAAAACAAGA CTGGAAAATATCTTTACCTATGTGGAACATCATTACGATAAGGAAATAAATATAGAGGAT GTTGCAAAGCTGGCTAATCTTACTCTTCCTGCATTTTGTAATTTTTTTAAAAAAGCAACA CAGATTACCTTTACAGAATTTGTCAACCGTTACCGTATTAATAAAGCCTGCCTTCTGATG ACTCAGGATAAAACAATATCCGAATGCAGCTACAGTTGTGGCTTTAACAATGTTACTTAT TTCAACAGAATGTTTAAAAAATATACCAATAAAACGCCATCAGAATTT
6
Gene structure Sequence similarity ? ? ? …DEW RQWKMDKQV WDA ER
…DDW RGWKMDKQV WDMER ? Gene structure Sequence similarity
7
Aligning two sequences implicitly aligns two genes
gacgagtgg gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER …DDW RGWKMDKQV WDMER gt ag gt ag tga agaaatttcg atgcgcgcgcgcgcgcg gatgattgg gatgggtggcgcgggtggaaaatggacaaacaggtg tgggacatggagcga
8
Using BLASTP alignments to make sense of genomic annotations
Annotation A gacgagtgg gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga ? agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER …DDW RGWKMDKQV WDMER gt ag gt ag tga agaaatttcg atgcgcgcgcgcgcgcg gatgattgg gatgggtggcgcgggtggaaaatggacaaacaggtg tgggacatggagcga Annotation B
9
Using genomic annotations to make sense of BLASTP alignments
Score = 42.0 bits (97), Expect(2) = 2e-25 Identities = 23/64 (35%), Positives = 40/64 (61%), Gaps = 2/64 (3%) Frame = -3 Query: 1 MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEIDTILQLNL-G DVS+ E++ M KV + GP+DIL+NNA ++ T P + +E D ++ +NL G Sbjct: 29 VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEWDRVISVNLKG Score = 37.7 bits (86), Expect(2) = 2e-25 Identities = 17/43 (39%), Positives = 29/43 (66%) Frame = -1 Query: 60 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI TGA G+GRAI++ELAK+G IN SGAE+ K+ +++ Sbjct: 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEEAKKTEEL 128
10
Using genomic annotations to make sense of BLASTP alignments
Score = 42.0 bits (97), Expect(2) = 2e-25 Identities = 23/64 (35%), Positives = 40/64 (61%), Gaps = 2/64 (3%) Frame = -3 Query: 1 MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEIDTILQLNL-G DVS+ E++ M KV + GP+DIL+NNA ++ T P + +E D ++ +NL G Sbjct: 29 VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEWDRVISVNLKG Score = 37.7 bits (86), Expect(2) = 2e-25 Identities = 17/43 (39%), Positives = 29/43 (66%) Frame = -1 Query: 60 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI TGA G+GRAI++ELAK+G IN SGAE+ K+ +++ Sbjct: 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEEAKKTEEL 128 splice junction splice junction splice junction splice junction splice junction splice junction
11
? Annotation A Some un-annotated genome …DEW gacgagtgg
gacgagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga ? agtttaaagc atgatatatatatatatatcg gt ag gt ag taa …DEW RQWKMDKQV WDAER Some un-annotated genome
12
Using genomic annotations to make sense of TBLASTX alignments
Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / Query: VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT A VAVG * V+RVMDS+KPAVSTT Sbjct: AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT 45909
13
Using genomic annotations to make sense of TBLASTX alignments
Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / Query: VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT A VAVG * V+RVMDS+KPAVSTT Sbjct: AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT 45909 1st Coding Exon 5’-UTR 1st Intron 2nd Coding Exon
14
Using genomic annotations to make sense of TBLASTX alignments
Score = 112 bits (240), Expect(2) = 2e-25 Identities = 58/99 (58%), Positives = 61/99 (61%) Frame = -3 / Query: VTSATLPLTSFSLSPSANR*SNRCLRKMASRTRGRRSRKTMTFRLKMARVVSSLRCACAA TSATLPLTSFSL P ANR*S RC MASR RG SRK + FRLK ARVVSSLR AC A Sbjct: LTSATLPLTSFSLRPRANR*SRRCFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792 Query: ACVAVGVAEAVGPVTVT*PAGTAVVVRVMDSRKPAVSTT* A VAVG * V+RVMDS+KPAVSTT* Sbjct: AWVAVGAVGVAEDEGTV*AGPAGAVMRVMDSMKPAVSTT* 45909 1st Coding Exon 5’-UTR ? 1st Intron 2nd Coding Exon ?
15
Using TBLASTN alignments to identify orthologus exons and introns
in an un-annotated genome Score = 53.5 bits (127), Expect(2) = 3e-10 Identities = 29/58 (50%), Positives = 30/58 (51%) Frame = -2 Query: 1 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEV+V++ Sbjct: 2097 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270 splice junction 1st coding exon orthologus intron begins here (2252) Score = 105 bits (261), Expect(2) = 3e-10 Identities = 52/52 (100%), Positives = 52/52 (100%) Frame = -1 Query: 49 SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK EVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct: 3991 NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 4146 splice junction 2nd coding exon orthologus intron ends here (4003)
16
Using TBLASTN alignments to identify orthologus exons and introns
in an un-annotated genome Score = 53.5 bits (127), Expect(2) = 3e-10 Identities = 29/58 (50%), Positives = 30/58 (51%) Frame = -2 Query: 1 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEV+V++ Sbjct: 2097 MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270 splice junction 1st coding exon ‘1st coding exon’ orthologus intron begins here (2252) Score = 105 bits (261), Expect(2) = 3e-10 Identities = 52/52 (100%), Positives = 52/52 (100%) Frame = -1 Query: 49 SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK EVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct: 3991 NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAK 4146 splice junction 2nd coding exon '2nd coding exon’ orthologus intron ends here (4003)
17
Exploring recent history with CGL
Part II Exploring recent history with CGL D. melanogaster Intron length distribution
18
time (millions of years)
D. virillis D. pseudoobscura D. ananassae D. yakuba D. simulans D. melanogaster 62.9 54.9 44.2 12.8 5.4 time (millions of years) Tamura et al (2003) Temporal Patterns of Fruit Fly (Drosophila Evolution Revealed by Mutation Clocks ; MBE
19
Compare intron lengths
agtttaaagc Query: 2 gacgagtgg gaccagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga 54 || || ||| || ||| ||||||||||||||||||||| ||||||||||||||| Sbjct: 29 gatgattgg gatgggtgg---gggtggaaaatggacaaacaggtg tgggacatggagcga 84 agaaatttcg atgcgcgcgcgcgcgcg atgatatattatatatatcg DDWRGW-MDKQVWDMER DEWRQWKMDKQVWDAER Query: 2 Sbjct: 29 18 44 D+WR WKMDKQVWDMER How do intron lengths change over time? Compare intron lengths Contig or read MDSR DKQV MESR atggatagtagc gataagcagaag atggaaagtagc gataaacaaaag annotated inferred Li Li’ Annotation
20
Compare intron lengths
agtttaaagc atgatatattatatatatcg Query: 2 gacgagtgg gaccagtggcgacaatggaaaatggacaaacaagtg tgggacgctgagcga 54 || || ||| || ||| ||||||||||||||||||||| ||||||||||||||| Sbjct: 29 gatgattgg gatgggtgg---gggtggaaaatggacaaacaggtg tgggacatggagcga 84 agaaatttcg atgcgcgcgcgcgcgcg Compare intron lengths Contig or read MDSR DKQV MESR atggatagtagc gataagcagaag atggaaagtagc gataaacaaaag annotated inferred Li Li’ Annotation Log10 length D. simulans Log10 length D. melanogaster
21
Why do intron lengths change over time?
D. pseudoobscura intron has transposon D. melanogaster intron has transposon Both introns have the same transposon D. pseudoobscura intron length D. melanogaster intron length
22
Does selection on the protein influence intron lengths ?
[(Li +Lj) - |Li–Lj| ]/Li+Lj Annotation gataagcagaag atggatagtagc Ka Li annotated MDSR DKQV Lj MESR DKQV inferred atggaaagtagc gataaacaaaag Contig or read
23
melanogaster & simulans melanogaster & yakuba
5 million yrs 13 million yrs 55 million yrs 250 million yrs melanogaster & simulans melanogaster & yakuba melanogaster & pseudoobsura melanogaster & mosquito A new evolutionary clock?
24
A new evolutionary clock
[(Li +Lj) - |Li–Lj| ]/Li+Lj D. simulans 5.4 myr D. yakuba myr D. pseudoobscura 54.9 myr D. Virilis myr D.ananassae myr
25
P(6) vs. dpse << 0.001 P(6) vs. dyak = 0.001 Region appears
To be un-duplicated In dpse; might be duplicated in dyak P(6) vs. dpse << 0.001 P(6) vs. dyak = 0.001
26
Are the phenoma we are seeing in th flies common to other orgs?
Explain the experiment What does this tell us? AA: recripcol best hits seems to work. Do it al the time but how do we know it workks: these data provide some independent confirnation of recip-best hits approachs. A. most paralogs much older than 70 million years– current details as to genes and gene families established much, much more than 70 million years ago. B. Paralogs diverging from one-another at same rate in both genomes– on average
27
Exploring ancient history with CGL
Part III Exploring ancient history with CGL C. elegans C. intestinalis H. sapiens M. musculus A.gambiae D. melanogaster 1000 time (myr) 250 70
28
protein similarities similarities in the intron-exon structures of genes
29
MIKEVFRPDKFGMDL MILEVFRPDKFGMDL MIKDVFRPDKFGIDL
30
Fraction of identical amino acids
The evolutionary trajectory of protein similarities Best HSP, reciprocal best blastp hits Human vs. mouse Dmel vs. agam Human vs. ciona Human vs. plant Fraction of identical amino acids
31
Fraction of identical amino acids
The evolutionary trajectory of protein similarities Asymptotic distribution? 70 million years 250 million years 600 million years 3 billion years Fraction of identical amino acids time
32
Fraction of identical amino acids
D. melanogaster vs. A. gambiae H. sapiens vs. C. intestinalis Fraction of identical amino acids C. intestinalis H. sapiens M. musculus A.gambiae D. melanogaster C. elegans
33
Fraction of identical amino acids
? Fraction of identical amino acids
34
Organism A Organism B BLASTP All proteins All proteins Every reciprocal best hit For every HSP Global similarity between the two organisms’ proteomes
35
H ‘standard’ C. elegans C. intestinalis H. sapiens M. musculus
A.gambiae D. melanogaster time (myr) 1000 250 70 C. elegans C. intestinalis H 100 H. sapiens 100 M. musculus A.gambiae 100 D. melanogaster
36
MIKEVFRPDKFGMDL MILEVFRPDKFGMDL MIKDVFRPDKFGIDL
37
I = * Organism A Organism B BLASTP All proteins All proteins
Every reciprocal best hit XXXXXXXXXXXXXX For every HSP qintron-intron I = pquery intron * psbjct intron Global similarity between the two organisms’ intron-exon structures
38
Good trees can be made from both measures; these measures provide a rapid
and robust means to calculate phylogenetic trees using the combined data from many annotated genomes simultaneously. C. elegans C. elegans C. intestinalis C. intestinalis 100 100 H. sapiens H. sapiens 100 100 M. musculus M. musculus A. gambiae A. gambiae 100 100 D. melanogaster D. melanogaster H I Trees based on I provide an independent check on trees made from amino-acid similarities; note that deep nodes are better resolved in the I tree.
39
Like intron lengths, gene-structures seem to evolve independently of protein
Sequence– hold H constant and you still get the same I tree A. gambiae D. melanogaster A. gambiae D. melanogaster C. elegans M. musculus M. musculus C. elegans H. sapiens H. sapiens C. intestinalis C. intestinalis I H = 1.5 ~ 50% similarity
40
New Methods for Comparative genomics
Part I: CGL: a software library for comparative genomics Genome annotations make possible any new kinds of genome analyses We have a developed a (soon to be) publicly available software library called CGL; ‘seagull’ that that greatly facilitates such analyses. Part II: Using CGL to explore recent history (1-70 myr years) Intron length can be used as an evolutionary clock. Seems to evolve independently of protein sequence. Can be used to confirm and calibrate existing protein clock approaches. Can be used to identify recent gene duplication events. Can be used to strengthen and extend with many ‘standard’ protein based approaches, e.g. quartet analysis. Part II: Using CGL to explore ancient history ( myr years) Proteins may saturate completely after ~ 1 billion years or so. Resolving deep evolutionary questions w/ proteins means computing ‘consensus’ trees; we offer one approach to this problem using ‘H’. Like intron length, the intron-exon structures of genes contains a phylogenetic signal; the ‘I’ vs. ‘H’ trees This signal seems to evolve independently of protein sequence, and perhaps more slowly. Large scale comparative genomics has much to tell us about the evolution of both genes and organisms.
41
Acknowledgements Chris Mungall Simon Prochnik Chris Smith
Josh Kaminker George Hartzell Suzi Lewis Gerald M. Rubin
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.