Download presentation
Presentation is loading. Please wait.
Published byRandolph Greene Modified over 9 years ago
1
Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine
2
32 mammals 9 yeasts 12 flies The age of comparative genomics humanmouseratchimpdog 8 Candida pathogensvectors
3
Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: 0.62 32sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP 10 -6 10 mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at 10 -11 Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at 10 -21 0.3 sub/site 0.1 sub/site 0.8 sub/site
4
Extensive conservation of synteny Global mapping of orthologous segments Nucleotide-level alignments span complete genomes Study properties / patterns of nucleotide conservation MammalsFliesCandida
5
Comparative Genomics 101: Conservation Function Conserved elements are typically functional (and vice versa) –For example: exons are deeply conserved to mouse, chicken, fish Some conserved elements are still uncharacterized –How do we make sense of them? –How do we distinguish each type of functional element Answer: evolutionary signatures (Comp. Genomics 201) –Tell me how you evolve, I’ll tell you who you are –Patterns of change selective pressures specific function
6
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
7
Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice
8
Putting it all together: probabilistic framework Hidden Markov Models (HMMs) –Generative model, learn emission, transition probabilities –Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) –Discriminative dual of HMMs, learn weights on features –Easy to integrate diverse signals, gradient ascent for training Systematically annotate all protein-coding genes
9
Large-scale re-annotation of the fly genome –New genes and exons, dubious genes and exons –Adjust gene boundaries: start codon, frame, splice site, seq errors –Reveal unusual gene structures: stop read-through, di-cistronics, editing Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons D. simulans D. erecta D. persimilis D. melanog. 579 fully rejected 1,454 exons (~800 genes) 2,499 not aligned +668 exons in 443 genes Revisiting fly genome annotation 10,845 fully confirmed (…)
10
Example 2: Novel multi-exon gene 1,454 novel exons outside known genes –60% cluster in 300 new multi-exon genes –40% are isolated high-confidence exons
11
Example 3: Dubious single-exon gene Classification approach: Yes / No answer –Closely related species: both genes and intergenic aligned –Show very different patterns of mutation Comparative analysis provides negative evidence –Alignment is unambiguous, orthologous, spans entire gene –Sequence shows mutations and indels in every species Weak or missing experimental evidence –100 of these independently rejected by FlyBase –These are missing from systematic clone collections –Only 34 (6%) have assigned names (vs. 36% of all fly genes)
12
CG6664/FBtr0100439 annotated start codonconserved start codon Example 4: Start codon adjustment Codon substitution patterns suggest new start in 200 genes –Score each substitution using Codon Substitution Matrix (CSM) poor CSM score, atypical substitution high CSM score, protein-like substitution ATG
13
Unusual genes 1: Stop codon read-through Method #1 (single exons) –112 events, 95 extending known genes Manual curation: 82 –Enriched in neuronal function Method #2 (after splicing) –256 events, looser cutoff, large overlap, needs manual curation –Enriched in transcription factors Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon
14
BDGP experimental validation: initial results 189 novel exons tested (in & out of genes) –inverse PCR reaction + sequencing –Recover new genes + alternative splice forms Results: 178 validated (94%) –Novel exons inside known genes: 41/43 (95%) –Novel exons outside known genes: 137/146 (94%) Some cDNA overlap: 8/8 (100%) no cDNA, some EST: 23/26 (88%) no cDNA, no EST: 106/112 (95%) novel gene known gene
15
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the fly genome Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Part 3. Genome evolution Phylogenomics
16
The regulatory code Multiple levels of regulation –Temporal and spatial regulation, disease, development –Chromatin, pre- / post-transcriptional, splicing, translational Combinatorial coding of individual motifs –The core: a relatively small number of regulatory motifs –Regions: diverse motif combinations specify diverse functions Regulatory motifs –Summarize information across thousands of sites Distinguish: regulatory motifs vs. motif instances –Challenging to discover Small (6-8 nucleotides), subtle (frequent degenerate positions), dispersed (act at a distance), diverse (sequence composition) Enhancer regions 5’-UTR Promoter motifs 3’-UTR Splicing signalsMotifs at RNA level
17
Regulatory motif discovery Study known motifs Derive conservation rules Discover novel motifs
18
Known motifs are preferentially conserved dmel AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAC dsim AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAC dyak AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAG dere AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC-----------AAGTCAG dana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGG dpse AAT--------TTTC-----------------------AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** * engrailed In multi-species alignments: known motifs conservation islands –Conserved biology: Conserved regulatory code, same words are functional –Preferential conservation: Stand out from surrounding nucleotides –Good signal for identifying individual instances of known motifs Not sufficient for motif discovery: –Conservation not limited to exact binding site additional bases would be found –Weakly constrained positions can diverge Real motifs will be missed –How do we discover motifs de novo? Use basic property of regulatory motifs Evaluate genome-wide conservation over thousands of instances
19
Known motifs are frequently conserved Across the fly genome, the engrailed motif: –appears 8599 times –is conserved 1534 times D. mel. D. yakub. D. erecta D. pseud. engrailed (TAATTA)engrailed Conservation rate: 17.8% Statistical significance –5 flies: conservation rate of random control motifs: 2.8% –Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev) Motif Conservation Score (MCS)
20
Systematically evaluate candidate patterns All potential motifs Evaluate MCS Collapse motif variants GTC AGT R R Y gap S W 196 motifs in 3’-UTR regions 168 motifs in promoter regions Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0 specificity 97% Collapsing –Sequence similarity –Overlapping occurrences
21
ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en)25.42 2TTKCAATTAA57.3reversed-polarity (repo)5.84.2 3WATTRATTK54.9araucan (ara)11.72.6 4AAATTTATGCK54.4paired (prd)4.516.5 5GCAATAAA51ventral veins lacking (vvl)13.20.3 6DTAATTTRYNR46.7Ultrabithorax (Ubx)163.3 7TGATTAAT45.7apterous (ap)7.11.7 8YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT41.2 20.14.3 10RATTKAATT40 3.90.7 11GCACGTGT39.5fushi tarazu (ftz)17.9 12AACASCTG38.8broad-Z3 (br-Z3)10.7 13AATTRMATTA38.2 19.51.2 14TATGCWAAT37.8 5.82 15TAATTATG37.5Antennapedia (Antp)14.15.4 16CATNAATCA36.9 1.81.7 17TTACATAA36.9 5.4 18RTAAATCAA36.3 3.22.8 19AATKNMATTT36 3.60 20ATGTCAAHT35.6 2.44.6 21ATAAAYAAA35.5 57.2-0.5 22YYAATCAAA33.9 5.30.6 23WTTTTATG33.8Abdominal B (Abd-B)6.36 24TTTYMATTA33.6extradenticle (exd)6.71.7 25TGTMAATA33.2 8.91.6 26TAAYGAG33.1 4.72.7 27AAAKTGA32.9 7.60.3 28AAANNAAA32.9 449.70.8 29RTAAWTTAT32.9gooseberry-neuro (gsb-n)110.8 30TTATTTAYR32.9Deformed (Dfd)30.7 Results in the fly genome: Promoter motifs
22
Motif length (a) 60 likely involved in mRNA regulation –AATAAA: Poly-A signal –6 AT-rich elements: mRNA stability / degradation –24 TGTA-rich elements: mRNA localization (PUF) –29 other, potential target of RNA-binding proteins Functional roles of 106 motifs in 3’-UTRs (b) 46 likely micro-RNA targets cleaved Protein-coding gene 3’-UTR miRNA microRNA gene Match 114 known microRNA genes Enable discovery of 144 novel microRNA genes Estimate extent of miRNA control 20% of human genes are miRNA targets 22-mer miRNA 8-mer motif Specifically match distal 8 bp of 22-mer miRNA 6 of 12 tested using RT-PCR and confirmed Global views of post-transcriptional regulation
23
Results in the fly: 50 novel microRNA genes
24
Regulatory motif discovery in the human ATATGCAA discovered 8-mers 114 known + 144 new miRNA genes Target ~20% of human 3’-UTRs microRNA regulation 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 106 motifs in 3’-UTR Strand specific 8-mers are miRNA-associated mRNA localization and stability TSS3’-UTR ATG Stop Systematic discovery of regulatory motifs in the human Frequently occurring, strongly conserved short regulatory signals
25
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics
26
Evolutionary history of all genes in 17 fungi Each branch –Mean and stdev –Num genes –Gains, losses Features –Few events ! –Gain vs. loss –Acceleration –Churning Applications –Recover WGD –Pathogenicity –Mating evolution –Codon capture –Evol. parallels yeast candida
27
… tile 34 of 288 … C. albicans (SC5314) C. albicans (WO-1) C. dubliniensis C. parapsilosis D. hansenii C. tropicalis C. guilliermondii C. lusitaniae lineage specific genes inserted segment species specific genes Synteny spans 100 million years! Gene duplication and loss in context of syntenic alignments
28
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
29
Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: 0.62 32sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP 10 -6 10 mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at 10 -11 Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at 10 -21 0.3 sub/site 0.1 sub/site 0.8 sub/site
30
Rules of thumb for comparative genome sequencing Total branch length: >4 subs/site –Genome annotation: new genes, exons, unusual –Regulatory motif discovery, miRNAs, enhancers Max pair-wise branch length: <1 subs/site –Conservation of function, nucleotide alignment quality Conserved gene order: synteny –Global alignment quality Sequencing depth –One or two genomes: >8X –Remaining genomes: >3X, if syntenic relative exists
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.