Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine
32 mammals 9 yeasts 12 flies The age of comparative genomics humanmouseratchimpdog 8 Candida pathogensvectors
Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at sub/site 0.1 sub/site 0.8 sub/site
Extensive conservation of synteny Global mapping of orthologous segments Nucleotide-level alignments span complete genomes Study properties / patterns of nucleotide conservation MammalsFliesCandida
Comparative Genomics 101: Conservation Function Conserved elements are typically functional (and vice versa) –For example: exons are deeply conserved to mouse, chicken, fish Some conserved elements are still uncharacterized –How do we make sense of them? –How do we distinguish each type of functional element Answer: evolutionary signatures (Comp. Genomics 201) –Tell me how you evolve, I’ll tell you who you are –Patterns of change selective pressures specific function
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice
Putting it all together: probabilistic framework Hidden Markov Models (HMMs) –Generative model, learn emission, transition probabilities –Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) –Discriminative dual of HMMs, learn weights on features –Easy to integrate diverse signals, gradient ascent for training Systematically annotate all protein-coding genes
Large-scale re-annotation of the fly genome –New genes and exons, dubious genes and exons –Adjust gene boundaries: start codon, frame, splice site, seq errors –Reveal unusual gene structures: stop read-through, di-cistronics, editing Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons D. simulans D. erecta D. persimilis D. melanog. 579 fully rejected 1,454 exons (~800 genes) 2,499 not aligned +668 exons in 443 genes Revisiting fly genome annotation 10,845 fully confirmed (…)
Example 2: Novel multi-exon gene 1,454 novel exons outside known genes –60% cluster in 300 new multi-exon genes –40% are isolated high-confidence exons
Example 3: Dubious single-exon gene Classification approach: Yes / No answer –Closely related species: both genes and intergenic aligned –Show very different patterns of mutation Comparative analysis provides negative evidence –Alignment is unambiguous, orthologous, spans entire gene –Sequence shows mutations and indels in every species Weak or missing experimental evidence –100 of these independently rejected by FlyBase –These are missing from systematic clone collections –Only 34 (6%) have assigned names (vs. 36% of all fly genes)
CG6664/FBtr annotated start codonconserved start codon Example 4: Start codon adjustment Codon substitution patterns suggest new start in 200 genes –Score each substitution using Codon Substitution Matrix (CSM) poor CSM score, atypical substitution high CSM score, protein-like substitution ATG
Unusual genes 1: Stop codon read-through Method #1 (single exons) –112 events, 95 extending known genes Manual curation: 82 –Enriched in neuronal function Method #2 (after splicing) –256 events, looser cutoff, large overlap, needs manual curation –Enriched in transcription factors Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon
BDGP experimental validation: initial results 189 novel exons tested (in & out of genes) –inverse PCR reaction + sequencing –Recover new genes + alternative splice forms Results: 178 validated (94%) –Novel exons inside known genes: 41/43 (95%) –Novel exons outside known genes: 137/146 (94%) Some cDNA overlap: 8/8 (100%) no cDNA, some EST: 23/26 (88%) no cDNA, no EST: 106/112 (95%) novel gene known gene
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the fly genome Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Part 3. Genome evolution Phylogenomics
The regulatory code Multiple levels of regulation –Temporal and spatial regulation, disease, development –Chromatin, pre- / post-transcriptional, splicing, translational Combinatorial coding of individual motifs –The core: a relatively small number of regulatory motifs –Regions: diverse motif combinations specify diverse functions Regulatory motifs –Summarize information across thousands of sites Distinguish: regulatory motifs vs. motif instances –Challenging to discover Small (6-8 nucleotides), subtle (frequent degenerate positions), dispersed (act at a distance), diverse (sequence composition) Enhancer regions 5’-UTR Promoter motifs 3’-UTR Splicing signalsMotifs at RNA level
Regulatory motif discovery Study known motifs Derive conservation rules Discover novel motifs
Known motifs are preferentially conserved dmel AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAC dsim AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAC dyak AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAG dere AATGGTTTGC CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC AAGTCAG dana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGG dpse AAT TTTC AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** * engrailed In multi-species alignments: known motifs conservation islands –Conserved biology: Conserved regulatory code, same words are functional –Preferential conservation: Stand out from surrounding nucleotides –Good signal for identifying individual instances of known motifs Not sufficient for motif discovery: –Conservation not limited to exact binding site additional bases would be found –Weakly constrained positions can diverge Real motifs will be missed –How do we discover motifs de novo? Use basic property of regulatory motifs Evaluate genome-wide conservation over thousands of instances
Known motifs are frequently conserved Across the fly genome, the engrailed motif: –appears 8599 times –is conserved 1534 times D. mel. D. yakub. D. erecta D. pseud. engrailed (TAATTA)engrailed Conservation rate: 17.8% Statistical significance –5 flies: conservation rate of random control motifs: 2.8% –Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev) Motif Conservation Score (MCS)
Systematically evaluate candidate patterns All potential motifs Evaluate MCS Collapse motif variants GTC AGT R R Y gap S W 196 motifs in 3’-UTR regions 168 motifs in promoter regions Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0 specificity 97% Collapsing –Sequence similarity –Overlapping occurrences
ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en) TTKCAATTAA57.3reversed-polarity (repo) WATTRATTK54.9araucan (ara) AAATTTATGCK54.4paired (prd) GCAATAAA51ventral veins lacking (vvl) DTAATTTRYNR46.7Ultrabithorax (Ubx) TGATTAAT45.7apterous (ap) YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT RATTKAATT GCACGTGT39.5fushi tarazu (ftz) AACASCTG38.8broad-Z3 (br-Z3) AATTRMATTA TATGCWAAT TAATTATG37.5Antennapedia (Antp) CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG33.8Abdominal B (Abd-B) TTTYMATTA33.6extradenticle (exd) TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT32.9gooseberry-neuro (gsb-n) TTATTTAYR32.9Deformed (Dfd)30.7 Results in the fly genome: Promoter motifs
Motif length (a) 60 likely involved in mRNA regulation –AATAAA: Poly-A signal –6 AT-rich elements: mRNA stability / degradation –24 TGTA-rich elements: mRNA localization (PUF) –29 other, potential target of RNA-binding proteins Functional roles of 106 motifs in 3’-UTRs (b) 46 likely micro-RNA targets cleaved Protein-coding gene 3’-UTR miRNA microRNA gene Match 114 known microRNA genes Enable discovery of 144 novel microRNA genes Estimate extent of miRNA control 20% of human genes are miRNA targets 22-mer miRNA 8-mer motif Specifically match distal 8 bp of 22-mer miRNA 6 of 12 tested using RT-PCR and confirmed Global views of post-transcriptional regulation
Results in the fly: 50 novel microRNA genes
Regulatory motif discovery in the human ATATGCAA discovered 8-mers 114 known new miRNA genes Target ~20% of human 3’-UTRs microRNA regulation 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 106 motifs in 3’-UTR Strand specific 8-mers are miRNA-associated mRNA localization and stability TSS3’-UTR ATG Stop Systematic discovery of regulatory motifs in the human Frequently occurring, strongly conserved short regulatory signals
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics
Evolutionary history of all genes in 17 fungi Each branch –Mean and stdev –Num genes –Gains, losses Features –Few events ! –Gain vs. loss –Acceleration –Churning Applications –Recover WGD –Pathogenicity –Mating evolution –Codon capture –Evol. parallels yeast candida
… tile 34 of 288 … C. albicans (SC5314) C. albicans (WO-1) C. dubliniensis C. parapsilosis D. hansenii C. tropicalis C. guilliermondii C. lusitaniae lineage specific genes inserted segment species specific genes Synteny spans 100 million years! Gene duplication and loss in context of syntenic alignments
Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at sub/site 0.1 sub/site 0.8 sub/site
Rules of thumb for comparative genome sequencing Total branch length: >4 subs/site –Genome annotation: new genes, exons, unusual –Regulatory motif discovery, miRNAs, enhancers Max pair-wise branch length: <1 subs/site –Conservation of function, nucleotide alignment quality Conserved gene order: synteny –Global alignment quality Sequencing depth –One or two genomes: >8X –Remaining genomes: >3X, if syntenic relative exists