Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Two short pieces MicroRNA Alternative splicing.
Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Speaker: HU Xue-Jia Supervisor: WU Yun-Dong Date: 19/12/2013.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Comparative Motif Finding
Introduction to BioInformatics GCB/CIS535
CSE182-L12 Gene Finding.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Biological Motivation Gene Finding in Eukaryotic Genomes
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Introns and Exons DNA is interrupted by short sequences that are not in the final mRNA Called introns Exons = RNA kept in the final sequence.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
Manolis Kellis modENCODE analysis group January 11, 2007 Part 1: Target identification: comparative vs. exprmt. (really the topic for today) Part 2: Target.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.
Click to edit Master title style Click to edit Master subtitle style CLICKER QUESTIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry,
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Chapter 3 The Interrupted Gene.
Motif Search and RNA Structure Prediction Lesson 9.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
Sequencing and comparative analysis of Candida genomes Christina Cuomo, Matt Rasmussen, Mike Lin, Joshua Grochow, Manfred Grabherr, Bruce Birren, Manolis.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Comparative genomics in flies and mammals
Very important to know the difference between the trees!
RNA-seq Replicate 1 RNA-seq Replicate 2 DNA
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Chapter 4 The Interrupted Gene.
In collaboration with Mikkelsen Lab
Study phylogeny in the context of species evolution
Presentation transcript:

Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine

32 mammals 9 yeasts 12 flies The age of comparative genomics humanmouseratchimpdog 8 Candida pathogensvectors

Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at sub/site 0.1 sub/site 0.8 sub/site

Extensive conservation of synteny Global mapping of orthologous segments Nucleotide-level alignments span complete genomes Study properties / patterns of nucleotide conservation MammalsFliesCandida

Comparative Genomics 101: Conservation  Function Conserved elements are typically functional (and vice versa) –For example: exons are deeply conserved to mouse, chicken, fish Some conserved elements are still uncharacterized –How do we make sense of them? –How do we distinguish each type of functional element Answer: evolutionary signatures (Comp. Genomics 201) –Tell me how you evolve, I’ll tell you who you are –Patterns of change  selective pressures  specific function

Overview Part 1. Genome interpretation  Evolutionary signatures of genes  Revisiting the human and fly genomes  Unusual gene structures Part 2. Gene regulation  Regulatory motif discovery  microRNA regulation  Enhancer identification Part 3. Genome evolution  Phylogenomics  Genome Duplication  Emergence of new functions

Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice

Putting it all together: probabilistic framework Hidden Markov Models (HMMs) –Generative model, learn emission, transition probabilities –Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) –Discriminative dual of HMMs, learn weights on features –Easy to integrate diverse signals, gradient ascent for training  Systematically annotate all protein-coding genes

Large-scale re-annotation of the fly genome –New genes and exons, dubious genes and exons –Adjust gene boundaries: start codon, frame, splice site, seq errors –Reveal unusual gene structures: stop read-through, di-cistronics, editing Towards a revised genome annotation  Curation: FlyBase integrates prediction with cDNA, protein, literature  Experimentation: BDGP large-scale functional validation novel exons D. simulans D. erecta D. persimilis D. melanog. 579 fully rejected 1,454 exons (~800 genes) 2,499 not aligned +668 exons in 443 genes Revisiting fly genome annotation 10,845 fully confirmed (…)

Example 2: Novel multi-exon gene 1,454 novel exons outside known genes –60% cluster in 300 new multi-exon genes –40% are isolated high-confidence exons

Example 3: Dubious single-exon gene Classification approach: Yes / No answer –Closely related species: both genes and intergenic aligned –Show very different patterns of mutation Comparative analysis provides negative evidence –Alignment is unambiguous, orthologous, spans entire gene –Sequence shows mutations and indels in every species Weak or missing experimental evidence –100 of these independently rejected by FlyBase –These are missing from systematic clone collections –Only 34 (6%) have assigned names (vs. 36% of all fly genes)

CG6664/FBtr annotated start codonconserved start codon Example 4: Start codon adjustment Codon substitution patterns suggest new start in 200 genes –Score each substitution using Codon Substitution Matrix (CSM) poor CSM score, atypical substitution high CSM score, protein-like substitution ATG

Unusual genes 1: Stop codon read-through Method #1 (single exons) –112 events, 95 extending known genes  Manual curation: 82 –Enriched in neuronal function Method #2 (after splicing) –256 events, looser cutoff, large overlap, needs manual curation –Enriched in transcription factors Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon

BDGP experimental validation: initial results 189 novel exons tested (in & out of genes) –inverse PCR reaction + sequencing –Recover new genes + alternative splice forms Results: 178 validated (94%) –Novel exons inside known genes: 41/43 (95%) –Novel exons outside known genes: 137/146 (94%) Some cDNA overlap: 8/8 (100%) no cDNA, some EST: 23/26 (88%) no cDNA, no EST: 106/112 (95%) novel gene known gene

Overview Part 1. Genome interpretation  Evolutionary signatures of genes  Revisiting the fly genome  Unusual gene structures Part 2. Gene regulation  Regulatory motif discovery  microRNA regulation Part 3. Genome evolution  Phylogenomics

The regulatory code Multiple levels of regulation –Temporal and spatial regulation, disease, development –Chromatin, pre- / post-transcriptional, splicing, translational Combinatorial coding of individual motifs –The core: a relatively small number of regulatory motifs –Regions: diverse motif combinations specify diverse functions Regulatory motifs –Summarize information across thousands of sites Distinguish: regulatory motifs vs. motif instances –Challenging to discover Small (6-8 nucleotides), subtle (frequent degenerate positions), dispersed (act at a distance), diverse (sequence composition) Enhancer regions 5’-UTR Promoter motifs 3’-UTR Splicing signalsMotifs at RNA level

Regulatory motif discovery Study known motifs Derive conservation rules Discover novel motifs

Known motifs are preferentially conserved dmel AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAC dsim AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAC dyak AATGATTTGC CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC AAGTCAG dere AATGGTTTGC CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC AAGTCAG dana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGG dpse AAT TTTC AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** * engrailed In multi-species alignments: known motifs  conservation islands –Conserved biology: Conserved regulatory code, same words are functional –Preferential conservation: Stand out from surrounding nucleotides –Good signal for identifying individual instances of known motifs Not sufficient for motif discovery: –Conservation not limited to exact binding site  additional bases would be found –Weakly constrained positions can diverge  Real motifs will be missed –How do we discover motifs de novo?  Use basic property of regulatory motifs  Evaluate genome-wide conservation over thousands of instances

Known motifs are frequently conserved Across the fly genome, the engrailed motif: –appears 8599 times –is conserved 1534 times D. mel. D. yakub. D. erecta D. pseud. engrailed (TAATTA)engrailed Conservation rate: 17.8% Statistical significance –5 flies: conservation rate of random control motifs: 2.8% –Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev) Motif Conservation Score (MCS)

Systematically evaluate candidate patterns All potential motifs Evaluate MCS Collapse motif variants GTC AGT R R Y gap S W 196 motifs in 3’-UTR regions 168 motifs in promoter regions Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0  specificity 97% Collapsing –Sequence similarity –Overlapping occurrences

ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en) TTKCAATTAA57.3reversed-polarity (repo) WATTRATTK54.9araucan (ara) AAATTTATGCK54.4paired (prd) GCAATAAA51ventral veins lacking (vvl) DTAATTTRYNR46.7Ultrabithorax (Ubx) TGATTAAT45.7apterous (ap) YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT RATTKAATT GCACGTGT39.5fushi tarazu (ftz) AACASCTG38.8broad-Z3 (br-Z3) AATTRMATTA TATGCWAAT TAATTATG37.5Antennapedia (Antp) CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG33.8Abdominal B (Abd-B) TTTYMATTA33.6extradenticle (exd) TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT32.9gooseberry-neuro (gsb-n) TTATTTAYR32.9Deformed (Dfd)30.7 Results in the fly genome: Promoter motifs

Motif length (a) 60 likely involved in mRNA regulation –AATAAA: Poly-A signal –6 AT-rich elements: mRNA stability / degradation –24 TGTA-rich elements: mRNA localization (PUF) –29 other, potential target of RNA-binding proteins Functional roles of 106 motifs in 3’-UTRs (b) 46 likely micro-RNA targets cleaved Protein-coding gene 3’-UTR miRNA microRNA gene Match 114 known microRNA genes Enable discovery of 144 novel microRNA genes Estimate extent of miRNA control  20% of human genes are miRNA targets 22-mer miRNA 8-mer motif  Specifically match distal 8 bp of 22-mer miRNA  6 of 12 tested using RT-PCR and confirmed Global views of post-transcriptional regulation

Results in the fly: 50 novel microRNA genes

Regulatory motif discovery in the human ATATGCAA discovered 8-mers 114 known new miRNA genes Target ~20% of human 3’-UTRs microRNA regulation 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 106 motifs in 3’-UTR Strand specific 8-mers are miRNA-associated mRNA localization and stability TSS3’-UTR ATG Stop Systematic discovery of regulatory motifs in the human Frequently occurring, strongly conserved short regulatory signals

Overview Part 1. Genome interpretation  Evolutionary signatures of genes  Revisiting the human and fly genomes  Unusual gene structures Part 2. Gene regulation  Regulatory motif discovery  microRNA regulation  Enhancer identification Part 3. Genome evolution  Phylogenomics

Evolutionary history of all genes in 17 fungi Each branch –Mean and stdev –Num genes –Gains, losses Features –Few events ! –Gain vs. loss –Acceleration –Churning Applications –Recover WGD –Pathogenicity –Mating evolution –Codon capture –Evol. parallels yeast candida

… tile 34 of 288 … C. albicans (SC5314) C. albicans (WO-1) C. dubliniensis C. parapsilosis D. hansenii C. tropicalis C. guilliermondii C. lusitaniae lineage specific genes inserted segment species specific genes Synteny spans 100 million years! Gene duplication and loss in context of syntenic alignments

Overview Part 1. Genome interpretation  Evolutionary signatures of genes  Revisiting the human and fly genomes  Unusual gene structures Part 2. Gene regulation  Regulatory motif discovery  microRNA regulation  Enhancer identification Part 3. Genome evolution  Phylogenomics  Genome Duplication  Emergence of new functions

Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at sub/site 0.1 sub/site 0.8 sub/site

Rules of thumb for comparative genome sequencing Total branch length: >4 subs/site –Genome annotation: new genes, exons, unusual –Regulatory motif discovery, miRNAs, enhancers Max pair-wise branch length: <1 subs/site –Conservation of function, nucleotide alignment quality Conserved gene order: synteny –Global alignment quality Sequencing depth –One or two genomes: >8X –Remaining genomes: >3X, if syntenic relative exists