Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Two short pieces MicroRNA Alternative splicing.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Speaker: HU Xue-Jia Supervisor: WU Yun-Dong Date: 19/12/2013.
A turbo intro to (the bioinformatics of) microRNAs 11/ Peter Hagedorn.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Comparative Motif Finding
Presenting: Asher Malka Supervisor: Prof. Hermona Soreq.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
A high-resolution map of human
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Adaptive Molecular Evolution Nonsynonymous vs Synonymous.
Lecture 12 Splicing and gene prediction in eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics.
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
More regulating gene expression. Combinations of 3 nucleotides code for each 1 amino acid in a protein. We looked at the mechanisms of gene expression,
Manolis Kellis modENCODE analysis group January 11, 2007 Part 1: Target identification: comparative vs. exprmt. (really the topic for today) Part 2: Target.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Integrative fly analysis: specific aims Aim 1: Comprehensive data collection – Data QC / data standards / – consistent pipelines Aim 2: Integrative annotation.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CSLS Retreat 2007 Matan Hofree & Assaf Weiner 1. Outline  A brief introduction to microRNA  Project motivation and goal  Selecting the data sets 
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
From Genomes to Genes Rui Alves.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Overview of ENCODE Elements
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Transcription factor binding motifs (part II) 10/22/07.
Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Regulation of Gene Expression
bacteria and eukaryotes
Kerstin Lindblad-Toh1 et al.
The Transcriptional Landscape of the Mammalian Genome
Comparative genomics in flies and mammals
Very important to know the difference between the trees!
De novo Motif Finding using ChIP-Seq
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Recitation 7 2/4/09 PSSMs+Gene finding
Interpreting the human genome
Comparative genomics of 29 eutherian mammals
Volume 154, Issue 1, Pages (July 2013)
In collaboration with Mikkelsen Lab
Volume 11, Issue 7, Pages (May 2015)
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Sequencing the mammalian phylogeny #SpeciesCenterCovg H1HumanDoneFull H2ChimpDoneFull H3RhesusDoneFull H4MouseDoneFull H5RatDoneFull H6DogDoneFull H7CowDoneFull 1ElephantBroad1.94x 2ArmadilloBroad1.98x 3TenrecBroad1.90x 4RabbitBroad1.95x 5Guinea PigBroad1.92x 6HedgehogBroad1.86x 7ShrewBroad1.92x 8MicrobatBroad1.84x 9Tree ShrewBroad1.89x 10SquirrelBroad1.90x 11BushbabyBroad1.87x 12PikaBroad1.92x 13Mouse LemurBroad1.93x 14HorseBroad5.36x 15CatAgencourt1.87x 16DolphinBaylor2.59x 17HyraxBaylor2.19x 18Kangaroo RatBaylor1.85x 19MegabatBaylor~2x 20AlpacaWashU2.34x 21TarsierWashU1.88x 22SlothWashU2.10x 23Pangolinxx 24Flying lemurxx Kerstin Lindblad-Toh, Sante Gnerre, Federica DiPalma Broad, Baylor, WashU, Arachne, UCSC

Comparative genomics of mammalian species Goal 1: Discover regions of increased selection –Detect functional elements by their increased conservation –More genomes: detect smaller elements, subtle selection Goal 2: Discover different classes of functional elements –Patterns of change distinguish different types of functional elements –Specific function  Selective pressures  Patterns of mutation/inse/del Develop evolutionary signatures characteristic of each function

Protein-coding genes Mike Lin

Evolutionary signatures for protein-coding genes Same conservation levels, distinct patterns of divergence –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Non-synonymous substitutions Synonymous codon substitutions Frame-shifting gaps Gaps are multiples of 3

Protein-coding evolution vs nucleotide conservation Evolutionary signatures specific to each function –Distinguish protein-coding from non-coding conservation –Genome-wide run (CSF only): 81% sens., 91% precision –Incorporating additional signatures: RFC, single-species… Protein-coding exonsHighly conserved non-coding elements

Many new genes confirmed by chromatin domains Several hundred new exons, many in clusters Example: MM14qC3 Supported by chromatin signatures (Guttman et al) Mikkelsen et al Missed exon Alt.spliced exon

Genome-wide curation / experimental follow-up Novel candidate genes and exons –Experimental cDNA sequencing and validation –Curation of gene structures integrating evidence Revising existing annotations –Identify dubious genes with non-protein-like evolution –Refine boundaries and exon sets of existing genes –Curation: evaluate evidence supporting that annotation Unusual gene structures –Evolutionary evidence in absence of primary signals –Reveal new and unusual biological mechanisms G PI: Tim Hubbard, Sanger Center. HAVANA curators, experimental validation.

Unusual protein-coding events Mike Lin

When primary sequence signals are ignored Unusual gene (GPX2). Protein-coding signal continues past the stop. GPX2 is a known selenoprotein! Additional candidates found. Typical gene (MEF2A). Evolutionary signal stops at the stop codon.

Translational read-through in neuronal proteins New mechanism of post-transcriptional control. –Conserved in both mammals (~5 candidates) and flies (~150 candidates) –Strongly enriched for neurotransmitters and brain-expressed proteins –Read-through stop codon (&surrounding) shows increased conservation Many questions remain –Role of editing? Cryptic splice sites? RNA secondary structure? Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon Lin et al, Genome Research 2007 Novel candidate: OPRL1 neurotransmitter

Measuring excess constraint within protein-coding exons Typical protein-coding exon (Numerous mutations, at each column) Excess-conservation exon: conserved above and beyond the call of duty  Likely to have additional functions, overlapping selective pressures

Searching for excess-constraint coding sequence (1) Build a model for expected substitution counts Syn.subs. correlate w/ degeneracy & CpGDistribution for each ancestral codon (3) Top candidate exons with excess constraint PCPB2: derived from ancestral transposon Hox B5 gene start: 52 AA before 1 syn.subst C6orf111: predicted ORF on chr. 6 EIF4G2: overlaps spliced EvoFold prediction (2) Score windows for depletion in syn. subst. Z-score: P(obs. subst | expected for each codon)

Examples: Top candidate exons showing increased selection HoxB5: 52 amino acids before the first synonymous substitution Overlaps highly conserved RNA secondary structure C6orf11: Predicted ORF, protein-coding, extremely conserved EIF4G2: Several consecutive exons, conserved RNA struct.

microRNA genes Alex Stark Pouya Kheradpour

Evolutionary signatures for microRNA genes Combine with 10 other features  4,500-fold enrichment (1)Conservation profile

Novel miRNAs validated by sequencing reads Ruby, Bartel, Lai 348 reads 16 reads In fly genome: 101 hairpins above 0.95 cutoff 60 of 74 (81%) known Rfam miRNAs rediscovered + 24 novel expression-validated by 454&Solexa (Bartel/Hannon) + 17 additional candidates show diverse evidence of function In mammals: combine experimental & evolutionary info Rely on reads for discovery, use evolutionary signal to study function Stark et al, Genome Research (GR) Ruby et al GR 2007

Surprise 1: microRNA & microRNA* function Both hairpin arms of a microRNA can be functional –High scores, abundant processing, conserved targets –Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators Stark et al, Genome Research 2007 Drosophila Hox

Surprise 2: microRNA-anti-sense function A single miRNA locus transcribed from both strands The two transcripts show distinct expression domains (mutually exclusive) Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense) sense anti- sense Stark et al, Genes&Development 2007 Highly conserved Hox targets

miR-iab-4AS leads to homeotic transformations Mis-expression of mir-iab-4S & AS: alteres  wings homeotic transform. Stronger phenotype for AS miRNA Sense/anti-sense pairs as general building blocks for miRNA regulation 10 sense/anti-sense miRNAs in mouse haltere  wing haltere Sensory bristles  wing w/bristles senseAntisense WT Note: C,D,E same magnification Stark et al, Genes&Development 2007

Function of miRNA* arms and anti-sense miRNAs Denser Hox miRNA targeting network

Measuring selection Michele Clamp Manuel Garber Xiaohui Xie

Detecting Purifying Selection (ω) Neutral sequence Constrained sequence  Estimating intensity of constraint (  ): Probabilistic evolutionary model Maximum Likelihood (ML) estimation of  - sitewise (evaluate every k-long window) - windows-based (increased power) Reports ω, and its log odds score (LODS). Theoretical p-value (LODS distributes  2 with df = 1) Manuel Garber, Michele Clamp, Xiaohui Xie

Detecting other constraint signatures (π)  Repeated C  G transversion Has happened at least 4 times. Very unlikely given neutral model. Goal: Identify sites with unlikely substitution pattern. Approach: Probabilistic method to detect a stationary distribution that is different from background. Solution: Implement ML estimator (  ) of this vector: Provides a Position Weight Matrix for any given k-mer in the genome. Scores every base in the genome (LODS).  Manuel Garber, Michele Clamp, Xiaohui Xie

Estimation of genome-wide constraint 10.5% conserved 6% above FDR cutoff Across entire genome: 5% under selection. Same as for Human-Mouse. What’s different? Pilot Encode Regions (1%): 9.4% conserved 5.7% above FDR cutoff Genome-wide: Manuel Garber, Michele Clamp, Xiaohui Xie

More mammals: We can actually tell which 5% it is! 4 mammals 21 mammals Constraint calculated over a 12mer 5% FDR 4 mammals 21 mammals Constraint calculated over a 50mer 5% FDR Michele Clamp >40% FDR

Individual conserved elements match known TF sites Binding site resolution, even without known motif model Promoter alignment  5’ Constraint score Known TF binding sites  5’ Michele Clamp TATASP-1CEF-2CEF1 Example: TNNC1 (Troponin C)

Binding sites for known regulators Pouya Kheradpour Alex Stark

Computing Branch Length Score (BLS) CTCF BLS = 2.23 sps (78%) Allows for: 1.Mutations permitted by motif degeneracy 2.Misalignment/movement of motifs within window (up to hundreds of nucleotides) 3.Missing motif in dense species tree mutations missing short branches movement

Branch Length Score  Confidence 1.Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation) 2.Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate) 3.Many species are needed to confidently predict instances

Performance on vertebrate Transfac motifs 1.Most motifs have confident instances into 90% confidence with 18 mammals 2.Substantial increase in the number of instances compared to only human, mouse rat and dog. 2.5x increase 3.5x 6.5x Median number of instances (at fixed confidence)

Intersection with CTCF ChIP-Seq regions ChIP-Seq and ChIP-Chip technologies allow for identifying binding sites of a motif experimentally 1.Conserved CTCF motif instances highly enriched in ChIP-Seq sites 2.High enrichment does not require low sensitivity 3.Many motif instances are verified ChIP data from Barski, et al., Cell (2007) ≥ 50% of regions with a motif 50% motifs verified 50% confidence

Enrichment also found for other factors Barski, et al., Cell (2007) We can accurately identify targets for many factors Odom, et al., Nature Genetics (2007) Lim, et al., Molecular Cell (2007) Robertson, et al., Nature Methods (2006) Wei, et al., Cell (2006)Zeller, et al., PNAS (2006) Lin, et al., PLoS Genetics (2007)

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished 1.ChIP bound regions may not be conserved 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished Odom, et al., Nature Genetics (2007) 1.ChIP bound regions may not be conserved 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher 4.Trend persists for other factors where we have multi- species ChIP data

Motif discovery Pouya Kheradpour Alex Stark

Using confidence for motif discovery 1.Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation) 2.Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate)

Motif discovery pipeline 1.Enumerate motif seeds Six non-degenerate characters with variable size gap in the middle 2.Score seed motifs Use a conservation ratio corrected for composition and small counts to rank seed motifs 3.Expand seed motifs Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing 4.Cluster to remove redundancy Using sequence similarity GTC AGT gap GTC AGT R R Y S W

Motif discovery in enhancer regions Collaboration with Ren, White, Posakony labs –Predict novel enhancer / promoter / insulator elements –Identify motifs associated with these regions –Validate predicted regions for in vivo function Initial results in human genome –Motif combinations predictive of enhancer regions (5X) Heinzman et al, Bing Ren’s lab

Motif discovery in 3’UTRs 1.Perform motif discovery by ranking 7-mers in 3’UTRs by the highest confidence they reach with 100 instances.

Summary Measuring increased selection –Scaling of branch lengths: ω –Non-random stationary distribution: π –Increased resolution: individual binding sites Protein-coding genes –Distinct evolutionary signatures –Novel genes, revised genes –Unusual structures: read-through, increased selection microRNAs –Function of miRNA/miRNA* and sense/anti-sense pairs –Dense miRNA targeting network for Hox cluster Regulatory motifs –Measure increased selection, derive confidence score –High sensitivity / high specificity for known motifs –Use enumeration/confidence metric for motif discovery

Acknowledgements Alex Stark SequencingBaylor, WashU, Agencourt. Funding: NHGRI miRNAsJulius Brennecke, Graham Ruby, Greg Hannon, David Bartel iab-4ASNatascha Bushati, Steve Cohen, Julius, Greg Hannon Pouya Kheradpour Mike Lin Matt Rasmussen Michele Clamp Xiaohui Xie Kerstin Lindblad-Toh Manuel Garber MIT Computer Science and AI LabBroad Institute of MIT and Harvard Sante Gnerre, David Jaffe Issao Fujiwara Federica Di Palma Arachne Assembly Team Broad Sequencing Platform Eric Lander