Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

Sequencing the mammalian phylogeny #SpeciesCenterCovg H1HumanDoneFull H2ChimpDoneFull H3RhesusDoneFull H4MouseDoneFull H5RatDoneFull H6DogDoneFull H7CowDoneFull 1ElephantBroad1.94x 2ArmadilloBroad1.98x 3TenrecBroad1.90x 4RabbitBroad1.95x 5Guinea PigBroad1.92x 6HedgehogBroad1.86x 7ShrewBroad1.92x 8MicrobatBroad1.84x 9Tree ShrewBroad1.89x 10SquirrelBroad1.90x 11BushbabyBroad1.87x 12PikaBroad1.92x 13Mouse LemurBroad1.93x 14HorseBroad5.36x 15CatAgencourt1.87x 16DolphinBaylor2.59x 17HyraxBaylor2.19x 18Kangaroo RatBaylor1.85x 19MegabatBaylor~2x 20AlpacaWashU2.34x 21TarsierWashU1.88x 22SlothWashU2.10x 23Pangolinxx 24Flying lemurxx Kerstin Lindblad-Toh, Sante Gnerre, Federica DiPalma Broad, Baylor, WashU, Arachne, UCSC

Comparative genomics of mammalian species Goal 1: Discover regions of increased selection –Detect functional elements by their increased conservation –More genomes: detect smaller elements, subtle selection Goal 2: Discover different classes of functional elements –Patterns of change distinguish different types of functional elements –Specific function  Selective pressures  Patterns of mutation/inse/del Develop evolutionary signatures characteristic of each function

Protein-coding genes Mike Lin

Evolutionary signatures for protein-coding genes Same conservation levels, distinct patterns of divergence –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Non-synonymous substitutions Synonymous codon substitutions Frame-shifting gaps Gaps are multiples of 3

Protein-coding evolution vs nucleotide conservation Evolutionary signatures specific to each function –Distinguish protein-coding from non-coding conservation –Genome-wide run (CSF only): 81% sens., 91% precision –Incorporating additional signatures: RFC, single-species… Protein-coding exonsHighly conserved non-coding elements

Many new genes confirmed by chromatin domains Several hundred new exons, many in clusters Example: MM14qC3 Supported by chromatin signatures (Guttman et al) Mikkelsen et al Missed exon Alt.spliced exon

Genome-wide curation / experimental follow-up Novel candidate genes and exons –Experimental cDNA sequencing and validation –Curation of gene structures integrating evidence Revising existing annotations –Identify dubious genes with non-protein-like evolution –Refine boundaries and exon sets of existing genes –Curation: evaluate evidence supporting that annotation Unusual gene structures –Evolutionary evidence in absence of primary signals –Reveal new and unusual biological mechanisms G PI: Tim Hubbard, Sanger Center. HAVANA curators, experimental validation.

Unusual protein-coding events Mike Lin

When primary sequence signals are ignored Unusual gene (GPX2). Protein-coding signal continues past the stop. GPX2 is a known selenoprotein! Additional candidates found. Typical gene (MEF2A). Evolutionary signal stops at the stop codon.

Translational read-through in neuronal proteins New mechanism of post-transcriptional control. –Conserved in both mammals (~5 candidates) and flies (~150 candidates) –Strongly enriched for neurotransmitters and brain-expressed proteins –Read-through stop codon (&surrounding) shows increased conservation Many questions remain –Role of editing? Cryptic splice sites? RNA secondary structure? Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon Lin et al, Genome Research 2007 Novel candidate: OPRL1 neurotransmitter

Measuring excess constraint within protein-coding exons Typical protein-coding exon (Numerous mutations, at each column) Excess-conservation exon: conserved above and beyond the call of duty  Likely to have additional functions, overlapping selective pressures

Searching for excess-constraint coding sequence (1) Build a model for expected substitution counts Syn.subs. correlate w/ degeneracy & CpGDistribution for each ancestral codon (3) Top candidate exons with excess constraint PCPB2: derived from ancestral transposon Hox B5 gene start: 52 AA before 1 syn.subst C6orf111: predicted ORF on chr. 6 EIF4G2: overlaps spliced EvoFold prediction (2) Score windows for depletion in syn. subst. Z-score: P(obs. subst | expected for each codon)

Examples: Top candidate exons showing increased selection HoxB5: 52 amino acids before the first synonymous substitution Overlaps highly conserved RNA secondary structure C6orf11: Predicted ORF, protein-coding, extremely conserved EIF4G2: Several consecutive exons, conserved RNA struct.

microRNA genes Alex Stark Pouya Kheradpour

Evolutionary signatures for microRNA genes Combine with 10 other features  4,500-fold enrichment (1)Conservation profile

Novel miRNAs validated by sequencing reads Ruby, Bartel, Lai 348 reads 16 reads In fly genome: 101 hairpins above 0.95 cutoff 60 of 74 (81%) known Rfam miRNAs rediscovered + 24 novel expression-validated by 454&Solexa (Bartel/Hannon) + 17 additional candidates show diverse evidence of function In mammals: combine experimental & evolutionary info Rely on reads for discovery, use evolutionary signal to study function Stark et al, Genome Research (GR) 2007. Ruby et al GR 2007

Surprise 1: microRNA & microRNA* function Both hairpin arms of a microRNA can be functional –High scores, abundant processing, conserved targets –Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators Stark et al, Genome Research 2007 Drosophila Hox

Surprise 2: microRNA-anti-sense function A single miRNA locus transcribed from both strands The two transcripts show distinct expression domains (mutually exclusive) Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense) sense anti- sense Stark et al, Genes&Development 2007 Highly conserved Hox targets

miR-iab-4AS leads to homeotic transformations Mis-expression of mir-iab-4S & AS: alteres  wings homeotic transform. Stronger phenotype for AS miRNA Sense/anti-sense pairs as general building blocks for miRNA regulation 10 sense/anti-sense miRNAs in mouse haltere  wing haltere Sensory bristles  wing w/bristles senseAntisense WT Note: C,D,E same magnification Stark et al, Genes&Development 2007

Function of miRNA* arms and anti-sense miRNAs Denser Hox miRNA targeting network

Measuring selection Michele Clamp Manuel Garber Xiaohui Xie

Detecting Purifying Selection (ω) Neutral sequence Constrained sequence  Estimating intensity of constraint (  ): Probabilistic evolutionary model Maximum Likelihood (ML) estimation of  - sitewise (evaluate every k-long window) - windows-based (increased power) Reports ω, and its log odds score (LODS). Theoretical p-value (LODS distributes  2 with df = 1) Manuel Garber, Michele Clamp, Xiaohui Xie

Detecting other constraint signatures (π)  0 0 0.8 0.5 0.6 3.2 0 0 Repeated C  G transversion Has happened at least 4 times. Very unlikely given neutral model. Goal: Identify sites with unlikely substitution pattern. Approach: Probabilistic method to detect a stationary distribution that is different from background. Solution: Implement ML estimator (  ) of this vector: Provides a Position Weight Matrix for any given k-mer in the genome. Scores every base in the genome (LODS).  Manuel Garber, Michele Clamp, Xiaohui Xie

Estimation of genome-wide constraint 10.5% conserved 6% above FDR cutoff Across entire genome: 5% under selection. Same as for Human-Mouse. What’s different? Pilot Encode Regions (1%): 9.4% conserved 5.7% above FDR cutoff Genome-wide: Manuel Garber, Michele Clamp, Xiaohui Xie

More mammals: We can actually tell which 5% it is! 4 mammals 21 mammals Constraint calculated over a 12mer 5% FDR 4 mammals 21 mammals Constraint calculated over a 50mer 5% FDR Michele Clamp >40% FDR

Individual conserved elements match known TF sites Binding site resolution, even without known motif model Promoter alignment  5’ Constraint score Known TF binding sites  5’ Michele Clamp TATASP-1CEF-2CEF1 Example: TNNC1 (Troponin C)

Binding sites for known regulators Pouya Kheradpour Alex Stark

Computing Branch Length Score (BLS) CTCF BLS = 2.23 sps (78%) Allows for: 1.Mutations permitted by motif degeneracy 2.Misalignment/movement of motifs within window (up to hundreds of nucleotides) 3.Missing motif in dense species tree mutations missing short branches movement

Branch Length Score  Confidence 1.Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation) 2.Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate) 3.Many species are needed to confidently predict instances

Performance on vertebrate Transfac motifs 1.Most motifs have confident instances into 90% confidence with 18 mammals 2.Substantial increase in the number of instances compared to only human, mouse rat and dog. 2.5x increase 3.5x 6.5x Median number of instances (at fixed confidence)

Intersection with CTCF ChIP-Seq regions ChIP-Seq and ChIP-Chip technologies allow for identifying binding sites of a motif experimentally 1.Conserved CTCF motif instances highly enriched in ChIP-Seq sites 2.High enrichment does not require low sensitivity 3.Many motif instances are verified ChIP data from Barski, et al., Cell (2007) ≥ 50% of regions with a motif 50% motifs verified 50% confidence

Enrichment also found for other factors Barski, et al., Cell (2007) We can accurately identify targets for many factors Odom, et al., Nature Genetics (2007) Lim, et al., Molecular Cell (2007) Robertson, et al., Nature Methods (2006) Wei, et al., Cell (2006)Zeller, et al., PNAS (2006) Lin, et al., PLoS Genetics (2007)

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished 1.ChIP bound regions may not be conserved 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher

Enrichment increases in conserved bound regions Human: Barski, et al., Cell (2007) Mouse: Bernstein, unpublished Odom, et al., Nature Genetics (2007) 1.ChIP bound regions may not be conserved 2.For CTCF we also have binding data in mouse 3.Enrichment in intersection is dramatically higher 4.Trend persists for other factors where we have multi- species ChIP data

Motif discovery Pouya Kheradpour Alex Stark

Using confidence for motif discovery 1.Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation) 2.Compute Confidence Score as fraction of instances over noise at a given BLS (=1 – false discovery rate)

Motif discovery pipeline 1.Enumerate motif seeds Six non-degenerate characters with variable size gap in the middle 2.Score seed motifs Use a conservation ratio corrected for composition and small counts to rank seed motifs 3.Expand seed motifs Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing 4.Cluster to remove redundancy Using sequence similarity GTC AGT gap GTC AGT R R Y S W

Motif discovery in enhancer regions Collaboration with Ren, White, Posakony labs –Predict novel enhancer / promoter / insulator elements –Identify motifs associated with these regions –Validate predicted regions for in vivo function Initial results in human genome –Motif combinations predictive of enhancer regions (5X) Heinzman et al, Bing Ren’s lab

Motif discovery in 3’UTRs 1.Perform motif discovery by ranking 7-mers in 3’UTRs by the highest confidence they reach with 100 instances.

Summary Measuring increased selection –Scaling of branch lengths: ω –Non-random stationary distribution: π –Increased resolution: individual binding sites Protein-coding genes –Distinct evolutionary signatures –Novel genes, revised genes –Unusual structures: read-through, increased selection microRNAs –Function of miRNA/miRNA* and sense/anti-sense pairs –Dense miRNA targeting network for Hox cluster Regulatory motifs –Measure increased selection, derive confidence score –High sensitivity / high specificity for known motifs –Use enumeration/confidence metric for motif discovery

Acknowledgements Alex Stark SequencingBaylor, WashU, Agencourt. Funding: NHGRI miRNAsJulius Brennecke, Graham Ruby, Greg Hannon, David Bartel iab-4ASNatascha Bushati, Steve Cohen, Julius, Greg Hannon Pouya Kheradpour Mike Lin Matt Rasmussen Michele Clamp Xiaohui Xie Kerstin Lindblad-Toh Manuel Garber MIT Computer Science and AI LabBroad Institute of MIT and Harvard Sante Gnerre, David Jaffe Issao Fujiwara Federica Di Palma Arachne Assembly Team Broad Sequencing Platform Eric Lander

Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Similar presentations

Presentation on theme: "Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Similar presentations

Presentation on theme: "Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard."— Presentation transcript:

Similar presentations

About project

Feedback