BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

The Concept of Functional Constraint. The intensity of purifying selection is determined by the degree of intolerance characteristic of a site or a genomic.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Changes in Highly Conserved Elements John McGuigan 05/04/2009.
Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Profiles for Sequences
Genome-wide Regulatory Complexity in Yeast Promoters Zhu YANG 15 th Mar, 2006.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
[Bejerano Fall10/11] 1 Any Project reflections?
Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Sequence similarity.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Comparative Genomics and Evolution Pollard, K.S., et al., Forces Shaping the Fastest Evolving Regions in the Human Genome. PLoS Genetics 2(10), McLean,
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Construction of Substitution Matrices
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene Regulations and Mutations
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
SHI Meng. Abstract Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Input: Alignment. Model parameters from neutral sequence Estimation example.
Construction of Substitution matrices
DNAse Hyper-Sensitivity BNFO 602 Biological Sequence Analysis, Spring 2014 Mark Reimers, Ph.D.
Comparative Genomics I: Tools for comparative genomics
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Modelling evolution Gil McVean Department of Statistics TC A G.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Kerstin Lindblad-Toh1 et al.
Gil McVean Department of Statistics, Oxford
Eukaryotic Gene Finding
26.5 Molecular Clocks Help Track Evolutionary Time
Discussion Section Week 9
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG Conservation Scores BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Conservation and Function: what kinds of DNA regions get conserved? Core coding regions are usually conserved across hundreds of millions of years (Myr) Active sites of enzymes and crucial structural elements of proteins are highly conserved Untranslated regions of genes are conserved over tens but not over hundreds of Myr Some regulatory regions evolve ‘quickly’ – over a time scale of tens of Myr

Conservation and Function: what kinds of DNA regions get conserved? Many splice sites and splice regulators are conserved between mouse and human Most promoters (70%) conserved between mouse and human Majority (~70%) of enhancers not conserved, but a significant minority are highly conserved

Approaches to Scoring Conservation Base-wise: PhyloP, GERP Small regions: PhastCons Small regions, tracking bias: SiPhy Regulatory conservation within exons may be detected by any of these methods Key regulatory regions are harder to see

DEMO: UCSC Alignment & Conservation Tracks GAPDH overall

Genomic Alignment Alignment is crucial (and not trivial) Common alignment algorithms may misplace ambiguous bases, leading to artifactual gaps Inversions are often badly handled Issue: incomplete alignments are not reflected in scores of any current algorithm Conservation scores computed on aligned genomes only Alignments of 46 placental mammals to human genome in MultiZ format at UCSC Subset of primate alignments also common problems with alignments

Alignment Issues When studying protein-coding regions, substitutions are most common Most genome evolution happens through insertions or deletions Human chimp alignable genome is 97% identical Only 91% of genome is alignable Regions may acquire regulatory function in some lineages but have no function in most

UCSC Alignment Symbols Single line ‘-’: No bases in the aligned species. May reflect insertion in the human genome or deletion in the aligning species. Double line ‘=‘: Aligning species has unalignable bases in the gap region. Many mutations or independent indels in between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Sequencing problems in aligning species

Conservation Across Mammals Differs from Conservation Across Primates Many regions conserved across mammals are also conserved across primates a few appear not to be Some regions appear to be conserved (insofar as can be measured) in primates but not across all mammals What is the diagonal? Are these regions conserved?

Genomic Evolutionary Rate Profiling (GERP) Measures Base Conservation Estimates mean number of substitutions in each aligned genome to estimate neutral evolution rate Original score is “rejected substitutions”: the number of substitutions expected under ‘neutrality’ minus the number of substitutions observed at each aligned position New scores based on ML fit of substitution rate at base Positive scores (fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates of evolution

PhyloP Assigns Conservation P-values Estimates mean number of substitutions in each aligned genome to estimate neutral evolution rate estimated from non-coding data (conservative) Compares probability of observed substitutions under hypothesis of neutral evolutionary rate Scores reflect either conservation (positive scores) or selection (negative scores) Score defined as –log10(P) where P is p-value for test of number of substitutions following (uniform) neutral rate inferred from all sites in alignment NB PhyloP may also refer to a suite of tools

PhastCons Fits a Hidden Markov Model PhastCons fits HMM with states ‘conserved’ and ‘not conserved’ Neutral substitution rates estimated from data as for PhyloP Tunable parameter m represents inverse of expected length of ‘conserved’ regions Parameter n sets proportion of conserved regions Interpret psi’s as phylogenetic branch lengths (substitution probabilities) Symmetric matrix of neutral substitutions estimated from data Siepel A et al. Genome Res. 2005;15:1034-1050

PhastCons Fits a Hidden Markov Model Scaling parameter ρ (0 ≤ ρ ≤ 1) represents the average rate of substitution in conserved regions relative to average rate in non-conserved regions and is estimated from data Originally developed to detect moderate-sized sequences such as non-coding RNA Can be adapted to shorter sequences but not as powerful

SiPhy SiPhy models the pattern of substitutions, rather than just the rate, as do most others. Biased substitutions (e.g. conserved lysine: AAA <-> AAG only) will be identified as constrained Some TFBS have similar degeneracy in evolution This is a more refined approach than rate models, but requires a fairly deep (or wide) phylogeny SiPhy uses a Bayesian approach and needs two parameters like PhastCons: the fraction of sequence conserved, and the typical length of a conserved region. Might want to spell out some details of algorithm where it differs from PhastCons … i.e. in alternatives to neutral

SiPhy Applied to Mammalian Genomes Identification of four NRSF-binding sites in NPAS4. K Lindblad-Toh et al. Nature (2011)

Comparison of Methods PhyloP, PhastCons, and GERP give fairly similar results over deep phylogenies (e.g. vertebrates) Differ substantially over bushes (e.g. primates) SiPhy is more sensitive over moderately deep phylogenies (e.g. mammals) Cannot be implemented for primates because of insufficient substitutions

Issues With Conservation Scores Most scores are misleading about gaps in alignments: they don’t distinguish between contig gaps (incomplete genomes) and inserted or deleted regions This information is often available but inconvenient to use Each model was devised with a particular kind of conservation in mind, and may not be adaptable to all kinds Broken sequences – e.g. ZNF TFBS are not captured well by any current method