Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
The Concept of Functional Constraint. The intensity of purifying selection is determined by the degree of intolerance characteristic of a site or a genomic.
Measuring the degree of similarity: PAM and blosum Matrix
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Genome-wide Regulatory Complexity in Yeast Promoters Zhu YANG 15 th Mar, 2006.
Molecular Evolution Revised 29/12/06
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
CS273a Lecture 14, Fall 08, Batzoglou CS273a Lecture 14, Fall 2008 Finding Conserved Elements (1) Binomial method  25-bp window in the human genome 
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
CS273a Lecture 10, Aut 08, Batzoglou CS273a Lecture 10, Fall 2008 Neutral Substitution Rates.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Comparative Motif Finding
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Similar Sequence Similar Function Charles Yan Spring 2006.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Lecture 12 Splicing and gene prediction in eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Using blast to study gene evolution – an example.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
NEW TOPIC: MOLECULAR EVOLUTION.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Comparative Genomics I: Tools for comparative genomics
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Transcription factor binding motifs (part II) 10/22/07.
Modelling evolution Gil McVean Department of Statistics TC A G.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 What forces constrain/drive protein evolution? Looking at all coding sequences across multiple genomes can shed considerable light on which forces contribute.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Genetics and Evolutionary Biology
Comparative Genomics.
Presentation transcript:

Evolution (1 st lecture)

Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper & all Identification and Characterization of Multi-Species Conserved Sequences Elliott Margulies & all Presented by Penka Markova

Finding Elements in DNA Conserved by Evolution Premise: highly conserved sequences are more likely to reflect regions under active selection due to the presence of an element(s) that confers biological function Involves comparative analysis, requires multi-alignments

Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

1 st Paper Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper, Michael Brudno, Eric Stone, Inna Dubchak, Serafim Batzoglou, and Arend Sidow

Overview Goal: Comparative analysis of rat/mouse/human genome facilitate insights into basic mechanisms of nucleotide evolution facilitate the discovery of elements in the genome that play a functional role in human biology (by leveraging the fact that functional DNA is constrained because of purifying selection ) Summary: Provides analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor Evidence for shift in the mutational spectrum b/n the mouse and rat lineages (increase of CG content in the rat genome) Support for the idea that rates of evolution are influenced by local genomic or cell biological context No correlation b/n rates of point substitution & rates of microindels (influences that affect these processes are distinct) Identified the regions in the human genome that are evolving slowly (likely to include functional elements important to human biology)

Data 3 complete mammalian genome sequences  Human, rat, mouse  new: rat genome Multi-aligned  MLAGAN 2 datasets 1.Containing all sites that are confidently aligned among all 3 sequences (most included positions originated prior to the last common ancestor) 2.“rodent-specific neutral sites” - containing only sites present in the rodents (heavily enriched for neutrally evolving sites)

Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

Global Patterns of Nucleotide Substitution Global shift in the mutation spectra between mouse and rat Rat has 0.35% more CG than mouse (41.26% vs 41.61%) – statistically highly significant difference CpG dinucleotides 0.92% in the mouse, 1.06% in the rat (the rest of the nucleotides exhibit lower difference) Consistent bias toward elevated CG in the rat genome does not appear to be confined to particular types of transitions or transversions based on Dataset1 quantitative analysis (117 million position with single difference in either rodent) The causative factors for the shift, selective or otherwise, remain to be elucidated

Rates of Transitions and Transversions in the Rodents Transitions are approximately fourfold more likely than any transversion Useful for molecular evolutionary studies (most methods of phylogenetic inference model point substitutions on the basis of stationary Markov processes and require user-specified substitution parameters)

Rates of Neutral Point Substitution Point substitution events in rodent-specific neutral sites (Dataset2) Neutral rate for the evolutionary tree relating the 3 Relative branch length of the tree: based on Dataset1 positions without gap in any sequence Normalized (rat branch is 1 unit length)

Rates of Microinsertion and Microdeletion Definition: lesions no larger than 10bp Dataset1 Gaps of size 11bp or less Rapid decline in the relative numbers of indel events as size increases

Global Identification of Constrained Elements Annotated all the regions in the human genome that are evolving, on average, significantly slower than the neutral rate Sequences that function in organismal biology tend to be under purifying selection & thus manifest themselves as regions evolving slowly 210, 923 constrained elements (>51 bp)

Global Identification of Constrained Elements

Regional Variability of Evolutionary Parameters Substantially stable microevolutionary pressures (modest-to-strong correlations between rates of microdeletion [A, B]) Local evolutionary pressures appear to influence point substitutions and microindels differently (variation in rate of microinsertions/microdeletion does not correlate well with point substitution) Local genomic context influences the rate of point substitution regardless of the type of site (correlation b/n neutral rate with the rate of substitution [B]) CG content correlates with rates of point substitution Sliding window analysis along rat Chromosome1, window width of 2Mb

Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

2 nd Paper Identification and Characterization of Multi- Species Conserved Sequences Elliott Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program, David Haussler, Eric Green

Overview Goals Identify highly conserved DNA regions, in particular “Multi-species Conserved Sequences” (MCSs), in a robust fashion useful in comparative sequence analysis, aiming to elucidate genome function Evaluate the relative contribution of different species’ sequences to identifying genomic regions of interest one of the criteria considered in choosing additional species for whole-genome sequencing Summary of results Proposes 2 strategies for MCS identification (binomial, parsimony) detect virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) Analysis of the features of detected MCSs Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

Data Sequences of human and 11 non-human vertebrates  2 primates (chimpansee, baboon), 2 carnivores (cat, dog), 2 artiodactyls (cow and pig), 2 rodents (mouse and rat), 1 bird (chicken), 2 fish (fugu and tetraodon)  Orthologous to a 1.8-Mb region on human chromosome 7q31 Multi-aligned  human-referenced pair-wise alignment  Repeat-masker, blastz Systematically annotated for known coding exons, UTRs, and ARs (ancestral repeats)

Algorithms: Binomial, Parsimony, Intersecting Take into account Phylogenetic diversity of the aligned species’ sequences The varying neutral substitution rate The characteristics of the available genomic multi-sequence alignment, esp sparse alignments Requirements Sufficiently large branch length of the phylogenetic tree (non-functional regions should be sufficiently diverged) Greater total branch length (compared to the required length for identification of larger functional elements) Good multi-alignment is crucial

Algorithms: Binomial Binomial-Based Method for MCS Detection Calculates the conservation score based on the probability of detecting the observed amount of conservation between the human and each other species’ sequence, assuming neutral substitution rate Neutral substitution rate is calculated from fourfold degenerate positions (the third base of codons for which any base will encode the same amino acid) Normalizes for phylogenetic biases by averaging Final conservation score is calculated from overlapping 25- base windows

Algorithms: Binomial N number of aligned bases in the 25-base window of the human-species j alignment K number of perfect matches p j neutral substitution probability: the probability that a given base in the human sequence has been conserved in species j, assuming the neutral substitution rate between human and species j K/N baseline conservation level C(j)cumulative binomial probability of observing at least K matches in N bases Algorithm 1) within all windows of 25 bases, for each species j: CGGCTAAG…ACTGACTGGGT CGACTGAG…ACTGACTGGGT

Algorithms: Binomial Algorithm 2) “phylogenetically average” the individual species’ scores s j to obtain the final conservation score for the window 3) the final score assigned to position i is 4) For a given treshhold t, position I is predicted to be part of an MCS if

Algorithms: Binomial Binomial-Based Method: Conclusion Conservation scores below zero represent alignable regions that are less conserved than expected, the opposite for scores above zero Minimum MCS length is 25 bases Sequence conservation detected with more diverged species (with higher neutral substitution rates) is weighted more heavily Measures conservation with respect to one reference sequence only

Algorithms: Parsimony Parsimony-Based Method Amount of conservation within each column of the alignment is measured using a phylogenetic parsimony score P(i) P(i) reflects the minimal number of substitutions needed along the branches of an established phylogenetic tree to account for the observed bases at the leaves of the tree Based on P(i) calculates a score under a continuous-time Markov model of neutral evolution, measuring the “surprise” of observing P(i) or smaller parsimony score Requires a phylogenetic tree, a model of neutral substitution

Algorithms: Parsimony Algorithm 1) Calculate the parsimony score P(i) for the i-th position P(i) = the minimum number of substitutions, performed along the branches of the tree, needed to explain the bases observed at the leaves of the tree notice P(i) is a tight lower bound on the number of substitutions having actually occurred at position i during evolution 2.0) Define a model of neutral evolution  based on the phylogenetic tree T relating the species under study, a neutral substitution rate matrix Q ℓ(e) denotes the length of branch e, r the root of the tree  transition probability matrix along a branch (u,v) M (u,v) = e ℓ(u,v)Q  background base distribution π This model generates a set of random but related bases at the leaves of the tree by simulating evolution.

Algorithms: Parsimony 2) Define the score assigned to position i based on the 25-base window as Z(r) is the random variable describing the parsimony score of the bases of the subtree rooted at r Pr[Z(r)  P(j)] is the probability that the parsimony score of the bases at the leaves of T generated by the model defined above is at most P(j) calculated using a dynamic programming algorithm proceeding from the leaves of T ot its root if this probability is small, the position is unlikely to have been generated under neutral evolution

Algorithms: Parsimony 3) the final score assigned to position i is 4) For a given treshhold t, position i is predicted to be part of an MCS if Parsimony-Based Method: Conclusion Requires a phylogenetic tree, a model of neutral substitution Produces higher scores based on conservation across large phylogenetic distance

Algorithms: Binomial, Parsimony, Intersecting Intersecting Method  Intersects the results from the Binomial and Parsimony methods  MCSs can be shorter than 25 bp Observations All three methods are biased towards the identification of sequences that are conserved in most species (as opposed to only a subset of species) Conservation score treshhold used was selected such that 5% of the human sequence from the analyzed region falls within an MCS (5% of the human genome is considered to be under active selection)

Concordance of the binomial- and parsimony- based methods for MCS detection

Results: discrimination of different types of sequence using conservation scores

Results General features of detected MCSs detected virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) majority of sequences conserved across multiple vertebrate species has no known function (70% of MCSs reside in non- coding regions) Uniqueness of the MCSs in the human genome Correlating MCSs with Functional Elements MCSs correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements

Results: characteristics of the detected MCSs

Positions of MCSs relative to other annotated genomic features (representative region)

Results Contribution of different species’ sequences to the detection of MCSs Rodent sequences detect the greatest number of MCS bases, largest number of non-coding sequence Chicken sequence has considerably higher specificity, largest amount of coding MCS bases MCSs detected with fish sequences almost exclusively contain coding sequence Non-human primate sequences are not useful with the applied methods None of the individual species’ sequences alone came close to identifying all the reference MCS bases Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

Ability of individual & combinations of species’ sequences to detect MCSs

Outline (The End) Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

The end

A(u) is the random variable representing the base generated by this random process at node u.