Presentation is loading. Please wait.

Presentation is loading. Please wait.

BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Similar presentations


Presentation on theme: "BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG"— Presentation transcript:

1 BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Conservation Scores BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

2 Conservation and Function: what kinds of DNA regions get conserved?
Core coding regions are usually conserved across hundreds of millions of years (Myr) Active sites of enzymes and crucial structural elements of proteins are highly conserved Untranslated regions of genes are conserved over tens but not over hundreds of Myr Some regulatory regions evolve ‘quickly’ – over a time scale of tens of Myr

3 Conservation and Function: what kinds of DNA regions get conserved?
Many splice sites and splice regulators are conserved between mouse and human Most promoters (70%) conserved between mouse and human Majority (~70%) of enhancers not conserved, but a significant minority are highly conserved

4 Approaches to Scoring Conservation
Base-wise: PhyloP, GERP Small regions: PhastCons Small regions, tracking bias: SiPhy Regulatory conservation within exons may be detected by any of these methods Key regulatory regions are harder to see

5 DEMO: UCSC Alignment & Conservation Tracks
GAPDH overall

6 Genomic Alignment Alignment is crucial (and not trivial)
Common alignment algorithms may misplace ambiguous bases, leading to artifactual gaps Inversions are often badly handled Issue: incomplete alignments are not reflected in scores of any current algorithm Conservation scores computed on aligned genomes only Alignments of 46 placental mammals to human genome in MultiZ format at UCSC Subset of primate alignments also common problems with alignments

7 Alignment Issues When studying protein-coding regions, substitutions are most common Most genome evolution happens through insertions or deletions Human chimp alignable genome is 97% identical Only 91% of genome is alignable Regions may acquire regulatory function in some lineages but have no function in most

8 UCSC Alignment Symbols
Single line ‘-’: No bases in the aligned species. May reflect insertion in the human genome or deletion in the aligning species. Double line ‘=‘: Aligning species has unalignable bases in the gap region. Many mutations or independent indels in between the aligned blocks in both species. Pale yellow coloring: Aligning species has Ns in the gap region. Sequencing problems in aligning species

9 Conservation Across Mammals Differs from Conservation Across Primates
Many regions conserved across mammals are also conserved across primates a few appear not to be Some regions appear to be conserved (insofar as can be measured) in primates but not across all mammals What is the diagonal? Are these regions conserved?

10 Genomic Evolutionary Rate Profiling (GERP) Measures Base Conservation
Estimates mean number of substitutions in each aligned genome to estimate neutral evolution rate Original score is “rejected substitutions”: the number of substitutions expected under ‘neutrality’ minus the number of substitutions observed at each aligned position New scores based on ML fit of substitution rate at base Positive scores (fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates of evolution

11 PhyloP Assigns Conservation P-values
Estimates mean number of substitutions in each aligned genome to estimate neutral evolution rate estimated from non-coding data (conservative) Compares probability of observed substitutions under hypothesis of neutral evolutionary rate Scores reflect either conservation (positive scores) or selection (negative scores) Score defined as –log10(P) where P is p-value for test of number of substitutions following (uniform) neutral rate inferred from all sites in alignment NB PhyloP may also refer to a suite of tools

12 PhastCons Fits a Hidden Markov Model
PhastCons fits HMM with states ‘conserved’ and ‘not conserved’ Neutral substitution rates estimated from data as for PhyloP Tunable parameter m represents inverse of expected length of ‘conserved’ regions Parameter n sets proportion of conserved regions Interpret psi’s as phylogenetic branch lengths (substitution probabilities) Symmetric matrix of neutral substitutions estimated from data Siepel A et al. Genome Res. 2005;15:

13 PhastCons Fits a Hidden Markov Model
Scaling parameter ρ (0 ≤ ρ ≤ 1) represents the average rate of substitution in conserved regions relative to average rate in non-conserved regions and is estimated from data Originally developed to detect moderate-sized sequences such as non-coding RNA Can be adapted to shorter sequences but not as powerful

14 SiPhy SiPhy models the pattern of substitutions, rather than just the rate, as do most others. Biased substitutions (e.g. conserved lysine: AAA <-> AAG only) will be identified as constrained Some TFBS have similar degeneracy in evolution This is a more refined approach than rate models, but requires a fairly deep (or wide) phylogeny SiPhy uses a Bayesian approach and needs two parameters like PhastCons: the fraction of sequence conserved, and the typical length of a conserved region. Might want to spell out some details of algorithm where it differs from PhastCons … i.e. in alternatives to neutral

15 SiPhy Applied to Mammalian Genomes
Identification of four NRSF-binding sites in NPAS4. K Lindblad-Toh et al. Nature (2011)

16 Comparison of Methods PhyloP, PhastCons, and GERP give fairly similar results over deep phylogenies (e.g. vertebrates) Differ substantially over bushes (e.g. primates) SiPhy is more sensitive over moderately deep phylogenies (e.g. mammals) Cannot be implemented for primates because of insufficient substitutions

17 Issues With Conservation Scores
Most scores are misleading about gaps in alignments: they don’t distinguish between contig gaps (incomplete genomes) and inserted or deleted regions This information is often available but inconvenient to use Each model was devised with a particular kind of conservation in mind, and may not be adaptable to all kinds Broken sequences – e.g. ZNF TFBS are not captured well by any current method


Download ppt "BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG"

Similar presentations


Ads by Google