Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Probabilistic Verification of Discrete Event Systems using Acceptance Sampling Håkan L. S. YounesReid G. Simmons Carnegie Mellon University.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
1. Estimation ESTIMATION.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Heuristic alignment algorithms and cost matrices
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Random Sampling, Point Estimation and Maximum Likelihood.
Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Construction of Substitution Matrices
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
MAT 4830 Mathematical Modeling
Pairwise Sequence Analysis-III
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
California Pacific Medical Center
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
Construction of Substitution matrices
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Modelling evolution Gil McVean Department of Statistics TC A G.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Håkan L. S. YounesDavid J. Musliner Carnegie Mellon UniversityHoneywell Laboratories Probabilistic Plan Verification through Acceptance Sampling.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Chapter 9: Correlational Research
Gene expression from RNA-Seq
Ch3: Model Building through Regression
Ab initio gene prediction
Confidence Intervals Chapter 10 Section 1.
Finding regulatory modules
When You See (This), You Think (That)
Volume 26, Issue 5, Pages (March 2016)
Discussion Section Week 9
Presentation transcript:

Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008

Alignment accuracy Simulation: Jukes-Cantor model Subs/indel rate = 7.5 Aligned with Viterbi + true model  Observed FPF

Neutral model for indels CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

Neutral model for indels Look at inter-gap segments Pr( length = L ) ? CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

Neutral model for indels CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1 Look at inter-gap segments Pr( length = L ) ? Def: p i = Pr( column i+1 survived | column i survived) Assumption: indels are independent of each other

Neutral model for indels CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1 Look at inter-gap segments Pr( length = L )  p i p i+1... p i+L-2 Def: p i = Pr( column i+1 survived | column i survived) Assumption: indels are independent of each other Assumption: indels occur uniformly across the genome

Neutral model for indels CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1 Look at inter-gap segments Pr( length = L )  p L Def: p i = Pr( column i+1 survived | column i survived) Assumption: indels are independent of each other Assumption: indels occur uniformly across the genome Prediction: Inter-gap distances follow a geometric distribution

Inter-gap distances in alignments Inter-gap distance (nucleotides) Weighted regression: R 2 > Log 10 counts Transposable elements +

Inter-gap distances in alignments (simulation)

Biases in alignments A: gap wander (Holmes & Durbin, JCB ) B,C: gap attraction D: gap annihilation

Biases in alignments

Influence of alignment parameters De-tuning of parameters away from “truth” does not improve alignments Accuracy of parameters (within ~ factor 2) does not hurt alignments much

Influence of model accuracy Improved model (for mammalian genomic DNA) : Better modelling of indel length distribution Substitution model & indel rates depend on local GC content Additional variation in local substitution rate Parameters: BlastZ alignments of human and mouse

Influence of model accuracy Simulation: –20 GC categories –10 substitution rate categories –100 sequences each = sequences –Each ~800 nt, + 2x100 flanking sequence

Summary so far Alignments are biased –Accuracy depends on position relative to gap –Fewer gaps than indels Alignments can be quite inaccurate –For 0.5 subs/site, indels/site: accuracy = 65%, false positives = 15% Choice of parameters does not matter much Choice of MODEL does not matter much…

Alignments: Best scoring path A C C G T T C A C A A T G G A T A T C A T C T G C A G T (Needleman-Wunsch, Smith-Waterman, Viterbi)

Alignments: Posterior probabilities A C C G T T C A C A A T G G A T A T C A T C T G C A G T (Durbin, Eddy, Krogh, Mitchison 1998)

Posterior probabilities

Posteriors: Good predictors of accuracy

Posterior decoding: better than Max Likelihood

Posteriors & estimating indel rates The inter-gap histogram slope estimates the indel rate, and is not affected by gap attraction….. but is influenced by gap annihilation… …leading to lower ‘asymptotic accuracy’… …which cannot be observed – but posteriors can be… …and they are identical in the mean:

Indel rate estimators Density: Alignment gaps per site Inter-gap: Slope of inter-gap histogram BW: Baum-Welch parameter estimate Prob: Inter-gap histogram with posterior probability correction

Human-mouse indel rate estimates Indel rate

Simulations: inferences are accurate Indel rate

Second summary Alignments are biased, and have errors Posterior accurately predicts local alignment quality Posterior decoding improves alignments, reduces biases With posterior decoding: modelling of indel lengths and sequence content improves alignments Indel rates (human-mouse) % higher than apparent from alignments

Neutral indel model: Whole genome Inter-gap distance (nucleotides) Log 10 counts Transposable elements:Whole genome:

Estimating fraction of sequence under purifying selection Model:● Genome is mixture of “conserved” and “neutral” sequence ● “Conserved” sequence accepts no indel mutations ● “Neutral” sequence accepts any indel mutation ● Indels are point events (no spatial extent) Account for “neutral overhang”: Correction depends on level of clustering of conserved sequence: –Low clustering: conserved segment is flanked by neutral overhang neutral contribution = 2 x average neutral distance between indels –High clustering: indels “sample” neutral sequence neutral contribution = 1 x average neutral distance between indels Lower bound: ~79 Mb, or ~2.6 % Upper bound: ~100 Mb, or ~3.25 %

How much of our genome is under purifying selection? 2.56 – 3.25% indel-conserved ( Mb) + + : Divergence (subs/site) Mb 5%

Inferences are not biased by divergence Inferred from data: Simulation (100 Mb conserved)

Conclusions Alignment is an inference problem; don’t ignore the uncertainties! Posterior decoding (heuristic) can be better than Viterbi (exact) Indel rates are high. Useful for identifying functional regions, since indels can be more disruptive of function than substitutions. Up to 10% of our genome may be functional, and a large proportion is rapidly turning over.