Gil McVean Department of Statistics, Oxford

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Hidden Markov Model.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Lecture 5: Learning models using EM
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Pairwise alignment Computational Genomics and Proteomics.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Hidden Markov Models In BioInformatics
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Construction of Substitution Matrices
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Modelling evolution Gil McVean Department of Statistics TC A G.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Evolutionary genomics can now be applied beyond ‘model’ organisms
Gil McVean Department of Statistics
Pipelines for Computational Analysis (Bioinformatics)
Sequence comparison: Significance of similarity scores
Eukaryotic Gene Finding
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
Pair Hidden Markov Model
Pairwise Sequence Alignment
Hidden Markov Models (HMMs)
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Gil McVean Department of Statistics, Oxford Modelling genomes Gil McVean Department of Statistics, Oxford

Why would we want to model a genome? To identify genes Protein-coding RNA Small RNAs To identify regulatory elements Transcription factor binding sites Enhancers To classify genome content Repeat DNA Unique sequence To understand the processes that shape genomes Mutation Recombination Duplication Rearrangement Natural selection

A rather simple model for a protein-coding gene Non-coding DNA Start codon Codon TERM s t STATES T: 30% C: 20% A: 30% G: 20% ATG: 100% AAA: 1/61% AAC: 1/61% AAG: 1/61% … TTT: 1/61% TAA: 30% TAG: 40% TGA: 30% EMISSIONS

Define model Explore properties Estimate parameters from data Test goodness-of-fit Refine model A ‘genome’ model is like any other statistical model

Hidden Markov Models in bioinformatics The model of a gene just described can be thought of as a hidden Markov model (HMM) The underlying states evolve in a Markov fashion, but we observe features (the DNA sequence) emitted by those states You will remember that there are lots of nice computational properties of hidden Markov models that we can use for inference Finding a most likely sequence of states Calculating posterior probabilities of a given state at a given position There are also various algorithms we can use to estimate parameters of HMMs (e.g. ML estimation by EM) How would you use the model of a gene to find new genes? How well do you think it would do?

Making useful HMMs in bioinformatics To be useful, HMMs for genes have to incorporate many features Regulatory sequences Intron-splicing features Correlations and biases in amino acid and base composition A REALLY important feature to capture is their evolution Important parts of genes and genomes evolve slower due to constraint

Searching for homology If we compare human and chimpanzee sequences they are approximately 98.8% identical at the DNA level. It is ‘easy’ to identify which parts of the genome in humans correspond to which parts in chimps If we compare human with, say mouse, we can see some parts that are similar, and other parts where there is only vague or even no obvious similarity. When measuring evolution, we need to identify regions that are homologous Homology means similarity by descent Traditionally, the problem of identifying homology has been intrinsically linked to the problem of alignment

Alignment of PFEMP1 proteins from P. falciparum

The simplest problem: aligning two sequences Suppose we have just two protein sequences that we want to align In evolution, three types of event can happen Mutation to new amino acids Insertion of new amino acids Deletion of amino acids We want to work out which amino acids in the two sequences are homologous – i.e. related to each other through shared ancestry WAKIS WEEKS W—AKIS WEEK-S What do the ‘-’s really mean?

How can we construct an alignment algorithm? What we want to do is to look at every possible alignment and choose the one that is ‘best’ What we have to do is to find an efficient algorithm that can search every possible alignment and that has an objective measure as to what ‘best’ means A natural approach is to make a model of alignments, parameterise it and find the alignment that maximises the likelihood Although the problem sounds hard we can solve it using a hidden Markov model structure

Xi Yj XiXi+1 Xi- XiXi+1 YjYj+1 YjYj+1 Yj- How does is work? Suppose residues Xi and Yj are aligned to each other Three things could happen next The next two residues in each sequence could also align (A) A gap could be introduced in sequence X (B) A gap could be introduced in sequence Y (C) We can parameterise the probabilities of each event Xi Yj XiXi+1 Xi- XiXi+1 (A) (B) (C) YjYj+1 YjYj+1 Yj-

Xi-a…Xi Xi …- Yj …- Yj-a…Yj The full algorithm We need to consider similar transitions for the cases when residue Xi is aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi We need to specify various probabilities The probability of inserting a gap The probability of extending a gap The probability of finishing the alignment The probability of observing an aligned pair of residues (20x20) The probability of observing a residue aligned to a gap (20) Once specified we can use the Viterbi and Forward/Backward algorithms to identify ML alignments, sample from the posterior or calculate posterior probabilities Xi-a…Xi Xi …- Yj …- Yj-a…Yj

Xi+1 H H D The forward algorithm Emission probabilities = ek(Xi+1 ) Transition probabilities = qij In alignment the state space is two-dimensional (residue i aligned to residue j)

Xi-1 Xi Xi+1 H H H D D D The Viterbi algorithm A traceback matrix is used to keep track of the best partial alignments

W—AKIS WA-KIS WEEK-S WEEK-S An example Suppose the gap opening and extension parameters are 0.2 and 0.5 respectively. There is a 80% chance of observing a match, a 20/19% chance of observing any given mismatch and a 5% chance of observing each unaligned amino acid (We can ignore termination for the moment) The BEST alignments are given below, each of which has log likelihood of -16.84, or 31% of the total likelihood (lnlk = -15.67). In many real situations, the best alignment represents only a fraction of the total likelihood W—AKIS WEEK-S WA-KIS WEEK-S

Posterior decoding Using the forward-backward algorithm we can calculate the posterior probability that any residue is aligned to any other, or that a given residue is in a gap state X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Y1 Y1 Y2 Y2 Y3 Y3 Y4 Y4 Y5 Y5 Conditional on X2-Y3

Extending the method Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov models Finite-state automata (FSA) There have been many extensions to the original idea Local alignment Repeat alignment Protein family identification Gene finding Multiple alignment The alignment algorithm is very much a workhorse of bioinformatics, as an alignment is needed or almost all subsequent analyses (e.g. phylogenetic tree reconstruction, population genetic inference) However, relying on a single alignment is not always a great idea

Doing away with alignment For most problems, the alignment is not of primary interest The natural thing to do is to integrate over alignments (as in the FB algorithm) to estimate parameters of interest The key problem is that there is no computationally efficient algorithm for statistical multiple alignment. All widely-used methods use heuristic approaches

Gene conversion and var gene diversity in P. falciparum Multiple alignment methods typically assume the sequences are related to each through an evolutionary tree For the case of multi-gene families, this may not be the case, because gene conversion between copies can lead to mosaic structures If we wish to learn about the processes of conversion, a natural approach is to model the mosaicism In the case of var genes, the sequences are so diverged that we also need to consider the problem of alignment

Mosaic alignment We could model the n+1th sequence as a mosaic of the previous n We can calculate the likelihood of observing a given sequence by summing over all possible mosaic structures and their alignment We can also identify the most likely mosaic structure and calculate the expected number of recombination events Repeating the procedure for all sequences provides a way of assessing the importance of mosaicism within the family

Extensive mosaicism within the var family