HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Model.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models in Bioinformatics
Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov models and its application to bioinformatics.
Ch9 Reasoning in Uncertain Situations Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Applying Hidden Markov Models to Bioinformatics. Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Hidden Markov Models In BioInformatics
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Profile Hidden Markov Models
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Introduction to Profile HMMs
Hidden Markov Models BMI/CS 576
Genome Annotation (protein coding genes)
Ab initio gene prediction
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan

2 Relationship Between DNA, RNA And Proteins Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

3 Protein Structure Primary Structure of Proteins The primary structure of peptides and proteins refers to the linear number and order of the amino acids present.

4 Protein Structure Secondary Structure Alpha Helix Beta Sheet Protein secondary structure refers to regular, repeated patters of folding of the protein backbone. How a protein folds is largely dictated by the primary sequence of amino acids

5 Multiple Alignment Process Process of aligning three or more sequences with each other Generalization of the algorithm to align two sequences Local multiple alignment uses Sum of pairs scoring scheme

6 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs

7 Markov Chains Sunny Rain Cloudy State transition matrix : The probability of the weather given the previous day's weather. Initial Distribution : Defining the probability of the system being in each of the states at time 0. States : Three states - sunny, cloudy, rainy.

8 Hidden Markov Models Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather). Observable states : the states of the process that are `visible' (e.g., seaweed dampness).

9 Components Of HMM Output matrix : containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state. Initial Distribution : contains the probability of the (hidden) model being in a particular hidden state at time t = 1. State transition matrix : holding the probability of a hidden state given the previous hidden state.

10 Example-HMM Scoring a Sequence with an HMM: The probability of ACCY along this path is.4 *.3 *.46 *.6 *.97 *.5 *.015 *.73 *.01 * 1 = 1.76x Transition Prob. Output Prob.

11 Problems With HMM Scoring problem: Given an existing HMM and observed sequence, what is the probability that the HMM can generate the sequence

12 Problems With HMM Alignment Problem Given a sequence, what is the optimal state sequence that the HMM would use to generate it

13 Problems With HMM Training Problem Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data

14 HMMs in Biology Gene finding and prediction Protein-Profile Analysis Secondary Structure prediction Advantages Limitations

15 Finding genes in DNA sequence This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.

16 What is a (protein-coding) gene? Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

17 In more detail (color ~state) (Removed) (Left)

18 Gene Finding HMMs Our Objective: –To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Our Motivation: –Assist in the annotation of genomic data produced by genome sequencing methods –Gain insight into the mechanisms involved in transcription, splicing and other processes

19 Why HMMs Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar : Our grammatical structure (and the beginnings of our architecture) is right here: Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation projects

HMMs for gene finding An HMM for unspliced genes. x : non-coding DNA c : coding state Training - Expectation Maximization (EM) Parsing – Viterbi algorithm

21 Genefinders- a comparison Sn = Sensitivity Sp = Specificity Ac = Approximate Correlation ME = Missing Exons WE = Wrong Exons GENSCAN Performance Data,

22 Protein Profile HMMs Motivation –Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein. Use Profile Similarity What is a Profile? –Proteins families of related sequences and structures –Same function –Clear evolutionary relationship –Patterns of conservation, some positions are more conserved than the others

23 Aligned Sequences Build a Profile HMM (Training) Database search Multiple alignments (Viterbi) Query against Profile HMM database (Forward) An Overview

A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases. ACA ATG TCA ACT ATC ACA C - - AGC AGA ATC ACC G - - ATC Building – from an existing alignment Transition probabilities Output Probabilities insertion

25 Matching states Insertion states Deletion states No of matching states = average sequence length in the family PFAM Database - of Protein families ( Building – Final Topology

26 Given HMM, M, for a sequence family, find all members of the family in data base. LL – score LL(x) = log P(x|M) (LL score is length dependent – must normalize or use Z-score) Database Searching

Consensus sequence: P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x 0.8x1 x 0.8 = 4.7 x Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities. ACAC - - ATC Query a new sequence

28 Multiple Alignments Try every possible path through the model that would produce the target sequences –Keep the best one and its probability. –Output : Sequence of match, insert and delete states Viterbi alg. Dynamic Programming

29 Building – unaligned sequences Baum-Welch Expectation-maximization method –Start with a model whose length matches the average length of the sequences and with random output and transition probabilities. –Align all the sequences to the model. –Use the alignment to alter the output and transition probabilities –Repeat. Continue until the model stops changing By-product: It produced a multiple alignment

30 PHMM Example An alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are the most conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.SH3 domain

31 Prediction of Protein Secondary structures Prediction of secondary structures is needed for the prediction of protein function. Analyze the amino-acid sequences of proteins Learn secondary structures –helix, sheet and turn Predict the secondary structures of sequences

32 Advantages Characterize an entire family of sequences. Position-dependent character distributions and position-dependent insertion and deletion gap penalties. Built on a formal probabilistic basis Can make libraries of hundreds of profile HMMs and apply them on a large scale (whole genome)

33 Limitations Markov Chains Probabilities of states are supposed to be independent P(y) must be independent of P(x), and vice versa This usually isn’t true P(x) … P(y)

34 Limitations - contd Standard Machine Learning Problems Watch out for local maxima –Model may not converge to a truly optimal parameter set for a given training set Avoid over-fitting –You’re only as good as your training set –More training is not always good

35 CONCLUSION For links & slides –