Fa07CSE182-L8 HMMs, Gene Finding. Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Hidden Markov Model.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Gene Finding (DNA signals) Genome Sequencing and assembly
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
CSE182-L7 Protein Sequence Analysis using HMMs, Gene Finding.
CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
CSE182-L9 Modeling Protein domains using HMMs. Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x|
10-07CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
CSE182-L9 Gene Finding (DNA signals) Genome Sequencing and assembly.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Ab initio gene prediction
Hidden Markov Models (HMMs)
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Fa07CSE182-L8 HMMs, Gene Finding

Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one sheet of written formulae Most of the questions will test concepts, so hopefully, you don’t have to remember too much. Final will be a take-home exam We will discuss class projects next time.

Fa07CSE182-L8 QUIZ! Question: your ‘friend’ likes to gamble. He tosses a coin: TAILS, he gives you a dollar. HEADS, you give him a dollar. Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. Can you say what fraction of the times he loads the coin? Suppose you know that in the loaded coin Pr(H)=0.8. Also, that he switches coins with 30% probability.

Fa07CSE182-L8 Modeling the coin problem as an HMM Q: Given the HMM, can you predict the hidden states? Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF Pr(H)=0.8Pr(H)=0.5 Pr(F->L)=0.3 Pr(L->F)=0.3 Pr(F->F)=0.7 Pr(L->L)=0.7 L F

Fa07CSE182-L8 HMM? It is an automaton with different ‘states’ (Ex: L,F) We start from a specific start state Each state of the automaton generates a symbol from an alphabet ({‘H’,’T’}), according to a distribution. After generating, we move (transition) to a new state. While the output (generated symbols) are visible, the states themselves are hidden. Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF LF

Fa07CSE182-L8 Profile HMMs Many natural problems can be expressed as an HMM. Ex: protein sequence profiles Recall that a profile is created by looking at position specific distributions

Fa07CSE182-L8 The generative model Think of each column in the alignment as generating a distribution. For each column, build a node that outputs a residue with the appropriate distribution Pr[F]=0.71 Pr[Y]=0.14

Fa07CSE182-L8 A simple Profile HMM Connect nodes for each column into a chain. This chain generates sequences based on the profile distribution. What is the probability of generating FKVVGQVILD? In this representation –Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] What is the difference with Profiles?

Fa07CSE182-L8 Profile HMMs can handle gaps The match states are the same as on the previous page. Insertion and deletion states help introduce gaps. A sequence may be generated using different paths.

Fa07CSE182-L8 Example The HMM M models a family of 3 sequences. Question: Is sequence S=ALIL part of the family? Solution: Compute Pr(S|M), the probability that the HMM M generated S. If Pr(S|M) is ‘High’, S belongs to the family. If it is ‘Low’, S does not belong. How can we compute Pr(S=ALIL|M)? A L - L A I V L A I - L

Fa07CSE182-L8 Example Probability [ALIL|M]? Note that multiple paths can generate this sequence. –P1=M 1 I 1 M 2 M 3 –P2=M 1 M 2 I 2 M 3 We ask a simpler question. Suppose a specific path was given. –Compute the joint probability Pr(S,P|M) A L - L A I V L A I - L

Fa07CSE182-L8 Joint probability computation Joint probability of seeing a sequence S, and path P –Pr[S,P|M] = Pr[S|P,M] Pr[P|M] –Pr[ALIL AND M 1 I 1 M 2 M 3 | M] = Pr[ALIL| M 1 I 1 M 2 M 3,M] Pr[M 1 I 1 M 2 M 3 | M] = ?

Fa07CSE182-L8 Formally The emitted sequence is S=S 1 S 2 …S m The path traversed is P 1 P 2 P 3.. e j (s) = emission probability of symbol s in state P j Transition probability T[P j,P k ] : Probability of transitioning from state P j to state P k. Pr(P,S|M) = e P1 (S 1 ) T[P 1,P 2 ] e P2 (S 2 ) …… Let P[k,n] = Pr(P 1 …P k,S 1 …S n |M) Then, P[k,n] = P[k-1,n-1] T[P k-1,P k ] e Pk (S n ) What is Pr(S|M)?

Fa07CSE182-L8 Two solutions An unknown (hidden) path is traversed to produce (emit) the sequence S. The probability that M emits S can be either –The sum over the joint probabilities over all paths. Pr(S|M) = ∑ P Pr(S,P|M) –OR, it is the probability of the most likely path Pr(S|M) = max P Pr(S,P|M) Both are appropriate ways to model, and have similar algorithms to solve them.

Fa07CSE182-L8 Viterbi Algorithm for HMM Let P max (i,j|M) be the probability of the most likely solution that emits S 1 …S i, and ends in state j (is it sufficient to compute this?) P max (i,j|M) = max k P max (i-1,k) T[k,j] e j (S i ) (Viterbi) P sum (i,j|M) = ∑ k ( P sum (i-1,k) T[k,j] ) e j (S i ) A L - L A I V L A I - L

Fa07CSE182-L8 Profile HMM membership We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. Backtracking can be used to get the path, which allows us to give an alignment A L - L A I V L A I - L Path: M 1 M 2 I 2 M 3 A L I L

Fa07CSE182-L8 Summary To search for remote homologs (family members) of a protein sequence, we need to have alignments of “disparate protein sequences”. HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. Patterns/Profiles/HMMs allow us to represent families and foucs on key residues Each has its advantages and disadvantages, and needs special algorithms to query efficiently.

Fa07CSE182-L8 Protein Domain databases A number of databases capture proteins (domains) using various representations Each domain is also associated with structure/function information, parsed from the literature. Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D HMM

Fa07CSE182-L8 Sequence databases We talked about DNA and protein sequence databases. Next, we will talk about the connection between DNA and proteins –Proteins are the cellular machinery –DNA is the inherited material, and must contain the code for generating proteins –Solution of the genetic code was one of the great intellectual achievements of the last century

Fa07CSE182-L8 Gene Finding What is a Gene?

Fa07CSE182-L8 Gene We define a gene as a location on the genome that codes for proteins. The genic information is used to manufacture proteins through transcription, and translation. There is a unique mapping from triplets to amino-acids

Fa07CSE182-L8 Eukaryotic gene structure

Fa07CSE182-L8 Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon.

Fa07CSE182-L8 Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. Given a DNA sequence, how many ways can you translate it?

Fa07CSE182-L8 Gene Features ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

Fa07CSE182-L8 End of L8

Fa07CSE182-L8 Gene identification Eukaryotic gene definitions: –Location that codes for a protein –The transcript sequence(s) that encodes the protein –The protein sequence(s) Suppose you want to know all of the genes in an organism. This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. All of that changed with better reagents and the development of high throughput methods like EST sequencing

Fa07CSE182-L8 Computational Gene Finding Given Genomic DNA, identify all the coordinates of the gene TRIVIA QUIZ! What is the name of the FIRST gene finding program? (google testcode) ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

Fa07CSE182-L8 Gene Finding: The 1st generation Given genomic DNA, does it contain a gene (or not)? Key idea: The distributions of nucleotides is different in coding (translated exons) and non- coding regions. Therefore, a statistical test can be used to discriminate between coding and non-coding regions.

Fa07CSE182-L8 Coding versus non-coding You are given a collection of exons, and a collection of intergenic sequence. Count the number of occurrences of ATGATG in Introns and Exons. –Suppose 1% of the hexamers in Exons are ATGATG –Only 0.01% of the hexamers in Intergenic are ATGATG How can you use this idea to find genes?

Fa07CSE182-L8 Generalizing AAAAAA AAAAAC AAAAAG AAAAAT IE Compute a frequency count for all hexamers. Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space Use this to decide whether a sequence X is exonic/intergenic X 5 Frequencies (X10 -5 )

Fa07CSE182-L8 A geometric approach (2 hexamers) Plot the following vectors – E= [10, 20] – I = [10, 5] – V 3 = [6, 10] – V 4 = [9, 15] Is V 3 more like E or more like I? E I V3V3

Fa07CSE182-L8 Choosing between Intergenic and Exonic V’ = V/||V|| All vectors have the same length (lie on the unit circle) Next, compute the angle to E, and I. Choose the feature that is ‘closer’ (smaller angle. E I V3V3

Fa07CSE182-L8 Coding versus non-coding signals Fickett and Tung (1992) compared various measures Measures that preserve the triplet frame are the most successful. Genscan uses a 5th order Markov Model

Fa07CSE182-L8 5th order markov chain Pr EXON [AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]…… = (20/78) (50/78)………. AAAAAA 20 1 AAAAAC AAAAAG 5 30 AAAAAT 3.. AAAAA A G C AAAAG AAAAC EXON

Fa07CSE182-L8 Scoring for coding regions The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron. In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions

Fa07CSE182-L8 Coding differential for 380 genes

Fa07CSE182-L8 Coding region can be detected Coding Plot the coding score using a sliding window of fixed length. The (large) exons will show up reliably. Not enough to predict gene boundaries reliably

Fa07CSE182-L8 Other Signals GT ATG AG Coding Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise. When combined they can be effective

Fa07CSE182-L8 Combining Signals We can compute the following: –E-score[i,j] –I-score[i,j] –D-score[i] –I-score[i] –Goal is to find coordinates that maximize the total score ij

Fa07CSE182-L8 The second generation of Gene finding Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure. It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low. Problem with multiple genes in a genomic region

Fa07CSE182-L8 Combining signals using D.P. An HMM is the best way to model and optimize the combination of signals Here, we will use a simpler approach which is essentially the same as the Viterbi algorithm for HMMs, but without the formalism.

Fa07CSE182-L8 Hidden states & gene structure Identifying a gene is equivalent to labeling each nucleotide as E/I/intergenic etc. These ‘labels’ are the hidden states For simplicity, consider only two states E and I IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII i1i1 i2i2 i3i3 i4i4

Fa07CSE182-L8 Gene finding reformulated Given a labeling L, we can score it as I-score[0..i 1 ] + E-score[i 1..i 2 ] + D-score[i 2 +1] + I-score[i i 3 -1] + A-score[i 3 -1] + E-score[i 3..i 4 ] + ……. Goal is to compute a labeling with maximum score. IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII i1i1 i2i2 i3i3 i4i4

Fa07CSE182-L8 Optimum labeling using D.P. (Viterbi) Define V E (i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled E Define V I (i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled I Why is it enough to compute V E (i) & V I (i) ?

Fa07CSE182-L8 Optimum parse of the gene j i ji

Fa07CSE182-L8 Generalizing Note that we deal with two states, and consider all paths that move between the two states. E I i

Fa07CSE182-L8 Generalizing We did not deal with the boundary cases in the recurrence. Instead of labeling with two states, we can label with multiple states, –E init, E fin, E mid, –I, I G (intergenic) E init I E fin E mid IGIG Note: all links are not shown here

Fa07CSE182-L8 An HMM for Gene structure

Fa07CSE182-L8 Generalized HMMs, and other refinements A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described In standard HMMs, there is an exponential distribution on the duration of time spent in a state. This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.

Fa07CSE182-L8 Length distributions of Introns & Exons

Fa07CSE182-L8 Generalized HMM for gene finding Each state also emits a ‘duration’ for which it will cycle in the same state. The time is generated according to a random process that depends on the state.

Fa07CSE182-L8 Forward algorithm for gene finding ji qkqk Emission Prob.: Probability that you emitted X i..X j in state q k (given by the 5th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state q k Duration Prob.: Probability that you stayed in state q k for j-i+1 steps

Fa07CSE182-L8 HMMs and Gene finding Generalized HMMs are an attractive model for computational gene finding –Allow incorporation of various signals –Quality of gene finding depends upon quality of signals.

Fa07CSE182-L8 DNA Signals Coding versus non-coding Splice Signals Translation start

Fa07CSE182-L8 Splice signals GT is a Donor signal, and AG is the acceptor signal GTAG

Fa07CSE182-L8 PWMs Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites AAGGTGAGTCCGGTAAGTGAGGTGAGGTAGGTAAGG

Fa07CSE182-L8 MDD PWMs do not capture correlations between positions Many position pairs in the Donor signal are correlated

Fa07CSE182-L8 Choose the position which has the highest correlation score. Split sequences into two: those which have the consensus at position I, and the remaining. Recurse until

Fa07CSE182-L8 MDD for Donor sites

Fa07CSE182-L8 De novo Gene prediction: Sumary Various signals distinguish coding regions from non-coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

Fa07CSE182-L8 How many genes do we have? Nature Science

Fa07CSE182-L8 Alternative splicing

Fa07CSE182-L8 Comparative methods Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

Fa07CSE182-L8 Comparative gene finding tools Procrustes/Sim4: mRNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach.

Fa07CSE182-L8 Databases RefSeq and other databases maintain sequences of full-length transcripts. We can query using sequence.