Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Ab initio gene prediction Genome 559, Winter 2011.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Phylogenetic Trees Presenter: Michael Tung
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 12 Splicing and gene prediction in eukaryotes
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Biological Motivation Gene Finding in Eukaryotic Genomes
Hidden Markov Models In BioInformatics
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
HMM for multiple sequences
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Comp. Genomics Recitation 3 The statistics of database searching.
Mark D. Adams Dept. of Genetics 9/10/04
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
What is a Hidden Markov Model?
Eukaryotic Gene Finding
Ab initio gene prediction
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation

Outline Gene finding using HMMs Adding trees to HMMs phyloHMM N-SCAN BLAST+ Gene Finding SGP2 Examples

3 Markov Sequence Models Key: distinguish coding/non-coding statistics Popular models: 6-mers (5 th order Markov Model) Homogeneous/non-homogeneous (reading frame specific) Not sensitive enough for eukaryote genes: exons too short, poor detection of splice junctions

Simple HMMs can only encode genometric length distributions The length of each exon (intron) : CG © Ron Shamir, Length Distribution exonintron p q 1-p 1-q

CG © Ron Shamir, Exon Length Distribution The length distribution of introns is ≈ geometric For exons, it isn’t: also affected by splicing itself: Too short (under 50bps): the spliceosomes have no room Too long (over 300bps): ends have problems finding each other. But as usual there are exceptions. A different model for exons is needed A different model is needed for exons.

CG © Ron Shamir, Generalized HMM (Burge & Karlin, J. Mol. Bio ) Instead of a single char, each state omits a sequence with some length distribution

CG © Ron Shamir, Generalized HMM (Burge & Karlin, J. Mol. Bio ) Overview: Hidden Markov states q 1,…q n State q i has output length distribution f i Output of each state can have a separate probabilistic model (weight matrix model, HMM…) Initial state probability distribution  State transition probabilities T ij

CG © Ron Shamir, Burge & Karlin JMB 97 GenScan Model

CG © Ron Shamir, GenScan model states = functional units on a gene The allowed transitions ensure the order is biologically consistent. As an intron may cut a codon, one must keep track of the reading frame, hence the three I phases: phase I 0 : between codons phase I 1: : introns that start after 1st base phase I 2 : introns that start after 2nd base

Phylogenetic HMMs Due to Siepel and Haussler A simple gene-finding HMM looks at a single Markov process: Along the sequence: each position is dependent on the previous position If we incorporate sequences from multiple organisms, we can look at another process: Along the tree: each position is dependent on its ancestor

Phylogenetic HMMs A simple HMM can be thought of as a machine that generates a sequence Every state omits a single character Multinomial distribution at every state A phyloHMM generates an MSA Every state omits a single MSA column Phylogenetic model at every state

Phylogenetic HMMs

Phylogenetic models in phyloHMM Defines a stochastic process of substitution Every position is independent The following process occurs: A character is assigned to the root The character substitution occur based of some substitution matrix and based on the branch lengths The characters at the leaves of the tree correspond to the MSA column

Phylogenetic models in phyloHMM Different models for different states: Different substitution rates E.g., in exons, we’ll see less substitutions Different patterns of substitutions E.g., third position bias in coding sequences Different tree topologies E.g., following recombination

Formally S – set of states Ψ – phylogenetic models (instead of E in a standard HMM) A – state transitions b – initial probabilities

Formally Q – substitution rate matrix (e.g., derived from PAM) Π – background frequencies τ – the phylogenetic tree β – branch lengths

Formally - Probability of a column X i being omitted by the model ψ i Can be computed efficiently by Felsenstein’s “pruning algorithm” (recitation 6) Joint probability of a path in the HMM and and alignment X Viterbi, forward-backward etc. – as usual

Simple phylo-gene-finder Non-coding 3 rd position If the parameters are known – Viterbi can be used to find the most probably path – segmentation into coding regions

Phylo-gene-finder is a good idea Use of phylogeny is important: Imposes structure on the substitutions Weights different pairs differently based on the evolutionary distance

N-SCAN Another phylogeny-HMM-gene-finder A GHHM that emits MSA columns Annotates one sequence at a time: the target sequence Distinguishes between a target sequence – T and other informative sequences (Is) that may contain gaps States correspond to sequence types in the target sequence

N-SCAN Bayesian network instead of a simple evolutionary model Accounts for: 5’ UTRs Conserved non-coding Highly conserved No “coding” features

Transforming the phylogenetic tree into a BN

SGP-2 Drawback of the described approaches: require meaningful alignment Impossible if one of the genomes is not yet finished An alignment is not necessary “correct”

SGP-2 A framework working on two genomes Idea: Use BLAST to identify which positions are more/less conserved Feed the BLAST scores into the gene-finding HMM The BLAST results serve to modify the scores of the exons.

SGP-2

SGP-2 is an extension of GENEID GENEID (Single genome): 5 th -order HMM Exons scored based on HMM scores + some additional features A spliced alignment used for assembly of genes from exons

Scoring exons Ideally, we would like to score exons based on their alignment score In this case a BLAST variant is used: the extent of the highest-scoring HSP overlapping the exon is used as a proxy for its “conservation” Problem: If the genome is not assembled, different exon parts will have HSPs with different genome parts Solution: consider HSPs covering different parts of the exon. Unify them into a single exon score. Final exon score:

BACH1

OLIG2

PPM1A

Summary Different approaches for gene finding Adding phylogeny generally helps But What about genes/exons which are specific to humans Ape genomes are not (almost) available and too similar Phylogenetic help almost essential in more difficult problems Motif finding (promoter analysis) Ultraconserved regions with no evident function