Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Ab initio gene prediction Genome 559, Winter 2011.
Sampling distributions of alleles under models of neutral evolution.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
1 Introduction to Bioinformatics 2 Mini Exam 3 3 Mini Exam Take a pencil and a piece of paper Please, not too close to your neighbour There a three.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Profile-profile alignment using hidden Markov models Wing Wong.
Lecture 5: Learning models using EM
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Comparative ab initio prediction of gene structures using pair HMMs
Lesson #17 Sampling Distributions. The mean of a sampling distribution is called the expected value of the statistic. The standard deviation of a sampling.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Hidden Markov Models In BioInformatics
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
§ 5.3 Normal Distributions: Finding Values. Probability and Normal Distributions If a random variable, x, is normally distributed, you can find the probability.
Conditional Probability Distributions Eran Segal Weizmann Institute.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Cis-regulatory Modules and Module Discovery
Multiple Species Gene Finding Sourav Chatterji
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Input: Alignment. Model parameters from neutral sequence Estimation example.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
(H)MMs in gene prediction and similarity searches.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Motif identification with Gibbs Sampler Xuhua Xia
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
A Very Basic Gibbs Sampler for Motif Detection
Sec. 7-5: Central Limit Theorem
Gibbs sampling.
Finding regulatory modules
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Modeling of Spliceosome
Presentation transcript:

Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley

Multiple Species Comparative Gene Finding (with Alignment)  McAuliffe et al. (2004), Siepel et al. (2004)

Multiple Species Comparative Gene Finding (with Alignment)  McAuliffe et al. (2004), Siepel et al. (2004)

Multiple Species Comparative Gene Finding (without Alignment)

Gibbs Sampling for Biological Sequence Analysis Introduced by Lawrence et al Motif Detection Extensions Multiple Motifs in a Sequence Multiple Types of Motifs Applications Alignment Linkage Analysis

Gibbs Sampling Aim : To sample from the joint distribution p(x 1,x 2,…,x n ) when it is easy to sample from the conditional distributions p(x i | x 1,…x i-1,x i+1,…,x n ) but not from the joint distribution. Method: Iteratively sample x i t from the conditional distribution p(x i | x 1 t,…x i-1 t,x i+1 t-1,…,x n t-1 ) Theorem : For discrete distributions, the distribution of (x 1 t,x 2 t …,x n t ) converges to p(x 1,x 2,…,x n )

tt ss Connection to HMMs Z1Z1 Y1Y1 Z2Z2 YmYm ZmZm Y2Y2 ss ss tt tt   t = output probabilities   s = transition probabilities  Difficult to sample from P(  Z | Y)  Easy to sample  from P(  | Z,Y)  Easy to sample Z from P(Z | ,Y)

Gibbs Sampling for Gene Finding

Initial Predictions

Gibbs Sampling for Gene Finding Sample Z 1 from P(Z 1 | Z [-1], Y)

Gibbs Sampling for Gene Finding Sample Z 2 from P(Z 2 | Z [-2], Y)

Additional Details Issues in the Gibbs Sampling Method Gibbs sampling assumes sequences independently generated by a HMM: need to generalize method a tree topology. Learn parameters from a subset of sequences roughly equidistant from each other: human, mouse, dog and cow Things get messy when there are multiple genes; need to handle multiple set of parameters. Make use of an approximate alignment Boost scores using a phyloHMM model

Results 2060 exons predicted Exon level Sensitivity : 23.2% Exon level Specificity : 46.7% 28.5% of predicted exons partially overlap with true exons. Nucleotide Level Sensitivity : 42.8% Nucleotide Level Specificity : 82.1%

Results Nucleotide level results much better than exon level results Need of better splice site models, probably multiple species splice site models. Low Sensitivity Is it the alignment?

Analysis of results (novel genes) Statistics of transcripts overlapping with novel VEGA genes 223 exons predicted Exon level Sensitivity : 24.8% (78 of 315 true exons are predicted correctly) Exon level Specificity : 35.0% (78 of the 223 predicted exons are correct) Additionally, 24.7% of predicted exons partially overlap with the true exons. Nucleotide level Sensitivity : 56.6% Nucleotide level Specificity : 62.9%