Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.

Slides:



Advertisements
Similar presentations
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Advertisements

Image Modeling & Segmentation
Hierarchical Dirichlet Processes
Hidden Markov Model in Biological Sequence Analysis – Part 2
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
METHODS FOR HAPLOTYPE RECONSTRUCTION
K Means Clustering , Nearest Cluster and Gaussian Mixture
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Markov Chains 1.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Markov Chains Lecture #5
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Overview Full Bayesian Learning MAP learning
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Robin McDougall, Ed Waller and Scott Nokleby Faculties of Engineering & Applied Science and Energy Systems & Nuclear Science 1.
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Bayes Factor Based on Han and Carlin (2001, JASA).
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Gibbs motif sampler & advanced motif.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sampling Approaches to Pattern Extraction
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cis-regulatory Modules and Module Discovery
Ch. 14: Markov Chain Monte Carlo Methods based on Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009.; C, Andrieu, N, de Freitas,
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Motif identification with Gibbs Sampler Xuhua Xia
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Advanced Statistical Computing Fall 2016
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Markov Networks.
Learning Sequence Motif Models Using Gibbs Sampling
Presentation transcript:

Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌

Topic 하나. Gibbs Sampler algorithm in Multiple Sequence Alignment( 기전 설명 ) (Lawrence et al., Science 1993; J. Liu et al. JASA, 1995) 둘. Brief Review of Bayesian missing data problem and Gibbs Sampler ( 개발된 배경 …)

Aim of Local MSA 기능상 중요한 서열들은 보통 보존되어있 다. (Conserved Sequence) Locate relatively short patterns shared by otherwise dissimilar sequences 예. 1. Regulon 들의 upstream sequence 에 서 Regulatory Motif 찾기 2. Protein 서열의 alignment 를 통해 interaction motif 찾기

Local MSA method EM(MEME) Gibbs sampler (AlignACE, Gibbs motif sampler) HMM(HMMER) MACAW ftp://ncbi.nlm.nih.gov/pub/macaw

Gibbs Sampler Algorithm In Practice (Predictive Update version of Gibbs Sampler)

Problem Description Given a set of N sequences S 1,…,S N of length n k (k=1,…,N) Identify a single pattern of fixed width(W) within each (N)input sequence A= {a k } (k=1,…,N) : a set of starting positions for the common pattern within each sequence ; a k =1…n k - W+1 Objective: to find the “best,” defined as the most probable, common pattern

Algorithm- Initialization (1) Choose random starting positions { a k } within the various sequences A= {a k } (k=1,…,N) : a set of starting positions for the common pattern within each sequence ; a k =1…n k -W+1

Algorithm- Predictive Update (2) One of the N sequences, Z, is chosen either at random or in specified order. The pattern description q ij and background frequency q 0j are then calculated excluding z.

q ij (i=1,…,W, j= A,T,G,C) c ij = the count of base j in this position b j = residue-dependent “pseudocounts” B = the sum of the bj q 0j : calculated analogously with counts taken over all non-motif positions q ij = c ij + b j N-1+B

XXXXXXXXXXXXX  Z XXXXXXMXXXX XXXXMXXXXXXX XXXXXXMXXX XXXXXXXXMX XXMXXXXXXX AGGAGCAAGA ACATCCAAGT TCATGATAGT TGTAATGTCA AATGTGGTCA N=6, W=10 q 1A = 3/5, q 2G = 2/5, …( 편의상 pseudocount 제외 ) q 1G = 0 ( 문제, pseudocount 꼭 필요 )

Q 4*W = q 1A q 2A …… q WA q 1T q 2T ……. q WT q 1G q 2G ……. q WG q 1C q 2C …… q WC Q 0 4*1 = q 0A q 0T q 0G q 0C A= {a 1,a 2,…,a k } Resulting parameters

Algorithm- Sampling step (3) Every possible segment of width(W) within sequence z is considered The probability Qx (Bx) of generating each segment x according to qij (q0j) are calculated The weight Ax = Qx/Bx is assigned to segment x, and with each segment so weighted, a random one is selected. Its position then becomes the new a z Iterations

XXXXXXXXXXXX  Z XXXXXMXXXX XXMXXXXXXX XXXXXXMXXX XXXXXXXXMX XXMXXXXXXX ATGGCTAAGCCATTAATCGC q 3G X q 4G X q 5C X q 6T X q 7A X q 8A X q 9G X q 10C X q 11C X q 12A q 0G x q 0G x q 0C x q 0T x q 0A x q 0A x q 0G x q 0C x q 0C x q 0A A X = Qx/Bx = Select a set of a k ’s that maximizes the product of these ratios, or F F = Σ 1≤i≤W Σ j ∈ {A,T,G,C} c i,j log(q ij /q 0j )

Bayesian missing data problem and Gibbs Sampler 중간 Review

Simulation Joint pdf f(x,y,z) 가 주어져있을 때 EX 를 어떻게 구할까 ? 원래는 f(x)=∫∫f(x,y,z)dzdy, EX= ∫xf(x)dx 그런데, f(x)=∫∫f(x,y,z)dzdy 를 계산하기 어렵다 면 ?…Simulation(sampling) x 1,x 2,x 3,….x 1000 을 생성시켜 1/m∑x i = EX 으로 approximation

MCMC High-dimensional joint densities are completely characterized by lower-dimensional conditional densities. Or, a big/hard problem can be broken down into a interrelated series of similar/easier problems. P(x 1,x 2,…,x n )= P(x n /x n-1,..., x 1 )P(x n-1 /x n-2,…,x 1 ) …P(x 2 /x 1 )P(x 1 ) = P(x n /x n-1 )P(x n-1 /x n-2 ) …P(x 2 /x 1 )P(x 1 ) (Markov Chain) Metropolis-Hastings algorithm, Gibbs Sampler 등 … Markov chain 을 구성하여 문제를 푼다. ~ Simulate a Markov chain that converges in distribution to a posterior distribution

Gibbs Sampler (two-component) (X,Y) 에서 sample 을 하고 싶다면 … Choose Y 0, t=0; Generate X t ~ f(x/y t ); Generate Y t+1 ~f(y/x t ); t = t+1; iterate (Y 0, X 0 ) (Y 1, X 1 ) (Y 2, X 2 ),…,(Y k, X k ),… … π(Y, X)~ invariant(stationary) distribution

2-dimension(x 1,x 2 ) 의 경우

Gibbs Sampler

Bayes theorem Posterior ∝ Likelihood X Prior 잠시 ~~

Bayesian missing data problem Θ: parameter of interest X={x 1,…,x N }: a set of complete i.i.d. observations from a density that depends upon θ: π(X ┃ θ) π(θ ┃ X) = Π i=1,…,n π(x i ┃ θ) π(θ) / π(X) In practical situations, x i may not be completely observed. Assuming the unobserved values are missing completely at random, let X=(Y,Z), x i =(y i,z i ) i=1,…,n y i : observed part, z i =missing part π(θ ┃ Y) = ∫ π(θ ┃ Y, Z) π(Z ┃ Y)dZ  Imputation

Multiple values, Z (1),…,Z (m) are drawn from π(Z ┃ Y) to form m complete data sets. With these imputed data sets and the ergodicity theorem, π(θ ┃ Y) ≈ 1/m*{π(θ ┃ Y, Z (1) )+ … + π(θ ┃ Y, Z (m) )} But in most applied problems it is impossible to draw Z from (Z ┃ Y) directly.

 Tanner and Wong’s data augmentation(DA) which applied Gibbs Sampler to draw multiples of θ’s and multiples of Z’s jointly from π(θ, Z ┃ Y), manages to cope with the problem by evolving a Markov chain. By iterating between drawing θ from π(θ ┃ Y,Z) and drawing Z from π(Z ┃ θ,Y),DA constructs a Markov chain whose equilibrium distribution is π(θ, Z ┃ Y)

Collapsed Gibbs Sampler(J. Liu) Consider Sampling from π(θ ┃ D), θ=(θ 1,θ 2,θ 3 ) Original Gibbs Sampler Collapsed Gibbs Sampler (J. Liu): 계산 용이

Back to Multiple Sequence Alignment Bayesian missing data problem 과 어떤 관계가 …?

Q 4*W = q 1A q 2A …… q WA q 1T q 2T ……. q WT :Parameter of Interest q 1G q 2G ……. q WG q 1C q 2C …… q WC Q 0 4*1 = q 0A q 0T q 0G q 0C A= {a 1,a 2,…,a k } : Missing Data! B= Given Sequences : Observed Data! Resulting parameters

Revisiting… By iterating between drawing Q from π(Q ┃ A, B) and drawing Z from π(A ┃ Q, B), DA constructs a Markov chain whose equilibrium distribution is π(Q, A ┃ B)  Collapsed Gibbs Sampler: π(A ┃ B)

Original Gibbs Sampler algorithm Step0. choose an arbitrary starting point A 0 =(a 1,0,a 2,0,…,a N,0,Q 0 ); Step2. Generate A t+1 =(a 1,t+1,a 2,t+1,…,a N,t+1,Q t+1 ) as follows: Generate a 1,t+1 ~π(a 1 ┃ a 2,t,…,a N,t, Q t, B); Generate a 2,t+1 ~π(a 2 ┃ a 1,t+1, a 3,t …,a N,t, Q t, B); … Generate a N,t+1 ~π(a N ┃ a 1,t+1, a 2,t+1 …,a N-1,t+1, Q t, B); Generate Q t+1 ~π(a N ┃ a 1,t+1, a 2,t+1 …,a N,t+1, B); Step3. Set t=t+1, and go to step 1

Collapsed Gibbs Sampler Step0. choose an arbitrary starting point A 0 =(a 1,0,a 2,0,…,a N,0 ); Step2. Generate A t+1 =(a 1,t+1,a 2,t+1,…,a N,t+1 ) as follows: Generate a 1,t+1 ~ π(a 1 ┃ a 2,t,…,a N,t, B); Generate a 2,t+1 ~ π(a 2 ┃ a 1,t+1, a 3,t …,a N,t, B); … Generate a N,t+1 ~ π(a N ┃ a 1,t+1, a 2,t+1 …,a N-1,t+1, B); Generate Q t+1 ~ π(Q ┃ a 1,t+1, a 2,t+1 …,a N,t+1, B); Step3. Set t=t+1, and go to step 1

Predictive Update Version? Predictive distribution π(A ┃ B) A [-k] ={ a 1,a 2,...,a k-1, a k+1,…, a N }, B: Sequences π(ak=i ┃ A [-k],B) … 계산 (Q 에 관한 적분 …) … ∝ Л 1≤i≤W (qij/q0j) <-- why we calculated Ax

Phase shift To avoid this situation, after every Mth iteration, for example, one may compare the current set of ak with sets shifted left and right by up to a certain number of letters. Probability ratios may be calculated for all probabilities, and a random selection is made among them with appropriate corresponding weights.

Convergence 양상

AlignACE 에 포함된 기능 -Automatic detection of variable pattern widths -Multiple motif instances per input sequence -Both strands are now considered -Near-optimum sampling method was improved -Model for base background frequencies was fixed to the background nucleotide frequencies in the genome being considered

종 합종 합 Lawrence, Liu, Neuwald, Collapsed Gibbs Sampler algorithm in Multiple Sequence Alignment 1993,1994,1995 최근 …Gibbs Sampler 를 응용한 motif search program 들 다수 (Gibbs Motif Sampler, AlignACE,) Dempster, EM algorithm,1977 Tanner & Wong, Data Augmentation 에 의한 사후 확률 계산, 1987, JASA (Gibbs Sampler 이용 )