Gibbs sampling.

Slides:



Advertisements
Similar presentations
Motivating Markov Chain Monte Carlo for Multiple Target Tracking
Advertisements

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Bayesian Estimation in MARK
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Probabilistic methods for phylogenetic trees (Part 2)
Finding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Hidden Markov Models for Sequence Analysis 4
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sampling Approaches to Pattern Extraction
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Sample variance and sample error We learned recently how to determine the sample variance using the sample mean. How do we translate this to an unbiased.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Transcription factor binding motifs (part II) 10/22/07.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Oliver Schulte Machine Learning 726
Oliver Schulte Machine Learning 726
MCMC Output & Metropolis-Hastings Algorithm Part I
A Very Basic Gibbs Sampler for Motif Detection
Intro to Sampling Methods
Model Inference and Averaging
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Bayes Net Learning: Bayesian Approaches
Learning Sequence Motif Models Using Expectation Maximization (EM)
Omiros Papaspiliopoulos and Gareth O. Roberts
CS 4/527: Artificial Intelligence
Transcription factor binding motifs
Bayesian inference Presented by Amir Hadadi
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Learning Sequence Motif Models Using Gibbs Sampling
CS 188: Artificial Intelligence
(Regulatory-) Motif Finding
Junghoo “John” Cho UCLA
Approximate Inference by Sampling
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Markov Networks.
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Gibbs sampling

The Motif Finding Problem Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the motif in each of the individual sequences

The Motif Finding Problem If starting positions s=(s1, s2,… st) are given, finding consensus is easy because we can simply construct (and evaluate) the profile to find the motif. But… the starting positions s are usually not given. How can we find the “best” profile matrix? Gibbs sampling Expectation-Maximization algorithm

Notations Set of symbols: Sequences: S = {S1, S2, …, SN} Starting positions of motifs: A = {a1, a2, …, aN} Motif model ( ) : qij = P(symbol at the i-th position = j) Background model: pj = P(symbol = j) Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region

Motif Finding Problem Problem: find starting positions and model parameters simultaneously to maximize the posterior probability: This is equivalent to maximizing the likelihood by Bayes’ Theorem, assuming uniform prior distribution:

Equivalent Scoring Function Maximize the log-odds ratio:

Sampling and optimization To maximize a function, f(x): Brute force method: try all possible x Sample method: sample x from probability distribution: p(x) ~ f(x) Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

Gibbs Sampling Idea: a joint distribution may be hard to sample from, but it may be easy to sample from the conditional distributions where all variables are fixed except one To sample from p(x1, x2, …xn), let each state of the Markov chain represent (x1, x2, …xn), the probability of moving to a state (x1, x2, …xn) is: p(xi |x1, …xi-1,xi+1,…xn). It is also called Markov Chain Monte Carlo (MCMC) method.

Gibbs Sampling

Gibbs Sampling in Motif Finding Randomly initialize A0; Repeat: (1) randomly choose a sequence z from S; A* = At \ az; compute θt = estimator of θ given S and A*; (2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* union x; Select At that maximizes F; Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model

Estimator of θ Given an alignment A, i.e. the starting positions of motifs, θ can be estimated by its MLE with smoothing (e.g. Dirichlet prior with parameter bj):

The Motif Finding Problem Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the motif in each of the individual sequences

Gene Regulation Transcription factor binding site, or motif instances TF Gene 1 CACGTGT CACGTGA CAAGTGA CAGGTGA Gene 2 Gene 3 Gene 4 Transcription factor binding site, or motif instances

Evolutionary Conservation CACGTGACC CACGTGAAC CACGTGAAC

Overview of TGS Colored lines: regulatory regions of genes How did the motifs evolve? How to find the ancestral motif instances? Colored lines: regulatory regions of genes Colored boxes: motif instances

How to find the ancestral motif instances? 1 2 3 4 5 6 7 8 9 A .036 .892 .036 .036 .036 .036 .892 .036 .036 C .892 .036 .892 .036 .036 .036 .036 .75 .75 G .036 .036 .036 .892 .036 .892 .036 .036 .036 T .036 .036 .036 .036 .892 .036 .036 .178 .178 Ancestral motif profile: CACGTGAAC CACGTGACC

How did the motifs evolve? Background substitution matrix A C G T A 0.8515 0.0278 0.0775 0.0432 C 0.0464 0.8026 0.0344 0.1167 G 0.1167 0.0350 0.8023 0.0460 T 0.0429 0.0785 0.0264 0.8522 Motif substitution matrix A C G T A 0.9802 0.0066 0.0066 0.0066 C 0.0120 0.9640 0.0120 0.0120 G 0.0120 0.0120 0.9640 0.0120 T 0.0066 0.0066 0.0066 0.9802

Evolution of motifs Distant species 250 million years

Overview of Gibbs Sampler Implementation Overview of Gibbs Sampler Iteratively sample from conditional distribution when other parameters are fixed. draw ~ In order to draw:

Implementation Parameters Ancestral motif weight matrix at the root Background distribution (multinomial) Probability that a gene in the i-th species will contain the motif Motif width Background substitution matrix for the i-th branch Motif substitution matrix for the i-th branch

Implementation Prior distribution Beta(1,1) Poisson distribution

Implementation Initialization Parameters are sampled using prior distributions; Motif instances in current species are sampled from sequences directly for each current species; Motif instances in ancestral species are randomly assigned with one of its immediate child motif instances.

Implementation Motif instance updating Updating motif instances in ancestral species Updating motif instances in current species

Ancestral Motif Weight Matrix Implementation Updating motif instance in ancestral species Ancestral Motif Weight Matrix 1 2 3 4 5 6 7 8 9 A .036 .892 .036 .036 .036 .036 .892 .036 .036 C .892 .036 .892 .036 .036 .036 .036 .75 .75 G .036 .036 .036 .892 .036 .892 .036 .036 .036 T .036 .036 .036 .036 .892 .036 .036 .178 .178 M11 M12 C A 2th position A: 0.932… C: 0.067 G: 8.4e-6 T: 2.5e-4 M11 M12 CCCGTGACC CACGTGAAC

Updated ancestral motif instance Implementation Updating motif instances for current species Updated ancestral motif instance CACTTGAAC M11 M12 …CACACCACGTGAGCTT... …CACATCACGTGAACTT…

Multiple Species? ? CAGGTGATC CACGTGAAC CACGTGAAC CACGTGAAC CACGTGATC