Transcription factor binding motifs (part I) 10/17/07.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Segmentation and Fitting Using Probabilistic Methods
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Lecture 5: Learning models using EM
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
CpG islands in DNA sequences
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Finding Regulatory Motifs in DNA Sequences
Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Sampling Approaches to Pattern Extraction
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.
Cis-regulatory Modules and Module Discovery
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
(Regulatory-) Motif Finding
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Transcription factor binding motifs (part I) 10/17/07

Steps of gene transcription TATA activator TFIID Pol II The term “transcription factor” (TF) usually means an activator or repressor.

Understand Regulation Which TFs are involved in the regulation? Does a TF enhance / repress gene expression? Which genes are regulated by this TF? Are there binding partner / competitor for the TF? Why disease when a TF went wrong?

Understand Regulation Which TFs are involved in the regulation? Does a TF enhance / repress gene expression? Which genes are regulated by this TF? Are there binding partner / competitor for the TF? Why disease when a TF went wrong?

Sequence specificity of TF binding

Motif representation Consensus: GCGAA PWM Alignment matrix

Motif representation Consensus: GCGAA PWM frequency matrix

Motif representation Consensus: GCGAA PWM Logo

Objectives of motif finding Known motif mapping –Given a known motif, find all the matches over a query sequence. De novo motif discovery –Both motif patterns and match positions are unknown – much harder

Known Motif Mapping The matching score for a new sequence x is given by where  m is the entries in the frequency matrix    is the background model: p 0 (A), …, p 0 (T), or can be third-order Markov model (see next slide). Calculate the matching score for all genomic sequences. Motif sites correspond to highest scores.

Third-order Markov model The probability of generating a new base is dependent on the previous three bases. 3rd order Markov dependency p( )

De novo motif discovery Statistical approach –Identify sequence patterns that occur more frequently than random. –Target regions: Promoters regions of co-regulated genes Promoters regions of differentially expressed genes Experimentally identified TF binding sites –Very common Biophysical approach –Calculate protein-DNA binding affinities from first principles. –See Roider et al for an example.

Methods PWM modeling –MEME, GMS, AlignACE, BioProspector Word enumeration –YMF, MDScan Use negative control –REDUCE, Motif Regressor Comparative genomic –MCS, ComparProspector, Phylocon CHIP-chip (will discuss later)

The challenges no motif sites

The challenges multiple motif sites

The challenges variable relative positions

The challenges variable sequence pattern ATCCG ATTCG

MEME (Bailey and Elkan 1994) Input –A set of sequences: Y = {Y i } –For a fixed length w, partition Y into overlapping w-mers: X = {X i } –A set of alphabets: A = {a j } = {A,C,G,T} Mixture Model –  m Motif model: –  0 Background model: 0 th or 3 rd Markov

Missing data: Z = { Z i } The log-likelihood is Select and  to maximize the log-likelihood, but how? Log-likelihood

Expectation-Maximization (EM) Iteratively update hidden states and parameter values. Commonly used in bioinformatics research. E-step: –Under current estimate of  , , and the observed data, evaluate the expected value of log-likelihood over the values of the missing data Z.

Expectation Maximization (EM) M-step: –Update the parameters so that expected log- likelihood is maximized. For  For  Iterative E- and M- steps until convergence

Issue with EM algorithm Can get trapped into local minimum Results depend on initial guess Often need to do multiple runs starting with difference initial guesses. Then pick the best one.

Gibbs sampling Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known. The sequence of samples comprises a Markov Chain. As the iteration number goes to infinity, the asymptotic distribution approaches the underlying joint distribution.

Key differences between EM and Gibbs sampling EMGibbs Sampling Maximum likelihoodPosterior DeterministicStochastic FrequenistBayesian Initialize seed for  Initialize prior for 

Gibbs Motif Sampler  3131 4141 5151 2121 1111 (Lawrence et al. 1993; Liu et al. 1995) Assume each sequence contains one motif. But the position  and the motif frequency matrix  are unknown.

Gibbs Motif Sampler  1 Without  11 Segment Take out one sequence with its sites from current motifTake out one sequence with its sites from current motif 3131 4141 5151 2121  11

Segment (2-7): 3 Sequence 1 Gibbs Motif Sampler Score each possible segment of this sequenceScore each possible segment of this sequence 3131 4141 5151 2121  1 Without  11 Segment

 12 Modified  1 Gibbs Motif Sampler Sample a new segment to put the sequence backSample a new segment to put the sequence back 3131 4141 5151 2121

Advantage of Gibbs sampling Stochastic sampling permits the algorithm to escape from local minima. More robust than determinstic sampling as in EM. Fast.

Transcription level changes in glucose vs galactose (Roth 1998)

MDscan (Liu et al. 2002) Basic idea –True targets are likely to be more differentially expressed than other genes. Procedure: –Rank genes according to p-values, gene expression levels, etc. –Search TF motif from highest ranking targets first (high signal / background ratio) –Refine candidate motifs with all targets

Similarity defined by m-match For a given w-mer and any other random w-mer TGTAACGT8-mer TGTAACGTmatched 8 AGTAACGTmatched 7 TGCAACATmatched 6 TGACACGGmatched 5 AATAACAGmatched 4 m-matches for TGTAACGT Pick a reasonable m to call two w-mers similar

MDscan Algorithm: Finding candidate motifs Seed1m-matches Significance of differential gene expression

MDscan Algorithm: Finding candidate motifs Seed2m-matches Significance of differential gene expression

Maximum a posteriori (MAP) score function: Prefer: conserved motifs with many sites, but are not often seen in the genome background Keep best candidate motifs MDscan Algorithm: Scoring candidate motifs Motif Signal Abundant Positions Conserved Specific (unlikely in genome background)

MDscan Algorithm: Update motifs with remaining seqs Seed1m-matches Significance of differential gene expression

Seed1m-matches MDscan Algorithm: Refine the motifs Significance of differential gene expression

MDscan Algorithm Check high signal/background ratio sequences first, more likely to find the correct motif Algorithm summary: –Seed with w-mer in top, find m-match to make matrix –Keep good motifs to be update by remaining sequences –Refine motifs by removing bad sites Can check motif of any width very fast –Only consider existing w-mers, finite dataset –Seed in top sequences O(n 2 ) –Update motifs with all sequences O(n)

Word enumeration YMF (Sinha and Tompa 2002) Search in ALL possible w-mers. For each w-mer, calculate a z-score measuring whether it is over- represented in the selected sequences vs the background. Rank the words by the z-score. Select the top ones. Advantage: Global optimum Drawback: Computational time grows exponentially with w, so can only be used to search short motifs. 6~10 mer.