Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
Setting Up a Replica Exchange Approach to Motif Discovery in DNA Jeffrey Goett Advisor: Professor Sengupta.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Fuzzy K means.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Finding Regulatory Motifs in DNA Sequences
Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
CS 6293 Advanced Topics: Current Bioinformatics Motif finding.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
CS 5263 & CS 4593 Bioinformatics Motif finding. What is a (biological) motif? A motif is a recurring fragment, theme or pattern Sequence motif: a sequence.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
CS 5263 Bioinformatics Motif finding. What is a (biological) motif? A motif is a recurring fragment, theme or pattern Sequence motif: a sequence pattern.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
Lecture 20 Practical issues in motif finding Final project
Hidden Markov Models BMI/CS 576
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
CS 5263 & CS 4233 Bioinformatics Motif finding.
(Regulatory-) Motif Finding
Motif finding in groups of related sequences
Presentation transcript:

Motif Finding

Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA

Regulation of Genes Gene RNA polymerase Transcription Factor (Protein) Regulatory Element DNA

Regulation of Genes Gene RNA polymerase Transcription Factor Regulatory Element DNA New protein

Microarrays A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Allow mRNAs from a sample to hybridize Measure number of hybridizations per spot

Finding Regulatory Motifs Tiny Multiple Local Alignments of Many Sequences

Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size –Because a constant-size transcription factor binds Often repeated Low-complexity-ish

Problem Definition Probabilistic Motif: M ij ; 1  i  W 1  j  4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with  k differences Given a collection of promoter sequences s 1,…, s N of genes with common expression

Essentially a Multiple Local Alignment Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases......

Algorithms Probabilistic 1.Expectation Maximization: MEME 2.Gibbs Sampling: AlignACE, BioProspector Combinatorial CONSENSUS, TEIRESIAS, SP-STAR, others

Discrete Approaches to Motif Finding

Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and a word in x i d(W, S) =  i d(W, x i )

Approaches Exhaustive Searches CONSENSUS MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWER

Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N =  i |x i |) Advantage: Finds provably best motif W Disadvantage: Time

Exhaustive Searches 2. Sample-driven algorithm: For W = a K-long word in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If: True motif does not occur in data, and True motif is “weak” Then, random motif may score better than any instance of true motif

CONSENSUS (1) Algorithm: Cycle 1: For each word W in S For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

CONSENSUS (2) Algorithm (cont’d): Cycle t: For each word W in S For each alignment A j from cycle t-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A ct ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT …

CONSENSUS (3) C 1, …, C n are user-defined heuristic constants Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total =  i C i, typically O(nC), where C is a big constant

MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: N a (W) = words W’ in S s.t. d(W,W’)  a Idea: Assume W is occurrence of true motif W * Will use N a (W) to correct “errors” in W

MULTIPROFILER (2) Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

MULTIPROFILER (2) Algorithm: For each W in S: For L = 1 to L max 1.Find all “strong” L-long wordlets G in N a (W) 2.Modify W by the wordlet G -> W’ 3.Compute d(W’, S) Report W * = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

Expectation Maximization in Motif Finding

Expectation Maximization (1) The MM algorithm, part of MEME package uses Expectation Maximization Algorithm (sketch): 1.Given genomic sequences find all K-long words 2.Assume each word is motif or background 3.Find likeliest Motif Model Background Model classification of words into either Motif or Background

Expectation Maximization (2) Given sequences x 1, …, x N, Find all k-long words X 1,…, X n Define motif model: M = (M 1,…, M K ) M i = (M i1,…, M i4 )(assume {A, C, G, T}) where M ij = Prob[ motif position i is letter j ] Define background model: B = B 1, …, B 4 B i = Prob[ letter j in background sequence ]

Expectation Maximization (3) Define Z i1 = { 1, if X i is motif; 0, otherwise } Z i2 = { 0, if X i is motif; 1, otherwise } Given a word X i = x[1]…x[k], P[ X i, Z i1 =1 ] = M 1x[1] …M kx[k] P[ X i, Z i2 =1 ] = (1 - ) B x[1] …B x[K] Let 1 = ; 2 = (1- )

Expectation Maximization (4) Define: Parameter space  = (M,B) Objective: Maximize log likelihood of model:

Expectation Maximization (5) Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,

E-step Expectation Maximization (6): E-step Expectation: Find expected value of log likelihood: where expected values of Z can be computed as follows:

M-step Expectation Maximization (7): M-step Maximization: Maximize expected value over  and independently For, this is easy:

M-step Expectation Maximization (8): M-step For  = (M, B), define c jk = E[ # times letter k appears in motif position j] c 0k = E[ # times letter k appears in background] It easily follows: to not allow any 0’s, add pseudocounts

Initial Parameters Matter! Consider the following “artificial” example: x 1, …, x N contain: –2 K patterns A…A, A…AT,……, T…T –2 K patterns C…C, C…CG,……, G…G –D << 2 K occurrences of K-mer ACTG…ACTG Some local maxima:  ½; B = ½C, ½G; M i = ½A, ½T, i = 1,…, K  D/2 k+1 ; B = ¼A,¼C,¼G,¼T; M 1 = 100% A, M 2 = 100% C, M 3 = 100% T, etc.

Overview of EM Algorithm 1.Initialize parameters  = (M, B), : –Try different values of from N -1/2 upto 1/(2K) 2.Repeat: a.Expectation b.Maximization 3.Until change in  = (M, B), falls below  4.Report results for several “good”

Conclusion One iteration running time: O(NK) –Usually need < N iterations for convergence, and < N starting points. –Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994.