Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Setting Up a Replica Exchange Approach to Motif Discovery in DNA Jeffrey Goett Advisor: Professor Sengupta.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Sequencing a genome and Basic Sequence Alignment
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sampling Approaches to Pattern Extraction
Sequencing a genome and Basic Sequence Alignment
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Copyright warning.
Finding Regulatory Motifs in DNA Sequences
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Step 3: Tools Database Searching
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Motif identification with Gibbs Sampler Xuhua Xia
Randomized Algorithms for Motif Finding [1] Ch 12.2.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Lecture 5 Motif discovery

Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs) DNA gene 1 gene 2 gene 3 RNA Protein transcription translation ? enhancer module promoter 2

Gene expression and RNA-seq Idea: gene 5’3’ protein gene X

Time series expression profiling It is possible to make a series of RNA-seq experiments to obtain a time series expression profile for each gene. Cluster similarly behaving genes. 4

Analysis of clustered genes Similarly expressing genes may share a common transcription factor located upstream of the gene sequence. Extract those sequences from the clustered genes and search for a common motif sequence. We concentrate now on some randomized algorithms to discover motifs. 5

An alternative: ChIP-seq 6 Idea: One known protein at a time Much more accurate prediction of the region where the binding takes place than with the RNA-seq / time series clustering gene 5’3’ protein gene X

Randomized motif finding approach Motif Finding problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences. Previously: Motif Finding problem has been studied using a branch and bound approach in Algorithms for Bioinformatics course, period I. Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif. The following slides are modified from the ones provided along the book Jones & Pevzner: An Introduction to Bionformatics Algorithms. The MIT Press, 2004.

Profiles revisited Let s=(s 1,...,s t ) be a set of starting positions for l-mers in our t sequences. The substrings corresponding to these starting positions will form: - t x l alignment matrix and - 4 x l profile matrix P. A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8

Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P. If a is very similar to the consensus string of P then Prob(a|P) will be high If a is very different, then Prob(a|P) will be low. Scoring strings with a profile

Scoring strings with a profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = ??? The probability of the consensus string:

Scoring strings with a profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string:

Scoring strings with a profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string: Probability of a different string:

Most probable l-mer Define the most probable l-mer of a sequence as an l-mer in that sequence which has the highest probability of being created from the profile P. A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P = Given a sequence ctataaaccttacatc, find its most probable l-mer under P.

Most probable l-mer (cont’d) l-merCalculationsProb(a|P) ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00 ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00 ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00 ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004

Most probable l-mers in many sequences Find the most probable l-mer in each of the sequences under P. ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtataccttacatc tgcattcaatagctta tatcctttccactcac ctccaaatcctttaca ggtcatcctttatcct A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P=P=

Most probable l-mers in many sequences (cont’d) ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct Most probable l-mers form a new profile 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8

Comparing new and old profiles Red – frequency increased, Blue – frequency descreased 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8 A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8

Greedy profile motif search Use most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif. 1) Select random starting positions. 2) Create a profile P from the substrings at these starting positions. 3) Find the most probable l-mer a under P in each sequence and change the starting position to the starting position of a. 4) Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.

Greedy profile motif search - analysis Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif. It is unlikely that the random starting positions will lead us to the correct solution at all. In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.

Gibbs sampling Greedy profile motif search is probably not the best way to find motifs. However, we can improve the algorithm by introducing Gibbs sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one. Gibbs sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.

How Gibbs sampling works 1) Randomly choose starting positions s = (s 1,...,s t ) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences. 3) Create a profile P from the other t -1 sequences. 4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P. 5) Choose a new starting position for the removed sequence at random based on the probabilities calculated in step 4. 6) Repeat steps 2-5 until there is no improvement

Gibbs sampling: an example Input: t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA

Gibbs sampling: an example 1) Randomly choose starting positions, s=(s 1,s 2,s 3,s 4,s 5 ) in the 5 sequences: s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

Gibbs sampling: an example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 2 =11 AAAATTTACCTTAGAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

Gibbs sampling: an example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7 GTAAACAATATTTATAGC s 3 =9 CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1 TACTTAACACCCTGTCAA

Gibbs sampling: an example 3) Create profile P from l-mers in remaining 4 sequences: 1AATATTTA 3TCAAGCGT 4GTAAACGA 5TACTTAAC A1/42/4 3/41/4 2/4 C01/4 002/401/4 T2/41/4 2/41/4 G /40 Consensus String TAAATCGA

Gibbs sampling: an example 4) Calculate the Prob(a|P) for every possible 8-mer in the removed sequence: l-mers Prob(a|P) AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG0 0 0

Gibbs sampling: an example 5) Create a distribution of probabilities of l-mers Prob(a|P), and randomly select a new starting position based on this distribution. Starting Position 1: Prob( AAAATTTA | P ) = / = 6 Starting Position 2: Prob( AAATTTAC | P ) = / = 1 Starting Position 8: Prob( ACCTTAGA | P ) = / = 1.5 a) To create this distribution, divide each probability Prob(a|P) by the lowest probability: Ratio = 6 : 1 : 1.5

Turning ratios into probabilities Prob(Selecting Starting Position 1): 6/( )= Prob(Selecting Starting Position 2): 1/( )= Prob(Selecting Starting Position 8): 1.5/( )=0.176 b) Define probabilities of starting positions according to computed ratios c) Select the start position according to computed ratios: Pos 1Pos 2Pos 8 01 Rand([0,1))

Gibbs sampling: an example Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions. s 1 =7 GTAAACAATATTTATAGC s 2 =1 AAAATTTACCTCGCAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =5 TGAGTAATCGACGTCCCA s 5 =1 TACTTCACACCCTGTCAA

Gibbs sampling: an example 6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.

Gibbs sampler in practice Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. Needs to be run with many randomly chosen seeds to achieve good results.

Another randomized approach Random projection algorithm is a different way to solve the Motif Finding problem. Guiding principle: Some instances of a motif agree on a subset of positions. However, it is unclear how to find these “non-mutated” positions. To bypass the effect of mutations within a motif, we randomly select a subset of positions in the pattern creating a projection of the pattern. Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.

Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form a k-mer. This can be viewed as a projection of l-dimensional space onto k- dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 Projection Projection = (2, 4, 5, 7, 11, 12, 13)

Random projections algorithm Select k out of l positions uniformly at random. For each l-mer in input sequences, hash into bucket based on letters at k selected positions. Recover motif from enriched bucket that contain many l- tuples covering many input sequences Bucket TGCT TGCACCT Input sequence: …TCAATGCACCTAT...

Random projections algorithm (cont’d) Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing. In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good” ATGCGTC...ccATCCGACca......ttATGAGGCtc......ctATAAGTCgc......tcATGTGACac... (7,2) motif

Example l = 7 (motif size), k = 4 (projection size) Choose projection (1,2,5,7) GCTC...TAGACATCCGACTTGCCTTACTAC... Buckets ATGC ATCCGAC GCCTTAC

Motif refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are from hash value of the bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. Gibbs sampler. Local refinement algorithm ATGCGAC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGCGAC

Synergy between random projection and Gibbs sampler Random projection is a procedure for finding good starting points: every enriched bucket is a potential starting point. Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point. These algorithms work particularly well for “good” starting points.

Building profiles from buckets A C G T Profile P Gibbs sampler Refined profile P* ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC

Random projection algorithm: a single iteration Choose a random k-projection and create a perfect hash function h() corresponding to it. Hash each l-mer x in input sequence into bucket labeled by h(x) From each enriched bucket (e.g., a bucket with a representative from each of the t sequences), form profile P and perform Gibbs sampler motif refinement Candidate motif is best found by selecting the best motif among refinements of all enriched buckets.

Choosing projection size Projection size k - choose k small enough so that several motif instances hash to the same bucket. - choose k large enough to avoid contamination by spurious l-mers: 4 k >> t (n - l + 1)

How many iterations? Planted bucket : bucket with hash value h(M), where M is the motif. Choose m = number of iterations, such that Prob(planted bucket contains at least s sequences in at least one of m iterations) =0.95 Probability is readily computable since iterations form a sequence of independent Bernoulli trials (More details in the exercises)