Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Fast Algorithms For Hierarchical Range Histogram Constructions
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
A Novel Knowledge Based Method to Predicting Transcription Factor Targets
The multi-layered organization of information in living systems
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ka-Lok Ng Dept. of Bioinformatics Asia University
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Motif Finding in Transcription Factor Binding Sites Jian-Bien Chen ( 陳建儐 )
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Finding Regulatory Motifs in DNA Sequences
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Algorithms in Bioinformatics: A Practical Introduction
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
P-value Calculating Problem Ph. D. Thesis by Jing Zhang Presented by Chao Wang.
Projects
Hidden Markov Models BMI/CS 576
A Very Basic Gibbs Sampler for Motif Detection
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models Part 2: Algorithms
Nora Pierstorff Dept. of Genetics University of Cologne
Hidden Markov Model Lecture #6
Presentation transcript:

Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li

Outline Background Model and Problem Description Previous works and Our results Algorithms Conclusion

Biology Background transcription factor (TF) transcription factor binding site (TFBS) DNA TF RNA TFBS Problem: given a DNA sequence and a group of TF candidates, which one regulates the DNA transcription ? Transcription DNA → RNA

Model of Transcription Factor One TF binds to a certain pattern of DNA clip (motif). We usually use binding cites to describe TF Word model GACCGCTTTGTCAAC GCTGCAGGTGTTCTC GCAGCAGGTGTTCCC CCCACAGCTGGGATC Matrix Model A 0/82/8 3/8 3/8 0/8 7/8 C 6/84/8 2/8 1/8 6/8 1/8 G 3/82/8 0/8 4/8 1/8 0/8 T 0/80/8 3/8 0/8 0/8 0/8 TFBS1 TFBS2TFBS3 TF

Model of Transcription Factor(cont.) A mixed model: structured motif introduced by Marsan and Sagot(2000): …… conserved spacers conserved spacers spacers conserved Consist of alternate conserved regions (boxes) and spacers (‘N’) CCGNN…N CGGAGGNN…N

Two box structured motif In our paper, we consider two box structured motifs GAL4 is a typical two box structured motif: "CGGNNNNNNNNNNNCCG”

Models of DNA sequence DNA sequence From biological view, it comes from two parts –background generated by random variables {X 1, X 2,…, X n }, which is a 1-order Markov Chain with transition matrix and stationary probability –binding sites …ATTCTAGCAAGCCTTAATTATCCAACTAAATCAGACCAGG…

Method: Hypothesis test Our hypothesis is: the given motif m comes from background (generated by a 1-order Markov Chain R) Our observation is : m appears on DNA sequence for k times If Pr(m appears on R for at least k times) is very small, the hypothesis must be wrong

Problem Description (P-value calculation) Input: a structured motif, an integer k > 0, a 1-order Markov Chain of length n with transition matrix T and stationary probability u Output: Pr( motif appears on R for at least k times)

Previous works Exact algorithm: –The non-overlapped (Helden et. al.[1]) Approximation algorithms : –Marsan et. al. [2] Robin S. et al. [3] [1]Van Helden, et. al.J. Rios, A.F. and Collado-Vides,J. Discovering and Regulatory elements in non-coding sequences by analysis of spaced dyads.Nucl. Acids Res [2]Marsan, L., and Sagot, M.-F Algorithms for extracting structured motifs using a suf. x tree with an applicationto promoter and regulatory site consensus identi. cation. J. Comp. Biol. 7, 345–362. [3] Robin, S., Daudin, J.-J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Oc-currence probability of structured motifs in random sequences. J. Comp. Biol

Our Contributions We give the first non-trivial algorithm to calculate the exact probability value of two-boxes structure motif The way to we do decomposition and dynamic programming is totally new and may be helpful to similar probability calculation.

Algorithms: main idea Transformation and Decomposition to the target probability Do Dynamic Programming to calculate the terms in decomposition

Example Structured motif: CNNAA Hit times: at least once Sequence model: 1-order Markov Chain Sequence length: 9

The probabilities for intersections of three events with minimum index 2 Transformation CA CA CA markov region R, |R|=9 E1E1 E2E2 E4E4 CA CA E3E3 E5E5 The probabilities for intersections of two events with minimum index 1 P(a,b) denotes the sum of all the probabilities for intersections of a events with minimum index b P(2,1) P(3,2) A A A A A

Terms in Dynamic Programming Structured prefix CAA Motif m CAA A CAAA A The suffix of s is m 2 m2m2 m1m1 The middle is covered by spacers and several (overlapped) m 2 The prefix of s is m 1 CAA

Terms in Dynamic Programming Recall the definition of P(a,b) and structured prefix –P(a,b) denotes the sum of all the probabilities for intersections of a events with minimum index b –structured prefix: three constraints Terms in Dynamic Programming –P(a,b) and –I(a,b,z) = the sum of all the probabilities for intersections of a events with minimum index b and structured prefix z is the prefix of subregion R[b,n] a :from 0 to 5 b :from 5 to 1 z: arbitrary order

Dynamic Programming (conti.) Main idea: calculate P(a,b) and I(a,b,z) interactively Key idea: decompose P(a,b) and I(a,b,z) according to the second minimum index(smi) of the events in the sum For example, P(2, 1) (two events, minimum index is 1) CAA CAA CAA smi<start of m 2 smi>start of m

Recurrence Formula For P(2,1) P(2,1) = Pr(‘CNNAA’)*P(1,6)+ Pr(‘C’)*I(1,2,’CNAAA’)+ Pr(‘CN’)*I(1,3,’CAAAA’) The decomposition to I(a,b,z) is quite similar CAA 11 CAA smi>=4(start of m 2 ) CAA smi<4

Time complexity Key: estimate the size of the structured motif set We can prove that Total time complexity: –n is the length of the DNA sequence CAAA

Experiment Results In SCPD, transcription factor GAL4 is reported to bind to 7 genes. We extract upstream sequence, of length 1000 bp, for these 7 genes The p-value and consumed time are shown in Table 1

Conclusion In this paper, we present a non-trivial and efficient algorithm to calculate the probability of the occurrence of a structured motif m One problem is that that is still an exponential time algorithm in worst case. Finding a polynomial time algorithm or proving that it is NP-hard are two main directions in the future work.

Thank you!