A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ab initio gene prediction Genome 559, Winter 2011.
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Tutorial 5 Motif discovery.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Identifying Regulatory Transcriptional Elements on Functional Gene Groups Using Computer-
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
Finding Regulatory Motifs in DNA Sequences
Lecture 12 Splicing and gene prediction in eukaryotes
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
(H)MMs in gene prediction and similarity searches.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
A Very Basic Gibbs Sampler for Motif Detection
Sequence comparison: Significance of similarity scores
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss

Regulation of Gene Expression

Difficulties of Motif Finding  Regulatory sequences don’t follow same orientation as the coding sequence or each other  Multiple binding sites might exist for each regulated gene  Large variation in the binding sites of a single factor. Variations are not well understood.

Previous & Proposed Methods for Finding Motifs  Previous Methods:  Find longer, general motifs  Use local search algorithms (Gibbs sampling, Expectation Maximization, greedy algorithms)  Proposed Method:  TFBS is small enough to use enumerative methods  Enumerative statistical methods guarantee global optimality and affordability

Proposed Method Highlights  Allows variations in the binding site instances of a given transcription factor  Allows for motifs to include “spacers”  Allows for overlapping occurrences (in both orientations), which lends to complex dependencies  Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides  Use of Markov chain to model background genomic distribution  Use of z-score to measure statistical significance  Allows for multiple binding sites

Characteristics of a Motif  Any single TFBS has significant variation  Many motifs have spacers from 1-11bp  Variation often occurs as a transition (e.g. purine  purine) rather than a transversion (e.g. pyrimidine  purine)  Variation occurs less between a pair of complementary bases.  Indels are uncommon 

Proposed Motif Definition  Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}  A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer)  TF database (SCPD) confirms this model of variation  Of 50 binding site consensi, 31 exact fits (62%)  Another 10 fit if slight variations allowed

Measure of Statistical Significance  Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation.  Model must measure from input sequences:  Absolute number of occurrences (N s ) of motif (s)  Background genomic distribution  X is a set of random DNA sequences in the same number and lengths of the input sequences  Generated by Markov chain of order m  Transition probabilities determined by (m+1)-mer frequencies in fully complement of (800bp in length)  Background model chooses m=3

z-score  X s – r.v. is number of occurrences of motif (s) in X  E(X s ) – expectation, σ(X s ) – standard deviation  z s – number of S.D. by which observed value N s exceeds expectation

Implications  Possibility of overlap of a motif with itself (in either orientation)  Previous study of pattern autocorrelation  Generalized computation of SD, treating motif as a finite set of strings  Higher order Markov chains  Spacers handled at no extra computational cost  Handles motif in either orientation

Algorithm  Enumerates over each input sequence  Tabulates number N s of occurrences of each motif in either direction  Compute expectation and SD for each motif s.t. N s >0  Calculate z-score  Rank motifs by z-score

Algorithm Analysis  For single motif, complexity is O(c 2 k 2 )  k – # of nonspacer characters in motif  c – # of instantiations of R, Y, S, W in motif  Only modest values of k  Linear dependence on genome size  Can trim variance calculation to optimize

Number of Occurrences  Convert motif s into a multiset W  Add reverse complements for each string in W  Motif s only occurs at position in X iff some string in W occurs at same position  X s - # of occurrences (in X) of each member of W  Handling Palindromes  W i – member of W  |W| = T

Number of Occurrences Con’t

Expectation  Linearity of Expectation

Variance  B term  C term

C Term  A term

A Term

Overlapping Concatenation  CW (like W) is potentially a multiset  One-to-one correspondence

C Term Simplification

A Term Revisited

S i1 S i2 Term & Approximation  Kleffe and Borodovsky (1992) Approximation

B Term

B Term Con’t

Summary

Higher Order Markov Models  Variance calculations remain the same except for S i1 S i2 term  Experimental m = 3

Experimental Results & Future Considerations  17 coregulated sets of genes  Known TF with known binding site consensus  In 9 experiments, known consensus was one of 3 highest scoring motifs  Future Topics:  Non-centered spacers  Enumeration Loop optimization  Filtering repeats

Question  E(X s ) is more straight-forward to calculate compared to σ(X s ). Under the assumptions given in the paper, name one of the reasons for this complication.