A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss

Regulation of Gene Expression

Difficulties of Motif Finding  Regulatory sequences don’t follow same orientation as the coding sequence or each other  Multiple binding sites might exist for each regulated gene  Large variation in the binding sites of a single factor. Variations are not well understood.

Previous & Proposed Methods for Finding Motifs  Previous Methods:  Find longer, general motifs  Use local search algorithms (Gibbs sampling, Expectation Maximization, greedy algorithms)  Proposed Method:  TFBS is small enough to use enumerative methods  Enumerative statistical methods guarantee global optimality and affordability

Proposed Method Highlights  Allows variations in the binding site instances of a given transcription factor  Allows for motifs to include “spacers”  Allows for overlapping occurrences (in both orientations), which lends to complex dependencies  Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides  Use of Markov chain to model background genomic distribution  Use of z-score to measure statistical significance  Allows for multiple binding sites

Characteristics of a Motif  Any single TFBS has significant variation  Many motifs have spacers from 1-11bp  Variation often occurs as a transition (e.g. purine  purine) rather than a transversion (e.g. pyrimidine  purine)  Variation occurs less between a pair of complementary bases.  Indels are uncommon 

Proposed Motif Definition  Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}  A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer)  TF database (SCPD) confirms this model of variation  Of 50 binding site consensi, 31 exact fits (62%)  Another 10 fit if slight variations allowed

Measure of Statistical Significance  Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation.  Model must measure from input sequences:  Absolute number of occurrences (N s ) of motif (s)  Background genomic distribution  X is a set of random DNA sequences in the same number and lengths of the input sequences  Generated by Markov chain of order m  Transition probabilities determined by (m+1)-mer frequencies in fully complement of 6000+ (800bp in length)  Background model chooses m=3

z-score  X s – r.v. is number of occurrences of motif (s) in X  E(X s ) – expectation, σ(X s ) – standard deviation  z s – number of S.D. by which observed value N s exceeds expectation

Implications  Possibility of overlap of a motif with itself (in either orientation)  Previous study of pattern autocorrelation  Generalized computation of SD, treating motif as a finite set of strings  Higher order Markov chains  Spacers handled at no extra computational cost  Handles motif in either orientation

Algorithm  Enumerates over each input sequence  Tabulates number N s of occurrences of each motif in either direction  Compute expectation and SD for each motif s.t. N s >0  Calculate z-score  Rank motifs by z-score

Algorithm Analysis  For single motif, complexity is O(c 2 k 2 )  k – # of nonspacer characters in motif  c – # of instantiations of R, Y, S, W in motif  Only modest values of k  Linear dependence on genome size  Can trim variance calculation to optimize

Number of Occurrences  Convert motif s into a multiset W  Add reverse complements for each string in W  Motif s only occurs at position in X iff some string in W occurs at same position  X s - # of occurrences (in X) of each member of W  Handling Palindromes  W i – member of W  |W| = T

Number of Occurrences Con’t

Expectation  Linearity of Expectation

Variance  B term  C term

C Term  A term

A Term

Overlapping Concatenation  CW (like W) is potentially a multiset  One-to-one correspondence

C Term Simplification

A Term Revisited

S i1 S i2 Term & Approximation  Kleffe and Borodovsky (1992) Approximation

B Term

B Term Con’t

Summary

Higher Order Markov Models  Variance calculations remain the same except for S i1 S i2 term  Experimental m = 3

Experimental Results & Future Considerations  17 coregulated sets of genes  Known TF with known binding site consensus  In 9 experiments, known consensus was one of 3 highest scoring motifs  Future Topics:  Non-centered spacers  Enumeration Loop optimization  Filtering repeats

Question  E(X s ) is more straight-forward to calculate compared to σ(X s ). Under the assumptions given in the paper, name one of the reasons for this complication.

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Similar presentations

Presentation on theme: "A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Similar presentations

Presentation on theme: "A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss."— Presentation transcript:

Similar presentations

About project

Feedback