Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Discriminative Motifs Saurabh Sinha, RECOMB ’02, April Introduction The term “motif” means the common pattern in different binding sites of a transcription.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Hidden Markov Models for Sequence Analysis 4
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Sampling Approaches to Pattern Extraction
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois, Urbana-Champaign

Transcriptional Regulation GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN

GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN Transcriptional Regulation

Binding sites and motifs Transcription factor binding sites in a gene’s neighborhood are the fundamental units of the regulatory network Transcription factor binding is specific, hence binding sites are similar to each other, but variability is often seen A motif is the common sequence pattern among binding sites of transcription factor

Motif models Consensus string, e.g., ACGWGT Position Weight Matrix (PWM)

Position Weight Matrix A C G T ACCCGTT ACCGGTT ACAGGAT ACCGGTT ACATGAT Binding sites PWM

Databases of PWMs Transfac has ~100s of PWMs for human Jaspar: a smaller, perhaps better curated database of PWMs Organism specific databases coming up frequenctly PWMs in databases often derived from experimentally validated binding sites

Bioinformatics of PWMs Popular motif model i.e., several motif finding algorithms that attempt to find PWMs from sequences Gibbs sampling: one of the earliest; tries to sample PWMs with high “relative entropy” MEME: another early algorithm; uses expectation maximization to find PWMs that best “model the sequences” Many more algorithms to find PWMs from a set of sequences

Problem: counting motifs Given DNA sequence, and a consensus motif (say “ACGWGT”), count the motif in the sequence Trivial solution What if the motif is a Position Weight Matrix (PWM) ? Why hasn’t this problem been looked at? Because previous algorithms used different scores of PWMs: how “sharp” they are, how well they explain data, etc.

Counting matches to a PWM: A possibility For each site s in sequence, compute If Pr(s | W) > some threshold, call s a site Count number of sites in sequence No distinction between strong and weak sites, as long as they are above threshold binary scheme, not realistic

A wish-list (for the score) Score should consider both strong and weak occurrences of motif Score should assign appropriate weights to strong and weak occurrences Score should be aware that there may also be sites of other known motifs in the sequence The list goes on : score should be efficiently computable, score should be differentiable, score should …

The “w-score” Defined by a probabilistic model of sequence generation Given one or more motifs, and a background distribution, defines a probability space on sequences A simple (zeroth order) Hidden Markov model (HMM)

Probabilistic Model: toy example Given two motifs W 1,W 2, a “background” motif W b, and a sequence length L Pr(W i  W j ) = p j transition probability When in state W i, emit a substring s chosen with probability Pr(s | W i ) emission probability Stop when length of emitted sequence is L W1W1 W2W2 WbWb A stochastic process generating sequences of length L

A “path” through the HMM One possible path T 1 W1W1 W1W1 W2W2 WbWb WbWb WbWb W2W2 WbWb WbWb W2W2 Another possible path T 2

Likelihood of sequence & paths A path of the HMM defines the locations of motif matches For a sequence S & a path T, the joint probability Pr(S,T) is easy to compute Conditional probability of a path T, given the data S, is: Strong matches make the probability higher Paths with weak matches have lower conditional probabilities W1W1 W1W1 W2W2 WbWb WbWb WbWb W2W2 WbWb WbWb W2W2

Let the number of occurrences of a motif (say W 1 ) in path T be Compute: In words: An average of the motif count, with weights equal to the probability of T given S The “w-score”

The “w-score” (Cont’d) Score depends both on number and quality of matches to motif. Every substring is a potential binding site, and paths placing the motif there will contribute to the count Pr(T | S) depends on the match strength of all motifs, not just the one being counted

The wish-list (again) Score should give consider both strong and weak occurrences of motif Score should assign appropriate weights to strong and weak occurrences Score should be aware that there may also be sites of other known motifs in the sequence    An exciting new feature of this motif score

Computational pros and cons The w-score computation takes time, where L is sequence length, and l m is the motif length. This is relatively expensive The w-score can be differentiated with respect to all of the PWM parameters in time Important feature for search algorithms

Using the “w-score” in discriminative motif finding

Discriminative motif finding Suppose we have a set of co-regulated genes, i.e., we believe they have binding sites of the same transcription factor (in their regulatory control regions) Traditionally, motif finding tries to find these binding sites, based on over-representation, conservation etc. Often we also know a set of genes that should NOT have binding sites of that transcription factor Examples: ChIP-on-chip, In situ hybridization pictures of Drosophila embryo, etc.

Problem formulation Given two sets of sequences S + and S - Find a motif that has many occurrences in S + and few occurrences in S - Maximize the difference in the average counts of the motif in the two sets Let  W (S) = count of a motif W in sequence S Maximize:

Optimization problem Find motif W that maximizes

Derivatives of objective function Let W k  be the PWM entry for base  in column k We can efficiently compute We can efficiently differentiate our objective function

Algorithm Search space: Set of n = 20 substrings of sequences in S + (called “site set”) Objective function: Construct PWM W from site-set, compute score Length of sites is user-defined

Algorithm S+S+ Current site-set C

Algorithm S+S+ Replace one site with any site from sequence Pick a replacement that improves objective function

Algorithm Current solution (site-set): C Candidate new solution: C Many possibilities for C (every substring of every sequence in S + is a possible replacement) Evaluate objective function on each candidate C Too slow ! Use derivative information !

Algorithm Estimate the objective function value for each candidate C using partial derivatives and first order approximation Examine each candidate in decreasing order of estimated score If a candidate C found with greater score than C, choose it.

Algorithm illustration Estimated scores 11 Accurate score 10 Accurate score 13 Accurate score Current score = 12

Algorithm Properties Objective function has many desirable properties, but is an expensive operation Derivative computation has the same time complexity, and is used to guide search Avoids local optima by searching in a discretized PWM space Performs significantly better and/or faster than Gibbs sampling and Conjugate Gradients, for this particular score

Discriminative PWM Search (DIPS) Software available Can easily handle data sets of ~100 sequences Can find multiple motifs iteratively, but without masking: Find a PWM, then include it in the model as a known PWM, find another PWM, and so on

Performance tests Tested on synthetic data Compared to traditional motif finder as well as two discriminative motif finders Superior performance in the presence of “distractor” motifs it really helps to be able to count a motif in the presence of other known motifs

Tests on Drosophila Enhancers HEAD TAIL Protein Concentration BICOID (ACTIVATOR)

Tests on Drosophila Enhancers HEAD TAIL Protein Concentration CAUDAL (ACTIVATOR)

Tests on Drosophila Enhancers HEAD TAIL Protein Concentration KRUPPEL (REPRESSOR)

DIPS runs S + = promoters of genes expressed in anterior half of embryo S - = promoters of genes expressed in posterior half of embryo Top motif: Bicoid ! HEAD TAIL Protein Concentration BICOID (ACTIVATOR)

DIPS runs S + = promoters of genes expressed in posterior half of embryo S - = promoters of genes expressed in anterior half of embryo Top motif: Caudal ! HEAD TAIL Protein Concentration CAUDAL (ACTIVATOR)

DIPS runs S + = promoters of genes expressed around the middle 20% of embryo S - = promoters of genes expressed in middle 20% of embryo Top motif: Kruppel ! HEAD TAIL Protein Concentration

Summary of results

Social regulation in honey bee Transition from nursing in the hive to foraging for food is age related, but also regulated by the needs of the colony 32 genes demonstrated to be significantly differentially expressed in brains of nurses and foragers (21 active in foragers only, 11 active in nurses only) DIPS run on 2Kbp promoters of these social behavior-related genes

Results on honey bee genes

Conclusion Discriminative motif finding increasingly becoming a necessary analysis Motif finding in the presence of other known motifs also becoming relevant A search algorithm that maximizes any objective function of the motif counts in the sequences (as long as its differentiable) Several extensions and variations possible

Acknowledgements Eric Siggia, Eran Segal Yoseph Barash (“LearnPSSM”) Andrew Smith (“DME”)

Reference ISMB 2006 (Brazil); Bioinformatics journal.