Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.

Slides:



Advertisements
Similar presentations
Structural Equation Modeling
Advertisements

Linear Regression.
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Lecture 22: Evaluation April 24, 2010.
1 Parametric Sensitivity Analysis For Cancer Survival Models Using Large- Sample Normal Approximations To The Bayesian Posterior Distribution Gordon B.
Ab initio gene prediction Genome 559, Winter 2011.
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Profiles for Sequences
Visual Recognition Tutorial
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 5: Learning models using EM
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Link Analysis, PageRank and Search Engines on the Web
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Radial Basis Function Networks
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Today Evaluation Measures Accuracy Significance Testing
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
PATTERN RECOGNITION AND MACHINE LEARNING
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
So far... We have been estimating differences caused by application of various treatments, and determining the probability that an observed difference.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Confidence intervals and hypothesis testing Petter Mostad
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Issues in Estimation Data Generating Process:
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
INTRODUCTION TO Machine Learning 3rd Edition
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Sampling and estimation Petter Mostad
CSE 517 Natural Language Processing Winter 2015
02/12/03© 2003 University of Wisconsin Last Time Intro to Monte-Carlo methods Probability.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
A Study of Poisson Query Generation Model for Information Retrieval
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Finding regulatory modules
Product moment correlation
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei

Problem Background Understand the mechanism of gene regulation and predict the gene regulation. Need a quantitative measure of the strength of a TF Binding Site correlated in a gene sequence. This measure can be used as an important feature in the study of gene regulation.

Project Background (cont.) There are no standard criteria for such a measure. But we expect a good measure can model –The quality of a Binding Site. –The occurrence of a binding site in the sequence. There can be many choices for such a measure, but which one is better…?

Project Overview Project Goal Possible scoring measures Evaluate a score: Constraint Analysis Experiments and results Current Status and Future Work

Goal of the project Three Steps: –Formalize the problem of counting score of weight matrix motifs and propose an evaluation mechanism. –Evaluate the existing scoring methods of weight matrix motifs. –Either suggest a good motif counting method or propose a new score better than existing scores.

Possible scoring measures Simple Counting –Match or not match Likelihood Sum –Data likelihood of a site is generated by a motif: sum over all possible sites Model based scores Free Energy Normalization of existing scores

Simple counting Simple Counting (match or not match) –Doesn’t work for fuzzy motifs –Variation: count a motif if for a subsequence, P(s|w) is above a threshold. Likelihood sum score –A soft version of simple counting –Ad hoc, doesn’t have a sound probabilistic interpretation

Model Based Scores Consider the sequence to be generated by a model involving a set of motifs HMM model: Stubb (Sinha et al 2003) –Count of a motif as an average number of times the motif is planted in the sequence Two options: –With fixed transition probabilities –Fitting transition probabilities by unsupervised learning

Other Possible Scores Free Energy : –  : a set of motifs M and model parameters –  b : model parameters and only backgrounds –F(s,  ) = log( Pr(S|  )/ Pr(S|  )) –Models the score of a sequence and a set of motifs, cannot give score of a specific motif unless run the computation for only one motif Normalization of Existing Scores –Estimating P(C >= x) instead of #sequences –Use well known normalization methods to normalize the actual counts –Min-Max; Z-Score ( Z = (N-E)/S)

Question: What makes a good score for weight matrix motif?

Evaluation of scores Empirical evaluation with Lab Experiments –Comparing the score with lab experiments to see the effectiveness –ChIP to Chip studies –Problem: Lab experiment data not easy to get Performance may vary over species (thus may be biased) Analytical evaluation: heuristic constraints

Analytical Evaluation with Heuristic Constrains There are many heuristic constraints which we expect a good score will satisfy The effectiveness of a score can be implied by how good it satisfies the constraints Whether a score satisfy a constraint can be studied analytically or with experiments on random data Combining with empirical evaluation, constraint analysis can tell us why a score is better than others, and help us defining a new score.

Heuristic Constraints (I) Formalization –Motif PWMs: w, M –Sequences: S –Possible binding sites: s –Score of the n th run: C n (S, w) Motif Quality Constraint –Focus on quality of sites –Contribution of a motif w with length l on sites –For one motif w, two sequence S1, S2, a site position [i, i + l - 1]. S1[i, i + l - 1] != S2[i, i + l - 1] and other positions are the same. –If I(S1[i, i + l - 1] ) <= I(S2[i, i + l - 1] ) –C(S1, w) >= C(S2, w)

Heuristic Constraints (II) Motif Length Constraint –For two motifs w1 and w2, length(w1) = length(w2) + 1. For any position i <= length(w2), the multinomial vector w1(i) = w2(i). –Compute the score of M1 and M2 on one sequence S independently –C(S, w1)<= C(S, w2) Motif Sharpness Constraint –For two motifs w1 and w2, length(w1) = length(w2), if for any position i, j = 0, 1, 2, w1(i, j) w2(i, 3) –(w1 is sharper than w2) –Compute the score of w1 and w2 on a large number of sequences independently –Expectation [C(w1)]<= Expectation [C(w2)]

Heuristic Constraints (III) Motif Probability Constraint –For one motif w, one sequence S, if we compute the score C(S, w) two times and give higher probability to w in the second run –(e.g. transition probability or prior probability in HMM) –E.g. p 1 < p 2 –C 1 (S, w) <= C 2 (S, w) Motif Competition Constraint –For two motifs w1 and w2, one sequence S. First compute the score for w1 only, then compute considering the co-occurrences of w1 and w2. –C 1 (S, w1) >= C 2 (S, w2)

Heuristic Constraints (IV) Deterministic Constraint –One motif w, one sequence S, if we compute the score of w twice with no parameter changing, –C 1 (S, w) = C 2 (S, w) Upper Bound Constraint –An existing set of motifs M, a sequence S. if we adding a new motif w n and compute the scores for M and w n again, – –But cannot exceed an upper bound (e.g. the length of S)

A summary of constraints The heuristic constraints can allow us to analyze the effectiveness of a score without doing experiments. In experiments show that one score is better than others, the heuristic constraints can indicate why it is better. Difficult to find a close set of constraints Some constraints are closely related (maybe not orthogonal, though not redundant)

Experiment Design Regular (comparing distribution): –Method Stubb with learnt p Simple Count –Data Real motifs, real sequence data Real motifs, random generated very long sequence (say, 10k~100k) Random motifs, including long, short, fuzzy and sharp combinations, random long sequence

Experiment Design Stubb with Fixed Prior Probability –Vary prior prob p : , …, 0.001, …, 0.01… –Data Real motifs, random generated long sequence Random motifs, including long, short, fuzzy and sharp combinations, random generated long sequence –See score distribution

Experiment Design: Constraints Motif Length: –Random generated motifs (uniform, varying length), random generated long sequence. –Random generated motifs (uniform, varying length), real sequences Motif sharpness: –Random generated motifs (varying sharpness, equal length), random generated long sequence (100k)

Experiment Design: Constraints Motif Competition –Real motifs, real sequence/random sequence data –several runs: 1 st run: only motif M1 2 nd run: M1 and M2, 3 rd run: M1 and M2 and M3, … –Plot the distribution of M1 in several runs.

Experiment Design (cont.) Deterministic constraints: –Real motifs, real sequences, run it several times, plot the distributions of Motif 1 to see whether it changes a lot. Normalization: –Z-Score only;Min-Max only; P(C>=N) only; P(C>=N) + Z-Score; P(C>=N) + Min-Max

Experiment Result(1) Stubb on real sequences against real motifs Simple count on real sequences against real motifs Four motifs –Bicoid, length 11, medium sharp –Kruppel, length 9, medium sharp –Gt, length 12, a bit sharper –Hkb, length 7, sharpest, every row has one non-zero count and three 0s

Experiment Result (1)-Stubb

Experiment Result (1)-Simple Count

Experiment Result (1) – Normalization P(x>=N)

Result (1) – Normalization z-score on motif score

Experiment Result(2) Stubb on random sequences against random motifs Simple count on random sequences against random motifs Four motifs –Long_fuzzy, length 20, uniform –Long_sharp, length 20, sharp –Short_fuzzy, length 5, uniform –Short_sharp, length 5, sharp

Experiment Result(2)-Stubb

Experiment Result (2)-Simple Count

Experiment Result(3) Stubb with Fixed Prior Probability, varying p ~0.05 Four real motifs –Bicoid –Kruppel –Hkb –Gt Four random motifs –Long_fuzzy –Long_sharp –Short_sharp –Short_fuzzy

Experiment Result(3)-Bicoid

Experiment Result(3)-Hkb

Experiment Result(3)-Long_fuzzy

Experiment Result(3)-Short_sharp

Experiment Result(4)-Constraint Motif Length Test on this heuristic –Stubb –Simple Count Generate 10 random motifs, uniform, vary length from 1 to 10

Experiment Result(4)-Stubb

Experiment Result(4)-Simple count

Experiment Result(5)-Contraint Motif Sharpness Test on this heuristic –Stubb –Simple Count Generate 10 random motifs, length 10, vary sharpness

Experiment Result(5)-Stubb

Experiment Result(5)-Simple count

Experiment Result(6)-Motif Competition Test on this constraint –Stubb –Simple Count 1 st run: using bicoid only 2 nd run: using bicoid and other five motifs 3 rd run: using bicoid and other nine motifs Monitor the bicoid score

Experiment Result(6)-Stubb

Experiment Result(6)-Simple count

Summary ConstraintsStubbStubb_Fi xedP Likelihood Sum Probability ConstraintN/AYesN/A Motif LengthYes Motif SharpnessYes No Motif CompetitionNot clear DeterministicNoYes Upper BoundYes No Site QualityTo be done..

Future Work Finish constraint tests Evaluate more scores (e.g. Free Energy) Define and formalize more constraints Comparing with ChIP-chip experiment results, study the effectiveness of scores and the relation to constraints