Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
1 One Tailed Tests Here we study the hypothesis test for the mean of a population when the alternative hypothesis is an inequality.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
RESEARCH METHODOLOGY & STATISTICS LECTURE 6: THE NORMAL DISTRIBUTION AND CONFIDENCE INTERVALS MSc(Addictions) Addictions Department.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
Using Statistics to Analyze your Results
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
1 Analysis of Variance This technique is designed to test the null hypothesis that three or more group means are equal.
Heuristic alignment algorithms and cost matrices
A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,
Similar Sequence Similar Function Charles Yan Spring 2006.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
ETM 607 – Random Number and Random Variates
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Confidence Intervals Chapter 7.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
Slide Slide 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim about a Proportion 8-4 Testing a Claim About.
Independent Samples t-Test (or 2-Sample t-Test)
Copyright © 2010, 2007, 2004 Pearson Education, Inc. 1.. Section 11-2 Goodness of Fit.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Slide 1 © 2002 McGraw-Hill Australia, PPTs t/a Introductory Mathematics & Statistics for Business 4e by John S. Croucher 1 n Learning Objectives –Identify.
Inferring phylogenetic trees: Maximum likelihood methods Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
Slide Slide 1 Section 8-6 Testing a Claim About a Standard Deviation or Variance.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Unit 8 Section 8-3 – Day : P-Value Method for Hypothesis Testing  Instead of giving an α value, some statistical situations might alternatively.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 6 Hypothesis Tests with Means.
Working on exercises (a few notes first)‏. Comments Sometimes you want to make a comment in the Python code, to remind you what’s going on. Python ignores.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Section 3.5 Revised ©2012 |
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Practice Page 128 –#6.7 –#6.8 Practice Page 128 –#6.7 =.0668 = test scores are normally distributed –#6.8 a =.0832 b =.2912 c =.4778.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
The Chi Square Equation Statistics in Biology. Background The chi square (χ 2 ) test is a statistical test to compare observed results with theoretical.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Stat Lecture 7 - Normal Distribution
Pairwise sequence comparison
Inferring phylogenetic trees: Distance methods
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Testing a Claim About a Mean:  Not Known
Theoretical Normal Curve
Transcription factor binding motifs
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Sequence comparison: Dynamic programming
Empirical Rule MM3D3.
Sequence comparison: Local alignment
Sequence comparison: Traceback
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Practice #7.7 #7.8 #7.9. Practice #7.7 #7.8 #7.9.
False discovery rate estimation
Transcription factor binding motifs
Presentation transcript:

Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline One-minute response Revision Motifs – Computing p-values for motif occurrences Python

One-minute responses I understood 90% of the lecture, and the concepts were interesting. I am getting familiar with the Python coding. Everything was clear in Python and in the class. The comprehension of Python is 50%. Liked the video. Understood the Python part. This section is interesting, but I didn’t figure out the Python problem. I appreciated the way of presenting the Python part. More Python code, please! I think we need more on programming to understand it better. Can you please explain the motif again?

Motif Compared to surrounding sequences, motifs experience fewer mutations. Why? – Because a mutation inside a motif reduces the chance that the organism will survive. What is an example of the function of a motif? – Binding site, phosphorylation site, structural motif. Why do we want to find motifs? – To understand the function of the sequence, and to identify distant homologs.

Revision A C G T TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG – 1.00 – 3.32 – = -6.72

Searching human chromosome 21 with the CTCF motif

Significance of scores Motif scanning algorithm TTGACCAGCAGGGGGCGCCG 6.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? A C G T

Two way to assess significance 1.Empirical – Randomly generate data according to the null hypothesis. – Use the resulting score distribution to estimate p- values. 2.Exact – Mathematically calculate all possible scores – Use the resulting score distribution to estimate p- values.

CTCF empirical null distribution

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)

Poor precision in the tail

Converting scores to p-values Linearly rescale the matrix values to the range [0,100] and integerize. A C G T A C G T

Converting scores to p-values Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative. A C G T A C G T

Converting scores to p-values Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100. A C G T / 7 = A C G T

Converting scores to p-values Round to the nearest integer. A C G T A C G T

Converting scores to p-values Say that your motif has N rows. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j. A C G T … 400

Converting scores to p-values For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1. A C G T …

Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. A C G T …

Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? – 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2. A C G T …

Converting scores to p-values In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value. A C G T …

Sample problem #1 Given: – a positive integer N, and – a file containing a DNA motif represented as a PSSM of length n. Return: – a version of the motif in which the values are integers in the range [0, N]. A C G T