Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
ETM 607 – Random Number and Random Variates
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
Overview Definition Hypothesis
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
General Confidence Intervals Section Starter A shipment of engine pistons are supposed to have diameters which vary according to N(4 in,
Slide 1 © 2002 McGraw-Hill Australia, PPTs t/a Introductory Mathematics & Statistics for Business 4e by John S. Croucher 1 n Learning Objectives –Identify.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Working on exercises (a few notes first). Comments Sometimes you want to make a comment in the Python code, to remind you what’s going on. Python ignores.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Inferring phylogenetic trees: Maximum likelihood methods Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Motif discovery and Protein Databases Tutorial 5.
Significance in protein analysis
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Statistical Testing with Genes Saurabh Sinha CS 466.
Working on exercises (a few notes first)‏. Comments Sometimes you want to make a comment in the Python code, to remind you what’s going on. Python ignores.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Created by Erin Hodgess, Houston, Texas Section 7-1 & 7-2 Overview and Basics of Hypothesis Testing.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Pairwise sequence comparison
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Transcription factor binding motifs
Sequence comparison: Significance of similarity scores
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Transcription factor binding motifs
Presentation transcript:

Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline One-minute response Revision Motifs – Definition and motivation – Representation as a PSSM – Scanning for motif occurrences Python

One-minute responses I would appreciate a little revision on the two types of tests. Class was not too fast today. The new method was fine. I understood about 50% of the Python part. The revision was very helpful. Need more explanation by chalk. Today was clear. I am OK with stats but struggling with the Python. We need to work more on Python. Can you explain how to control the FDR, the j* and that table of p- values? I would like to know more about Python execution.

Other comments We need extra Python tutorials with the tutors. Tutors should write an explanation of each program so that we understand what it does. Emile is complaining because our code looks like yours (no main function, not modular). What is the best way to code? – Emile’s comments will not affect your marks – they are stylistic suggestions only.

Revision You have searched a database of 1000 proteins with a single query sequence. What p-value threshold should you use if you want to apply Bonferroni correction and achieve 99% confidence? – 0.01 / 1000 = You have searched a database of 5000 proteins, and you observed a top-scoring p-value of What E-value does this correspond to? – * 5000 = 0.1 How do you decide whether to control the family-wise error rate or the false discovery rate? – If the conclusion or follow-up experiment involves a single result, then control family-wise error rate; otherwise, control false discovery rate.

Revision How many p-values in this list achieve an FDR of 10%? j = index α = 0.1 m =

Motif Set of similar substrings, within a family of diverged sequences. Motif long DNA or protein sequence

Protein motifs Protein binding site Phosphorylation site Structural motif HAHU V.LSPADKTN..VKAAWGKVG.AHAGE YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY..PDIQNKFSQaFKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE

Transcription from DNA to RNA

Transcription factor binding A transcription factor is a protein that affects transcription by binding to DNA. Transcription factor binding sites

Why identify motifs? In proteins – Identify functionally important regions of a protein family – Find similarities to known proteins In DNA – Discover how genes are regulated

Representing motifs as PSSMs AAGTGT TAATGT AATTGT AATTGA ATCTGT AATTGT TGTTGT AAATGA TTTTGT A C G T Convert these 9 6-letter sequences into a PSSM. Use uniform background probabilities (A=0.25, C=0.25, G=0.25, T=0.25) and a pseudocount weight of 1. A C G T A C G T A C G T A C G T

Scanning for motif occurrences Given: – a long DNA sequence, and TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG – a DNA motif represented as a PSSM Find: – occurrences of the motif in the sequence A C G T

Scanning for motif occurrences A C G T TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG – = 6.87

Scanning for motif occurrences A C G T TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG – 3.32 – = -1.39

CTCF One of the most important transcription factors in human cells. Responsible both for turning genes on and for maintaining 3D structure of the DNA.

Searching human chromosome 21 with the CTCF motif

Significance of scores Motif scanning algorithm TTGACCAGCAGGGGGCGCCG 6.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? A C G T

Two way to assess significance 1.Empirical – Randomly generate data according to the null hypothesis. – Use the resulting score distribution to estimate p- values. 2.Exact – Mathematically calculate all possible scores – Use the resulting score distribution to estimate p- values.

CTCF empirical null distribution

Poor precision in the tail

Converting scores to p-values Linearly rescale the matrix values to the range [0,100] and integerize. A C G T A C G T

Converting scores to p-values Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative. A C G T A C G T

Converting scores to p-values Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100. A C G T / 7 = A C G T

Converting scores to p-values Round to the nearest integer. A C G T A C G T

Converting scores to p-values Say that your motif has N rows. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j. A C G T … 400

Converting scores to p-values For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1. A C G T …

Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. A C G T …

Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? – 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2. A C G T …

Converting scores to p-values In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value. A C G T …

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)

Sample problem #1 Given: – a file containing a length-n DNA sequence, and – a file containing a DNA motif represented as a PSSM of length n. Return: – the score of the motif versus the sequence A C G T

Sample problem #2 Given: – a file containing a DNA sequence, and – a file containing a DNA motif represented as a PSSM. Return: – For each position that scores greater than 0, print the position, the score and the matching sequence A C G T

Sample problem #3 Modify the previous program to print the same results, but in sorted order, with the greatest score first.