Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Outline One-minute response Revision Motifs – Computing p-values for motif occurrences Python
One-minute responses I understood 90% of the lecture, and the concepts were interesting. I am getting familiar with the Python coding. Everything was clear in Python and in the class. The comprehension of Python is 50%. Liked the video. Understood the Python part. This section is interesting, but I didn’t figure out the Python problem. I appreciated the way of presenting the Python part. More Python code, please! I think we need more on programming to understand it better. Can you please explain the motif again?
Motif Compared to surrounding sequences, motifs experience fewer mutations. Why? – Because a mutation inside a motif reduces the chance that the organism will survive. What is an example of the function of a motif? – Binding site, phosphorylation site, structural motif. Why do we want to find motifs? – To understand the function of the sequence, and to identify distant homologs.
Revision A C G T TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG – 1.00 – 3.32 – = -6.72
Searching human chromosome 21 with the CTCF motif
Significance of scores Motif scanning algorithm TTGACCAGCAGGGGGCGCCG 6.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? A C G T
Two way to assess significance 1.Empirical – Randomly generate data according to the null hypothesis. – Use the resulting score distribution to estimate p- values. 2.Exact – Mathematically calculate all possible scores – Use the resulting score distribution to estimate p- values.
CTCF empirical null distribution
Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)
Poor precision in the tail
Converting scores to p-values Linearly rescale the matrix values to the range [0,100] and integerize. A C G T A C G T
Converting scores to p-values Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative. A C G T A C G T
Converting scores to p-values Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100. A C G T / 7 = A C G T
Converting scores to p-values Round to the nearest integer. A C G T A C G T
Converting scores to p-values Say that your motif has N rows. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j. A C G T … 400
Converting scores to p-values For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1. A C G T …
Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. A C G T …
Converting scores to p-values For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? – 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2. A C G T …
Converting scores to p-values In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value. A C G T …
Sample problem #1 Given: – a positive integer N, and – a file containing a DNA motif represented as a PSSM of length n. Return: – a version of the motif in which the values are integers in the range [0, N]. A C G T