Download presentation
Presentation is loading. Please wait.
Published byPauline Campbell Modified over 9 years ago
1
Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT 71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC 71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG 71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT 71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA Might increase performance of our PSSM if we can filter out columns that don’t have “enough information” Not every column is as well conserved – some seem to be more informative about what a binding site looks like!
2
Position-specific scoring matrices Decrease complexity through info analysis Uncertainty (H c ) = - [p ic log 2 (p ic )] Confusing!!!
3
Digression on information theory Uncertainty when all outcomes are equally probable Pretend we have a machine that spits out an infinitely long string of nucleotides But that each one is EQUALLY LIKELY to occur: A Pretend we have a machine that spits out an infinitely long string of nucleotides: GATGACTC… How uncertain are we about the outcome BEFORE we see each new character produced by the machine? Intuitively, this uncertainty will depend on how many possibilities exist
4
Digression on information theory If the possibilities are: A or G or C or T Quantifying uncertainty when outcomes are equally probable One way to quantify uncertainty is to ask: “What is the minimum number of questions required to remove all ambiguity about the outcome?” How many yes/no questions do we need to ask?
5
Digression on information theory AGCT AGCT AGCT Quantifying uncertainty when outcomes are equally probable ( M = 4 (Alphabet size) H = log 2 (M) Number of decisions depends on the height of the decision tree With M = 4 we are uncertain by log 2 (4) = 2 bits before each new symbol is made by our machine
6
Digression on information theory Uncertainty when all outcomes are equally probable After we have received a new symbol from our machine we are less uncertain Intuitively, when we become less uncertain, it means we have gained information Information = uncertainty before - uncertainty after Information = H before - H after Note that only in the special case where no uncertainty remains after (H after = 0) does information = H before In the real world this never happens because of noise in the system!!
7
Digression on information theory Necessary when outcomes are not equally probable! Fine, but where did we get H = P i log 2 P i ? i =1 M
8
Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but some are more likely to occur than others: P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2
9
Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: AAA T AA G T C … Now how uncertain are we about the outcome BEFORE we see each new character? Are we more or less surprised when we see an “A” or a “C”?
10
Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: … Do you agree that we are less surprised to see an “A” than we are to see a “G”? A G AAAA TT C Do you think that the output of our new machine is more or less uncertain?
11
Digression on information theory What about when outcomes are not equally probable? log 2 M = -log 2 M -1 = - log 2 (1/M) = - log 2 (P) P = 1/M = probability of a symbol appearing
12
Digression on information theory What about when outcomes are not equally probable? P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2 P i = 1 i =1 M Remember that the probabilities of all possible symbols must sum to 1! M = 4
13
Digression on information theory How surprised are we to see a given symbol? u i = - log 2 (P i ) U A = -log 2 (0.6) = 0.7 U G = -log 2 (0.1) = 3.3 U C = -log 2 (0.1) = 3.3 U T = -log 2 (0.2) = 2.3 (where P i = probability of i th symbol) } U i is therefore called the surprisal for symbol i
14
Digression on information theory What does the surprisal for a symbol have to do with uncertainty? u i = - log 2 (P i ) Uncertainty is the average surprisal for the infinite string of symbols produced by our machine the “surprisal”
15
Digression on information theory Let’s first imagine that our machine only produces a finite string of N symbols N N i i =1 M N i is equal to the number of times each symbol occurred in a string of length N N A = 5 N G = 2 N C = 1 N T = 1 For example, for the string “AAGTAACGA” N = 9
16
Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be: Niui Niui i =1 M Ni Ni M Niui Niui M N NiNi M N uiui
17
Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be: i =1 M NiNi N uiui PiPi M uiui Remember that P i is simply the probability of generating the i th symbol! But wait! We also already defined U i !!
18
Digression on information theory PiPi i =1 M uiui Congratulations! This is Claude Shannon’s famous formula defining uncertainty when the probability of each symbol is unequal! u i = - log 2 (P i ) PiPi i =1 M log 2 (P i ) - Therefore: H
19
Digression on information theory Uncertainty is largest when all symbols are equally probable! (1/M) i =1 M log 2 (1/M) - H eq How does it reduce assuming equiprobable symbols? 1 i =1 M - (1/M log 2 1/M) M - (1/M log 2 1/M) M log 2
20
Digression on information theory Uncertainty when M = 2 PiPi i =1 M log 2 (P i ) - H Uncertainty is largest when all symbols are equally probable!
21
Digression on information theory OK, but how much information is present in each column? Information (R) = H before - H after Mlog 2 PiPi i =1 M log 2 (P i ) - Now before and after refers to before and after we examined the contents of a column
22
Digression on information theory http://weblogo.berkeley.edu/ Sequence logos graphically display how Much information is present in each column
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.