Presentation is loading. Please wait.

Presentation is loading. Please wait.

Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.

Similar presentations


Presentation on theme: "Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA."— Presentation transcript:

1 Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT 71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC 71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG 71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT 71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA Might increase performance of our PSSM if we can filter out columns that don’t have “enough information” Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

2 Position-specific scoring matrices Decrease complexity through info analysis Uncertainty (H c ) = -  [p ic log 2 (p ic )] Confusing!!!

3 Digression on information theory Uncertainty when all outcomes are equally probable Pretend we have a machine that spits out an infinitely long string of nucleotides But that each one is EQUALLY LIKELY to occur: A Pretend we have a machine that spits out an infinitely long string of nucleotides: GATGACTC… How uncertain are we about the outcome BEFORE we see each new character produced by the machine? Intuitively, this uncertainty will depend on how many possibilities exist

4 Digression on information theory If the possibilities are: A or G or C or T Quantifying uncertainty when outcomes are equally probable One way to quantify uncertainty is to ask: “What is the minimum number of questions required to remove all ambiguity about the outcome?” How many yes/no questions do we need to ask?

5 Digression on information theory AGCT AGCT AGCT Quantifying uncertainty when outcomes are equally probable ( M = 4 (Alphabet size) H = log 2 (M) Number of decisions depends on the height of the decision tree With M = 4 we are uncertain by log 2 (4) = 2 bits before each new symbol is made by our machine

6 Digression on information theory Uncertainty when all outcomes are equally probable After we have received a new symbol from our machine we are less uncertain Intuitively, when we become less uncertain, it means we have gained information Information = uncertainty before - uncertainty after Information = H before - H after Note that only in the special case where no uncertainty remains after (H after = 0) does information = H before In the real world this never happens because of noise in the system!!

7 Digression on information theory Necessary when outcomes are not equally probable! Fine, but where did we get H =  P i log 2 P i ? i =1 M

8 Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but some are more likely to occur than others: P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2

9 Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: AAA T AA G T C … Now how uncertain are we about the outcome BEFORE we see each new character? Are we more or less surprised when we see an “A” or a “C”?

10 Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: … Do you agree that we are less surprised to see an “A” than we are to see a “G”? A G AAAA TT C Do you think that the output of our new machine is more or less uncertain?

11 Digression on information theory What about when outcomes are not equally probable? log 2 M = -log 2 M -1 = - log 2 (1/M) = - log 2 (P) P = 1/M = probability of a symbol appearing

12 Digression on information theory What about when outcomes are not equally probable? P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2  P i = 1 i =1 M Remember that the probabilities of all possible symbols must sum to 1! M = 4

13 Digression on information theory How surprised are we to see a given symbol? u i = - log 2 (P i ) U A = -log 2 (0.6) = 0.7 U G = -log 2 (0.1) = 3.3 U C = -log 2 (0.1) = 3.3 U T = -log 2 (0.2) = 2.3 (where P i = probability of i th symbol) } U i is therefore called the surprisal for symbol i

14 Digression on information theory What does the surprisal for a symbol have to do with uncertainty? u i = - log 2 (P i ) Uncertainty is the average surprisal for the infinite string of symbols produced by our machine the “surprisal”

15 Digression on information theory Let’s first imagine that our machine only produces a finite string of N symbols N  N i i =1 M N i is equal to the number of times each symbol occurred in a string of length N N A = 5 N G = 2 N C = 1 N T = 1 For example, for the string “AAGTAACGA” N = 9

16 Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be:  Niui Niui i =1 M  Ni Ni M  Niui Niui M N NiNi M N uiui 

17 Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be: i =1 M NiNi N uiui  PiPi M uiui  Remember that P i is simply the probability of generating the i th symbol! But wait! We also already defined U i !!

18 Digression on information theory PiPi i =1 M uiui  Congratulations! This is Claude Shannon’s famous formula defining uncertainty when the probability of each symbol is unequal! u i = - log 2 (P i ) PiPi i =1 M log 2 (P i )  - Therefore: H

19 Digression on information theory Uncertainty is largest when all symbols are equally probable! (1/M) i =1 M log 2 (1/M)  - H eq How does it reduce assuming equiprobable symbols? 1 i =1 M  - (1/M log 2 1/M) M - (1/M log 2 1/M) M log 2

20 Digression on information theory Uncertainty when M = 2 PiPi i =1 M log 2 (P i )  - H Uncertainty is largest when all symbols are equally probable!

21 Digression on information theory OK, but how much information is present in each column? Information (R) = H before - H after Mlog 2 PiPi i =1 M log 2 (P i )  - Now before and after refers to before and after we examined the contents of a column

22 Digression on information theory http://weblogo.berkeley.edu/ Sequence logos graphically display how Much information is present in each column


Download ppt "Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA."

Similar presentations


Ads by Google