Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.

Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT 71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC 71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG 71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT 71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA Might increase performance of our PSSM if we can filter out columns that don’t have “enough information” Not every column is as well conserved – some seem to be more informative about what a binding site looks like!

Position-specific scoring matrices Decrease complexity through info analysis Uncertainty (H c ) = -  [p ic log 2 (p ic )] Confusing!!!

Digression on information theory Uncertainty when all outcomes are equally probable Pretend we have a machine that spits out an infinitely long string of nucleotides But that each one is EQUALLY LIKELY to occur: A Pretend we have a machine that spits out an infinitely long string of nucleotides: GATGACTC… How uncertain are we about the outcome BEFORE we see each new character produced by the machine? Intuitively, this uncertainty will depend on how many possibilities exist

Digression on information theory If the possibilities are: A or G or C or T Quantifying uncertainty when outcomes are equally probable One way to quantify uncertainty is to ask: “What is the minimum number of questions required to remove all ambiguity about the outcome?” How many yes/no questions do we need to ask?

Digression on information theory AGCT AGCT AGCT Quantifying uncertainty when outcomes are equally probable ( M = 4 (Alphabet size) H = log 2 (M) Number of decisions depends on the height of the decision tree With M = 4 we are uncertain by log 2 (4) = 2 bits before each new symbol is made by our machine

Digression on information theory Uncertainty when all outcomes are equally probable After we have received a new symbol from our machine we are less uncertain Intuitively, when we become less uncertain, it means we have gained information Information = uncertainty before - uncertainty after Information = H before - H after Note that only in the special case where no uncertainty remains after (H after = 0) does information = H before In the real world this never happens because of noise in the system!!

Digression on information theory Necessary when outcomes are not equally probable! Fine, but where did we get H =  P i log 2 P i ? i =1 M

Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but some are more likely to occur than others: P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2

Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: AAA T AA G T C … Now how uncertain are we about the outcome BEFORE we see each new character? Are we more or less surprised when we see an “A” or a “C”?

Digression on information theory Uncertainty with unequal probabilities Now our machine produces a string of symbols, but we know that some are more likely to occur than others: … Do you agree that we are less surprised to see an “A” than we are to see a “G”? A G AAAA TT C Do you think that the output of our new machine is more or less uncertain?

Digression on information theory What about when outcomes are not equally probable? log 2 M = -log 2 M -1 = - log 2 (1/M) = - log 2 (P) P = 1/M = probability of a symbol appearing

Digression on information theory What about when outcomes are not equally probable? P A = 0.6 P G = 0.1 P C = 0.1 P T = 0.2  P i = 1 i =1 M Remember that the probabilities of all possible symbols must sum to 1! M = 4

Digression on information theory How surprised are we to see a given symbol? u i = - log 2 (P i ) U A = -log 2 (0.6) = 0.7 U G = -log 2 (0.1) = 3.3 U C = -log 2 (0.1) = 3.3 U T = -log 2 (0.2) = 2.3 (where P i = probability of i th symbol) } U i is therefore called the surprisal for symbol i

Digression on information theory What does the surprisal for a symbol have to do with uncertainty? u i = - log 2 (P i ) Uncertainty is the average surprisal for the infinite string of symbols produced by our machine the “surprisal”

Digression on information theory Let’s first imagine that our machine only produces a finite string of N symbols N  N i i =1 M N i is equal to the number of times each symbol occurred in a string of length N N A = 5 N G = 2 N C = 1 N T = 1 For example, for the string “AAGTAACGA” N = 9

Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be:  Niui Niui i =1 M  Ni Ni M  Niui Niui M N NiNi M N uiui 

Digression on information theory For every N i, there is a corresponding surprisal u i therefore the average surprisal for N symbols will be: i =1 M NiNi N uiui  PiPi M uiui  Remember that P i is simply the probability of generating the i th symbol! But wait! We also already defined U i !!

Digression on information theory PiPi i =1 M uiui  Congratulations! This is Claude Shannon’s famous formula defining uncertainty when the probability of each symbol is unequal! u i = - log 2 (P i ) PiPi i =1 M log 2 (P i )  - Therefore: H

Digression on information theory Uncertainty is largest when all symbols are equally probable! (1/M) i =1 M log 2 (1/M)  - H eq How does it reduce assuming equiprobable symbols? 1 i =1 M  - (1/M log 2 1/M) M - (1/M log 2 1/M) M log 2

Digression on information theory Uncertainty when M = 2 PiPi i =1 M log 2 (P i )  - H Uncertainty is largest when all symbols are equally probable!

Digression on information theory OK, but how much information is present in each column? Information (R) = H before - H after Mlog 2 PiPi i =1 M log 2 (P i )  - Now before and after refers to before and after we examined the contents of a column

Digression on information theory http://weblogo.berkeley.edu/ Sequence logos graphically display how Much information is present in each column

Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.

Similar presentations

Presentation on theme: "Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA.

Similar presentations

Presentation on theme: "Position-specific scoring matrices Decrease complexity through info analysis Training set including sequences from two Nostocs 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA."— Presentation transcript:

Similar presentations

About project

Feedback