Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Similar presentations


Presentation on theme: "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen."— Presentation transcript:

1 Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

2 GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Mutiple alignment of acceptor sites from 268 yeast DNA sequences –What is the biological signal around the site ? –What are the important positions –How can it be visualized ? Biological information Sequence-logo Logo plot with Information Content Exon Intron Exon

3 Entropy - Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics  G=H-T  S –The entropy S of a system is the degree of disorder

4 Entropy - Definition Entropy of a distribution of amino acids –The Shannon entropy: H(p) = -  a p a log 2 (p a ),where p is an amino acid distribution. H(p) is measured in bits: log 2 (2) = 1, log 2 (4)=2 Mutiple alignment of 3 sequences Seq1: A L P K Seq2: A V P R Seq3: A I K R High entropy - high disorder Low entropy - low disorder

5 Entropy - example H(p) = -  a p a log 2 (p a ) Mutiple alignment of 3 sequences Seq1: A L R Seq2: A V R Seq3: A I K Pos1: H(p)= -[1*log2(1)] = 0 Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] = Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =

6 Relative Entropy The Kullback-Leiber distance D How different is an amino acid distribution p a compared to a background distribution q a - i.e. distance D between them. D(p||q) =  a p a log 2 (p a /q a ) Normally a background distribution of the amino acids is obtained as frequencies from a large database like UniProt. Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87 Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46 Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16 Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07 Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

7 Information content D(p||q) =  a p a log 2 (p a /q a ) Often the Information content is used as a measure of the degree of conservation. I =  a p a log 2 (p a /q a ) A special case is that where all amino acids have the same background distribution: q a = 1/20

8 Information content I =  a p a log 2 (p a /(1/20)) =  a p a [log 2 p a - log 2 (1/20)] = -H(p) -  a p a log 2 (1/20) = -H(p) +  a p a log 2 (20) = -H(p) + log 2 (20) = -H(p) + 4.32

9 Information content I = -H(p) + 4.32 =  a p a log 2 p a + 4.32 The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment. Mutiple alignment of 3 sequences: Seq1: A L R Seq2: A V R Seq3: A I K Pos1: I = -[1*log2(1)]+ 4.32 = 4.32 Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 = Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32 =

10 GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Frequency-logo: Logo plots - HowTo

11 Logo plots - Information Content Sequence-logo Calculate Information Content I =  a  p a log 2 p a + log 2 (4), Maximal value is 2 bits Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment. ~0.5 each Completely conserved

12 Programs to make a Logo plot WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

13 WebLogo - http://weblogo.berkeley.edu/ http://weblogo.berkeley.edu/

14 WebLogo - http://weblogo.berkeley.edu/ http://weblogo.berkeley.edu/

15 Find important positions >sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL TTTSFEGTCL What is the next step ? 1Find homologous sequences - how ? - Blast or PsiBlast - Download sequences - Make a mutiple alignment - ClustalW or others - or use Blast2Logo program

16 Mutiple alignment programs

17 Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/ http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

18 Important positions Important positions in proteins are conserved positions => high Information Content. Conserved for a reason: Functionally important positions Catalytic residues Structurally important positions Manitain the correct fold of the protein

19 Blast2logo Runs iterative blast i.e. Psi-Blast Searching for homologues sequences by use of Position Specific Scoring Matrices (PSSM). 1. Iteration - use Blosum62 scoring matrix 2. Iteration - make profile of seq found in iteration 1 3. Iteration - make profile of seq found in iteration 2 4. Iteration - Calculate aa freq at each position in query sequence. Correct for low counts and weight seq such that very similar seq are down weighted

20 Important positions - counting

21 Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

22 Exercise 1.Calculate nucleotide frequencies from a mutiple alignment of human donor sites 2.Calculate Entropy and Information content 3.Draw (by hand) a Logo plot 4. Use 2 Logo plot programs 5. Learn to interpret Logo & frequency plots 6. Active site residues & structural residues


Download ppt "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen."

Similar presentations


Ads by Google