Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU Building 208, room 021
It og Sundhed Nov Jan. Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
BY: SHERENE MINHAS. Agr Glu Thr Ile Glu Ser Leu Ser Ser Ser Glu Glu Ser Ile Pro Glu Tyr Lys Gln Lys Val Glu Lys Val Lys His Glu Asp Gln Gln Gln Gly Thr.
Protein Secondary Structures
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
1 Levels of Protein Structure Primary to Quaternary Structure.
Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules.
Amino Acids and Proteins 1.What is an amino acid / protein 2.Where are they found 3.Properties of the amino acids 4.How are proteins synthesized 1.Transcription.
Lectures on Computational Biology HC Lee Computational Biology Lab Center for Complex Systems & Biophysics National Central University EFSS II National.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Introduction to bioinformatics
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
Multiple sequence alignments and motif discovery Tutorial 5.
It & Health 2010 Summary Thomas Nordahl Petersen.
Sequence similarity search Glance to the protein world.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
The relative orientation observed for  helices packed on ß sheets.
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
BIOCHEMISTRY REVIEW Overview of Biomolecules Chapter 4 Protein Sequence.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Proteins dictate function in an organism:
Biology 4900 Biocomputing.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Protein Secondary Structure Prediction
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Outline What is an amino acid / protein
Protein Secondary Structure Prediction G P S Raghava.
A program of ITEST (Information Technology Experiences for Students and Teachers) funded by the National Science Foundation Background Session #3 DNA &
1 Protein synthesis How a nucleotide sequence is translated into amino acids.
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Construction of Substitution matrices
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Supplementary Fig. 1 Relative concentrations of amino acids after transamination reaction catalyzed by PpACL1, α- ketoglutarate as the amino acceptor.
Chapter 17 How to read a table of codons. These are two forms in which you might see a table of codons.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Sequence similarity search II Searching for remote homologies.
Sequence similarity search Glance to the protein world.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Sequence similarity, BLAST alignments & multiple sequence alignments
Cathode (attracts (+) amino acids)
Visualization of genomic data
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
It og Sundhed Thomas Nordahl Petersen, Associate Professor
BLAT Blast Like Alignment Tool
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
It og Sundhed Thomas Nordahl Petersen, Associate Professor
Thomas Nordahl Petersen, Associate Prof, Food DTU
Thomas Nordahl Petersen, Associate Bioinformatics, DTU
Presentation transcript:

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Mutiple alignment of acceptor sites from 268 yeast DNA sequences –What is the biological signal around the site ? –What are the important positions –How can it be visualized ? Biological information Sequence-logo Logo plot with Information Content Exon Intron Exon

Entropy - Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics  G=H-T  S –The entropy S of a system is the degree of disorder

Entropy - Definition Entropy of a distribution of amino acids –The Shannon entropy: H(p) = -  a p a log 2 (p a ),where p is an amino acid distribution. H(p) is measured in bits: log 2 (2) = 1, log 2 (4)=2 Mutiple alignment of 3 sequences Seq1: A L P K Seq2: A V P R Seq3: A I K R High entropy - high disorder Low entropy - low disorder

Entropy - example H(p) = -  a p a log 2 (p a ) Mutiple alignment of 3 sequences Seq1: A L R Seq2: A V R Seq3: A I K Pos1: H(p)= -[1*log2(1)] = 0 Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] = Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =

Relative Entropy The Kullback-Leiber distance D How different is an amino acid distribution p a compared to a background distribution q a - i.e. distance D between them. D(p||q) =  a p a log 2 (p a /q a ) Normally a background distribution of the amino acids is obtained as frequencies from a large database like UniProt. Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87 Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46 Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16 Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07 Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

Information content D(p||q) =  a p a log 2 (p a /q a ) Often the Information content is used as a measure of the degree of conservation. I =  a p a log 2 (p a /q a ) A special case is that where all amino acids have the same background distribution: q a = 1/20

Information content I =  a p a log 2 (p a /(1/20)) =  a p a [log 2 p a - log 2 (1/20)] = -H(p) -  a p a log 2 (1/20) = -H(p) +  a p a log 2 (20) = -H(p) + log 2 (20) = -H(p)

Information content I = -H(p) =  a p a log 2 p a The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment. Mutiple alignment of 3 sequences: Seq1: A L R Seq2: A V R Seq3: A I K Pos1: I = -[1*log2(1)] = 4.32 Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] = Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) =

GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Frequency-logo: Logo plots - HowTo

Logo plots - Information Content Sequence-logo Calculate Information Content I =  a  p a log 2 p a + log 2 (4), Maximal value is 2 bits Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment. ~0.5 each Completely conserved

Programs to make a Logo plot WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

WebLogo -

WebLogo -

Find important positions >sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL TTTSFEGTCL What is the next step ? 1Find homologous sequences - how ? - Blast or PsiBlast - Download sequences - Make a mutiple alignment - ClustalW or others - or use Blast2Logo program

Mutiple alignment programs

Blast2logo -

Important positions Important positions in proteins are conserved positions => high Information Content. Conserved for a reason: Functionally important positions Catalytic residues Structurally important positions Manitain the correct fold of the protein

Blast2logo Runs iterative blast i.e. Psi-Blast Searching for homologues sequences by use of Position Specific Scoring Matrices (PSSM). 1. Iteration - use Blosum62 scoring matrix 2. Iteration - make profile of seq found in iteration 1 3. Iteration - make profile of seq found in iteration 2 4. Iteration - Calculate aa freq at each position in query sequence. Correct for low counts and weight seq such that very similar seq are down weighted

Important positions - counting

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Exercise 1.Calculate nucleotide frequencies from a mutiple alignment of human donor sites 2.Calculate Entropy and Information content 3.Draw (by hand) a Logo plot 4. Use 2 Logo plot programs 5. Learn to interpret Logo & frequency plots 6. Active site residues & structural residues