Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Slides:



Advertisements
Similar presentations
It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU Building 208, room 021
Advertisements

It og Sundhed Nov Jan. Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Protein Secondary Structures
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
It & Health 2010 Summary Thomas Nordahl Petersen.
Sequence similarity search Glance to the protein world.
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
An Introduction to Bioinformatics
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Motif discovery and Protein Databases Tutorial 5.
A program of ITEST (Information Technology Experiences for Students and Teachers) funded by the National Science Foundation Background Session #3 DNA &
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Protein Sequence Alignment Multiple Sequence Alignment
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Sequence similarity search II Searching for remote homologies.
Sequence similarity search Glance to the protein world.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Useful shell commands head/tail, cut, sort, uniq Virginie Orgogozo March 2011.
Useful shell commands head/tail, cut, sort, uniq Virginie Orgogozo March 2011.
Sequence similarity, BLAST alignments & multiple sequence alignments
Table 2. the contents of free amino acids
Position-Specific Substitution Matrices
Transcription, Translation & Protein Synthesis
Protein Sequence Alignments
Cathode (attracts (+) amino acids)
Visualization of genomic data
Visualization of genomic data
Outline What is an amino acid / protein
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
More on translation.
Figure 3.14A–D Protein structure (layer 1)
Haixu Tang School of Inforamtics
Amino Acids Amine group -NH2 Carboxylic group -COOH
It og Sundhed Thomas Nordahl Petersen, Associate Professor
Cytochrome.
Levels of Protein Structure
How to Test an Assertion
Translation.
BLAT Blast Like Alignment Tool
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
It og Sundhed Thomas Nordahl Petersen, Associate Professor
Chapter 18 Naturally Occurring Nitrogen-Containing Compounds
Example of regression by RBF-ANN
Thomas Nordahl Petersen, Associate Prof, Food DTU
Thomas Nordahl Petersen, Associate Bioinformatics, DTU
Fig. 3 Organization of the active site of DHHC20.
Presentation transcript:

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Biological information GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Exon Intron Exon Mutiple alignment of acceptor sites from 268 yeast DNA sequences What is the biological signal around the site ? What are the important positions How can it be visualized ? Logo plot with Information Content Sequence-logo

Entropy - Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics G=H-TS The entropy S of a system is the degree of disorder

Entropy - Definition Entropy of a distribution of amino acids The Shannon entropy: H(p) = - a pa log2(pa), where p is an amino acid distribution. H(p) is measured in bits: log2(2) = 1, log2(4)=2 Mutiple alignment of 3 sequences Seq1: A L P K Seq2: A V P R Seq3: A I K R High entropy - high disorder Low entropy - low disorder

Entropy - example H(p) = - a pa log2(pa) Mutiple alignment of 3 sequences Seq1: A L R Seq2: A V R Seq3: A I K Pos1: H(p)= -[1*log2(1)] = 0 Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=1.58 Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =0.92

Relative Entropy The Kullback-Leiber distance D How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them. D(p||q) = a pa log2(pa/qa) Normally a background distribution of the amino acids is obtained as frequencies from a large database like UniProt. Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87 Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46 Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16 Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07 Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

Information content D(p||q) = a pa log2(pa/qa) Often the Information content is used as a measure of the degree of conservation. I = a pa log2(pa/qa) A special case is that where all amino acids have the same background distribution: qa = 1/20

Information contents amino acids I = a pa log2(pa/(1/20)) = a pa [log2pa - log2(1/20)] = -H(p) - a palog2(1/20) = -H(p) + a palog2(20) = -H(p) + log2(20) = -H(p) + 4.32 = a palog2pa + 4.32

Information content I = -H(p) + 4.32 = a palog2pa + 4.32 General formula: a palog2pa + log2(N), where N is number of letters The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment. Mutiple alignment of 3 sequences: Seq1: A L R Seq2: A V R Seq3: A I K Pos1: I = [1*log2(1)]+ 4.32 = 4.32 Pos2: I = [1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =2.74 Pos3: I = [2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32= 3.40

Logo plots - HowTo Count nucleotides at each position: GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Frequency-logo:

Logo plots - Information Content Calculate Information Content I = apalog2pa + log2(4), Maximal value is 2 bits Sequence-logo Completely conserved ~0.5 each Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment.

Logo Plot – DNA Splice sites 28-Dec-18 Acceptor site Doror site Exon

BLAT genome Browser ”Details” 28-Dec-18 BLAT genome Browser ”Details” Exon Correct splice site ?

BLAT genome Browser ”Details” 28-Dec-18 BLAT genome Browser ”Details” Donor site | Acceptor site exon... . G | GT ...intron ...AG | exon...

Programs to make a Logo plot WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

WebLogo - http://weblogo.berkeley.edu/

WebLogo - http://weblogo.berkeley.edu/

Find important positions >sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL TTTSFEGTCL What is the next step ? Find homologous sequences - how ? Blast or PsiBlast Download sequences Make a mutiple alignment ClustalW, Mafft or others or use Blast2Logo program

Mutiple alignment programs

Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

Important positions Important positions in proteins are conserved positions => high Information Content. Conserved for a reason: Functionally important positions Catalytic residues Structurally important positions Manitain the correct fold of the protein

Blast2logo Runs iterative blast i.e. Psi-Blast Searching for homologues sequences by use of Position Specific Scoring Matrices (PSSM). Iteration - use Blosum62 scoring matrix Iteration - make profile of seq found in iteration 1 Iteration - make profile of seq found in iteration 2 Iteration - Calculate aa freq at each position in query sequence. Correct for low counts and weight seq such that very similar seq are down weighted

Psi-Blast Iterative Blast An iterative process to search for remote homologs Capture and use evolutionary conserved information Scoring matrix is refined by use of gap-free multiple alignment Input sequence Sequence database Blast E < threshold 4 iterations PSSM Multiple alignment PSSM: Position Specific Scoring Matrix 23

Important positions - counting

Blast2logo Important amino acids: G24, D25 & S26 G89, N91 & D92 Important amino acids: D209 & H212 25

Blast2logo Db=nr.70 Important amino acids: G24, D25 & S26 D209 & H212 Db=nr.70 26

Exercise Calculate nucleotide frequencies from a mutiple alignment of human donor sites Calculate Entropy and Information content Draw (by hand) a Logo plot Learn to interpret Logo & frequency plots