Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 2 The basics: DNA (deoxyribonucleic acid) stores information, codes for more DNA and for RNA (ribonucleic acid), which is the intermediate between long term storage in the nucleus and Proteins, which do most of the work in living cells DNA (deoxyribonucleic acid) stores information, codes for more DNA and for RNA (ribonucleic acid), which is the intermediate between long term storage in the nucleus and Proteins, which do most of the work in living cells
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 3 Alphabets and translation DNA and RNA use four letter alphabets (ACGT or ACGU); base pairing (A-T and G- C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings. DNA and RNA use four letter alphabets (ACGT or ACGU); base pairing (A-T and G- C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 4 Alphabets and translation Transcription of DNA to RNA is a simple 1:1 read- a strand of DNA produces its complement Translation of RNA to protein amino acid sequence is complex Transcription of DNA to RNA is a simple 1:1 read- a strand of DNA produces its complement Translation of RNA to protein amino acid sequence is complex
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 5 Alphabets and translation One base alone could only code for 4 different AA Two bases together could code for 4x4=16 different AA- close, but no cigar Three bases could code for 64 different AA- we only need 21 for the 20 AA used in proteins and a stop signal One base alone could only code for 4 different AA Two bases together could code for 4x4=16 different AA- close, but no cigar Three bases could code for 64 different AA- we only need 21 for the 20 AA used in proteins and a stop signal
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 6 Alphabets and translation In translation, groups of three bases (codons) are translated into amino acids Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character. In translation, groups of three bases (codons) are translated into amino acids Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 7 Alphabets and translation One consequence of the degeneracy of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 8 Information content How much information can you put into a character string? The computer age has provided the current generation of students with valuable intuition in this area If I can put 10,000 songs on one ipod, how many songs can I put on two ipods? How much information can you put into a character string? The computer age has provided the current generation of students with valuable intuition in this area If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 9 Information content In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 10 Information content More precisely, we express information content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters = 7 bits = 7 bits 10 = 2 bits (no shave or haircut) This assumes 1 and 0 are equally likely More precisely, we express information content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters = 7 bits = 7 bits 10 = 2 bits (no shave or haircut) This assumes 1 and 0 are equally likely
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 11 Information content It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 12 Information content In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log 2 M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log 2 M, = N.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 13 Information content For nucleic acids, M = 4 (ACGT) so N log 2 M =2 N For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY) so N log 2 M ~ 4.3 N For nucleic acids, M = 4 (ACGT) so N log 2 M =2 N For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY) so N log 2 M ~ 4.3 N
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 14 Information content A protein sequence has more than twice the information content of a nucleic acid sequence of the same length. But since it takes 3 bases to code for a single AA, a protein sequence has only about.7 the information content of the DNA sequence that originally coded for it. A protein sequence has more than twice the information content of a nucleic acid sequence of the same length. But since it takes 3 bases to code for a single AA, a protein sequence has only about.7 the information content of the DNA sequence that originally coded for it.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 15 Information content Suppose we translate a 15 base pair sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits. The information content of the protein sequence is 5log 2 20 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits Almost 8 1 / 2 bits are lost to degeneracy. Suppose we translate a 15 base pair sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits. The information content of the protein sequence is 5log 2 20 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits Almost 8 1 / 2 bits are lost to degeneracy.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 16 Information and Entropy Entropy is a measure of the number of ways a system can exist. Example: the oversimplified 2 state molecule ______ B _______ A Entropy is a measure of the number of ways a system can exist. Example: the oversimplified 2 state molecule ______ B _______ A Molecule has two states, A and B In a large ensemble (sample) of molecules the populations of the states are Na and Nb
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 17 Information and Entropy The oversimplified 2 state molecule ______ B _______ A The oversimplified 2 state molecule ______ B _______ A If a photon with energy h can induce transitions between the states the energy difference between them is just = h, and at temperature T the population ratio Nb/Na is e - /kT, where K is the Boltzmann constant
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 18 Information and Entropy The oversimplified 2 state molecule: multiplicity ______ B _______ A The oversimplified 2 state molecule: multiplicity ______ B _______ A Now suppose that A consists of n substates and B of m substates. The ratio of the populations of any substate of B to any substate of A is e /kT, so the ratio the populations of all the B states to A states is just n/m ( e - /kT )
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 19 Information and Entropy The oversimplified 2 state molecule: free energy and entropy We can rearrange the expression n/m(e - /kT ) using simple algebra to obtain the equivalent expression e -(D+kTlog(n/m)/kT. In the exponent, the term (D+kTln(n/m) has units of energy and is a free energy. Free energies in general determine equilibria. Ln(n/M) is an entropy term representing the difference in entropy between A and B ( S=Sb-Sa).
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 20 Information and Entropy Question: What has entropy got to do with information? Answer: Everything, because entropy is just a measure of the number of possible states. The entropy of a state is just the natural logarithm of the # of ways that state can exist. (That’s why it’s related to the degree of order: there are more ways of making a mess than of keeping things neat).
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 21 Information and Entropy Half a century ago Claude Shannon’s seminal work on information theory showed that the information content in a message could be expressed as an function we call the Shannon entropy. The basic idea is that the information content is the difference between the ln of the ways the message might read before we see it and the ln of the ways it might read after we read it. (Shannon was interested in errors as well as perfect reads.) Other people has similar ideas, (e.g., Norbert Weiner, who coined the term cybernetics) but Shannon got the details right.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 22 Information and Entropy The information content (in bits) of a string of N characters with M ‘letters’ in the alphabet is Nlog 2 M if characters are equally probable. More generally, information content can be written in terms of probabilities as – log P i, which looks worse than it is. Suppose that in an organism the CG content is 60%. The P i are.3 for C and G and.2 for A and T. Each C or G contributes –log 2 (.3) bits, and each A or T contributes –log 2 (.2) bits. The average information per position is – P i log P i ~1.96.