A short tutorial on DNA structure and functions The DNA double helix A short tutorial on DNA structure and functions
Nucleotides: the bricks of a ladder structure The DNA helix is made assemblying four structural units called nucleotides. Each nucleotide has 3 building blocks: Adenine (A) Guanine (G) Cytosine (C) Thymine (T) Base: purine pyrimidine Phosphate group Sugar ring
The double helix A T C G A T ... G T A C Four-letter alphabet {A,C,G,T} Hidrogen bonds
The central dogma of molecular biology The DNA molecule contains the information for coding all the proteins (more than 100,000 in humans) The DNA expression (# and type of proteins produced) is different in different cells
The DNA molecule is divided in regions that code for proteins (GENES) and non-coding regions chromatine
The “words” in genes are triplets called codons Each codon on the gene corresponds to an amynoacid on the corresponding protein Serine The message encoding is not as simple as that! Strong degeneracy 43=64 different triplets 20 different amynoacids
The C paradox (“complete genome size” paradox) The size of complete genomes (=total length of DNA on all chromosomes) does not seem to be correlated with the Phenotypic complexity of the species Homo Sapiens …. C ~ 3.4 x 109 bp Pinus resinosa ….. C ~ 6.8 x 1010 bp Amoeba dubia …. C ~ 6.7 x 1011 bp
Coding to non-coding ratio The coding to non-coding ratio generally decreases from simpler to higher organisms
Is it all junk? The higher the organism is on the evolutionary scale, the greater the amount of “junk” DNA Is it all junk? Regulatory sequences instructions concerning transcription and expression Specific protein binding sites Interplay between DNA sequence and structure
An example: promoters Transcription factors bind to specific sequences upstream with respect to the gene (promoters) Then the RNA-polymerase “recognises” where to start transcription Regulation of gene expression through promoters and transcription factors. Adjacent to genes are promoters, DNA regions that control how much RNA is transcribed from that gene. (1) The enzyme RNA polymerase II initiates trancription at a specific site in the promoter. (2) Certain 'basal' transcription factors control the binding of RNA polymerase to this site. Other proteins that bind to specific short DNA stretches in the promoter and basal factors work together to activate or inhibit transcription. Drugs such as alcohol may modify the activity of those factors. (3) Here, alcohol is arbitrarily assumed to increase the activity of an activating transcription factor (ATF), resulting in (4) increased RNA synthesis from the alcohol-responsive gene. (5) Genes lacking this particular ATF binding site would not respond to alcohol.
The role of non-coding DNA … is still an open problem tools from Computational Linguistics techniques of information theory Analysis of DNA sequences with Statistical physics
Statistical analysis of DNA sequences A short account of how physicists abuse of biology
Restatement of the problem A long sequence is given of symbols drawn from a four-letter alphabet {A,T,C,G}. How is it organised? Is there a language? Space of investigation at least two-dimensional: Functional bipartition of the sequence - coding and non-coding regions (x axis) Position of the organism on the evolutionary scale (y axis)
Methodological pre-requisite When searching for rules that hold in the biological world, physicists must switch to the biological definition of law A law in biology is a rule that holds for describing a given phenomenon with the least number of exceptions
Long-range correlations Long-range correlations in a sequence are the first signature of some non-trivial auto-organisation (ex: languages); as opposed to a sequence generated by a process characterised by local order. Scale-free “fractal” process => power-law decay of correlations Counter-example: Markov process of order n0 There exists the characteristic length n0 in the organisation of the sequence produced by such a process. Correlations decay exponentially as exp(-n/ n0)
Long-range correlations: methods Spectral analysis of DNA sequences Walk representations of DNA sequences
Mapping rules Map a nucleotide sequence {ni} (n {A,C,T,G}, i=1,2,…,L) onto a numerical sequence {ui} Purine-pyrimidine Weak-strong
Spectral analysis Divide the sequence of L nucleotides into K=[ L/N ] non-overlapping sub-sequences of size N, and compute the power spectrum of each one where qf is the Discrete Fourier Transform of set ui
Finally, average S( f ) over the K subsequences If a sequence has long-range power-law correlations Hence, a log-log plot of S( f ) vs f is a straight line with slope -b
Low-frequency portion: spurious contributions for f < order 1/N Medium-frequency portion: power-law behaviour High-frequency portion: peaks at f =1/3 and f =1/9 bp-1
The measured distributions are found to be peaked on non-zero values of b in non-coding regions only Purine-pyrimidine mapping rule, N=512 Practical use? This result may be useful in gene-finding algorithms based on the different correlation properties of coding and non-coding regions S.V. Buldyrev et al., PRE, 51, 5084 (1995) S.M.Ossadnik et al, Biophys. J., 67, 64 (1994).
Walk representation Mapping rule => displacement Displacement y(l) Nucleotide distance Displacement y(l)
One then studies the root mean square fluctuations F(l) about the average of the displacement where and the bars indicate averages over l0
The mean square fluctuation is related to the autocorrelation function in the following way
There are three possible behaviours Random base sequence => C(l )=0 on average (except C(0)=1). Hence (normal random walk) Local correlations up to a distance L => However, asymptotically ( l >> L) If there is no characteristic length (“infinite-length” correlations) one gets power laws of the form with (anomalous random walk)
Techniques from statistical Linguistic Zipf plots Shannon entropy and redundancy Mutual information
Zipf plots Construct the Rank vector R, by ordering the words from Conventional Zipf analysis: take all words in a text written in a given language measure their frequency of occurrence ( f ) Construct the Rank vector R, by ordering the words from the most frequent to the least, assigning ranks R=1, 2, ... The Zipf plot is f (R) It is found that in several natural languages with z close to 1
n-tuple Zipf analysis What if the words of the language are not known? One can fix a word length n and perform Zipf analysis of a sequence of symbols of length N by taking into account all N-n+1 words Application of n -tuple Zipf analysis to natural and formal languages gives encouraging results. Yet, the success of such analysis provides just a necessary condition for non-Markovian structure in a language
n-tuple Zipf analysis of English R. N n-tuple Zipf analysis of English R.N. Mantegna et al, PRE 52, 2939 (1995) English text of about 106 words from an Encyclopedia. Alphabet of 32 character: 26 english letters and 6 punctuation symbols, including the “blank” character. Fit gives To a good extent independent of n
n-tuple Zipf analysis of Unix binary code R. N n-tuple Zipf analysis of Unix binary code R.N. Mantegna et al, PRE 52, 2939 (1995) Compiled version of the Unix operating system (9x106 characters). The alphabet is binary. Fit gives To a good extent independent of n
n-tuple Zipf analysis still shows power-law behaviour, but the exponent z is different from conventional Zipf analysis The difference observed in the exponent z may be due to the vocabulary enlargement of the n -tuple analysis. Conventional Zipf scheme: only REAL WORDS n-tuple scheme: all n-long combinations
n-tuple Zipf analysis of human DNA sequence R. N n-tuple Zipf analysis of human DNA sequence R.N. Mantegna et al, PRE 52, 2939 (1995) Fit gives Primarily non-coding human DNA sequence (2x105 bases) from GenBank (coding=1.5 %) To a good extent independent of n
Coding => logarithmic fit Extensive n-tuple Zipf analysis of coding and non-coding sequences R.N. Mantegna et al, PRE 52, 2939 (1995) A statistically significant functional difference is found in the functional form of the n-tuple Zipf plots of coding and non-coding sequences in most organisms. Non-coding => power-law fit Coding => logarithmic fit
Entropy and Redundancy Shannon, 1948: the foundations of information theory. The concept of entropy of an information source In order to introduce Shannon entropy, let us give a mathematical definition of surprise!
A mathematical formulation of surprise Suppose that the surprise one feels upon learning that an event E has occurred depends only on the probability p(E), 0 < p < 1. It is natural to agree on four basic axioms for the function S(p) S(1) = 0 S(p) is a strictly decreasing function of p S(p) is a continuous function of p S(pq) = S(p) + S(q), 0 < p,q < 1 Theorem:
, then surprise is measured in bits If we take , then surprise is measured in bits Suppose an observable X=(x1,x2,…,xN) is associated with a probability distribution P=(p1,p2,…,pN). Then, the average amount of surprise felt upon learning the actual value of X will be H(X) can be also regarded as the average amount of information received when X is observed, and hence “carried” by X
Entropy of an information source Suppose a source of information emits symbols drawn from an alphabet with l characters. The source is characterised by a probability distribution P=(p1,p2,…,pl). It is natural to define the entropy of the source (in bits) as H(source) measures the information content of the source associated with the probability distribution P .
Maximum entropy of an information source The maximum entropy of P (maximum degree of uncertainty) is attained when In this case one has
Block (n-gram) entropy Let us define the probability to observe a given string of length n The n-gram entropy is then defined as
The largest-disorder (maximum information) case corresponds to The maximum-disorder case corresponds to linear n-gram entropy Deviations from randomness (presence of constraints) in a given sequence can be revealed in the form of non-linear n-gram entropy Important result
Estimating n-gram entropies from data * Start with a sequence of length L * Slide a window of length n by steps of 1 symbol (base) * Store all the L - n + 1 encountered words * Calculate where ni is the number of occurrencies of the i-th word Of course, in order to achieve statistical significance, one must make sure that
n-gram entropy measures in human chromosome 1 Two symbols alphabet (purines = {A,G}, pyrimidines = {C,T}). Maximum n-gram entropy (in bits) Smax = n log2 2 Coding Non-coding intergenic
Redundancy The flexibility of a language can be quantified in terms of its redundancy, i.e. the limit relative deviation from the random case The greater the redundancy, the farther is the symbolic structure under study from the random case, i.e. the more constraints (grammar) are present
Redundancy of DNA sequences R.N. Mantegna et al, PRE 52, 2939 (1995) Non-coding regions Coding regions
Correlations part II: mutual information The traditional methods for studying correlations are not invariant under change of the mapping rule. Moreover, correlation functions measure linear correlations by definition The mutual information generalises the measure of correlations between two symbols (bases) separated by a distance k in the sequence
The thermodinamical analogy X Y Joint system [ X, Y ] Statistical dependence Statistical independence
The mutual information can be defined as the distance in “entropy space” between statistical independence and dependence If kB is replaced by 1/ln 2 then I(X,Y) quantifies the amount of mutual information between X and Y in bits
Mutual information in DNA sequences k ni, nj [A,C,T,G] Y (ni) X (nj) Pij(k) Joint probability of finding nucleotide ni and nj spaced by k in the sequence
Mutual information in human DNA sequences I Mutual information in human DNA sequences I. Grosse et al, PRE 61, 5624 (2000) Mutual information of human coding and non-coding sequences from Genbank. Mutual information calculated by averaging over fragments of length 500 bp The mutual information clearly distinguishes between coding and non-coding regions Period 3 bp