Download presentation
Presentation is loading. Please wait.
1
A short tutorial on DNA structure and functions
The DNA double helix A short tutorial on DNA structure and functions
2
Nucleotides: the bricks of a ladder structure
The DNA helix is made assemblying four structural units called nucleotides. Each nucleotide has 3 building blocks: Adenine (A) Guanine (G) Cytosine (C) Thymine (T) Base: purine pyrimidine Phosphate group Sugar ring
3
The double helix A T C G A T ... G T A C
Four-letter alphabet {A,C,G,T} Hidrogen bonds
5
The central dogma of molecular biology
The DNA molecule contains the information for coding all the proteins (more than 100,000 in humans) The DNA expression (# and type of proteins produced) is different in different cells
6
The DNA molecule is divided in regions that code
for proteins (GENES) and non-coding regions chromatine
7
The “words” in genes are triplets called codons
Each codon on the gene corresponds to an amynoacid on the corresponding protein Serine The message encoding is not as simple as that! Strong degeneracy 43=64 different triplets 20 different amynoacids
8
The C paradox (“complete genome size” paradox)
The size of complete genomes (=total length of DNA on all chromosomes) does not seem to be correlated with the Phenotypic complexity of the species Homo Sapiens …. C ~ 3.4 x 109 bp Pinus resinosa ….. C ~ 6.8 x 1010 bp Amoeba dubia …. C ~ 6.7 x 1011 bp
9
Coding to non-coding ratio
The coding to non-coding ratio generally decreases from simpler to higher organisms
10
Is it all junk? The higher the organism is on the evolutionary scale,
the greater the amount of “junk” DNA Is it all junk? Regulatory sequences instructions concerning transcription and expression Specific protein binding sites Interplay between DNA sequence and structure
11
An example: promoters Transcription factors bind to specific sequences
upstream with respect to the gene (promoters) Then the RNA-polymerase “recognises” where to start transcription Regulation of gene expression through promoters and transcription factors. Adjacent to genes are promoters, DNA regions that control how much RNA is transcribed from that gene. (1) The enzyme RNA polymerase II initiates trancription at a specific site in the promoter. (2) Certain 'basal' transcription factors control the binding of RNA polymerase to this site. Other proteins that bind to specific short DNA stretches in the promoter and basal factors work together to activate or inhibit transcription. Drugs such as alcohol may modify the activity of those factors. (3) Here, alcohol is arbitrarily assumed to increase the activity of an activating transcription factor (ATF), resulting in (4) increased RNA synthesis from the alcohol-responsive gene. (5) Genes lacking this particular ATF binding site would not respond to alcohol.
12
The role of non-coding DNA
… is still an open problem tools from Computational Linguistics techniques of information theory Analysis of DNA sequences with Statistical physics
13
Statistical analysis of DNA sequences
A short account of how physicists abuse of biology
14
Restatement of the problem
A long sequence is given of symbols drawn from a four-letter alphabet {A,T,C,G}. How is it organised? Is there a language? Space of investigation at least two-dimensional: Functional bipartition of the sequence - coding and non-coding regions (x axis) Position of the organism on the evolutionary scale (y axis)
15
Methodological pre-requisite
When searching for rules that hold in the biological world, physicists must switch to the biological definition of law A law in biology is a rule that holds for describing a given phenomenon with the least number of exceptions
16
Long-range correlations
Long-range correlations in a sequence are the first signature of some non-trivial auto-organisation (ex: languages); as opposed to a sequence generated by a process characterised by local order. Scale-free “fractal” process => power-law decay of correlations Counter-example: Markov process of order n0 There exists the characteristic length n0 in the organisation of the sequence produced by such a process. Correlations decay exponentially as exp(-n/ n0)
17
Long-range correlations: methods
Spectral analysis of DNA sequences Walk representations of DNA sequences
18
Mapping rules Map a nucleotide sequence {ni} (n {A,C,T,G}, i=1,2,…,L) onto a numerical sequence {ui} Purine-pyrimidine Weak-strong
19
Spectral analysis Divide the sequence of L nucleotides into K=[ L/N ]
non-overlapping sub-sequences of size N, and compute the power spectrum of each one where qf is the Discrete Fourier Transform of set ui
20
Finally, average S( f ) over the K subsequences
If a sequence has long-range power-law correlations Hence, a log-log plot of S( f ) vs f is a straight line with slope -b
21
Low-frequency portion: spurious
contributions for f < order 1/N Medium-frequency portion: power-law behaviour High-frequency portion: peaks at f =1/3 and f =1/9 bp-1
22
The measured distributions are found to be peaked
on non-zero values of b in non-coding regions only Purine-pyrimidine mapping rule, N=512 Practical use? This result may be useful in gene-finding algorithms based on the different correlation properties of coding and non-coding regions S.V. Buldyrev et al., PRE, 51, 5084 (1995) S.M.Ossadnik et al, Biophys. J., 67, 64 (1994).
23
Walk representation Mapping rule => displacement Displacement y(l)
Nucleotide distance Displacement y(l)
24
One then studies the root mean square fluctuations
F(l) about the average of the displacement where and the bars indicate averages over l0
25
The mean square fluctuation is related to the autocorrelation
function in the following way
26
There are three possible behaviours
Random base sequence => C(l )=0 on average (except C(0)=1). Hence (normal random walk) Local correlations up to a distance L => However, asymptotically ( l >> L) If there is no characteristic length (“infinite-length” correlations) one gets power laws of the form with (anomalous random walk)
27
Techniques from statistical Linguistic
Zipf plots Shannon entropy and redundancy Mutual information
28
Zipf plots Construct the Rank vector R, by ordering the words from
Conventional Zipf analysis: take all words in a text written in a given language measure their frequency of occurrence ( f ) Construct the Rank vector R, by ordering the words from the most frequent to the least, assigning ranks R=1, 2, ... The Zipf plot is f (R) It is found that in several natural languages with z close to 1
29
n-tuple Zipf analysis What if the words of the language are not known?
One can fix a word length n and perform Zipf analysis of a sequence of symbols of length N by taking into account all N-n+1 words Application of n -tuple Zipf analysis to natural and formal languages gives encouraging results. Yet, the success of such analysis provides just a necessary condition for non-Markovian structure in a language
30
n-tuple Zipf analysis of English R. N
n-tuple Zipf analysis of English R.N. Mantegna et al, PRE 52, 2939 (1995) English text of about 106 words from an Encyclopedia. Alphabet of 32 character: 26 english letters and 6 punctuation symbols, including the “blank” character. Fit gives To a good extent independent of n
31
n-tuple Zipf analysis of Unix binary code R. N
n-tuple Zipf analysis of Unix binary code R.N. Mantegna et al, PRE 52, 2939 (1995) Compiled version of the Unix operating system (9x106 characters). The alphabet is binary. Fit gives To a good extent independent of n
32
n-tuple Zipf analysis still shows power-law behaviour,
but the exponent z is different from conventional Zipf analysis The difference observed in the exponent z may be due to the vocabulary enlargement of the n -tuple analysis. Conventional Zipf scheme: only REAL WORDS n-tuple scheme: all n-long combinations
33
n-tuple Zipf analysis of human DNA sequence R. N
n-tuple Zipf analysis of human DNA sequence R.N. Mantegna et al, PRE 52, 2939 (1995) Fit gives Primarily non-coding human DNA sequence (2x105 bases) from GenBank (coding=1.5 %) To a good extent independent of n
34
Coding => logarithmic fit
Extensive n-tuple Zipf analysis of coding and non-coding sequences R.N. Mantegna et al, PRE 52, 2939 (1995) A statistically significant functional difference is found in the functional form of the n-tuple Zipf plots of coding and non-coding sequences in most organisms. Non-coding => power-law fit Coding => logarithmic fit
35
Entropy and Redundancy
Shannon, 1948: the foundations of information theory. The concept of entropy of an information source In order to introduce Shannon entropy, let us give a mathematical definition of surprise!
36
A mathematical formulation of surprise
Suppose that the surprise one feels upon learning that an event E has occurred depends only on the probability p(E), 0 < p < 1. It is natural to agree on four basic axioms for the function S(p) S(1) = 0 S(p) is a strictly decreasing function of p S(p) is a continuous function of p S(pq) = S(p) + S(q), 0 < p,q < 1 Theorem:
37
, then surprise is measured in bits
If we take , then surprise is measured in bits Suppose an observable X=(x1,x2,…,xN) is associated with a probability distribution P=(p1,p2,…,pN). Then, the average amount of surprise felt upon learning the actual value of X will be H(X) can be also regarded as the average amount of information received when X is observed, and hence “carried” by X
38
Entropy of an information source
Suppose a source of information emits symbols drawn from an alphabet with l characters. The source is characterised by a probability distribution P=(p1,p2,…,pl). It is natural to define the entropy of the source (in bits) as H(source) measures the information content of the source associated with the probability distribution P .
39
Maximum entropy of an information source
The maximum entropy of P (maximum degree of uncertainty) is attained when In this case one has
40
Block (n-gram) entropy
Let us define the probability to observe a given string of length n The n-gram entropy is then defined as
41
The largest-disorder (maximum information) case corresponds to
The maximum-disorder case corresponds to linear n-gram entropy Deviations from randomness (presence of constraints) in a given sequence can be revealed in the form of non-linear n-gram entropy Important result
42
Estimating n-gram entropies from data
* Start with a sequence of length L * Slide a window of length n by steps of 1 symbol (base) * Store all the L - n + 1 encountered words * Calculate where ni is the number of occurrencies of the i-th word Of course, in order to achieve statistical significance, one must make sure that
43
n-gram entropy measures in human chromosome 1
Two symbols alphabet (purines = {A,G}, pyrimidines = {C,T}). Maximum n-gram entropy (in bits) Smax = n log2 2 Coding Non-coding intergenic
44
Redundancy The flexibility of a language can be quantified in terms of its redundancy, i.e. the limit relative deviation from the random case The greater the redundancy, the farther is the symbolic structure under study from the random case, i.e. the more constraints (grammar) are present
45
Redundancy of DNA sequences R.N. Mantegna et al, PRE 52, 2939 (1995)
Non-coding regions Coding regions
46
Correlations part II: mutual information
The traditional methods for studying correlations are not invariant under change of the mapping rule. Moreover, correlation functions measure linear correlations by definition The mutual information generalises the measure of correlations between two symbols (bases) separated by a distance k in the sequence
47
The thermodinamical analogy
X Y Joint system [ X, Y ] Statistical dependence Statistical independence
48
The mutual information can be defined as the distance in “entropy
space” between statistical independence and dependence If kB is replaced by 1/ln 2 then I(X,Y) quantifies the amount of mutual information between X and Y in bits
49
Mutual information in DNA sequences
k ni, nj [A,C,T,G] Y (ni) X (nj) Pij(k) Joint probability of finding nucleotide ni and nj spaced by k in the sequence
50
Mutual information in human DNA sequences I
Mutual information in human DNA sequences I. Grosse et al, PRE 61, 5624 (2000) Mutual information of human coding and non-coding sequences from Genbank. Mutual information calculated by averaging over fragments of length 500 bp The mutual information clearly distinguishes between coding and non-coding regions Period 3 bp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.