A short tutorial on DNA structure and functions

Slides:



Advertisements
Similar presentations
DNA and RNA Chapter 16.
Advertisements

Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Introduction to molecular biology. Subjects overview Investigate how cells organize their DNA within the cell nucleus, and replicate it during cell division.
Information Theory EE322 Al-Sanie.
Transformation Principle In 1928 Fredrick Griffith heated the S bacteria and mixed with the harmless bacteria thinking that neither would make the mice.
Lecture 1 An introduction to DNA Topology  The human cell contains 23 pairs of chromosomes  If we scale the cell nucleus to the size of Basketball.
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.
DNA and RNA. I. DNA Structure Double Helix In the early 1950s, American James Watson and Britain Francis Crick determined that DNA is in the shape of.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Central Dogma of Biology
Biomolecules Nucleic acids.  Are the genetic materials of all organisms and determine inherited characteristics.  The are two kinds of nucleic acids,
Gene Structure: DNA RNA Protein Dr. Jason Tasch. Nucleic Acids Sequence of Nucleotides Nucleotide composed of: –Nitrogenous Base Purine Pyrimidine –Sugar.
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
RNA & Protein Synthesis Uracil Hydrogen bonds Adenine Ribose RNA Mrs. Stewart Biology I.
Protein Synthesis Pages Part 3. Warm-Up: DNA DNA is a double stranded sequence of ___________ (smallest unit of DNA). 2.Short segments of.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
The Central Dogma of Molecular Biology DNA → RNA → Proteins Biology II D. Mitchell.
Coding for Life Introduction
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Protein Synthesis Review By PresenterMedia.com PresenterMedia.com.
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
DNA, RNA & PROTEIN SYNTHESIS FROM GENE TO PROTEIN.
8.2 KEY CONCEPT DNA structure is the same in all organisms.
DNA, RNA AND PROTEIN SYNTHESIS. DNA (DEOXYRIBONUCLEIC ACID) Nucleic acid that composes chromosomes and carries genetic information.
DNA & RNA Biology Chapter 12 & 13.
13/11/
Topic 25 – RNA and protein synthesis
DNA Structure and Protein Synthesis (also known as Gene Expression)
Chapter 10 – DNA, RNA, and Protein Synthesis
Gene Structure: DNA RNA Protein
Nucleic Acid and Protein Synthesis
1st lesson Medical students Medical Biology Molecular Biology
Molecular biology (1) (Foundation Block).
Things that may help with comprehension of bioinformatics issues in general and Rosalind problems in particular.
Outline Introduction Signal, random variable, random process and spectra Analog modulation Analog to digital conversion Digital transmission through baseband.
2A. Distinguish between DNA and RNA.
M.B.Ch.B, MSC, DCH (UK), MRCPCH
DNA to PROTEIN CHAPTER 12 DEOXYRIBONUCLEIC ACID.
2A. Distinguish between DNA and RNA.
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Chapter 5 RNA and Transcription
WHAT IS IT? HOW DOES IT WORK?
BTY100-Lec#4.1 Genetic Basis of Life Genetic Makeup © LPU: BTY100.
Transcription and Translation Chapter 12
Lecture 4 By Ms. Shumaila Azam
Synthetic Biology: Protein Synthesis
Transcription is the synthesis of RNA under the direction of DNA
DNA Replication and Protein Synthesis
DNA, RNA Replication, Transcription, Translation
WHAT IS IT? HOW DOES IT WORK?
How Proteins are Made Biology I: Chapter 10.
General Animal Biology
Molecular Biology of the Gene
DNA: Structure Biology 12.
RNA and protein synthesis
CHAPTER 17 FROM GENE TO PROTEIN
Genes and Protein Synthesis Review
7.3 RNA and Protein Synthesis
Gene Structure: DNA RNA Protein
DNA, RNA, & Proteins Vocab review
Structure of DNA (Most slides should be a review you NEED to have the underlined text in notes along with drawings that I say put in notes )
General Animal Biology
Discovering Frequent Poly-Regions in DNA Sequences
Molecular biology (1) (Foundation Block).
Presentation transcript:

A short tutorial on DNA structure and functions The DNA double helix A short tutorial on DNA structure and functions

Nucleotides: the bricks of a ladder structure The DNA helix is made assemblying four structural units called nucleotides. Each nucleotide has 3 building blocks: Adenine (A) Guanine (G) Cytosine (C) Thymine (T) Base: purine pyrimidine Phosphate group Sugar ring

The double helix A T C G A T ... G T A C Four-letter alphabet {A,C,G,T} Hidrogen bonds

The central dogma of molecular biology The DNA molecule contains the information for coding all the proteins (more than 100,000 in humans) The DNA expression (# and type of proteins produced) is different in different cells

The DNA molecule is divided in regions that code for proteins (GENES) and non-coding regions chromatine

The “words” in genes are triplets called codons Each codon on the gene corresponds to an amynoacid on the corresponding protein Serine The message encoding is not as simple as that! Strong degeneracy 43=64 different triplets 20 different amynoacids

The C paradox (“complete genome size” paradox) The size of complete genomes (=total length of DNA on all chromosomes) does not seem to be correlated with the Phenotypic complexity of the species Homo Sapiens …. C ~ 3.4 x 109 bp Pinus resinosa ….. C ~ 6.8 x 1010 bp Amoeba dubia …. C ~ 6.7 x 1011 bp

Coding to non-coding ratio The coding to non-coding ratio generally decreases from simpler to higher organisms

Is it all junk? The higher the organism is on the evolutionary scale, the greater the amount of “junk” DNA Is it all junk? Regulatory sequences instructions concerning transcription and expression Specific protein binding sites Interplay between DNA sequence and structure

An example: promoters Transcription factors bind to specific sequences upstream with respect to the gene (promoters) Then the RNA-polymerase “recognises” where to start transcription Regulation of gene expression through promoters and transcription factors. Adjacent to genes are promoters, DNA regions that control how much RNA is transcribed from that gene. (1) The enzyme RNA polymerase II initiates trancription at a specific site in the promoter. (2) Certain 'basal' transcription factors control the binding of RNA polymerase to this site. Other proteins that bind to specific short DNA stretches in the promoter and basal factors work together to activate or inhibit transcription. Drugs such as alcohol may modify the activity of those factors. (3) Here, alcohol is arbitrarily assumed to increase the activity of an activating transcription factor (ATF), resulting in (4) increased RNA synthesis from the alcohol-responsive gene. (5) Genes lacking this particular ATF binding site would not respond to alcohol.

The role of non-coding DNA … is still an open problem tools from Computational Linguistics techniques of information theory Analysis of DNA sequences with Statistical physics

Statistical analysis of DNA sequences A short account of how physicists abuse of biology

Restatement of the problem A long sequence is given of symbols drawn from a four-letter alphabet {A,T,C,G}. How is it organised? Is there a language? Space of investigation at least two-dimensional: Functional bipartition of the sequence - coding and non-coding regions (x axis) Position of the organism on the evolutionary scale (y axis)

Methodological pre-requisite When searching for rules that hold in the biological world, physicists must switch to the biological definition of law A law in biology is a rule that holds for describing a given phenomenon with the least number of exceptions

Long-range correlations Long-range correlations in a sequence are the first signature of some non-trivial auto-organisation (ex: languages); as opposed to a sequence generated by a process characterised by local order. Scale-free “fractal” process => power-law decay of correlations Counter-example: Markov process of order n0 There exists the characteristic length n0 in the organisation of the sequence produced by such a process. Correlations decay exponentially as exp(-n/ n0)

Long-range correlations: methods Spectral analysis of DNA sequences Walk representations of DNA sequences

Mapping rules Map a nucleotide sequence {ni} (n {A,C,T,G}, i=1,2,…,L) onto a numerical sequence {ui} Purine-pyrimidine Weak-strong

Spectral analysis Divide the sequence of L nucleotides into K=[ L/N ] non-overlapping sub-sequences of size N, and compute the power spectrum of each one where qf is the Discrete Fourier Transform of set ui

Finally, average S( f ) over the K subsequences If a sequence has long-range power-law correlations Hence, a log-log plot of S( f ) vs f is a straight line with slope -b

Low-frequency portion: spurious contributions for f < order 1/N Medium-frequency portion: power-law behaviour High-frequency portion: peaks at f =1/3 and f =1/9 bp-1

The measured distributions are found to be peaked on non-zero values of b in non-coding regions only Purine-pyrimidine mapping rule, N=512 Practical use? This result may be useful in gene-finding algorithms based on the different correlation properties of coding and non-coding regions S.V. Buldyrev et al., PRE, 51, 5084 (1995) S.M.Ossadnik et al, Biophys. J., 67, 64 (1994).

Walk representation Mapping rule => displacement Displacement y(l) Nucleotide distance Displacement y(l)

One then studies the root mean square fluctuations F(l) about the average of the displacement where and the bars indicate averages over l0

The mean square fluctuation is related to the autocorrelation function in the following way

There are three possible behaviours Random base sequence => C(l )=0 on average (except C(0)=1). Hence (normal random walk) Local correlations up to a distance L => However, asymptotically ( l >> L) If there is no characteristic length (“infinite-length” correlations) one gets power laws of the form with (anomalous random walk)

Techniques from statistical Linguistic Zipf plots Shannon entropy and redundancy Mutual information

Zipf plots Construct the Rank vector R, by ordering the words from Conventional Zipf analysis: take all words in a text written in a given language measure their frequency of occurrence ( f ) Construct the Rank vector R, by ordering the words from the most frequent to the least, assigning ranks R=1, 2, ... The Zipf plot is f (R) It is found that in several natural languages with z close to 1

n-tuple Zipf analysis What if the words of the language are not known? One can fix a word length n and perform Zipf analysis of a sequence of symbols of length N by taking into account all N-n+1 words Application of n -tuple Zipf analysis to natural and formal languages gives encouraging results. Yet, the success of such analysis provides just a necessary condition for non-Markovian structure in a language

n-tuple Zipf analysis of English R. N n-tuple Zipf analysis of English R.N. Mantegna et al, PRE 52, 2939 (1995) English text of about 106 words from an Encyclopedia. Alphabet of 32 character: 26 english letters and 6 punctuation symbols, including the “blank” character. Fit gives To a good extent independent of n

n-tuple Zipf analysis of Unix binary code R. N n-tuple Zipf analysis of Unix binary code R.N. Mantegna et al, PRE 52, 2939 (1995) Compiled version of the Unix operating system (9x106 characters). The alphabet is binary. Fit gives To a good extent independent of n

n-tuple Zipf analysis still shows power-law behaviour, but the exponent z is different from conventional Zipf analysis The difference observed in the exponent z may be due to the vocabulary enlargement of the n -tuple analysis. Conventional Zipf scheme: only REAL WORDS n-tuple scheme: all n-long combinations

n-tuple Zipf analysis of human DNA sequence R. N n-tuple Zipf analysis of human DNA sequence R.N. Mantegna et al, PRE 52, 2939 (1995) Fit gives Primarily non-coding human DNA sequence (2x105 bases) from GenBank (coding=1.5 %) To a good extent independent of n

Coding => logarithmic fit Extensive n-tuple Zipf analysis of coding and non-coding sequences R.N. Mantegna et al, PRE 52, 2939 (1995) A statistically significant functional difference is found in the functional form of the n-tuple Zipf plots of coding and non-coding sequences in most organisms. Non-coding => power-law fit Coding => logarithmic fit

Entropy and Redundancy Shannon, 1948: the foundations of information theory. The concept of entropy of an information source In order to introduce Shannon entropy, let us give a mathematical definition of surprise!

A mathematical formulation of surprise Suppose that the surprise one feels upon learning that an event E has occurred depends only on the probability p(E), 0 < p < 1. It is natural to agree on four basic axioms for the function S(p) S(1) = 0 S(p) is a strictly decreasing function of p S(p) is a continuous function of p S(pq) = S(p) + S(q), 0 < p,q < 1 Theorem:

, then surprise is measured in bits If we take , then surprise is measured in bits Suppose an observable X=(x1,x2,…,xN) is associated with a probability distribution P=(p1,p2,…,pN). Then, the average amount of surprise felt upon learning the actual value of X will be H(X) can be also regarded as the average amount of information received when X is observed, and hence “carried” by X

Entropy of an information source Suppose a source of information emits symbols drawn from an alphabet with l characters. The source is characterised by a probability distribution P=(p1,p2,…,pl). It is natural to define the entropy of the source (in bits) as H(source) measures the information content of the source associated with the probability distribution P .

Maximum entropy of an information source The maximum entropy of P (maximum degree of uncertainty) is attained when In this case one has

Block (n-gram) entropy Let us define the probability to observe a given string of length n The n-gram entropy is then defined as

The largest-disorder (maximum information) case corresponds to The maximum-disorder case corresponds to linear n-gram entropy Deviations from randomness (presence of constraints) in a given sequence can be revealed in the form of non-linear n-gram entropy Important result

Estimating n-gram entropies from data * Start with a sequence of length L * Slide a window of length n by steps of 1 symbol (base) * Store all the L - n + 1 encountered words * Calculate where ni is the number of occurrencies of the i-th word Of course, in order to achieve statistical significance, one must make sure that

n-gram entropy measures in human chromosome 1 Two symbols alphabet (purines = {A,G}, pyrimidines = {C,T}). Maximum n-gram entropy (in bits) Smax = n log2 2 Coding Non-coding intergenic

Redundancy The flexibility of a language can be quantified in terms of its redundancy, i.e. the limit relative deviation from the random case The greater the redundancy, the farther is the symbolic structure under study from the random case, i.e. the more constraints (grammar) are present

Redundancy of DNA sequences R.N. Mantegna et al, PRE 52, 2939 (1995) Non-coding regions Coding regions

Correlations part II: mutual information The traditional methods for studying correlations are not invariant under change of the mapping rule. Moreover, correlation functions measure linear correlations by definition The mutual information generalises the measure of correlations between two symbols (bases) separated by a distance k in the sequence

The thermodinamical analogy X Y Joint system [ X, Y ] Statistical dependence Statistical independence

The mutual information can be defined as the distance in “entropy space” between statistical independence and dependence If kB is replaced by 1/ln 2 then I(X,Y) quantifies the amount of mutual information between X and Y in bits

Mutual information in DNA sequences k ni, nj [A,C,T,G] Y (ni) X (nj) Pij(k) Joint probability of finding nucleotide ni and nj spaced by k in the sequence

Mutual information in human DNA sequences I Mutual information in human DNA sequences I. Grosse et al, PRE 61, 5624 (2000) Mutual information of human coding and non-coding sequences from Genbank. Mutual information calculated by averaging over fragments of length 500 bp The mutual information clearly distinguishes between coding and non-coding regions Period 3 bp