Basic Molecular Biology Many slides by Omkar Deshpande.

Slides:



Advertisements
Similar presentations
11.1 Genes are made of DNA.
Advertisements

Disease-causing bacteria (smooth colonies) Harmless bacteria (rough colonies) Heat-killed, disease- causing bacteria (smooth colonies) Control (no growth)
Basic Molecular Biology for CS374 Scientific Method: The widely held philosophy that a theory can never be proved, only disproved, and that all attempts.
Basic Molecular Biology Many slides by Omkar Deshpande.
Basic Molecular Biology for CS262 Omkar Deshpande.
Canadian Bioinformatics Workshops
Molecular Biology Chapter 10.
Connection Between Alignment and HMMs. A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Basic Biology for CS262 OMKAR DESHPANDE (TA) Overview Structures of biomolecules How does DNA function? What is a gene? How are genes regulated?
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
DNA Sequencing and Assembly
LECTURE 5: DNA, RNA & PROTEINS
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CS273a Lecture 1, Autumn 10, Batzoglou DNA Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Transcription and Translation
1. Important Features a. DNA contains genetic template" for proteins.
FROM GENE TO PROTEIN: TRANSCRIPTION & RNA PROCESSING Chapter 17.
From DNA to Proteins Lesson 1. Lesson Objectives State the central dogma of molecular biology. Describe the structure of RNA, and identify the three main.
Elements of Molecular Biology All living things are made of cells All living things are made of cells Prokaryote, Eukaryote Prokaryote, Eukaryote.
Lesson Overview 13.1 RNA.
Protein Synthesis: Transcription
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Chapter 10 Table of Contents Section 1 Discovery of DNA
Replication, Transcription and Translation
FROM DNA TO PROTEIN Transcription – Translation We will use:
Gene Expression and Gene Regulation. The Link between Genes and Proteins At the beginning of the 20 th century, Garrod proposed: – Genetic disorders such.
RNA and Protein Synthesis
FROM DNA TO PROTEIN Transcription – Translation. I. Overview Although DNA and the genes on it are responsible for inheritance, the day to day operations.
11.1 Genes are made of DNA. Griffith Experiment Viral DNA Background Virus – a package of nucleic DNA wrapped in a protein shell that must use a host.
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
Protein Synthesis 6C transcription & translation.
12-3 RNA and Protein Synthesis
Basic Molecular Biology Many slides by Omkar Deshpande.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
Chapter 12 DNA, RNA, Gene function, Gene regulation, and Biotechnology.
DNA Structure Chapter 10.
RNA, transcription & translation Unit 1 – Human Cells.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
DNA Sequencing.
Replication, Transcription and Translation. Griffith’s Experiment.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Transcription and Translation. Central Dogma of Molecular Biology  The flow of information in the cell starts at DNA, which replicates to form more DNA.
PROTEIN SYNTHESIS DECEMBER 13, 2010 CAPE BIOLOGY UNIT 1 MRS. HAUGHTON.
Higher Human Biology Unit 1 Human Cells KEY AREA 3: Gene Expression.
Gene Expression DNA, RNA, and Protein Synthesis. Gene Expression Genes contain messages that determine traits. The process of expressing those genes includes.
12-3 RNA and Protein Synthesis Page 300. A. Introduction 1. Chromosomes are a threadlike structure of nucleic acids and protein found in the nucleus of.
Chapter – 10 Part II Molecular Biology of the Gene - Genetic Transcription and Translation.
From DNA to Proteins Lesson 1.
Human Cells Gene Expression
Transcription and Translation
Transcription and Translation
Chapter 14.
Transcription and Translation
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
Transcription and Translation
12-3 RNA and Protein Synthesis
Basic Molecular Biology
LECTURE 5: DNA, RNA & PROTEINS
Presentation transcript:

Basic Molecular Biology Many slides by Omkar Deshpande

Overview Structures of biomolecules Central Dogma of Molecular Biology Overview of this course Genome Sequencing

Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

Watson and Crick

Macromolecule (Polymer) Monomer DNADeoxyribonucleotides (dNTP) RNARibonucleotides (NTP) Protein or PolypeptideAmino Acid

Nucleic acids (DNA and RNA) Form the genetic material of all living organisms. Found mainly in the nucleus of a cell (hence “nucleic”) Contain phosphoric acid as a component (hence “acid”) They are made up of nucleotides.

Nucleotides Phosphate Group Sugar Nitrogenous Base Phosphate Group Sugar Nitrogenous Base

T C A C T G G C G A G T C A G C DNA A = T G = C

The gene and the genome Genome = The entire DNA sequence within the nucleus. The information in the genome is used for protein synthesis A gene is a length of DNA that codes for a (single) protein.

How big are genomes? OrganismGenome Size (Bases)Estimated Genes Human (Homo sapiens)3 billion20,000 Laboratory mouse (M. musculus) 2.6 billion20,000 Mustard weed (A. thaliana)100 million18,000 Roundworm (C. elegans)97 million16,000 Fruit fly (D. melanogaster)137 million12,000 Yeast (S. cerevisiae)12.1 million5,000 Bacterium (E. coli) 4.6 million3,200 Human immunodeficiency virus (HIV) 97009

Repeats The DNA is full of repetitive elements (those that occur over & over & over) There are several type of repeats, including SINEs & LINEs (Short & Long Interspersed Elements) (1 million just ALUs) and low complexity elements. Their function is poorly understood, but they make problems more difficult.

Central dogma DNA tRNA rRNA snRNA mRNA transcription translation POLYPEPTIDE ZOOM IN

Transcription The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus.

T C A C T G G C G A G T C A G C G A G U C A G C DNARNA A = T G = C T  U

More complexity The RNA message is sometimes “edited”. Exons are nucleotide segments whose codons will be expressed. Introns are intervening segments (genetic gibberish) that are snipped out. Exons are spliced together to form mRNA.

Splicing frgjjthissentencehjfmkcontainsjunkelm thissentencecontainsjunk

Key player: RNA polymerase It is the enzyme that brings about transcription by going down the line, pairing mRNA nucleotides with their DNA counterparts.

Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ Promoter 3’

Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ Promoter 3’

Transcription – key steps Initiation Elongation Termination + DNA RNA DNA

Transcription – key steps Initiation Elongation Termination DNA

Transcription – key steps Initiation Elongation Termination DNA

Transcription – key steps Initiation Elongation Termination DNA

Transcription – key steps Initiation Elongation Termination + DNA RNA DNA

Genes can be switched on/off In an adult multicellular organism, there is a wide variety of cell types seen in the adult. eg, muscle, nerve and blood cells. The different cell types contain the same DNA though. This differentiation arises because different cell types express different genes. Promoters are one type of gene regulators

Transcription (recap) The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus. Its destination is a molecular workbench in the cytoplasm, a structure called a ribosome.

Translation How do I interpret the information carried by mRNA to the Ribosome? Think of the sequence as a sequence of “triplets”. Think of AUGCCGGGAGUAUAG as AUG- CCG-GGA-GUA-UAG. Each triplet (codon) maps to an amino acid.

The Genetic Code f : codon amino acid 1968 Nobel Prize in medicine – Nirenberg and Khorana Important – The genetic code is universal! It is also redundant / degenerate.

The Genetic Code

Composed of a chain of amino acids. R | H 2 N--C--COOH | H Proteins 20 possible groups

R R | | H 2 N--C--COOH H 2 N--C--COOH | | H H Proteins

Dipeptide R O R | II | H 2 N--C--C--NH--C--COOH | | H H This is a peptide bond

Protein structure Linear sequence of amino acids folds to form a complex 3-D structure. The structure of a protein is intimately connected to its function. The 3-D shape of proteins gives them their working ability – the ability to bind with other molecules.

Our course (2417) DNA rRNA snRNA mRNA transcription translation POLYPEPTIDE Part 1, DNA: Assembly, Evolution, Alignment Part 2, Genes: Prediction, Regulation

DNA Sequencing Some slides shamelessly stolen from Serafim Batzoglou

DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 – 1/10,000 Other organisms have much higher polymorphism rates

Why humans are so similar A small population that interbred reduced the genetic variation Out of Africa ~ 40,000 years ago Out of Africa

Migration of human variation

Migration of human variation

Migration of human variation

DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

DNA sequencing – vectors + = DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) Known location (restriction site)

DNA sequencing – gel electrophoresis 1. Start at primer(restriction site) 2. Grow DNA chain 3. Include dideoxynucleoside (modified a, c, g, t) 4. Stops reaction at all possible points 5. Separate products with length, using gel electrophoresis

Electrophoresis diagrams

Challenging to read answer

Reading an electropherogram 1. Filtering 2. Smoothening 3. Correction for length compressions 4. A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Based on dynamic programming Several better methods exist, but labs are reluctant to change

Output of PHRED: a read A read: nucleotides A C G A A T C A G …A …21 Quality scores: -10*log 10 Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: Both leftmost & rightmost ends are sequenced

Method to sequence longer regions cut many times at random (Shotgun) genomic segment Get two reads from each segment ~500 bp

Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region reads Reconstructing The Sequence

Definition of Coverage Length of genomic segment:L Number of reads: n Length of each read:l Definition: Coverage C = n l / L How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C

Challenges with Fragment Assembly Sequencing errors ~1-2% of bases are wrong Repeats Computation: ~ O( N 2 ) where N = # reads false overlap due to repeat

Repeats Bacterial genomes:5% Mammals:50% Repeat types: Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 10 6 copies LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies

Hierarchical Sequencing

Hierarchical Sequencing Strategy 1. Obtain a large collection of BAC clones 2. Map them onto the genome (Physical Mapping) 3. Select a minimum tiling path 4. Sequence each clone in the path with shotgun 5. Assemble 6. Put everything together a BAC clone map genome

Methods of physical mapping Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: Hybridization Digestion

1. Hybridization Short words, the probes, attach to complementary words 1. Construct many probes 2. Treat each BAC with all probes 3. Record which ones attach to it 4. Same words attaching to BACS X, Y  overlap p1p1 pnpn

2.Digestion Restriction enzymes cut DNA where specific words appear 1. Cut each clone separately with an enzyme 2. Run fragments on a gel and measure length 3. Clones C a, C b have fragments of length { l i, l j, l k }  overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads plasmids (2 – 10 Kbp) cosmids (40 Kbp) known dist ~500 bp

History of DNA Sequencing Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNA Ala Miescher: Discovers DNA Wu: Sequences Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 1986 Next Generation Sequencing Improved enzymes and chemistry New image processing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) ,000 25,000 1, ,000 50,000,000 Efficiency (bp/person/year) 15, ,000,000,

Read length and throughput read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) Illumina/Solexa, AB/SOLiD short-read sequencers (1-4 Gb in bp reads) NGS Slides courtesy of Gabor Marth

Church, 2005 Sequencing chemistries DNA base extension DNA ligation

Massively parallel sequencing Church, 2005

Features of NGS data Short sequence reads – bp: 454 (Roche) –35-120bp Solexa(Illumina), SOLiD(AB) Huge amount of sequence per run –Gigabases per run Huge number of reads per run –Up to billions Higher error (compared with Sanger) –Different error profile

Current and future application areas De novo genome sequencing Short-read sequencing will be (at least) an alternative to microarrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery DEL SNP reference genome

What can we use them for? SANGER454SolexaAB SOLiD De novo assembly Mammal (3*10 9 ) Bacteria, Yeast BacteriaBacteria? SNP Discovery Yes 90% of human Larger eventsYes Transcript profiling (rare) NoMaybeYes

Computer scientists vs Biologists Nothing is ever completely true or false in Biology. Everything is either true or false in computer science.

Next Gen: Raw Data Machine Readouts are different Read length, accuracy, and error profiles are variable. All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

Current and future application areas De novo genome sequencing Short-read sequencing will be (at least) an alternative to microarrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery DEL SNP reference genome

Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 2. Data visualization 3. Data storage & management

Informatics challenges (cont’d) 4. SNP and short INDEL, and structural variation discovery 6. Data storage & management 5. De novo Assembly

3’3’ 5’ N N N T G z z z 3’3’ 5’ N N N G A z z z 3’3’ 5’ N N N A T z z z 2-base, 4-color: 16 probe combinations ●4 dyes to encode 16 2-base combinations ●Detect a single color indicates 4 combinations & eliminates 12 ●Each color reflects position, not the base call ●Each base is interrogated by two probes ●Dual interrogation eases discrimination –errors (random or systematic) vs. SNPs (true polymorphisms) ACGT A C G T 2 nd Base 1 st Base AB SOLiD System dibase sequencing

The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. ACGT A C G T 2 nd Base 1 st Base AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G Possible Sequences Converting colors into letters

A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error SOLiD error checking code

Comparison of the technologies SANGER454SolexaAB SOLiD OutputSequenceFlowgramSequenceColors Read Length Error rate2%3% (indels)1%4% or 0.06% Mb per run Cost per Mb$1000$50$0.15$0.05 Paired?YesSort ofYes (<1k)Yes (<10k)