Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Molecular Biology Many slides by Omkar Deshpande.

Similar presentations


Presentation on theme: "Basic Molecular Biology Many slides by Omkar Deshpande."— Presentation transcript:

1

2 Basic Molecular Biology Many slides by Omkar Deshpande

3 Overview Structures of biomolecules Central Dogma of Molecular Biology Overview of this course Genome Sequencing

4 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

5

6 Watson and Crick

7

8 Macromolecule (Polymer) Monomer DNADeoxyribonucleotides (dNTP) RNARibonucleotides (NTP) Protein or PolypeptideAmino Acid

9 Nucleic acids (DNA and RNA) Form the genetic material of all living organisms. Found mainly in the nucleus of a cell (hence “nucleic”) Contain phosphoric acid as a component (hence “acid”) They are made up of nucleotides.

10 Nucleotides Phosphate Group Sugar Nitrogenous Base Phosphate Group Sugar Nitrogenous Base

11 T C A C T G G C G A G T C A G C DNA A = T G = C

12 The gene and the genome Genome = The entire DNA sequence within the nucleus. The information in the genome is used for protein synthesis A gene is a length of DNA that codes for a (single) protein.

13 How big are genomes? OrganismGenome Size (Bases)Estimated Genes Human (Homo sapiens)3 billion20,000 Laboratory mouse (M. musculus) 2.6 billion20,000 Mustard weed (A. thaliana)100 million18,000 Roundworm (C. elegans)97 million16,000 Fruit fly (D. melanogaster)137 million12,000 Yeast (S. cerevisiae)12.1 million5,000 Bacterium (E. coli) 4.6 million3,200 Human immunodeficiency virus (HIV) 97009

14 Repeats The DNA is full of repetitive elements (those that occur over & over & over) There are several type of repeats, including SINEs & LINEs (Short & Long Interspersed Elements) (1 million just ALUs) and low complexity elements. Their function is poorly understood, but they make problems more difficult.

15 Central dogma DNA tRNA rRNA snRNA mRNA transcription translation POLYPEPTIDE ZOOM IN

16 Transcription The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus.

17 T C A C T G G C G A G T C A G C G A G U C A G C DNARNA A = T G = C T  U

18 More complexity The RNA message is sometimes “edited”. Exons are nucleotide segments whose codons will be expressed. Introns are intervening segments (genetic gibberish) that are snipped out. Exons are spliced together to form mRNA.

19 Splicing frgjjthissentencehjfmkcontainsjunkelm thissentencecontainsjunk

20 Key player: RNA polymerase It is the enzyme that brings about transcription by going down the line, pairing mRNA nucleotides with their DNA counterparts.

21 Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ Promoter 3’

22 Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ Promoter 3’

23 Transcription – key steps Initiation Elongation Termination + DNA RNA DNA

24 Transcription – key steps Initiation Elongation Termination DNA

25 Transcription – key steps Initiation Elongation Termination DNA

26 Transcription – key steps Initiation Elongation Termination DNA

27 Transcription – key steps Initiation Elongation Termination + DNA RNA DNA

28 Genes can be switched on/off In an adult multicellular organism, there is a wide variety of cell types seen in the adult. eg, muscle, nerve and blood cells. The different cell types contain the same DNA though. This differentiation arises because different cell types express different genes. Promoters are one type of gene regulators

29 Transcription (recap) The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus. Its destination is a molecular workbench in the cytoplasm, a structure called a ribosome.

30 Translation How do I interpret the information carried by mRNA to the Ribosome? Think of the sequence as a sequence of “triplets”. Think of AUGCCGGGAGUAUAG as AUG- CCG-GGA-GUA-UAG. Each triplet (codon) maps to an amino acid.

31 The Genetic Code f : codon amino acid 1968 Nobel Prize in medicine – Nirenberg and Khorana Important – The genetic code is universal! It is also redundant / degenerate.

32 The Genetic Code

33 Composed of a chain of amino acids. R | H 2 N--C--COOH | H Proteins 20 possible groups

34 R R | | H 2 N--C--COOH H 2 N--C--COOH | | H H Proteins

35 Dipeptide R O R | II | H 2 N--C--C--NH--C--COOH | | H H This is a peptide bond

36 Protein structure Linear sequence of amino acids folds to form a complex 3-D structure. The structure of a protein is intimately connected to its function. The 3-D shape of proteins gives them their working ability – the ability to bind with other molecules.

37 Our course (2417) DNA rRNA snRNA mRNA transcription translation POLYPEPTIDE Part 1, DNA: Assembly, Evolution, Alignment Part 2, Genes: Prediction, Regulation Part 3, Interactions

38 DNA Sequencing Some slides shamelessly stolen from Serafim Batzoglou

39 DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

40 Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 – 1/10,000 Other organisms have much higher polymorphism rates

41 Why humans are so similar A small population that interbred reduced the genetic variation Out of Africa ~ 40,000 years ago Out of Africa

42 Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

43 Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

44 Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

45 DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

46 DNA sequencing – vectors + = DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) Known location (restriction site)

47 Different types of vectors VECTORSize of insert Plasmid 2,000-10,000 Can control the size Cosmid40,000 BAC (Bacterial Artificial Chromosome) 70,000-300,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently

48 DNA sequencing – gel electrophoresis 1. Start at primer(restriction site) 2. Grow DNA chain 3. Include dideoxynucleoside (modified a, c, g, t) 4. Stops reaction at all possible points 5. Separate products with length, using gel electrophoresis

49 Electrophoresis diagrams

50 Challenging to read answer

51

52

53 Reading an electropherogram 1. Filtering 2. Smoothening 3. Correction for length compressions 4. A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Based on dynamic programming Several better methods exist, but labs are reluctant to change

54 Output of PHRED: a read A read: 500-700 nucleotides A C G A A T C A G …A 16 18 21 23 25 15 28 30 32 …21 Quality scores: -10*log 10 Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: Both leftmost & rightmost ends are sequenced

55 Read length and throughput read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb ABI capillary sequencer 454 pyrosequencer (20-100 Mb in 100-250 bp reads) Illumina/Solexa, AB/SOLiD short-read sequencers (1-4 Gb in 25-50 bp reads) NGS Slides courtesy of Gabor Marth

56 Church, 2005 Sequencing chemistries DNA base extension DNA ligation

57 Massively parallel sequencing Church, 2005

58 Features of NGS data Short sequence reads –100-200bp: 454 (Roche) –35-70bp Solexa(Illumina), SOLiD(AB) Huge amount of sequence per run –Up to gigabases per run Huge number of reads per run –Up to 100’s of millions Higher error (compared with Sanger) –Different error profile

59 Next Gen: Raw Data Machine Readouts are different Read length, accuracy, and error profiles are variable. All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

60 3’3’ 5’ N N N T G z z z 3’3’ 5’ N N N G A z z z 3’3’ 5’ N N N A T z z z 2-base, 4-color: 16 probe combinations ●4 dyes to encode 16 2-base combinations ●Detect a single color indicates 4 combinations & eliminates 12 ●Each color reflects position, not the base call ●Each base is interrogated by two probes ●Dual interrogation eases discrimination –errors (random or systematic) vs. SNPs (true polymorphisms) ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 AB SOLiD System dibase sequencing

61 The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G 1 00 1 23022 1 00 1 23022 4 Possible Sequences Converting colors into letters

62 A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error SOLiD error checking code

63 Current and future application areas De novo genome sequencing Short-read sequencing will be (at least) an alternative to microarrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery DEL SNP reference genome

64 Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 2. Data visualization 3. Data storage & management

65 Informatics challenges (cont’d) 4. SNP and short INDEL, and structural variation discovery 6. Data storage & management 5. De novo Assembly

66 Comparison of the technologies SANGER454SolexaAB SOLiD OutputSequenceFlowgramSequenceColors Read Length500-700250-50035-7035-50 Error rate2%3% (indels)1%4% or 0.06% Mb per run0.82010004000 Cost per Mb$1000$50$0.35$0.10 Paired?YesSort ofYes (<1k)Yes (<10k)

67 What can we use them for? SANGER454SolexaAB SOLiD De novo assembly Mammal (3*10 9 ) Bacteria, Yeast BacteriaBacteria? SNP Discovery Yes 90% of human Larger eventsYes Transcript profiling (rare) NoMaybeYes

68 Computer scientists vs Biologists Nothing is ever completely true or false in Biology. Everything is either true or false in computer science.


Download ppt "Basic Molecular Biology Many slides by Omkar Deshpande."

Similar presentations


Ads by Google