Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology.

Similar presentations


Presentation on theme: "CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology."— Presentation transcript:

1

2 CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

3 Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: –Programming experience –Strong background in algorithms and data structures –Basic understanding of statistics and probability –Appetite to learn some biology For other information, check course website

4 Why bioinformatics The advance of experimental technology has resulted in a huge amount of data –The human genome is “finished” –Even if it were, that’s only the beginning… The bottleneck is how to integrate and analyze the data –Noisy –Diverse

5 Growth of GenBank vs Moore’s law

6 Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

7 What is bioinformatics National Institutes of Health (NIH): –Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

8 What is bioinformatics National Center for Biotechnology Information (NCBI): –the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

9 Bioinformatics Biology Molecular Biology MedicinePhysics Computer Science Informatics Mathematics Statistics Chemistry

10 Computer Scientists vs Biologists (courtesy Serafim Batzoglou, Stanford)

11 Biologists vs computer scientists (almost) Everything is true or false in computer science (almost) Nothing is ever true or false in Biology

12 Biologists vs computer scientists Biologists seek to understand the complicated, messy natural world Computer scientists strive to build their own clean and organized virtual world

13 Biologists vs computer scientists Computer scientists are obsessed with being the first to invent or prove something Biologists are obsessed with being the first to discover something

14 Some examples of central role of CS in bioinformatics

15 1. Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides

16 AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces 1. Genome sequencing

17 Where are the genes? 2. Gene Finding In humans: ~22,000 genes ~1.5% of human DNA

18 Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites 2. Gene Finding Hidden Markov Models (Well studied for many years in speech recognition)

19 3. Protein Folding The amino-acid sequence of a protein determines the 3D fold The 3D fold of a protein determines its function Can we predict 3D fold of a protein given its amino-acid sequence? –Holy grail of computational biology —40 years old problem –Molecular dynamics, computational geometry, machine learning

20 4. Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, one of the most cited papers in history Still very active area of research query DB BLAST Efficient string matching algorithms Fast database index techniques

21 …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC). Lipman & Pearson, 1985 Database size today: 10 12 (increased by 2 million folds). BLAST search: 1.5 minutes …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

22 5. Microarray data analysis Example: Clinical prediction of Leukemia type 2 types of leukemia –Acute lymphoid (ALL) –Acute myeloid (AML) Different treatments & outcomes Predict type before treatment? Bone marrow samples: ALL vs AML Measure amount of each gene

23 Some goals of biology for the next 50 years List all molecular parts that build an organism –Genes, proteins, other functional parts Understand the function of each part Understand how parts interact physically and functionally Study how function has evolved across all species Find genetic defects that cause diseases Design drugs rationally Sequence the genome of every human, use it for personalized medicine Bioinformatics is an essential component for all the goals above

24 A short introduction to molecular biology

25 Life Two main categories: –Prokaryotes (e.g. bacteria) Unicellular No nucleus –Eukaryotes (e.g. fungi, plant, animal) Unicellular or multicellular Has nucleus

26 Prokaryote vs Eukaryote Eukaryote has many membrane-bounded compartment inside the cell –Different biological processes occur at different cellular location

27 Organism, Organ, Cell Organism Organ

28 Chemical contents of cell Water Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet) –Protein –DNA –RNA –…–… Small molecules –Sugar –Ions (Na +, Ka +, Ca 2+, Cl -,…) –Hormone –…–…

29 DNA DNA: forms the genetic material of all living organisms –Can be replicated and passed to descendents –Contains information to produce proteins To computer scientists, DNA is a string made from alphabet {A, C, G, T} –e.g. ACAGAACGTAGTGCCGTGAGCG Each letter is a nucleotide Length varies from hundreds to billions

30 RNA Historically thought to be mainly an information carrier –DNA => RNA => Protein –Very important new roles have been found recently To computer scientists, RNA is a string made from alphabet {A, C, G, U} –e.g. ACAGAACGUAGUGCCGUGAGCG Each letter is a nucleotide Length varies from tens to thousands

31 Protein Protein: the actual “worker” for almost all processes in the cell –Enzymes: speed up reactions –Signaling: information transduction –Structural support –Production of other macromolecules –Transport To computer scientists, protein is a string made from an alphabet of 20 letters –E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP Each letter is called an amino acid Length varies from tens to thousands

32 DNA/RNA zoom-in Commonly referred to as Nucleic Acid DNA: Deoxyribonucleic acid RNA: Ribonucleic acid Found mainly in the nucleus of a cell (hence “nucleic”) Contain phosphoric acid as a component (hence “acid”) They are made up of a string of nucleotides

33 Nucleotides A nucleotide has 3 components –Sugar ring (ribose in RNA, deoxyribose in DNA) –Phosphoric acid –Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Thymine (T) in DNA and Uracil (U) in RNA

34 Units of RNA: ribo-nucleotide A ribonucleotide has 3 components –Sugar - Ribose –Phosphate group –Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Uracil (U)

35 Units of DNA: deoxy-ribo-nucleotide A deoxyribonucleotide has 3 components –Sugar – Deoxy-ribose –Phosphate group –Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Thymine (T)

36 Polymerization: Nucleotides => nucleic acids Phosphate Sugar Nitrogen Base Phosphate Sugar Nitrogen Base Phosphate Sugar Nitrogen Base

37 G A G T C A G C 5’-AGCGACTG-3’ AGCGACTG Phosphate Sugar Base 1 2 3 4 5 Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc. 5’ 3’ DNA Free phosphate 5 prime 3 prime

38 G A G U C A G U 5’-AGUGACUG-3’ AGUGACUG Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. translation. 5’ 3’ RNA Free phosphate 5 prime 3 prime

39 T C A C T G G C G A G T C A G C Base-pair: A = T G = C 5’ 3’ 5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC Forward (+) strand Backward (-) strand One strand is said to be reverse- complementary to the other DNA usually exists in pairs.

40 DNA double helix G-C pair is stronger than A-T pair

41 Reverse-complementary sequences 5’-ACGTTACAGTA-3’ The reverse complement is: 3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’ Or simply written as TACTGTAACGT

42 Orientation of the double helix Double helix is anti-parallel –5’ end of one strand pairs with 3’ end of the other –5’ to 3’ motion in one strand is 3’ to 5’ in the other Double helix has no orientation –Biology has no “forward” and “reverse” strand –Relative to any single strand, there is a “reverse complement” or “reverse strand” –Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’

43 RNA RNAs are normally single- stranded Form complex structure by self- base-pairing A=U, C=G Can also form RNA-DNA and RNA-RNA double strands. –A=T/U, C=G

44 Carboxyl group Amino group Protein zoom-in Side chain Generic chemical form of amino acid Protein is the actual “worker” for almost all processes in the cell A string built from 20 kinds of chars –E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH Each letter is called an amino acid R | H 2 N--C--COOH | H

45 20 amino acids, only differ at side chains –Each can be expressed by three letters –Or a single letter: A-Y, except B, J, O, U, X, Z –Alanine = Ala = A –Histidine = His = H Units of Protein: Amino acid

46 R R | | H 2 N--C--CO--NH--C--COOH | | H H R R | | H 2 N--C--COOH H 2 N--C--COOH | | H H Amino acids => peptide Peptide bond

47 Protein Has orientations Usually recorded from N-terminal to C-terminal Peptide vs protein: basically the same thing Conventions –Peptide is shorter (< 50aa), while protein is longer –Peptide refers to the sequence, while protein has 2D/3D structure R H2N RRRRR COOH N-terminal C-terminal …

48 Protein structure Linear sequence of amino acids folds to form a complex 3-D structure. The structure of a protein is intimately connected to its function.

49 Genome and chromosome Genome: the complete DNA sequences in the cell of an organism –May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes Chromosome: a single large DNA molecule in the cell –May be circular or linear –Contain genes as well as “junk DNAs” –Highly packed!

50 Formation of chromosome

51 50,000 times shorter than extended DNA The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun

52 Gene Gene: unit of heredity in living organisms –A segment of DNA with information to make a protein or a functional RNA

53 Some statistics ChromosomesBasesGenes Human463 billion20k-25k Dog782.4 billion~20k Corn202.5 billion50-60k Yeast1620 million~7k E. coli14 million~4k Marbled lungfish ?130 billion?

54

55 Human genome 46 chromosomes: 22 pairs + X + Y 1 from mother, 1 from father Female: X + X Male: X + Y

56 Human genome Every cell contains the same genomic information –Except sperms and eggs, which only contain half of the genome Otherwise your children would have 46 + 46 chromosomes …

57 Cell division: mitosis A cell duplicates its genome and divides into two identical cells These cells build up different parts of your body

58 Cell division: meiosis A reproductive cell divides into four cells, each containing only half of the genomes –Diploid => haploid Two haploid cells (sperm + egg) forms a zygote –Which will then develop into a multi-cellular organism by mitosis

59 Central dogma of molecular biology DNA replication is critical in both mitosis and meiosis

60 DNA Replication The process of copying a double-stranded DNA molecule –Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTATT-5’  5’-ACATGATAA-3’ 3’-TGTACTATT-5’

61 Mutation: changes in DNA base-pairs Proofreading and error-correcting mechanisms exist to ensure extremely high fidelity p p p Nucleotide triphosphate (dNTP)

62 Central dogma of molecular biology

63 Transcription The process that a DNA sequence is copied to produce a complementary RNA –Called message RNA (mRNA) if the RNA carries instruction on how to make a protein –Called non-coding RNA if the RNA does not carry instruction on how to make a protein –Only consider mRNA for now Similar to replication, but –Only one strand is copied

64 Transcription (where genetic information is stored) (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C

65 Translation The process of making proteins from mRNA A gene uniquely encodes a protein There are four bases in DNA (A, C, G, T), and four in RNA (A, C, G, U), but 20 amino acids in protein How many nucleotides are required to encode an amino acid in order to ensure correct translation? –4^1 = 4 –4^2 = 16 –4^3 = 64 The actual genetic code used by the cell is a triplet. –Each triplet is called a codon

66 The Genetic Code Third letter

67 Translation The sequence of codons is translated to a sequence of amino acids Gene: -GCT TGT TTA CGA ATT- mRNA: -GCU UGU UUA CGA AUU - Peptide: - Ala - Cys - Leu - Arg - Ile – Start codon: AUG –Also code Met –Stop codon: UGA, UAA, UAG

68 Translation Transfer RNA (tRNA) – a different type of RNA. –Freely float in the cell. –Every amino acid has its own type of tRNA that binds to it alone. Anti-codon – codon binding crucial. mRNA tRNA-Leu Nascent peptide tRNA-Pro Anti-codon

69 Transcriptional regulation gene promoter Transcription starting site RNA Polymerase Transcription factor Will talk more in later lectures RNA polymerase binds to certain location on promoter to initiate transcription Transcription factor binds to specific sequences on the promoter to regulate the transcription –Recruit RNA polymerase: induce –Block RNA polymerase: repress –Multiple transcription factors may coordinate

70 Splicing gene promoter Transcription starting site Pre-mRNA transcription Pre-mRNA needs to be “edited” to form mature mRNA Will talk more in later lectures. 5’ UTR 3’ UTRexon intron Start codonStop codon Open reading frame (ORF) Pre-mRNA Mature mRNA (mRNA) Splicing

71 Summary DNA: a string made from {A, C, G, T} –Forms the basis of genes –Has 5’ and 3’ –Normally forms double-strand by reverse complement RNA: a string made from {A, C, G, U} –mRNA: messenger RNA –tRNA: transfer RNA –Other types of RNA: rRNA, miRNA, etc. –Has 5’ and 3’ –Normally single-stranded. But can form secondary structure Protein: made from 20 kinds of amino acids –Actual worker in the cell –Has N-terminal and C-terminal –Sequence uniquely determined by its gene via the use of codons –Sequence determines structure, structure determines function Central dogma: DNA transcribes to RNA, RNA translates to Protein –Both steps are regulated

72 Experimental techniques to manipulate DNA

73 DNA synthesis Creating DNA synthetically in a laboratory Chemical synthesis –Chemical reactions –Arbitrary sequences –Maximum length 160-200 Cloning: make copies based on a DNA template –Biological reactions –Requires template –Many copies of a long DNA in a short time

74 Some terms Denature: a DNA double-strand is separated into two strands –By raising temperature Renature: the process that two denatured DNA strands re-forms a double-strand –By cooling down slowly Hybridization: two heterogeneous DNAs form a double-stranded DNA –may have mismatches –The rationale behind many molecular biological techniques including DNA microarray

75 in vitro DNA Cloning Polymerase chain reaction (PCR) denature 5’ Primer (< 30 bases) 5’ dNTP 5’ DNA Polymerase

76 in vivo DNA Cloning Connect a piece of DNA to bacterial DNA, which can then be replicated together with the host DNA bacterial DNA

77 DNA sequencing technology Read out the letters from a DNA sequence Chain-termination method (Sanger method) 1974, Frederick Sanger GTGAGGCGCTGC

78 DNA sequencing: Basic idea PCR primer extension 5’-TTACAGGTCCATACTA  3’-AATGTCCAGGTATGATACATAGG-5’ We need to supply A, C, G, T for the synthesis to continue Besides A, C, G, T, we add some A*, C*, G*, and T* –Very similar to ACGT in all aspects, except that –The extension will stop if used

79 DNA sequencing, cont

80

81 Base calling

82 Sequencing speed Current methods can directly sequence only relatively short (<1000bp long) DNA fragments in a single reaction Automated DNA-sequencing instruments (using gel-filled capillaries) can sequence up to 384 DNA samples in a single batch (run) in up to 24 runs a day: ~ 3,000,000 bases per day

83 Advances in DNA sequencing 1969: three years to sequence 115nt DNA 1979: three years to sequence ~1650nt 1989: one week to sequence ~1650nt 1995: Haemophilus genome sequenced at TIGR - 1,830,138nt 2000: Human Genome - working draft sequence, 3 billion bases 2004: 454 Life Science invented the first new-generation sequencer

84 The bioinformatics landmark Completion of human genome sequencing is a success embraced by –Advancement in sequencing technology –Speed of computation –Algorithm development in bioinformatics HGP (Human Genome Project) strategy –Hierarchical sequencing –Estimated 15 years (1990 – 2005), completed in 13 years –$3 billion Celera strategy –Whole-genome shotgun sequencing –Three years (1998-2001) –$300 million

85 Prior to year 2007 Over 300 genomes have been sequenced ~10 11 - 10 12 nt

86 Year 2007 Genomes of three individual human were sequenced –James Watson –Craig Venter –Yang Huanming Cost for sequencing Watson’s genome –$3 million, 2 months –Compared to $3 billion, 13 years for HGP These are achieved without the new-generation sequencing technology ! June 3 2010: “Illumina Drops Personal Genome Sequencing Price to Below $20,000”

87 Sequencing speed has been tremendously improved High efficiency and relatively low cost makes it possible to sequence the genome of any individual from any species What’s next?

88 Continue to sequence more species? Genome 10K project More individuals? 1000 Genome project What to do with those sequences?

89 Coming next: biological sequence analysis


Download ppt "CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology."

Similar presentations


Ads by Google