Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology
CS 5263 Bioinformatics Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Outline Administravia What is bioinformatics Why bioinformatics
Course overview Short introduction to molecular biology

Survey form Your name Academic preparation Interests help me better design lectures and assignments

Course Info Instructor: Jianhua Ruan Office: S.B. 4.01.48
Phone: Office hours: W 3-4pm or by appointment Web:

Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: Programming experiences Some knowledge in algorithms and data structures Basic understanding of statistics and probability Appetite to learn some biology

Textbooks An Introduction to Bioinformatics Algorithms
by Jones and Pevzner Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin, Eddy, Krogh and Mitchison Additional resources Papers Handouts See course website

Grading Attendance: 10% Homeworks: 50%
At most 2 classes missed without affecting grade, unless pre-approved by the instructor Homeworks: 50% About 5 assignments Combination of theoretical and programming exercises Possibly presenting papers No exams No late submission accepted Read the collaboration policy! Final project and presentation: 40%

Why bioinformatics The advance of experimental technology has resulted in a huge amount of data The human genome is “finished” Even if it were, that’s only the beginning… The bottleneck is how to integrate and analyze the data Noisy Diverse

Growth of GenBank vs Moore’s law

Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

What is bioinformatics
National Institutes of Health (NIH): Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

National Center for Biotechnology Information (NCBI): the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

Wikipedia Bioinformatics refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.

Course objectives Learn the basis of sequence analysis and other computational biology algorithms Familiarize with the research topics in bioinformatics Be able to Read / criticize bioinformatics research articles Identify subareas that best suit your background Communicate and exchange ideas with (computational) biologists

What you will learn? Basic concepts in molecular biology and genetics
Algorithms to address selected problems in bioinformatics Dynamic programming, string algorithms, graph algorithms Statistical learning algorithms: HMM, EM, Gibbs sampling Data mining: clustering / classification Applications to real data

What you will not learn? Designing / performing biological experiments (duh!) Programming (in perl, etc). Building bioinformatics software tools (GUI, database, Web, …) Using existing tools / databases (well, not exactly true)

Covered topics 1 week Biology Sequence analysis Gene prediction
Sequence alignment Pairwise, multiple, global, local, optimal, heuristic String matching Motif finding Gene prediction RNA structure prediction Phylogenetic tree Functional Genomics Microarray data analysis Biological networks 8 weeks 5 weeks

Computer Scientists vs Biologists (courtesy Serafim Batzoglou, Stanford)

Biologists vs computer scientists
(almost) Everything is true or false in computer science (almost) Nothing is ever true or false in Biology

Biologists seek to understand the complicated, messy natural world Computer scientists strive to build their own clean and organized virtual world

Computer scientists are obsessed with being the first to invent or prove something Biologists are obsessed with being the first to discover something

Some examples of central role of CS in bioinformatics

1. Genome sequencing 3x109 nucleotides ~500 nucleotides
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides ~500 nucleotides

1. Genome sequencing 3x109 nucleotides A big puzzle ~60 million pieces
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome

2. Gene Finding Where are the genes? In humans: ~22,000 genes
~1.5% of human DNA

2. Gene Finding Hidden Markov Models
Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Stop codon TAG/TGA/TAA Splice sites The problem of predicting genes means to give coordinates for the exon boundaries. The first kind of information that prediction algorithms use, is the regular structure of a gene. Every gene starts with an ATG codon, and then exons alternate with introns; at the exon-intron boundaries, the splice sites, there are short words that are approximately preserved. Hidden Markov Models (Well studied for many years in speech recognition)

3. Protein Folding The amino-acid sequence of a protein determines the 3D fold The 3D fold of a protein determines its function Can we predict 3D fold of a protein given its amino-acid sequence? Holy grail of compbio—40 years old problem Molecular dynamics, computational geometry, machine learning

4. Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, one of the most cited papers in history Still very active area of research query DB BLAST Efficient string matching algorithms Fast database index techniques

Lipman & Pearson, 1985 …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC). …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC). Database size today: 1012 (increased by 2 million folds). BLAST search: 1.5 minutes

5. Microarray analysis Clinical prediction of Leukemia type
2 types Acute lymphoid (ALL) Acute myeloid (AML) Different treatments & outcomes Predict type before treatment? Bone marrow samples: ALL vs AML Measure amount of each gene

Some goals of biology for the next 50 years
List all molecular parts that build an organism Genes, proteins, other functional parts Understand the function of each part Understand how parts interact physically and functionally Study how function has evolved across all species Find genetic defects that cause diseases Design drugs rationally Sequence the genome of every human, use it for personalized medicine Bioinformatics is an essential component for all the goals above

A short introduction to molecular biology

Life Two categories: Prokaryotes (e.g. bacteria)
Unicellular No nucleus Eukaryotes (e.g. fungi, plant, animal) Unicellular or multicellular Has nucleus

Prokaryote vs Eukaryote
Eukaryote has many membrane-bounded compartment inside the cell Different biological processes occur at different cellular location

Organism, Organ, Cell Organism Organ

Chemical contents of cell
Water Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet) Protein DNA RNA … Small molecules Sugar Ions (Na+, Ka+, Ca2+, Cl- ,…) Hormone

DNA DNA: forms the genetic material of all living organisms
Can be replicated and passed to descendents Contains information to produce proteins To computer scientists, DNA is a string made from alphabet {A, C, G, T} e.g. ACAGAACGTAGTGCCGTGAGCG Each letter is a nucleotide Length varies from hundreds to billions

RNA Historically thought to be information carrier only
DNA => RNA => Protein New roles have been found for them To computer scientists, RNA is a string made from alphabet {A, C, G, U} e.g. ACAGAACGUAGUGCCGUGAGCG Each letter is a nucleotide Length varies from tens to thousands

Protein Protein: the actual “worker” for almost all processes in the cell Enzymes: speed up reactions Signaling: information transduction Structural support Production of other macromolecules Transport To computer scientists, protein is a string made from 20 kinds of characters E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP Each letter is called an amino acid Length varies from tens to thousands

DNA/RNA zoom-in Commonly referred to as Nucleic Acid
DNA: Deoxyribonucleic acid RNA: Ribonucleic acid Found mainly in the nucleus of a cell (hence “nucleic”) Contain phosphoric acid as a component (hence “acid”) They are made up of a string of nucleotides

Nucleotides A nucleotide has 3 components
Sugar ring (ribose in RNA, deoxyribose in DNA) Phosphoric acid Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Thymine (T) or Uracil (U)

Monomers of RNA: ribo-nucleotide
A ribonucleotide has 3 components Sugar - Ribose Phosphate group Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Uracil (U)

Monomers of DNA: deoxy-ribo-nucleotide
A deoxyribonucleotide has 3 components Sugar – Deoxy-ribose Phosphate group Nitrogen base Adenine (A) Guanine (G) Cytosine (C) Thymine (T)

Polymerization: Nucleotides => nucleic acids
Phosphate Sugar Nitrogen Base Phosphate Sugar Nitrogen Base Phosphate Sugar Nitrogen Base

DNA 5’-AGCGACTG-3’ AGCGACTG Base Phosphate Sugar
Free phosphate 5’ G A T C 5 prime 3 prime 5’-AGCGACTG-3’ AGCGACTG DNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc. Base 5 Phosphate Sugar 4 1 2 3 3’

RNA 5’-AGUGACUG-3’ AGUGACUG e.g. translation. 5’ Free phosphate A
5 prime 3 prime 5’-AGUGACUG-3’ AGUGACUG RNA Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. translation. 3’

5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC
Base-pair: A = T G = C G A T C Forward (+) strand 5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC Backward (-) strand One strand is said to be reverse- complementary to the other 3’ 5’ DNA usually exists in pairs.

DNA double helix G-C pair is stronger than A-T pair

Reverse-complementary sequences
5’-ACGTTACAGTA-3’ The reverse complement is: 3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’ Or simply written as TACTGTAACGT

Orientation of the double helix
Double helix is anti-parallel 5’ end of each strand pairs with 3’ end of the other 5’ to 3’ motion in one strand is 3’ to 5’ in the other Double helix has no orientation Biology has no “forward” and “reverse” strand Relative to any single strand, there is a “reverse complement” or “reverse strand” Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’

RNA RNAs are normally single-stranded
Form complex structure by self-base-pairing A=U, C=G Can also form RNA-DNA and RNA-RNA double strands. A=T/U, C=G

Protein zoom-in Protein is the actual “worker” for almost all processes in the cell A string built from 20 letters E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH Each letter is called an amino acid R | H2N--C--COOH H Side chain Amino group Carboxyl group Generic chemical form of amino acid

Amino acid 20 amino acids, only differ at side chains
Each can be expressed by three letters Or a single letter: A-Y, except B, J, O, U, X, Z Alanine = Ala = A Histidine = His = H

Amino acids => peptide
R R | | H2N--C--COOH H2N--C--COOH H H R R | | H2N--C--CO--NH--C--COOH H H Peptide bond

Protein … Has orientations
H2N COOH N-terminal C-terminal … Has orientations Usually recorded from N-terminal to C-terminal Peptide vs protein: basically the same thing Conventions Peptide is shorter (< 50aa), while protein is longer Peptide refers to the sequence, while protein has 2D/3D structure

Protein structure Linear sequence of amino acids folds to form a complex 3-D structure. The structure of a protein is intimately connected to its function.

Genome and chromosome Genome: the complete DNA sequences in the cell of an organism May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes Chromosome: a single large DNA molecule in the cell May be circular or linear Contain genes as well as “junk DNAs” Highly packed!

Formation of chromosome

Formation of chromosome
50,000 times shorter than extended DNA The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun

Gene Gene: unit of heredity in living organisms
A segment of DNA with information to make a protein

Some statistics Chromosomes Bases Genes Human 46 3 billion 20k-25k Dog
78 2.4 billion ~20k Corn 20 2.5 billion 50-60k Yeast 16 20 million ~7k E. coli 1 4 million ~4k Marbled lungfish ? 130 billion

Human genome 46 chromosomes: 22 pairs + X + Y
1 from mother, 1 from father Female: X + X Male: X + Y

Human genome Every cell contains the same genomic information
Except sperms and eggs, which only contain half of the genome Otherwise your children would have chromosomes

Cell division: mitosis
A cell duplicates its genome and divides into two identical cells These cells build up different parts of your body

Cell division: meiosis
A reproductive cell divides into four cells, each containing only half of the genomes Diploid => haploid Two haploid cells (sperm + egg) forms a zygote Which will then develop into a multi-cellular organism by mitosis

Central dogma of molecular biology
DNA replication is critical in both mitosis and meiosis

DNA Replication The process of copying a double-stranded DNA molecule
Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTATT-5’  5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’

Mutation: changes in DNA base-pairs
Proofreading and error-correcting mechanisms exist to ensure extremely high fidelity

Central dogma of molecular biology

Transcription The process that a DNA sequence is copied to produce a complementary RNA Called message RNA (mRNA) if the RNA carries instruction on how to make a protein Called non-coding RNA if the RNA does not carry instruction on how to make a protein Only consider mRNA for now Similar to replication, but Only one strand is copied

Transcription DNA-RNA pair: A=U, C=G T=A, G=C
(where genetic information is stored) DNA-RNA pair: A=U, C=G T=A, G=C (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: ’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

Translation The process of making proteins from mRNA
A gene uniquely encodes a protein There are four bases in DNA (A, C, G, T), and four in RNA (A, C, G, U), but 20 amino acids in protein How many nucleotides are required to encode an amino acid in order to ensure correct translation? 4^1 = 4 4^2 = 16 4^3 = 64 The actual genetic code used by the cell is a triplet. Each triplet is called a codon

The Genetic Code Third letter

Translation The sequence of codons is translated to a sequence of amino acids Gene: -GCT TGT TTA CGA ATT- mRNA: -GCU UGU UUA CGA AUU - Peptide: - Ala - Cys - Leu - Arg - Ile – Start codon: AUG Also code Met Stop codon: UGA, UAA, UAG

Translation Transfer RNA (tRNA) – a different type of RNA.
Freely float in the cell. Every amino acid has its own type of tRNA that binds to it alone. Anti-codon – codon binding crucial. tRNA-Pro Anti-codon Nascent peptide tRNA-Leu mRNA

Transcriptional regulation
Transcription factor RNA Polymerase Transcription starting site promoter gene Will talk more in later lectures RNA polymerase binds to certain location on promoter to initiate transcription Transcription factor binds to specific sequences on the promoter to regulate the transcription Recruit RNA polymerase: induce Block RNA polymerase: repress Multiple transcription factors may coordinate

Splicing Pre-mRNA needs to be “edited” to form mature mRNA
Transcription starting site promoter gene transcription Pre-mRNA Pre-mRNA needs to be “edited” to form mature mRNA Will talk more in later lectures. intron intron Pre-mRNA 5’ UTR exon exon exon 3’ UTR Splicing Mature mRNA (mRNA) Open reading frame (ORF) Start codon Stop codon

Summary DNA: a string made from {A, C, G, T}
Forms the basis of genes Has 5’ and 3’ Normally forms double-strand by reverse complement RNA: a string made from {A, C, G, U} mRNA: messenger RNA tRNA: transfer RNA Other types of RNA: rRNA, miRNA, etc. Normally single-stranded. But can form secondary structure Protein: made from 20 kinds of amino acids Actual worker in the cell Has N-terminal and C-terminal Sequence uniquely determined by its gene via the use of codons Sequence determines structure, structure determines function Central dogma: DNA transcribes to RNA, RNA translates to Protein Both steps are regulated

Experimental techniques to manipulate DNA

DNA synthesis Creating DNA synthetically in a laboratory
Chemical synthesis Chemical reactions Arbitrary sequences Maximum length Cloning: make copies based on a DNA template Biological reactions Requires template Many copies of a long DNA in a short time

in vivo DNA Cloning Connect a piece of DNA to bacterial DNA, which can then be replicated together with the host DNA bacterial DNA

in vitro DNA Cloning Polymerase chain reaction (PCR) 5’ 5’ denature 5’
Primer (< 30 bases) 5’ 5’ 5’ 5’ DNA Polymerase dNTP 5’ 5’ 5’ 5’

Some terms Denature: a DNA double-strand is separated into two strands
By raising temperature Renature: the process that two denatured DNA strands re-forms a double-strand By cooling down slowly Hybridization: two heterogeneous DNAs form a double-stranded DNA may have mismatches The rationale behind many molecular biological techniques including DNA microarray

DNA sequencing technology
Read out the letters from a DNA sequence 1974, Frederick Sanger GTGAGGCGCTGC

DNA sequencing: Basic idea
PCR primer extension 5’-TTACAGGTCCATACTA  3’-AATGTCCAGGTATGATACATAGG-5’ We need to supply A, C, G, T for the synthesis to continue Besides A, C, G, T, we add some A*, C*, G*, and T* Very similar to ACGT in all aspects, except that The extension will stop if used

DNA sequencing, cont

Advances in DNA sequencing
1969: three years to sequence 115nt DNA 1979: three years to sequence ~1650nt 1989: one week to sequence ~1650nt 1995: Haemophilus genome sequenced at TIGR - 1,830,138nt 2000: Human Genome - working draft sequence, 3 billion bases 2003: (near) completion of human genome

The bioinformatics landmark
Completion of human genome sequencing is a success embraced by Advancement in sequencing technology Speed of computation Algorithm development in bioinformatics HGP (Human Genome Project) strategy Hierarchical sequencing Estimated 15 years (1990 – 2005), completed in 13 years $3 billion Celera strategy Whole-genome shotgun sequencing Three years ( ) $300 million

Now Over 300 genomes have been sequenced ~ nt

2007 Genomes of three individual human were sequenced
James Watson Craig Venter TBN Chinese Cost for sequencing Watson’s genome $3 million, 2 months Compared to $3 billion, 13 years for HGP

What’s next? Sequencing speed has been tremendously improved
High efficiency and relatively low cost makes it possible to sequence the genome of any individual from any species What’s next?

Continue to sequence more species? More individuals?
What to do with those sequences?

Coming next: biological sequence analysis

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Similar presentations

Presentation on theme: "Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Similar presentations

Presentation on theme: "Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology"— Presentation transcript:

Similar presentations

About project

Feedback