Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium
Goals of this course Introduction to Computational Biology & Genomics Basic concepts and scientific questions Why does it matter? Basic biology for computer scientists In-depth coverage of algorithmic techniques Current active areas of research Useful algorithms Dynamic programming String algorithms HMMs and other graphical models for sequence analysis
Topics in CS262 Part 1: Basic Algorithms Sequence Alignment & Dynamic Programming Hidden Markov models, Context Free Grammars, Conditional Random Fields Part 2: Topics in computational genomics and areas of active research DNA sequencing Comparative genomics Genes: finding genes, gene regulation Proteins, families, and evolution Networks of protein interactions
Course responsibilities Homeworks 4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter Collaboration allowed – please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing Due one week after the lecture, except special permission Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, staff list to sign up
Reading material Books “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10 “Algorithms on strings, trees, and sequences” by Gusfield Chapters 5-7, 11-12, 13, 14, 17 Papers Lecture notes
Birth of Molecular Biology DNA Phosphate Group Sugar Nitrogenous Base A, C, G, T PhysicistOrnithologist
T C A C T G G C G A G T C A G C G A G U C A G C DNARNA A - T G - C T U
DNA DNA is written 5’ to 3’ by convention AGACC = GGTCT 3’ 5’ 3’
Chromosomes H1DNA H2A, H2B, H3, H4 ~146bp telomere centromere nucleosome chromatin In humans: 2x22 autosomes X, Y sex chromosomes
The Genetic Dogma 3’ 5’ 3’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (transcription) (translation) Single-stranded RNA protein Double-stranded DNA
DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding
Gene Transcription 3’ 5’ 3’ G A T T A C A... C T A A T G T...
Gene Transcription 3’ 5’ 3’ The promoter lies upstream of a gene Transcription factors recognize transcription factor binding sites and bind to them, forming a complex RNA polymerase binds the complex G A T T A C A... C T A A T G T...
Gene Transcription 3’ 5’ 3’ The two strands are separated G A T T A C A... C T A A T G T...
Gene Transcription 3’ 5’ 3’ An RNA copy of the 5’ → 3’ sequence is created from the 3’ → 5’ template G A T T A C A... C T A A T G T... G A U U A C A
Gene Transcription 3’ 5’ 3’ G A U U A C A... G A T T A C A... C T A A T G T... pre-mRNA5’3’
RNA Processing 5’ cap poly(A) tail intron exon mRNA 5’ UTR3’ UTR pre-mRNA
Gene Structure 5’3’ promoter 5’ UTR exons3’ UTR introns coding non-coding
How many? Genes: ~22,000 in the human genome Exons per gene: ~ 8 on average (max: 148) Nucleotides per exon: 170 on average (max: 12k) Nucleotides per intron: 5,500 on average (max: 500k) Nucleotides per gene: 45k on average (max: 2,2M)
Composed of a chain of amino acids. R | H 2 N--C--COOH | H Proteins 20 possible groups Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
R R | | H 2 N--C--COOH H 2 N--C--COOH | | H H Proteins Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
Dipeptide R O R | II | H 2 N--C--C--NH--C--COOH | | H H This is a peptide bond Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
Protein structure Linear sequence of amino acids folds to form a complex 3-D structure The structure of a protein is intimately connected to its function
Translation The ribosome synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid. mRNA P siteA site
The Genetic Code UCAG U UUU Phenylalanine (Phe)UCU Serine (Ser)UAU Tyrosine (Tyr)UGU Cysteine (Cys)U UUC PheUCC SerUAC TyrUGC CysC UUA Leucine (Leu)UCA SerUAA STOPUGA STOPA UUG LeuUCG SerUAG STOPUGG Tryptophan (Trp)G C CUU Leucine (Leu)CCU Proline (Pro)CAU Histidine (His)CGU Arginine (Arg)U CUC LeuCCC ProCAC HisCGC ArgC CUA LeuCCA ProCAA Glutamine (Gln)CGA ArgA CUG LeuCCG ProCAG GlnCGG ArgG A AUU Isoleucine (Ile)ACU Threonine (Thr)AAU Asparagine (Asn)AGU Serine (Ser)U AUC IleACC ThrAAC AsnAGC SerC AUA IleACA ThrAAA Lysine (Lys)AGA Arginine (Arg)A AUG Methionine (Met) or STARTACG ThrAAG LysAGG ArgG G GUU Valine (Val)GCU Alanine (Ala)GAU Aspartic acid (Asp)GGU Glycine (Gly)U GUC ValGCC AlaGAC AspGGC GlyC GUA ValGCA AlaGAA Glutamic acid (Glu)GGA GlyA GUG ValGCG AlaGAG GluGGG GlyG
Translation (tRNA) C C A Tryptophan anticodon
Translation 5’... A U U A U G G C C U G G A C U U G A... 3’ UTR Met Start Codon AlaTrpThr
Translation 5’... A U U A U G G C C U G G A C U U G A... 3’
Translation MetAla 5’... A U U A U G G C C U G G A C U U G A... 3’ Trp
Errors? What if the transcription / translation machinery makes mistakes? What is the effect of mutations in coding regions?
Reading Frames G C U U G U U U A C G A A U U A G
Synonymous Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G G C U U G U U U G C G A A U U A G Ala Cys Leu Arg Ile
Missense Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G G C U U G G U U A C G A A U U A G Ala Trp Leu Arg Ile
Nonsense Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G A G C U U G A U U A C G A A U U A G Ala STOP
Frameshift G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G C U U G U U A C G A A U U A G Ala Cys Tyr Glu Leu
Noncoding RNA 3’ 5’ 3’ G A U U A C A... G A T T A C A... C T A A T G T... 5’3’
Genetics in the 20 th Century
21 st Century AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT AGTAGGACAGACTACGACGAGACGAT CGTGCGAGCGACGGCGTAGTGTGCTG TACTGTCGTGTGTGTGTACTCTCCTC TCTCTAGTCTACGTGCTGTATGCGTT AGTGTCGTCGTCTAGTAGTCGCGATG CTCTGATGTTAGAGGATGCACGATGC TGCTGCTACTAGCGTGCTGCTGCGAT GTAGCTGTCGTACGTGTAGTGTGCTG TAAGTCGAGTGTAGCTGGCGATGTAT CGTGGT
Computational Biology Organize & analyze massive amounts of biological data Enable biologists to use data Form testable hypotheses Discover new biology AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT
DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding
Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides
Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Computational Fragment Assembly Introduced ~ : assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces
Complete genomes today More than 300 complete genomes have been sequenced
Where are the genes? 2. Gene Finding In humans: ~22,000 genes ~1.5% of human DNA
atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag
3. Molecular Evolution
Evolution at the DNA level OK X X Still OK? next generation
4. Sequence Comparison Sequence conservation implies function Sequence comparison is key to Finding genes Determining function Uncovering the evolutionary processes
Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research query DB BLAST
5. RNA Structure Predict: Given: AGCAGAGUGG … an unfolded RNA sequence AGCACAGUGA … + aligned homologs ACUAGACAGG … CGCCGAGUCG … AGCAGUGUGG … bulge loop helix (stem) hairpin loop internal loop multi- branch loop which nucleotides base pair?
6. Protein networks Fresh research area Construct networks from multiple data sources Navigate networks Compare networks across organisms
Computer Scientists vs Biologists
Computer scientists vs Biologists Nothing is ever true or false in Biology Everything is true or false in computer science
Computer scientists vs Biologists Biologists strive to understand the complicated, messy natural world Computer scientists seek to build their own clean and organized virtual worlds
Biologists are obsessed with being the first to discover something Computer scientists are obsessed with being the first to invent or prove something Computer scientists vs Biologists
Biologists are comfortable with the idea that all data have errors Computer scientists are not Computer scientists vs Biologists
Computer scientists get high-paid jobs after graduation Biologists typically have to complete one or more 5-year post-docs... Computer scientists vs Biologists
Computer Science is to Biology what Mathematics is to Physics