Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday 12:50-2:05 Skilling Auditorium
Goals of this course Introduction to Computational Biology & Genomics Basic concepts and scientific questions Why does it matter? Basic biology for computer scientists In-depth coverage of algorithmic techniques Current active areas of research Useful algorithms Dynamic programming String algorithms HMMs and other graphical models for sequence analysis
Topics in CS262 Part 1: Basic Algorithms Sequence Alignment & Dynamic Programming Hidden Markov models, Context Free Grammars, Conditional Random Fields Part 2: Topics in computational genomics and areas of active research DNA sequencing Comparative genomics Genes: finding genes, gene regulation Proteins, families, and evolution Networks of protein interactions
Course responsibilities Homeworks 4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter Collaboration allowed – please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing Due one week after the lecture, except special permission Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, staff list to sign up
Reading material Books “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10 “Algorithms on strings, trees, and sequences” by Gusfield Chapters 5-7, 11-12, 13, 14, 17 Papers Lecture notes
Birth of Molecular Biology DNA Phosphate Group Sugar Nitrogenous Base A, C, G, T Physicist Ornithologist
G A G U C A G C DNA RNA A - T G - C T U
DNA DNA is written 5’ to 3’ by convention AGACC = GGTCT 3’ 5’ 3’
Chromosomes H1DNA H2A, H2B, H3, H4 ~146bp telomere centromere nucleosome chromatin In humans: 2x22 autosomes X, Y sex chromosomes
The Genetic Dogma 3’ 5’ 3’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (transcription) (translation) Single-stranded RNA protein Double-stranded DNA
DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding
Genetics in the 20 th Century
21 st Century AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT AGTAGGACAGACTACGACGAGACGAT CGTGCGAGCGACGGCGTAGTGTGCTG TACTGTCGTGTGTGTGTACTCTCCTC TCTCTAGTCTACGTGCTGTATGCGTT AGTGTCGTCGTCTAGTAGTCGCGATG CTCTGATGTTAGAGGATGCACGATGC TGCTGCTACTAGCGTGCTGCTGCGAT GTAGCTGTCGTACGTGTAGTGTGCTG TAAGTCGAGTGTAGCTGGCGATGTAT CGTGGT
Computational Biology Organize & analyze massive amounts of biological data Enable biologists to use data Form testable hypotheses Discover new biology AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT
Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides
Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Computational Fragment Assembly Introduced ~ : assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces
Complete genomes today More than 300 complete genomes have been sequenced
Where are the genes? 2. Gene Finding In humans: ~22,000 genes ~1.5% of human DNA
atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag
3. Molecular Evolution
Evolution at the DNA level OK X X Still OK? next generation
4. Sequence Comparison Sequence conservation implies function Sequence comparison is key to Finding genes Determining function Uncovering the evolutionary processes
Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research query DB BLAST
5. RNA Structure Predict: Given: AGCAGAGUGG … an unfolded RNA sequence AGCACAGUGA … + aligned homologs ACUAGACAGG … CGCCGAGUCG … AGCAGUGUGG … bulge loop helix (stem) hairpin loop internal loop multi- branch loop
6. Protein networks Fresh research area Construct networks from multiple data sources Navigate networks Compare networks across organisms
Sequence Alignment
Complete DNA Sequences More than 300 complete genomes have been sequenced
Evolution
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
Evolutionary Rates OK X X Still OK? next generation
Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering evolutionary events
Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps AG-CG-AAGTTT
Scoring Function Sequence edits: AGGCCTC MutationsAGGACTC InsertionsAGGGCCTC DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches) m - (# mismatches) s – (#gaps) d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other”
How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: >> 2 N (exercise)
Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
Dynamic Programming There are only a polynomial number of subproblems Align x 1 …x i to y 1 …y j Original problem is one of the subproblems Align x 1 …x M to y 1 …y N Each subproblem is easily solved from smaller subproblems We will show next Dynamic Programming!!!Then, we can apply Dynamic Programming!!! Let F(i, j) = optimal score of aligning x 1 ……x i y 1 ……y j F is the DP “Matrix” or “Table” “Memoization”