Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Lecture outline Sequence alignment Gaps and scoring matrices
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Sequencing and Sequence Alignment
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: George Asimenos Andreas Sundquist Tuesday&Thursday 2:45-4:00 Skilling Auditorium.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS374 Algorithms in Biology. Overview Administrivia Molecular Biology and Computation  DNA, proteins, cells, evolution  Some examples of.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
CS5263 Bioinformatics Lecture 1: Introduction Outline Administravia What is bioinformatics Why bioinformatics Topics in bioinformatics What you will.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium.
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Welcome to CS374 Algorithms in Biology
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
CSE 6406: Bioinformatics Algorithms. Course Outline
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
Intelligent systems in bioinformatics Introduction to the course.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
CHAPTER 12 STUDY GUIDE MATER LAKES ACADEMY MR. R. VAZQUEZ BIOLOGY
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Construction of Substitution Matrices
Minimum Edit Distance Definition of Minimum Edit Distance.
Overview of Bioinformatics 1 Module Denis Manley..
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Oswald Avery Canadian biologist ( ) Discovered DNA in 1944 with a team of scientists.
Doug Raiford Phage class: introduction to sequence databases.
Chapter 12 Remember! Chargaff’s rules The relative amounts of adenine and thymine are the same in DNA The relative amounts of cytosine and guanine are.
Sequence Similarity.
LECTURE PRESENTATIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Minorsky, Robert.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
Definition of Minimum Edit Distance
Chapter 12 Molecular Genetics.
DNA and RNA Chapter 12.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Pairwise sequence Alignment.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Chapter 12 Molecular Genetics.
Presentation transcript:

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday 12:50-2:05 Skilling Auditorium

Goals of this course Introduction to Computational Biology & Genomics  Basic concepts and scientific questions  Why does it matter?  Basic biology for computer scientists  In-depth coverage of algorithmic techniques  Current active areas of research Useful algorithms  Dynamic programming  String algorithms  HMMs and other graphical models for sequence analysis

Topics in CS262 Part 1: Basic Algorithms  Sequence Alignment & Dynamic Programming  Hidden Markov models, Context Free Grammars, Conditional Random Fields Part 2: Topics in computational genomics and areas of active research  DNA sequencing  Comparative genomics  Genes: finding genes, gene regulation  Proteins, families, and evolution  Networks of protein interactions

Course responsibilities Homeworks  4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter  Collaboration allowed – please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing  Due one week after the lecture, except special permission  Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, staff list to sign up

Reading material Books  “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10  “Algorithms on strings, trees, and sequences” by Gusfield Chapters 5-7, 11-12, 13, 14, 17 Papers Lecture notes

Birth of Molecular Biology DNA Phosphate Group Sugar Nitrogenous Base A, C, G, T Physicist Ornithologist

G A G U C A G C DNA RNA A - T G - C T  U

DNA DNA is written 5’ to 3’ by convention AGACC = GGTCT 3’ 5’ 3’

Chromosomes H1DNA H2A, H2B, H3, H4 ~146bp telomere centromere nucleosome chromatin In humans: 2x22 autosomes X, Y sex chromosomes

The Genetic Dogma 3’ 5’ 3’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (transcription) (translation) Single-stranded RNA protein Double-stranded DNA

DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding

Genetics in the 20 th Century

21 st Century AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT AGTAGGACAGACTACGACGAGACGAT CGTGCGAGCGACGGCGTAGTGTGCTG TACTGTCGTGTGTGTGTACTCTCCTC TCTCTAGTCTACGTGCTGTATGCGTT AGTGTCGTCGTCTAGTAGTCGCGATG CTCTGATGTTAGAGGATGCACGATGC TGCTGCTACTAGCGTGCTGCTGCGAT GTAGCTGTCGTACGTGTAGTGTGCTG TAAGTCGAGTGTAGCTGGCGATGTAT CGTGGT

Computational Biology Organize & analyze massive amounts of biological data  Enable biologists to use data  Form testable hypotheses  Discover new biology AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT

Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides

Some Topics in CS Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Computational Fragment Assembly Introduced ~ : assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces

Complete genomes today More than 300 complete genomes have been sequenced

Where are the genes? 2. Gene Finding In humans: ~22,000 genes ~1.5% of human DNA

atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

3. Molecular Evolution

Evolution at the DNA level OK X X Still OK? next generation

4. Sequence Comparison Sequence conservation implies function Sequence comparison is key to Finding genes Determining function Uncovering the evolutionary processes

Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research query DB BLAST

5. RNA Structure Predict: Given: AGCAGAGUGG … an unfolded RNA sequence AGCACAGUGA … + aligned homologs ACUAGACAGG … CGCCGAGUCG … AGCAGUGUGG … bulge loop helix (stem) hairpin loop internal loop multi- branch loop

6. Protein networks Fresh research area Construct networks from multiple data sources Navigate networks Compare networks across organisms

Sequence Alignment

Complete DNA Sequences More than 300 complete genomes have been sequenced

Evolution

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering evolutionary events

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps AG-CG-AAGTTT

Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other”

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: >> 2 N (exercise)

Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Dynamic Programming There are only a polynomial number of subproblems  Align x 1 …x i to y 1 …y j Original problem is one of the subproblems  Align x 1 …x M to y 1 …y N Each subproblem is easily solved from smaller subproblems  We will show next Dynamic Programming!!!Then, we can apply Dynamic Programming!!! Let F(i, j) = optimal score of aligning x 1 ……x i y 1 ……y j F is the DP “Matrix” or “Table” “Memoization”