Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &

Slides:



Advertisements
Similar presentations
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Advertisements

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Sequence Alignment Tutorial #2
Lecture outline Sequence alignment Gaps and scoring matrices
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Sequence Alignment.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Alignment II Dynamic Programming
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Deoxyribonucleic acid (DNA) Biometrics CPSC 4600 Biometrics and Cryptography.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
Introduction to Profile Hidden Markov Models
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Minimum Edit Distance Definition of Minimum Edit Distance.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
1 Выравнивание двух последовательностей. 2 AGC A A A C
Sequence Similarity.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Definition of Minimum Edit Distance
Lecture 5: Local Sequence Alignment Algorithms
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Presentation transcript:

Welcome to CS262!

Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics & applications In-depth coverage of Computational Genomics  Algorithms for sequence analysis  Current applications, trends, and open problems Coverage of useful algorithms  Hidden Markov models  Dynamic Programming  String algorithms  Applications of AI techniques

Topics in CS262 Part 1: In-depth coverage of basic computational methods for analysis of biological sequences  Sequence Alignment & Dynamic Programming  Hidden Markov models These methods are used heavily in most genomics applications:  DNA sequencing  Comparison of DNA and proteins across organisms  Discovery of genes, promoters, regulatory sites

Topics in CS262 Part 2: Topics in computational genomics, more algorithms, and areas of active research  DNA sequencing & assembly: reading a complete genome such as the human DNA  Gene finding: marking genes on the DNA sequence  Large-scale comparative genomics: comparing whole genomes from multiple organisms  Microarrays & regulation: understanding the regulatory code, and potential disease-causing genes  RNA structure: predicting the folding of RNA  Phylogeny and evolution: quantifying the evolution of biological sequences

Course responsibilities Homeworks[80%]  4 challenging problem sets, 4-5 problems/pset  Collaboration allowed – please give credit Final[20%]  Takehome, 1 day  Collaboration not allowed  Basic questions – much easier than homeworks ScribingOptional  Due one week after the lecture, except special permission  Scribing grade replaces 2 lowest problems from homeworks

Reading material Books  “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchinson Chapters 1-4, 6, (7-8), (9-10)  “Algorithms on strings, trees, and sequences” by Gusfield Chapters (5-7), 11-12, (13), 14, (17) Papers Lecture notes

Topic 1. Sequence Alignment

Biology in One Slide: 2 Paradigms Molecular Paradigm Evolution Paradigm

Complete DNA Sequences nearly 200 complete genomes have been sequenced

Evolution

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering the evolutionary forces

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 M+N )

Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j

Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = max F(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not

AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = Optimal Alignment: F(4,3) = 2 AGTA A - TA

The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j-1) + s(x i, y j ) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don’t want to penalize gaps in the ends

Different types of overlaps

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

Why local alignment Genes are shuffled between genomes Portions of proteins (domains) are often conserved

Cross-species genome similarity 98% of genes are conserved between any two mammals >70% average similarity in protein sequence hum_a : 57331/ mus_a : 78560/ rat_a : / fug_a : 36008/68174 hum_a : 57381/ mus_a : 78610/ rat_a : / fug_a : 36058/68174 hum_a : 57431/ mus_a : 78659/ rat_a : / fug_a : 36084/68174 hum_a : 57481/ mus_a : 78708/ rat_a : / fug_a : 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t ??For all i, j find F(i, j) > t, and trace back Complicated by overlapping local alignments

Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

Convex gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O(N 2 M)(assume N>M) Space:O(NM)

Compromise: affine gaps  (n) = d + (n – 1)  e | | gap gap open extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i aligns to a gap after y j if H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 …x i to y 1 …y j d e  (n)

Needleman-Wunsch with affine gaps Why do we need two matrices? x i aligns to y j x 1 ……x i-1 x i x i+1 y 1 ……y j-1 y j - 2.x i aligns to a gap x 1 ……x i-1 x i x i+1 y 1 ……y j …- - Add -d Add -e

Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(x i, y j ) V(i, j – 1) – d G(i, j) = max G(i, j – 1) – e V(i – 1, j) – d H(i, j) = max H(i – 1, j) – e Termination: similar

To generalize a little… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) M ) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same Easy to extend to the affine gap case x 1 ………………………… x M y 1 ………………………… y N k(N)