Sequence Alignment Lecture 2, Thursday April 3, 2003.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Sequence Alignment.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Sequence Alignment Cont’d. Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Time Warping Hidden Markov Models Lecture 2, Thursday April 3, 2003.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Sequence Alignment Cont’d. Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Heuristic Approaches for Sequence Alignments
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Sequence Alignment Cont’d. CS262 Lecture 4, Win06, Batzoglou Indexing-based local alignment (BLAST- Basic Local Alignment Search Tool) 1.SEED Construct.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST)
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Construction of Substitution matrices
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
CS 6293 Advanced Topics: Translational Bioinformatics
Homology Search Tools Kun-Mao Chao (趙坤茂)
Pairwise sequence Alignment.
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Sequence Alignment Lecture 2, Thursday April 3, 2003

Review of Last Lecture Lecture 2, Thursday April 3, 2003

Lecture 3, Tuesday April 8, 2003 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene

Lecture 3, Tuesday April 8, 2003 The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

Lecture 3, Tuesday April 8, 2003 Affine Gaps  (n) = d + (n – 1)  e |gap openextend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i, or y j, aligns to a gap d e  (n)

Lecture 3, Tuesday April 8, 2003 Needleman-Wunsch with affine gaps Initialization:F(i, 0) = d + (i – 1)  e F(0, j) = d + (j – 1)  e Iteration: F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max G(i – 1, j – 1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) – d G(i, j) = max G(i, j – 1) – e G(i – 1, j) – e Termination: same

Lecture 3, Tuesday April 8, 2003 Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same Easy to extend to the affine gap case x 1 ………………………… x M y 1 ………………………… y N k(N)

Lecture 3, Tuesday April 8, 2003 Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*

Lecture 3, Tuesday April 8, 2003 Today The Four-Russian Speedup Heuristic Local Aligners –BLAST –BLAT –PatternHunter Time Warping

The Four-Russian Algorithm A useful speedup of Dynamic Programming

Lecture 3, Tuesday April 8, 2003 Main Observation Within a rectangle of the DP matrix, values of D depend only on the values of A, B, C, and substrings x l...l’, y r…r’ Definition: A t-block is a t  t square of the DP matrix Idea: Divide matrix in t-blocks, Precompute t-blocks Speedup: O(t) A B C D xlxl x l’ yryr y r’ t

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Main structure of the algorithm: Divide N  N DP matrix into K  K log 2 N- blocks that overlap by 1 column & 1 row For i = 1……K For j = 1……K Compute D i,j as a function of A i,j, B i,j, C i,j, x[l i …l’ i ], y[r j …r’ j ] Time: O(N 2 / log 2 N) times the cost of step 4 t t t

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Another observation: ( Assume m = 0, s = 1, d = 1 ) Lemma. Two adjacent cells of F(.,.) differ by at most 1 Gusfield’s book covers case where m = 0, called the edit distance (p. 216): minimum # of substitutions + gaps to transform one string to another

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Proof of Lemma: 1.Same row: a.F(i, j) – F(i – 1, j)  +1 At worst, one more gap:x 1 ……x i-1 x i y 1 ……y j – a.F(i, j) – F(i – 1, j)  -1 F(i, j)F(i – 1, j – 1) F(i, j) – F(i – 1, j – 1) x 1 ……x i-1 x i x 1 ……x i-1 – y 1 ……y a-1 y a y a+1 …y j y 1 ……y a-1 y a y a+1 …y j  -1 x 1 ……x i-1 x i x 1 ……x i-1 y 1 ……y a-1 – y a …y j y 1 ……y a-1 y a …y j +1 2.Same column: similar argument

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Proof of Lemma: 3.Same diagonal: a.F(i, j) – F(i – 1, j – 1)  +1 At worst, one additional mismatch in F(i, j) b.F(i, j) – F(i – 1, j – 1)  - 1 F(i, j)F(i – 1, j – 1)F(i, j) – F(i – 1, j – 1) x 1 ……x i-1 x i x 1 ……x i-1 | y 1 ……y i-1 y j y 1 ……y j-1  -1 x 1 ……x i-1 x i x 1 ……x i-1 y 1 ……y a-1 – y a …y j y 1 ……y a-1 y a …y j +1

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Definition: The offset vector is a t-long vector of values from {-2, -1, 0, 1, 2}, where the first entry is 0 If we know the value at A, and the top row, left column offset vectors, and x l ……x l’, y r ……y r’, Then we can find D A B C D xlxl x l’ yryr y r’ t

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Example: x = AACT y = CACT AAC T C A C T

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Example: x = AACT y = CACT AAC T C A C T

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Definition: The offset function of a t-block is a function that for any given offset vectors of top row, left column, and x l ……x l’, y r ……y r’, produces offset vectors of bottom row, right column A B C D xlxl x l’ yryr y r’ t

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm We can pre-compute the offset function: 5 2(t-1) possible input offset vectors 4 2t possible strings x l ……x l’, y r ……y r’ Therefore 5 2(t-1)  4 2t values to pre-compute We can keep all these values in a table, and look up in linear time, or in O(1) time if we assume constant-lookup RAM for log-sized inputs

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm Four-Russians Algorithm: (Arlazarov, Dinic, Kronrod, Faradzev) 1.Cover the DP table with t-blocks 2.Initialize values F(.,.) in first row & col 3.Row-by-row, use offset values at leftmost col and top row of each block, to find offset values at rightmost col and bottom row 4.Let Q = total of offsets at row N F(N, N) = Q + F(N, 0)

Lecture 3, Tuesday April 8, 2003 The Four-Russian Algorithm t t t

Heuristic Local Aligners BLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT, PatternHunter, ……

Lecture 3, Tuesday April 8, 2003 State of biological databases Sequenced Genomes: Human 3  10 9 Yeast1.2  10 7 Mouse2.7  10 9  12 different strains Rat2.6  10 9 Neurospora 4  more fungi within next year Fugu fish 3.3  10 8 Tetraodon 3  10 8 ~250 bacteria/viruses Mosquito 2.8  10 8 Next 2 years: Drosophila 1.2  10 8 Dog, Chimpanzee, Chicken Worm 1.0  sea squirts  1.6  10 8 Current rate of sequencing: Rice 1.0  big labs  3  10 9 bp /year/lab Arabidopsis 1.2  s small labs

Lecture 3, Tuesday April 8, 2003 State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences

Lecture 3, Tuesday April 8, 2003 Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? - How fast does it evolve? Assume we try Smith-Waterman: The entire genomic database Our new gene

Lecture 3, Tuesday April 8, 2003 Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? -Potential genes - Other biological characteristics Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3 

Lecture 3, Tuesday April 8, 2003 BLAST (Basic Local Alignment Search Tool) Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

Lecture 3, Tuesday April 8, 2003 BLAST  Original Version Dictionary: All words of length k (~11) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

Lecture 3, Tuesday April 8, 2003 BLAST  Original Version A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

Lecture 3, Tuesday April 8, 2003 Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT

Lecture 3, Tuesday April 8, 2003 Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Nearby alignments are merged Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT

Lecture 3, Tuesday April 8, 2003 Variants of BLAST MEGABLAST: Optimized to align very similar sequences Works best when k = 4i  16 Linear gap penalty PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated WU-BLAST: (Wash U BLAST) Optimized, added features BlastZ Combines BLAST/PatternHunter methodology

Lecture 3, Tuesday April 8, 2003 Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

Lecture 3, Tuesday April 8, 2003 Example Query: Human atoh enhancer, 179 letters[1.5 min] Result: 57 blast hits 1. gi| |gb|AF |AF Homo sapiens ATOH1 enhanc e-95 gi| |gb|AF |AF gi| |gb|AC | Mus musculus Strain C57BL6/J ch e-68gi| |gb|AC |264 3.gi| |gb|AF |AF Mus musculus Atoh1 enhanc e-66gi| |gb|AF |AF gi| |gb|AF | Gallus gallus CATH1 (CATH1) gene e-12gi| |gb|AF |78 5.gi| |emb|AL | Zebrafish DNA sequence from clo e-05gi| |emb|AL |54 6.gi| |gb|AC | Oryza sativa chromosome 10 BAC O gi| |gb|AC |44 7.gi| |ref|NM_ | Mus musculus suppressor of Ty gi| |ref|NM_ |42 8.gi| |gb|BC | Mus musculus, Similar to suppres gi| |gb|BC |42 gi| |gb|AF |AF218258gi| |gb|AF |AF Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

Lecture 3, Tuesday April 8, 2003 BLAT: Blast-Like Alignment Tool Differences with BLAST: 1.Dictionary of the DB(BLAST: query) 2.Initiate alignment on any # of matching hits (BLAST: 1 or 2 hits) 3.More aggressive stitching-together of alignments 4.Special code to align mature mRNA to DNA Result: very efficient for: Aligning many seqs to a DB Aligning two genomes

Lecture 3, Tuesday April 8, 2003 PatternHunter Main features: Non-consecutive position words Highly optimized 5 hits 3 hits 7 hits Consecutive PositionsNon-Consecutive Positions 6 hits On a 70% conserved region: Consecutive Non-consecutive Expected # hits: Prob[at least one hit]:

Lecture 3, Tuesday April 8, 2003 Advantage of Non-Consecutive Words 11 positions 10 positions