CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Modified from:
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
(Regulatory-) Motif Finding
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Protein Classification. PDB Growth New PDB structures.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Expected accuracy sequence alignment Usman Roshan.
Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―
Protein Classification
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Chapter 14 Protein Structure Classification
Protein Structural Classification
Presentation transcript:

CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Main Idea Genomic regions of interest contain islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them 3.Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

CS262 Lecture 15, Win06, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

CS262 Lecture 15, Win06, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

CS262 Lecture 15, Win06, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

CS262 Lecture 15, Win06, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L: 1-indexed array, L[1]  w 1 B: 0-indexed array of backpointers; B[0] = 0 P: array used for traceback // L[j]: smallest last element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1]  w[i] B[j+1]  i P[i]  B[j] } That’s it!!! Running time?

CS262 Lecture 15, Win06, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ x y

CS262 Lecture 15, Win06, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y x y

CS262 Lecture 15, Win06, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, x y

CS262 Lecture 15, Win06, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (top) to largest (bottom) value  L is implemented as a balanced binary tree y h l

CS262 Lecture 15, Win06, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

CS262 Lecture 15, Win06, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i)  V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k

CS262 Lecture 15, Win06, Batzoglou Example x y 1: 5 3: 3 2: 6 4: 4 5:

CS262 Lecture 15, Win06, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

CS262 Lecture 15, Win06, Batzoglou Examples Human Genome Browser ABC

CS262 Lecture 15, Win06, Batzoglou Whole-genome alignment Rat—Mouse—Human

CS262 Lecture 15, Win06, Batzoglou Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

CS262 Lecture 15, Win06, Batzoglou Protein Classification

CS262 Lecture 15, Win06, Batzoglou PDB Growth New PDB structures

CS262 Lecture 15, Win06, Batzoglou Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new folds identified very small (and close to constant) Protein classification can  Generate overview of structure types  Detect similarities (evolutionary relationships) between protein sequences Morten Nielsen,CBS, BioCentrum, DTU SCOP release 1.67, Class# folds# superfamilies# families All alpha proteins All beta proteins Alpha and beta proteins (a/b) Alpha and beta proteins (a+b) Multi-domain proteins40 55 Membrane & cell surface Small proteins Total

CS262 Lecture 15, Win06, Batzoglou Protein world Protein fold Protein structure classification Protein superfamily Protein family Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 15, Win06, Batzoglou Structure Classification Databases SCOP  Manual classification (A. Murzin)  scop.berkeley.edu scop.berkeley.edu CATH  Semi manual classification (C. Orengo)  FSSP  Automatic classification (L. Holm)  Morten Nielsen,CBS, BioCentrum, DTU

CS262 Lecture 15, Win06, Batzoglou Protein Classification Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein

CS262 Lecture 15, Win06, Batzoglou PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M

CS262 Lecture 15, Win06, Batzoglou Profiles & Profile HMMs Psi-BLAST builds a profile Profile HMMs: more elaborate versions of a profile  Intuitively, profile that models gaps

CS262 Lecture 15, Win06, Batzoglou

Profile HMMs Each M state has a position-specific pre-computed substitution table Each I state has position-specific gap penalties (and in principle can have its own emission distributions) Each D state also has position-specific gap penalties  In principle, D-D transitions can also be customized per position M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

CS262 Lecture 15, Win06, Batzoglou Profile HMMs  transition between match states – α M(i)M(i+1)  transitions between match and insert states – α M(i)I(i), α I(i)M(i+1)  transition within insert state – α I(i)I(i)  transition between match and delete states – α M(i)D(i+1), α D(i)M(i+1)  transition within delete state – α D(i)D(i+1)  emission of amino acid b at a state S – ε S (b) M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

CS262 Lecture 15, Win06, Batzoglou Profile HMMs  transition probabilities ~ frequency of a transition in alignment  emission probabilities ~ frequency of an emission in alignment  pseudocounts are usually introduced M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

CS262 Lecture 15, Win06, Batzoglou Alignment of a protein to a profile HMM To align sequence x 1 …x n to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm Define  V j M (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from M j  V j I (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from I j  V j D (i):score of best alignment of x 1 …x i to the HMM ending in D j (x i is the last character emitted before D j ) Denote by q a the frequency of amino acid a in a ‘random’ protein You can fill-in the details! (or, read Durbin et al. Chapter 5)

CS262 Lecture 15, Win06, Batzoglou How to build a profile HMM

CS262 Lecture 15, Win06, Batzoglou Resources on the web HMMer – a free profile HMM software  SAM – another free profile HMM software  PFAM – database of alignments and HMMs for protein families and domains  SCOP – a structural classification of proteins 

CS262 Lecture 15, Win06, Batzoglou Classification with Profile HMMs Fold Family Superfamily ? new protein

CS262 Lecture 15, Win06, Batzoglou Classification with Profile HMMs How generative models work  Training examples ( sequences known to be members of family ): positive  Tuning parameters with a priori knowledge  Model assigns a probability to any given protein sequence  Idea: The sequence from the family (hopefully) yield a higher probability than sequences outside the family Log-likelihood ratio as score P(X | H 1 ) P(H 1 ) P(H 1 |X) P(X) P(H 1 |X) L(X) = log = log = log P(X | H 0 ) P(H 0 ) P(H 0 |X) P(X) P(H 0 |X)

CS262 Lecture 15, Win06, Batzoglou Generative Models

CS262 Lecture 15, Win06, Batzoglou Generative Models

CS262 Lecture 15, Win06, Batzoglou Generative Models

CS262 Lecture 15, Win06, Batzoglou Generative Models

CS262 Lecture 15, Win06, Batzoglou Discriminative Models -- SVM v Decision Rule: red: v T x > 0 margin If x 1 … x n training examples, sign(  i i x i T x) “decides” where x falls Train i to achieve best margin Large Margin for |v| < 1  Margin of 1 for small |v|

CS262 Lecture 15, Win06, Batzoglou k-mer based SVMs for protein classification For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

CS262 Lecture 15, Win06, Batzoglou SVMs will find a few support vectors v After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

CS262 Lecture 15, Win06, Batzoglou Benchmarks