Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Searching Sequence Databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
From Pairwise Alignment to Database Similarity Search.
Heuristic Approaches for Sequence Alignments
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Alignment Algorithms Hongchao Li Jan
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics

Which Program should one use? Most researchers use methods for determining local similarities:  Smith-Waterman (gold standard)  FASTA  BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

Heuristic Database Search Methods Smith-Waterman dynamic programming too computer and time intensive for searching big databases  e.g., UniProt July 2004 – 1.5M sequences Most popular: BLASTx (Altschul et al 1990, 1997) and FASTx (Lipman and Pearson 1985)

BLAST – Basic Local Alignment Search Tool Basic idea:  Identify short very similar segment pairs – extend local alignment Critical issues:  For every database sequence d significantly similar to q, one should find at least one segment pair  Fewer segment pairs means faster computation

Definitions Maximal Segment Pair (MSP qd ) – pair of identical length segments having the highest score of all ungapped local alignments between q and d. High-Scoring Segment Pair (HSP) – segment pair for which the score cannot be increased by shortening or extension Word – segment of fixed length w Word pair – pair of segments of length w

Reformulating the Problem Identify those database sequences d such that MSP qd is over a threshold V. A segment pair scoring at least V has with a high probability a word pair scoring at least T. Identify word pairs with score at least T, extend to high-scoring segment pairs – check if score over V

Finding Hits and HSPs Hit – word pair scoring at least T Preprocess q  Find all words o T (length w) that can score at least T against a word in q  Save in easy-to-use data structure Find the hits  Search in d for all occurrences (o d ) of the words o T Extend (heuristically) to high-scoring segment pairs Perform dynamic programming around HSPs scoring over a certain threshold – allows introduction of gaps

Pre-processing q Aim:  Allow rapid identification of all words o T in d – and the location of corresponding words in q to allow extension into HSPs Possibility: table of 20 w entries

Pre-processing q

Finding HSPs For each word in d (starting in position j) hitting a word in q (starting in position i), record the hit indexed by its diagonal (j-i ). Hits close together on the same diagonal are joined before extension to HSPs Extending to HSP:  Ideally – move to the end of the sequences in both directions  Heuristic – if score falls “far below” best seen so far, stop extension

Dynamic Programming Around HSPs DP is time consuming and need to be constrained Starting from identified HSP, find ”seed pair” Perform ”forward” and ”backward” DP from seed pair (independently) Stop DP if score falls T below best score S’ seen so far

Significance of alignments Suppose alignment reveals an intriguing similarity between two sequences. Is the similarity significant ? Or could it have arisen by chance?

Significance of alignment If the score of the alignment observed is no better than might be expected from a random permutation of the sequence, then it is likely to have arisen by chance.

How to Generate the Random Sequences? Global alignment  Randomize one of the sequences, many times, realign each result to the second sequence (fixed), and collect the distribution of resulting scores. Local alignment  Uses the population of results returned from the entire database as the population with which to measure the statistics.

Statistical parameters Z-score  A measure of how unusual our original match is A z-score of 0 means the observed similarity is no better than the average of the control population. The higher the Z-score, the greater the probability. Z-score  5

Statistical parameters P = the probability that the alignment is better than random  P ≤ exact match  P in range sequences very nearly identical  P in range closely-related sequences, homology certain  P in range distant relatives, usually  P > match probably insignificant

Statistical parameters E-value  The expected number of sequences that give the same Z-score or better if the database is probed with a random sequence.  found by multiplying the value of P by the size of the database probed.  Note that E but not P depends on the size of the database.

Statistical parameters Interpreting E values  E ≤ 0.02 sequences probably homologous  E between 0.02 and 1 homologous cannot be ruled out  E > 1 you’d have to expect this good a match just by chance

Rules and thinking.. Percent of identical residues in the optimal alignment  over 45%, very similar structures, common or at least a related function.  Over 25%, a similar general folding pattern.  A lower degree of sequence similarity cannot rule out homology

Rules and thinking.. 18%-25% twilight zone, the suggestion of homology is tantalizing but dangerous Absence of significant similarity does not imply that the sequences are not homologous – could be distantly related (twilight zone or beyond)