Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Searching Sequence Databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
From Pairwise Alignment to Database Similarity Search.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Alignment Algorithms Hongchao Li Jan
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Homology Search Tools Kun-Mao Chao (趙坤茂)
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS Sep 16 th, 2014

Key concepts Assessing the significance of an alignment – Extreme value distribution gives an analytical form to compute the significance of a score Heuristic algorithms – BLAST algorithm

Readings from book Chapter 2 – Section 2.5 – Section 2.7

RECAP: Four issues in sequence alignment Type Algorithm Score Significance

Classical approach to assessing the significance of score Develop a “background” distribution of alignment score Assess the probability of observing our score S from this background distribution

Thought experiment for generating the background distribution Suppose we assume Sequence lengths m and n A particular substitution matrix and amino-acid frequencies And we consider generating random sequences of lengths m and n and finding the best alignment of these sequences This will give us a distribution over alignment scores for random pairs of sequences

The extreme value distribution Because we are picking the best alignments, we want to know the distribution of max scores for alignments against a random set of sequences looks like The background distribution is given by an extreme value distribution (EVD) We need to assess the probability of observing a score of S or higher by random chance, that is, we need the form of the CDF: P(x>S)

Assessing significance of sequence score alignments It can be shown that the mean of optimal scores is – K, λ estimated from the substitution matrix Probability of observing a score greater than S Substituting U into the equation

Need to speed up sequence alignment Consider the task of searching the RefSeq collection of sequences against a query sequence: – most recent release of DB contains 32,504,738 proteins – Entails 33,000,000*(300*300) matrix operations (assuming query sequence is of length 300 and avg. sequence length is 300) O(mn) too slow for large databases with high query traffic We will look at a heuristic algorithm to speed up the search process

Heuristic alignment Heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions Heuristic methods do fast approximation to dynamic programming – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990; Altschul et al., Nucleic Acids Research 1997]

BLAST: Basic Local Alignment Search Tool Altshul et al 1990 – Cited >48,000 times! Key heuristics in BLAST – A good alignment is made up short stretches of matches: seeds – Extend seeds to make longer alignments Key tradeoff made: sensitivity vs. speed Used EVD theory for random sequence score Works for both protein sequence and DNA sequence – Only scores differ

BLAST continued Two parameters control how BLAST searches the database – w: This specifies the length of words to seed the alignment – T: The smallest threshold of word pair match to be considered in the alignment

Key steps of the BLAST algorithm For each query sequence 1.Compile a list of high-scoring words of score at least T First generate words in the query sequence Then find words that match query sequence words with score at least T Thus allows for inexact matches 2.Scan the database for hits of these words Relies on indexing performed as pre-processing 3.Extend hits

Determining query words Given: query sequence: QLNFSAGW word length w = 2 (default for protein usually w = 3) word score threshold T = 9 Step 1: determine all words of length w in query sequence QL LN NF FS SA AG GW

Determining query words Step 2: Determine all words that score at least T when compared to a word in the query sequence QLQL=9 LNLN=10 NFNF=12, NY=9 … SAnone... words from query sequence words with T≥9 Additional words potentially in database Aminoacid substitution matrix

Scanning the database Search database for all occurrences of query words Approach: – index database sequences into table of words (pre-compute this) – index query words into table (at query time) NP NS NT NW NY QLNFSAGW MFNYT, STNYD… NPGAT, TSQRPNP… query sequence database sequences

Extending a word hit to a larger alignment is straightforward Terminate extension when the score of the current alignment falls a certain distance below the best score found for shorter extensions Extending a hit Query sequence: Q L N F S A DB sequence: R L N Y S W Score: Total: =17

How to choose w and T? Tradeoff between running time and sensitivity Sensitivity T – small T: greater sensitivity, more hits to expand – large T: lower sensitivity, fewer hits to expand w – Larger w : fewer query word seeds, lower time for extending, but more possible words (20 w for AAs)

Updates to BLAST Two hit method – Lower the threshold but require two words to be on the same diagonal and be no more than A characters apart Ability to handle gaps Ability to handle position-specific score matrix created from alignments generated from iteration i for iteration i+1 Altshul et al 1997

Two-hit method Figure from Altshul et al Hits wit T>=13 (15 hits). Hits with T>=11 (22 hits) Only these two are considered as they satisfy the two-hit method

Summary of BLAST T: Don’t consider seeds with score < T Don’t extend hits when score falls below a specified threshold Pre-processing of database or query helps to improve the running time

FASTA Starts with exact seed matches instead of inexact matches that satisfy a threshold Extends seeds (similar to BLAST) Join high scoring seeds allowing for gaps Re-align high scoring matches using dynamic programming

Different versions of BLAST programs ProgramQueryDatabase BLASTPProtein BLASTNDNA BLASTX Translated DNA Protein TBLASTNProteinTranslated DNA TBLASTX Translated DNA

Sequence databases Web portals/Knowledge bases – NCBI: – EBI: – Sanger: – Each of these centers link to hundreds of databases Nucleotide sequences – Genbank – EMBL-EBI Nucleotide Sequence Database – Comprise ~8% of the total database (Nucleic Acid Research 2006 Database edition) Protein sequences – UniProtKB

Using BLAST Will blast a DNA sequence against NCBI nucleotide database We will select – =70545&to=72150&report=fasta =70545&to=72150&report=fasta

Using BLAST Choose the database Enter the query sequence

Using BLAST The sequence corresponds to the human HBB (hemoglobin) gene. But we will select the mouse DB Use Megablast (large word size)

Interpreting results Assesses significance of a score. Related to P-value, but gives the expected number of alignments of this score value or higher.