Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Slides:



Advertisements
Similar presentations
Techniques for Protein Sequence Alignment and Database Searching
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Rationale for searching sequence databases
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Sequence Analysis Tools
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Alignment Algorithms Hongchao Li Jan
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Techniques for Protein Sequence Alignment and Database Searching
Identifying templates for protein modeling:
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Techniques for Protein Sequence Alignment and Database Searching
Pairwise Sequence Alignment
Lecture #7: FASTA & LFASTA
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava

Protein Sequence Alignment and Database Searching Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques Stochastic Algorithms (SA, GA, HMM) Non Stochastic Algorithms Database Scanning – FASTA, BLAST, PSIBLAST, ISS Alignment of Whole Genomes – MUMmer (Maximal Unique Match)

Database scanning Basic principles of Database searching – Search query sequence against all sequence in database – Calculate score and select top sequences – Dynamic programming is best Approximation Algorithms FASTA Fast sequence search Based on dotplot Identify identical words (k-tuples) Search significant diagonals Use PAM 250 for further refinement Dynamic programming for narrow region

WHAT IS A HEURISTIC METHOD? Definition: A heuristic method is a method that uses “rules of thumb” to give an answer that, while it is not necessarily exact, is fast and is based on sound reasoning. When comparing two sequences, exact methods such as dynamic programming is possible. When comparing a sequence against a database, dynamic programming consumes far too much computer time and space. Heuristic methods are needed. The two most widely used heuristic methods in sequencing are FASTA and BLAST. The time is close to linear in time and space (rather than quadratic as with dynamic programming.)

The FASTA algorithm Four steps: 1)Identify regions of similarity: Using the ktup parameter which specifies # consecutive identities required in a match 10 best diagonal regions found based on #matches and distance between matches 2)Rescore regions and identify best initial regions PAM250 or other scoring matrix used for rescoring the 10 diagonal regions identified in step 1 to allow for conservative replacements and runs of identities shorter than ktup For each the best diagonal regions, identify “initial region” that is best scoring subregion

The FASTA algorithm 3)Optimally join initial regions with scores > T  Given: location of initial regions, scores, gap penalty  Calculate an optimal alignment of initial regions as a combination of compatible regions with maximal score  Use resulting score to rank the library sequences  Selectivity degradation limited by using initial regions that score greater than some threshold T 4)Align the highest scoring library sequences using modification of global and local alignment algorithms  Considers all possible alignments of the query and library sequence that falls within a band centered around the highest scoring initial region

The Four- Step FASTA Process

Overview of the fasta algorithm FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. For DNA sequences the word length usually used is 6. 1 seq1 seq2

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix. These scores are the init1 scores 2 seq1 seq2

3 FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non- overlapping regions may be joined. seq1 seq2

The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region, at the end of this step, is saved as the initn score. seq1 seq2 3 - cont

FastA uses dynamic programming (Smith-Waterman algorithm ) over a narrow band of high scoring diagonals between the query sequence and the search set sequence, to produce an alignment with a new score. 4 seq1 seq2 Opt_score

Overview of the fasta algorithm Search for regions with exact word matches keep 10 highest scoring regions and re-score them using a scoring matrix : Init1 Join diagonals by introducing gaps : Initn Apply Smith-Waterman algorithm to achieve best alignment : Opt Correct for sequence lengths : Z-score Evaluate significance of Z_scores: E values

Database scanning Approximation Algorithms BLAST Heuristic method to find the highest scoring Locally optimal alignments Allow multiple hits to the same sequence Based on statistics of ungapped sequence alignments The statistics allow the probability of obtaining an ungapped alignment MSP - Maximal Segment Pair above cut-off All world (k > 3) score grater than T Extend the score both side Use dynamic programming for narrow region

BLAST – Basic Local Alignment Search Tool Find the highest scoring locally optimal alignments between a query sequence and a database. Very fast algorithm Can be used to search extremely large databases (uses a pre-indexed database which contributes to its great speed) Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used

BLAST Algorithm, Step 1 For a given word length w (usually 3 for proteins) and a given score matrix: Create a list of all words (w-mers) that can can score >T when compared to w-mers from the query. P D G 13 P Q A 12 P Q N 12 etc. Below Threshold (T=13) Query SequenceL N K C K T P Q G Q R L V N Q P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 Neighborhood Words Word P M G 13

BLAST Algorithm, Step 2 Each neighborhood word gives all positions in the database where it is found (hit list). P D G 13 P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P M G 13 PMG Database

BLAST Algorithm, Step 3 The program tries to extend matching segments (seeds) out in both directions by adding pairs of residues. Residues will be added until the incremental score drops below a threshold.

BLAST-Basic Local Alignment Search Tool Capable of searching all the available major sequence databases Run on nr database at NCBI web site Developed by Samuel Karlin and Stevan Altschul Method uses substitution scoring matrices A substitution scoring matrix is a scoring method used in the alignment of one residue or nucleotide against another First scoring matrix was used in the comparison of protein sequences in evolutionary terms by Late Margret Dayhoff and coworkers Matrices –Dayhoff, MDM, or PAM, BLOSUM etc. Basic BLAST program does not allow gaps in its alignments Gapped BLAST and PSI-BLAST

Input Query DNA SequenceAmino Acid Sequence Blastptblastnblastnblastxtblastx Compares Against Protein Sequence Database Compares Against translated Nucleotide Sequence Database Compares Against Nucleotide Sequence Database Compares Against Protein Sequence Database Compares Against translated nucleotide Sequence Database An Overview of BLAST

Database Scanning or Fold Recognition Concept of PSIBLAST –Perform the BLAST search (gap handling) –GeneImprove the sensivity of BLAST –rate the position-specific score matrix –Use PSSM for next round of search Intermediate Sequence Search –Search query against protein database –Generate multiple alignment or profile –Use profile to search against PDB

Comparison of Whole Genomes MUMmer (Salzberg group, 1999, 2002) –Pair-wise sequence alignment of genomes –Assume that sequences are closely related –Allow to detect repeats, inverse repeats, SNP –Domain inserted/deleted –Identify the exact matches How it works –Identify the maximal unique match (MUM) in two genomes –As two genome are similar so larger MUM will be there –Sort the matches found in MUM and extract longest set of possible matches that occurs in same order (Ordered MUM) –Suffix tree was used to identify MUM –Close the gaps by SNPs, large inserts –Align region between MUMs by Smith- Waterman

Thanks