Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, 2006-09-26 Per Kraulis

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Sequence Analysis Tools
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Copyright OpenHelix. No use or reproduction without express written consent1.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Fast Sequence Alignments
Sequence Based Analysis Tutorial
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, Per Kraulis

Previous lecture: Databases General issues –Data model –Quality –Policies Updates, corrections EMBL, GenBank, Ensembl UniProt Access: EBI, NCBI (Entrez)

Course design 1.What is bioinformatics? Basic databases and tools 2.Sequence searches: BLAST, FASTA 3.Multiple alignments, phylogenetic trees 4.Protein domains and 3D structure 5.Seminar: Sequence analysis of a favourite gene 6.Gene expression data, methods of analysis 7.Gene and protein annotation, Gene Ontology, pathways 8.Seminar: Further analysis of a favourite gene

Sequence searches Two tasks: 1)Compare two sequences: How similar? 2)Search for similar sequences

How to do it? Computer program Algorithm –Appropriate –Correct –Speed Database –Content

Consider! Sensitivity –Are correct hits found? Specificity –Are false hits avoided? Statistics: Significant match? Biological judgement –“Strange” features in sequences –Are assumptions OK?

What is an algorithm? “Procedure for accomplishing some task” –Set of well-defined instructions Cookbook recipe –Produce result from initial data Input data set -> output data set All software implements algorithms

Example: substring search PRESELVISLEYELVIS EEELEELVIS Naïve algorithm: 11 operations

Substring search with lookup PRESELVISLEY P 1 L 6,10 R 2 V 7 E 3,5,11 I 8 S 4,9 Y 12 Index lookup table PRESELVISLEYELVIS ELELVIS Improved algorithm: 6 operations But: preprocessing required

Algorithm properties Execution time –Number of operations to produce result Storage –Amount of memory required Result –Exact: Guaranteed correct –Approximate: Reasonably good

Analysis of algorithms, 1 Larger input data set: what happens to –Execution time? –Storage? Examples: –Longer query sequence –Larger database –More sequences in multiple alignment

Analysis of algorithms, 2 “Complexity” of an algorithm –Behaviour with larger input data sets –Time (speed) and storage (memory) Big-O notation: general behaviour –O(c) constant –O(log(n)) logarithmic –O(n) linear –O(n 2 ) quadratic –O(2 n ) exponential

n t O(2 n ) O(c) O(n)

Example: O(n) Compute mol weight M W of protein Table: M W (aa) for each amino acid residue For each residue in protein, add M W –M W (Met) + M W (Ala) + … + M W (Ser) Add M W for water O(n) for protein size

Example: O(2 n ) Given mol weight M W for a protein, compute all possible sequence that might fit Table: M W (aa) for each amino acid residue Produce all permutations up to M W –MAAAA, MAAAG, MAAAS, MAAAT, … Naïve implementation: O(2 n ) for M W

Heuristic algorithms Less-than-perfect –Reasonably good solution in decent time Why? –Faster than rigorous algorithm –May be the only practical approach Specific to the task –Reasonable or likely cases Rule-of-thumb Use biological knowledge

Sequence comparison Sequences related by evolution –Common ancestor –Modified over time –Biologically relevant changes Single-residue mutations Deletion/insertion of segments Sequences may be related by evolution, although we cannot detect it

Alignment Corresponding segments of sequences Identical residues Conserved residues Gaps for deletion/insertion PRESELVISLEY ||| ||.||| | PREPELIISL-Y

Local vs. global alignment Global alignment: entire sequences Local alignment: segments of sequences Local alignment often the most relevant –Depends on biological assumptions

Alignment matrix, 1 PRESELVISLEY E X X X L X X V X I X S X X Mark identical residues Find longest diagonal stretch Local alignment O(m*n)

Alignment matrix, 2 PRESEIVISLEY E X X X L X V X I X X S X X PRESEIVISLEY E X X X L.. X V X I X X. S X X Mark similar residues –Substitution probability Find longest diagonal stretch –Above some score limit Local alignment –High Scoring Pair, HSP

Substitution matrix The probability of mutation X -> Y –M(i,j) where i and j are all amino acid residues –Transformed into log-odds for computation Common matrices –PAM250 (Dayhoff et al) Based on closely similar proteins –BLOSUM62 (Henikoff et al) Based on conserved regions Considered best for distantly related proteins

Gap penalties To model deletion/insertion –Segment of gene deleted or inserted Gap open –Start a gap: should be tough Gap extension –Continue a gap: should be easier PRESELVISLEY ||| ||.||| | PREPELIISL-Y

Alignment/search algorithms Needleman-Wunsch Smith-Waterman FASTA BLAST

Needleman-Wunsch, 1970 Global alignment Rigorous algorithm –Dynamic programming –Simple to implement Slow; not used for search ces/needle.htmlhttp://bioweb.pasteur.fr/seqanal/interfa ces/needle.html

Smith-Waterman, 1981 Local alignment Rigorous algorithm –Dynamic programming –Fairly simple to implement Precise, sensitive alignments Slow; not used for search SSEARCH in the FASTA package pairwise.shtmlhttp://pir.georgetown.edu/pirwww/search/ pairwise.shtml

FASTA, Lipman Pearson 1985 Local alignment Heuristic algorithm –Table lookup, “words” of length ktup Higher ktup: faster but less sensitive Protein: ktup=2 Nucleotide: ktup=6 –Extension of hits into alignments Faster than Smith-Waterman Useful for searches

FASTA statistics Fairly sophisticated statistics –But still fallible E-value (expectation value) –The number of hits with this score expected, if query were a random sequence –Values should be low Below almost certainly significant to 0.1 probably significant 0.1 to 10 may be significant 10 and above probably rubbish

FASTA example Example search –Query: UniProt P04049 (RAF1_HUMAN) –Standard parameters, fasta3, UniProt –Kept only 24 hours at EBI – bin/sumtab?tool=fasta&jobid=fasta http:// bin/sumtab?tool=fasta&jobid=fasta

BLAST, Altschul et al 1990 Basic Local Alignment Search Tool Heuristic algorithm –Basically similar ideas as FASTA –Did not originally allow gaps –BLAST2 allows gaps ~50 faster than Smith-Waterman Faster than FASTA, less sensitive E-value statistics: same idea as FASTA

BLAST example Example search –Query: UniProt P04049 (RAF1_HUMAN)P04049 –Standard parameters, human proteins – – BLASTQ2

BLAST and short nucleotides Default BLAST parameters are for genes and proteins Oligonucleotides require other parameters for meaningful results special linkhttp://

BLAST and low complexity regions Some proteins contain “low complexity” regions, e.g. S, T, Q in long peptides Spurious high significance –Does not make biological sense Filter out such regions –BLAST uses SEG algorithm –Regions masked out; replaced by “XXXX” –May go wrong; check results!

Variants of search programs QueryDatabaseProgramComment Protein blastp fastp Nucleotide blastn fastn Use only if nucleotide comparison is really wanted NucleotideProteinblastx fastx3 Translate query to protein; 6-frame ProteinNucleotidetblastn tfastx3 Translate DB on the fly; 6-frame Nucleotide tblastxTranslate both query and DB (gene-oriented); 2* 6-frame