Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Heuristic alignment algorithms and cost matrices
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
Public Resources (II) – Analysis tools  Web-based analysis tools – easy to use, but often with less customization options.  Stand-alone analysis tools.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Public Resources for Bioinformatics Databases : how to find relevant information. Analysis Tools.
BLAST : Basic local alignment search tool B L A S T !
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Construction of Substitution Matrices
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
What is BLAST? Basic BLAST search What is BLAST?
Stand-alone tools 2. 1.Download the zip file to the GMS6014 folder. 2.Unzip the files to a folder named “clustalx”. 3.Edit the MDM2_isoforms_5.fasta file.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window, navigate to the C:\GMS6014\blast folder. 3.At the prompt “C:\GMS6014\blast >” type the command “formatdb –i dbs\Dm.P –p T” -- format the dataset for the program. 4.Compose the query sequence save as “3TNF.txt” in the “blast\query\” folder. 5.Initiated the search by typing “blastall –p blastp –d dbs\Dm.P –i query\4_MMD2.fasta –o out\Mdm2_DmP.html –T T”

What’s in a command? formatdb –i dbs\Dm.P –p T Program – format database for search. Feed me the input file name Tell me is it a protein sequence file? For more info, refer to the “user manual” file in the blast\doc folder.

Blast outputoutput

How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P P Seq_B: M P P W I

How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P P Seq_B: M P P W I

Scoring matrix –BLOSUM 62

Judging the match using “Scoring Matrix” Q: which one is a better match to the query ? Query: M A T W L Seq_B: M P P W I Score: Total: 16 Total: 7 -4 Query: M A T W L Seq_A: M A T P P Score: 545-3

Scoring matrix Does it make sense? How was it generated?

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL SALEDFL ASLDDYL ASIDEFY ASIDEFY … Score(a1/a2) = observed frequency of a1/a2 2* log2 predicated frequency of a1/a2 AA: 6 AS: 3 SS: 0

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L / I > predicated frequency of L / I i.e: 0.1*0.1 = 0.01i.e: 0.03 Score (L/I) > 0 Substitution of L / I is common in conserved sequences

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L / K < predicated frequency of L / K i.e: 0.1*0.1 = 0.01i.e: Score (L/K) < 0 Substitution of L / K is rare in conserved sequences

BLOSUM - Blocks Substitution Matrices -- Clustering threshold BLOSUM 90– Blocks with >= 90% identity are counted as one to compute the substitution score BLOSUM 30– Blocks with >= 30 % identity are counted as one to compute the substitution score BLOSUM 62

BLOSUM - Blocks Substitution Matrices -- Clustering threshold BLOSUM 90 BLOSUM 30 BLOSUM 62 ASLDEFL SALEEFL ASLDDYL SALEEFL TAIQNYV ATVNQFI … ASLDEFL SALEEFL ASLDDYL SALEEFL TAIQNYV ATVNQFI … SALEEFL* TAIQNYV ATVNQFI …

Comparison of Blosum matrixes A R N D C Q E G H I L K M F L Blosum 90 L Blosum 62 L Blosum 30

Q: Which substitution matrix will you use to identify a distant ortholog ? a.) Blosum 40 c.) Blosum 90 b.) Blosum 60

Blast outputoutput BLAST – Basic Local Alignment Search Tool

Global vs. Local Alignment Global alignmentGlobal alignment: from start to end Local alignmentLocal alignment: best matches on any segment of participating sequence. Before calculating the similarity score, we need an alignment --

Practice : try local alignment and global alignment of the same pair of sequence Start two new browser tabs with the alignment server. Open the test sequence file, copy seq1 and seq2 to the respective windows in the alignment web page. Select Blosum62 as the scoring matrix on both pages. Run one set for Local alignment, the other for global alignment.

Global Alignment seq1 & seq2 Score: -57 at (seq1)[1..90] : (seq2)[1..92] >seq1 MA-----STVTSCLEPTEVFMDLWPEDHSNWQELSPLEPSDPLNPPTPPRAAPSPVVPST :. ::.. :. :.... :.. : : :. : :.. >seq2 MSHGIQMSTIKKRRSTDEEVFCLPIKGREIYEILVKIYQIENYNMECAPPAGASSVSVGA >seq1 EDYGGDFDFRVGFVEAGTAKSVTCTYSPVLNKVYC. :. :.: ::... >seq2 TEAEPTEVFMDLWPED---HSNWQELSPLEPSDHM 14 identical matches

Local Alignment seq1 & seq2 with BLOSUM 62 Score: 156 at (seq1)[10..36] : (seq2)[64..90] 10 EPTEVFMDLWPEDHSNWQELSPLEPSD ||||||||||||||||||||||||||| 64 EPTEVFMDLWPEDHSNWQELSPLEPSD 27 identical matches

Finding the best alignment = Get the highest score The consideration on whether to open/extend a gap is weighed by its effect on the total score of the alignment. Optimization - Dynamic programming

Global vs. Local Alignment Q: Can a Global alignment produce a Local alignment ? Can a Local alignment produce a Global alignment ?

Global vs. Local Alignment Global alignment: from start to end Local alignment: best matches on any segment of participating sequence. complete sequence of orthologs incomplete sequence protein family with varying length identifying motifs shared by non- orthologs

Factors determining a local alignment Local Alignment -- The highest score - stop the alignment extension if it is not profitable Scoring matrix Gap penalty

Effect of matrices on Local Alignment Column 1,3,5, align seq1 and seq 2 with “blosum80” Observe: effect of matrices on the outcome of local alignment Column 2 and 4, align seq1 and seq 2 with “blosum35”

Effect of matrices on Local Alignment Score:Score: 206 at (seq1)[10..38] : (seq2)[64..92] 10 EPTEVFMDLWPEDHSNWQELSPLEPSDPL |||||||||||||||||||||||||||. 64 EPTEVFMDLWPEDHSNWQELSPLEPSDHM Score:Score: 156 at (seq1)[10..36] : (seq2)[64..90] 10 EPTEVFMDLWPEDHSNWQELSPLEPSD ||||||||||||||||||||||||||| 64 EPTEVFMDLWPEDHSNWQELSPLEPSD Blosum 35: P / H: -1 L/M: 3 Blosum 62: P / H: -2 L/M: 2

Introducing a gap Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total: 5 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap Extension: -2 ~ -6

Effect of gap penalty on Local Alignment Column 1,3,5, align seq1 and seq2 with “gap=15, ext=3,” Practice : effect of gap penalty on local alignment Column 2 and 4, align seq1 and seq2 with “gap=5, ext=1” Set matrix to “blosum62”

Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | ||||||||||||||||||||||||||| 53 ASSVSVGATEA-EPTEVFMDLWPEDHSNWQELSPLEPSD Score:Score: 156 at (seq1)[10..36] : (seq2)[64..90] 10 EPTEVFMDLWPEDHSNWQELSPLEPSD ||||||||||||||||||||||||||| 64 EPTEVFMDLWPEDHSNWQELSPLEPSD Blosum 62 Gap: -15 Ex: -3 Gap: -5 Ex: - 1

BLAST – Basic Local Alignment Search Tool It is based on local alignment, -- highest score is the only priority in terms of finding alignment match. -- determined by scoring matrix, gap penalty It is optimized for searching large data set instead of finding the best alignment for two sequences

BLAST – Basic Local Alignment Search Tool 1.A high similarity core (2- 4aa) 2.Often without gap Query: M A T W L I. Word : M A T A T W T W L W L I 1.For each word, find matches with Score > T. 2.Extend the match as long as profitable. - High Scoring segment Pair (best local alignment) 3.Find the P and E value for HSP(s) with Score > cut off*. * Cut off value can be automatically calculated based on E

BLAST – Basic Local Alignment Search Tool The P and E value for HSP(s) : based on the total score (S) of the identified “best” local alignment. P (S) : the probability that two random sequence, one the length of the query and the other the entire length of the database, could achieve the score S. E (S) : The expectation of observing a score >= S in the target database. For a given database, there is a one to one correspondence between S and E(s) -- choosing E determines cut off score

BLAST – Basic Local Alignment Search Tool BLASTN BLASTP TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. BLASTX compares a nucleotide query sequence translated in all reading frames against a protein sequence database TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

BLAST – Advanced options : all adjustable in stand alone BLAST -F Filter query sequence [String] default = T -M Matrix [String] default = BLOSUM62 -G Cost to open gap [Integer] default = 5 for nucleotides 11 proteins -E Cost to extend gap [Integer] default = 2 nucleotides 1 proteins -q Penalty for nucleotide mismatch [Integer] default = -3 -r reward for nucleotide match [Integer] default = 1 -e expect value [Real] default = 10 -W wordsize [Integer] default = 11 nucleotides 3 proteins -T Produce HTML output [T/F] default = F …..

Smith-Waterman vs. Heuristic approach Smith-Waterman Algorithm Finding the best alignment based on complete route map BLAST, FASTA Try to find the best alignment based on experience/knowledge Search result Difference ? - For real good matches, almost no difference - For marginal similarity and exceptional cases, the difference may matter.

Overview of homology search strategy 1.) Where should I search? NCBI Has pretty much every thing that has been available for some time Genome projects Has the updated information (DNA sequence as well as analysis result)

Overview of homology search strategy 2.) Which sequence should I use as the query? Protein cDNA Genomic

Overview of homology search strategy 2.) Which sequence should I use as the query? cDNA (BlastN) Protein (TblastN)

Overview of homology search strategy 2.) Which sequence should I use as the query? Protein v.s cDNA Searching at the protein level is much more sensitive query: S A L query: TCT GCA TTG target: S A L target: AGC GCT CTA Protein: ~ 5% Nucleotide: ~ 25% Base level identity Protein: 100% Nucleotide: 33%

Overview of homology search strategy 2.) Which sequence should I use as the query? If you want to identify similar feature at the DNA level. Be Cautious with genomic sequence initiated search Low complexity region repeats

Overview of homology search strategy 3.) Which program to use? 1.Smith-Waterman vs. Blast. 2.Different flavors of BLAST

Overview of homology search strategy 4.) Which data set should I search? Protein sequence (known and predicted) blastP, Smith_Waterman Genomic sequence TblastN EST TblastN Predicted genes TblastN

Overview of homology search strategy 5.) How to optimize the search ? Scoring matrices Gap penalty Expectation / cut off Example

Overview of homology search strategy 6.) How do I judge the significance of the match ? P-value, E -value Alignment Structural / Function information

Overview of homology search strategy 7.) How do I retrieve related information about the hit(s) ? NCBI is relatively easy The scope of information collection can be enlarged by searching (linking) multiple databases (links). exampleexample Genome projects often have their own interface and logistics (ie. Ensemble, wormbase, MGI, etc. )Ensemble

Overview of homology search strategy 8.) How to align (compare) my query and the hits ? Global alignment Local alignment ClustalW/ClustalX

Blast outputoutput

Specific patterns 1.DNA pattern – Transcription factor binding site. 2.Short protein pattern – enzyme recognition sites. 3.Protein motif/signature.

Binary patterns for protein and DNA Caspase recognition site: [EDQN] X [^RKH] D [ASP] Examples: Observe: Search for potential caspase recognition sites with BaGua

Searching for binary (string) patterns Seq: A G G G C T C A T G A C A G R C W G A C A G T G R C W G A C A G T G R C W G A C A G T Positive match