Database searching with BLAST

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Bioinformatics Tutorial I BLAST and Sequence Alignment.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Introduction to Bioinformatics

Heuristic alignment algorithms and cost matrices

Sequence analysis course

Introduction to Bioinformatics Algorithms Sequence Alignment.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Introduction to bioinformatics

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Scoring matrices Identity PAM BLOSUM.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

An Introduction to Bioinformatics

BLAST What it does and what it means Steven Slater Adapted from pt.

Protein Sequence Alignment and Database Searching.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Bioinformatics A Summary seminar (with many hints for exam questions)

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:

Sequence Based Analysis Tutorial

Alignment IV BLOSUM Matrices

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

BLAST Slides adapted & edited from a set by

Presentation transcript:

Database searching with BLAST Outline of today’s lecture Transfer of information Database searching with sequences Sequence Alignment Scoring Matrices Significance of alignments BLAST method parameters output Celia van Gelder CMBI UMC Radboud September 2013

Transfer of information The main topic of this course is transfer of information from a well known to a “new” system (sequence). In the protein world that leads to the questions: From which protein can I transfer information How do I transfer what information from where to where Today’s answer is BLAST…

BLAST - Searching with sequences LAST WEEK: Searching with words (Google like) Query = word(s) Tool used: (MRS-Search, Entrez, SRS, …) TODAY: Searching with sequences Query = sequence Tool used: BLAST (MRS, NCBI, ..)

Database Searching with a query sequence Purpose: To identify similarities between Your query sequence (with unknown structure and function) and Database sequences (with elucidated structures and function) If we identify similarity we can transfer information!

Transfer of information to corresponding residues Your sequence: DRTGHNIPLMSTRKTYHIHIENASEERTIKLLMN is phosphorylated on one of the two serines. Which one? What is your approach?

Transfer of information to corresponding residues BLAST finds two database hits that are annotated to have a phosphorylated serine. DRT-GHNIPLMSTRK-TYHIHIENASEERTIKLLMN DRR-GTTINLMTTKR-TYADELENASEDRTLLLNMN AEPIYYHL---LTKRETYHIHIENASEEKIIKIVVN “this serine is phorphorylated in a known protein from the database, so in my protein the corresponding serine is likely to be phosphorylated too”.

Database searching concept The query sequence is compared (aligned) with every sequence in the database. High-scoring database sequences are assumed to be evolutionary related to the query sequence. If sequences are related by divergence from a common ancestor, there are said to be homologous.

gap = insertion or deletion (indel) Sequence Alignment A B gap = insertion or deletion (indel) A B

Sequence alignment is easy: You only need three things: A computer program that produces all possible alignments, and A computer program that gives each alignment a score, and, the simplest, A computer program that selects the highest scoring alignment from the very large number you tried.

Scoring/Substitution Matrix Scoring scheme for quality of an alignment Contains scores for every possible amino acid substitution in a sequence alignment For protein/protein comparisons we need a 20 x 20 matrix with scores for pairs of residues. Every cell in the matrix contains at position X, Y a score for the substitution/mutation amino acid X -> amino acid Y

Scores Positive score if corresponding amino acid residues in the two aligned sequences are identical or similar. This is a likely change. Negative score if corresponding amino acid residues are not similar. This is an unlikely change. The scores are numbers that you can add up.

Amino Acid substitutions, some thoughts Not all 20x20 possible mutations occur equally often Residues mutate more easily to similar ones (e.g. Leucine and Isoleucine) Residues at surface mutate more easily Aromatics mutate preferably into aromatics Core tends to be hydrophobic; Cysteines are dangerous at the surface Cysteines in sulfur bridges (S-S) seldom mutate Some amino acids have similar codons (for example TTT & TTC for Phe, TTA & TTG for Leu) Etc etc

PAM250 Matrix (Dayhoff Matrix)

Scoring example 1 12 12 6 2 5 -1 2 6 1 0 => score = 46 Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5 -1 2 6 1 0 => score = 46

Scoring matrix, cntnd When you use bioinformatics tools (BLAST, CLUSTAL, etc) the scoring matrix often is a paramater that you can choose. Two widely used matrices (often default in the packages) PAM250 (Dayhoff et al) Based on closely similar proteins BLOSUM62 (Henikoff et al) Based on conserved regions Considered best for distantly related proteins

Dayhoff Matrix (1) The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously. Then they counted all mutations (and non-mutations) and calculated the mutation frequencies With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).

Dayhoff Matrix (2) Given the frequency of Leu and Val in my sequences, and the frequency of mutations,, do I see more mutations of V  L than I would expect by chance alone? Score of mutation A  B = log (observed a  b mutation / expected a  b mutations) This is called a log odd and can be negative, zero, or positive. Zero means no information, no contribution to the score of the alignment. When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues.

Dayhoff Matrix (3) This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. PAM250: 2,5 mutations per residue. equivalent to 20% matches remaining between two sequences, i.e. 80% of the amino acid positions are observed to have changed (one or more times). is default in many analysis packages.

BLOSUM Matrix Limit of Dayhoff matrix: Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal… An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences.

BLOSUM Matrix (2) The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database. The BLOCKS database utilizes the concept of blocks (un-gapped amino acid pattern), that act as signatures of a family of proteins. Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix. Different matrices are obtained by varying the identity threshold. For example, BLOSUM80 was derived using blocks of 80% identity.

Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) Often used defaults are: PAM250, BLOSUM62 BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 20 PAM 120 PAM 250 More conserved More variable

Significance of alignment (1) When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning a random sequence to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance?

Significance of alignment (2) Database size= 200 x 106 amino acids peptide #hits A 10 x 106 AP 500 x 103 IAP 25000 LIAP 1250 WLIAP 62,5 KWLIAP 3,1 KWLIAPY 0,16 KWLIAPYS 0,008

Sequence similarity search Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence? Input: Query sequence Output: List of sequences that are similar to the query sequence

BLAST BLAST – Basic Local Alignment Search Tool BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used

Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include discovering new genes or proteins discovering variants of genes or proteins exploring protein structure and function Etc. It is all about transfer of information!

BLAST – Algorithm Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. N.B. ‘Real alignment’ is a main topic of this course. Step 4: Present result to user: list of sequences that match query sequence & their alignments

BLAST Algorithm, Step 2 The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.

Basic BLAST Algorithms Program Query Database BLASTP Protein 1 BLASTN DNA BLASTX translatedDNA protein 6 TBLASTN TBLASTX 36

DNA potentially encodes six proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Slide from Bioinformatics and Functional Genomics by Jonathan PevsnerCopyright © 2009

Position Specific Iterated BLAST PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.

Steps in running BLAST Entering your query sequence (cut-and-paste) Select the database(s) you want to search And, optionally: Choose output parameters Choose alignment parameters (scoring matrix, filters,….)

BLAST Input - FASTA format >relevant_sequence_name optional comments AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFCSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNNDITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT

BLAST Output Click here to go to the corresponding swissprot entry A low E-value indicates that a match is unlikely to have arisen by chance A high score indicates a likely relationship Click here to study alignment in detail; Look here first!!

BLAST Output But remember: Low scores with high E-values suggest that matches have arisen by chance But remember: Mathematical significance ≠ biological significance!

Alignment Significance in BLAST P value (probability) A p value is a way of representing the significance of an alignment. The closer to zero, the greater the confidence that the hit is significant. 0<p<1

Alignment Significance in BLAST E value (expect value) The expect value E is the number of alignments with scores greater than or equal to the current score S that are expected to occur by chance in a database search. e.g. an E value of 5 assigned to a hit indicates that in a database of the current size one might expect to see 5 matches with a similar score simply by chance. Rule of thumb: An E value of 10-6 or better normally means that things are OK.

BLAST result: easy

BLAST result: less easy

BLAST result: very difficult

BLAST parameter: Low complexity filter Many sequences contain repeats or stretches that consist predominantly of one type of amino acid We call this low-complexity regions. Examples: Many nuclear proteins have a poly-asparagine tail (polyN) Huntington´s disease PolyGlutamine (polyQ) repeat Membrane proteins often consist of mainly hydrophobic amino acids Many binding proteins have proline rich stretches. Example PPPPPPL/R

BLAST - Low complexity filter Low complexity regions influence your BLAST output NNNNNNNN Use the low complexity filter to adapt your BLAST query sequence: Filter OFF NNNNNNNN Filter ON Choice depends on your research question!

Low complexity motifs visible

Things we discussed today Why we want to do database searches – Transfer of information! Alignment & scoring methods Significance of alignments BLAST principle of method BLAST output, in particular E-value BLAST input parameters, in particular low complexity filter Let´s BLAST!!