Download presentation
1
Database searching with BLAST
Outline of today’s lecture Transfer of information Database searching with sequences Sequence Alignment Scoring Matrices Significance of alignments BLAST method parameters output Celia van Gelder CMBI UMC Radboud September 2013
2
Transfer of information
The main topic of this course is transfer of information from a well known to a “new” system (sequence). In the protein world that leads to the questions: From which protein can I transfer information How do I transfer what information from where to where Today’s answer is BLAST…
3
BLAST - Searching with sequences
LAST WEEK: Searching with words (Google like) Query = word(s) Tool used: (MRS-Search, Entrez, SRS, …) TODAY: Searching with sequences Query = sequence Tool used: BLAST (MRS, NCBI, ..)
4
Database Searching with a query sequence
Purpose: To identify similarities between Your query sequence (with unknown structure and function) and Database sequences (with elucidated structures and function) If we identify similarity we can transfer information!
5
Transfer of information to corresponding residues
Your sequence: DRTGHNIPLMSTRKTYHIHIENASEERTIKLLMN is phosphorylated on one of the two serines. Which one? What is your approach?
6
Transfer of information to corresponding residues
BLAST finds two database hits that are annotated to have a phosphorylated serine. DRT-GHNIPLMSTRK-TYHIHIENASEERTIKLLMN DRR-GTTINLMTTKR-TYADELENASEDRTLLLNMN AEPIYYHL---LTKRETYHIHIENASEEKIIKIVVN “this serine is phorphorylated in a known protein from the database, so in my protein the corresponding serine is likely to be phosphorylated too”.
7
Database searching concept
The query sequence is compared (aligned) with every sequence in the database. High-scoring database sequences are assumed to be evolutionary related to the query sequence. If sequences are related by divergence from a common ancestor, there are said to be homologous.
8
gap = insertion or deletion (indel)
Sequence Alignment A B gap = insertion or deletion (indel) A B
9
Sequence alignment is easy:
You only need three things: A computer program that produces all possible alignments, and A computer program that gives each alignment a score, and, the simplest, A computer program that selects the highest scoring alignment from the very large number you tried.
10
Scoring/Substitution Matrix
Scoring scheme for quality of an alignment Contains scores for every possible amino acid substitution in a sequence alignment For protein/protein comparisons we need a 20 x 20 matrix with scores for pairs of residues. Every cell in the matrix contains at position X, Y a score for the substitution/mutation amino acid X -> amino acid Y
11
Scores Positive score if corresponding amino acid residues in the two aligned sequences are identical or similar. This is a likely change. Negative score if corresponding amino acid residues are not similar. This is an unlikely change. The scores are numbers that you can add up.
12
Amino Acid substitutions, some thoughts
Not all 20x20 possible mutations occur equally often Residues mutate more easily to similar ones (e.g. Leucine and Isoleucine) Residues at surface mutate more easily Aromatics mutate preferably into aromatics Core tends to be hydrophobic; Cysteines are dangerous at the surface Cysteines in sulfur bridges (S-S) seldom mutate Some amino acids have similar codons (for example TTT & TTC for Phe, TTA & TTG for Leu) Etc etc
13
PAM250 Matrix (Dayhoff Matrix)
14
Scoring example 1 12 12 6 2 5 -1 2 6 1 0 => score = 46
Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT => score = 46
15
Scoring matrix, cntnd When you use bioinformatics tools (BLAST, CLUSTAL, etc) the scoring matrix often is a paramater that you can choose. Two widely used matrices (often default in the packages) PAM250 (Dayhoff et al) Based on closely similar proteins BLOSUM62 (Henikoff et al) Based on conserved regions Considered best for distantly related proteins
16
Dayhoff Matrix (1) The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously. Then they counted all mutations (and non-mutations) and calculated the mutation frequencies With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).
17
Dayhoff Matrix (2) Given the frequency of Leu and Val in my sequences, and the frequency of mutations,, do I see more mutations of V L than I would expect by chance alone? Score of mutation A B = log (observed a b mutation / expected a b mutations) This is called a log odd and can be negative, zero, or positive. Zero means no information, no contribution to the score of the alignment. When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues.
18
Dayhoff Matrix (3) This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. PAM250: 2,5 mutations per residue. equivalent to 20% matches remaining between two sequences, i.e. 80% of the amino acid positions are observed to have changed (one or more times). is default in many analysis packages.
19
BLOSUM Matrix Limit of Dayhoff matrix:
Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal… An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences.
20
BLOSUM Matrix (2) The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database. The BLOCKS database utilizes the concept of blocks (un-gapped amino acid pattern), that act as signatures of a family of proteins. Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix. Different matrices are obtained by varying the identity threshold. For example, BLOSUM80 was derived using blocks of 80% identity.
21
Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) Often used defaults are: PAM250, BLOSUM62 BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 20 PAM 120 PAM 250 More conserved More variable
22
Significance of alignment (1)
When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning a random sequence to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance?
23
Significance of alignment (2)
Database size= 200 x 106 amino acids peptide #hits A 10 x 106 AP x 103 IAP LIAP WLIAP 62,5 KWLIAP 3,1 KWLIAPY 0,16 KWLIAPYS 0,008
24
Sequence similarity search
Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence? Input: Query sequence Output: List of sequences that are similar to the query sequence
25
BLAST BLAST – Basic Local Alignment Search Tool
BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used
26
Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include discovering new genes or proteins discovering variants of genes or proteins exploring protein structure and function Etc. It is all about transfer of information!
27
BLAST – Algorithm Step 1: Read/understand user query sequence.
Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. N.B. ‘Real alignment’ is a main topic of this course. Step 4: Present result to user: list of sequences that match query sequence & their alignments
28
BLAST Algorithm, Step 2 The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.
29
Basic BLAST Algorithms
Program Query Database BLASTP Protein 1 BLASTN DNA BLASTX translatedDNA protein 6 TBLASTN TBLASTX 36
30
DNA potentially encodes six proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Slide from Bioinformatics and Functional Genomics by Jonathan PevsnerCopyright © 2009
31
Position Specific Iterated BLAST
PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.
32
Steps in running BLAST Entering your query sequence (cut-and-paste)
Select the database(s) you want to search And, optionally: Choose output parameters Choose alignment parameters (scoring matrix, filters,….)
33
BLAST Input - FASTA format
>relevant_sequence_name optional comments AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFCSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNNDITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT
34
BLAST Output Click here to go to the corresponding swissprot entry
A low E-value indicates that a match is unlikely to have arisen by chance A high score indicates a likely relationship Click here to study alignment in detail; Look here first!!
35
BLAST Output But remember:
Low scores with high E-values suggest that matches have arisen by chance But remember: Mathematical significance ≠ biological significance!
36
Alignment Significance in BLAST P value (probability)
A p value is a way of representing the significance of an alignment. The closer to zero, the greater the confidence that the hit is significant. 0<p<1
37
Alignment Significance in BLAST E value (expect value)
The expect value E is the number of alignments with scores greater than or equal to the current score S that are expected to occur by chance in a database search. e.g. an E value of 5 assigned to a hit indicates that in a database of the current size one might expect to see 5 matches with a similar score simply by chance. Rule of thumb: An E value of 10-6 or better normally means that things are OK.
38
BLAST result: easy
39
BLAST result: less easy
40
BLAST result: very difficult
41
BLAST parameter: Low complexity filter
Many sequences contain repeats or stretches that consist predominantly of one type of amino acid We call this low-complexity regions. Examples: Many nuclear proteins have a poly-asparagine tail (polyN) Huntington´s disease PolyGlutamine (polyQ) repeat Membrane proteins often consist of mainly hydrophobic amino acids Many binding proteins have proline rich stretches. Example PPPPPPL/R
42
BLAST - Low complexity filter
Low complexity regions influence your BLAST output NNNNNNNN Use the low complexity filter to adapt your BLAST query sequence: Filter OFF NNNNNNNN Filter ON Choice depends on your research question!
43
Low complexity motifs visible
44
Things we discussed today
Why we want to do database searches – Transfer of information! Alignment & scoring methods Significance of alignments BLAST principle of method BLAST output, in particular E-value BLAST input parameters, in particular low complexity filter Let´s BLAST!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.