BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Heuristic alignment algorithms and cost matrices
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Multiple testing correction
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise sequence comparison
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
Sequence comparison: Dynamic programming
Pairwise Sequence Alignment (cont.)
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline Responses from last class Revision BLAST PSI-BLAST Position specific scoring matrices (PSSMs) Python

One-minute responses Please explain the null and alternative hypothesis again. Liked giving examples on the statistical concepts. Sometimes the class is boring because you are using only the projector. For Python, we learn more by practicing than just looking at your code. Python session was good, but too fast. More Python examples, please. The Python is difficult because it is different from what we learned before. The problem is how to use sys in Python. I hope you give lots of examples for the sys command. Please be available for consultation over the weekend on the assignment. Does BLAST use p-values to decide which alignments to consider?

Revision What is a distribution? – A mathematical function whose values sum to 1. If you roll a single die many times and make a histogram of the resulting values, what kind of distribution will you observe? – Uniform If you compare a protein sequence to many, randomly shuffled protein sequences and make a histogram of the resulting scores, what kind of distribution will you observed? – Extreme value distribution What is the definition of “null hypothesis”? – A statistical model of the situation that we are not interested in. What is the opposite of the null hypothesis? – The alternative hypothesis. What is the name of the estimated probability of observing the data, assuming that it was generated according to the null hypothesis? – p-value How do you decide what p-value threshold to use? – Consider the costs associated with making a mistake.

Significance of scores Sequence alignment algorithm HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE 45 Low score = unrelated High score = homologs How high is high enough?

Sequence database Database searching Sequence comparison algorithm Query Targets ranked by score

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n There are nm entries in the matrix. Each entry requires a constant number c of operations. The total number of required operations is approximate nmc. We say that the algorithm is “ order nm ” or “ O(nm). ”

How long does DP take? Say that your query is 200 amino acids long. You are searching a database that contains a million proteins. If their average length is 200, then you have to fill in 200  200  1,000,000 = 4  DP entries. If it takes only 10 operations to fill in each cell, then you still have to do 4  operations.

BLAST DP is O(nm); BLAST is O(m). Fundamental innovation: employ a data structure to index the query sequence. The data structure allows you to look up entries in a table in O(1) time. Does my length-n sequence contain the subsequence “ GTR ” ? Naive method: scan the sequence Improved method: hash table or search tree lookup O(n) O(1)

BLAST Query sequence Target sequence Query List of words in query and similar words

BLAST Query sequence Target sequence Query List of words in query and similar words “ Does this target word appear in the query word list? ”

“ Yes, at position 34 in the query sequence. ” BLAST Query sequence x Target sequence Query List of words in query and similar words

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words These two hits are on the diagonal and close to each other, so let ’ s try to connect them. x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x Target sequence Query List of words in query and similar words x Assign a score to each hit

BLAST “ The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. ” The initial word threshold T is the most important parameter. Low T = high sensitivity, long compute. High T = low sensitivity, quick compute.

When does BLAST fail? BLAST works by joining together short regions of high similarity. Therefore, BLAST will fail to detect long regions of low similarity. ERDCRVSSFRVKENFDKARFAGTWYAMAKKDPEGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDT E R F E K A Y K E L I F E M A V N V M F ECEIRQFLFIQRESARKEACATGTYREKKMDPELIVLVIWICPQFEQLEMRAMWIHAKJEVIUENAQCVIYTMQEPFCII

Summary of BLAST Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. BLAST is O(m). BLAST produces an index of the query sequence that allows fast matching to the database. Relative to Smith-Waterman, BLAST can produce false negatives; i.e., homologs that BLAST fails to detect.

BLAST Query Sequence database Homologs

Position-specific iterated BLAST BLAST Query Sequence database Statistical model of protein family Homologs Position-specific scoring matrix (PSSM)

Position-specific scoring matrix A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3 “ K ” at position 3 gets a score of 2. Position in query sequence

Position-specific scoring matrix This PSSM assigns the sequence NMFWAFGH a score of = 12. A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3

What score does this PSSM assign to KRPGHFLA? = 6 A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3

How PSI-BLAST makes PSSMs

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ?

Creating a PSSM from 1 sequence A R N D C Q E G H I L K M F-3 6 P S T W Y V -2-3 BLOSUM62 matrix RNRGQFGH R R 20 by by L L

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ?

Creating a PSSM from multiple sequences Discard columns that contain gaps in the query. For each column C – Compute relative sequence weights – Compute PSSM entries, taking into account Observed residues in this column Sequence weights Substitution matrix

Discard query gap columns EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA

Compute sequence weights Low weights are assigned to redundant sequences. High weights are assigned to unique sequences. EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3

Compute PSSM entries EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 BLOSUM62 matrix PSSM

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment

Summary of PSI-BLAST PSI-BLAST builds a model of the query sequence and its close homologs. Instead of comparing a target sequence to the query, each target is compared to the model. The PSI-BLAST model is called a position-specific scoring matrix (PSSM). The PSSM can be constructed from a collection of targets aligned to the query sequence. PSI-BLAST is more accurate than BLAST.

Sample problem #1 Given: – a file containing a sequence of amino acids Return: – the amino acid counts./compute-counts.py seq1.txt Read 68 amino acids from seq1.txt. A 5 C 2 D 3 E 1 F 6 G 0 H 0 I 2 K 2 L 8 M 1 N 5 P 7 Q 1 R 1 S 2 T 5 V 6 W 3 Y 8

Sample problem #2 Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids Return: – the summed amino acid counts and pseudocounts

Sample problem #3 Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids Return: – the normalized summed amino acid counts and pseudocounts