#11 - MSAs; PSSMs & Psi-BLAST

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
9/14/07BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment1 BCB 444/544 Lecture 11 First BLAST vs FASTA Plus some Gene Jargon Multiple Sequence.
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Annotation Continued
Sequence Alignment 11/24/2018.
#8 Finish DP, Scoring Matrices, Stats & BLAST
Sequence Based Analysis Tutorial
#7 Still more DP, Scoring Matrices
Sequence Based Analysis Tutorial
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics
#30 - Phylogenetics Distance-Based Methods
MULTIPLE SEQUENCE ALIGNMENT
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

#11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 9/17/07 Lecture 12 Multiple Sequence Alignment (MSA) PSSMs & Psi-BLAST #12_Sept17 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Required Reading (before lecture) #11 - MSAs; PSSMs & Psi-BLAST Required Reading (before lecture) 9/17/07 √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Hidden Markov Models Chp 6 - pp 79-84 Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Wed Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Assignments & Announcements #11 - MSAs; PSSMs & Psi-BLAST Assignments & Announcements 9/17/07 Sun Sept 16 - Study Guide for Exam 1 was posted Mon Sept 17 - Answers to HW#2 will be posted ~ Noon Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming~ BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Chp 5- Multiple Sequence Alignment #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment Scoring Function Exhaustive Algorithms Heuristic Algorithms Practical Issues BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Multiple Sequence Alignments #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Multiple Sequence Alignments Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Overview 9/17/07 What is a multiple sequence alignment (MSA)? Where/why do we need MSA? What is a good MSA? Algorithms to compute a MSA BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Multiple Sequence Alignment #11 - MSAs; PSSMs & Psi-BLAST Multiple Sequence Alignment 9/17/07 Generalize pairwise alignment of sequences to include > 2 homologous sequences Analyzing more than 2 sequences gives us much more information: Which amino acids are required? Correlated? Evolutionary/phylogenetic relationships Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Definition: MSA 9/17/07 Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that resulting sequences have same length no column contains only gaps ATT-GC AT-TGC AT-T-GC ATTTGC ATTTGC ATTT-GC ATTTG ATTTG- ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Displaying MSAs: using CLUSTAL W #11 - MSAs; PSSMs & Psi-BLAST Displaying MSAs: using CLUSTAL W 9/17/07 RED: AVFPMILW (small) BLUE: DE (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) * entirely conserved column : all residues have ~ same size AND hydropathy . all residues have ~ same size OR hydropathy BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

What is a Consensus Sequence? #11 - MSAs; PSSMs & Psi-BLAST What is a Consensus Sequence? 9/17/07 A single sequence that represents most common residue of each column in a MSA Example: FGGHL-GF F-GHLPGF FGGHP-FG Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Applications of MSA 9/17/07 Building phylogenetic trees Finding conserved patterns, e.g.: Regulatory motifs (TF binding sites) Splice sites Protein domains Identifying and characterizing protein families Find out which protein domains have same function Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Application: Recover Phylogenetic Tree #11 - MSAs; PSSMs & Psi-BLAST Application: Recover Phylogenetic Tree 9/17/07 What was series of events that led to current species? NYLS NFLS NYLS BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Application: Discover Conserved Patterns #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Goal: Characterize Protein Families #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Databases of Multiple Alignments #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Databases of Multiple Alignments Pfam (Protein Domain Families data base) Contains alignments and HMMs of protein families InterPro Integrates: Prosite, Prints, ProDom, Pfam, and SMART BLOCKS Segments of highly conserved multiple alignments Hovergen (Homologous Vertebrate Genes Database) COGs (Clusters of Orthologous Groups) BaliBASE (Benchmark alignments database) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Scoring an Alignment 9/17/07 Goal: Align homologous positions. But: Without knowledge of phylogenetic tree is this very hard (sometimes impossible) to achieve! NYLS NFLS BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Scoring an Alignment 9/17/07 In practice, simple scoring functions are used: usually, columns are scored independently, i.e. ith column of alignment m gap penalty A F P G Q I K Y D W - BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Sum of Pairs (SP) Score 9/17/07 SP = sum of scores of all possible pairs of sequences in an MSA based on a particular scoring matrix Compute for each column c S(mi) = k<l s(mik,mil) F I - mi residue l PAM or BLOSUM score A F P G I Q W - D Y BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST How Score Gaps in MSAs? 9/17/07 Want to align gaps with each other over all sequences. A gap in a pairwise alignment that “matches” a gap in another pairwise alignment should cost less than introducing a totally new gap. Possible that a new gap could be made to “match” an older one by adjusting older pairwise alignment Change gap penalty near conserved domains of various kinds (e.g. secondary structure elements, hydrophobic regions) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Example: SP Score 9/17/07 F Y G D 5 -2 -1 7 1 -5 4 -3 F-G G D F-G m= FYD Gap penalty: -8 s(-,-) = 0 BLOSUM 60 S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Overcoming problems with SP scoring #11 - MSAs; PSSMs & Psi-BLAST Overcoming problems with SP scoring 9/17/07 Use weights to incorporate evolution in sum of pairs scoring: Some pairwise alignments are more important than others e.g., more important to have a good alignment between mouse & human sequences than between mouse & bird Assign different weights to different pairwise alignments Weight decreases with evolutionary distance BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

How Compute a Multiple Alignment? #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 How Compute a Multiple Alignment? Algorithms for MSA: Multidimensional dynamic programming Optimal global alignment (time & space intensive!!!) Progressive alignments (Star alignment, ClustalW) Match closely-related sequences first using a guide tree Iterative methods Combined local alignments (Dialign) Multiple re-building attempts to find best alignment Partial order alignment (POA) Local alignments Profiles, Blocks, Patterns BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Dynamic Programming for MSA #11 - MSAs; PSSMs & Psi-BLAST Dynamic Programming for MSA 9/17/07 As with pairwise alignments, multiple sequence alignments can be computed by dynamic programming 3D F 2D BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Generalized Needleman-Wunsch Algorithm #11 - MSAs; PSSMs & Psi-BLAST Generalized Needleman-Wunsch Algorithm 9/17/07 3D Given 3 sequences x, y, and z: Main iteration loop: F(i,j,k) = max ( F(i-1, j-1, k-1) + S(xi, yj, zk), F(i-1, j-1, k ) + S(xi, yj, - ), F(i-1, j , k-1) + S(xi, -, zk), F(i-1, j , k ) + S(xi, -, - ), F(i , j-1, k-1) + S( -, yj, zk), F(i , j-1, k ) + S( -, yj, -), F(i , j , k-1) + S( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

What Happens to Computational Complexity? #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 What Happens to Computational Complexity? Given k sequences of length n: Space for matrix: O(nk) Neighbors/cell: 2k-1 Time to compute SP score: O(k2) Overall runtime: O(k22knk) Ouch!!! 3D BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

What's so bad about those exponents? An example: Running Time of DP #11 - MSAs; PSSMs & Psi-BLAST What's so bad about those exponents? An example: Running Time of DP 9/17/07 Overall runtime: O(k22knk) # sequences running time 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences: globins ( 150 aa) But: There are fast heuristics. BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Progressive Alignment #11 - MSAs; PSSMs & Psi-BLAST Progressive Alignment 9/17/07 Heuristic procedure: Align most similar sequences first Add sequences progressively Often: use guide tree to determine order of alignments Examples: Star alignment ClustalW Multiple Alignment by adding sequences 1 2 3 4 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Guide Tree 9/17/07 Binary tree Leaves correspond to sequences Internal nodes represent alignments Root corresponds to final MSA -TCG -TCC ATC- ATG- ATC TCG ATG TCC ATC ATG TCG TCC BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Star Alignment - will skip for now, come back to this on Wed Star alignment will NOT be covered on Exam 1 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Chp6 - Profiles & Hidden Markov Models #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Chp6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs Position Specific Scoring Matrices (PSSMs) PSI-BLAST Profiles Markov Model & Hidden Markov Model BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Position Specific Iterated BLAST Intuition: substitution matrices should be specific to a particular site: penalize alanine→glycine more in a helix Basic idea: Use BLAST with high stringency to get a set of closely related sequences Align those sequences to create a new substitution matrix for each position Then use that matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST PSI-BLAST pseudocode 9/17/07 Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST PSI-BLAST pseudocode 9/17/07 Position-specific scoring matrix Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST PSI-BLAST pseudocode 9/17/07 Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific scoring matrix - PSSM #11 - MSAs; PSSMs & Psi-BLAST Position-specific scoring matrix - PSSM 9/17/07 A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V A PSSM is an n by m matrix, where n is the size of alphabet, and m is length of sequence Entry at (i, j) is score assigned by PSSM to letter i at the jth position BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific scoring matrix #11 - MSAs; PSSMs & Psi-BLAST Position-specific scoring matrix 9/17/07 A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. “K” at position 3 gets a score of 2 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific scoring matrix #11 - MSAs; PSSMs & Psi-BLAST Position-specific scoring matrix 9/17/07 A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V This PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific scoring matrix #11 - MSAs; PSSMs & Psi-BLAST Position-specific scoring matrix 9/17/07 A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V What score does this PSSM assign to KRPGHFLA? 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific iterated BLAST #11 - MSAs; PSSMs & Psi-BLAST Position-specific iterated BLAST 9/17/07 ? Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Creating a PSSM from 1 sequence #11 - MSAs; PSSMs & Psi-BLAST Creating a PSSM from 1 sequence 9/17/07 L R A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V RNRGQFGH R BLOSUM62 matrix 20 by 20 20 by L BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Position-specific iterated BLAST #11 - MSAs; PSSMs & Psi-BLAST Position-specific iterated BLAST 9/17/07 ? Query PSSM Multiple alignment Sequence database BLAST BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Creating a PSSM from multiple sequences #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 Creating a PSSM from multiple sequences Discard columns that contain gaps in query For each column C Compute relative sequence weights Compute PSSM entries, taking into account Observed residues in this column Sequence weights Substitution matrix BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Discard query gap columns #11 - MSAs; PSSMs & Psi-BLAST Discard query gap columns 9/17/07 EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Compute sequence weights #11 - MSAs; PSSMs & Psi-BLAST Compute sequence weights 9/17/07 Low weights are assigned to redundant sequences High weights are assigned to unique sequences EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

Compute PSSM entries (simplified version) #11 - MSAs; PSSMs & Psi-BLAST Compute PSSM entries (simplified version) 9/17/07 A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 + = PSSM Observed residues E Q R G K A F Background frequencies PSSM column These are usually derived from a large sequence database BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Log-odds score 9/17/07 Estimate the probability of observing each residue Divide by the background probability of observing the same residue Take log so scores will be additive BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Log-odds score 9/17/07 Residue was generated by foreground model (i.e., the PSSM) Residue “A” is observed Estimate the probability of observing each residue Divide by the background probability of observing the same residue Take log so scores will be additive Residue was generated by the background model (i.e., randomly selected) BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Why (not) PSI-BLAST 9/17/07 Weights sequence according to observed diversity specific to family of interest Advantage: If sequences used to construct Position Specific Scoring Matrices (PSSMs) are all homologous, sensitivity at a given specificity improves significantly Disadvantage: However, if any non-homologous sequences are included in PSSMs, they are “corrupted.” Then they "pull in" addition non-homologous sequences, and become worse than generic BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST How to use PSI BLAST 9/17/07 Set initial thresholds high Inspect each iteration's result for suspicious sequences Do several iterations (~5), or until no new sequences are found Even if only looking for a small set of sequences, make initial search very broad First, use NR (large, inclusive database) with up to 5 iterations to set PSSM Then use that PSSM to search in restricted domain BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST PSI-BLAST caveats 9/17/07 Goal: Increased ability to find distant homologs Cost? additional care to prevent non-homologous sequences from being included in PSSM calculation When in doubt, leave it out! Examine sequences with moderate similarity carefully Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 PSI-BLAST example Query is human NF-Kappa-B sequence BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST 9/17/07 First Iteration … … BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Second iteration 9/17/07 BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs

#11 - MSAs; PSSMs & Psi-BLAST Summary 9/17/07 Dynamic programming is O(NM) BLAST is O(M) BLAST produces an index of query sequence that allows fast matching to the database Target database is pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold PSI-BLAST iterates BLAST, adding new homologs at each iteration BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST BCB 444/544 Fall 07 Dobbs