Protein Structural Classification
Structural Classification Databases Sequence pairwise comparison SCOP, CATH, FSSP Sequence pairwise comparison Smith-waterman, BLAST, PSI-BLAST, rank-propagation, SAM-T98 Discriminative classification SVM pairwise, mismatch kernel, EMOTIF kernel, I-Site kernel, semi-supervised kernel
SCOP Fold Superfamily Family SCOP Negative Test Set Positive Training Set Test Set Negative Family : Sequence identity > 30% or functions and structures are very similar Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin Common fold : same major secondary structures in same arrangement with the same topological connections
CATH Class Architecture Topology Homologous Sequence family
Local alignment: Smith-Waterman algorithm For two string x and y, a local alignment with gaps is: The score is: Smith-Waterman score: Thanks to Jean Philippe
BLAST: a heuristic algorithm for matching DNA/Protein sequences Idea: True match are likely to contain a short stretch of identity A list of ‘neighborhood words” of the query sequence Search database with the list, whenever there is a match do a ‘hit extension’, stopping at the maximum scoring extension Altschul, Madden, Schaffer, Zhang etc., 1997
PSI-BLAST: Position-specific iterated BLAST Only extend those double hit within a certain range. A gapped alignment uses dynamic programming to extend a central pair of aligned residues in both directions. PSI-BLAST can takes PSSM as input to search database Altschul, Madden, Schaffer, Zhang etc., 1997
Local and Global Consistency Affinity matrix D is a diagonal matrix Iterate F* is the limit of seuqnce {F(t)} Zhou, Bousquet, Lal, Weston, and Scholkopf, 2003
Weston, Elisseeff, Zhou, Leslie and Noble, 2004 Rank propagation Protein similarity network: Graph nodes: protein sequences in the database Directed edges: a exponential function of the PSI-BLAST e-value (destination node as query) Activation value at each node: the similarity to the query sequnce Exploit the structure of the protein similarity network Weston, Elisseeff, Zhou, Leslie and Noble, 2004
Karplus, Barrett and Hughey, 1999 SAM-T98 The first iteration: query sequence to search NR database using WU-BLASTP and build alignment for the found homologs 2nd-4th iterations: take the alignment from the previous iterations to find more homologs with WU-BLASTP and update the alignment with the new homologs found. Build a HMM from the final alignment. The HMM of query sequence is used to search database, or we can use query sequence to search against HMM database Karplus, Barrett and Hughey, 1999
To do it in a discriminative manner with SVM…
Jaakkola, Diekhans and Haussler, 2000 Fisher Kernel A HMM (or more than one) is built for each family Derive kernel function from the fisher scores of each sequence given a HMM H1: Jaakkola, Diekhans and Haussler, 2000