Download presentation
Presentation is loading. Please wait.
1
Protein Classification
2
PDB Growth New PDB structures
3
Only a few folds are found in nature
4
Protein classification Number of protein sequences grows exponentially Number of solved structures grows exponentially Number of new folds identified very small (and close to constant) Protein classification can Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences Help predict 3D structure of new protein sequences Morten Nielsen,CBS, BioCentrum, DTU SCOP release 1.69, Class# folds# superfamilies# families All alpha proteins218376608 All beta proteins144290560 Alpha and beta proteins (a/b)136222629 Alpha and beta proteins (a+b)279409717 Multi-domain proteins46 61 Membrane & cell surface478899 Small proteins75108171 Total94515392845 Classification of 25,973 protein structures in PDB
5
Protein world Protein fold Protein structure classification Protein superfamily Protein family Morten Nielsen,CBS, BioCentrum, DTU
6
Structure Classification Databases SCOP Manual classification (A. Murzin) scop.berkeley.edu scop.berkeley.edu CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath www.biochem.ucl.ac.uk/bsm/cath FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html www.ebi.ac.uk/dali/fssp/fssp.html Morten Nielsen,CBS, BioCentrum, DTU
7
Major classes in SCOP Classes All proteins All proteins and proteins ( / ) and proteins ( + ) Multi-domain proteins Membrane and cell surface proteins Small proteins Coiled coil proteins Morten Nielsen,CBS, BioCentrum, DTU
8
All : Hemoglobin (1bab) Morten Nielsen,CBS, BioCentrum, DTU
9
All : Immunoglobulin (8fab) Morten Nielsen,CBS, BioCentrum, DTU
10
Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU
11
: Lysozyme (1jsf) Morten Nielsen,CBS, BioCentrum, DTU
12
Families Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Proteins are divided into Species The same protein may be found in several species Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU
13
Superfamilies Proteins which are (remotely) evolutionarily related Sequence similarity low Share function Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU
14
Folds >~50% secondary structure elements arranged in the same order in sequence and in 3D No evolutionary relation Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU
15
Protein Classification Given a new protein sequence, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein
16
BLAST (Basic Local Alignment Search Tool) Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
17
PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M, a profile 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M
18
A profile
19
Profile HMMs Each M state has a position-specific pre-computed substitution table Each I and D state has position-specific gap penalties Profile HMM is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parametrizes a conditional distribution P(X | H) Protein profile HMM
20
Classification with Profile HMMs Fold Family Superfamily ? new protein
21
Classification with Profile HMMs How generative models work Training examples ( sequences known to be members of family ) Model assigns a probability to any given protein sequence. The sequence from that family yield a higher probability than that of outside family. Log-likelihood ratio as score P(X | H 1 ) P(H 1 ) P(H 1 |X) P(X) P(H 1 |X) L(X) = log -------------------------- = log --------------------- = log -------------- P(X | H 0 ) P(H 0 ) P(H 0 |X) P(X) P(H 0 |X)
22
Generative Models
27
Discriminative Methods Instead of modeling the process that generates data, directly discriminate between classes More direct way to the goal Better if model is not accurate
28
Discriminative Models -- SVM v Decision Rule: red: v T x > 0 margin If x 1 … x n training examples, sign( i i x i T x) “decides” where x falls Train i to achieve best margin Large Margin for |v| < 1 Margin of 1 for small |v|
29
Discriminative protein classification Jaakkola, Diekhans, Haussler, ISMB 1999 Define the discriminant function to be L(X) = Xi H1 i K(X, X i ) - Xj H0 j K(X, X j ) We decide X family H whenever L(X) > 0 For now, let’s just assume K(.,.) is a similarity function Then, we want to train i so that this classifier makes as few mistakes as possible in the new data Similarly to SVMs, train i so that margin is largest for 0 i 1
30
Discriminative protein classification Ideally, for training examples, L(X i ) ≥ 1 if X i H 1, L(X i ) -1 otherwise This is not always possible; softer constraints are obtained with the following objective function J( ) = Xi H1 i (2 - L(X i )) + Xj H0 j (2 + L(X j )) Training: for X i H, try to “make” L(X i ) = 1 1 - L(X i ) + i K(X i, X i ) i -----------------------------; with minimum 0, and maximum 1 K(X i, X i ) Similarly, for X i H 0 try to “make” L(X i ) = -1
31
The Fisher Kernel The function K(X, Y) compares two sequences Acts effectively as an inner product in a (non-Euclidean) space Called “Kernel” Has to be positive definite For any X 1, …, X n, the matrix K: K ij = K(X i, X j ) is such that For any X R n, X ≠ 0, X T K X > 0 Choice of this function is important Consider P(X | H 1, ) – sufficient statistics How many expected times X takes each transition/emission
32
The Fisher Kernel Let be the vector of parameters of HMM (probs in each arrow & emission) Fisher score U X = log P(X | H 1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare U X, U Y D 2 F (X, Y) = ½ 2 |U X – U Y | 2 ; is just a scaling parameter Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D 2 F (X, Y)) Set so that the average distance of training sequences X i H 1 to sequences X j H 0 is 1
33
The Fisher Kernel To train a classifier for a given family H 1, 1.Build profile HMM, H 1 2.U X = log P(X | H 1, )(Fisher score) 3.D 2 F (X, Y) = ½ 2 |U X – U Y | 2 (distance) 4.K(X, Y) = exp(-D 2 F (X, Y)), (akin to dot product) 5.L(X) = Xi H1 i K(X, X i ) – Xj H0 j K(X, X j ) 6.Iteratively adjust to optimize J( ) = Xi H1 i (2 - L(X i )) + Xj H0 j (2 + L(X j )) To classify query X, Compute U X Compute K(X, X i ) for all training examples X i with I ≠ 0 (few) Decide based on L(X) >? 0
34
The Fisher Kernel If a given superfamily has more than one profile model, L max (X) = max i L i (X) = max i ( Xj Hi j K(X, X j ) – Xj H0 j K(X, X j ) ) Family Superfamily
35
O. Jangmin Benchmarks
36
Other methods WU-BLAST version 2.0a16 (Althcshul & Gish 1996) PDB90 database was queried with each positive training examples, and E-values were recorded. BLAST:SCOP-only BLAST:SCOP+SAM-T98-homologs Scores were combined by the maximum method SAM-T98 method Null model: reverse sequence model Same data and same set of models as in the SVM-Fisher Combined with maximum methods O. Jangmin
37
Results Metric : the rate of false positives (RFP) RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence Result of the family of the nucleotide triphosphate hydrolases SCOP superfamily Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds O. Jangmin
39
QUESTION Running time of Fisher kernel SVM on query X?
40
k-mer based SVMs Leslie, Eskin, Weston, Noble; NIPS 2002 Highlights K(X, Y) = exp(-½ 2 |U X – U Y | 2 ), requires expensive profile alignment: U X = log P(X | H 1, ) – O(|X| |H 1 |) Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice Off-the-shelf SVM software used
41
k-mer based SVMs For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word pairs with ≤ l mismatches Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
42
SVMs will find a few support vectors v After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X
43
Benchmarks
44
Semi-Supervised Methods GENERATIVE SUPERVISED METHODS
45
Semi-Supervised Methods DISCRIMINATIVE SUPERVISED METHODS
46
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
47
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
48
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
49
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
50
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
51
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
52
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
53
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
54
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
55
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
56
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
57
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples SVMs and other discriminative methods may make significant mistakes due to lack of data
58
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
59
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
60
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
61
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
62
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
63
Cluster Kernels 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)| X’ Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’) 2.Bagged mismatch
64
Cluster Kernels 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)| X’ Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’) 2.Bagged mismatch 1.Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2.For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n p 1(c p (X) = c p (Y)) 3.Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)
65
Benchmarks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.