Download presentation
Presentation is loading. Please wait.
Published byRoss Ramsey Modified over 8 years ago
1
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015
2
ABSTRACT LSA Related Work on Remote Homology Detection LSA-based SVM and Data set Result and Discussion CONCLUSION
3
Motivation Remote homology detection: A central problem in computational biology, the classification of proteins into functional and structural classes given their amino acid sequences Results Discriminative method such as SVM is one of the most effective methods Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon Introduce LSA, which is an effecient feature extraction technique in NLP LSA model significantly improves the performance of remote homology detection in comparison with basic formalisms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than other sequence-based methods ABSTRACT
4
Related Work of Remote Homology Detection pairwise sequence comparison algorithm rotein families and discriminative classifiers generative models dynamic programming algorithm: BLAST, FASTA, PSI-BLAST, etc HMM, etc SVM, SVM-fisher, SVM-k-spectrum, mismatch-SVM, SVM-pairwise, SVM-I-sites, SVM-LA, SVM-SW, etc structure is more conserved than sequence -- detecting very subtle sequence similarities, or remote homology is important Most methods can detect homology with a high level of similarity, while remote homology is often difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight zone' The success of a SVM classification method depends on the choice of the feature set to describe each protein. Most of these research efforts focus on finding useful representations of protein sequence data for SVM training by using either explicit feature vector representations or kernel functions.
5
LSA Latent semantic analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning by statistical computations applied to a large corpus of text. LSA analysis the relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
6
LSA c1c2c3c4c5m1m2m3m4 human11 interface11 computer11 user111 system112 response11 time11 EPS11 survey11 tree111 graph111 minor11 bag-of-words model N document M words in total Word-Document Matrix ( M × N ) ( This repretation does not recognize synonymous or related words and the dimensions are too large)
7
U (M×K) S (K×K) V T (K×N) LSA
8
c1c2c3c4c5m1m2m3m4 human11 interface11 computer11 user111 system112 response11 time11 EPS11 survey11 tree111 graph111 minor11 + sequences of proteins documents LSA
9
For a new document (sequence) which is not in the training set, it is required to add the unseen document (sequence) to the original training set and the LSA model be computed. The new vector t can be approximated as t = dU LSA where d is the raw vector of the new document, which is similar to the columns of the matrix W
10
LSA-based SVM and Data set Structral Classification of Proteins (SCOP) 1.53 sequences from ASTRAL database 54 families 4352 distinct sequences remote homology is simulated by holding out all members of a target 1.53 family from a given superfamily. 3 basic building block of proteins N-gram N = 3, 20^3, 8000 words Patterns alphabet ∑U{‘.’}, where ∑ is the set of the 20 amino acids and {‘.’} can be any of the amino acids. X 2 selection, 8000 patterns. Motifs denotes the limited, highly conserved regions of proteins. 3231 motifs.
11
Result and Discussion Two methods are used to evaluate the experimental results: the receiver operating characteristic (ROC) scores. the median rate of false positives (M-RFP) scores. The fraction of negative test sequences that score as high or better than the median score of the positive sequences.
12
Result and Discussion
13
When the families are in the left-upper area, it means that the method labeled by y-axis outperforms the method labeled by x-axis on this family.
14
Result and Discussion fold1 superfamily1.1 family1.1.1 fold2 family1.1.2 family1.1.3 superfamily2.1 family1.2.1 family1.2.2 superfamily1.2 family2.1.1 positive train 20 positive test 13 1. Family level negative train & negative test 3033 & 1137
15
Result and Discussion fold1 superfamily1.1 family1.1.1 fold2 family1.1.2 family1.1.3 superfamily2.1 family1.2.1 family1.2.2 superfamily1.2 family2.1.1 positive train 88 positive test 33 2. Superfamily level negative train & negative test 3033 & 1137
16
Result and Discussion fold1 superfamily1.1 family1.1.1 fold2 family1.1.2 family1.1.3 superfamily2.1 family1.2.1 family1.2.2 superfamily1.2 family2.1.1 positiv e train 61 3. Fold level positive test 33 negative train & negative test 3033 & 1137
17
Result and Discussion
18
LSA better than SVM-pairwise and SVM-LA worse than methods without LSA and PSI-BLAST vectorization stepoptimization step SVM-pairwiseO(n 2 l 2 )O(n 3 ) SVM-LAO(n 2 l 2 )O(n 2 p) SVM-NgramO(nml)O(n 2 m) SVM-PatternO(nml)O(n 2 m) SVM-MotifO(nml)O(n 2 m) SVM-Ngram-LSAO(nmt)O(n 2 R) SVM-Pattern-LSAO(nmt)O(n 2 R) SVM-Motif-LSAO(nmt)O(n 2 R) computational efficiency n: the number of training examples l: the length of the longest training sequence m: the total number of words t: min (m,n) p: the length of the latent semantic representation vector p = n, in SVM-pairwise p = m , in the method with LSA p = R, in the LSA method
19
CONCLUSION In this paper, the LSA model from natural language processing is successfully used in protein remote homology detection and i mproved performances have been acquired in comparison with the basic formalisms. Each document is represented as a linear combination of hidden abstract concepts, which arise automatically from the SVD mechanism. LSA defines a transformation between high-dimensional discrete entities (the vocabulary) and a low-dimensional continuous vector space S, the R-dimensional space spanned by the Us, leading to noise removal and efficient representation of the protein sequence. As a result, the LSA model achieves better performance than the methods without LSA.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.