Download presentation
Presentation is loading. Please wait.
Published byAleesha Williamson Modified over 8 years ago
1
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S051069
2
Content Introduction Methods Results Discussion Conclusion
3
Introduction Detecting remote evolutionary relationships among proteins is still a difficult problem in bioinformatics. Due to the protein sequence is consist of letter like the words in the text, so the author try to solve the problem in NLP. They present an algorithm called “ProtEmbed” that learns an embedding of protein domain sequences into a semantic space. They use multitask learning to build model, and the auxiliary information they use is “class labels” from SCOP and “structural similarity scores”.
4
Methods Define a feature representation: P t is from the set of database such as SCOP as P’ is a query protein
5
Methods Define a feature representation: Is the PSI-BLAST E-value It is easy for calculating
6
Methods Construct a training set of tuple R(q, p+, p-): The tuples are collected by running PSI-BLAST in a database. Given a protein q, we consider any protein with an E-value < 0.1 is p+, and the p- is picked randomly.
7
Methods Embedding: W is a n*l matrix, then we will reduce the dimension from l to n and embed the protein into a low dimension space.
8
Methods Build the matrix W: Distance between two proteins
9
Methods Build the matrix W: min We initialize the matrix W randomly using a normal distribution with mean zero and standard deviation one.
10
Methods Adding information about protein structure: (1) category labels for a given protein (from SCOP); (2) similarity scores between pairs of proteins (SCOP category relationships or MAMMOTH similarities).
11
Methods Class-based data: is the class label, is a vector represented the fold or superfamily
12
Methods Class-based data:
13
Methods Ranking-based data: The tuples we collect are created by running MAMMOTH (the cutoff for choosing the p+ is 2.0), and latter is the same as before.
14
Methods Data sets: (1) Labeled data: We used proteins from the SCOP v1.59 protein database. train set: 7329 test set: 97 (2) Unlabeled data: The data is from ADDA, 115644.
15
Methods Comparison methods: (1) PSI-BLAST version 2.2.8; (2) HHSearch version 1.5.0 which is considered a leading method for remote homology detection; (3) RankProp.
16
Results We then used PSI-BLAST, Rankprop, HHSearch and PROTEMBED to rank a collection of 7329 SCOP domain sequences with respect to each of 97 test domains. To provide a rich database in which to perform the search, we augmented the SCOP data set with 115,644 single-domain sequences from the ADDA domain database.
17
Results Before training our embedding, we ran a series of cross-validation experiments within the training set to select hyper parameters. (1) For PSI-BLAST, a learning rate of 0.05 and an embedding dimension of 250; (2) For HHSearch, a learning rate of 0.02 and an embedding dimension of 100.
18
Results 1.ProEmbed (trained using HHSearch); 2.ProEmbed (trained using HHSearch); 3.HHSearch; 4.RankProp; 5.PSI-BLAST Adding structure information is better; SCOP rank > SCOP class
19
Results Calibration of ProEmbed scores: To measure the calibration of the scores among queries, we sorted all of the scores from all 97 test queries into a single list.
20
Results
21
Visualizing the results of a query
22
Discussion The PROTEMBED algorithm learns its embedding on domain sequences rather than full-length protein sequences; It is possible to process a multi-domain query sequences using PROTEMBED.
23
Conclusion We can use “Semantic Embedding” to solve the problem of “detecting remote evolutionary relationships among proteins ”; Adding structure information is helpful to improve the performance of detecting;
24
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.