Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S051069.

Slides:



Advertisements
Similar presentations
SAX: a Novel Symbolic Representation of Time Series
Advertisements

Chapter 5: Introduction to Information Retrieval
K nearest neighbor and Rocchio algorithm
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Heuristic alignment algorithms and cost matrices
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
A novel log-based relevance feedback technique in content- based image retrieval Reporter: Francis 2005/6/2.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Similar Sequence Similar Function Charles Yan Spring 2006.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Machine learning methods for protein analyses William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Protein Tertiary Structure Prediction
Friends and Locations Recommendation with the use of LBSN
Efficient Model Selection for Support Vector Machines
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
SINGULAR VALUE DECOMPOSITION (SVD)
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
Clustering C.Watters CS6403.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Copyright Paula Matuszek Kinds of Machine Learning.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Enriching Taxonomies With Functional Domain Knowledge
Protein Structural Classification
Presentation transcript:

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S051069

Content Introduction Methods Results Discussion Conclusion

Introduction Detecting remote evolutionary relationships among proteins is still a difficult problem in bioinformatics. Due to the protein sequence is consist of letter like the words in the text, so the author try to solve the problem in NLP. They present an algorithm called “ProtEmbed” that learns an embedding of protein domain sequences into a semantic space. They use multitask learning to build model, and the auxiliary information they use is “class labels” from SCOP and “structural similarity scores”.

Methods Define a feature representation: P t is from the set of database such as SCOP as P’ is a query protein

Methods Define a feature representation: Is the PSI-BLAST E-value It is easy for calculating

Methods Construct a training set of tuple R(q, p+, p-): The tuples are collected by running PSI-BLAST in a database. Given a protein q, we consider any protein with an E-value < 0.1 is p+, and the p- is picked randomly.

Methods Embedding: W is a n*l matrix, then we will reduce the dimension from l to n and embed the protein into a low dimension space.

Methods Build the matrix W: Distance between two proteins

Methods Build the matrix W: min We initialize the matrix W randomly using a normal distribution with mean zero and standard deviation one.

Methods Adding information about protein structure: (1) category labels for a given protein (from SCOP); (2) similarity scores between pairs of proteins (SCOP category relationships or MAMMOTH similarities).

Methods Class-based data: is the class label, is a vector represented the fold or superfamily

Methods Class-based data:

Methods Ranking-based data: The tuples we collect are created by running MAMMOTH (the cutoff for choosing the p+ is 2.0), and latter is the same as before.

Methods Data sets: (1) Labeled data: We used proteins from the SCOP v1.59 protein database. train set: 7329 test set: 97 (2) Unlabeled data: The data is from ADDA,

Methods Comparison methods: (1) PSI-BLAST version 2.2.8; (2) HHSearch version which is considered a leading method for remote homology detection; (3) RankProp.

Results We then used PSI-BLAST, Rankprop, HHSearch and PROTEMBED to rank a collection of 7329 SCOP domain sequences with respect to each of 97 test domains. To provide a rich database in which to perform the search, we augmented the SCOP data set with 115,644 single-domain sequences from the ADDA domain database.

Results Before training our embedding, we ran a series of cross-validation experiments within the training set to select hyper parameters. (1) For PSI-BLAST, a learning rate of 0.05 and an embedding dimension of 250; (2) For HHSearch, a learning rate of 0.02 and an embedding dimension of 100.

Results 1.ProEmbed (trained using HHSearch); 2.ProEmbed (trained using HHSearch); 3.HHSearch; 4.RankProp; 5.PSI-BLAST Adding structure information is better; SCOP rank > SCOP class

Results Calibration of ProEmbed scores: To measure the calibration of the scores among queries, we sorted all of the scores from all 97 test queries into a single list.

Results

Visualizing the results of a query

Discussion The PROTEMBED algorithm learns its embedding on domain sequences rather than full-length protein sequences; It is possible to process a multi-domain query sequences using PROTEMBED.

Conclusion We can use “Semantic Embedding” to solve the problem of “detecting remote evolutionary relationships among proteins ”; Adding structure information is helpful to improve the performance of detecting;

Thank you!