A Deep Learning Technical Paper Recommender System By Janelle Blankenburg
Outline Introduction Methodology Conclusion Problem Description Major Objectives Methodology Semantic Analysis Data Preprocessing Model Training Network Generation Three Network Versions Recommendation Mechanism Finding and Ranking Similar Papers Experimental Evaluation Dataset Evaluation Metrics Conclusion 1 min
Introduction
Problem Description Technical paper recommendation General process Identify keywords related to research interests Input keywords into online textual search service Google Scholar, CiteSeer, arXiv Modify keyword list and repeat 1 min
Problem Description Cont. Simple keyword search is not sufficient Collaborative Filtering (CF) methods [1] Like-minded users Content-Based Filtering (CBF) methods [2] Previous “purchases” Content-Based Filtering issues: Use only title and abstract of paper Use word counts as basis for similarity models 2 min
Major Objectives Verify that using content from the body of a paper can result in better recommendations that using only the title and abstract Develop a novel recommender approach which utilizes deep learning and network science fundamentals Use semantic analysis of text instead of word counts Consists of meaningful relations between recommended papers Visualize relations through generated networks 1 min
Methodology
Semantic Analysis Simple word counts are not sufficient Use semantic analysis to extract meaning from full text of paper Compare content across various papers to get similarity
Data Pre-Processing Given source code of set of papers written in LaTex [3] Parse the source code to extract the following Title Abstract Other sections via “\section” 1 min
Data Pre-Processing Separate other sections into the following categories: Introduction Trivial Related Work/Background If both exist, combine Conclusion/Future Work Methodology Remaining sections assumed to be methodology 1 min
Data Pre-Processing Final categories: Title, Abstract, Introduction, Related Work, Methodology, Conclusion Pre-processing through gensim [4] Phrase extraction “machine” and “learning” ⇒ “machine_learning” Output: List of keywords for each paper Text cleaning Tokenize text into words, remove punctuation, lowercase letters, etc. Output: List of words in sequential order from each category of each paper 2 min
Semantic Analysis Analyze aspects of the text through two approaches: Word2vec [5] Doc2vec [6] Should be around 15 minutes here!!!
Word2Vec Natural Language Processing (NLP) tool Computes numerical vector representations of words Allows us to use numerical metrics to perform similarity comparisons between sets of words Example: king−man+women=queen 2 min http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/
Doc2Vec Also used for NLP Extension of word2vec Computes numerical vector representations of documents instead of words Documents can be: Short 140 character tweet Single paragraph such as an abstract A full article http://gensim.narkive.com/RavqZorK/gensim-4914-graphic-representations-of-word2vec-and-doc2vec 1 min
Semantic Analysis Word2vec model: Doc2vec models: 1 model Input: List of keywords from each paper in dataset Doc2vec models: 7 models Title Abstract Introduction Related Work Methodology Conclusion Full text - all 6 categories concatenated together Input: List of words from respective sections for each paper in dataset 1 min
Network Generation Want to verify that including content from body produces better recommendations Given pre-processed data and trained models Generate three networks using different combinations of models Title and abstract All 7 category-based models All 8 models: 7 category-based models with keyword model 1 min
Network Generation Iterate through dataset, comparing papers pairwise Use trained models to generate similarity scores between pairs of papers Cosine similarity Use scores to create edges in the three networks If score > threshold Create edge between pair of papers First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models 2 min
Network Generation After all pairs have been checked, network generation is complete Result is three different network representations of similarities between papers in dataset 1 min
Recommendation Mechanism Goal: Return top m papers from dataset most similar to input paper Given: Input paper from the user Trained similarity models Generated networks 1 min
Recommendation Mechanism Use gensim [4] to obtain the most similar paper to the input paper This paper becomes the top recommendation Gather friends of this paper as candidate recommendations If not at least m candidates, gather friends of friends Continue gathering layers of friends until at least m candidates 1 min
Recommendation Mechanism 2 min
Recommendation Mechanism Get similarity scores for candidate papers Similarity scoring same as pairwise process used to generate network edges: First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models Order candidate papers based on highest similarity Return top m papers as the list of recommendations to user 2 min
Recommendation Mechanism 1 min
Experimental Evaluation Aim to test approach in small, preliminary experiment Manually verify logical recommendations Serves as proof of concept Full-scale experiment (time permitting) Can perform further numerical analysis of recommender system Unable to verify recommendations manually 2 min
Dataset Preliminary experiment: Source code from 100 papers from arXiv 10 subareas within computer science Computer vision Robotics Machine learning Graphics Networking Computer security Operating systems Parallel computing Compiler theory Software engineering 10 papers per subarea 1 min
Dataset Full-scale experiment: Specialize dataset: General dataset: Hep-th portion of arXiv from 1992-2003 29,000 papers General dataset: Crawl arXiv to get papers from all major areas physics, mathematics, computer science, quantitative finance, electrical engineering, 50,0000 papers, 10,000 per area 1 min
Evaluation Metrics Want to determine which of the three generated networks provide best recommendations Preliminary experiment: Manually examine ranking from each network Recommendations should be in same subarea Compare underlying scores for top m papers from each network More similar papers should produce better scores Full-scale experiment: Average scores for top m papers Average number of recommendations in same area 2 min
Conclusion
Conclusion Want to verify that utilizing content from body of paper generates better recommendations than only using the abstract and title Proposed a novel technical paper recommender system Utilizes the full content of the paper Combines deep learning methods with network science foundations Generalizable and consists of meaningful relation between papers Connections between papers can be easily visualized Uses semantic analysis to generate similarity models instead of relying on word counts 1 min
References [1] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186, ACM, 1994. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender systems handbook, pp. 1–35, Springer, 2011. [3] L. Lamport, LATEX: a document preparation system: user’s guide and reference manual. Addison-wesley, 1994. [4] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May 2010. http://is.muni.cz/publication/884893/en. [5] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, and G. Zweig, “word2vec,” 2014. [6] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196, 2014.
Questions?