Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Deep Learning Technical Paper Recommender System

Similar presentations


Presentation on theme: "A Deep Learning Technical Paper Recommender System"— Presentation transcript:

1 A Deep Learning Technical Paper Recommender System
By Janelle Blankenburg

2 Outline Introduction Methodology Conclusion Problem Description
Major Objectives Methodology Semantic Analysis Data Preprocessing Model Training Network Generation Three Network Versions Recommendation Mechanism Finding and Ranking Similar Papers Experimental Evaluation Dataset Evaluation Metrics Conclusion 1 min

3 Introduction

4 Problem Description Technical paper recommendation General process
Identify keywords related to research interests Input keywords into online textual search service Google Scholar, CiteSeer, arXiv Modify keyword list and repeat 1 min

5 Problem Description Cont.
Simple keyword search is not sufficient Collaborative Filtering (CF) methods [1] Like-minded users Content-Based Filtering (CBF) methods [2] Previous “purchases” Content-Based Filtering issues: Use only title and abstract of paper Use word counts as basis for similarity models 2 min

6 Major Objectives Verify that using content from the body of a paper can result in better recommendations that using only the title and abstract Develop a novel recommender approach which utilizes deep learning and network science fundamentals Use semantic analysis of text instead of word counts Consists of meaningful relations between recommended papers Visualize relations through generated networks 1 min

7 Methodology

8 Semantic Analysis Simple word counts are not sufficient
Use semantic analysis to extract meaning from full text of paper Compare content across various papers to get similarity

9 Data Pre-Processing Given source code of set of papers written in LaTex [3] Parse the source code to extract the following Title Abstract Other sections via “\section” 1 min

10 Data Pre-Processing Separate other sections into the following categories: Introduction Trivial Related Work/Background If both exist, combine Conclusion/Future Work Methodology Remaining sections assumed to be methodology 1 min

11 Data Pre-Processing Final categories:
Title, Abstract, Introduction, Related Work, Methodology, Conclusion Pre-processing through gensim [4] Phrase extraction “machine” and “learning” ⇒ “machine_learning” Output: List of keywords for each paper Text cleaning Tokenize text into words, remove punctuation, lowercase letters, etc. Output: List of words in sequential order from each category of each paper 2 min

12 Semantic Analysis Analyze aspects of the text through two approaches:
Word2vec [5] Doc2vec [6] Should be around 15 minutes here!!!

13 Word2Vec Natural Language Processing (NLP) tool
Computes numerical vector representations of words Allows us to use numerical metrics to perform similarity comparisons between sets of words Example: king−man+women=queen 2 min

14 Doc2Vec Also used for NLP Extension of word2vec
Computes numerical vector representations of documents instead of words Documents can be: Short 140 character tweet Single paragraph such as an abstract A full article 1 min

15 Semantic Analysis Word2vec model: Doc2vec models: 1 model
Input: List of keywords from each paper in dataset Doc2vec models: 7 models Title Abstract Introduction Related Work Methodology Conclusion Full text - all 6 categories concatenated together Input: List of words from respective sections for each paper in dataset 1 min

16 Network Generation Want to verify that including content from body produces better recommendations Given pre-processed data and trained models Generate three networks using different combinations of models Title and abstract All 7 category-based models All 8 models: 7 category-based models with keyword model 1 min

17 Network Generation Iterate through dataset, comparing papers pairwise
Use trained models to generate similarity scores between pairs of papers Cosine similarity Use scores to create edges in the three networks If score > threshold Create edge between pair of papers First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models 2 min

18 Network Generation After all pairs have been checked, network generation is complete Result is three different network representations of similarities between papers in dataset 1 min

19 Recommendation Mechanism
Goal: Return top m papers from dataset most similar to input paper Given: Input paper from the user Trained similarity models Generated networks 1 min

20 Recommendation Mechanism
Use gensim [4] to obtain the most similar paper to the input paper This paper becomes the top recommendation Gather friends of this paper as candidate recommendations If not at least m candidates, gather friends of friends Continue gathering layers of friends until at least m candidates 1 min

21 Recommendation Mechanism
2 min

22 Recommendation Mechanism
Get similarity scores for candidate papers Similarity scoring same as pairwise process used to generate network edges: First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models Order candidate papers based on highest similarity Return top m papers as the list of recommendations to user 2 min

23 Recommendation Mechanism
1 min

24 Experimental Evaluation
Aim to test approach in small, preliminary experiment Manually verify logical recommendations Serves as proof of concept Full-scale experiment (time permitting) Can perform further numerical analysis of recommender system Unable to verify recommendations manually 2 min

25 Dataset Preliminary experiment: Source code from 100 papers from arXiv
10 subareas within computer science Computer vision Robotics Machine learning Graphics Networking Computer security Operating systems Parallel computing Compiler theory Software engineering 10 papers per subarea 1 min

26 Dataset Full-scale experiment: Specialize dataset: General dataset:
Hep-th portion of arXiv from 29,000 papers General dataset: Crawl arXiv to get papers from all major areas physics, mathematics, computer science, quantitative finance, electrical engineering, 50,0000 papers, 10,000 per area 1 min

27 Evaluation Metrics Want to determine which of the three generated networks provide best recommendations Preliminary experiment: Manually examine ranking from each network Recommendations should be in same subarea Compare underlying scores for top m papers from each network More similar papers should produce better scores Full-scale experiment: Average scores for top m papers Average number of recommendations in same area 2 min

28 Conclusion

29 Conclusion Want to verify that utilizing content from body of paper generates better recommendations than only using the abstract and title Proposed a novel technical paper recommender system Utilizes the full content of the paper Combines deep learning methods with network science foundations Generalizable and consists of meaningful relation between papers Connections between papers can be easily visualized Uses semantic analysis to generate similarity models instead of relying on word counts 1 min

30 References [1] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186, ACM, 1994. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender systems handbook, pp. 1–35, Springer, 2011. [3] L. Lamport, LATEX: a document preparation system: user’s guide and reference manual. Addison-wesley, 1994. [4] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May [5] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, and G. Zweig, “word2vec,” 2014. [6] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196, 2014.

31 Questions?


Download ppt "A Deep Learning Technical Paper Recommender System"

Similar presentations


Ads by Google