Download presentation
Presentation is loading. Please wait.
Published byGordon Terry Modified over 6 years ago
1
A Deep Learning Technical Paper Recommender System
By Janelle Blankenburg
2
Outline Introduction Methodology Conclusion Problem Description
Major Objectives Methodology Semantic Analysis Data Preprocessing Model Training Network Generation Three Network Versions Recommendation Mechanism Finding and Ranking Similar Papers Experimental Evaluation Dataset Evaluation Metrics Conclusion 1 min
3
Introduction
4
Problem Description Technical paper recommendation General process
Identify keywords related to research interests Input keywords into online textual search service Google Scholar, CiteSeer, arXiv Modify keyword list and repeat 1 min
5
Problem Description Cont.
Simple keyword search is not sufficient Collaborative Filtering (CF) methods [1] Like-minded users Content-Based Filtering (CBF) methods [2] Previous “purchases” Content-Based Filtering issues: Use only title and abstract of paper Use word counts as basis for similarity models 2 min
6
Major Objectives Verify that using content from the body of a paper can result in better recommendations that using only the title and abstract Develop a novel recommender approach which utilizes deep learning and network science fundamentals Use semantic analysis of text instead of word counts Consists of meaningful relations between recommended papers Visualize relations through generated networks 1 min
7
Methodology
8
Semantic Analysis Simple word counts are not sufficient
Use semantic analysis to extract meaning from full text of paper Compare content across various papers to get similarity
9
Data Pre-Processing Given source code of set of papers written in LaTex [3] Parse the source code to extract the following Title Abstract Other sections via “\section” 1 min
10
Data Pre-Processing Separate other sections into the following categories: Introduction Trivial Related Work/Background If both exist, combine Conclusion/Future Work Methodology Remaining sections assumed to be methodology 1 min
11
Data Pre-Processing Final categories:
Title, Abstract, Introduction, Related Work, Methodology, Conclusion Pre-processing through gensim [4] Phrase extraction “machine” and “learning” ⇒ “machine_learning” Output: List of keywords for each paper Text cleaning Tokenize text into words, remove punctuation, lowercase letters, etc. Output: List of words in sequential order from each category of each paper 2 min
12
Semantic Analysis Analyze aspects of the text through two approaches:
Word2vec [5] Doc2vec [6] Should be around 15 minutes here!!!
13
Word2Vec Natural Language Processing (NLP) tool
Computes numerical vector representations of words Allows us to use numerical metrics to perform similarity comparisons between sets of words Example: king−man+women=queen 2 min
14
Doc2Vec Also used for NLP Extension of word2vec
Computes numerical vector representations of documents instead of words Documents can be: Short 140 character tweet Single paragraph such as an abstract A full article 1 min
15
Semantic Analysis Word2vec model: Doc2vec models: 1 model
Input: List of keywords from each paper in dataset Doc2vec models: 7 models Title Abstract Introduction Related Work Methodology Conclusion Full text - all 6 categories concatenated together Input: List of words from respective sections for each paper in dataset 1 min
16
Network Generation Want to verify that including content from body produces better recommendations Given pre-processed data and trained models Generate three networks using different combinations of models Title and abstract All 7 category-based models All 8 models: 7 category-based models with keyword model 1 min
17
Network Generation Iterate through dataset, comparing papers pairwise
Use trained models to generate similarity scores between pairs of papers Cosine similarity Use scores to create edges in the three networks If score > threshold Create edge between pair of papers First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models 2 min
18
Network Generation After all pairs have been checked, network generation is complete Result is three different network representations of similarities between papers in dataset 1 min
19
Recommendation Mechanism
Goal: Return top m papers from dataset most similar to input paper Given: Input paper from the user Trained similarity models Generated networks 1 min
20
Recommendation Mechanism
Use gensim [4] to obtain the most similar paper to the input paper This paper becomes the top recommendation Gather friends of this paper as candidate recommendations If not at least m candidates, gather friends of friends Continue gathering layers of friends until at least m candidates 1 min
21
Recommendation Mechanism
2 min
22
Recommendation Mechanism
Get similarity scores for candidate papers Similarity scoring same as pairwise process used to generate network edges: First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models Order candidate papers based on highest similarity Return top m papers as the list of recommendations to user 2 min
23
Recommendation Mechanism
1 min
24
Experimental Evaluation
Aim to test approach in small, preliminary experiment Manually verify logical recommendations Serves as proof of concept Full-scale experiment (time permitting) Can perform further numerical analysis of recommender system Unable to verify recommendations manually 2 min
25
Dataset Preliminary experiment: Source code from 100 papers from arXiv
10 subareas within computer science Computer vision Robotics Machine learning Graphics Networking Computer security Operating systems Parallel computing Compiler theory Software engineering 10 papers per subarea 1 min
26
Dataset Full-scale experiment: Specialize dataset: General dataset:
Hep-th portion of arXiv from 29,000 papers General dataset: Crawl arXiv to get papers from all major areas physics, mathematics, computer science, quantitative finance, electrical engineering, 50,0000 papers, 10,000 per area 1 min
27
Evaluation Metrics Want to determine which of the three generated networks provide best recommendations Preliminary experiment: Manually examine ranking from each network Recommendations should be in same subarea Compare underlying scores for top m papers from each network More similar papers should produce better scores Full-scale experiment: Average scores for top m papers Average number of recommendations in same area 2 min
28
Conclusion
29
Conclusion Want to verify that utilizing content from body of paper generates better recommendations than only using the abstract and title Proposed a novel technical paper recommender system Utilizes the full content of the paper Combines deep learning methods with network science foundations Generalizable and consists of meaningful relation between papers Connections between papers can be easily visualized Uses semantic analysis to generate similarity models instead of relying on word counts 1 min
30
References [1] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186, ACM, 1994. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender systems handbook, pp. 1–35, Springer, 2011. [3] L. Lamport, LATEX: a document preparation system: user’s guide and reference manual. Addison-wesley, 1994. [4] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May [5] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, and G. Zweig, “word2vec,” 2014. [6] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196, 2014.
31
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.