A Deep Learning Technical Paper Recommender System

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Literature Survey, Literature Comprehension, & Literature Review.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Do we need theoretical computer science in software engineering curriculum: an experience from Uni Novi Sad Bansko, August 28, 2013.
Evaluation and analysis of the application of interactive digital resources in a blended-learning methodology for a computer networks subject F.A. Candelas,
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
EE LECTURE 4 REPORT STRUCTURE AND COMPONENTS Electrical Engineering Dept King Saud University.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
A Quick Guide to beginning Research Where to Start.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Venue Recommendation: Submitting your Paper with Style Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering, Lehigh University.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Distributed Representations for Natural Language Processing
PhD at CSE: Overview CSE department offers Doctoral degree in the Computer Science (CS) or Computer Engineering areas (CpE) at both MS to PhD and BS to.
Security analysis of COM with Alloy
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Neural Machine Translation by Jointly Learning to Align and Translate
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Distributed Representation of Words, Sentences and Paragraphs
CS 425/625 Software Engineering Architectural Design
Yuri Pettinicchi Jeny Tony Philip
Presented by: Prof. Ali Jaoua
Chapter 13: Systems Analysis and Design
Searching and browsing through fragments of TED Talks
Searching with context
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Deep Cross-media Knowledge Transfer
Resource Recommendation for AAN
prerequisite chain learning and the introduction of LectureBank
Department of Computer Science Abdul Wali Khan University Mardan
Team 7 → Final Presentation
Relevance and Reinforcement in Interactive Browsing
CS565: Intelligent Systems and Interfaces
Information Retrieval and Web Design
WSExpress: A QoS-Aware Search Engine for Web Services
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Examining Hurricane Irma with Twitter Data
Presentation transcript:

A Deep Learning Technical Paper Recommender System By Janelle Blankenburg

Outline Introduction Methodology Conclusion Problem Description Major Objectives Methodology Semantic Analysis Data Preprocessing Model Training Network Generation Three Network Versions Recommendation Mechanism Finding and Ranking Similar Papers Experimental Evaluation Dataset Evaluation Metrics Conclusion 1 min

Introduction

Problem Description Technical paper recommendation General process Identify keywords related to research interests Input keywords into online textual search service Google Scholar, CiteSeer, arXiv Modify keyword list and repeat 1 min

Problem Description Cont. Simple keyword search is not sufficient Collaborative Filtering (CF) methods [1] Like-minded users Content-Based Filtering (CBF) methods [2] Previous “purchases” Content-Based Filtering issues: Use only title and abstract of paper Use word counts as basis for similarity models 2 min

Major Objectives Verify that using content from the body of a paper can result in better recommendations that using only the title and abstract Develop a novel recommender approach which utilizes deep learning and network science fundamentals Use semantic analysis of text instead of word counts Consists of meaningful relations between recommended papers Visualize relations through generated networks 1 min

Methodology

Semantic Analysis Simple word counts are not sufficient Use semantic analysis to extract meaning from full text of paper Compare content across various papers to get similarity

Data Pre-Processing Given source code of set of papers written in LaTex [3] Parse the source code to extract the following Title Abstract Other sections via “\section” 1 min

Data Pre-Processing Separate other sections into the following categories: Introduction Trivial Related Work/Background If both exist, combine Conclusion/Future Work Methodology Remaining sections assumed to be methodology 1 min

Data Pre-Processing Final categories: Title, Abstract, Introduction, Related Work, Methodology, Conclusion Pre-processing through gensim [4] Phrase extraction “machine” and “learning” ⇒ “machine_learning” Output: List of keywords for each paper Text cleaning Tokenize text into words, remove punctuation, lowercase letters, etc. Output: List of words in sequential order from each category of each paper 2 min

Semantic Analysis Analyze aspects of the text through two approaches: Word2vec [5] Doc2vec [6] Should be around 15 minutes here!!!

Word2Vec Natural Language Processing (NLP) tool Computes numerical vector representations of words Allows us to use numerical metrics to perform similarity comparisons between sets of words Example: king−man+women=queen 2 min http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

Doc2Vec Also used for NLP Extension of word2vec Computes numerical vector representations of documents instead of words Documents can be: Short 140 character tweet Single paragraph such as an abstract A full article http://gensim.narkive.com/RavqZorK/gensim-4914-graphic-representations-of-word2vec-and-doc2vec 1 min

Semantic Analysis Word2vec model: Doc2vec models: 1 model Input: List of keywords from each paper in dataset Doc2vec models: 7 models Title Abstract Introduction Related Work Methodology Conclusion Full text - all 6 categories concatenated together Input: List of words from respective sections for each paper in dataset 1 min

Network Generation Want to verify that including content from body produces better recommendations Given pre-processed data and trained models Generate three networks using different combinations of models Title and abstract All 7 category-based models All 8 models: 7 category-based models with keyword model 1 min

Network Generation Iterate through dataset, comparing papers pairwise Use trained models to generate similarity scores between pairs of papers Cosine similarity Use scores to create edges in the three networks If score > threshold Create edge between pair of papers First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models 2 min

Network Generation After all pairs have been checked, network generation is complete Result is three different network representations of similarities between papers in dataset 1 min

Recommendation Mechanism Goal: Return top m papers from dataset most similar to input paper Given: Input paper from the user Trained similarity models Generated networks 1 min

Recommendation Mechanism Use gensim [4] to obtain the most similar paper to the input paper This paper becomes the top recommendation Gather friends of this paper as candidate recommendations If not at least m candidates, gather friends of friends Continue gathering layers of friends until at least m candidates 1 min

Recommendation Mechanism 2 min

Recommendation Mechanism Get similarity scores for candidate papers Similarity scoring same as pairwise process used to generate network edges: First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models Order candidate papers based on highest similarity Return top m papers as the list of recommendations to user 2 min

Recommendation Mechanism 1 min

Experimental Evaluation Aim to test approach in small, preliminary experiment Manually verify logical recommendations Serves as proof of concept Full-scale experiment (time permitting) Can perform further numerical analysis of recommender system Unable to verify recommendations manually 2 min

Dataset Preliminary experiment: Source code from 100 papers from arXiv 10 subareas within computer science Computer vision Robotics Machine learning Graphics Networking Computer security Operating systems Parallel computing Compiler theory Software engineering 10 papers per subarea 1 min

Dataset Full-scale experiment: Specialize dataset: General dataset: Hep-th portion of arXiv from 1992-2003 29,000 papers General dataset: Crawl arXiv to get papers from all major areas physics, mathematics, computer science, quantitative finance, electrical engineering, 50,0000 papers, 10,000 per area 1 min

Evaluation Metrics Want to determine which of the three generated networks provide best recommendations Preliminary experiment: Manually examine ranking from each network Recommendations should be in same subarea Compare underlying scores for top m papers from each network More similar papers should produce better scores Full-scale experiment: Average scores for top m papers Average number of recommendations in same area 2 min

Conclusion

Conclusion Want to verify that utilizing content from body of paper generates better recommendations than only using the abstract and title Proposed a novel technical paper recommender system Utilizes the full content of the paper Combines deep learning methods with network science foundations Generalizable and consists of meaningful relation between papers Connections between papers can be easily visualized Uses semantic analysis to generate similarity models instead of relying on word counts 1 min

References [1] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186, ACM, 1994. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender systems handbook, pp. 1–35, Springer, 2011. [3] L. Lamport, LATEX: a document preparation system: user’s guide and reference manual. Addison-wesley, 1994. [4] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May 2010. http://is.muni.cz/publication/884893/en. [5] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, and G. Zweig, “word2vec,” 2014. [6] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196, 2014.

Questions?