Resource Recommendation for AAN

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Chapter 5: Information Retrieval and Web Search
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Chapter 6: Information Retrieval and Web Search
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Collaborative Deep Learning for Recommender Systems
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Information Retrieval in Practice
Fill-in-The-Blank Using Sum Product Network
Distributed Representations for Natural Language Processing
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Dimensionality Reduction and Principle Components Analysis
Queensland University of Technology
Recommendation in Scholarly Big Data
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Deep Learning for Bacteria Event Identification
Search Engine Architecture
Text Mining CSC 600: Data Mining Class 20.
DATA MINING © Prentice Hall.
An Additive Latent Feature Model
A Deep Learning Technical Paper Recommender System
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Zhe Ye Word2vec Tutorial Zhe Ye
Intelligent Information System Lab
Natural Language Processing of Knee MRI Reports
Clustering tweets and webpages
Efficient Estimation of Word Representation in Vector Space
Multi-Dimensional Data Visualization
Final Presentation: Neural Network Doc Summarization
Exploring Matching Networks for Cross Language Information Retrieval
Matching Words with Pictures
Word Embedding Word2Vec.
Topic models for corpora and for graphs
TutorialBank: Using a Manually-Collected Corpus for Prerequisite Chains, Survey Extraction and Resource Recommendation Alexander R. Fabbri, Irene Li, Prawat.
prerequisite chain learning and the introduction of LectureBank
Panagiotis G. Ipeirotis Luis Gravano
Prepared by: Mahmoud Rafeek Al-Farra
Unsupervised Pretraining for Semantic Parsing
Text Mining CSC 576: Data Mining.
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Relevance and Reinforcement in Interactive Browsing
Deep Learning for the Soft Cutoff Problem
EVA 2.50 Software and Input Values Our paper illustrates the application of a neural network using the Windows version of C#. The training and tests.
From Unstructured Text to StructureD Data
Topic: Semantic Text Mining
Modeling IDS using hybrid intelligent systems
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Pose Estimation in hockey videos using convolutional neural networks
GhostLink: Latent Network Inference for Influence-aware Recommendation
Vector Representation of Text
Implementation Plan system integration required for each iteration
Presentation transcript:

Resource Recommendation for AAN Robert Tung Alexander R. Fabbri, Irene Li Advisor: Professor Dragomir Radev

Outline

Outline Problem Description Quick Background Methods Results Future Research Appendix

Problem Description

Problem Description User inputs title and abstract of new project Related previous literature often simple queries Recommend resources from AAN corpus Related previous literature largely uses papers The user of this work may input the title and abstract of some new project, and the work in this paper will recommend resources from the AAN corpus to help the user prepare for said project. There currently exists extensive literature in the realm of recommending scientific papers to others, such as that of Tang and McCalla’s paper “On the Pedagogically Guided Paper Recommendation for an Evolving Web-Based Learning System” in which they base a system on user's learned behaviors and pedagogical functions of papers among other factors. However, much less literature exists regarding resources such as those in AAN, including tutorials and corpora among other types listed later in this paper. Additionally, recommending scientific papers specifically may make the recommendations largely inaccessible to a beginner in the relevant topics. For this reason, we focus more generally on recommending resources (e.g. tutorials, links, presentations, etc.).

Quick Background

Quick Background (cont’d.) Pedagogical Value of Resources Corpus, Lecture, Library, NACLO, Paper, Resource, Survey, Tutorial Pre-requisite Chains Reading List Generation LDA Doc2Vec Sheng et. al determine pedagogical value of resources in “An Investigation into the Pedagogical Features of Documents”. A separate but inter-related AAN project deals with pre-requisite chains, which seek to find a graph representation of the topics in a corpus, and the dependency relationships between them. Pre-requisite chains not used in this project’s results but very related in determining the topic structure. See Pan et. al’s “Prerequisite Relation Learning for Concepts in MOOCs”. Jardine does this on technical papers with ThemedPageRank, and Gordon et. al generates a concept graph and then traverses it to recommend resources for each topic node. LDA, or Latent Dirichlet Allocation, is an NLP technique that allows for documents to be explained by multiple unobserved topics. It assumes that each document is a mixture of these topics, and that the document is generated probabilistically. Doc2Vec is an extension of Word2Vec, which creates representations of arbitrarily-sized documents.

Methods

Methods: Baseline Implementation with LDA and Doc2Vec Ran unsupervised on 60 topics Doc2Vec Ran for 10 epochs Used each to recommend resources for 10 random papers From a corpus of ~1500 resources LDA classifies topic and recommends from the topic Doc2Vec finds most similar documents 5 annotators score results We began by constructing a baseline implementation of resource recommendation using only each of LDA and Doc2Vec. For this part, we tested on a set of 1,480 files with a vocabulary size 191,446 and a token count of 9,134,452. For each file, tokens were stripped for stop words and stemmed. Additionally, any word with overall counts of less than five across the entire corpus were dropped. LDA specifically was applied in an unsupervised manner, and used $60$ topics over $300$ iterations as displayed in Figure 2. Our Doc2Vec model used a hidden dimension size of $300$, a window size of $10$, and a constant learning rate of $0.025$. The model was trained for $10$ epochs \cite{aan}.

Methods: Deep Learning Approach Built a network for each project / resource pair Uses topic embedding of each, document embedding of title and text of each, and similarity scores Assumes document and topic embedding models constan Neural network on next slide...

Score of whether resource 2 is helpful for resource 1 LDA topic distr. Doc 1 Doc 2 Topic Distributions Overall Inputs Our specific inputs Similarity Scores Embeddings LDA cosine sim. Doc2Vec cosine sim. Doc2Vec embedding: Title of Doc 1 Abstract of Doc 1 Title of Doc 2 Text of Doc 2 Concatenate Hidden Layer Output Score of whether resource 2 is helpful for resource 1 Note that this neural network takes in features regarding a \textbf{pair} of resources, and returns a score representing whether resource $2$ is helpful for resource / project $1$. Note also that each time the network is trained and tested, it assumes a constant LDA model, Doc2Vec model, and corpus from which these models are created. Changing these models would inherently change the embeddings and similarity representations, and thus require retraining the network as well. The inputs that the neural network takes in are: the topic embedding of each, an array of similarity scores from different metrics, and the embedding of the title and text of each resource. All the above dense layers use $tanh$ as the activation function. For our training and testing metrics, we use LDA to perform the topic embedding of each, Doc2Vec for the embeddings, and the cosine similarities of the LDA and Doc2Vec representations for the similarity scores. However, note that the neural network code in this project can be reused with any topic embedding metric, any similarity scores, and any embeddings of the title and texts of the resources, by just changing the preprocessing code. Neural Network Architecture

Results

Results: Baseline Implementation with LDA and Doc2Vec Doc2Vec on left, LDA on right. Resources seem clustered well

Results: Recommendation with Baseline Implementation Both can be improved but LDA better overall (0.45 avg vs 0.34) LDA better on cases 5 and 6 (well-defined topics) Doc2Vec better on cases 2 and 8 (mix of topics)

Results: Deep Learning Approach Tested on human-annotated corpus Used annotations from 10 projects as before Weighted score based on human annotations (1 positive, -1 negative). 0 for all other pairs. Evaluated based on whether output was same sign as human annotation 73.79% accuracy on 5 epochs, 74.76% accuracy on 10 epochs. Much better than baseline 51.46% accuracy

Future Research

Future Research Other topic and document embeddings Network Architecture Other Approaches Other Evaluation Metrics A future project could determine if LDA and Doc2Vec are indeed the best for topic and document embeddings. Pre-requisite chains may be used. The network may improve performance with a different structure. Other approaches may be taken such as query modeling. Other evaluation metrics may more accurately fit the problem.

Appendix

Appendix All code for this project can be found at: https://github.com/IreneZihuiLi/aan_rec All data sets are extracted from the AAN corpus.

Thanks! Professor Dragomir Radev Alex Fabbri Irene Li Everyone in LILY Lab Professor John Wettlaufer