Resource Recommendation for AAN

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Recommender systems Ram Akella November 26 th 2008.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Chapter 5: Information Retrieval and Web Search

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Chapter 6: Information Retrieval and Web Search

Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

Collaborative Deep Learning for Recommender Systems

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Information Retrieval in Practice

Fill-in-The-Blank Using Sum Product Network

Distributed Representations for Natural Language Processing

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Dimensionality Reduction and Principle Components Analysis

Queensland University of Technology

Recommendation in Scholarly Big Data

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Deep Learning for Bacteria Event Identification

Search Engine Architecture

Text Mining CSC 600: Data Mining Class 20.

DATA MINING © Prentice Hall.

An Additive Latent Feature Model

A Deep Learning Technical Paper Recommender System

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

Zhe Ye Word2vec Tutorial Zhe Ye

Intelligent Information System Lab

Natural Language Processing of Knee MRI Reports

Clustering tweets and webpages

Efficient Estimation of Word Representation in Vector Space

Multi-Dimensional Data Visualization

Final Presentation: Neural Network Doc Summarization

Exploring Matching Networks for Cross Language Information Retrieval

Matching Words with Pictures

Word Embedding Word2Vec.

Topic models for corpora and for graphs

TutorialBank: Using a Manually-Collected Corpus for Prerequisite Chains, Survey Extraction and Resource Recommendation Alexander R. Fabbri, Irene Li, Prawat.

prerequisite chain learning and the introduction of LectureBank

Panagiotis G. Ipeirotis Luis Gravano

Prepared by: Mahmoud Rafeek Al-Farra

Unsupervised Pretraining for Semantic Parsing

Text Mining CSC 576: Data Mining.

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.

Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab

Relevance and Reinforcement in Interactive Browsing

Deep Learning for the Soft Cutoff Problem

EVA 2.50 Software and Input Values Our paper illustrates the application of a neural network using the Windows version of C#. The training and tests.

From Unstructured Text to StructureD Data

Topic: Semantic Text Mining

Modeling IDS using hybrid intelligent systems

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Pose Estimation in hockey videos using convolutional neural networks

GhostLink: Latent Network Inference for Influence-aware Recommendation

Vector Representation of Text

Implementation Plan system integration required for each iteration

Presentation transcript:

Resource Recommendation for AAN Robert Tung Alexander R. Fabbri, Irene Li Advisor: Professor Dragomir Radev

Outline

Outline Problem Description Quick Background Methods Results Future Research Appendix

Problem Description

Problem Description User inputs title and abstract of new project Related previous literature often simple queries Recommend resources from AAN corpus Related previous literature largely uses papers The user of this work may input the title and abstract of some new project, and the work in this paper will recommend resources from the AAN corpus to help the user prepare for said project. There currently exists extensive literature in the realm of recommending scientific papers to others, such as that of Tang and McCalla’s paper “On the Pedagogically Guided Paper Recommendation for an Evolving Web-Based Learning System” in which they base a system on user's learned behaviors and pedagogical functions of papers among other factors. However, much less literature exists regarding resources such as those in AAN, including tutorials and corpora among other types listed later in this paper. Additionally, recommending scientific papers specifically may make the recommendations largely inaccessible to a beginner in the relevant topics. For this reason, we focus more generally on recommending resources (e.g. tutorials, links, presentations, etc.).

Quick Background

Quick Background (cont’d.) Pedagogical Value of Resources Corpus, Lecture, Library, NACLO, Paper, Resource, Survey, Tutorial Pre-requisite Chains Reading List Generation LDA Doc2Vec Sheng et. al determine pedagogical value of resources in “An Investigation into the Pedagogical Features of Documents”. A separate but inter-related AAN project deals with pre-requisite chains, which seek to find a graph representation of the topics in a corpus, and the dependency relationships between them. Pre-requisite chains not used in this project’s results but very related in determining the topic structure. See Pan et. al’s “Prerequisite Relation Learning for Concepts in MOOCs”. Jardine does this on technical papers with ThemedPageRank, and Gordon et. al generates a concept graph and then traverses it to recommend resources for each topic node. LDA, or Latent Dirichlet Allocation, is an NLP technique that allows for documents to be explained by multiple unobserved topics. It assumes that each document is a mixture of these topics, and that the document is generated probabilistically. Doc2Vec is an extension of Word2Vec, which creates representations of arbitrarily-sized documents.

Methods

Methods: Baseline Implementation with LDA and Doc2Vec Ran unsupervised on 60 topics Doc2Vec Ran for 10 epochs Used each to recommend resources for 10 random papers From a corpus of ~1500 resources LDA classifies topic and recommends from the topic Doc2Vec finds most similar documents 5 annotators score results We began by constructing a baseline implementation of resource recommendation using only each of LDA and Doc2Vec. For this part, we tested on a set of 1,480 files with a vocabulary size 191,446 and a token count of 9,134,452. For each file, tokens were stripped for stop words and stemmed. Additionally, any word with overall counts of less than five across the entire corpus were dropped. LDA specifically was applied in an unsupervised manner, and used $60$ topics over $300$ iterations as displayed in Figure 2. Our Doc2Vec model used a hidden dimension size of $300$, a window size of $10$, and a constant learning rate of $0.025$. The model was trained for $10$ epochs \cite{aan}.

Methods: Deep Learning Approach Built a network for each project / resource pair Uses topic embedding of each, document embedding of title and text of each, and similarity scores Assumes document and topic embedding models constan Neural network on next slide...

Score of whether resource 2 is helpful for resource 1 LDA topic distr. Doc 1 Doc 2 Topic Distributions Overall Inputs Our specific inputs Similarity Scores Embeddings LDA cosine sim. Doc2Vec cosine sim. Doc2Vec embedding: Title of Doc 1 Abstract of Doc 1 Title of Doc 2 Text of Doc 2 Concatenate Hidden Layer Output Score of whether resource 2 is helpful for resource 1 Note that this neural network takes in features regarding a \textbf{pair} of resources, and returns a score representing whether resource $2$ is helpful for resource / project $1$. Note also that each time the network is trained and tested, it assumes a constant LDA model, Doc2Vec model, and corpus from which these models are created. Changing these models would inherently change the embeddings and similarity representations, and thus require retraining the network as well. The inputs that the neural network takes in are: the topic embedding of each, an array of similarity scores from different metrics, and the embedding of the title and text of each resource. All the above dense layers use $tanh$ as the activation function. For our training and testing metrics, we use LDA to perform the topic embedding of each, Doc2Vec for the embeddings, and the cosine similarities of the LDA and Doc2Vec representations for the similarity scores. However, note that the neural network code in this project can be reused with any topic embedding metric, any similarity scores, and any embeddings of the title and texts of the resources, by just changing the preprocessing code. Neural Network Architecture

Results

Results: Baseline Implementation with LDA and Doc2Vec Doc2Vec on left, LDA on right. Resources seem clustered well

Results: Recommendation with Baseline Implementation Both can be improved but LDA better overall (0.45 avg vs 0.34) LDA better on cases 5 and 6 (well-defined topics) Doc2Vec better on cases 2 and 8 (mix of topics)

Results: Deep Learning Approach Tested on human-annotated corpus Used annotations from 10 projects as before Weighted score based on human annotations (1 positive, -1 negative). 0 for all other pairs. Evaluated based on whether output was same sign as human annotation 73.79% accuracy on 5 epochs, 74.76% accuracy on 10 epochs. Much better than baseline 51.46% accuracy

Future Research

Future Research Other topic and document embeddings Network Architecture Other Approaches Other Evaluation Metrics A future project could determine if LDA and Doc2Vec are indeed the best for topic and document embeddings. Pre-requisite chains may be used. The network may improve performance with a different structure. Other approaches may be taken such as query modeling. Other evaluation metrics may more accurately fit the problem.

Appendix

Appendix All code for this project can be found at: https://github.com/IreneZihuiLi/aan_rec All data sets are extracted from the AAN corpus.

Thanks! Professor Dragomir Radev Alex Fabbri Irene Li Everyone in LILY Lab Professor John Wettlaufer