Automatic Collection “Recruiter” Shuang Song. Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Chapter 5: Introduction to Information Retrieval
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Berenzweig - Music Recommendation1 Music Recommendation Systems: A Progress Report Adam Berenzweig April 19, 2002.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Modern Information Retrieval
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Chapter 5: Information Retrieval and Web Search
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Search Engines and Information Retrieval Chapter 1.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
SINGULAR VALUE DECOMPOSITION (SVD)
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
1 Business Proprietary © 2009 Oculus Info Inc. Everyone’s a Critic: Memory Models and Uses for an Artificial Turing Judge W. Joseph MacInnes, Blair C.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Information Retrieval
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Machine Learning With Python Sreejith.S Jaganadh.G.
Information Retrieval on the World Wide Web
Multimedia Information Retrieval
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Text Categorization Assigning documents to a fixed set of categories
CSE 635 Multimedia Information Retrieval
Topic: Semantic Text Mining
Latent Semantic Analysis
Presentation transcript:

Automatic Collection “Recruiter” Shuang Song

Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process to achieve the task  Apply different filtering algorithms  Evaluate the result

The Process  Tokenization and frequency counting  New items extraction  New items filtering and ranking Query Terms Filter Collection External Source Query Results Training Sets New Items

Filtering Algorithms Latent Semantic Analysis (LSA)  Pre-processing, no stemming  SVD over term by document matrix  Pseudo-document representation of new items Gzip Compression Algorithms

Relevance Measure - LSA LSA Feature Space Collection Signature Vector Pseudo-document Vector V* V

Relevance Measure - gzip

First Experiment – Math Forum Collection 19 courseware in the collection 10 items in the experiment set  First 5 from math forum  The other 5 from other collections in

First Experiment Result

Second Experiment – Collaborative Filtering Collection 12 papers in the collection 11 items in the experiment set  First 10 from Citeseer Query terms submitted: (information 284) (algorithm 250) (ratings 217) (filtering 159) (system 197) (query 149) (reputation 114) (reviewer 109) (collaborative 106) (recommendations 98)  Last one is the paper we read in class: “An Algorithm for Automated Rating of Reviewers”

Second Experiment Result

Second Experiment – User Study 6 people in my research lab participated in this study  3 of them with IR background  3 of them without IR background They were asked to rate the 11 items in the experiment set in according to the the degree of relevance to the given collection

Second Experiment Result – Human Rating

Second Experiment Result – Another View Document ID LSAgzip Group with IR background Group without IR background 1MMLL 2HHHH 3LLLM 4HLHM 5LHHH 6HMHM 7MLHH 8HLHH 9LMLL 10MHHH 11LHHM

Second Experiment Result – comparison of w/o SVD and w/o weightings

Second Experiment – Correlation with human rating

Second Experiment – precision and recall (cutoff: R LSA >0.5 & R gzip >0.2)

Second Experiment – precision and recall (cutoff: R LSA >0.4 & R gzip >0.17)

Comparison of Two Filtering Algorithms Gzip works well when input documents are just abstracts, while LSA works for both LSA captures words association pattern and statistical importance, gzip scans for repetition only. LSA is more computationally demanding, while gzip is simple Effectiveness

To Do List And Future Work Accurate and trustworthy evaluation from expert (collection owner?) Extract full text and abstract from Citeseer automatically