Automatic Collection “Recruiter” Shuang Song
Project Goal Given a collection, automatically suggest other items to add to the collection Design a process to achieve the task Apply different filtering algorithms Evaluate the result
The Process Tokenization and frequency counting New items extraction New items filtering and ranking Query Terms Filter Collection External Source Query Results Training Sets New Items
Filtering Algorithms Latent Semantic Analysis (LSA) Pre-processing, no stemming SVD over term by document matrix Pseudo-document representation of new items Gzip Compression Algorithms
Relevance Measure - LSA LSA Feature Space Collection Signature Vector Pseudo-document Vector V* V
Relevance Measure - gzip
First Experiment – Math Forum Collection 19 courseware in the collection 10 items in the experiment set First 5 from math forum The other 5 from other collections in
First Experiment Result
Second Experiment – Collaborative Filtering Collection 12 papers in the collection 11 items in the experiment set First 10 from Citeseer Query terms submitted: (information 284) (algorithm 250) (ratings 217) (filtering 159) (system 197) (query 149) (reputation 114) (reviewer 109) (collaborative 106) (recommendations 98) Last one is the paper we read in class: “An Algorithm for Automated Rating of Reviewers”
Second Experiment Result
Second Experiment – User Study 6 people in my research lab participated in this study 3 of them with IR background 3 of them without IR background They were asked to rate the 11 items in the experiment set in according to the the degree of relevance to the given collection
Second Experiment Result – Human Rating
Second Experiment Result – Another View Document ID LSAgzip Group with IR background Group without IR background 1MMLL 2HHHH 3LLLM 4HLHM 5LHHH 6HMHM 7MLHH 8HLHH 9LMLL 10MHHH 11LHHM
Second Experiment Result – comparison of w/o SVD and w/o weightings
Second Experiment – Correlation with human rating
Second Experiment – precision and recall (cutoff: R LSA >0.5 & R gzip >0.2)
Second Experiment – precision and recall (cutoff: R LSA >0.4 & R gzip >0.17)
Comparison of Two Filtering Algorithms Gzip works well when input documents are just abstracts, while LSA works for both LSA captures words association pattern and statistical importance, gzip scans for repetition only. LSA is more computationally demanding, while gzip is simple Effectiveness
To Do List And Future Work Accurate and trustworthy evaluation from expert (collection owner?) Extract full text and abstract from Citeseer automatically