Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

Personalized Information Filtering Identify user-desired documents from a document stream Two families of filtering approaches – Collaborative Filtering (CF) – Content-Based Filtering (CBF) Applications: news feeder, email spam filter, etc. 2 Filtering System News Blogs Emails Passed documents …

Semi-Structured Documents Increasingly prevalent over the Internet Emails, news, movies, tweets, etc. Plenty of metadata available 3

Definitions Facet: a metadata field – Date, Topic, Location, Director, Genre, etc. Facet-Value Pair (FVP): a metadata field assigned with a particular value – Topic: Royal wedding – Date: 04-29-2011 – Location: London, UK 4 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Motivation Existing filtering approaches learn user interests based on users relevance judgments of documents Users may have prior knowledge on which facet-value pairs are relevant – English-only readers Language: English – Social network analysts Company: Facebook 5 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

6 Can we exploit users prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

A New User Interaction Mechanism: Faceted Feedback 7 Filtering System FVP candidates: Lang: … Topic: … Date: … Relevant FVPs: Topic: … Lang: …

Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 8 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Q1: Possible Methods Feature selection methods for text classification – E.g., Mutual Information, Chi-Square measure, etc. Usually a large number of labeled documents available Query expansion methods for retrieval – E.g., TFIDF score on pseudo relevant documents No labeled documents available 9 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

FVP Selection: Our Approach In a filtering task – A large number of unlabeled documents – Possibly a small number of labeled documents We rank facet-value pairs by 10 Pseudo relevant (positively classified) documents User-labeled relevant documents Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

Research Questions Question 1 – How to select facet-value pair candidates? Question 2 – How to learn user profiles based on faceted feedback? 11 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Content-Based Filtering (CBF) Treated as a binary text classification task User profile: a feature vector that represents a users information needs (interests/preferences) Given the user profile θ, a document can be determined as relevant or not according to: 12 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Document vector Document label The core of CBF is learning the user profile!

Q2: Possible Methods Simple methods – Boolean strategy (AND, OR) – Feature selection – Pseudo relevant document Sophisticated methods – Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) – Generalized Expectation Criteria (Druck et al. 08) 13 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Our Approach The assumption – A feature is selected by a user since it has a high correlation with the document label (R/NR) Generalized Constraint Model (GCM) 14 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Correlation Decomposition Sufficiency – The probability of a document being relevant given that the feature has occurred: P(R + |f=1) – P(R + |f=1)=1 : sufficient features E.g., Company: Facebook for social network analysts Necessity – The probability of the feature having occurred given that a document is relevant: P(f=1|R + ) – P(f=1|R + )=1 : necessary features E.g., Language: English for English-only readers 15 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Examples: Highly-Correlated Features 16 The whole corpus R+R+ f 2 =1 f 1 =1 f 3 =1 1) f 1 is a sufficient feature since P(R + |f 1 =1)=1 2) f 2 is a necessary feature since P(f 2 =1|R + )=1 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz 3) f 3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

Estimating Sufficiency 17 Document label The feature The set of documents covered by feature f User profile vector Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Estimation of the label of document d i

Estimating Necessity 18 Feature sufficiency Bayes Theorem! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Prior distribution

Reference Distributions Our assumption – User selects a feature since it has a high sufficiency and/or a high necessity Reference distributions: two Bernoulli distns – The sufficiency/necessity of a user-selected feature should be close to the reference distribution – KL-divergence for similarity measure 19 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

User Profile Learning The unified loss function to combine two types of feedback: 20 User-labeled documents Necessary features Sufficient features Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz T s, T n : reference distns

User Interaction Mechanisms Two mechanisms – Mechanism 1: ask users to select features they think are relevant – Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Outline Introduction Faceted Feedback – Facet-Value Pair Candidate Selection – Learning from Faceted Feedback Experiments – Settings – Results Summary 22

Data Sets Use two data sets from TREC filtering track – TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs) Metadata field: MeSH (Medical Subject Headings) – TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors Metadata fields: Topic, Industry, Region Split each topic set into two equal-size subsets – One for parameter tuning, the other for testing 23 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Faceted Feedback Collection Recruit subjects on Mechanical Turk – Five subjects per topic – The average performances will be reported For each topic, we show subjects – The topic description (information need) – A group of facet-value pair candidates 24 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Evaluation Metrics Precision (macro) Recall (macro) T11U = 2 * N rd – N nd – N rd : the number of relevant docs delivered – N nd : the number of non-relevant docs delivered T11SU = – MinNU = -0.5 – MaxU: the maximum possible utility (T11U) 25 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Results 1: w/wo Faceted Feedback (FF) 27 Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz # relevant docs initially known

Results 2: Different Learning Algorithms 28 Our approach Existing approaches BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria

Summary Faceted feedback is useful for filtering, especially in the cold-start scenarios The Generalized Constraint Model (GCM) is a robust user profile learning algorithm In future work, we will evaluate our methods on data sets where faceted features are more important – Movie, music, product, etc. 30 Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Questions? 31 Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz lanbo@soe.ucsc.edu yiz@soe.ucsc.edu xingqianli@gmail.com

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Similar presentations

Presentation on theme: "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Similar presentations

Presentation on theme: "Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)"— Presentation transcript:

Similar presentations

About project

Feedback