CMPE 493 INTRODUCTION INFORMATION RETRIEVAL PERSONALIZED QUERY EXPANSION FOR THE WEB Chirita, P.A., Firan, C.S., and Nejdl, W. SIGIR, 2007, pp. 7-14 Bahtiyar.

Slides:



Advertisements
Similar presentations
A Domain Level Personalization Technique A. Campi, M. Mazuran, S. Ronchi.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Personalized Query Expansion for the Web Paul-Alexandru Chirita, Claudiu S. Firan, Wolfgang Nejdl Gabriel Barata.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Search Engines and Information Retrieval Chapter 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Information Retrieval in Practice
Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

CMPE 493 INTRODUCTION INFORMATION RETRIEVAL PERSONALIZED QUERY EXPANSION FOR THE WEB Chirita, P.A., Firan, C.S., and Nejdl, W. SIGIR, 2007, pp Bahtiyar Kaba

Introduction Aim: improve the search output by expanding the query with exploiting users’ PIR(Personal Information Repository). Why? – Inherent ambiguity of short queries. Ex: “language ambiguity” => a computer scientist or a linguistics scientist probably search for sth different. So, help them formulating a bettery query by expansion. “language ambiguity in computing”. Come up with the latter term by investigating the user’s desktop(PIR). Studies show 80% of users prefer personalized outputs for their search.

What we will use? – The personal collection of all documents, text documents, s, cached Web pages etc. By personalizing this way, we have 2 advantages: – Better description of the users interest, there is a large amount of information. – Privacy: “profile” information is extracted and exploited locally. we should not track the URLs clicked or queries issued.

Algorithms Local desktop query context: – Determine expansion terms from those personal documents matching the query best. – Keyword, expression, summary based techniques. Global desktop collection: – Investigate expansions based on co occurence metrics and external theasuri through entire personal directory. Before details of these, a glance at previous work.

Previous Work Two IR research areas: Search Personalization and Automatic Query Expansion A lot of algorithms for both domains, but not as much for combining them. Personalized search: ranking search results according to user profiles (ex. By means of past history) Query Expansion: derive a better formulation of the query to enhance retrieval.Based on exploiting social or collection specific characteristics.

Personalized search Two major components: – User Profiles: generated with as features of visited pages. Topic preference vectors -> Topic-Sensitive page rank. Advantage of being easy to obtain and process. But, may not suffice to obtain a good understanding of user’s interests and concerns about privacy. – Personalization Algorithm itself: Topic oriented page rank. Pagerank vectors accrdingly, then bias the results according to these vector and search term similarity.

Query Expansion Relevance Feedback: – Useful information for the expansion terms can be extracted from the relevant documents returned. – Extract such keywords based on term frequency, document frequency, summarization of top-ranked documents. Co-occurence: – Terms highly co-occuring together were shown to incrase precision. Assess term relationship levels. Theasurus: – Expand the query with new terms having close meaning. – Can be extracted from a large theasuri, ex: Wordnet.

Query Expansion with PIR We have a rich, personal collection but the data is very unstructured in format, content etc. So, we analyze PIR at various granularity levels, from term frequency withing Desktop documents to global co-occurence statistics. Then an empirical analysis of the algorithms is proposed.

Local Desktop Analysis Similar to relevance feedback method for query expansion, this time we use PIR best hits. Investigate in 3 granularity levels: – Term and document frequency: Advantage of being fast to compute as we have a previous offline computation. Independently associate a score with each term based on two statistics.

Local Desktop Analysis Term Frequency: – Use actual frequency information and position of the term first appears. – TermScore = [1/2 + ½*(nrWords-pos/nrWords)]*log(1+TF) – Position information is used as more informative terms appear earlier in the document.

Local Desktop Analysis Document frequency – Given the set of top-k relevant documents, generate snippets focusing on the original search request, then order by their DF scores. – Focusing on the query is necessary since DF scores are calculated through entire PIR. TFxIDF weighting may not be good for local desktop analysis, since a term with hight DF in desktop may be rare in web. – Ex: page-rank may have high DF in a IR scientists PIR having a low tfxidf while it resolves good in the web.

Local Desktop Analysis Lexical Dispersion Hypothesis: an expression’s lexical dispersion can be used to identify key concepts. {adjective?noun+} Generate such compound expressions offline and use them for query expansion on runtime. Further improvements by ordering according to lexical dispersion.

Local Desktop Analysis Summarization: – The set of relevant desktop documents identified – Then a summary containing most important sentences generated as output. – Most comprehensive output but not efficient as it can not be computed offline. – Rank the documents according to their salience scores computed as follows:

Local Desktop Analysis Summarization: – SalienceScore = square(SW)/TW + PS + square(TQ)/NQ – SW : significant terms, decided if its TF is above a threshold value ms as: Ms=7-0.1*(25-NS);if NS < 25 7;if 25<NS< *(NS -40);if NS>40 – PS: position score (Avg(NS)-SentenceIndex)/square(avg(NS)) Scaling it this way, short documents are not effected, as they do not have summaries in the beginning. – Final term is for balancing towards original query. The more query terms a sentence, the more related it is.

Global Desktop Analysis Previous techniques were based on relevant documents for the query. Now, we rely on information across the entire PIR of the user. We have two techniques: – Co-occurence Statistics: – Theasurus Based Expansion:

Global Desktop Analysis For each term, we compute terms co-occuring most frequently with it in our PIR collection, then use this info at runtime to expand our queries.

Global Desktop Analysis Algorithm: Off-line computation: 1: Filter potential keywords k with DF 2 [10,..., 20% · N] 2: For each keyword k i 3: For each keyword k j 4: Compute SC ki,kj, the similarity coefficient of (k i, k j ) On-line computation: 1: Let S be the set of keywords, potentially similar to an input expression E. 2: For each keyword k of E: 3: S S [ TSC(k), where TSC(k) contains the Top-K terms most similar to k 4: For each term t of S: 5a: Let Score(t) Q k2E ( SC t,k ) 5b: Let Score(t) #DesktopHits(E|t) 6: Select Top-K terms of S with the highest scores.

Global Desktop Analysis We have each terms’ correlated terms calculated offline. At run time we need to calculate correlation of every output term with the eniter query. Two approaches: – Product of correlation between term and all keywords – The number of documents the proposed term co- occurs with entire query. Similarity coefficients are calculate using: – Cosine similarity : (correlation coefficient) – Mutual information – Likelihood Ratio

Global Desktop Analysis Theasurus Based Expansion: – Identify the set of terms related to query terms (using theasurus information), then calculate each co-occurence level of possible expansions (i.e original search query and the new term). Select the ones with the highest frequency.

Theasurus Base Expansion 1: For each keyword k of an input query Q: – 2: Select the following sets of related terms 2a: Syn: All Synonyms 2b: Sub: All sub-concepts residing one level below k 2c: Super: All super-concepts residing one level above k 3: For each set Si of the above mentioned sets: – 4: For each term t of Si: 5: Search the PIR with (Q|t), i.e., – the original query, as expanded with t 6: Let H be the number of hits of the above search – (i.e., the co-occurence level of t with Q) – 7: Return Top-K terms as ordered by their H values.

Experiments 18 subjects indexed their content with their selected paths: s, docs,webcache. Types of Queries – Random log query, hitting 10 docs in PIR. – Self selected specific query, subject think having one meaning – Self selected ambigious query, subject think having more than one meaning. We set the number of expanded terms to 4.

Experiments Measure – Discounted Cumulative Gain: DCG = G(1); if i = 1 DCG(i-1) + G(i)/log(i);otherwise. Giving more weight to highly ranked documents, and incorporating relevance levels.

Experiments Labelings for the following results tables: – Google: Actual google result – TF,DF: as regular, term and document frequency – LC, LC[O]:regular and optimized lexical compounds – SS: sentence selection (summarization) – TC[CS], TC[MI],TC[LR]: term co-occurence statistics with cosing similarity, mutual information, and likelihood ratio respectively. – WN[SYN],WN[SUB],WN[SUP]: wordnet based theasurus expansion with synoyums, sub concepts and super concepts respectively.

Results for log queries

Results for selected queries

Results For log queries the best performance achieved with TF, LC[O] and TC[LR]. We get good results with simple keyword and expression oriented (TF, LC[O]) techniques, whereas more complicated ones does not show significant improvements. For unambigious selected queries, we do not have much improvement, but for ambigious we have a clear benefit. For clear(unambigious) queries decreasing the number of expanded terms can bring further improvements. İdea of adaptive algorithms.

Adaptivity An optimal personalized query expansion algorithm should adapt itself according to the initial query. How should we measure this, i.e. How much personal data be fed into our search. Query Length: – the number of words in the user query, not efficient -> there are short or long complicated queries. Query Scope: – IDF of the entire query. Log(#docuemntsincollection/#hitsforquery) Performs well collection focused on a single topic. Query Clarity: – Measures the diveregence between language model of the query and the language model of the collection(PIR). – Σ P(w | Query) * log (P(w | Query)/ P(w)) where w is a word in query, P(w | Query) is the probabilty of the word in query and P(w) the probability in the entire collection.

Calculate “scope” for the PIR and “clarity” for the web. We will use LC[O] (best performance in the previous experiment), TF, and WN[SYN] which produced good first and second expansion terms. Tailor the amount of expansion terms as a function of its ambiguity in Web and clarity in the web. The scores for combination of scope and clarity levels as follows:

Clarity Levels

Experiments Similar approach taken as the previous experiments. For top log queries, an improvement over google and even on static methods (term number = 4). For random queries, again better results than Google, but behind the static methods. We may need a better selection of the number of expansion terms. For self-selected queries: – A clear improvement for ambigious queries. – Slight performance increase for clear queries. Results tell adaptivity is further step for research in web search personalization.

Conclusion Five techniques for determining expansion terms generated from personal documents. Empirical analysis show 51.28% improvement. Further works to adapt search queries. An additional improvement of 8.47%.

Further Work Investigations on how to optimally select the number of expansion terms. Other query expansion suggestion approaches: Latent Semantic Analysis.

Thank you…