Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Smarter searches Search engines were moving beyond simple keyword matching. The big idea was to “understand” the users queries, then suggest similar queries. The significance of these “similar queries”: Other users have asked them, and received correct answers.

Two assumptions 1.Users click on the same documents, having used different queries, then the queries are similar. 2.If a set of documents is often selected for a set of queries, then the terms in the documents are related to the terms in the queries. Key point – similar queries would have been grouped into multiple clusters using keywords alone.

The aims The editors were seeking to improve the encyclopaedia so that the users could locate information in a more precise way. In particular: 1.If Encarta does not provide sufficient information for FAQ, then improve the entries. 2.If an FAQ is emerging as a “hot topic”, then check the results set, and provide direct links. This paper is about helping out with issue 2.

Raw material User logs for searches against the online Encarta encyclopaedia. Session means query session rather than user session session := queryText [clickedDocument]* The Encarta titles were carefully crafted, so the assumption is that if user clicks were based on relevance.

Clustering principles 1. Using query contents. If two queries contain the same of similar terms, they denote the same or similar information needs. More useful for longer queries. 2. Using document clicks. If two queries lead to the selection of the same documents, then they are similar. Both principles were used.

Clustering algorithm requirements 1.No manual configuration of the clusters 2.Filter out queries with low frequencies 3.Fast 4.Incremental Selected DBSCAN & incremental DBSCAN, But provided their own similarity function.

Similarity Based on Query Contents

Plus refinements: If phrases can be identified: they can be treated as single term in the calculations. Easy in this case as Encarta supplied a dictionary of phrases. There were plans to include syntactic analysis to identify noun phrases.

Similarity Based on Query Contents Similarity based on edit distance: The number of insertions, deletions, and/or replacements needed to unify two queries. Found to be useful for long and complex queries in preliminary tests. Implemented? Also mentioned the possibility of using Wordnet synonyms.

Similarity Based on User Feedback Single documents: Similarity doc = RD(p,q)/Max( rd(p), rd(q))

Similarity Based on User Feedback Encarta documents are hierarchal: A concept taxonomy. The lower the common branch, The higher the similarity. S(d i, d j ) = (L(F(d i, d j ))-1)/L_Total)

Outcomes The authors stated the need for more empirical results data, but were happy with their progress. But – no actual results. Their approach was certainly successful in detecting similarities missed by other approaches.

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.

Similar presentations

Presentation on theme: "Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.

Similar presentations

Presentation on theme: "Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang."— Presentation transcript:

Similar presentations

About project

Feedback