n A variety of automatic or semi-automatic query suggestion techniques have been developed Goal is to improve effectiveness by matching related/similar terms Semi-automatic techniques require user interaction to select best suggested terms n Query expansion is a related technique Alternative queries, usually offer more terms 2
Query Suggestion n Approaches usually based on an analysis of term co- occurrence Either in the entire document collection, a large collection of queries, or the top-ranked documents in a result list Query-based stemming also a suggestion technique n Automatic suggestion based on general thesaurus not effective Does not take context into account, e.g., “aquarium” is a good suggestion for “tank” in the query “tropical fish tank”, but not for “armor for tanks” 3
Term Association Measures n Dice’s Coefficient where stands for rank equivalent n Mutual Information Measure (MIM) where N is the number of documents in a collection P(a) = n a /N, P(b) = n b /N, P(a, b) = n ab /N 4 = rank Measures the extent to which words co- occurrence independently
Term Association Measures n Mutual Information measure (MIM) favors low frequency terms n Expected Mutual Information Measure (EMIM) addresses the problem of MIM by weighting MIM using P(a, b) Actually only 1 part of EMIM focused on word occurrence EMIM, however, favors high frequency terms 5
Term Association Measures n Pearson’s Chi-squared (χ 2 ) measure Compares the number of co-occurrences of two words with the expected number of co-occurrences if the two words were independent Normalizes this comparison by the expected number Also limited form focused on word co-occurrence 6 Expected number of co- occurrence if the words occur independently Favors low- frequency terms
Association Measure Summary 7
Association Measure Example Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. 8 Identical ranking & favor low- frequency words More general than MIM & X 2
Association Measure Example Most strongly associated words for “fish”, a high frequent term, in a collection of TREC news stories. 9 Similar Top- ranked words in MIM & X 2
Association Measure Example Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of 5 words. 10 Still favor low-frequency terms Most stable & reliable regardless of the window sizes
Association Measures n Associated words are of little use for expanding the query “tropical fish” n Expansion based on whole query takes context into account e.g., using Dice with term “tropical fish” gives the following highly associated words: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet n Impractical for all possible queries, other approaches used to achieve this effect 11
Other Approaches n Pseudo-relevance feedback Expansion terms based on top retrieved docs for initial query n Context vectors Represent words by the words that co-occur with them e.g., top 35 most strongly associated words for “aquarium” (using Dice’s coefficient): Rank words for a query by ranking context vectors n Challenges (computational & accuracy): due to huge size & variability in quality of the collections 12
Other Approaches n Query logs Best source of information about queries & related terms short pieces of text & click data e.g., most frequent words in queries containing “tropical fish” from MSN log: stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies Query suggestion based on finding similar queries group based on click data 13
Query Expansion n Search engines suggest expanded/alternative queries in response to a query Q Using some form of thesaurus to perform global analysis For each term t in Q, Q is expanded with synonyms and related words of t from the thesaurus 14
Query Expansion n Methods for building a thesaurus for query expansion 1. Use of a controlled vocabulary maintained by human editors, such as the Library of Congress subject headings (LCSH), e.g., The LCSH of “American Revolutionary War” is United States – History -- Revolution, An automatically derived thesaurus, constructed using word co-occurrence statistics over a collection of docs 3. Query reformulations based on query log mining by exploring the manual query reformulations of other users to make suggestions to a user Thesaurus-based query expansion does not require any user input to increase recall 15
Query Expansion n Automatic thesaurus generation using word co-occurrence A simple approach is based on term-term similarities Start with a term-document matrix A, where each cell A t,d is a weighted count of w t,d for term t & document d Calculate C = AA T in which C u,v is a similarity score between terms u and v, the larger the number, the better An example of a derived t hesaurus with good/bad suggestions 16
Query Expansion n The quality of term association is typically a problem in an automatically generated thesaurus Term ambiguity easily introduces irrelevant statistically correlated terms, such as “Apple” can be expanded to “Apple red fruit computer” Suffer from false positives (FP) and false negatives (FN) High cost to manually produce and update a thesaurus Query expansion often increases recall, but may also significantly decease precision, especially when the query contains ambiguous terms, e.g., interest rate interest rate fascinate evaluate is unlikely to be useful 17