Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH
Problems with BoW Needs query text to appear in document Especially bad with short queries – e.g. mobile Using thesaurus often makes things worse Web directories aren’t structured well enough to help Why not use ESA we saw last week for IR too?
A reminder about Explicit Semantic Analysis Map text input to all Wikipedia articles Order by relevance: TF-IDF per word in an article Matrix of word to article with TF-IDF value For a query, find vectors for each word But ESA sometimes gives harmful results “law enforcement, dogs” ‒Contract ‒Dog fighting ‒Police dog ‒Law enforcement in Australia ‒Breed-specific legislation ‒Cruelty to animals ‒Seattle police department ‒Louisiana ‒Air force security forces ‒Royal Canadian mounted police
MORAG Combine BoW and ESA techniques. Pre-processing index each document by both BoW and ESA. Index whole document and 50- word passages.
MORAG (2)
Why rank by both documents and passages? Often a few relevant sentences can determine a documents relevancy to a query Different parts of a Wikipedia article can focus on different topics But relevancy of the whole article is still important.
Feature Selection Not all ESA features are good, why not only use the best ones? Choose k-best and k-worst documents by BoW rankings. For each feature, find how well it separates these best and worst sets (entropy). Features to use are those with highest entropy.
Results TREC-8: 528k news stories, 50 topics, human-based classification. Significantly better when optimal parameters used w/ concept selection. No better without concept selection – shows problems with vanilla ESA. Robust as can be used to improve any of the existing methods below.
Optimising concept ranking parameters How many features to use? 25 What value of k for top-k and bottom-k in ranking algorithm? 20% Could be dataset specific.
Other Thoughts Combining BoW with ESA isn’t great when one significantly better than other. Definitely a lot more work/computation Are improvements worth it? 5% better than best baseline Dataset “only” 50 topics – could this work on general searching?
Questions/Comments? Related Journal Article (detailed, if you’re really keen): Egozi, Ofer, Shaul Markovitch, and Evgeniy Gabrilovich. "Concept-based information retrieval using explicit semantic analysis." ACM Transactions on Information Systems (TOIS) 29.2 (2011): 8.