Download presentation
Presentation is loading. Please wait.
Published byEunice O’Connor’ Modified over 8 years ago
1
Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH
2
Problems with BoW Needs query text to appear in document Especially bad with short queries – e.g. mobile Using thesaurus often makes things worse Web directories aren’t structured well enough to help Why not use ESA we saw last week for IR too?
3
A reminder about Explicit Semantic Analysis Map text input to all Wikipedia articles Order by relevance: TF-IDF per word in an article Matrix of word to article with TF-IDF value For a query, find vectors for each word But ESA sometimes gives harmful results “law enforcement, dogs” ‒Contract ‒Dog fighting ‒Police dog ‒Law enforcement in Australia ‒Breed-specific legislation ‒Cruelty to animals ‒Seattle police department ‒Louisiana ‒Air force security forces ‒Royal Canadian mounted police
4
MORAG Combine BoW and ESA techniques. Pre-processing index each document by both BoW and ESA. Index whole document and 50- word passages.
5
MORAG (2)
6
Why rank by both documents and passages? Often a few relevant sentences can determine a documents relevancy to a query Different parts of a Wikipedia article can focus on different topics But relevancy of the whole article is still important.
7
Feature Selection Not all ESA features are good, why not only use the best ones? Choose k-best and k-worst documents by BoW rankings. For each feature, find how well it separates these best and worst sets (entropy). Features to use are those with highest entropy.
8
Results TREC-8: 528k news stories, 50 topics, human-based classification. Significantly better when optimal parameters used w/ concept selection. No better without concept selection – shows problems with vanilla ESA. Robust as can be used to improve any of the existing methods below.
9
Optimising concept ranking parameters How many features to use? 25 What value of k for top-k and bottom-k in ranking algorithm? 20% Could be dataset specific.
10
Other Thoughts Combining BoW with ESA isn’t great when one significantly better than other. Definitely a lot more work/computation Are improvements worth it? 5% better than best baseline Dataset “only” 50 topics – could this work on general searching?
11
Questions/Comments? Related Journal Article (detailed, if you’re really keen): Egozi, Ofer, Shaul Markovitch, and Evgeniy Gabrilovich. "Concept-based information retrieval using explicit semantic analysis." ACM Transactions on Information Systems (TOIS) 29.2 (2011): 8.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.