Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH.

Similar presentations


Presentation on theme: "Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH."— Presentation transcript:

1 Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH

2 Problems with BoW Needs query text to appear in document Especially bad with short queries – e.g. mobile Using thesaurus often makes things worse Web directories aren’t structured well enough to help Why not use ESA we saw last week for IR too?

3 A reminder about Explicit Semantic Analysis Map text input to all Wikipedia articles Order by relevance: TF-IDF per word in an article Matrix of word to article with TF-IDF value For a query, find vectors for each word But ESA sometimes gives harmful results “law enforcement, dogs” ‒Contract ‒Dog fighting ‒Police dog ‒Law enforcement in Australia ‒Breed-specific legislation ‒Cruelty to animals ‒Seattle police department ‒Louisiana ‒Air force security forces ‒Royal Canadian mounted police

4 MORAG Combine BoW and ESA techniques. Pre-processing index each document by both BoW and ESA. Index whole document and 50- word passages.

5 MORAG (2)

6 Why rank by both documents and passages? Often a few relevant sentences can determine a documents relevancy to a query Different parts of a Wikipedia article can focus on different topics But relevancy of the whole article is still important.

7 Feature Selection Not all ESA features are good, why not only use the best ones? Choose k-best and k-worst documents by BoW rankings. For each feature, find how well it separates these best and worst sets (entropy). Features to use are those with highest entropy.

8 Results TREC-8: 528k news stories, 50 topics, human-based classification. Significantly better when optimal parameters used w/ concept selection. No better without concept selection – shows problems with vanilla ESA. Robust as can be used to improve any of the existing methods below.

9 Optimising concept ranking parameters How many features to use? 25 What value of k for top-k and bottom-k in ranking algorithm? 20% Could be dataset specific.

10 Other Thoughts Combining BoW with ESA isn’t great when one significantly better than other. Definitely a lot more work/computation Are improvements worth it? 5% better than best baseline Dataset “only” 50 topics – could this work on general searching?

11 Questions/Comments? Related Journal Article (detailed, if you’re really keen): Egozi, Ofer, Shaul Markovitch, and Evgeniy Gabrilovich. "Concept-based information retrieval using explicit semantic analysis." ACM Transactions on Information Systems (TOIS) 29.2 (2011): 8.


Download ppt "Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH."

Similar presentations


Ads by Google