Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Similar presentations


Presentation on theme: "Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones."— Presentation transcript:

1 Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones Carnegie Mellon University, USA Dunja Mladenic J. Stefan Institute, Slovenia

2 Motivation Need a collection of documents matching a particular concept Search on the web, modify query, analyze documents, modify query,… Repetitive, time-consuming, requires reasonable familiarity with the concept

3 Task Given: 1 Document in Target Concept 1 Other Document (negative example) Access to a Web Search Engine Create a Corpus of the Target Concept quickly with no human effort

4 Algorithm Query GeneratorWWW Seed Docs Filter/Classifier

5 Web Word Statistics Initial Docs Build Query Filter Relevant Non-Relevant Learning

6 Query Generation Examine current relevant and non-relavent documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

7 Query Term Selection Methods Uniform (UN) – select k words randomly from the current vocabulary Term-Frequency (TF) – select top k words ranked according to their frequency Probabilistic TF (PTF) – k words with probability proportional to their frequency

8 Query Term Selection Methods RTFIDF – top k words according to their rtfidf scores Odds-Ratio (OR) – top k words according to their odds-ratio scores Probabilistic OR (POR) – select k words with probability proportional to their Odds- Ratio scores

9 Query Parameters 4 Parameters Inclusion Term-Selection Method Exclusion Term-Selection Method Inclusion Length Exclusion Length Example: Odds-Ratio, rtfidf, 3,6

10 Experimental Setup Language: Slovenian Initial documents: 1 web page in Slovenian, 1 in English Search engine: Altavista

11 Evaluation Goal: Collect as many relevant documents as possible while minimizing the cost Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

12 Fixed Query Parameters Fix Query Lengths and Vary Term-Selection Methods Fix Term-Selection Methods and Vary Query Lengths Results (Ghani et al., SIGIR 2001): Odds-Ratio works well overall Long Queries are precise but with low recall

13 Why Online Learning? Different Term-Selection Methods Excel with different Query Lengths Best Combination of methods and lengths may change as different parts of the Web/feature space are explored

14 Learning Methods Memory-Less (ML) Learning Ignore all history and only use the current performance Long-Term Memory (LT) Learning Use all of the previous history Additive Update Rule Multiplicative Update Rule Fading Memory (FM) Learning Use all of the history but with a decay function over time

15 Results LTM Memory-Less

16 Results

17 Further Experiments Other Languages Similar results with Croatian, Czech and Tagalog Keywords Similar results when initializing with keywords instead of documents Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature

18 Conclusions Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines Online Learning is useful in adapting to different parts of the Web space System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder

19 Algorithm 1. Initialization 2. Generate query terms from relevant and non- relevant documents 3. Retrieve document using the Query from 2. 4. Use the language filter to add the new document to the relevant or non-relevant set of documents. 5. Update frequencies and scores 6. Return to Step 2.

20 1. Initialize Given documents in the target and non-target languages Calculate various statistics over the words in each set

21 3. Language Filter TextCat (van Noord’s implementation) trained on a handful of documents Manually evaluated through sampling 100 Slovenian documents and found to be 99% accurate Contains models for 60 languages


Download ppt "Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones."

Similar presentations


Ads by Google