Presentation is loading. Please wait.

Presentation is loading. Please wait.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Similar presentations


Presentation on theme: "TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR."— Presentation transcript:

1 TREC 2009 Review Lanbo Zhang

2 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

3 67 Participating Groups

4 The new dataset: ClueWeb09 1 billion web pages, in 10 languages, half are in English Crawled by CMU in Jan. and Feb. 2009 5 TB (compressed), 25 TB (uncompressed) Subset B – 50 million English pages – Includes all Wikipedia pages The original dataset and the Indri index of subset B are available on our lab machines

5 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

6 Web Track Two tasks – Adhoc Retrieval Task – Diversity Task Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.

7 Web Track Topic type 1: ambiguous

8 Web Track Topic type 2: faceted

9 Web Track Results of adhoc task

10 Web Track Results of diversity task

11 Waterloo at Web track Two runs – Top 10000 docs in the entire collection – Top 10000 docs in the Wikipedia set Wikipedia docs as pseudo relevance feedback Machine learning methods to re-rank the top 20000 docs, and return the top 1000 Diversity task – A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates

12 MSRA at Web track Mining subtopics for a query by – Anchor texts – Search results clusters – Sites of search results Search results diversification – A greedy algorithm to iteratively select the next best document

13 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

14 Relevance Feedback Track Tasks – Phase 1: find a set of 5 documents that are good for relevance feedback. – Phase 2: develop an RF algorithm to do retrieval based on the relevance judgments of 5 docs.

15 Results of RF track: Phase 1

16 Results of RF track: Phase 2

17 UCSC at RF track Phase 1: documents selection – Clustering top ranked documents – Transductive Experimental Design (TED) Phase 2: RF algorithm – Combining different document representations Title, anchor, heading, document – Incorporating term position information Phrase match, text window match – Incorporating document similarities to labeled docs

18 UMas at RF track A supervised method to estimate the weights of expanded terms for RF Train collection: wt10g Term features given a query: – Term frequency in FB docs and entire collection – Co-occurrence with query terms – Term proximity to query terms – Document frequency

19 UMas at RF track Model: Boosting

20 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

21 Entity Track Task – Given an input entity, find the related entities Return 100 related entities and their homepages

22 Results of Entity track

23 Purdue at Entity track Entity Extraction – Hierarchical Relevance Model – Three levels of relevance: document, passage, entity

24 Purdue at Entity track Homepage Finding for Entities – Logistic Regression model

25 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

26 Blog Track Tasks – Faceted Blog Distillation – Top Stories Identification Collection: Blogs08 – Crawled between 01/14/2008 and 02/10/2009 – 1.3 million unique blogs

27 Blog Track Task 1: Faceted Blog Distillation – Given a topic and the faceted restriction, find the relevant blogs. – Facets Opinionated vs. Factual Personal vs. Official In-depth vs. Shallow – Topic example

28 Blog Track Task 2: Top Stories Identification – Given a date, find the hottest news headlines for that day and select the relevant and diverse blog posts for those headlines – News headlines from New York Times used – Topic example

29 Results of Blog track Faceted Blog Distillation

30 Results of Blog track Top Stories Identification – Find the hottest news headlines – Identify the related blog posts

31 BUPT at Blog track Faceted Blog Distillation – Scoring function: The title section of a topic plus automatically selected terms from the DESC and NARR sections Phrase match – Facets Analysis Opinionated v.s. Factual: a sentiment analysis model Personal v.s Official: the maximum frequency of an organization entity occurring in a blog (Stanford Named Entity Recognizer) In-depth v.s. Shallow: post length – Linear combination of the above two parts

32 Univ. of Glasgow at Blog track Top Stories Identification – The model: – Incorporating the following days – Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline

33 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

34 Legal Track Tasks – Interactive task (Enron email collection) Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs – Batch task (IIT CDIP 1.0) Retrieval with relevance evidence (RF)

35 Results of Legal track

36 Waterloo at Legal track Interactive task – Phase 1: interactive search and judging To find a large and diverse set of training examples – Phase 2: interactive learning To find more potentially relevant documents Batch task – Run three spam filters on every document: An on-line logistic regression filter, A Naïve Bayes spam filter An on-line version of BM25 RF method

37 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

38 Million Query Track Tasks – Adhoc retrieval for 40000 queries – Predict query types Query intent: Precision-oriented vs. Recall-oriented Query difficulty: Hard vs. Easy Precision-oriented – Navigational: Find a specific URL or web page. – Closed: Find a short, unambiguous answer to a specific question. – Resource: Locate a web-based resource or download. Recall-oriented – Open: Answer an open-ended question, or nd all available information about a topic. – Advice : Find advice or ideas regarding a general question or problem. – List: Find a list of results that will help satisfy an open-ended goal.

39 Results of Million Query track Precision vs. Recall Hard vs. Easy

40 Northeastern Univ. at MQ track Query-specific learning to rank – Learn different ranking functions for queries in different classes Using SVM to classify queries – Training data: MQ 2008 dataset Features – Document features: document length, TF, IDF, TF*IDF, normalized TF, Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM). – Field features: title, heading, anchor text, and URL – Web graph features

41 Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

42 Chemical IR Track Tasks – Technical Survey Task Retrieve documents in response to each topic given by chemical patent experts – Prior Art Search Task Find relevant patents with respect to a set of 1000 existing patents

43 Results of Chemical track

44 Geneva at Chemical track Document Representation: – Title, Description, Abstract, Claims, Applicants, Inventors, IPC codes, Patent references Exploiting Citation Networks – Query expansion using chemical annotations Filtering based on IPC codes Re-ranking based on claims


Download ppt "TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR."

Similar presentations


Ads by Google