TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

TREC 2009 Review Lanbo Zhang

7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

67 Participating Groups

The new dataset: ClueWeb09 1 billion web pages, in 10 languages, half are in English Crawled by CMU in Jan. and Feb. 2009 5 TB (compressed), 25 TB (uncompressed) Subset B – 50 million English pages – Includes all Wikipedia pages The original dataset and the Indri index of subset B are available on our lab machines

Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track

Web Track Two tasks – Adhoc Retrieval Task – Diversity Task Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.

Web Track Topic type 1: ambiguous

Web Track Topic type 2: faceted

Web Track Results of adhoc task

Web Track Results of diversity task

Waterloo at Web track Two runs – Top 10000 docs in the entire collection – Top 10000 docs in the Wikipedia set Wikipedia docs as pseudo relevance feedback Machine learning methods to re-rank the top 20000 docs, and return the top 1000 Diversity task – A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates

MSRA at Web track Mining subtopics for a query by – Anchor texts – Search results clusters – Sites of search results Search results diversification – A greedy algorithm to iteratively select the next best document

Relevance Feedback Track Tasks – Phase 1: find a set of 5 documents that are good for relevance feedback. – Phase 2: develop an RF algorithm to do retrieval based on the relevance judgments of 5 docs.

Results of RF track: Phase 1

Results of RF track: Phase 2

UCSC at RF track Phase 1: documents selection – Clustering top ranked documents – Transductive Experimental Design (TED) Phase 2: RF algorithm – Combining different document representations Title, anchor, heading, document – Incorporating term position information Phrase match, text window match – Incorporating document similarities to labeled docs

UMas at RF track A supervised method to estimate the weights of expanded terms for RF Train collection: wt10g Term features given a query: – Term frequency in FB docs and entire collection – Co-occurrence with query terms – Term proximity to query terms – Document frequency

UMas at RF track Model: Boosting

Entity Track Task – Given an input entity, find the related entities Return 100 related entities and their homepages

Results of Entity track

Purdue at Entity track Entity Extraction – Hierarchical Relevance Model – Three levels of relevance: document, passage, entity

Purdue at Entity track Homepage Finding for Entities – Logistic Regression model

Blog Track Tasks – Faceted Blog Distillation – Top Stories Identification Collection: Blogs08 – Crawled between 01/14/2008 and 02/10/2009 – 1.3 million unique blogs

Blog Track Task 1: Faceted Blog Distillation – Given a topic and the faceted restriction, find the relevant blogs. – Facets Opinionated vs. Factual Personal vs. Official In-depth vs. Shallow – Topic example

Blog Track Task 2: Top Stories Identification – Given a date, find the hottest news headlines for that day and select the relevant and diverse blog posts for those headlines – News headlines from New York Times used – Topic example

Results of Blog track Faceted Blog Distillation

Results of Blog track Top Stories Identification – Find the hottest news headlines – Identify the related blog posts

BUPT at Blog track Faceted Blog Distillation – Scoring function: The title section of a topic plus automatically selected terms from the DESC and NARR sections Phrase match – Facets Analysis Opinionated v.s. Factual: a sentiment analysis model Personal v.s Official: the maximum frequency of an organization entity occurring in a blog (Stanford Named Entity Recognizer) In-depth v.s. Shallow: post length – Linear combination of the above two parts

Univ. of Glasgow at Blog track Top Stories Identification – The model: – Incorporating the following days – Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline

Legal Track Tasks – Interactive task (Enron email collection) Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs – Batch task (IIT CDIP 1.0) Retrieval with relevance evidence (RF)

Results of Legal track

Waterloo at Legal track Interactive task – Phase 1: interactive search and judging To find a large and diverse set of training examples – Phase 2: interactive learning To find more potentially relevant documents Batch task – Run three spam filters on every document: An on-line logistic regression filter, A Naïve Bayes spam filter An on-line version of BM25 RF method

Million Query Track Tasks – Adhoc retrieval for 40000 queries – Predict query types Query intent: Precision-oriented vs. Recall-oriented Query difficulty: Hard vs. Easy Precision-oriented – Navigational: Find a specific URL or web page. – Closed: Find a short, unambiguous answer to a specific question. – Resource: Locate a web-based resource or download. Recall-oriented – Open: Answer an open-ended question, or nd all available information about a topic. – Advice : Find advice or ideas regarding a general question or problem. – List: Find a list of results that will help satisfy an open-ended goal.

Results of Million Query track Precision vs. Recall Hard vs. Easy

Northeastern Univ. at MQ track Query-specific learning to rank – Learn different ranking functions for queries in different classes Using SVM to classify queries – Training data: MQ 2008 dataset Features – Document features: document length, TF, IDF, TF*IDF, normalized TF, Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM). – Field features: title, heading, anchor text, and URL – Web graph features

Chemical IR Track Tasks – Technical Survey Task Retrieve documents in response to each topic given by chemical patent experts – Prior Art Search Task Find relevant patents with respect to a set of 1000 existing patents

Results of Chemical track

Geneva at Chemical track Document Representation: – Title, Description, Abstract, Claims, Applicants, Inventors, IPC codes, Patent references Exploiting Citation Networks – Query expansion using chemical annotations Filtering based on IPC codes Re-ranking based on claims

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Similar presentations

Presentation on theme: "TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Similar presentations

Presentation on theme: "TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR."— Presentation transcript:

Similar presentations

About project

Feedback