Download presentation
Presentation is loading. Please wait.
Published byCora Merritt Modified over 9 years ago
1
TREC 2009 Review Lanbo Zhang
2
7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
3
67 Participating Groups
4
The new dataset: ClueWeb09 1 billion web pages, in 10 languages, half are in English Crawled by CMU in Jan. and Feb. 2009 5 TB (compressed), 25 TB (uncompressed) Subset B – 50 million English pages – Includes all Wikipedia pages The original dataset and the Indri index of subset B are available on our lab machines
5
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
6
Web Track Two tasks – Adhoc Retrieval Task – Diversity Task Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.
7
Web Track Topic type 1: ambiguous
8
Web Track Topic type 2: faceted
9
Web Track Results of adhoc task
10
Web Track Results of diversity task
11
Waterloo at Web track Two runs – Top 10000 docs in the entire collection – Top 10000 docs in the Wikipedia set Wikipedia docs as pseudo relevance feedback Machine learning methods to re-rank the top 20000 docs, and return the top 1000 Diversity task – A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates
12
MSRA at Web track Mining subtopics for a query by – Anchor texts – Search results clusters – Sites of search results Search results diversification – A greedy algorithm to iteratively select the next best document
13
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
14
Relevance Feedback Track Tasks – Phase 1: find a set of 5 documents that are good for relevance feedback. – Phase 2: develop an RF algorithm to do retrieval based on the relevance judgments of 5 docs.
15
Results of RF track: Phase 1
16
Results of RF track: Phase 2
17
UCSC at RF track Phase 1: documents selection – Clustering top ranked documents – Transductive Experimental Design (TED) Phase 2: RF algorithm – Combining different document representations Title, anchor, heading, document – Incorporating term position information Phrase match, text window match – Incorporating document similarities to labeled docs
18
UMas at RF track A supervised method to estimate the weights of expanded terms for RF Train collection: wt10g Term features given a query: – Term frequency in FB docs and entire collection – Co-occurrence with query terms – Term proximity to query terms – Document frequency
19
UMas at RF track Model: Boosting
20
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
21
Entity Track Task – Given an input entity, find the related entities Return 100 related entities and their homepages
22
Results of Entity track
23
Purdue at Entity track Entity Extraction – Hierarchical Relevance Model – Three levels of relevance: document, passage, entity
24
Purdue at Entity track Homepage Finding for Entities – Logistic Regression model
25
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
26
Blog Track Tasks – Faceted Blog Distillation – Top Stories Identification Collection: Blogs08 – Crawled between 01/14/2008 and 02/10/2009 – 1.3 million unique blogs
27
Blog Track Task 1: Faceted Blog Distillation – Given a topic and the faceted restriction, find the relevant blogs. – Facets Opinionated vs. Factual Personal vs. Official In-depth vs. Shallow – Topic example
28
Blog Track Task 2: Top Stories Identification – Given a date, find the hottest news headlines for that day and select the relevant and diverse blog posts for those headlines – News headlines from New York Times used – Topic example
29
Results of Blog track Faceted Blog Distillation
30
Results of Blog track Top Stories Identification – Find the hottest news headlines – Identify the related blog posts
31
BUPT at Blog track Faceted Blog Distillation – Scoring function: The title section of a topic plus automatically selected terms from the DESC and NARR sections Phrase match – Facets Analysis Opinionated v.s. Factual: a sentiment analysis model Personal v.s Official: the maximum frequency of an organization entity occurring in a blog (Stanford Named Entity Recognizer) In-depth v.s. Shallow: post length – Linear combination of the above two parts
32
Univ. of Glasgow at Blog track Top Stories Identification – The model: – Incorporating the following days – Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline
33
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
34
Legal Track Tasks – Interactive task (Enron email collection) Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs – Batch task (IIT CDIP 1.0) Retrieval with relevance evidence (RF)
35
Results of Legal track
36
Waterloo at Legal track Interactive task – Phase 1: interactive search and judging To find a large and diverse set of training examples – Phase 2: interactive learning To find more potentially relevant documents Batch task – Run three spam filters on every document: An on-line logistic regression filter, A Naïve Bayes spam filter An on-line version of BM25 RF method
37
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
38
Million Query Track Tasks – Adhoc retrieval for 40000 queries – Predict query types Query intent: Precision-oriented vs. Recall-oriented Query difficulty: Hard vs. Easy Precision-oriented – Navigational: Find a specific URL or web page. – Closed: Find a short, unambiguous answer to a specific question. – Resource: Locate a web-based resource or download. Recall-oriented – Open: Answer an open-ended question, or nd all available information about a topic. – Advice : Find advice or ideas regarding a general question or problem. – List: Find a list of results that will help satisfy an open-ended goal.
39
Results of Million Query track Precision vs. Recall Hard vs. Easy
40
Northeastern Univ. at MQ track Query-specific learning to rank – Learn different ranking functions for queries in different classes Using SVM to classify queries – Training data: MQ 2008 dataset Features – Document features: document length, TF, IDF, TF*IDF, normalized TF, Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM). – Field features: title, heading, anchor text, and URL – Web graph features
41
Tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track
42
Chemical IR Track Tasks – Technical Survey Task Retrieve documents in response to each topic given by chemical patent experts – Prior Art Search Task Find relevant patents with respect to a set of 1000 existing patents
43
Results of Chemical track
44
Geneva at Chemical track Document Representation: – Title, Description, Abstract, Claims, Applicants, Inventors, IPC codes, Patent references Exploiting Citation Networks – Query expansion using chemical annotations Filtering based on IPC codes Re-ranking based on claims
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.