1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu
2 Outline Introduction From Wikipedia Entities to Web Entities and back Entity Ranking on Wikipedia Entity Ranking on Web Conclusion
3 Introduction Entity ranking is the task of finding documents representing entities of a correct type that are relevant to a query. presenting a ranked list of entities directly, rather than a list of web pages with relevant but also potentially redundant information about these entities.
4 Differs from document retrieval on at least three points: i) returned documents have to represent an entity ii) this entity should belong to a specified entity type iii) to create a diverse result list an entity should only be returned once.
5 Main Goal To Rank Web entities 1. Associate target entity types with the query 2. Rank Wikipedia pages according to their similarity with the query and target entity types 3. Find web entities corresponding to the Wikipedia entities
6 Using Wikipedia as a pivot entities: Wikipedia pages the name of the entity: the title of the page the content of the page: the representation of the entity Each Wikipedia page is assigned to a number of categories: topical, type, and administrative categories.
7 From Wikipedia Entities to Web Entities and back From Web to Wikipedia these repositories provide enough clues to find the corresponding entities on theWeb? they contain enough entities that cover the complete range of entities needed to satisfy all kinds of information needs?
8 From Wikipedia to Web Use External Link
9 Entity Ranking on Wikipedia * Entity Types Entity Type Assignment exploit the existing Wikipedia categorization of documents Pseudo-relevance feedback of the top retrieved documents we extract the categories that are most frequently assigned the top 10 results, and look at the 2 most frequently occurring categories belonging to these documents
10 *Entity Types -Scoring Entities estimate background probabilities smooth the probabilities of a term occurring in a category name with the background collection : the name of the category : the category : the query terms : the document : the entire Wikipedia document collection
11 Similarity between two categories The entity type score for a document in relation to a query topic Score Normalization
12 Entity Ranking on Wikipedia *Experimental Setup Data Set: INEX: specific, ex countries, national parks.. TREC: people, organization, product Advantage: clear, few options, could be easily selected Disadvantage: cover a small part of all possible entity ranking queries manually assigned more specific entity types
13 rerank the top 2,500 results of the baseline Manually assigned (author) Automatically assigned (PRF) evaluation 2009 TREC:P10 and INEX:P10 and MAP INEX consisting of 79 topics INEX 2009 topics consisting of a selection of 55 topics from the topics. only count the so-called ‘primary’ pages
14
15 Entity Ranking on The Web We have three approaches for finding web pages associated with Wikipedia pages. 1. External links: the External links section of the Wikipedia page 2. Anchor text: Wikipedia page title as query retrieve pages from the anchor text index 3. Combined: not all Wikipedia pages have external links not all external links of Wikipedia pages are part of the Clueweb collection less than 3 webpages are found, we fill up the results to 3 pages using the top pages retrieved using anchor text
16
17 Conclusion Our experiments show that our wikipedia- as-a-pivot approach outperforms a baselines of full-text search. Both external links on Wikipedia pages, and searching an anchor text index of the web are effective approaches to find homepages for entities represented by Wikipedia pages.