EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois at Urbana-Champaign VLDB ’07, September 23-28, 2007, Vienna, Austria Presented by Sangkeun Lee, IDS Lab., Seoul National University
Copyright 2007 by CEBT Motivating Scenario 2 Customer Service phone number of Amazon?
Copyright 2007 by CEBT 3 Search on Amazon?
Copyright 2007 by CEBT 4 Search on Google?
Copyright 2007 by CEBT Many many similar cases: The of Luis Gravano? What profs are doing databases at UIUC? The papers and presentations of ICDE 2007? Due date of SIGMOD 2008? Sale price of “Canon PowerShot A400”? “Hamlet” books available at bookstores? Often times, we are looking for data entities, e.g. s, dates, prices, etc, not pages. 5
Copyright 2007 by CEBT 6 What you search is not what you want
Copyright 2007 by CEBT 7 Traditional SearchEntity Search Keywords Entities Results Support Entity Search Problem
Copyright 2007 by CEBT Entity Search Problem 8
Copyright 2007 by CEBT Challenge How to rank Entities? Why a novel Problem? 9
Copyright 2007 by CEBT Core Challenges Contextual: pattern (phrase, uw, ow) & proximity Holistic: aggregated occurrences Uncertainty: extraction confidence probability Associative: distinguish true associations from accidental Discriminative: entity instances matched on more popul ar pages should receive higher scores than entity instances from less popular pages A novel problem: solve all together, probabilistic 10
Copyright 2007 by CEBT Impression Model 11
Copyright 2007 by CEBT Recognition Layer: Local Assessment Given a document d, how to assess a particular tuple t= matches the query q = α (E 1,…, E m, k 1,…, k l ) = α (γ): Two orthogonal factors Extraction uncertainty Association context –Boolean Pattern Qualification Doc, phrase, uw, ow –Probabilistic Proximity Quantification * s: the span length-the shortest window that covers the entire occurence
Copyright 2007 by CEBT 13 Recognition Layer: Local Assessment C ontextual U ncertain H olistic D iscriminative A ssociative Input: L1L1 L2L2 Extraction Conf = 1.0Extraction Conf = 0.3 Output:
Copyright 2007 by CEBT 14 Access Layer: Global Aggregation C ontextual U ncertain H olistic D iscriminative A ssociative Holistic Discriminative Output: Input: e.g
Copyright 2007 by CEBT Validation Layer: Hypothesis Testing Accidental association E.g: appears very frequently with keywords “Luis”, “G ravano”. However, such association is only accidental as org appears on many org Validate if the association is not accidental
Copyright 2007 by CEBT EntityRank: The Scoring Function Local RecognitionGlobal Aggregation Validation
Copyright 2007 by CEBT Comparison … EntityRank Naïve approch Local only Global only Combine L by simple summation L+G without hypothesis testing %Satisfied Queries at #Rank Query Type I: Phone for Top-30 Fortune500 Companies Query Type II: for 51 of 88 SIGMOD07 PC Corpus: General crawl of the Web(Aug, 2006), around 2TB with 93M pages. Entities: Phone (8.8M distinctive instances) (4.6M distinctive instances) System: A cluster of 34 machines
Copyright 2007 by CEBT Conclusions Formulate the entity search problem Study and define the characteristics and requirements of entity search Propose Impression Model and EntityRank framework for ranking entities Implement a prototype with real Web