Download presentation
Presentation is loading. Please wait.
Published byPaul Gibson Modified over 9 years ago
1
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria
2
2 Customer service phone number of Amazon? Motivating Scenario
3
3 Search on Amazon?
4
4 Search on Google?
5
5 Many many similar cases: The email of Luis Gravano? What profs are doing databases at UIUC? The papers and presentations of ICDE 2007? Due date of SIGMOD 2008? Sale price of “Canon PowerShot A400”? “Hamlet” books available at bookstores? Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.
6
6 What you search is not what you want.
7
7 From pages to entities Traditional SearchEntity Search Keywords Entities Results Support
8
8 Concretely, what do we mean by Entity Search? Online Demo. Yellowpage: Comprehensive corpus. Special Thanks: ~100M Pages from Stanford WebBase
9
9 Entity Search Problem: Given: Entity Collection over Document Collection Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email ) Output: Ranked list of sorted by Score(q(t)), the query score of t Given: Entity Collection over Document Collection Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email ) Output: Ranked list of sorted by Score(q(t)), the query score of t Given: Input: Keywords & Entities (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Tuples …… 0.60 0.80 0.90
10
10 How to rank Entities? Why a novel Problem? Challenge:
11
11 Characteristics I: Contextual -Utilize Entities’ Surrounding Context Content Context
12
12 Characteristics II: Uncertain -Extractions are non-”prefect”
13
13 Characteristics III: Holistic -Many evidences from multiple sources
14
14 Characteristics IV: Discriminative - Web Pages are of Varying Quality
15
15 Characteristics V: Associative -Tell True Associations from Accidental Example: Finding Prof. Luis Gravano’s Email Observation: info@acm.org appears very frequently with keywords “Luis”, “Gravano”info@acm.org However, such association is only accidental as info@acm.org appears on many pages. info@acm.org
16
16 EntityRank : The Impression Model Tireless Observer......... ?? Access Layer: Global Aggregation Recognition Layer: Local Assessment Validation Layer: Hypothesis Testing …… 0.60 0.80 0.90
17
17 Recognition Layer: Local Assessment C ontextual U ncertain H olistic D iscriminative A ssociative Input: L1L1 L2L2 Extraction Conf = 1.0Extraction Conf = 0.3 Output:
18
18 Access Layer: Global Aggregation C ontextual U ncertain H olistic D iscriminative A ssociative Holistic Discriminative Output: Input:
19
19 Validation Layer: Hypothesis Testing C ontextual U ncertain H olistic D iscriminative A ssociative Input: Collection E over D Output: Virtual Collection E’ over D’ randomize
20
20 EntityRank : The Scoring Function Local RecognitionGlobal Aggregation Validation
21
21 Sort-merge Join Query Processing 7, 33d9d9 3d7d7 10d6d6 5d3d3 8, 25d1d1 Doc Posting Doc 8, 24d7d7 66d5d5 11d3d3 Posting 44d8d8 9d7d7 12d3d3 Doc Posting AmazonCustomer Service (13,800-202-7575,1.0) (78,800-322-9266,1.0) d7d7 (18,800-202-7575,1.0)d3d3 (42,851-0400,0.8)d2d2 Doc Posting #phone Aggregation 800-202-7575: 0.5 800-322-9266: 0.2 800-202-7575: 0.6 800-322-9266: 0.1 800-202-7575: 0.4 Hypothesis Test Result
22
22 Experiment Setup Corpus: General crawl of the Web(Aug, 2006), around 2TB with 93M pages. Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances) System: A cluster of 34 machines
23
23 Comparing EntityRank to the Following Different Approaches C ontextual U ncertain H olistic D iscriminative A ssociative N aïve L ocal G lobal C ombine W ithout E ntity R ank
24
24 Example Query Results
25
25 Comparison… EntityRank Naïve approch Local only Global only Combine L by simple summation L+G without hypothesis testing %Satisfied Queries at #Rank Query Type I: Phone for Top-30 Fortune500 Companies Query Type II: Email for 51 of 88 SIGMOD07 PC
26
26 Conclusions Formulate the entity search problem Study and define the characteristics of entity search Conceptual Impression Model and concrete EntityRank framework for ranking entities An online prototype with real Web corpus
27
27 Thanks! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.