1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer Science Department University of Illinois at Urbana-Champaign
2 Customer service phone number of Amazon? Users in Frustration Search on Amazon? Search on Search Engine?
3 Professors in the area of data mining Even More Frustration cs.uiuc.edu cs.uiuc.edu/research cs.uiuc.edu/research/data cs.stanford.edu … … cs.stanford.edu/research cs.stanford.edu/research/faculty
4 Many many such cases: The of Kevin Chang? The papers and presentations of ICDE 2010? Conferences and their due dates on databases in 2010? Sale price of “Canon PowerShot A400”? Often times, we are looking for data entities, e.g., s, dates, prices, etc., not pages. Indeed, according to a recent survey, 52.9% of queries are directly targeting at structured entities [DE Bulletin’09] [DE Bulletin’09]: R. Kumar and A. Tomkins, “A Characterization of Online Search Behavior”
Recent Trends: WQA Web-based Question Answering (WQA) (Wu 2007, Lin 2003, Brill 2002) Who is CEO of Dell? Keywords: “CEO Dell” Parse Top-k results Michael Dell 5
Recent Trends: WIE 6 Specialized Information Extractors Web Information Extraction (WIE) (Marius 2006, Cafarella 2005, Etzioni 2004) Pattern: “X is CEO of Y” CompanyCEO GoogleEric Schmidt IBMS. Palmisano ……
Recent Trends: TAS 7 Typed-Annotated Search (TAS) (Cheng 2007, Cafarella 2007, Chakrabarti 2006) Inventor of television? …… Ranked Entity List Finding person names near Keywords “invent” and “television” Finding person names near Keywords “invent” and “television” Typed-Annotated Search
8 From Pages to Data Entities Traditional SearchEntity Search Keywords Keywords & Entity Type Results Support
9 Concretely, what do we mean by Entity Search? Online Demo. 3TB Corpus of 150M pages 16 -machine cluster 24 entity types
10 Entity Search Problem Abstraction Given: Entity Collection over Document Collection Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone # ) Output: Ranked list of sorted by Score(q(t)), the query score of t Given: Entity Collection over Document Collection Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone # ) Output: Ranked list of sorted by Score(q(t)), the query score of t Input: Keywords & Entity Type (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Instances Ordered by: Score(e) where e is an entity instance …… Given: D
Unanimous Requirements across the Trends Context Matching (in document) Match the target type (say # location ) by keywords (e.g., “louvre museum”) that appear in its surrounding context, in certain desired patterns Global Aggregation (across documents) Match an entity (say, #location = Paris) for as many times as it appears in numerous pages 11
Computation Challenges Expensive Context Matching (Join ) Need to perform proximity matching in documents Beyond simple containment checking Extensive Global Aggregation ( G ) Need to perform corpus scale aggregation A layer that is non-existent in online page retrieval 12
Traditional Page Retrieval based Approach 13 Who is the CEO of Dell? Keywords: “CEO Dell” Analyze top-k results Michael Dell Limitation Only top-k documents Many random seeks
Our Proposal: Entity-aware Indexing Inspired by the success of inverted index in enabling efficient IR for searching documents However, traditional inverted index only aware of keywords and documents How can we make index entity aware ? Our proposal: Dual-Inversion Index Principle I : Document-inverted Index Principle II : Entity-inverted Index 14
Entity-as-keyword: Document-inverted Index 15 : : keyword pos doc id
Document Space Partitioning Node 10 Node 1
Distributed Query Processing over D-inverted Index 17 Join …… Aggregation Local Ranking Global Join … results, scores …… … Node 1 Node 10
Entity-as-document: Entity-inverted Index 18 keyword posentity id entity pos
Entity Space Partitioning 19 Node 1 Node 9
Distributed Query Processing over E-inverted Index 20 … Local Ranking Global … results, scores … Node 1 Node 9 Join Aggregation Join Aggregation … …
21 Experiment Setup Corpus: General crawl of the Web (Aug, 2007), around 3TB with 150M pages. Entities: 24 diverse entity types Concrete Applications (Benchmark queries) : Yellowpage: # , #phone, #state, #location, #zipcode CSAcademia: #university, #professor, #research, # , #phone
Metrics Used for Evaluation to Measure Throughput & Response Time Local Processing Time Overall local processing time. Max local processing time Transfer Time Overall transfer time Max transfer time Global Processing Time 22
Local Processing Time Comparison 23
Network Transfer Comparison 24
Global Processing Time Comparison 25
Overall Time/Space Summary 26 Generally, ~2 to 4 orders of speedup, with reasonable space overhead
Dual-Inversion Index 27 Dual-Inversion Index: The two types of indexes can co-exist, and complement each other
Indexing Configuration 28 Entity Type Level Configuration: Create E-Inverted Index only for popular, space efficient entities D-Inverted Index for less popular, space expensive entities Keyword Level Configuration: Only create E-Inverted Index for pairs, when they are related, e.g., queried often from query log
Conclusion Identify essential computation requirements for entity search Dual-inversion indexing and partition schemes for efficient and scalable query processing Document-inverted index Entity-inverted index Verify over large-scale corpus with real applications 29
30 Thanks much for coming! Questions?
TopK Convergence 31
References of Related Work Index Design Junghoo Cho and Sridhar Rajagopalan. A fast regular expression indexing engine. In ICDE, Hugh E. Williams, Justin Zobel, and Dirk Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 22(4):573–594, Xiaohui Long and Torsten Suel. Three-level caching for efficient query processing in large web search engines. In WWW, Michael Cafarella and Oren Etzioni. A search engine for large-corpus language applications. In WWW, Question Answering S. Abney, M. Collins, and A. Singhal. Answer extraction. In ANLP, E. Brill, S. Dumais, and M. Banko. An analysis of the askmsr question-answering system. In EMNLP, Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the web. In WWW, Jimmy J. Lin and Boris Katz. Question answering from the web using knowledge annotation and knowledge mining techniques. In CIKM,
Search Interface 33
Query I: Amazon Customer Service Phone 34 Results # of Supporting Page Representative Supporting Pages
Query II: Professors in Data Mining 35
Query III: University of California Locations 36