Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1
Web Info Extraction Typed Entity Search Web-based Q/A In most cases, what we really want are not pages, but the information units inside. ? ? 2
Specialized Information Extractors Web Information Extraction (WIE) (Marius 2006, Cafarella 2005, Etzioni 2004) Pattern: “X is CEO of Y” CompanyCEO GoogleEric Schmidt IBMS. Palmisano …… Limitation Focus on simple patterns. Lack of interactivity. 3
Web-based Question Answering (WQA) (Wu 2007, Lin 2003, Brill 2002) Who is CEO of Dell? Keywords: “CEO Dell” Parse Top-k results Michael Dell Limitation Only rely on top-k pages to retrieve the answer. 4
Typed-Entity Search (TES) (Cheng 2007, Cafarella 2007, Chakrabarti 2006) Amazon Phone …… Ranked Entity List But … Where is CEO Limitation Limited Number of Data Type Lack of Flexibility 5
? ? Data-oriented Content Query System Web Info Extraction Typed Entity Search Web-based QA Requirements 1.Extensible Data Types 2.Flexible Contextual Patterns 3.Customizable Scoring 6
Input: CQL (Content Query Language) Output Entity Search Web QA Data-oriented Content Query System 7
8
What we needRelational Model Person Organization Location Number Person Organization Location 9
What we needRelational Model Find the population of China WHERE pattern(…) GROUP BY #number ORDER BY conf() FROM #number 10 China has a population of 1.3 billion China with its population of 1.3 billion people China is established in Shanghai is the largest city with 15 million inhabitants in China 1.3 billion 15 million billion 2.15 million … billion 2.15 million …
What we needRelational Model Number Location Person Population Phone Price Capital Headquarter Professor CEO President Table View Number population price phone 11
Index Layer Parsing Layer Index Selection Module Execution Tree INPUT SELECT … FROM … WHERE … OUTPUT Index Design Special Inverted Index Contextual Index Join Index Index Design Special Inverted Index Contextual Index Join Index Query Optimization Graph Coverage Problem Query Optimization Graph Coverage Problem Data Type Repository Data Type Definition 12 Experimental Result Speed improvement: 6-10 times Space overhead: Around 2 times original corpus size. E x p e r i m e n t a l R e s u l t S p e e d i m p r o v e m e n t : t i m e s S p a c e o v e r h e a d : A r o u n d 2 t i m e s o r i g i n a l c o r p u s s i z e.
Data-oriented Content Query System Web Info Extraction Typed Entity Search Web-based Q/A 13
14