A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September , University of Vienna, Austria Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University
I want to know how cold Madison gets during winter 2
After searching with keyword and browsing… I found a wiki page, “Madison, Wisconsin”
There is a temperate table in the page… but how can I query over this?
In fact… it’s the structure implicit in unstructured documents !
Copyright 2007 by CEBT Goal: Exploit the Structure! Improve keyword search and browsing Structured queries “Which universities are in places that are very cold in winter?” Queries that keyword search and browsing are not good at answering 6
Copyright 2007 by CEBT Challenges Manage this evolving structure incrementally How to enable users to query using as much structure as is currently known? 7
Copyright 2007 by CEBT Proposal A Relational Workbench A way to store an expanding set of documents and attributes Tools to incrementally process the data A way to exploit structure in queries 8
Copyright 2007 by CEBT Advantages Data always available for querying Supports incremental data processing Can pose increasingly sophisticated queries over time Exploit strength of a RDBMS 9
Copyright 2007 by CEBT Relational Workbench Data Storage Data Processing Case Study: Wikipedia 10
Copyright 2007 by CEBT Data Storage – Wide Table Problems: keeps evolving, heterogeneous… No good schema! So.. use single table! One document per row New attribute? create new column in the table! – No overhead for storing nulls with interpreted storage or column- oriented database 11 DocTitleDocContentofficial flower Madison, Wisconsin “Madison. is the captial of …” null Seattle, Washington “Seattle is the largest city in …” Dahlia
Copyright 2007 by CEBT Data Storage – Wide Table (more) Allow attributes to have internal structure 12 DocTitleDocContentofficial flowerheadquarter(city, company) Madison, Wisconsin “Madison. is the captial of …” null[(Madison, Raven Software), (Madison, Human Head Studios), …] Seattle, Washington “Seattle is the largest city in …” Dahlia[(Seattle, Starbucks), (Seattle, Amazon.com), …]
Copyright 2007 by CEBT Data Storage – Mapping Table Store mappings for different attributes that correspond to the same real-world concept Query evaluation: Look up mapping table Rewrite query to include matching attributes 13 host idhost namemappings a6 temp ( ℉ ) { a6 = a7 * 9 / } a7 temperature ( ℃ ) { a7 = 5 / 9 * (a6 – 32) }
Copyright 2007 by CEBT Data Processing Workbench does not decide how to process data Provides three basic operators: Extract – “address” => “city”, “state”, “zip code” Integrate – “address” = “sent-to” Cluster DBAs decide what operators to use and set the parameters 14
Copyright 2007 by CEBT Data Processing (more) Operator Interaction Example Improving clustering via iteration 1.Extract section names from Wikipedia pages the page without a section name cannot be clustered properly! 2.Cluster pages based on section names 3.Extract and cluster pages based on the other attributes 15
Copyright 2007 by CEBT Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players Stage 1: Initial Loading 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate Can do keyword search over PageText immediately! Stage 2: Extracting Sections 1,253 new columns - Explosion of new columns! Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups 16
Copyright 2007 by CEBT Case Study: Wikipedia (2/2) Stage 3: Attribute Clustering Grouped together attributes that were either both null or both non- null in a row Views on clusters – ~25 ms for each cluster – 44s sec for wide table Stage 4: Extracting Wiki Tables we can extract temperature_wiki! 17 CityMonthLow_F… Madison, Wisconsin 16 Seattle, Washington 136
Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12; 18
Copyright 2007 by CEBT Conclusion Relational workbench to incrementally extract and query structure from unstructured data Wide table Mapping table Operators Many problems ahead! 19