Download presentation
Presentation is loading. Please wait.
Published byJanice Atkinson Modified over 9 years ago
1
Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1
2
User Query Answer Table Table Query Example Name of ExplorersNationalityAreas Explored Vasco da GamaPortugueseSea route to India Abel TasmanDutchOceania Christopher ColumbusCaribbean... 2 Name of ExplorersNationalityAreas Explored
3
Mountains in North America Mount McKinley Mount Saint Elias Mount Lucania... Types of Structured Queries Entities Relationship between two entities Entities with values of attributes Pain KillersSide Effects Pain KillersSide Effects aspirinasthma ibuprofenasthma, upset stomach naproxen sodiumupset stomach... Name of Explorers NationalityAreas Explored Name of ExplorersNationalityAreas Explored Vasco da GamaPortugueseSea route to India Abel TasmanDutchOceania Christopher ColumbusCaribbean... 3
4
Our Data Source Tables on the Web (elizabethan-era.org.uk) 4 (wikipedia.org) (vaughns-1-pagers.com) Richer sources of structured knowledge than free- format text
5
AfghanistanKabul AlbaniaTirana AlgeriaAlgiers AndorraAndorra la Vella... Challenges NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... … … 5 YearNameSubject 1902Ronald RossMedicine 1907Rudyard KiplingLiterature... The present list contains winners under the country/countries that are stated by the Nobel Prize committee on its website. Nobel Prize Winners User Query Limited column specific information Query has set of keywords. Web tables have headers. Designated HTML table header tag is not always used (80%). Many tables have no headers (18%). Header text is often uninformative. Context of a table can be helpful, but it does not give column specific information and it is often noisy.
6
System Architecture of WWT 6
7
The Column Mapping Task NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... ExplorationWho (explorer) (Chronological order) Sea route to IndiaVasco da Gama CaribbeanChristopher Columbus OceaniaAbel Tasman... This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV) List of explorers - Wikipedia, the free encyclopedia Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920 All areas will be available for mineral exploration and mining Name of ExplorersNationalityAreas Explored User Query Web Table 3Web Table 2Web Table 1 Index ProbeRelevant Tables 7
8
NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... User Query The Column Mapping Task Name of ExplorersNationalityAreas Explored Name of ExplorersNationalityAreas Explored Vasco da GamaPortugueseSea route to India Abel TasmanDutchOceania Christopher ColumbusCaribbean Alexander MackenzieBritishCanada... Answer Table Map Columns Consolidation Name of ExplorersNationalityAreas Explored ExplorationWho (explorer) (Chronological order) Sea route to IndiaVasco da Gama CaribbeanChristopher Columbus OceaniaAbel Tasman... This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV) List of explorers - Wikipedia, the free encyclopedia Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920 All areas will be available for mineral exploration and mining Web Table 3Web Table 2Web Table 1 8
9
Q1Q1 Q2Q2 Q3Q3 For each table t Step 1: Is t relevant? IR_Sim(Q, C) + λ. IR_Sim(Q, h) > Threshold ? Step 2: If yes, map columns of t to columns in Q A Baseline Approach 9 User Query, Q h1h1 h2h2 h3h3 h4h4 Table, t Context, C Edge Weight = IR_Sim(Q i, h j )
10
Limitations of Baseline Cannot match tables with poor/missing headers. E.g. Exploit content overlap with related tables How? 10 NameNationalityMain areas explored Abel TasmanDutchOceania...
11
Graphical Model Approach NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... ExplorationWho (explorer)Century (Chronological order) Sea route to IndiaVasco da Gama15th/16th CaribbeanChristopher Columbus15th/16th OceaniaAbel Tasman17th... Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... N1N1 N2N2 N4N4 N5N5 N3N3 N7N7 N8N8 N9N9 Name of ExplorersNationalityAreas Explored User Query Create a node for every column 11 N6N6
12
Graphical Model Approach NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... Name of ExplorersNationalityAreas Explored User Query Possible labels for every node 1.Name of explorers 2.Nationality 3.Areas Explored 12 N1N1 N2N2 N4N4 N5N5 N3N3 N7N7 N8N8 N9N9 N6N6 ExplorationWho (explorer)Century (Chronological order) Sea route to IndiaVasco da Gama15th/16th CaribbeanChristopher Columbus15th/16th OceaniaAbel Tasman17th... 1 1 23 3
13
Graphical Model Approach NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... Name of ExplorersNationalityAreas Explored User Query Possible labels for every node 1.Name of explorers 2.Nationality 3.Areas Explored 4.NA (Not Assigned) 5.NR (Not Relevant) 13 N1N1 N2N2 N4N4 N5N5 N3N3 N7N7 N8N8 N9N9 N6N6 ExplorationWho (explorer)Century (Chronological order) Sea route to IndiaVasco da Gama15th/16th CaribbeanChristopher Columbus15th/16th OceaniaAbel Tasman17th... 1 1 23 3 NA NR
14
Graphical Model Approach NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... ExplorationWho (explorer)Century (Chronological order) Sea route to IndiaVasco da Gama15th/16th CaribbeanChristopher Columbus15th/16th OceaniaAbel Tasman17th... N1N1 N2N2 N4N4 N5N5 N3N3 Name of ExplorersNationalityAreas Explored User Query Edges Complete Bipartite Graph between nodes of two tables Content overlap between column contents and headers Maximum Bipartite Matching 14 N6N6 Edge Weights 0.6 0.2 0.1 0.7 0 0 0 0 0.1
15
Graphical Model Approach NameNationalityMain areas explored Abel TasmanDutchOceania Vasco da GamaPortugueseSea route to India Alexander MackenzieBritishCanada... ExplorationWho (explorer)Century (Chronological order) Sea route to IndiaVasco da Gama15th/16th CaribbeanChristopher Columbus15th/16th OceaniaAbel Tasman17th... Forest reserves IDNameArea 7Shakespeare Hills2236 9Plains Creek880 13Welcome Swamp168... N1N1 N2N2 N4N4 N5N5 N3N3 N7N7 N8N8 N9N9 Name of ExplorersNationalityAreas Explored User Query Edge Potentials Large weights Same label Soft Constraint 15 N6N6 0.6 0.7 0.3 0.40.1
16
Node Potentials 16 Score expressing the affinity of a table column c i to a query column Q j Baseline approach: IR similarity between Q j and header of c i
17
Limitations of Baseline Similarity Generic IR similarity not a good fit for typical roles of context + headers Context “topic” of a table Header label of a column New model based on a two part segmentation of query words over context and header. 17 The present list contains laureates under the country/countries that are stated by the Nobel Prize committee on its website. YearWinnersSubject 1902Ronald RossMedicine 1907Rudyard KiplingLiterature... Nobel Prize Winners User Query This article presents a comprehensive list of peaks in North America, highlighting some of the important features. Mountain PeakRegionElevation Mount McKinleyAlaska6194 m Mount LoganYukon5956 m... Mountains in North America User Query
18
Limitations of Baseline Similarity Matches with other parts of table ignored Frequent words in a column Multi-row headers Split headers match to union of words Vs Sub-headers match only to one header How to detect which of the two? Take soft-max over matches over context, body, other headers of table on one part of the segmented query. 18 Band nameCountryGenre AarconGermany Black Metal Act of GodRussia Melodic Black AdragardItaly Black Metal... Black metal bands NameNationalityMain areas explored Abel TasmanDutchOceania... ExplorationWho (explorer) (Chronological order) Sea route to IndiaVasco da Gama... User Query Name of ExplorersNationalityAreas Explored User Query
19
Segmented Similarity Nobel Prize Winners Similarity score between a table column c i and a query column Q j cici 19 Year User Query QjQj
20
Segmented Similarity Similarity score between a table column c i and a query column Q j Maximum soft-max score over matches of different segments of query with different parts of a table Winners Nobel Prize............................................................................. Winners................................ Nobel Prize........................... Title Context Header Rows User Query 20 Nobel Prize WinnersYear cici QjQj
21
Frequent Body Contents Header Text Other Headers in the Same Row Other Header Rows in Same Column ContextTitle Soft-max over matches over all sections Segmented Similarity Step 1: Segment query column keywords into two parts Step 2: Similarity scores between each part and different sections of table Step 3: Soft-max over sections of table where each part matches User Query 21 Title Context Nobel Prize WinnersYear Current Header Row PrefixSuffix cici Header Rows
22
Segmented Similarity Step 1: Segment query column keywords into two parts Step 2: Similarity scores between each part and different sections of table Step 3: Soft-max over sections of table where each part matches User Query 22 Title Context Header Rows Nobel Prize WinnersYear PrefixSuffix cici Soft-max over matches over all sections Current Header Row
23
Hard Constraints MUTEX Constraint At most one column in a table can be mapped to a query column. ALL-IRR Constraint If one column in a table is assigned a label NR, then all columns of table must be assigned NR. MUST-MATCH Constraint Every relevant table must contain the first query column. MIN-MATCH Constraint Every relevant table must contain at least m out of q query columns. m = 2 if q >= 2. 23
24
Labeling the Graphical Model Final goal: Jointly assign one of |Q|+2 labels to each column to maximize sum of node and edge potentials satisfy the hard constraints NP-Hard 24 N1N1 N2N2 N4N4 N5N5 N3N3 N7N7 N8N8 N9N9 N6N6
25
Inference Algorithms Collective Inference: Table-Centric Edge Potentials are used to modify node potentials Optimal table level inference Collective Inference: Edge-Centric Edge potentials are given central importance. Existing inference algorithms: Belief Propagation, TRWS, MPLP, α-Expansion, etc. Modified α-Expansion works best in our case. 25
26
Experimental Setup Workload 59 multi-column queries mostly collected from Amazon Mechanical Turk (AMT) service [Cafarella et al, 2009] Data source 25 million tables from a web crawl of 500 million pages Ground Truth Manual labeling for 1906 web tables 26
27
Column Mapping Methods Baseline NbrText Baseline augmented with similarity scores from neighboring columns PMI [Cafarella et al, 2009] Baseline augmented with corpus wide co-occurrence score of column contents and a label WWT Our graphical model based approach with table- centric collective inference 27
28
Column Mapping Methods Comparison 28 Overall error WWT: 30.3%, Baseline & PMI: 34.7%, NbrText: 34.2%. Baseline
29
Running Time 29 Actual column mapping takes less than half a second. Time for table & index read can be improved with better machine configuration.
30
Summary Presented a graphical model approach for answering table queries A novel method to find similarity using two part query segmentation model Robust mechanism of exploiting content and header overlap across table columns Different algorithms for inferencing in graphical model 12% reduction in error relative to a baseline method Future Work Exploiting newer corpus wide co-occurrence statistics Alternative structured sources such as ontologies Enhance the search experience via faceted search and user feedback. 30
31
Thank you. 31
32
Related Work Query-By-Example Paradigm [2] Extracting tables from lists on the web Web tables are not considered Halevy et al [3, 4] highlight the potential of web tables as a source of structured information Collecting offline information like attribute columns, attribute associations, etc. OCTOPUS [1] Multiple user interactions are necessary PMI score for relevance ranking is not effective in our case. Schema Matching [5, 6] Managing complex alignment between the large number of schema elements in two databases Web tables are noisy unlike database tables. 32
33
References 1.M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090–1101, 2009. 2.R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289– 300, 2009. 3.M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538–549, 2008. 4.M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008. 5.A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. The AI Magazine, 26(1):83–94, 2005. 6.E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334– 350, 2001. 33
34
Segmented Similarity 34 Overall error reduction from 33.3% to 30.3% Reduction is more than 10% in 8 cases.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.