Download presentation
Presentation is loading. Please wait.
Published byCharity Young Modified over 9 years ago
1
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006
2
2 Search Problems on the Deep Web united.comairtravel.com delta.com Find round-trip flights from Chicago to New York under $500
3
3 Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com Global query interface comparison shopping systems “on steroid”
4
4 Current State of Affairs Very active in both research communities & industry Research –multidisciplinary efforts: Database, Web, KDD & AI –10+ research groups in US, Asia & Europe –focuses: –source discovery –schema matching & integration –query processing –data extraction Industry –Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …
5
5 Key Task: Schema Matching 1-1 match Complex match
6
6 Schema Matching is Ubiquitous! Fundamental problem in numerous applications –data integration –data warehousing –peer data management –ontology merging –view integration –personal information management Schema matching across Web sources –30+ papers generated in past few years –Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB- 04], Utah [WebDB-05], …
7
7 Schema Matching is Still Very Difficult Must rely on properties of attributes, e.g., label & instances Often there are little in common between matching attributes Many attributes do not even have instances! 1-1 match Complex match
8
8 Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances 28.1% ~ 74.6% of attributes with no instances Extremely challenging to match these attributes –e.g., does departure city match from city or departure date? Also difficult to match attributes with dissimilar instances –e.g., airline (with American airliners) vs. carrier (with Europeans)
9
9 Our Solution: Exploit the Web Discover instances from the Web –e.g., Chicago, New York, etc. for departure city & from city Borrow instances from other attributes & validate via Web –e.g., check if Air Canada is an instance of carrier with the Web
10
10 Key Idea: Question-Answering from AI Search Web via search engines, e.g., Google … but search engines do not understand natural language questions Idea: form extraction queries as sentences to be completed “Trick” search engine to complete sentences with instances Example extraction query: “departure cities such as” Extraction Patterns Ls such as NP1, … NPn such Ls as NP1, …, NPn NP1, …, NPn, and other Ls Ls including NP1, …, NPn attribute label: departure city
11
11 Key Idea: Question-Answering from AI Search Google & obtain snippets: Extract instance candidates from snippets: other departure cities such as Boston, Chicago and LAX available … Boston, Chicago, LAX extraction querycompletion
12
12 But Not Every Candidate is True Instance Reason 1: Extraction queries may not be perfect Reason 2: Web content is inherently noisy Example: –attribute: city –extraction query: “and other cities” –extracted candidate: 150 need to perform instance verification
13
13 Instance Verification: Outlier Detection Goal: Remove statistical outliers (among candidates) Step 1: Pre-processing –recognize types of instances via pattern matching & 80% rule –types: numeric & string –discard all candidates not of determined type –e.g., most of instance candidates for city are strings, so remove 150 Step 2: Type-specific detection –perform discordance tests –test statistics, e.g., –# of words: abnormal if more than 5 words in person name –% of numeric characters: US zip code contains only digits
14
14 Instance Verification: Web Validation Goal: Further semantic-level validation Idea: Exploit co-occurrence statistics of label & instances –“Make: Honda; Model: Accord” –“a variety of makes such as Honda, Mitsubishi” Form validation queries using validation patterns –e.g., “make Honda”, “makes such as Honda” Validation Patterns (V + x) L xL x Ls such as x such Ls as x x and other Ls Ls including x Validation phrase V
15
15 Instance Verification: Web Validation Possible measure: NumHits(V+x) –e.g., NumHits(“cities such as Los Angeles”) = 26M Potential problems: bias towards popular instances Use PMI(V, x), point-wise mutual information Example: –V = “cities such as”, candidates: California, Los Angeles –NumHits(V, California) = 29 –PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)
16
16 Validate Instances from Other Attributes Method 1: Discover k more instances from Web –then check for borrowed one (Aer Lingus for Airline) problem: very likely Aer Lingus not among discovered instances Method 2: Compare validation score with that of instance problem: score for Aer Lingus may be much lower, how to decide? Key observation: compare also to scores of non-instances –e.g., Economy (with respect to Airline)
17
17 Train Validation-Based Instance Classifier Naïve Bayes classifier with validation-based features ExampleM1M2+/- Air Canada.5.3+ American.8.1+ Economy.4.03- First Class.2.05- Delta.6.3+ United.9.4+ Jan.1.06- 1.3.09- Examplef1f2+/- Delta11+ United11+ Jan00- 101- Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 … V1: Airlines such as V2: Airline
18
18 Validate Instances via Deep Web Handle attributes while difficult via Web, e.g., from Disadvantage: ambiguity when no results found
19
19 Architecture of Assisted Matching System Instance acquisition Interface matcher Source interfaces with augmented instances Attribute matches
20
20 Empirical Evaluation Five domains: Experiments: –Baseline: IceQ [Wu et al., SIGMOD-04] –Web assistance Performance metrics: –precision (P), recall (R), & F1 (= 2PR/(P+R)) Domain # schemas # attributes per schema % of attributes with no instances Average depth of schemas Airfare2010.728.13.6 Automobile205.138.62.4 Book205.474.62.3 Job204.630.02.1 Real Estate206.532.22.7
21
21 Matching Accuracy Web assistance boosts accuracy (F1) from 89.5 to 97.5
22
22 Overhead Analysis Reasonable overhead: 6~11 minutes across domains
23
23 Conclusion Search problems on the Deep Web are increasingly crucial! Novel QA-based approach to learning attribute instances Incorporation into a state-of-art matching system Extensive evaluation over varied real-world domains More details: Wensheng Wu on Google
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.