Download presentation
Presentation is loading. Please wait.
Published byElwin Bishop Modified over 9 years ago
1
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign
2
MetaQuerier 2 Background: MetaQuerier – Large-Scale Integration of the deep Web MetaQuerier QueryResult The Deep Web
3
MetaQuerier 3 MetaQuerier: System architecture Database Crawler Database Crawler MetaQuerier Interface Extraction Interface Extraction Source Organization Source Organization Schema Matching Schema Matching The Deep Web Back-end: Semantics Discovery Front-end: Query Execution Query Translation Query Translation Source Selection Source Selection Result Compilation Result Compilation Deep Web Repository Unified InterfacesSubject DomainsQuery CapabilitiesQuery Interfaces Query Web databasesFind Web databases
4
MetaQuerier 4 In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Books Automobiles
5
MetaQuerier 5 What are the representative feature of query interfaces? Interface Extraction [ Author ; { contain }; text] [ Title ; { contain }; text] … [ Format ; {=}; {hardcopy, paperback, …}] … Query InterfaceQuery Schema [ SIGMOD 2004 ] Is query schema the feature we are looking for?
6
MetaQuerier 6 Query schemas are appropriate representatives of Web databases: distinctive property AirfaresHotelsMovies Number of observations Attributes Index Each domain contains a dominant range of attributes, distinctive from other domains Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies, Source organization becomes the clustering of query schemas
7
MetaQuerier 7 Query schemas can be viewed as categorical data Query schemas as transactions: S 1 : {author, title, subject, ISBN} S 2 : {author, title, category, publisher} S 3 : {make, model, price, zip code} S 4 : {manufacturer, model, price} S 5 : {from, to, departure date, return date, number of passengers} S 6 : {departure city, arrival city, number of adults, number of children} …… Thus, we can apply algorithms for clustering categorical data
8
MetaQuerier 8 Clustering categorical data: Objective function Clustering needs to have an objective function to evaluate the quality of clusters Existing objective functions Likelihood [1998] (Model-based clustering) Context Linkage [ROCK 2000] Entropy [COOLCAT 2002] In this paper, we propose a new objective function Model-Differentiation
9
MetaQuerier 9 Model-Differentiation: A new objective function for model-based clustering Assumption of model-base clustering: Each cluster C i has a generative model M i to generate its data with probabilistic behavior What is a good clustering result? (our observation) data in different clusters are very dissimilar models of different clusters are very dissimilar a new objective function: maximize the dissimilarity of models To realize, we need to answer three questions: How to model the data? How to estimate the model, given data? How to measure the dissimilarity of models?
10
MetaQuerier 10 Modeling: Multinomial distribution Each attribute is an independent event A schema is generated by a series of sampling from M Vocabulary: author (P 1 ) publisher (P 2 ) title (P 3 ) ISBN (P 4 ) city (P 5 ) price (P 6 ) model (P 7 ) … A schema: {title, author, ISBN} title author ISBN P1P1 P3P3 P4P4 Probability: P 1 *P 3 *P 4 Model M
11
MetaQuerier 11 Model estimation: Given a set of data, how to estimate its model? Maximum likelihood estimation S 1 = {title, author, ISBN}, S 2 = {author, ISBN, publisher} S 3 = {author, title, price}, S 4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher authortitleISBNpricepublishertotal 4322112 0.330.250.17 0.081.0
12
MetaQuerier 12 Measuring the dissimilarity of models: Statistical hypothesis testing Multinomial distribution can be directly tested by χ 2 testing S 1 = {title, author, ISBN}, S 2 = {author, ISBN, price}, S 3 = {make, model, price} 1. Combining S 1 and S 2 : Pro Attrs M Pro Attrs M3M3 2. Combining S 1 and S 3 : Pro Attrs M Pro Attrs M2M2 3. Combining S 2 and S 3 : Pro Attrs M Pro Attrs M1M1 Inspire a hierarchical agglomerative clustering (HAC) algorithm
13
MetaQuerier 13 Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters S 1 : with anchor attributes S2S2 S 1 and S 2 should be in the same domain and thus pre-clustered Distinguishable How to decide whether an S is “distinguishable” ? S1S1 Sup(S 1 ) Any S i, S j in Sup(S 1 )
14
MetaQuerier 14 Post-classification: Handling “loners” Pre-clustering Loners: too small for X 2 test after pre-clustering Separate Model clustering Naïve Bayesian
15
MetaQuerier 15 Data Questions to answer: - Can schema clustering effectively organize Web databases? - Can it build a domain hierarchy correctly? Experiments
16
MetaQuerier 16 We also try existing objective functions Three existing objective functions - Likelihood: maximize likelihood - Entropy: maximize entropy - Context Linkage: minimize cross links To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures
17
MetaQuerier 17 Effectiveness of Clustering 8 domains, 8 clusters Most Web databases are clustered correctly Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61
18
MetaQuerier 18 After 8 clusters, continue to run the HAC algorithm to merge them together It is consistent with common-sense: close concepts are merged first To build a domain hierarchy
19
MetaQuerier 19 Conclusions Cluster Web databases using their query schemas First work on clustering Web databases, not pages Query schemas are good representatives Essentially a problem of clustering categorical data A new objective function: Model-Differentiation Realized by statistical hypothesis testing Derive different similarity measure for HAC
20
MetaQuerier 20 Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.