Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

MetaQuerier 2 Background: MetaQuerier – Large-Scale Integration of the deep Web MetaQuerier QueryResult The Deep Web

MetaQuerier 3 MetaQuerier: System architecture Database Crawler Database Crawler MetaQuerier Interface Extraction Interface Extraction Source Organization Source Organization Schema Matching Schema Matching The Deep Web Back-end: Semantics Discovery Front-end: Query Execution Query Translation Query Translation Source Selection Source Selection Result Compilation Result Compilation Deep Web Repository Unified InterfacesSubject DomainsQuery CapabilitiesQuery Interfaces Query Web databasesFind Web databases

MetaQuerier 4 In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Books Automobiles

MetaQuerier 5 What are the representative feature of query interfaces? Interface Extraction [ Author ; { contain }; text] [ Title ; { contain }; text] … [ Format ; {=}; {hardcopy, paperback, …}] … Query InterfaceQuery Schema [ SIGMOD 2004 ] Is query schema the feature we are looking for?

MetaQuerier 6 Query schemas are appropriate representatives of Web databases: distinctive property AirfaresHotelsMovies Number of observations Attributes Index Each domain contains a dominant range of attributes, distinctive from other domains Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies,  Source organization becomes the clustering of query schemas

MetaQuerier 7 Query schemas can be viewed as categorical data Query schemas as transactions: S 1 : {author, title, subject, ISBN} S 2 : {author, title, category, publisher} S 3 : {make, model, price, zip code} S 4 : {manufacturer, model, price} S 5 : {from, to, departure date, return date, number of passengers} S 6 : {departure city, arrival city, number of adults, number of children} …… Thus, we can apply algorithms for clustering categorical data

MetaQuerier 8 Clustering categorical data: Objective function Clustering needs to have an objective function to evaluate the quality of clusters Existing objective functions  Likelihood [1998] (Model-based clustering)  Context Linkage [ROCK 2000]  Entropy [COOLCAT 2002] In this paper, we propose a new objective function  Model-Differentiation

MetaQuerier 9 Model-Differentiation: A new objective function for model-based clustering Assumption of model-base clustering: Each cluster C i has a generative model M i to generate its data with probabilistic behavior What is a good clustering result? (our observation) data in different clusters are very dissimilar  models of different clusters are very dissimilar  a new objective function: maximize the dissimilarity of models To realize, we need to answer three questions:  How to model the data?  How to estimate the model, given data?  How to measure the dissimilarity of models?

MetaQuerier 10 Modeling: Multinomial distribution Each attribute is an independent event A schema is generated by a series of sampling from M Vocabulary: author (P 1 ) publisher (P 2 ) title (P 3 ) ISBN (P 4 ) city (P 5 ) price (P 6 ) model (P 7 ) … A schema: {title, author, ISBN} title author ISBN P1P1 P3P3 P4P4 Probability: P 1 *P 3 *P 4 Model M

MetaQuerier 11 Model estimation: Given a set of data, how to estimate its model? Maximum likelihood estimation S 1 = {title, author, ISBN}, S 2 = {author, ISBN, publisher} S 3 = {author, title, price}, S 4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher authortitleISBNpricepublishertotal 4322112 0.330.250.17 0.081.0

MetaQuerier 12 Measuring the dissimilarity of models: Statistical hypothesis testing Multinomial distribution can be directly tested by χ 2 testing S 1 = {title, author, ISBN}, S 2 = {author, ISBN, price}, S 3 = {make, model, price} 1. Combining S 1 and S 2 : Pro Attrs M Pro Attrs M3M3 2. Combining S 1 and S 3 : Pro Attrs M Pro Attrs M2M2 3. Combining S 2 and S 3 : Pro Attrs M Pro Attrs M1M1 Inspire a hierarchical agglomerative clustering (HAC) algorithm

MetaQuerier 13 Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters S 1 : with anchor attributes S2S2 S 1 and S 2 should be in the same domain and thus pre-clustered Distinguishable How to decide whether an S is “distinguishable” ? S1S1 Sup(S 1 ) Any S i, S j in Sup(S 1 )

MetaQuerier 14 Post-classification: Handling “loners” Pre-clustering Loners: too small for X 2 test after pre-clustering Separate Model clustering Naïve Bayesian

MetaQuerier 15 Data Questions to answer: - Can schema clustering effectively organize Web databases? - Can it build a domain hierarchy correctly? Experiments

MetaQuerier 16 We also try existing objective functions Three existing objective functions - Likelihood: maximize likelihood - Entropy: maximize entropy - Context Linkage: minimize cross links To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures

MetaQuerier 17 Effectiveness of Clustering 8 domains, 8 clusters Most Web databases are clustered correctly Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61

MetaQuerier 18 After 8 clusters, continue to run the HAC algorithm to merge them together It is consistent with common-sense: close concepts are merged first To build a domain hierarchy

MetaQuerier 19 Conclusions Cluster Web databases using their query schemas  First work on clustering Web databases, not pages  Query schemas are good representatives  Essentially a problem of clustering categorical data A new objective function: Model-Differentiation  Realized by statistical hypothesis testing  Derive different similarity measure for HAC

MetaQuerier 20 Thank You!

Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback