Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
June 22, 2007 CMPE588 Term Project Presentation Discovery of Composable Web Services Presented by: Vassilya Abdulova.
BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY ADVISOR : KHONDKER.
Conceptual Clustering
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
CSE 636 Data Integration Data Integration Approaches.
Clustering Categorical Data The Case of Quran Verses
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
Automatic Data Ramon Lawrence University of Manitoba
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 10: Information Integration and Synthesis.
Overview of Search Engines
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Public Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Conversations William Lee,
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Facilitating Document Annotation using Content and Querying Value.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Multidimensional analysis model for a document warehouse that includes textual measures KIM JEONG RAE UOS.DML
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Statistical Schema Matching across Web Query Interfaces
COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
Cristian Ferent and Alex Doboli
Database Design Hacettepe University
Toward Large Scale Integration
Context-Aware Internet
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

MetaQuerier 2 Background: MetaQuerier – Large-Scale Integration of the deep Web MetaQuerier QueryResult The Deep Web

MetaQuerier 3 MetaQuerier: System architecture Database Crawler Database Crawler MetaQuerier Interface Extraction Interface Extraction Source Organization Source Organization Schema Matching Schema Matching The Deep Web Back-end: Semantics Discovery Front-end: Query Execution Query Translation Query Translation Source Selection Source Selection Result Compilation Result Compilation Deep Web Repository Unified InterfacesSubject DomainsQuery CapabilitiesQuery Interfaces Query Web databasesFind Web databases

MetaQuerier 4 In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Books Automobiles

MetaQuerier 5 What are the representative feature of query interfaces? Interface Extraction [ Author ; { contain }; text] [ Title ; { contain }; text] … [ Format ; {=}; {hardcopy, paperback, …}] … Query InterfaceQuery Schema [ SIGMOD 2004 ] Is query schema the feature we are looking for?

MetaQuerier 6 Query schemas are appropriate representatives of Web databases: distinctive property AirfaresHotelsMovies Number of observations Attributes Index Each domain contains a dominant range of attributes, distinctive from other domains Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies,  Source organization becomes the clustering of query schemas

MetaQuerier 7 Query schemas can be viewed as categorical data Query schemas as transactions: S 1 : {author, title, subject, ISBN} S 2 : {author, title, category, publisher} S 3 : {make, model, price, zip code} S 4 : {manufacturer, model, price} S 5 : {from, to, departure date, return date, number of passengers} S 6 : {departure city, arrival city, number of adults, number of children} …… Thus, we can apply algorithms for clustering categorical data

MetaQuerier 8 Clustering categorical data: Objective function Clustering needs to have an objective function to evaluate the quality of clusters Existing objective functions  Likelihood [1998] (Model-based clustering)  Context Linkage [ROCK 2000]  Entropy [COOLCAT 2002] In this paper, we propose a new objective function  Model-Differentiation

MetaQuerier 9 Model-Differentiation: A new objective function for model-based clustering Assumption of model-base clustering: Each cluster C i has a generative model M i to generate its data with probabilistic behavior What is a good clustering result? (our observation) data in different clusters are very dissimilar  models of different clusters are very dissimilar  a new objective function: maximize the dissimilarity of models To realize, we need to answer three questions:  How to model the data?  How to estimate the model, given data?  How to measure the dissimilarity of models?

MetaQuerier 10 Modeling: Multinomial distribution Each attribute is an independent event A schema is generated by a series of sampling from M Vocabulary: author (P 1 ) publisher (P 2 ) title (P 3 ) ISBN (P 4 ) city (P 5 ) price (P 6 ) model (P 7 ) … A schema: {title, author, ISBN} title author ISBN P1P1 P3P3 P4P4 Probability: P 1 *P 3 *P 4 Model M

MetaQuerier 11 Model estimation: Given a set of data, how to estimate its model? Maximum likelihood estimation S 1 = {title, author, ISBN}, S 2 = {author, ISBN, publisher} S 3 = {author, title, price}, S 4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher authortitleISBNpricepublishertotal

MetaQuerier 12 Measuring the dissimilarity of models: Statistical hypothesis testing Multinomial distribution can be directly tested by χ 2 testing S 1 = {title, author, ISBN}, S 2 = {author, ISBN, price}, S 3 = {make, model, price} 1. Combining S 1 and S 2 : Pro Attrs M Pro Attrs M3M3 2. Combining S 1 and S 3 : Pro Attrs M Pro Attrs M2M2 3. Combining S 2 and S 3 : Pro Attrs M Pro Attrs M1M1 Inspire a hierarchical agglomerative clustering (HAC) algorithm

MetaQuerier 13 Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters S 1 : with anchor attributes S2S2 S 1 and S 2 should be in the same domain and thus pre-clustered Distinguishable How to decide whether an S is “distinguishable” ? S1S1 Sup(S 1 ) Any S i, S j in Sup(S 1 )

MetaQuerier 14 Post-classification: Handling “loners” Pre-clustering Loners: too small for X 2 test after pre-clustering Separate Model clustering Naïve Bayesian

MetaQuerier 15 Data Questions to answer: - Can schema clustering effectively organize Web databases? - Can it build a domain hierarchy correctly? Experiments

MetaQuerier 16 We also try existing objective functions Three existing objective functions - Likelihood: maximize likelihood - Entropy: maximize entropy - Context Linkage: minimize cross links To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures

MetaQuerier 17 Effectiveness of Clustering 8 domains, 8 clusters Most Web databases are clustered correctly Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61

MetaQuerier 18 After 8 clusters, continue to run the HAC algorithm to merge them together It is consistent with common-sense: close concepts are merged first To build a domain hierarchy

MetaQuerier 19 Conclusions Cluster Web databases using their query schemas  First work on clustering Web databases, not pages  Query schemas are good representatives  Essentially a problem of clustering categorical data A new objective function: Model-Differentiation  Realized by statistical hypothesis testing  Derive different similarity measure for HAC

MetaQuerier 20 Thank You!