1 Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration Fangjiao Jiang Renmin University of China Joint work with Weiyi Meng.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter
Fast Algorithms For Hierarchical Range Histogram Constructions
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Estimation of the Number of Relevant Images in Infinite Databases Presented by: Xiaoling Wang Supervisor: Prof. Clement Leung.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
1 Searching the Web Junghoo Cho UCLA Computer Science.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Development of Empirical Models From Process Data
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Estimate the Number of Relevant Images Using Two-Order Markov Chain Presented by: WANG Xiaoling Supervisor: Clement LEUNG.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Databases & Data Warehouses Chapter 3 Database Processing.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Introduction to Monte Carlo Methods D.J.C. Mackay.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Exploring Personal CoreSpace For DataSpace Management Li Yukun and Xiaofeng Meng WAMDM Lab Renmin University of China.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Model of Prediction Error in Chaotic and Web Driven Business Environment Franjo Jović*, Alan Jović ** * Faculty of Electrical Engineering, University of.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
VLDB Demo WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web Hai He, Weiyi Meng, Clement Yu, Zonghuan.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Histograms for Selectivity Estimation
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
1 Tom Edgar’s Contribution to Model Reduction as an introduction to Global Sensitivity Analysis Procedure Accounting for Effect of Available Experimental.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Exploratory Analysis of Crash Data
Panagiotis G. Ipeirotis Luis Gravano
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Toward Large Scale Integration
Presentation transcript:

1 Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration Fangjiao Jiang Renmin University of China Joint work with Weiyi Meng & Xiaofeng Meng

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 2 The previous Web: things are just on the surface

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 3 The current Web: Getting “deeper” A great deal of information is hidden behind query forms Deep = not accessible through search engines

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 4 Why is it important? More than 10 million distinct forms

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 5 Why is it important? Up to 5,000 billions dynamic result pages

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 6 A Key Component: Query translation Challenge  Large-scale  Heterogeneity  Autonomy Integrated query interface Web database query interfaces Query translation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 7 The Problem Selectivity Estimation for Exclusive Query Translation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 8 Example √ ??

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 9 Related work & the Challenge A prominent solution for selectivity estimation —— histograms [Piatetsky+, Poosala+, Ioannidis+] Categorical attribute Infinite-value attribute Another solution —— random sampling [Goodman+, Haas+, Oliken+, Vitter+, Dasgupta+] Random sampling Challenge Selectivity estimation of infinite-value attribute

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 10 Selectivity Estimation for Exclusive Query Translation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 11 Two Observations There exist different correlations between different attribute pairs the word frequency of the values on an infinite-value attribute usually has a Zipf-like distribution Weakest Strongest Weaker

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 12 Selectivity Estimation for Exclusive Query Translation Attribute Correlation calculation for a domain Selectivity estimation for a Web database Correlation-based sampling Word frequency probing Zipf equation calculation Selectivity estimation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 13 Selectivity Estimation Challenges 1. Attribute Correlation calculation Find the least correlative attribute Discover the word rank 2. Zipf equation calculation Calculate the parameters of Zipf equation Estimate selectivity

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 14 Attribute Correlation Calculation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 15 Goal Random sample Word Rank Attribute Correlation calculation (1) (2)

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 16 Discussion on Word rank Word rank should be computed for each attribute

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 17 Zipf Equation Calculation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 18 Zipf equation calculation Zipf equantion

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 19 The parameters of Zipf equation

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 20 discussion on P, p and E

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 21 Experiments

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 22 Data Sets & Evaluation Method Data sets Evaluation method

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 23 Experimental Results The average precision of selectivity estimations is high.

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 24 Summary

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 25 Contributions Identify the selectivity estimation problem of infinite-value attribute for exclusive query translation Propose correlation-base sampling approach to obtain the sample as random as possible Propose Zipf-based selectivity estimation method Verify the accuracy of our approach

Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration (DASFAA2009) 26 Thanks (Q&A)