Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Optimizing search engines using clickthrough data

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Best-First Search: Agendas

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Evaluating Search Engine

1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.

Ensemble Learning: An Introduction

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Three kinds of learning

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Recommender systems Ram Akella November 26 th 2008.

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Introduction to the design (and analysis) of experiments James M. Curran Department of Statistics, University of Auckland

Basic Data Mining Techniques

Data Mining Techniques

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Database Management 9. course. Execution of queries.

“Solving Data Inconsistencies and Data Integration with a Data Quality Manager” Presented by Maria del Pilar Angeles, Lachlan M.MacKinnon School of Mathematical.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Tanja Magoč, François Modave, Xiaojing Wang, and Martine Ceberio Computer Science Department The University of Texas at El Paso.

Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.

Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.

Feature Selection Benjamin Biesinger - Manuel Maly - Patrick Zwickl.

Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Representation and modelling 3 – landscape specialisations 4.1 Introduction 4.2 Simple height field landscapes 4.3 Procedural modeling of landscapes- fractals.

Presented By Anirban Maiti Chandrashekar Vijayarenu

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Rigorous Testing by Merging Structural and Behavioral UML Representations Presented by Chin-Yi Tsai.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.

TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Database Management System

Chapter 12: Query Processing

Big Data Quality the next semantic challenge

Keyword Searching and Browsing in Databases using BANKS

Probabilistic Databases

Psych 231: Research Methods in Psychology

Introduction to the design (and analysis) of experiments

Probabilistic Ranking of Database Query Results

Organizational Aspects of Data Management

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang

Introduction Motivation Observation The main user criterion for selecting sources by hand – NOT just response time, BUT the expected quality of the data The sources have varying information quality Results become outdated quickly The intrinsic imprecision of many experimental techniques Contribution Integration of classical query planning the assessment and consideration of information quality (IQ)

Correctness and Completeness For a given user query, UQ, against the mediator schema, “Correct plan” Combination of QCAs that are semantically contained in the UQ Plans that compute only correct results “Complete answer” to a UQ w.r.t. the given QCAs Union over the answers of all correct plans Problem Too many correct plans!!

Example(1/2) Global tables sequence and gene A user query The sequence of a specific gene The mediator detects from QCAs S5 and other two sources can be used for the gene part S1, S2, and S3 for the sequence part We can generate 9 correct plans Question DO WE HAVE TO EXECUTE ALL THE 9 CORRECT PLANS?

Example (2/2) Assuming that IQ scores are available Sequence data on S1 S1 copies infrequently from other sites, sometimes introducing parsing errors Sequence data on S3 highly up-to-date, but few annotations are provided Reducing the number of correct plans to be executed We may consider 3 correct plans, instead of 9 Case 1: If the user was particularly interested in complete annotation  We conclude that plans using S3 are not very promising Case 2: If highly up-to-date data is required  S1 could probably be ignored

“Completeness of integrated information sources” by Felix Naumann, et al. (Information systems 2004) Implicit assumption by most information integration projects “The mediator should always compute the complete answer” In many cases, this assumption is wrong! “Computing the complete answer is not always necessary” For example, a meta-search engine does not need to download all hits from all search engines it uses; instead, taking the top ten hits usually suffices “Computing the complete answer may be too expensive or it may take too long time” Another assumption they have “The most complete response to the user is the best, given some cost limits”

IQ classification Source-specific criteria Determine the overall quality of a data source E.g., reputation QCA-specific criteria Determine the quality aspects of specific query that are computable by a source E.g., response times Attribute-specific criteria Assess the quality of a source in terms of its ability to provide the attributes of a specific user query E.g., the completeness of the annotation attribute on a source Depending on the application domain and the structure of available sources, the classification may vary Problem The ability to assign IQ scores in an objective manner is difficult Some IQ criteria are highly subjective (e.g., reputation)  Use user profiles, sets of IQ scores for all subjective criteria

Source-specific criteria Ease of understanding User ranking Reputation User ranking Reliability Ranking of experimental method (intrinsic error rate) Timeliness Average age of data

QCA-specific criteria Availability Percentage of time the source is accessible Price Monetary price of a query Representational Consistency Wrapper workload E.g., a wrapper with relational export schema is always consistent with the global schema Response time Average waiting time for response Accuracy Percentage of objects with errors Usually produced during data input Relevancy Percentage of real word objects represented Usually highly user-dependent

Attribute-specific criteria Completeness Fullness of the relation in each attribute (horizontal fitness) E.g., an attribute with 90% of null -values Amount Number of unwanted attributes (vertical fitness)

Algorithm (Three phases) Input User query Sources with QCAs, IQ scores Phase 1 Source selection with source-specific criteria  Best sources Phase 2 Planning with QCAs  All correct plans Phase 3 Plan selection with QCA- and attribute-specific criteria  Best plans

Phase 1: Source selection Goal Use the source-specific IQ criteria to “weed-out” sources that are qualitatively not as good as others (non-good sources) We completely disregard non-good sources for further planning Method used Data Envelopment Analysis (DEA) developed by Charnes et al. A general method to classify a population of observations Avoids the problems of scaling and weighting Do not remove a source S with low IQ If S is the only source providing a certain attribute of the global schema If S exclusively provide certain extensions of an attribute

Phase 2: Plan creation UQ with the user weightings for each attribute Plans, each possibly producing a different set of correct tuples for UQ

Phase 3: Plan selection Goal Qualitatively rank the plans of the previous phase Restrict plan execution to meet stop conditions Stop condition1: execute some best percentage of plans Stop condition2: execute as many plans as necessary to meet certain cost- or quality- criteria Three steps a) QCA quality The IQ scores of the QCAs are determined b) Plan Quality b1) The quality model (tree-structured) aggregates these scores along tree paths b2) Gain an overall score at the root of the tree, which forms the score of the entire plan c) Plan Ranking Rank all plans using IQ score of each plan

3a) QCA quality – determine IQ vectors for the QCAs The general IQ vector for QCAs The IQ vectors for QCAs participating in the six correct plans

3b) Plan Quality The six plans have aggregated IQ vectors Merging IQ vectors in join nodes The IQ vector for an inner join node Up to this point, the scores are neither scaled nor weighted, making a comparison or ranking of plans impossible

3c) Plan ranking Method used The Simple Additive Weighting (SAQ) method Scaling Positive criteria Availability, accuracy, relevancy, completeness Negative criteria Price, representational consistency, response time, amount Computing the weighted sum Needs a user-specific weight vector Reflects the importance of the individual criteria to the user Stored in the user profile IQ scores of plans obtained by the indifferent weight vector (Each weight value is 1/8)