Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Slides:



Advertisements
Similar presentations
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Efficient Query Evaluation on Probabilistic Databases
IS605/606: Information Systems Instructor: Dr. Boris Jukic
Improving Query Results using Answer Corroboration Amélie Marian Rutgers University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Databases & Data Warehouses Chapter 3 Database Processing.
3 The Relational Model MIS 304 Winter Class Objectives That the relational database model takes a logical view of data That the relational model’s.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Data-Centric Human Computation Jennifer Widom Stanford University.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Adaptive Processing of Top-k Queries in XML Amelie Marian, Sihem Amer-Yahia Nick Koudas, Divesh Srivastava Proceedings of the 21st International Conference.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
A paper on Join Synopses for Approximate Query Answering
Keyword Searching and Browsing in Databases using BANKS
Structure and Content Scoring for XML
Answering Cross-Source Keyword Queries Over Biological Data Sources
Structure and Content Scoring for XML
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University Divesh Srivastava, AT&T Labs-Research

9/11/2006Amélie Marian - Rutgers University2 Motivating Example Sales TN BAN TN BAN CustName ORN PON Provisioning CustName PON SubPON Inventory PON TN CircuitID Ordering ORNTN TN: Telephone Number ORN: Order Number BAN: Billing Accoung Number PON: Provisoning Order Number SubPON: Related PON What is the Circuit ID associated with a Telephone Number that appears in SALES?

9/11/2006Amélie Marian - Rutgers University3 Motivations  Data applications with overlapping features Data integration Web sources  Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems

9/11/2006Amélie Marian - Rutgers University4 Contributions  Multiple Join Path (MJP) framework Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers  Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)  Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques

9/11/2006Amélie Marian - Rutgers University5 Outline  Multiple Join Path Framework Problem Definition  Our Approach Scoring Answers Computing Answers  Experimental Evaluation  Related Work

9/11/2006Amélie Marian - Rutgers University6 Multiple Join Path Framework: Problem Definition  Query of the form: “Given X=a find the value of Y” Examples: Given a telephone number of a customer, find the ID of the circuit to which the telephone line is attached. One answer expected Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID. Possibly several answers

9/11/2006Amélie Marian - Rutgers University7 Schema Graph  Directed acyclic graph  Nodes are field names  Intra-application edge Links fields in the same application  Inter-application edge Links fields across applications All (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications

9/11/2006Amélie Marian - Rutgers University8 Data Graph  Given a specific value of the source node X what are values of the sink node Y?  Considers all join paths from X to Y in the schema graph X (no corresponding SALES.BAN) X X Example: two paths lead to answer c1

9/11/2006Amélie Marian - Rutgers University9 Scoring Answers  Which are the correct values? Unclean data No a priori knowledge  Technique to score data edges What is the probability that the fields associated by the edge is correct  Probabilistic interpretation of data edge scores to score full join paths Edge score aggregation Independent on the length of the path

9/11/2006Amélie Marian - Rutgers University10 Scoring Data Edges  Rely on functional dependencies (we are considering fields that are keys)  Data edge scores model the error in the data  Intra-application edge  Inter-application edge equals 1, unless approximate matching Fields A and B within the same application AB (and symetrically for B -> A) Where b i are the values instantiated from querying the application with value a ABBAand

9/11/2006Amélie Marian - Rutgers University11 Scoring Data Paths  A single data path is scored using a simple sequential composition of its data edges probabilities  Data paths leading to the same answer are scored using parallel composition XabY pathScore=0.5*0.8*0.6=0.24 XabY c pathScore= (0.24*0.2) pathScore= Independence Assumption

9/11/2006Amélie Marian - Rutgers University12 Identifying Answers  Only interested in best answers  Standard top-k techniques do not apply Answer scores can always be increased by new information We keep score range information Return top answers when identified, may not have complete scores  Two return strategies Top-k Top-few (weaker stop condition)

9/11/2006Amélie Marian - Rutgers University13 Computing Answers  Take advantage of early pruning Only interested in best answers  Incremental data graph computation Probes to each applications Cost model is number of probes  Standard graph searching techniques (DFS, BFS) do not take advantage of score information  We propose a technique based on the notion of maximum benefit

9/11/2006Amélie Marian - Rutgers University14 Maximum Benefit  Benefit computation of a path uses two components Known scores of the explored data edges Best way to augment an answer’s scores  Uses residual benefit of unexplored schema edges  Our strategy makes choices that aim at maximizing this benefit metric

9/11/2006Amélie Marian - Rutgers University15 VIP Experimental Platform  Integration platform developed at AT&T  30 legacy systems  Real data  Developed as a platform for resolving disputes between applications that are due to data inconsistencies  Front-end web interface

9/11/2006Amélie Marian - Rutgers University16 VIP Queries  Random sample of 150 user queries.  Analysis shows that queries can be classified according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries  oneLarge(oL): 47 queries  manyLarge(mL): 4 queries  manySmall(mS): 8 queries heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query

9/11/2006Amélie Marian - Rutgers University17 VIP Schema Graph Paths leading to an answer /paths leading to top-1 answer (94 queries) Not considering all paths may lead to missing top-1 answers

9/11/2006Amélie Marian - Rutgers University18 Number of Parallel Paths Contributing to the Top-1 Answer Average of 10 parallel paths per answer, 2.5 significant

9/11/2006Amélie Marian - Rutgers University19 Cost of Execution

9/11/2006Amélie Marian - Rutgers University20 Related Work  Keyword Search in DBMS (BANKS, DBXPlorer, DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores  Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores  Probabilistic ranking of DB results Queries not selective, large answer set We take corroborative evidence into account to rank query results

9/11/2006Amélie Marian - Rutgers University21 Conclusion  Multiple Join Path Framework Uses corroborating evidence to identify high quality results Looks at all paths in the schema graph  Scoring mechanism Probabilistic interpretation Takes schema information into account  Techniques to compute answers Take into account agglomerative scoring Top-k and top-few