Presentation is loading. Please wait.

Presentation is loading. Please wait.

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Similar presentations


Presentation on theme: "Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University."— Presentation transcript:

1 Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University Divesh Srivastava, AT&T Labs-Research

2 9/11/2006Amélie Marian - Rutgers University2 Motivating Example Sales TN BAN TN BAN CustName ORN PON Provisioning CustName PON SubPON Inventory PON TN CircuitID Ordering ORNTN TN: Telephone Number ORN: Order Number BAN: Billing Accoung Number PON: Provisoning Order Number SubPON: Related PON What is the Circuit ID associated with a Telephone Number that appears in SALES?

3 9/11/2006Amélie Marian - Rutgers University3 Motivations  Data applications with overlapping features Data integration Web sources  Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems

4 9/11/2006Amélie Marian - Rutgers University4 Contributions  Multiple Join Path (MJP) framework Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers  Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)  Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques

5 9/11/2006Amélie Marian - Rutgers University5 Outline  Multiple Join Path Framework Problem Definition  Our Approach Scoring Answers Computing Answers  Experimental Evaluation  Related Work

6 9/11/2006Amélie Marian - Rutgers University6 Multiple Join Path Framework: Problem Definition  Query of the form: “Given X=a find the value of Y” Examples: Given a telephone number of a customer, find the ID of the circuit to which the telephone line is attached. One answer expected Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID. Possibly several answers

7 9/11/2006Amélie Marian - Rutgers University7 Schema Graph  Directed acyclic graph  Nodes are field names  Intra-application edge Links fields in the same application  Inter-application edge Links fields across applications All (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications

8 9/11/2006Amélie Marian - Rutgers University8 Data Graph  Given a specific value of the source node X what are values of the sink node Y?  Considers all join paths from X to Y in the schema graph X (no corresponding SALES.BAN) X X Example: two paths lead to answer c1

9 9/11/2006Amélie Marian - Rutgers University9 Scoring Answers  Which are the correct values? Unclean data No a priori knowledge  Technique to score data edges What is the probability that the fields associated by the edge is correct  Probabilistic interpretation of data edge scores to score full join paths Edge score aggregation Independent on the length of the path

10 9/11/2006Amélie Marian - Rutgers University10 Scoring Data Edges  Rely on functional dependencies (we are considering fields that are keys)  Data edge scores model the error in the data  Intra-application edge  Inter-application edge equals 1, unless approximate matching Fields A and B within the same application AB (and symetrically for B -> A) Where b i are the values instantiated from querying the application with value a ABBAand

11 9/11/2006Amélie Marian - Rutgers University11 Scoring Data Paths  A single data path is scored using a simple sequential composition of its data edges probabilities  Data paths leading to the same answer are scored using parallel composition XabY 0.50.80.6 pathScore=0.5*0.8*0.6=0.24 XabY 0.50.80.6 c pathScore=0.24+0.2-(0.24*0.2) pathScore=0.392 0.40.5 Independence Assumption

12 9/11/2006Amélie Marian - Rutgers University12 Identifying Answers  Only interested in best answers  Standard top-k techniques do not apply Answer scores can always be increased by new information We keep score range information Return top answers when identified, may not have complete scores  Two return strategies Top-k Top-few (weaker stop condition)

13 9/11/2006Amélie Marian - Rutgers University13 Computing Answers  Take advantage of early pruning Only interested in best answers  Incremental data graph computation Probes to each applications Cost model is number of probes  Standard graph searching techniques (DFS, BFS) do not take advantage of score information  We propose a technique based on the notion of maximum benefit

14 9/11/2006Amélie Marian - Rutgers University14 Maximum Benefit  Benefit computation of a path uses two components Known scores of the explored data edges Best way to augment an answer’s scores  Uses residual benefit of unexplored schema edges  Our strategy makes choices that aim at maximizing this benefit metric

15 9/11/2006Amélie Marian - Rutgers University15 VIP Experimental Platform  Integration platform developed at AT&T  30 legacy systems  Real data  Developed as a platform for resolving disputes between applications that are due to data inconsistencies  Front-end web interface

16 9/11/2006Amélie Marian - Rutgers University16 VIP Queries  Random sample of 150 user queries.  Analysis shows that queries can be classified according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries  oneLarge(oL): 47 queries  manyLarge(mL): 4 queries  manySmall(mS): 8 queries heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query

17 9/11/2006Amélie Marian - Rutgers University17 VIP Schema Graph Paths leading to an answer /paths leading to top-1 answer (94 queries) Not considering all paths may lead to missing top-1 answers

18 9/11/2006Amélie Marian - Rutgers University18 Number of Parallel Paths Contributing to the Top-1 Answer Average of 10 parallel paths per answer, 2.5 significant

19 9/11/2006Amélie Marian - Rutgers University19 Cost of Execution

20 9/11/2006Amélie Marian - Rutgers University20 Related Work  Keyword Search in DBMS (BANKS, DBXPlorer, DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores  Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores  Probabilistic ranking of DB results Queries not selective, large answer set We take corroborative evidence into account to rank query results

21 9/11/2006Amélie Marian - Rutgers University21 Conclusion  Multiple Join Path Framework Uses corroborating evidence to identify high quality results Looks at all paths in the schema graph  Scoring mechanism Probabilistic interpretation Takes schema information into account  Techniques to compute answers Take into account agglomerative scoring Top-k and top-few


Download ppt "Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University."

Similar presentations


Ads by Google