Download presentation
Presentation is loading. Please wait.
1
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University Divesh Srivastava, AT&T Labs-Research
2
9/11/2006Amélie Marian - Rutgers University2 Motivating Example Sales TN BAN TN BAN CustName ORN PON Provisioning CustName PON SubPON Inventory PON TN CircuitID Ordering ORNTN TN: Telephone Number ORN: Order Number BAN: Billing Accoung Number PON: Provisoning Order Number SubPON: Related PON What is the Circuit ID associated with a Telephone Number that appears in SALES?
3
9/11/2006Amélie Marian - Rutgers University3 Motivations Data applications with overlapping features Data integration Web sources Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems
4
9/11/2006Amélie Marian - Rutgers University4 Contributions Multiple Join Path (MJP) framework Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few) Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques
5
9/11/2006Amélie Marian - Rutgers University5 Outline Multiple Join Path Framework Problem Definition Our Approach Scoring Answers Computing Answers Experimental Evaluation Related Work
6
9/11/2006Amélie Marian - Rutgers University6 Multiple Join Path Framework: Problem Definition Query of the form: “Given X=a find the value of Y” Examples: Given a telephone number of a customer, find the ID of the circuit to which the telephone line is attached. One answer expected Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID. Possibly several answers
7
9/11/2006Amélie Marian - Rutgers University7 Schema Graph Directed acyclic graph Nodes are field names Intra-application edge Links fields in the same application Inter-application edge Links fields across applications All (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications
8
9/11/2006Amélie Marian - Rutgers University8 Data Graph Given a specific value of the source node X what are values of the sink node Y? Considers all join paths from X to Y in the schema graph X (no corresponding SALES.BAN) X X Example: two paths lead to answer c1
9
9/11/2006Amélie Marian - Rutgers University9 Scoring Answers Which are the correct values? Unclean data No a priori knowledge Technique to score data edges What is the probability that the fields associated by the edge is correct Probabilistic interpretation of data edge scores to score full join paths Edge score aggregation Independent on the length of the path
10
9/11/2006Amélie Marian - Rutgers University10 Scoring Data Edges Rely on functional dependencies (we are considering fields that are keys) Data edge scores model the error in the data Intra-application edge Inter-application edge equals 1, unless approximate matching Fields A and B within the same application AB (and symetrically for B -> A) Where b i are the values instantiated from querying the application with value a ABBAand
11
9/11/2006Amélie Marian - Rutgers University11 Scoring Data Paths A single data path is scored using a simple sequential composition of its data edges probabilities Data paths leading to the same answer are scored using parallel composition XabY 0.50.80.6 pathScore=0.5*0.8*0.6=0.24 XabY 0.50.80.6 c pathScore=0.24+0.2-(0.24*0.2) pathScore=0.392 0.40.5 Independence Assumption
12
9/11/2006Amélie Marian - Rutgers University12 Identifying Answers Only interested in best answers Standard top-k techniques do not apply Answer scores can always be increased by new information We keep score range information Return top answers when identified, may not have complete scores Two return strategies Top-k Top-few (weaker stop condition)
13
9/11/2006Amélie Marian - Rutgers University13 Computing Answers Take advantage of early pruning Only interested in best answers Incremental data graph computation Probes to each applications Cost model is number of probes Standard graph searching techniques (DFS, BFS) do not take advantage of score information We propose a technique based on the notion of maximum benefit
14
9/11/2006Amélie Marian - Rutgers University14 Maximum Benefit Benefit computation of a path uses two components Known scores of the explored data edges Best way to augment an answer’s scores Uses residual benefit of unexplored schema edges Our strategy makes choices that aim at maximizing this benefit metric
15
9/11/2006Amélie Marian - Rutgers University15 VIP Experimental Platform Integration platform developed at AT&T 30 legacy systems Real data Developed as a platform for resolving disputes between applications that are due to data inconsistencies Front-end web interface
16
9/11/2006Amélie Marian - Rutgers University16 VIP Queries Random sample of 150 user queries. Analysis shows that queries can be classified according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries oneLarge(oL): 47 queries manyLarge(mL): 4 queries manySmall(mS): 8 queries heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query
17
9/11/2006Amélie Marian - Rutgers University17 VIP Schema Graph Paths leading to an answer /paths leading to top-1 answer (94 queries) Not considering all paths may lead to missing top-1 answers
18
9/11/2006Amélie Marian - Rutgers University18 Number of Parallel Paths Contributing to the Top-1 Answer Average of 10 parallel paths per answer, 2.5 significant
19
9/11/2006Amélie Marian - Rutgers University19 Cost of Execution
20
9/11/2006Amélie Marian - Rutgers University20 Related Work Keyword Search in DBMS (BANKS, DBXPlorer, DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores Probabilistic ranking of DB results Queries not selective, large answer set We take corroborative evidence into account to rank query results
21
9/11/2006Amélie Marian - Rutgers University21 Conclusion Multiple Join Path Framework Uses corroborating evidence to identify high quality results Looks at all paths in the schema graph Scoring mechanism Probabilistic interpretation Takes schema information into account Techniques to compute answers Take into account agglomerative scoring Top-k and top-few
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.