Download presentation
Presentation is loading. Please wait.
Published byApril Reeves Modified over 9 years ago
1
Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T
2
Data Quality Cost to business: $600 billion Problems prevalent in measurement data – Equipment failure – Calibration/systemic errors – Configuration errors – Management errors Our goal: detection, not cleaning
3
Data Quality Principled approach: integrity constraints – Assert semantics – Deviations = quality issues – Allow approx for real-world data – Condition tableau discovery Domain often gives rise to semantics – missing, extraneous, out-of-order
4
Sequential Dependency Definition Sequential dependency X g Y – (y i – y i-1 ) g, y’s sorted w.r.t. x-values – extension of functional dependency, g = (0, ) X Y : t 1,t 2 t 1 [X] < t 2 [X] t 1 [Y] < t 2 [Y] Example #1: date [20, ) price – Prices increasing by at least 20 units Example #2: poll# [4,6] time – Consecutive polls within 4-6 mins
5
Approx Sequential Dependency Start End Confidence = 67% g = (0, )
6
Conditional Sequential Dependencies Confidence ≥ 80% [1,6] [2,11] [7,12]
7
Conditional Sequential Dependencies Example #1: g = (0, )
8
Conditional Sequential Dependencies Example #2: g = [9,11] g = [20, ]
9
Contributions Introduce sequential dependencies (SDs) – algorithm for computing confidence Tableau Discovery for CSDs – problem definition – fast approximation algorithm
10
Approx Sequential Dependency Confidence: (N-OPS)/N, OPS = min ins+del – Edit distance Ex: with [4,6] – del 12, ins 15, ins 20, del 31 conf = 4/8 Doesn’t overpenalize for rare drops – Eg, with [5,5] Penalize large gaps – Eg, [3,5] with gap of 6 vs. 1000
11
Approx Sequential Dependency How to compute OPS for g=[G1,G2]? dcost(d) = #ins (or ) to to end in d – Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2 – d/G2 when (d+1)/G1 = d/G2 ; else Let T(i) := OPS made to Suppose T(1), T(2), …, T(i-1) already computed T(i) = minj { T(j) + (i-1-j) + [dcost(a i -a j )-1] } – O(G2/(G2-G1) N log N) algorithm
12
Tableau Discovery Assume underlying SD given – Data often suggest ordering semantics Good tableau = small set of intervals – Each interval satisfies confidence threshold – Union satisfies support threshold Find maximal time intervals [i,j] s.t. – Confidence satisfied in [i,j] Can we do better than testing all [i,j]’s?
13
Tableau Discovery: Candidates Relax constraint: confidence ≥ ĉ/(1+ε) For any interval I, exists J s.t. (a) I J and (b) |J| ≤ (1+ε)|I| conf(J) ≥ conf(I)/(1+ε) I J
14
Tableau Discovery: Candidates Test just enough intervals: (a) lengths 1, (1+δ), (1+δ) 2, … (b) starting points δ, δ(1+δ), δ(1+δ) 2, …
15
Tableau Discovery: Candidates Processing cost: – Intervals at level h have length (1+δ) h – N/(δ(1+δ) h ) intervals at level h – log 1+δ N total levels – sum of lengths = O((N/δ)log 1+δ N) = O(N/δ 2 lg N) Improvement: – Interval lengths in [A,2A] start at δA,2δA,3δA,… – Prefix property
16
Tableau Discovery: Assembly Optimal solution in quadratic time Greedy partial set cover Can implement in linear time Constant performance ratio
17
Summary of Results Tableau almost identical at small δ Significant speedup at small δ “Inflating” ĉ to (1+δ)ĉ works well
18
Experiments: Sample Tableau Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05
19
Experiments: Tableau Size Gaps in [0,∞ ) Gaps in [0,5] DowJones data: support 0.5
20
Experiments: Scalability Gaps in [0,∞ ) Gaps in [4,6] Network data support 0.5 conf 0.99 WeatherDates support 0.5 conf 0.9
21
Case Study: Polled Data conf ≥ 0.995, support ≥ 0.5, δ=0.05
22
Case Study: Stock Data conf ≥ 0.995, support ≥ 0.5, δ=0.05 Dow Jones 2-week moving average 10 4 10 3 10 2
23
Conclusions Constraint-driven approach – Define, discover, detect Use whatever semantics available – Domain knowledge, expectation, etc. Model errors carefully – Confidence measure Tableaux useful for summary
24
The End
25
Background Functional Dependency – X Y : t 1,t 2 t 1 [X] = t 2 [X] t 1 [Y] = t 2 [Y] Example – title salary – What happens when data merged?
26
Page 26 Background ssn|name|title|salary 123|alice|manager|50 456|bob|sales|40 789|cathy|manager|50 title salary ssn|name|company|title|salary 123|alice|ATT|manager|50 456|bob|ATT|sales|40 789|cathy|ATT|manager|50 012|david|IBM|engineer|30 345|emily|IBM|engineer|35 [title,company] salary? – 100% support, 80% confidence Hold Tableau Fail Tableau ATT Company ** SalaryTitle 60% support, 100% confidence IBM Company ** SalaryTitle 40% support, 50% confidence
27
CFD Results Given FD, discover tableau: – min tableau size – subj. to support and confidence constraints Hardness: – global conf: inapproximable – local conf: NP-hard, fast approx algo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.