Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T.

Similar presentations


Presentation on theme: "Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T."— Presentation transcript:

1 Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T

2 Data Quality Cost to business: $600 billion Problems prevalent in measurement data – Equipment failure – Calibration/systemic errors – Configuration errors – Management errors Our goal: detection, not cleaning

3 Data Quality Principled approach: integrity constraints – Assert semantics – Deviations = quality issues – Allow approx for real-world data – Condition  tableau discovery Domain often gives rise to semantics – missing, extraneous, out-of-order

4 Sequential Dependency Definition Sequential dependency X  g Y – (y i – y i-1 )  g, y’s sorted w.r.t. x-values – extension of functional dependency, g = (0,  ) X  Y :  t 1,t 2 t 1 [X] < t 2 [X]  t 1 [Y] < t 2 [Y] Example #1: date  [20,  ) price – Prices increasing by at least 20 units Example #2: poll#  [4,6] time – Consecutive polls within 4-6 mins

5 Approx Sequential Dependency Start End Confidence = 67% g = (0,  )

6 Conditional Sequential Dependencies Confidence ≥ 80% [1,6] [2,11] [7,12]

7 Conditional Sequential Dependencies Example #1: g = (0,  )

8 Conditional Sequential Dependencies Example #2: g = [9,11] g = [20,  ]

9 Contributions Introduce sequential dependencies (SDs) – algorithm for computing confidence Tableau Discovery for CSDs – problem definition – fast approximation algorithm

10 Approx Sequential Dependency Confidence: (N-OPS)/N, OPS = min ins+del – Edit distance Ex: with [4,6] – del 12, ins 15, ins 20, del 31  conf = 4/8 Doesn’t overpenalize for rare drops – Eg, with [5,5] Penalize large gaps – Eg, [3,5] with gap of 6 vs. 1000

11 Approx Sequential Dependency How to compute OPS for g=[G1,G2]? dcost(d) = #ins (or  ) to to end in d – Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2 –  d/G2  when  (d+1)/G1  =  d/G2  ; else  Let T(i) := OPS made to  Suppose T(1), T(2), …, T(i-1) already computed T(i) = minj { T(j) + (i-1-j) + [dcost(a i -a j )-1] } – O(G2/(G2-G1)  N log N) algorithm

12 Tableau Discovery Assume underlying SD given – Data often suggest ordering semantics Good tableau = small set of intervals – Each interval satisfies confidence threshold – Union satisfies support threshold Find maximal time intervals [i,j] s.t. – Confidence satisfied in [i,j] Can we do better than testing all [i,j]’s?

13 Tableau Discovery: Candidates Relax constraint: confidence ≥ ĉ/(1+ε) For any interval I, exists J s.t. (a) I  J and (b) |J| ≤ (1+ε)|I|  conf(J) ≥ conf(I)/(1+ε) I J

14 Tableau Discovery: Candidates Test just enough intervals: (a) lengths 1, (1+δ), (1+δ) 2, … (b) starting points δ, δ(1+δ), δ(1+δ) 2, …

15 Tableau Discovery: Candidates Processing cost: – Intervals at level h have length (1+δ) h – N/(δ(1+δ) h ) intervals at level h – log 1+δ N total levels –  sum of lengths = O((N/δ)log 1+δ N) = O(N/δ 2 lg N) Improvement: – Interval lengths in [A,2A] start at δA,2δA,3δA,… – Prefix property

16 Tableau Discovery: Assembly Optimal solution in quadratic time Greedy partial set cover Can implement in linear time Constant performance ratio

17 Summary of Results Tableau almost identical at small δ Significant speedup at small δ “Inflating” ĉ to (1+δ)ĉ works well

18 Experiments: Sample Tableau Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05

19 Experiments: Tableau Size Gaps in [0,∞ ) Gaps in [0,5] DowJones data: support  0.5

20 Experiments: Scalability Gaps in [0,∞ ) Gaps in [4,6] Network data support  0.5 conf  0.99 WeatherDates support  0.5 conf  0.9

21 Case Study: Polled Data conf ≥ 0.995, support ≥ 0.5, δ=0.05

22 Case Study: Stock Data conf ≥ 0.995, support ≥ 0.5, δ=0.05 Dow Jones 2-week moving average 10 4 10 3 10 2

23 Conclusions Constraint-driven approach – Define, discover, detect Use whatever semantics available – Domain knowledge, expectation, etc. Model errors carefully – Confidence measure Tableaux useful for summary

24 The End

25 Background Functional Dependency – X  Y :  t 1,t 2 t 1 [X] = t 2 [X]  t 1 [Y] = t 2 [Y] Example – title  salary – What happens when data merged?

26 Page 26 Background ssn|name|title|salary 123|alice|manager|50 456|bob|sales|40 789|cathy|manager|50 title  salary ssn|name|company|title|salary 123|alice|ATT|manager|50 456|bob|ATT|sales|40 789|cathy|ATT|manager|50 012|david|IBM|engineer|30 345|emily|IBM|engineer|35 [title,company]  salary? – 100% support, 80% confidence Hold Tableau Fail Tableau ATT Company ** SalaryTitle 60% support, 100% confidence IBM Company ** SalaryTitle 40% support, 50% confidence

27 CFD Results Given FD, discover tableau: – min tableau size – subj. to support and confidence constraints Hardness: – global conf: inapproximable – local conf: NP-hard, fast approx algo


Download ppt "Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T."

Similar presentations


Ads by Google