Download presentation
Presentation is loading. Please wait.
1
Geometric Approach Geometric Interpretation:
Each node holds a statistics vector Coloring the vector space Grey:: function > threshold White:: function <= threshold Goal: determine color of global data vector (average). 12/29/2017
2
Bounding the Convex Hull
Observation: average is in the convex hull If convex hull monochromatic then average too But – convex hull may become large 12/29/2017
3
Drift Vectors Periodically calculate an estimate vector - the current global Each node maintains a drift vector – the change in the local statistics vector since the last time the estimate vector was calculated Global average statistics vector is also the average of the drift vectors 12/29/2017
4
The Bounding Theorem [SIGMOD’06]
A reference point is known to all nodes Each vertex constructs a sphere Theorem: convex hull is bounded by the union of spheres Local constraints! 12/29/2017
5
Proofs of the bounding theorem:
SIGMOD06 – induction on the dimension. Micha Sharir – induction on number of points. Yuri Rabinovich – uses the following observation: z is not in the sphere supported by x,y iff (x-z,y-z)>0.
6
Basic Algorithm An initial estimate vector is calculated
Nodes check color of drift spheres Drift vector is the diameter of the drift sphere If any sphere non monochromatic: node triggers re-calculation of estimate vector 12/29/2017
7
Reuters Corpus (RCV1-v2)
800,000+ news stories Aug Aug Corporate/Industrial tagging n=10 12/29/2017 10 nodes, random data distribution
8
Trade-off: Accuracy vs. Performance
Inefficiency: value of function on average is close to the threshold Performance can be enhanced at the cost of less accurate result: Set error margin around the threshold value 12/29/2017
9
Performance Analysis 12/29/2017
10
Performance Analysis (cntd.)
Change dist(…,f,r) with D_global 12/29/2017
11
Balancing Globally calculating average is costly
Often possible to average only some of the data vectors. 12/29/2017
12
Shape Sensitivity [PODS’08]
Fitting cover to Data Fitting cover to threshold surface Specific function classes 12/29/2017
13
Fitting Cover to Data (using the covariance matrix)
12/29/2017
14
Fitting Cover to Threshold Surface -- Reference Vector Selection
12/29/2017
15
Distance Fields Skeleton, Medial Axis 12/29/2017
16
Results – Shape Sensitivity
12/29/2017
17
Prediction-Based Geometric Monitoring [SIGMOD’12]
ΔV1 ΔV2 ΔV3 ΔV4 ΔV5 ep ΔVp1 ΔVp2 ΔVp3 ΔVp4 ΔVp5 f(v(t)) > T v(t) Instead of drift vectors which expressed the change of the local vectors since the last contact with the coordinating source, we now have prediction deviation vectors which denote how much accurate are the predictions provided by the adopted estimators. As long as local predictors remain good, the convex hull formed by the prediction deviation vectors will be tighter and local constraints monitored at each site will be stricter. click Moreover, notice that, together with the prediction deviations, the common reference vector (e^p) changes positions following the predicted v(t) movement. Stricter local constraints if local predictions remain accurate Keeping up with v(t) movement
18
Let the nodes communicate only when “something happens”
Local Constraints Safe Zones! Let the nodes communicate only when “something happens” Send me your current measurements! Tell me only if your measurement is larger than 50!
19
These Safe Zones save more communication!
Local Distributions Reasonable to assume future data will behave similarly… 58 45 10 66 44 20 43 50 15 78 17 85 30 21 70 47 11 76 25 12 65 5 56 75 34 16 These Safe Zones save more communication!
20
Optimal Safe Zones 1. Legal / Safe 2. Large: Minimize Communication
21
Example: Air quality monitoring
What are the optimal Safe Zones…?
22
The Optimization Problem
Is this Convex? Is this Linear? How many constraints are these? BAD NEWS: This problem is NP-hard.
23
The Optimization Problem
X Step 3: Use non-convex optimization toolboxes (e.g. Matlab’s “fmincon”). These toolboxes use sophisticated Gradient Descent algorithms and return close-to-optimal results.
24
Data Set How the data looks like
25
Ratio Queries Example of triangular Safe Zones
26
Improvement over convex-hull cover method
5’000 hours Up to 200 nodes were involved in the experiment. The average improvement was by a factor of 17.5 Why do we improve so much?
27
Higher Dimensions
28
Chi-Square Monitoring (5D)
Examples of axis aligned boxes as Safe Zones
29
Improvement over GM 1’000 hours 90 nodes The improvement over the Geometric Method gets more substantial in higher dimensions.
30
Safe Zones - Example
31
Biclique: Non-Convex Safe Zones
Safe Zone Algorithm (for 2 nodes): Take the data points, build a bipartite graph(how?), find the maximal Biclique, these are your Safe Zones!
32
Conclusions Local filtering for large-scale distributed data systems
Saving in communication is unlimited Bounded only by the aggregate over system lifetime Saving bandwidth, central resources, power. Not necessary to sacrifice precision and latency Less communication more Privacy 12/29/2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.