Download presentation
Presentation is loading. Please wait.
Published byAndrea Fisher Modified over 9 years ago
1
Yanlei Diao, University of Massachusetts Amherst Capturing Data Uncertainty in High- Volume Stream Processing Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton †, Thanh Tran, Michael Zink University of Massachusetts, Amherst † University of California, Berkeley
2
Yanlei Diao, University of Massachusetts Amherst Uncertain Data Streams Uncertain data streams Environmental monitoring sensor networks Radio Frequency Identification (RFID) networks GPS systems Radar sensor networks Data: incomplete, imprecise, misleading Results: unknown quality
3
Yanlei Diao, University of Massachusetts Amherst Scope of Our Problem Data modeled as continuous random variables Many types of sensor data. More examples later… High-volume data streams In contrast to probabilistic databases An end-to-end solution Uncertainty of raw data Uncertainty of query processing results
4
Yanlei Diao, University of Massachusetts Amherst Object Tracking and Monitoring Mobile RFID readers Handheld, robot-mounted Incomplete, noisy data Environmental factors Orientation of reading Not directly queriable Raw data: Data needed for querying: e.g., precise object locations +
5
Yanlei Diao, University of Massachusetts Amherst Fire Monitoring Application Display of solid merchandise shall not exceed 200 pounds per square foot of shelf area. SELECTRSTREAM(area(R.(x,y,z) p ), sum(R.weight)) FROMR [PARTITION BY R.tag_id ROW 1] GROUP BY area(R.(x,y,z) p ) HAVING sum(R.weight) > 200 (time, tag_id, (x,y,z) p ) What is the quality of the alert returned by this query?
6
Yanlei Diao, University of Massachusetts Amherst Fire Monitoring Application Alert when a flammable object is exposed to a high temperature. SELECTRSTREAM(R.tag_id, R.(x,y,z) p, T.temp p ) FROMRFIDStream [RANGE 3 seconds] as R, TempStream [RANGE 3 seconds] as T WHERE object_type(R.tag_id) = ‘flammable’ and T.temp p > 60°C and location_equals(R.(x,y,z) p, T.(x,y,z)) What is the quality of the alert returned by this query? (time, (x,y,z), temp p ) (time, tag_id, (x,y,z) p )
7
Yanlei Diao, University of Massachusetts Amherst Severe Weather Monitoring Sensing Merging Detection/ Predication Detection/ Predication wireless transmission Task Generation Sensing Transformation & Averaging Transformation & Averaging Transformation & Averaging Transformation & Averaging
8
Yanlei Diao, University of Massachusetts Amherst High-Volume, Uncertain Raw Data High-Volume: 1.66 million data items, 205Mb / sec per radar High-Volume: 1.66 million data items, 205Mb / sec per radar Uncertainty: Environmental noise Device noise Transmit frequency System clock Positioner Antenna Uncertainty: Environmental noise Device noise Transmit frequency System clock Positioner Antenna Pulses 12345 6 7 Gates (distance) (time) Raw Pulse data Sensing
9
Yanlei Diao, University of Massachusetts Amherst Averaged Moment Data Sensing Transformation & Averaging Transformation & Averaging Transformation & Averaging Transformation & Averaging 12345 6 7 Pulses Gates (distance) (time) Moment data velocity, reflectivity, …
10
Yanlei Diao, University of Massachusetts Amherst Averaged Moment Data Sensing Transformation & Averaging Transformation & Averaging Transformation & Averaging Transformation & Averaging 12345 6 7 Pulses Gates (distance) (time) Uncertainty: what is the effect of averaging over uncertain data? Uncertainty: what is the effect of averaging over uncertain data? Moment data velocity, reflectivity, …
11
Yanlei Diao, University of Massachusetts Amherst Merged Data Sensing Merging Detection/ Predication Detection/ Predication wireless transmission Sensing Transformation & Averaging Transformation & Averaging Transformation & Averaging Transformation & Averaging What is the quality of the detection result? Uncertainty: Uneven distribution of data density Uncertainty: Uneven distribution of data density
12
Yanlei Diao, University of Massachusetts Amherst © KSWO TV © Patrick Marsh May 8, 2007 Series of low-level circulations. NWS Tornado Warnings: 7:16pm, 7:39pm, 8:29pm 7:21pm 8:15 pm 9:54pm 11:00pm
13
Yanlei Diao, University of Massachusetts Amherst Effect of Averaging of Uncertain Data Averaging size Moment data size (MB) Detection running time (sec) Reported tornados False negatives 4041.49273.750 6027.68231.52.25 8020.79210.53.25 10016.65210.253.75 5003.422003.75 10001.762003.75 Results of 38 second trace at 8:10 pm on May 8, 2007. The averaging size 40 used to represent detection results using fine-grained data.
14
Yanlei Diao, University of Massachusetts Amherst Challenges Raw data is inherently incomplete and noisy Raw data is not directly queriable RFID: ; Radar: ; High volume raw data streams RFID: hundreds of readings per second per reader Radar: 1.66 million data items per second per radar Sophisticated query processing
15
Yanlei Diao, University of Massachusetts Amherst System Overview T1 T2 T3 A1 A2 A3 A4 J1 tuples w. lineage Archived tuples Confidence region Mean, Variance, Bounds
16
Yanlei Diao, University of Massachusetts Amherst Data Capture and Transformation Transform raw streams into tuple streams with quantified uncertainty -- compute p(X|O): Output: continuous random variables X, hidden Input: random variables O, observed Existing work Statistical machine learning Sensor stream cleaning and processing Our goal: choose appropriate statistical models, optimize for high-volume streams
17
Yanlei Diao, University of Massachusetts Amherst RFID Streams: Modeling A generative model characterizes how data is generated -- p(X,O) X: true object location (x,y,z) O: boolean for RFID readings How state of the world changes Object movement, reader motion How sensing generates data from the state of the world Probabilistic inference over RFID streams in mobile Environments. T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. ICDE 2009.
18
Yanlei Diao, University of Massachusetts Amherst RFID Streams: Inference Probabilistic inference over streams -- p(X|O) Sampling-based inference Key to performance: using a small number of samples Standard sampling- based inference Our optimizations Accuracy0.6 - 0.8 foot0.1 - 0.5 foot Performance0.1 reading/sec for 20 objects > 1000 readings/sec for 20,000 objects 7 orders of magnitude improvement!
19
Yanlei Diao, University of Massachusetts Amherst Radar Streams: Modeling Again, a generative model p(X,O)? O: raw pulse data X: velocity, reflectivity, … Highly complex sensing process Extremely high volume, 1.66 million data items/sec Pulses 12345 6 7 Gates (distance) (time) Environmental noise Device noise Transmit frequency System clock Positioner Antenna… Environmental noise Device noise Transmit frequency System clock Positioner Antenna…
20
Yanlei Diao, University of Massachusetts Amherst Radar Streams: Model Fitting Make output data X observable -- p(X) Deterministic heuristic algorithm for O-X transformation Fit a known model directly Moving Average (MA) model for p(X1, …, Xn) Key to performance: model fitting at stream speed Identify sequences obeying MA at 1.66 million items/sec X1X2X3X4X5X6X7 E1E2E3E4E5E6E7
21
Yanlei Diao, University of Massachusetts Amherst Distance from radar MA seq. length MA(5) Dynamically decide MA sequences for averaging Initial Result of MA Fitting Efficiently compute distribution of averaging over MA sequences
22
Yanlei Diao, University of Massachusetts Amherst Relational Processing under Uncertainty A relational paradigm for data processing after initial data capture and transformation Support , , Aggr e gation Compute a distribution for each result, modeled as a continuous random variable Integral-based approach [Cheng et al., SIGMOD 2003] Exact, but too slow for stream processing Sampling-based approach [Ge & Zdonik, ICDE 2008] Speed-accuracy tradeoff?
23
Yanlei Diao, University of Massachusetts Amherst Research Issues Techniques for exact derivation that are natural for continuous random variables Approximation Achieving speed-accuracy tradeoff more effectively Correlated intermediate results When do they occur with , , Aggregation ? Optimizations: avoid intermediate pdfs Complex function Lineage …
24
Yanlei Diao, University of Massachusetts Amherst © KSWO TV Much Work Lies Ahead… Your comments are welcome.
25
Yanlei Diao, University of Massachusetts Amherst RFID Streams: Speed vs. Accuracy
26
Yanlei Diao, University of Massachusetts Amherst Distance from radar MA seq. length MA(5)MA(20) Dynamically decide MA sequences for averaging Performance tradeoff
27
Yanlei Diao, University of Massachusetts Amherst Aggregation: Speed vs. Accuracy AlgorithmThroughputVariance Distance [0,1] Histogram33820.083 CF (exact)4660 CF (approx)105930.012
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.