Eiman Elnahrawy WSNA’03 Cleaning and Querying Noisy Sensors Eiman Elnahrawy and Badri Nath Rutgers University WSNA September 2003 This work was supported.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Uncertainty in rainfall-runoff simulations An introduction and review of different techniques M. Shafii, Dept. Of Hydrology, Feb
I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Dialogue Policy Optimisation
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
CHAPTER 14: Confidence Intervals: The Basics
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluating Search Engine
Copyright 2004 David J. Lilja1 Errors in Experimental Measurements Sources of errors Accuracy, precision, resolution A mathematical model of errors Confidence.
Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Error Analysis Accuracy Closeness to the true value Measurement Accuracy – determines the closeness of the measured value to the true value Instrument.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
PARAMETRIC STATISTICAL INFERENCE
General Statistics Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) Each has some error or uncertainty.
Physics 270 – Experimental Physics. Standard Deviation of the Mean (Standard Error) When we report the average value of n measurements, the uncertainty.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Uncertainty & Error “Science is what we have learned about how to keep from fooling ourselves.” ― Richard P. FeynmanRichard P. Feynman.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Understanding Sampling
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
Statistics Presentation Ch En 475 Unit Operations.
Uncertainty Management in Rule-based Expert Systems
Attenuation measurement with all 4 frozen-in SPATS strings Justin Vandenbroucke Freija Descamps IceCube Collaboration Meeting, Utrecht, Netherlands September.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Data Mining and Decision Support
Statistics Presentation Ch En 475 Unit Operations.
IPDET Module 9: Choosing the Sampling Strategy. IPDET © Introduction Introduction to Sampling Types of Samples: Random and Nonrandom Determining.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Optimizing Query Processing In Sensor Networks Ross Rosemark.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Confidence Intervals Cont.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Data Mining: Concepts and Techniques
A paper on Join Synopses for Approximate Query Answering
Statistics Presentation
Statistics Review ChE 477 Winter 2018 Dr. Harding.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
CS639: Data Management for Data Science
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Applied Statistics and Probability for Engineers
Presentation transcript:

Eiman Elnahrawy WSNA’03 Cleaning and Querying Noisy Sensors Eiman Elnahrawy and Badri Nath Rutgers University WSNA September 2003 This work was supported in part by NSF grant ANI and DARPA under contract number N

Eiman Elnahrawy WSNA’03 I can’t rely on this sensor data anymore. It has too many problems!!? -N-Noise -B-Bias -M-Missing information -Hmm, is this a malicious sensor -Something strange or sensor gone bad

Outline Motivation General Framework Cleaning Noise Querying Noisy Sensors Statistically Preliminary Evaluations Challenges and Future Work Conclusion

Eiman Elnahrawy WSNA’03 Motivation “Measurements” subject to many sources of error Systematic errors->Bias (Calibration) [Bychkovskiy03] Random errors (Noise) : external, uncontrollable environmental, HW, inaccuracies/imprecision Current technology: cheap noisy sensors, vary in tolerance, precision/accuracy Focus of industry is even cheaper sensors -> noisier, noise varies with the cost of the sensor

Eiman Elnahrawy WSNA’03 So What? Uncertainty Interest is generally queries over a set of noisy sensors –Predicate/ range queries –Aggregates SUM, MIN –Other Accumulation: seriously affects decision-making/triggers False +ve/-ve Misleading answers May cost you money h t

Eiman Elnahrawy WSNA’03 Problem Definition Research focused on homogeneous sensors, in-network aggregation, query languages, optimization The primitives are now working fairly fine, why don’t we move on to more complex data quality problems If the collected data/query result is erroneous/misleading, why would we need such nets? Given any query and some user-defined confidence metrics, how do we answer this query “efficiently” given noisy sensors? What is the effect of noise on queries?

Eiman Elnahrawy WSNA’03 Is this a new problem? Traditional databases –Data entry, transactional activity –Clean data: no noise –Supervised off-line cleaning Sensors –Stream –Decision-making in real time –Online cleaning and query processing –Many resource constraints

Eiman Elnahrawy WSNA’03 General Framework Two Steps Online cleaning –Inputs: noisy data + error models + prior knowledge –Output: uncertainty models (clean data) Queries evaluated on clean data (uncertainty models) Cleaning Module Query Processing Module Uncertainty Models (Posteriors) Query Answer Noisy Observations from Sensors Error Models Prior Knowledge User Query

Eiman Elnahrawy WSNA’03 Observation: noisy reading from the sensor Prior Knowledge: r.v., distribution of the true reading –Facts, learning, using less noisy as priors for noisier, experts, dynamic (parametric model) Error Model: r.v., noise characteristic –Any appropriate distribution, e.g., Gaussian –Heterogeneity -> model for each type or even each individual sensor Uncertainty Model (true unknown): r.v., with a distribution, we would like to estimate Cleaning Module Noisy Observations from Sensors Error Models Prior Knowledge Uncertainty Models (Posteriors)

Eiman Elnahrawy WSNA’03 Cleaning Single Sensor Fusion using Bayes’ rule Posterior = (likelihood x prior) / (evidence) Single attribute sensors Example: Gaussian prior ( μ s, σ 2 s ), Gaussian error ( 0,δ 2 ) yield Gaussian posterior (uncertainty model)

Eiman Elnahrawy WSNA’03 Cleaning Multi-attributes sensors Example: Gaussian prior ( μ s, Σ s ), Gaussian error ( 0, Σ 2 ) yield Gaussian posterior (uncertainty model) The terms Σ s [Σ s + Σ] -1, Σ T will be computed off-line

Eiman Elnahrawy WSNA’03 Classification of Queries –What is the reading(s) of sensor x ? Single Source Queries (SSQ) –Which sensors have at least c% chance of satisfying a given predicate? Set Non-Aggregate Queries (SNAQ) –On those sensors which have at least c% chance of satisfying a given predicate, what is the value of a given aggregate? Summary Aggregate Queries (SUM, AVG, COUNT) SAQ Exemplary Aggregate Queries (MIN, MAX, etc.) EAQ Query Processing Module Uncertainty Models (Posteriors) Query Answer User Query

Eiman Elnahrawy WSNA’03 Single Source Queries Approach 1: output expected value of the probability distribution Approach 2: output p% confidence interval using Chebychev’s inequality [μ s - ε, μ s + ε] –“p” is user-defined with a default value, e.g., 95% Multi-attribute: first compute the marginal pdf of each attribute then proceed as above

Eiman Elnahrawy WSNA’03 Set Non-Aggregate Queries Output sensor id, confidence ( p i ) Confidence = probability of satisfying the given predicate (range R ) >= user defined confidence p i = ∫ R p si (t) dt {s i } = S R, eligible set If the readings are required compute it using the SSQ’s algorithms Multi-attribute: compute S R over a region rather than a single interval

Eiman Elnahrawy WSNA’03 Summary Aggregate Queries SUM: compute sum of independent continuous r.vs. Z = sum(s 1, s 2,…, s m ) Perform convolution on two sensors and then add one sensor repeatedly from the eligible set ( S R ) Output expected value or p% confidence interval of overall sum

Eiman Elnahrawy WSNA’03 Summary Aggregate Queries COUNT: output |S R | over the given predicate AVG: output SUM/COUNT Multi-attribute: compute S R, marginalize over the aggregated attribute, then proceed as above

Eiman Elnahrawy WSNA’03 Exemplary Aggregate Queries Min: compute min of independent continuous r.vs. Z = min(s 1, s 2,…, s m ) Output expected value or p% confidence interval Other order statistics Max, Top-K, Min-K, and median in a similar manner Multi-attribute: analogous

Eiman Elnahrawy WSNA’03 Tradeoffs “Sensors” Vs. “Database” Sensor Level –Storage cost –Communication cost “sending priors” –Processing cost “compute posteriors” –Adv: point estimate, in-network aggregation with error bounds Database Level –0 cost assuming free processing, storage –Communication cost saved –Exact query answer –Disadv: no distributed query processing

Eiman Elnahrawy WSNA’03 Evaluations Synthetic data “Unknown” true readings –1000 sensors, random from 5 clusters –Gaussian, μ = 1000, 2000, 3000, 4000, 5000, δ 2 = 100 Noisy data (Raw data) –Added random noise, Gaussian, μ = 0, different noise levels Posteriors (Bayesian data) –Prior: distribution of the cluster generated the reading Predicates: 500 random range queries at each noise level, averaged the error

Eiman Elnahrawy WSNA’03 Single source queries –Metric is MSE –Reduces uncertainty, yields far less errors –Error scaled down by a factor of δ p 2 /(δ p 2 + δ n 2)

Eiman Elnahrawy WSNA’03 Set non-aggregate queries: prior δ = 10 –Metrics are Precision and Recall –Recall: fraction of relevant objects that are retrieved –Precision: fraction of retrieved objects that are relevant –High Recall, Precision (low false –ve, +ve, res.) better –Maintained high Recall, Precision at different confidence levels –95 % versus 70 % for noisy readings

Eiman Elnahrawy WSNA’03 Summary aggregate queries: prior δ = 10 –Metric is Absolute error –More accurate priors yield smaller error –SUM: noisy readings caused four times the error –COUNT: 2 versus 14 for noisy data

Eiman Elnahrawy WSNA’03 Challenges and Future Work Prototype and more evaluations on real data Just scratched the surface! –Other estimation techniques –Other uncertainty problems: outliers, missing data, etc. –Other queries –Effect of noise on queries “Efficient” distributed query processing

Eiman Elnahrawy WSNA’03 Challenges and Future Work Given a query and specific quality requirements (confidence, number of false +/-) what to do if can’t satisfy confidence? –Sensors are not homogeneous –Change sampling method at running time –Turn on “specific” sensors at running time –Routing –Up-to-date metadata about sensors’ resources/characteristics –Cost and query optimization

Eiman Elnahrawy WSNA’03 Conclusion Taking noise into consideration is important Single sensor fusion Statistical queries Works well Many open problems and future work directions

Eiman Elnahrawy WSNA’03 Thank You