HY436: Mobile Computing and Wireless Networks Data sanitization Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Forecasting Using the Simple Linear Regression Model and Correlation
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Random Sampling and Data Description
© 2003 Prentice-Hall, Inc.Chap 5-1 Business Statistics: A First Course (3 rd Edition) Chapter 5 Probability Distributions.
HY539: Mobile Computing and Wireless Networks Basic Statistics / Data Preprocessing Tutorial: November 21, 2006 Elias Raftopoulos Prof. Maria Papadopouli.
1 Pertemuan 06 Sebaran Normal dan Sampling Matakuliah: >K0614/ >FISIKA Tahun: >2006.
Chapter 6 The Normal Distribution and Other Continuous Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
1 Pertemuan 11 Peubah Acak Normal Matakuliah: I0134-Metode Statistika Tahun: 2007.
Business Statistics - QBM117 Statistical inference for regression.
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Chapter 4 Continuous Random Variables and Probability Distributions
Chapter 1: Introduction to Statistics
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Chapter 2 Describing Data.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
Wednesday, May 13, 2015 Report at 11:30 to Prairieview.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
© 2002 Prentice-Hall, Inc.Chap 6-1 Basic Business Statistics (8 th Edition) Chapter 6 The Normal Distribution and Other Continuous Distributions.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Basic Business Statistics
1 Chapter 2: The Normal Distribution 2.1Density Curves and the Normal Distributions 2.2Standard Normal Calculations.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
1 ES Chapter 3 ~ Normal Probability Distributions.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions Basic Business.
Math 4030 – 7b Normality Issues (Sec. 5.12) Properties of Normal? Is the sample data from a normal population (normality)? Transformation to make it Normal?
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Thursday, May 12, 2016 Report at 11:30 to Prairieview
Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 2: Modeling Distributions of Data
Entry Task Chapter 2: Describing Location in a Distribution
CHAPTER 2 Modeling Distributions of Data
Good Afternoon! Agenda: Knight’s Charge-please wait for direction
Understanding Standards Event Higher Statistics Award
Density Curves and Normal Distribution
CHAPTER 2 Modeling Distributions of Data
6-1 Introduction To Empirical Models
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Warmup Normal Distributions.
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
DESIGN OF EXPERIMENT (DOE)
CHAPTER 2 Modeling Distributions of Data
The Normal Distribution
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Presentation transcript:

HY436: Mobile Computing and Wireless Networks Data sanitization Tutorial: November 7, 2005 Elias Raftopoulos Ploumidis Manolis Prof. Maria Papadopouli Assistant Professor Department of Computer Science University of North Carolina at Chapel Hill

Data Analysis Discovery of Missing Values Data treatment Outliers Detection Outliers Removal [Optional] Data Normalization [Optional] Statistical Analysis

Why Data Preprocessing? Data in the real world is dirty incomplete noisy inconsistent No quality data, no quality statistical processing Quality decisions must be based on quality data

Data Cleaning Tasks Handle missing values, due to Sensor malfunction Random disturbances Network Protocol [eg UDP] Identify outliers, smooth out noisy data

Recover Missing Values Linear Interpolation

Recover Missing Values Moving Average A simple moving average is the unweighted mean of the previous n data points in the time series A weighted moving average is a weighted mean of the previous n data points in the time series A weighted moving average is more responsive to recent movements than a simple moving average An exponentially weighted moving average (EWMA or just EMA) is an exponentially weighted mean of previous data points The parameter of an EWMA can be expressed as a proportional percentage - for example, in a 10% EWMA, each time period is assigned a weight that is 90% of the weight assigned to the next (more recent) time period

Recover Missing Values Moving Average (cont’d) Symmetric Linear Filters Moving Average

What are outliers in the data? An outlier is an observation that lies an abnormal distance from other values in a random sample from a population It is left to the analyst (or a consensus process) to decide what will be considered abnormal Before abnormal observations can be singled out, it is necessary to characterize normal observations

Outliers An outlier is a data point that comes from a distribution different (in location, scale, or distributional form) from the bulk of the data In the real world, outliers have a range of causes, from as simple as operator blunders equipment failures day-to-day effects batch-to-batch differences anomalous input conditions warm-up effects

Scatter Plot: Outlier Scatter plot here reveals A basic linear relationship between X and Y for most of the data A single outlier (at X = 375)

Symmetric Histogram with Outlier A symmetric distribution is one in which the 2 "halves" of the histogram appear as mirror-images of one another. The above example is symmetric with the exception of outlying data near Y = 4.5

Normalization Normalization is a process of scaling the numbers in a data set to improve the accuracy of the subsequent numeric computations Most statistical tests and intervals are based on the assumption of normality This leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption Most real data sets are in fact not approximately normal An appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution This increases the applicability and usefulness of statistical techniques based on the normality assumption.

Box-Cox Transformation The Box-Cox transformation is a particulary useful family of transformations

Measuring Normality Given a particular transformation such as the Box-Cox transformation defined above, it is helpful to define a measure of the normality of the resulting transformation One measure is to compute the correlation coefficient of a normal probability plot The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot (the more linear the probability plot, the better a normal distribution fits the data). The Box-Cox normality plot is a plot of these correlation coefficients for various values of the parameter. The value of λ corresponding to the maximum correlation on the plot is then the optimal choice for λ

Measuring Normality (cont’d) The histogram in the upper left-hand corner shows a data set that has significant right skewness And so does not follow a normal distribution The Box-Cox normality plot shows that the maximum value of the correlation coefficient is at = -0.3 The histogram of the data after applying the Box-Cox transformation with = -0.3 shows a data set for which the normality assumption is reasonable This is verified with a normal probability plot of the transformed data.

Normal Probability Plot The normal probability plot is a graphical technique for assessing whether or not a data set is approximately normally distributed The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality The normal probability plot is a special case of the probability plot

Normal Probability Plot (cont’d)

CDF Plot Plot of empirical cumulative distribution function