Normalizing and Redistributing Variables Chapter 7 of Data Preparation for Data Mining Markus Koskela.

Slides:



Advertisements
Similar presentations
Calculating & Reporting Healthcare Statistics
Advertisements

Curve-Fitting Regression
Introduction to Inference Estimating with Confidence Chapter 6.1.
Chapter 6 Continuous Random Variables and Probability Distributions
Basic Business Statistics 10th Edition
Ch. 6 The Normal Distribution
CHAPTER 3: Statistical Description of Data
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Continuous probability distributions
 There are times when an experiment cannot be carried out, but researchers would like to understand possible relationships in the data. Data is collected.
Business Statistics: Communicating with Numbers
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Chapter 4 Continuous Random Variables and Probability Distributions
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Numerical Descriptive Measures
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Chapter 6 The Normal Probability Distribution
Chapter 6: Probability Distributions
Data Handbook Chapter 4 & 5. Data A series of readings that represents a natural population parameter A series of readings that represents a natural population.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Modified by ARQ, from © 2002 Prentice-Hall.Chap 3-1 Numerical Descriptive Measures Chapter %20ppts/c3.ppt.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Sample Size Determination CHAPTER thirteen.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Thinking About Psychology: The Science of Mind and Behavior 2e Charles T. Blair-Broeker Randal M. Ernst.
Describing Behavior Chapter 4. Data Analysis Two basic types  Descriptive Summarizes and describes the nature and properties of the data  Inferential.
Applied Quantitative Analysis and Practices LECTURE#11 By Dr. Osman Sadiq Paracha.
Normal Distribution Introduction. Probability Density Functions.
Analysis of Distribution If the sample is truly random and there is no bias in the sampling then the expected distribution would be a smooth bell-shaped.
IT College Introduction to Computer Statistical Packages Eng. Heba Hamad 2009.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 13: Correlation An Introduction to Statistical Problem Solving in Geography As Reviewed by: Michelle Guzdek GEOG 3000 Prof. Sutton 2/27/2010.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Numerical Measures of Variability
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
A.P. STATISTICS – UNIT 1 VOCABULARY (1) Types of Univariate Graphs -- * Histogram * Frequency Distribution * Dot Plot * Stem-and-Leaf Plot (“Stemplot”)
BASIC STATISTICAL CONCEPTS Chapter Three. CHAPTER OBJECTIVES Scales of Measurement Measures of central tendency (mean, median, mode) Frequency distribution.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Introduction to statistics I Sophia King Rm. P24 HWB
Lecture 5 Introduction to Sampling Distributions.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Introduction to inference Estimating with confidence IPS chapter 6.1 © 2006 W.H. Freeman and Company.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions Basic Business.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Psychology’s Statistics Appendix. Statistics Are a means to make data more meaningful Provide a method of organizing information so that it can be understood.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
MM150 ~ Unit 9 Statistics ~ Part II. WHAT YOU WILL LEARN Mode, median, mean, and midrange Percentiles and quartiles Range and standard deviation z-scores.
Slide Copyright © 2009 Pearson Education, Inc. Types of Distributions Rectangular Distribution J-shaped distribution.
Transforming Data.
Introduction to inference Estimating with confidence
Data Mining: Concepts and Techniques
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Populations.
Elementary Statistics: Picturing The World
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

Normalizing and Redistributing Variables Chapter 7 of Data Preparation for Data Mining Markus Koskela

Introduction All variables are assumed to have a numerical representation. Two topics: Normalizing the range of a variable Normalizing the distribution of a variable (redistribution)

Part I: Normalizing variables Variable normalization requires taking values that span a specific range and representing them in another range. The standard method is to normalize variables to [0,1]. This may introduce various distortions or biases into the data. Therefore, the properties and possible weaknesses of the used method must be understood. Depending on the modeling tool, normalizing variable ranges can be beneficial or sometimes even required.

Linear scaling transform First task in normalizing is to determine the minimum and maximum values of variables. Then, the simplest method to normalize values is the linear scaling transform: y = (x - min{x 1, x N }) / (max{x 1, x N } - min{x 1, x N }) Introduces no distortion to the variable distribution. Has a one-to-one relationship between the original and normalized values.

Out-of-range values In data preparation, the data used is only a sample of the population. Therefore, it is not certain that the actual minimum and maximum values of the variable have been discovered when normalizing the ranges If some values that turn up later in the mining process are outside of the limits discovered in the sample, they are called out-of-range values.

Dealing with out-of range values After range normalization, all variables should be in the range of [0,1]. Out-of-range values, however, have values like -0.2 or 1.1 which can cause unwanted behavior. Solution 1. Ignore that the range has been exceeded. Most modeling tools have (at least) some capacity to handle numbers outside the normalized range. Does this affect the quality of the model?

Dealing with out-of range values Solution 2. Ignore the out-of-range instances. Used in many commercial modeling tools. One problem is that reducing the number of instances reduces the confidence that the sample represents the population. Another, and potentially more severe problem is that this approach introduces bias. Out-of-range values occur with a certain pattern and ignoring these instances removes samples according to a pattern introducing distortion to the sample.

Dealing with out-of range values Solution 3. Clip the out-of-range values. If the value is greater than 1, assign 1 to it. If less than 0, assign 0. This approach assumes that out-of-range values are somehow equivalent with range limit values. Therefore, the information content on the limits is distorted by projecting multiple values into a single value. Has the same problem with bias as Solution 2.

Making room for out-of-range values The linear scaling transform provides an undistorted normalization but suffers from out-of-range values. Therefore, we should modify it to somehow include also values that are out of range. Most of the population is inside the range so for these values the normalization should be linear. The solution is to reserve some part of the range for the out-of-range values. Reserved amount of space depends on the confidence level of the sample: –98% confidence  linear part is [0.01, 0.99]

Squashing the out-of-range values Now the problem is to fit the out-of-range values into the space left for them. The greater the difference between a value and the range limit, the less likely any such value is found. Therefore, the transformation should be such that as the distance to the range grows, the smaller the increase towards one or decrease towards zero. One possibility is to use functions of the form y =1/x and attach them to the ends of the linear part.

Softmax scaling Carrying out the normalization in pieces is tedious so one function with equal properties would be useful. This functionality is achieved with softmax scaling. The extent of the linear part can be controlled by one parameter. The space assigned for out-of-range values can be controlled by the level of uncertainty in the sample. Nonidentical values have always different normalized values.

The logistic function Softmax scaling is based on the logistic function: y = 1 / (1 + e -x ) where y is the normalized value and x is the original value. The logistic function transforms the original range of [- ,  ] to [0,1] and also has a linear part on the transform. Due to finite wordlength in computers, very large positive and negative numbers are not mapped to unique normalized values.

Modifying the linear part of the logistic function range The values of the variables must be modified before using the logistic function in order to get a desired response. This is achieved by using the following transform x’ = (x - x)/( (  /2  )) where x is the mean of x,  is the standard deviation, and is the size of the desired linear response. The linear part of the curve is described in terms of how many normally distributed standard deviations are to have a linear response.

Part II: Redistributing variable values (Linear) range normalization does not alter the distribution of the variables. The existing distribution may also cause problems or difficulties for the modeling tools. –Outlying values –Outlying clusters Many modeling tools assume that the distributions are normal (or uniform). Varying densities in distribution may cause difficulties.

Adjusting distributions Easiest way adjust distributions is to “spread” high- density areas until the mean density is reached. –Results in uniform distribution –Can only be fully performed if none of the instance values is duplicated Every point in the distribution is displaced in a particular direction and distance. The required movement for different points can be illustrated in a displacement graph.

Modified distributions What changes if a distribution of a variable is adjusted? –Median values move closer to point 0.5 –Quartile ranges locate closer to their appropriate locations in a uniform distribution –“Skewness” decreases –May cause distortions e.g. with monotonic variables