Data Mining Lecture 5. Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
DATA PREPROCESSING Why preprocess the data?
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
QUANTITATIVE DATA ANALYSIS
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Slides by JOHN LOUCKS St. Edward’s University.
Basic Business Statistics 10th Edition
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 2 Describing Data.
Describing distributions with numbers
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Skewness & Kurtosis: Reference
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Categorical vs. Quantitative…
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Chapter 2 Describing Data: Numerical
Descriptive Statistics ( )
Statistics for Managers Using Microsoft® Excel 5th Edition
Business and Economics 6th Edition
ISE 261 PROBABILISTIC SYSTEMS
Data Mining: Concepts and Techniques
Data Preprocessing CENG 514 June 17, 2018.
Description of Data (Summary and Variability measures)
Introduction to Summary Statistics
Understanding Data Characteristics
Numerical Descriptive Measures
Understanding Basic Characteristics of Data
Descriptive Statistics
Basic Statistical Terms
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Data Transformations targeted at minimizing experimental variance
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Advanced Algebra Unit 1 Vocabulary
Business and Economics 7th Edition
Presentation transcript:

Data Mining Lecture 5

Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1) Data Analysis Techniques (Week 5) –Statistical Background –Trends/ Outliers/Normalizations –Principal Component Analysis –Discretization Techniques Case Study 2: Working and experiencing on the properties of discretization infrastructure of The Retail Banking Data Mart (Week 5 –Assignment 2) Lecture Talk: Searching/Matching Engine

The importance of Statistics Why we need to use descriptive summaries? –Motivation To better understand the data: central tendency, variation and spread – Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. – Numerical dimensions correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals – Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube

Remember Stats Facts Min: –What is the big oh value for finding min of n-sized list ? Max: –What is the min number of comparisons needed to find the max of n-sized list? Range: –Max-Min –What about simultaneous finding of min-max? Value Types: –Cardinal value -> how many, counting numbers –Nominal value -> names and identifies something –Ordinal value -> order of things, rank, position –Continuous value -> real number

Remember Stats Facts Mean (algebraic measure) (sample vs. population): –Weighted arithmetic mean: –Trimmed mean: chopping extreme values Median: A holistic measure –Middle value if odd number of values, or average of the middle two values otherwise –Estimated by interpolation (for grouped data): Mode –Value that occurs most frequently in the data –Unimodal, bimodal, trimodal –Empirical formula:

Transformation Min-max normalization: to [new_minA, new_maxA] –Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (μ: mean, σ: standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1

The importance of Mean and Median: figuring out the shape of the distribution symmetric positively skewed negatively skewed mean > median > mode mean < median < mode

Measuring dispersion of data Quantiles: Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets cumulative distribution functionrandom variable The 1000-quantiles are called permillages --> Pr The 100-quantiles are called percentiles --> Ppercentiles The 20-quantiles are called vigiciles --> V The 12-quantiles are called duo-deciles --> Dd The 10-quantiles are called deciles --> Ddeciles The 9-quantiles are called noniles (common in educational testing)--> NOeducational The 5-quantiles are called quintiles --> QUquintiles The 4-quantiles are called quartiles --> Qquartiles The 3-quantiles are called tertiles or terciles --> T

Measuring dispersion of data The first standardized moment is zero, because the first moment about the mean is zero The second standardized moment is one, because the second moment about the mean is equal to the variancevariance (the square of the standard deviation) The third standardized moment is the skewness (seen before)skewness The fourth standardized moment is the kurtosis (estimate peak structure)kurtosis k th standardized moment

Measuring dispersion of data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, MEDIAN, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Measuring dispersion of data What is the difference between sample variance, standart variance, What is the use of N-1 in s formula ? Does it make sense ? Bessel’s correction or degrees of freedom the sample error of a hypothesis with respect to some sample S of instances drawn from X is the fraction of S that it misclassifies the true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D Estimation bias (difference between true error and sample error)

Measuring dispersion of data

Chebiyshev Inequality Let X be a random variable with expected value μ andrandom variableexpected value finite variance σ 2. Then for any real number k > 0,variancereal number Only the cases k > 1 provide useful information. This can be equivalently stated as At least 50% of the values are within √2 standard deviations from the mean. At least 75% of the values are within 2 standard deviations from the mean. At least 89% of the values are within 3 standard deviations from the mean. At least 94% of the values are within 4 standard deviations from the mean. At least 96% of the values are within 5 standard deviations from the mean. At least 97% of the values are within 6 standard deviations from the mean. At least 98% of the values are within 7 standard deviations from the mean.

Measuring dispersion of data The normal (distribution) curve –From μ – σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) – From μ – 2σ to μ+2σ: contains about 95% of it –From μ – 3σ to μ+3σ: contains about 99.7% of it

Boxplot Analysis Five-number summary of a distribution: –Minimum, Q1, M, Q3, Maximum Boxplot –Data is represented with a box –The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ –The median is marked by a line within the box –Whiskers: two lines outside the box extend to Minimum and Maximum

Histogram Analysis Graph displays of basic statistical class descriptions –Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information –For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane

Loess Curve Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

Covarience The covariance between two real-valued random variables X and Y,realrandom variables with expected values expected values and is defined as Understanding the correlation relationship

Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient) –where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated

Correlation Analysis (Categorical Data) Χ2 (chi-square) test (we will return back again) The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count

Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data Steps –Normalize input data: Each attribute falls within the same range –Compute k orthonormal (unit) vectors, i.e., principal components –Each input data (vector) is a linear combination of the k principal component vectors –The principal components are sorted in order of decreasing “significance” or strength –Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data Works for numeric data only Used when the number of dimensions is large Dimensionality Reduction: Principal Component Analysis (PCA)

X1 X2 Y1 Y2 Principal Component Analysis

Data Reduction Method (2): Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: –Equal-width: equal bucket range –Equal-frequency (or equal- depth) Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only

Sampling: Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample

Discretization Discretization: –Divide the range of a continuous attribute into intervals –Some classification algorithms only accept categorical attributes. –Reduce data size by discretization –Prepare for further analysis

Discretization and Concept Hierarchy Discretization –Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals –Interval labels can then be used to replace actual data values –Supervised vs. unsupervised –Split (top-down) vs. merge (bottom-up) –Discretization can be performed recursively on an attribute Concept hierarchy formation –Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

Week 5-End assignment 1 (please share your ideas with your group) –choose freely a dataset my advice: asets0405.html asets0405.html -evaluate every attribute get descriptive statistics find mean, median, max, range, min, histogram, quartile, percentile, determine missing value strategy, erraneous value strategy, inconsistent value strategy -you can freely use any Statistics Tool. But my advice use open source Weka

Week 5-End read –Course Text Book Chapter 2 –Supplemantary Book “Machine Learning”- Tom Mitchell Chapter 5