David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Slides:



Advertisements
Similar presentations
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law.
SUMMARIZING DATA: Measures of variation Measure of Dispersion (variation) is the measure of extent of deviation of individual value from the central value.
Chapter 18 Sampling Distribution Models
Measures of Dispersion
Sampling Distribution of & the Central Limit Theorem.
Calculating & Reporting Healthcare Statistics
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Created by Tom Wegleitner, Centreville, Virginia Section 3-1.
Measures of Variability
12.3 – Measures of Dispersion
MEASURES of CENTRAL TENDENCY.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Describing Data: Numerical
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Overview Summarizing Data – Central Tendency - revisited Summarizing Data – Central Tendency - revisited –Mean, Median, Mode Deviation scores Deviation.
JDS Special Program: Pre-training1 Basic Statistics 01 Describing Data.
Dan Piett STAT West Virginia University
Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools Note: Included in this Slide Set is a subset of Chapter 5 material and additional material.
Measures of Spread Chapter 3.3 – Tools for Analyzing Data I can: calculate and interpret measures of spread MSIP/Home Learning: p. 168 #2b, 3b, 4, 6, 7,
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
David Corne, Heriot-Watt University - These slides and related resources: Data Mining.
Warsaw Summer School 2014, OSU Study Abroad Program Variability Standardized Distribution.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Warm up The following graphs show foot sizes of gongshowhockey.com users. What shape are the distributions? Calculate the mean, median and mode for one.
1 Lecture 1 Density curves and the CLT Quantitative Methods Module I Gwilym Pryce
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 2 – Slide 1 of 27 Chapter 3 Section 2 Measures of Dispersion.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Distributions of the Sample Mean
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Descriptive Statistics: Presenting and Describing Data.
Measures of Spread Chapter 3.3 – Tools for Analyzing Data Mathematics of Data Management (Nelson) MDM 4U.
Chapter 18: Sampling Distribution Models
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
1.  In the words of Bowley “Dispersion is the measure of the variation of the items” According to Conar “Dispersion is a measure of the extent to which.
MATH 1107 Elementary Statistics Lecture 3 Describing and Exploring Data – Central Tendency, Variation and Relative Standing.
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Minds on! Two students are being considered for a bursary. Sal’s marks are Val’s marks are Which student would you award the bursary.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
3.3 Measures of Spread Chapter 3 - Tools for Analyzing Data Learning goal: calculate and interpret measures of spread Due now: p. 159 #4, 5, 6, 8,
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Lecture notes 5: sampling distributions and the central limit theorem
Sampling Distribution Models
Measures of Dispersion
Distribution of the Sample Means
Descriptive Statistics: Presenting and Describing Data
Chapter 18: Sampling Distribution Models
Sampling Distribution Models
Social Science Statistics Module I Gwilym Pryce
Distributions / Histograms
Data Mining (and machine learning)
Data analysis and basic statistics
SAMPLING-BASED SELECTIVITY ESTIMATION
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Advanced Algebra Unit 1 Vocabulary
Measures of Dispersion
Presentation transcript:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Data Mining (and machine learning) DM Lecture 3: Basic Statistics for data miners

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Overview of My Lectures All at: 25/9 Overview of DM (and of these 8 lectures) 02/10: Data Cleaning - usually a necessary first step for large amounts of data 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry NO THURSDAY LECTURE OCTOBER 23rd 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data NO THURSDAY LECTURE November 6th 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data 20/11: Regression - the simplest algorithm for predicting data/class values 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Today you will see The most important theorem in science

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Statistical Data Mining Definitions – Population, Sample, Statistic Simple Statistics – Mean, Mode, Median – Range, Variance, Standard Deviation Probability Distributions – Normal distribution

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Fundamental Statistics Definitions A Population is the total collection of all items/individuals/events under consideration A Sample is that part of a population which has been observed or selected for analysis E.g. all students is a population. Students at HWU is a sample; this class is a sample, etc … A Statistic is a measure which can be computed to describe a characteristic of the sample (e.g. the sample mean) The reason for doing this is almost always to estimate (i.e. make a good guess) things about that characteristic in the population

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: E.g. This class is a sample from the population of students at HWU (it can also be considered as a sample of other populations – like what?) One statistic of this sample is your mean weight. Suppose that is 65Kg. I.e. this is the sample mean. Is 65Kg a good estimate for the mean weight of the population? Another statistic: suppose 10% of you are married. Is this a good estimate for the proportion that are married in the population?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Some Simple Statistics The Mean (average) is the sum of the values in a sample divided by the number of values The Median is the midpoint of the values in a sample (50% above; 50% below) after they have been ordered (e.g. from the smallest to the largest) The Mode is the value that appears most frequently in a sample The Range is the difference between the smallest and largest values in a sample The Variance is a measure of the dispersion of the values in a sample – how closely the observations cluster around the mean of the sample The Standard Deviation is the square root of the variance of a sample

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Standard Deviation and other `moment’s The m-th moment about the mean (μ) of a sample is: Where n is the number of items in the sample. The first moment (m = 1) is 0! The second moment (m = 2) is the variance (and: square root of the variance is the standard deviation) The third moment can be used in tests for skewness The fourth moment can be used in tests for kurtosis

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Distributions / Histograms A Normal (aka Gaussian) distribution (image from Mathworld)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Distributions / Histograms Uniform distributions. Every possible value tends to be equally likely

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Probability Distributions If a population is expected to match a standard probability distribution then a wealth of statistical knowledge and results can be brought to bear on its analysis Many standard statistical techniques are based on the assumption that the underlying distribution of a population is Normal (Gaussian) Statistical tests have been developed to determine whether a sampled population is normally distributed

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: An important aside … This is the standard deviation of a sample This is slightly different, called the sample standard deviation Std is square root of Sample Std is square root of

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A closer look at the normal distribution This is the ND with mean mu and std sigma

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Suppose mean of your sample is 1.8; and suppose std of your sample is 0.12 Theory tells us that if a population is Normal, the sample std is a fairly good guess at the population std More than just a pretty bell shape So, we can say with some confidence, for example, that 99.7% of the population lies between 1.44 and 2.16

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DateSalesReturnsNet income 23 rd Nov£25,609£1,003£24, th Nov£26,202£1,601£24, th Nov£28,936£1,178£25,758

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The Central Limit Theorem Sir Francis Galton (Natural Inheritance, 1889) described the Central Limit Theorem as: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (from the wikipedia article)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: (from the wikipedia article) the more tosses of the coin in each expt, the more the closer the distribution of heads is to a Normal distribution. Same with : dist of sum of two dice dists of heights, weights, hours watching TV, etc …

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The Central Limit Theorem is this: As more and more samples are taken from a population the distribution of the sample means conforms to a normal distribution The average of the samples more and more closely approximates the average of the entire population A very powerful and useful theorem The normal distribution is such a common and useful distribution that additional statistics have been developed to measure how closely a population conforms to it and to test for divergence from it due to skewness and kurtosis

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The CLT helps us make the guesses reasonable rather than crazy. Assuming normal dist, the stats of a sample tells us lots about the stats of the population Remember, MUCH of science relies on making guesses about populations And, assuming normal dist helps us detect errors and outliers – how?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Testing for Normality: the χ 2 goodness-of-fit test This is the classic test of whether a data sample is normally distributed or not We first group our data into k classes so that we can form a frequency distribution (the number of data items in each class) We calculate the mean and standard deviation of our sample and define a normal distribution based on these values. We now need to see if the number of data items in each of our classes matches the number predicted by the normal distribution

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The normal distribution - with mean mu and std sigma This tells you how to calculate the probability (frequency) for any value x

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The goodness of fit test simply measures the difference between the bars and the curve – adding up the squared difference for each bar.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: We can also test for skewness and kurtosis, using higher order moments

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The take-home lesson (for those new to statistics) Your data contains 100 values for x, and you have good reason to believe that x is normally distributed. Thanks to the Central Limit Theorem, you can: –Make a lot of good estimates about the statistics of the population –Find outliers and spot other problems in the data It’s better to test for Normality though, and also test for skewness and kurtosis, so that you can say: “probably around 0.3% of people use their mobile for >8 hrs per day, although the sample is somewhat skewed to the left so this may be an underestimate …”

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Next week – an actual Data Mining Algorithm!