David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Data Mining (and machine learning) DM Lecture 3: Basic Statistics for data miners
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Overview of My Lectures All at: 25/9 Overview of DM (and of these 8 lectures) 02/10: Data Cleaning - usually a necessary first step for large amounts of data 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry NO THURSDAY LECTURE OCTOBER 23rd 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data NO THURSDAY LECTURE November 6th 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data 20/11: Regression - the simplest algorithm for predicting data/class values 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Today you will see The most important theorem in science
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Statistical Data Mining Definitions – Population, Sample, Statistic Simple Statistics – Mean, Mode, Median – Range, Variance, Standard Deviation Probability Distributions – Normal distribution
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Fundamental Statistics Definitions A Population is the total collection of all items/individuals/events under consideration A Sample is that part of a population which has been observed or selected for analysis E.g. all students is a population. Students at HWU is a sample; this class is a sample, etc … A Statistic is a measure which can be computed to describe a characteristic of the sample (e.g. the sample mean) The reason for doing this is almost always to estimate (i.e. make a good guess) things about that characteristic in the population
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: E.g. This class is a sample from the population of students at HWU (it can also be considered as a sample of other populations – like what?) One statistic of this sample is your mean weight. Suppose that is 65Kg. I.e. this is the sample mean. Is 65Kg a good estimate for the mean weight of the population? Another statistic: suppose 10% of you are married. Is this a good estimate for the proportion that are married in the population?
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Some Simple Statistics The Mean (average) is the sum of the values in a sample divided by the number of values The Median is the midpoint of the values in a sample (50% above; 50% below) after they have been ordered (e.g. from the smallest to the largest) The Mode is the value that appears most frequently in a sample The Range is the difference between the smallest and largest values in a sample The Variance is a measure of the dispersion of the values in a sample – how closely the observations cluster around the mean of the sample The Standard Deviation is the square root of the variance of a sample
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Standard Deviation and other `moment’s The m-th moment about the mean (μ) of a sample is: Where n is the number of items in the sample. The first moment (m = 1) is 0! The second moment (m = 2) is the variance (and: square root of the variance is the standard deviation) The third moment can be used in tests for skewness The fourth moment can be used in tests for kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Distributions / Histograms A Normal (aka Gaussian) distribution (image from Mathworld)
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Distributions / Histograms Uniform distributions. Every possible value tends to be equally likely
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Probability Distributions If a population is expected to match a standard probability distribution then a wealth of statistical knowledge and results can be brought to bear on its analysis Many standard statistical techniques are based on the assumption that the underlying distribution of a population is Normal (Gaussian) Statistical tests have been developed to determine whether a sampled population is normally distributed
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: An important aside … This is the standard deviation of a sample This is slightly different, called the sample standard deviation Std is square root of Sample Std is square root of
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A closer look at the normal distribution This is the ND with mean mu and std sigma
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Suppose mean of your sample is 1.8; and suppose std of your sample is 0.12 Theory tells us that if a population is Normal, the sample std is a fairly good guess at the population std More than just a pretty bell shape So, we can say with some confidence, for example, that 99.7% of the population lies between 1.44 and 2.16
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DateSalesReturnsNet income 23 rd Nov£25,609£1,003£24, th Nov£26,202£1,601£24, th Nov£28,936£1,178£25,758
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The Central Limit Theorem Sir Francis Galton (Natural Inheritance, 1889) described the Central Limit Theorem as: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (from the wikipedia article)
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: (from the wikipedia article) the more tosses of the coin in each expt, the more the closer the distribution of heads is to a Normal distribution. Same with : dist of sum of two dice dists of heights, weights, hours watching TV, etc …
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The Central Limit Theorem is this: As more and more samples are taken from a population the distribution of the sample means conforms to a normal distribution The average of the samples more and more closely approximates the average of the entire population A very powerful and useful theorem The normal distribution is such a common and useful distribution that additional statistics have been developed to measure how closely a population conforms to it and to test for divergence from it due to skewness and kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The CLT helps us make the guesses reasonable rather than crazy. Assuming normal dist, the stats of a sample tells us lots about the stats of the population Remember, MUCH of science relies on making guesses about populations And, assuming normal dist helps us detect errors and outliers – how?
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Testing for Normality: the χ 2 goodness-of-fit test This is the classic test of whether a data sample is normally distributed or not We first group our data into k classes so that we can form a frequency distribution (the number of data items in each class) We calculate the mean and standard deviation of our sample and define a normal distribution based on these values. We now need to see if the number of data items in each of our classes matches the number predicted by the normal distribution
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The normal distribution - with mean mu and std sigma This tells you how to calculate the probability (frequency) for any value x
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The goodness of fit test simply measures the difference between the bars and the curve – adding up the squared difference for each bar.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: We can also test for skewness and kurtosis, using higher order moments
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The take-home lesson (for those new to statistics) Your data contains 100 values for x, and you have good reason to believe that x is normally distributed. Thanks to the Central Limit Theorem, you can: –Make a lot of good estimates about the statistics of the population –Find outliers and spot other problems in the data It’s better to test for Normality though, and also test for skewness and kurtosis, so that you can say: “probably around 0.3% of people use their mobile for >8 hrs per day, although the sample is somewhat skewed to the left so this may be an underestimate …”
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Next week – an actual Data Mining Algorithm!