Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools Note: Included in this Slide Set is a subset of Chapter 5 material and additional material.

Slides:



Advertisements
Similar presentations
Brought to you by Tutorial Support Services The Math Center.
Advertisements

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Chapter 7 Sampling and Sampling Distributions
Chapter 3 Numerically Summarizing Data
Intro to Descriptive Statistics
1 Basic statistics Week 10 Lecture 1. Thursday, May 20, 2004 ISYS3015 Analytic methods for IS professionals School of IT, University of Sydney 2 Meanings.
Measures of Variability
Introduction to Educational Statistics
Chapter 19 Data Analysis Overview
Hypothesis Testing. G/RG/R Null Hypothesis: The means of the populations from which the samples were drawn are the same. The samples come from the same.
Copyright ©2009 Cengage Learning 1.1 Day 3 What is Statistics?
Statistics for CS 312. Descriptive vs. inferential statistics Descriptive – used to describe an existing population Inferential – used to draw conclusions.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 18-1 Chapter 18 Data Analysis Overview Statistics for Managers using Microsoft Excel.
Central Tendency and Variability
Measures of Central Tendency
Today: Central Tendency & Dispersion
BIOSTAT - 2 The final averages for the last 200 students who took this course are Are you worried?
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
CHAPTER 1 Basic Statistics Statistics in Engineering
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Quantitative Skills: Data Analysis
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Statistics Recording the results from our studies.
Statistical Analysis Mean, Standard deviation, Standard deviation of the sample means, t-test.
Nature of Science Science Nature of Science Scientific methods Formulation of a hypothesis Formulation of a hypothesis Survey literature/Archives.
Describing Behavior Chapter 4. Data Analysis Two basic types  Descriptive Summarizes and describes the nature and properties of the data  Inferential.
Chapter 21 Basic Statistics.
Lecture 2 Forestry 3218 Lecture 2 Statistical Methods Avery and Burkhart, Chapter 2 Forest Mensuration II Avery and Burkhart, Chapter 2.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Fundamentals of Data Analysis Lecture 3 Basics of statistics.
Distributions of the Sample Mean
Presentation on Statistics for Research Lecture 7.
1.1 Statistical Analysis. Learning Goals: Basic Statistics Data is best demonstrated visually in a graph form with clearly labeled axes and a concise.
Descriptive & Inferential Statistics Adopted from ;Merryellen Towey Schulz, Ph.D. College of Saint Mary EDU 496.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
Chapter SixteenChapter Sixteen. Figure 16.1 Relationship of Frequency Distribution, Hypothesis Testing and Cross-Tabulation to the Previous Chapters and.
Chap 18-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 18-1 Chapter 18 A Roadmap for Analyzing Data Basic Business Statistics.
PCB 3043L - General Ecology Data Analysis.
Quality Control: Analysis Of Data Pawan Angra MS Division of Laboratory Systems Public Health Practice Program Office Centers for Disease Control and.
LIS 570 Summarising and presenting data - Univariate analysis.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Descriptive Statistics for one Variable. Variables and measurements A variable is a characteristic of an individual or object in which the researcher.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
Descriptive Statistics(Summary and Variability measures)
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Descriptive Statistics Used in Biology. It is rarely practical for scientists to measure every event or individual in a population. Instead, they typically.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
Chapter 18 Data Analysis Overview Yandell – Econ 216 Chap 18-1.
An Introduction to Statistics
Statistical Methods Michael J. Watts
Statistics in Management
Data Analysis.
Overview of probability and statistics
STATISTICS FOR SCIENCE RESEARCH
Statistical Methods Michael J. Watts
PCB 3043L - General Ecology Data Analysis.
Introductory Statistics
Central Tendency and Variability
Description of Data (Summary and Variability measures)
Statistics in Applied Science and Technology
Numerical Descriptive Measures
Data analysis and basic statistics
Univariate Statistics
Political Science 30 Political Inquiry
Bar Chart Data Analysis First Generation Third Generation.
Numerical Descriptive Measures
Presentation transcript:

Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools Note: Included in this Slide Set is a subset of Chapter 5 material and additional material from the instructor.

2 Why a Manager (or you) Needs to Know Some Basics about Statistics To know how to properly present information To know how to draw conclusions about populations based on sample information To know how to improve processes To know how to obtain reliable forecasts

3 Statistics vs Data Mining For statisticians, data mining has a negative connotation – one of searching for data to support preconceived ideas Statistics don’t lie but liars use statistics! Statistics developed as a discipline to help scientists make sense of observations and experiments, hence the scientific method Problem has often been too little data for statisticians DM is faced with too much data Many of the techniques & algorithms used are shared by both statisticians and data miners

4 Some Definitions Population (universe) is the collection of things under consideration Sample is a portion of the population selected for analysis Statistic is a summary measure computed to describe a characteristic of the sample

5 Some Definitions* Mean (average) is the sum of the values divided by the number of values Median is the midpoint of the values (50% above; 50% below) after they have been ordered from the smallest to the largest, or the largest to the smallest Mode is the value among all the values observed that appears most frequently Range is the difference between the smallest and largest observation in the sample * laymen’s

6 Population and Sample PopulationSample Use parameters to summarize features Use statistics to summarize features Inference on the population from the sample

7 Occam’s Razor – “Kiss” William of Occam, Franciscan monk, – prior to modern statistics, the Renaissance and the printing press. Influential philosopher, theologian, professor with a very simple idea: –Latin: Entia non sunt multiplicanda sine necessitate –English: The simpler explanation is the preferable one or “Keep it simple, stupid!”

8 The Null Hypothesis The NH assumes that differences among observations are due simply to chance Bush vs Kerry – poll’s margin of error ~ 3% - 4% Layperson asks, “Are these %’s different?” Statistician asks, “What is the probability that these two values are really the same?”

9 Skepticism Is good for both statisticians and DMiners Goal for both is to demonstrate results that work, hence discounting the null hypothesis The less reliance on chance the better

10 P-Values and Q-Values The null hypothesis can be quantified The p-value is the probability that the null hypothesis is true When the null hypothesis is true, nothing is really happening; differences are due to chance Confidence, the reverse of a p-value, is called the q-value. p-value = 5% then the q-value (confidence) is 95%. Example: Bush/Kerry…p-value 60% or 5%

11 Data Visualization Discrete data, such as products, channels, regions, and descriptions is the main focus of data mining Histogram – bars show number of times different values occur

12 Data Visualization Histograms describe a single moment in time Data mining is often concerned with what is happening over time. Time Series Analysis – choosing an appropriate time frame to consider the data

13 Standardized Values Time Series charts are useful, but have limitations also; cannot tell whether the changes over time are expected or unexpected We could look at a segment of the data, say a day at a time asking: “Is it possible that the differences seen on each day are strictly due to chance?” (null hypothesis) Answer: calculate the p-value for a day

14 Central Limit Theorem As more and more samples are taken from a population, the distribution of the averages of the samples follows the normal distribution. The average of the samples comes arbitrarily close to the average of the entire population. Normal distribution is described by the mean (average count) and the standard deviation (clustering around the mean)

15 Different Shapes of Distributions

16 Variance and Standard Deviation Variance is a measure of the dispersion of a sample (or how closely the observations cluster around the mean [average]) Standard Deviation, the square root of the variance, is the measure of variation in the observed values (or variation in the clustering around the mean)

17 Example: Sample Scores/Grades Sort the data from highest to lowest and assign grades 2.Find the Mean, Median, Mode, and Standard Deviation 3.Create a histogram for the grades

18 Using MS Excel… B C D E F G H I

19 Using MS Excel…

20 End of Chapter 5