Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Slides:



Advertisements
Similar presentations
Descriptive Statistics
Advertisements

Basic Statistical Concepts
Descriptive statistics. Statistics Many studies generate large numbers of data points, and to make sense of all that data, researchers use statistics.
Statistics.
Measures of Central Tendency. Central Tendency “Values that describe the middle, or central, characteristics of a set of data” Terms used to describe.
Review of Basics. REVIEW OF BASICS PART I Measurement Descriptive Statistics Frequency Distributions.
Statistics for the Social Sciences
Calculating & Reporting Healthcare Statistics
Why do we do statistics? To Make Inferences from a Small number of cases to a Large number of cases This means that we have to collect data.
DESCRIBING DATA: 2. Numerical summaries of data using measures of central tendency and dispersion.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Descriptive Statistics Data 1: Frequency.
Introduction to Educational Statistics
Central Tendency & Variability Dec. 7. Central Tendency Summarizing the characteristics of data Provide common reference point for comparing two groups.
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.
Central Tendency.
Data observation and Descriptive Statistics
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Data 1: Frequency Distributions Ido.
Central Tendency and Variability
 Deviation is a measure of difference for interval and ratio variables between the observed value and the mean.  The sign of deviation (positive or.
1 Measures of Central Tendency Greg C Elvers, Ph.D.
Measures of Central Tendency
Today: Central Tendency & Dispersion
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Quiz 2 Measures of central tendency Measures of variability.
@ 2012 Wadsworth, Cengage Learning Chapter 5 Description of Behavior Through Numerical 2012 Wadsworth, Cengage Learning.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
EPE/EDP 557 Key Concepts / Terms –Empirical vs. Normative Questions Empirical Questions Normative Questions –Statistics Descriptive Statistics Inferential.
Measures of Central Tendency or Measures of Location or Measures of Averages.
Overview Summarizing Data – Central Tendency - revisited Summarizing Data – Central Tendency - revisited –Mean, Median, Mode Deviation scores Deviation.
JDS Special Program: Pre-training1 Basic Statistics 01 Describing Data.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Basic Statistics. Scales of measurement Nominal The one that has names Ordinal Rank ordered Interval Equal differences in the scores Ratio Has a true.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
Warsaw Summer School 2014, OSU Study Abroad Program Variability Standardized Distribution.
© 2006 McGraw-Hill Higher Education. All rights reserved. Numbers Numbers mean different things in different situations. Consider three answers that appear.
Interpreting Performance Data
Descriptive Statistics
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
1 Univariate Descriptive Statistics Heibatollah Baghi, and Mastee Badii George Mason University.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Descriptive Statistics The goal of descriptive statistics is to summarize a collection of data in a clear and understandable way.
Unit 2 (F): Statistics in Psychological Research: Measures of Central Tendency Mr. Debes A.P. Psychology.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
BASIC STATISTICAL CONCEPTS Chapter Three. CHAPTER OBJECTIVES Scales of Measurement Measures of central tendency (mean, median, mode) Frequency distribution.
IE(DS)1 Descriptive Statistics Data - Quantitative observation of Behavior What do numbers mean? If we call one thing 1 and another thing 2 what do we.
LIS 570 Summarising and presenting data - Univariate analysis.
Introduction to statistics I Sophia King Rm. P24 HWB
Chapter 2 Describing and Presenting a Distribution of Scores.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Descriptive Statistics(Summary and Variability measures)
Welcome to… The Exciting World of Descriptive Statistics in Educational Assessment!
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 2 Describing and Presenting a Distribution of Scores.
Populations.
Measures of Central Tendency
Univariate Statistics
Central Tendency and Variability
Numerical Measures: Centrality and Variability
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
BUS7010 Quant Prep Statistics in Business and Economics
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Chapter 4  DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY AND VARIABILITY Understanding Statistics for International Social Work and Other Behavioral.
Central Tendency & Variability
Presentation transcript:

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Frequency Distributions and Scales

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3 Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Different Variability

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 This Lesson Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical) Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation Standard scores (z-score)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 Centrality/Variability Measures and Scales

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 The Mode (Mo) השכיח The mode of a variable is the value that is most frequent Mo = argmax f(x) For categorical variable: The category that appeared most For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in the interval

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7 Finding the Mode: Example 1 The collection of values that a variable X took during the measurement ? Depends on Grouping

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8 Finding the Mode: Example 2 The mode of a grouped frequency distribution depends on grouping

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ?

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = (¼ * 1.0)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = (¼ * 1.0) between 7 and 8

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = (¼ * 1.0) 1 of four 8's

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 The Median (Mdn) החציון The median of a variable is its 50 th percentile, P 50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75  7.75 = (¼ * 1.0) Width of interval containing 8's (real limits)

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Arithmetic mean (mean, for short) Average is colloquial: Not precisely defined when used, so we avoid the term. The Arithmetic Mean ממוצע חשבוני

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Properties of Central Tendency Measures Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one: Distributions that have more than one sometimes called multi-modal For uniform distributions, all values are possible modes Typically used only on nominal data

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Properties of Central Tendency Measures Mean: Responsive to exact value of each score Only interval and ratio scales Takes total of scores into account: Does not ignore any value Sum of deviations from mean is always zero: Because of this: sensitive to outliers Presence/absence of scores at extreme values Stable between samples, and basis for many other statistical measures

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Properties of Central Tendency Measures Median: Robust to extreme values Only cares about ordering, not magnitude of intervals Often used with skewed distributions Mo Mdn Mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Dispersion and Variability Mode, Median, Mean: Only give central tendencies Mo Mdn Mean We need to measure the spread of the distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Dispersion as Ranges Range: max(X) - min(X) Semi-Interquartile Range: Half the range where 50% of the scores are

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Dispersion as Deviation Look at dispersion as a function of the central tendency (mean) We know sum of deviations from mean is zero But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution around the mean

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24 Variance Statisticians prefer a different way to use absolute values Sum of squares Shorthand for: Sum of squared deviations from the mean And normalizing for the size of the sample This is called the variance of the distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25 Standard Deviation (std.) Square root of variance Robust to sampling variation: Does not change very much with new samples of the population Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a bit different We ignore this for now; return to this later

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26 Standard Scores Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the central tendency measures We may need to also compare distributions changing in range For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is Can compute z-scores of the raw scores

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27 z Scores Key idea: Express all values in units of standard deviation This allows comparison of values from different distributions But only if shapes of distributions are similar

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28 Measuring Dispersion in Nominal Scales Entropy Where r X is rel f of the value X Entropy of 0 means that all values X are the same rel f = 1.0 for some value X Entropy grows positive when values become more dispersed e.g., Entropy of 1 means all scores split evenly between two values Entropy is maximal when r X = 1/N for all values X i.e., uniform distribution

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29 Normalizing Entropy Can normalize by dividing by maximal entropy given N. This allows comparing the entropy of distributions of different size