Exploratory Data Analysis (Descriptive Statistics)

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Chapter 2 Exploring Data with Graphs and Numerical Summaries
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Dot Plots & Box Plots Analyze Data.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Measures of Dispersion
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
CHAPTER 4 Displaying and Summarizing Quantitative Data Slice up the entire span of values in piles called bins (or classes) Then count the number of values.
1 Chapter 1: Sampling and Descriptive Statistics.
Displaying & Summarizing Quantitative Data
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Descriptive Statistics
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
CHAPTER 1: Picturing Distributions with Graphs
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
Department of Quantitative Methods & Information Systems
Describing distributions with numbers
Objective To understand measures of central tendency and use them to analyze data.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
REPRESENTATION OF DATA.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
1.1 Displaying Distributions with Graphs
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Methods for Describing Sets of Data
Module 8 Test Review. Now is a chance to review all of the great stuff you have been learning in Module 8! Statistical Questioning Measurement of Data.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
Chapter 2 Describing Data.
1 Chapter 3 Looking at Data: Distributions Introduction 3.1 Displaying Distributions with Graphs Chapter Three Looking At Data: Distributions.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
1 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES. 2 MEASURES OF CENTRAL TENDENCY FOR UNGROUPED DATA  In Chapter 2, we used tables and graphs to summarize a.
Categorical vs. Quantitative…
To be given to you next time: Short Project, What do students drive? AP Problems.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Displaying Distributions with Graphs. the science of collecting, analyzing, and drawing conclusions from data.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
LIS 570 Summarising and presenting data - Univariate analysis.
Plan for Today: Chapter 11: Displaying Distributions with Graphs Chapter 12: Describing Distributions with Numbers.
Descriptive Statistics Unit 6. Variable Any characteristic (data) recorded for the subjects of a study ex. blood pressure, nesting orientation, phytoplankton.
1 Take a challenge with time; never let time idles away aimlessly.
Statistics Unit Test Review Chapters 11 & /11-2 Mean(average): the sum of the data divided by the number of pieces of data Median: the value appearing.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
What is Statistics?. Statistics 4 Working with data 4 Collecting, analyzing, drawing conclusions.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
UNIT ONE REVIEW Exploring Data.
Exploratory Data Analysis
Chapter 1: Exploring Data
ISE 261 PROBABILISTIC SYSTEMS
Statistics Unit Test Review
4. Interpreting sets of data
Statistical Reasoning
NUMERICAL DESCRIPTIVE MEASURES
Description of Data (Summary and Variability measures)
Laugh, and the world laughs with you. Weep and you weep alone
CHAPTER 1: Picturing Distributions with Graphs
Displaying Distributions with Graphs
CHAPTER 1: Picturing Distributions with Graphs
An Introduction to Statistics
Basic Practice of Statistics - 3rd Edition
Welcome!.
Basic Practice of Statistics - 3rd Edition
CHAPTER 1: Picturing Distributions with Graphs
Honors Statistics Review Chapters 4 - 5
CHAPTER 1 Exploring Data
Probability and Statistics
Advanced Algebra Unit 1 Vocabulary
Lesson Plan Day 1 Lesson Plan Day 2 Lesson Plan Day 3
Presentation transcript:

Exploratory Data Analysis (Descriptive Statistics) Martina Litschmannová martina.litschmannova@vsb.cz K210

Statistics has two major chapters: Descriptive Statistics Inferential statistics

Statistics Descriptive Statistics Gives numerical and graphic procedures to summarize a collection of data in a clear and understandable way. Inferential Statistics Provides procedures to draw inferences about a population from a sample.

Inferential Statistics Exploratory Data Analysis Populations vs. Sample A population includes each element from the set of observations that can be made. A sample consists only of observations drawn from the population. population sample Inferential Statistics Exploratory Data Analysis sampling

Variable A variable has two defining characteristics: A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can "vary" from one entity to another.

Types of Variables Types of Variables Qualitative variable (categorical) Ordinal variable (a variant can be sorted) Nominal variable (has equivalent variants) Quantitative variable (numerical)

Exploratory data analysis Statistical tools that help examine data in order to describe their main features. Basic strategy Examine variables one by one, then look at the relationships among the different variables. Start with graphs, then add numerical summaries of specific aspects of the data.

Exploratory data analysis - One variable Graphical displays Qualitative/categorical data: bar chart, pie chart, etc. Quantitative data: histogram, boxplot etc. Summary statistics Qualitative/categorical: frequency tables Quantitative: mean, median, standard deviation, range etc.)

EDA - qualitative variable

Summary of categorical variables Numerically: tables with total counts and percents, mod Graphically Bar graphs, pie charts Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie

Statistical characteristics We summarize categorical data using a table. Note that percentages are often called Relative Frequencies Frequency table (or Summary table) Class xi Absolute frequency ni Relative frequency pi x1 n1 p1=n1 /n x2 n2 p2=n2 /n xk nk pk=nk /n Total: n1+n2+…+nk=n 1 + Mod (a variant that occurs most frequently)

Statistical characteristics Frequency table Sex Absolute frequency Relative frequency [%] Male 457 58,2 Female 328 41,8 Total: 785 100,0 Mod = Male

Graphical Methods of Presenting Qualitative Variables Bar chart is a standard graph, where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.)

Attention! A bar chart is made up of columns plotted on a graph. The columns are positioned over a label that represents a categorical variable. The height of the column indicates the size of the group defined by the column label. Attention! We subjectively take notice the volume, rather than the height of the shape!!!

Graphical Methods of Presenting Qualitative Variables Bar chart is a standard graph where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.) Pie Chart represents relative frequencies of individual variants of a variable. Frequencies are presented as proportions in a sector of a circle.

Blood type Rh factor Total Rh+ Rh- 38 7 45 A 34 6 40 B 9 2 11 AB 3 1 4 84 16 100

One-way table analysis in Excel

Statgraphics v. 5.0 Manual: http://people.duke.edu/~rnau/sgwin5.pdf

One-way table analysis in Statgraphics

EDA - quantitative variable

Quantitative variables Numerical sumary Mean Median Quartiles Range Standard deviation… Graphical summary Histogram Box plot…

Quantitative measures When you compare two or more data sets, focus on four features: Center Spread Shape. Unusual features

Measures of Central Tendency Mean To find the mean of a set of observations, add their values and divide by the number of observations. Mean of a population: 𝜇= (𝑖) 𝑥 𝑖 𝑁 Mean of a sample: 𝑥 = (𝑖) 𝑥 𝑖 𝑛

Mean example The average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the average age change? If so, what is the new average age?

Measures of Central Tendency Median The median is the midpoint of a distribution The number such that half the observations are smaller and the other half are larger. Also called the 50th percentile or 2nd quartile. To compute a median Order observations. If number of observations is odd the median is the center observation. If number of observations is even the median is the average of the two center observations. Median of a population: 𝑥 0,5 Median of a sample: 𝑥 0,5

Median example The median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the median age change? If so, what is the new median age? The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.

Mean vs. median When histogram is symmetric mean and median are similar. Mean and median are different when histogram is skewed. Skewed to the right mean is larger than median. Skewed to the left mean is smaller than median.

Mean vs. median Extreme example Income in small town of 6 people: $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 Mean is $31,830 and median is $32,000. Bill Gates moves to town. $35,000 $37,000 $38,000 $40,000,000 Mean is $5,741,571 median is $35,000. Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data.

Effect of Changing Units How measures of central tendency are affected when we change units (minutes to hours, feet to meters etc.)? If you add a constant to every value, the mean and median increase by the same constant. If you multiply every value by a constant, the mean and median will also be multiplied by that constant.

Effect of Changing Units - example The average annual temperature in Prague is 10 ° C. What is the average annual temperature in Prague in degrees Fahrenheit? 𝐹= 9𝐶 5 +32

Is a central measure enough? A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?

Measures of spread Range difference between the largest and smallest values in a set of values. Inter-quartile range 𝐼𝑄𝑅= 𝑥 0,75 − 𝑥 0,25 lower quartil 𝑥 0,25 is the "middle" value in the first half of the rank-ordered data set. upper quartil 𝑥 0,75 is the "middle" value in the second half of the rank-ordered data set.

Measures of spread Variance In a population, variance is the average squared deviation from the population mean, as defined by the following formula: 𝜎 2 = 𝑖=1 𝑁 𝑥 𝑖 −𝜇 2 𝑁 . Sample variance is defined by slightly different formula, and uses a slightly different notation: 𝑠 2 = 𝑖=1 𝑁 𝑥 𝑖 −𝜇 2 𝑛−1 . Standard deviation The standard deviation looks at how far observations are from their mean. Population: 𝜎 Sample: 𝑠

Measures of spread - example A population consists of four observations: {1, 3, 5, 7}. What is the variance? A simple random sample consists of four observations: {1, 3, 5, 7}. Based on these sample observations, what is the best estimate of the standard deviation of the population?

Effect of Changing Units How measures of spread affected when we change units (minutes to hours, feet to meters etc.)? If you add a constant to every value, the distance between values does not change. As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same. Suppose you multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant. It has an even greater effect on the variance. It multiplies the variance by the square of the constant..

Effect of Changing Units - example The variance annual temperature in Prague is 0,25 (° C)2. What is the variance annual temperature in Prague in square degrees Fahrenheit? 𝐹= 9𝐶 5 +32

Measures of position Percentiles Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles. Quartiles (lower quartil, median, upper quartil) Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 4 equal parts are called quartiles. Standard Scores (z-scores) z-score indicates how many standard deviations an element is from the mean. A standard score can be calculated from the formula: 𝑧−𝑠𝑐𝑜𝑟𝑒= 𝑥−𝜇 𝜎

How to interpret z-score? 𝑧−𝑠𝑐𝑜𝑟𝑒<0 … an element less than the mean. 𝑧−𝑠𝑐𝑜𝑟𝑒>0 … an element greater than the mean. 𝑧−𝑠𝑐𝑜𝑟𝑒=0 … an element equal to the mean. 𝑧−𝑠𝑐𝑜𝑟𝑒=1 … an element that is 1 standard deviation greater than the mean; 𝑧−𝑠𝑐𝑜𝑟𝑒=2 , 2 standard deviations greater than the mean; etc. 𝑧−𝑠𝑐𝑜𝑟𝑒=−1 … an element that is 1 standard deviation less than the mean; 𝑧−𝑠𝑐𝑜𝑟𝑒=−2 , 2 standard deviations less than the mean; etc. If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3. 𝑧−𝑠𝑐𝑜𝑟𝑒 >3 … an element is outlier

z-score - Example A national achievement test is administered annually to 3rd graders. The test has a mean score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on the test?

Graphical Methods of Presenting Qualitative Variables Histograms - made up of columns plotted on a graph There is no space between adjacent columns. The columns are positioned over a label that represents a quantitative variable. The column label can be a single value or a range of values. The height of the column indicates the size of the group defined by the column label.

Histograms Where did the bins come from? They were chosen rather arbitrarily. Does choosing other bins change the picture? Yes!! And sometimes dramatically. What do we do about this? Some pretty smart people have come up with some “optimal” bin widths and we will rely on there suggestions. Optimal number of bins: 𝑘=1+3,3 log 𝑛 (Sturges rule)

Histogram The purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution you can see the important features.

Histograms We will describe the features of the distribution that the histogram is displaying with three characteristics. Shape Center Spread Unusual Features

Histograms Shape Symmetry - when it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other.

Histograms Shape Number of peaks. Distributions with one clear peak are called unimodal. Distributions with two clear peaks are called bimodal. When a symmetric distribution has a single peak at the center, it is referred to as bell-shaped.

Histograms Shape Skewness - when they are displayed graphically, some distributions have many more observations on one side of the graph than the other. Distributions with most of their observations on the left (toward lower values) are said to be skewed right. Distributions with most of their observations on the right (toward higher values) are said to be skewed left.

Sample skewness: 𝐺= 𝑛 𝑛−1 𝑛−2 ∙ (𝑖) 𝑥 𝑖 − 𝑥 𝑠 3 Histograms Shape Skewness – measure of the asymetry Sample skewness: 𝐺= 𝑛 𝑛−1 𝑛−2 ∙ (𝑖) 𝑥 𝑖 − 𝑥 𝑠 3 𝐺>0 … skewed right 𝐺<0 … skewed left 𝐺=0 … symetric

Histograms Shape Uniform - when the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution.

Histograms 𝑥 < 𝑥 0,5 𝑥 = 𝑥 0,5 𝑥 > 𝑥 0,5 Center Graphically, the center of a distribution is located at the median of the distribution. 𝑥 < 𝑥 0,5 𝑥 = 𝑥 0,5 𝑥 > 𝑥 0,5

Sample kurtosis: 𝑔 2 = 1 𝑛 ∙ (𝑖) 𝑥 𝑖 − 𝑥 𝑠 4 Histograms Spread The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller. Kurtosis – measure of the kurtosis Sample kurtosis: 𝑔 2 = 1 𝑛 ∙ (𝑖) 𝑥 𝑖 − 𝑥 𝑠 4 𝑔 2 >0 … big kurtosis (less spread) 𝑔 2 <0 … small kurtosis (more spread)

Histograms Unusual Features Gaps. Gaps refer to areas of a distribution where there are no observations. Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers. How can we identify outliers? 𝑧−𝑠𝑐𝑜𝑟𝑒 >3 … an element is outlier Rule of thumb: 𝑥 𝑖 < 𝑥 0,25 −1,5𝐼𝑄𝑅 ∨ 𝑥 𝑖 > 𝑥 0,75 +1,5𝐼𝑄𝑅 extreme value is often considered to be an outlier if it is at least 1.5 interquartile ranges below the lower quartil, or at least 1.5 interquartile ranges above the upper quartil.

Histograms Unusual Features Gaps. Gaps refer to areas of a distribution where there are no observations. Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers.

Box and whiskers plot A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name), which goes from the lower quartile (Q1) to the upper quartile (Q3). Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers. The front whisker goes from Q1 to the smallest non-outlier in the data set (Q1-1,5IQR), and the back whisker goes from Q3 to the largest non-outlier (Q3+1,5IQR). If the data set includes one or more outliers, they are plotted separately as points on the chart.

How to interpret a box plot? Range IQR Shape of distribution

Quantitative variable analysis in Excel

Quantitative variable analysis in Statgraphics