Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant.

Slides:



Advertisements
Similar presentations
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 2 Picturing Variation with Graphs.
Advertisements

Chapter 2 Summarizing and Graphing Data
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.2 Graphical Summaries.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
B a c kn e x t h o m e Frequency Distributions frequency distribution A frequency distribution is a table used to organize data. The left column (called.
Descriptive Statistics Summarizing data using graphs.
Beginning the Visualization of Data
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Slide 1 Copyright © 2004 Pearson Education, Inc..
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter Two Treatment of Data.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Descriptive statistics (Part I)
Introductory Statistics: Exploring the World through Data, 1e
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data with Tables and Graphs.  A frequency distribution is a collection of observations produced by sorting observations into classes and showing.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Copyright © 2004 Pearson Education, Inc.
REPRESENTATION OF DATA.
Frequency Distributions and Graphs
Chapter 2 Summarizing and Graphing Data
Descriptive Statistics
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 3 Organizing and Displaying Data.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Sta220 - Statistics Mr. Smith Room 310 Class #3. Section
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Methods for Describing Sets of Data
2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Department of Civil, Structural, and Environmental Engineering University at Buffalo,
Frequency Distribution
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
Exploratory Data Analysis
Chapter 2 Describing Data.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Elementary Statistics Eleventh Edition Chapter 2.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 3 Graphical Methods for Describing Data.
1 Elementary Statistics Larson Farber Descriptive Statistics Chapter 2.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 2-2 Frequency Distributions.
Unit 4 Statistical Analysis Data Representations.
Descriptive Statistics. Frequency Distributions and Their Graphs What you should learn: How to construct a frequency distribution including midpoints,
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
© Copyright McGraw-Hill CHAPTER 2 Frequency Distributions and Graphs.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
MATH 2311 Section 1.5. Graphs and Describing Distributions Lets start with an example: Height measurements for a group of people were taken. The results.
Chapter 3 EXPLORATION DATA ANALYSIS 3.1 GRAPHICAL DISPLAY OF DATA 3.2 MEASURES OF CENTRAL TENDENCY 3.3 MEASURES OF DISPERSION.
Chapter 5: Organizing and Displaying Data. Learning Objectives Demonstrate techniques for showing data in graphical presentation formats Choose the best.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Chapter 2 Summarizing and Graphing Data  Frequency Distributions  Histograms  Statistical Graphics such as stemplots, dotplots, boxplots, etc.  Boxplots.
1.2 Displaying Quantitative Data with Graphs.  Each data value is shown as a dot above its location on the number line 1.Draw a horizontal axis (a number.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
How to change bad news to good one
Exploratory Data Analysis
Methods for Describing Sets of Data
Chapter 2 Summarizing and Graphing Data
ISE 261 PROBABILISTIC SYSTEMS
Descriptive Statistics
Unit 4 Statistical Analysis Data Representations
MATH 2311 Section 1.5.
Statistical Reasoning
Laugh, and the world laughs with you. Weep and you weep alone
Descriptive Statistics
DAY 3 Sections 1.2 and 1.3.
Day 52 – Box-and-Whisker.
Honors Statistics Review Chapters 4 - 5
Probability and Statistics
Displaying Distributions with Graphs
Biostatistics Lecture (2).
Essentials of Statistics 4th Edition
Chapter 2 Describing, Exploring, and Comparing Data
Presentation transcript:

Univariate EDA (Exploratory Data Analysis)

EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant measures/displays –little influenced by changes in a small proportion of the total number of cases –resistant to the effects of outliers –emphasizes smooth over rough components concepts apply to statistics and to graphical methods

Tree Ring dates (AD) dendrochronology dates what do they mean???? usually helps to sort the data…

Stem-and-Leaf Diagram |62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…

can simplify in various ways… 11|6 12| –‘leaves’ rounded to nearest decade –‘stem’ based on centuries

|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 ‘stem’ based on decades…

Stem-and-Leaf Diagram 11|6 12| –‘leaves’ rounded to nearest decade –‘stem’ based on centuries

Stem-and-Leaf Diagram 11|62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…

|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 ‘stem’ based on decades…

|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 highlights existence of gaps in the distribution of dates, groups of dates…

R stem() vu  round(runif(25, 0, 50),0); stem(vu) vn  round(rnorm(25, 25, 10),0); stem(vn) stem(vn, scale=2)

unit 1unit unit 1unit 2  Back-to-back stem-and-leaf plot rim diameter data (cm)

percentiles useful for constructing various kinds of EDA graphics don’t confuse percentile with percent or proportion Note: frequency = count relative frequency = percent or proportion

percentiles “the pth percentile of a distribution:  number such that approximately p percent of the values in the distribution are equal or less than that number…” can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…

percentiles sort the data so that x 1 is the smallest value, and x n is the largest (where n=total number of cases) x i is the p i th percentile of a dataset of n members where:

p 1 = 100( ) / 7 = 7.1 p 2 = 100( ) / 7 = 21.4 p 3 = 100( ) / 7 = 35.7 p 4 = 100( ) / 7 = 50 etc… [1]

 25 ? 85 ? th percentile: i=(7*50)/ i=4, x i =7 25 th percentile: i=(7*25)/ i=2.25, 3<x i <5

? if i integer, then… k = integer part of i; f = fractional part of i x int = interpolated value of x x int = (1-f)x k + fx k+1 x int = (1-.25)*3+.25*5 x int = th percentile: i=(7*25)/ i=2.25, 3<x i <5 25

use R!! test<-c(1,3,5,7,9,9,14) quantile(test,.25, type=5)

75 th 25 th 50 th percentiles: interquartile range (midspread) upper hingelower hinge inner fence “boxplot” (1.5 x midspread)

Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]

Cleveland, W. S. (1985) The Elements of Graphing Data.

Histograms divide a continuous variable into intervals called ‘bins’ count the number of cases within each bin use bars to reflect counts intervals on the horizontal axis counts on the vertical axis

“bins” Histogram counts percent

useful for illustrating the shape of the distribution of a batch of numbers may be helpful for identifying modes and modal behaviour Histograms

mode mode? mode! the distribution is clearly bimodal may be multimodal…

important variables in histogram constuction: bin width bin starting point

boundaries of ‘bins’… bins: 1-2; 2-3; George Cowgill: construct ‘bins’ of whole multiples of “minimum meaningful measurement units” (“mmmus”) where to count a value like ‘2.0’? Shennan: really means ; ; is this OK?? 2.0 = 1.95> <2.05 mmmu= ; ; ; …

observed value 2.0…

smoothing histograms may want to accentuate the ‘smooth’ in a data distribution… calculate “running averages” on bin counts level of smoothing is arbitrary…

histogram / barchart variations 3d stacked dual frequency polygon kernel density methods

dual barchart

Site 1 Site 2

‘mirror’ barchart

stacked barchart

3d barchart

frequency polygon

kernel density model

controlling kernel density plots… hd <- density(XX) hh <- hist(XX, plot=F) maxD <- max(hd$y) maxH <- max(hh$density) Y <- c(0, max(c(maxD, maxH))) hist(XX, freq=F, ylim=Y) lines(density(XX))

Dot Plot [R: dotchart()]

Dot Histogram [R: stripchart()] VAR VAR VAR00003 method = “stack”

cooking/serviceserviceritual line plot

cooking/serviceserviceritual

20% 19% 18% 21% 22% pie chart

percent cumulative percent Cumulative Percent Graph

cumulative percent some useful statistical measures (ordinal or ratio scale) can be misleading when used with nominal data good for comparing data sets Cumulative Percent Graph