Exploratory Data Analysis I

Slides:



Advertisements
Similar presentations
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Describing Data: Two Variables SECTIONS 2.1, 2.4, 2.5 Two categorical (2.1)
Advertisements

CHAPTER 1 Exploring Data
Chapter 3 Graphic Methods for Describing Data. 2 Basic Terms  A frequency distribution for categorical data is a table that displays the possible categories.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.2 Graphical Summaries.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Describing Data: One Variable
Slide 1 Spring, 2005 by Dr. Lianfen Qian Lecture 2 Describing and Visualizing Data 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 9/6/12 Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical.
Describing Data: One Quantitative Variable
Statistics Lecture 2. Last class began Chapter 1 (Section 1.1) Introduced main types of data: Quantitative and Qualitative (or Categorical) Discussed.
Descriptive statistics (Part I)
CHAPTER 1: Picturing Distributions with Graphs
The Stats Unit.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
Objective To understand measures of central tendency and use them to analyze data.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
STAT 250 Dr. Kari Lock Morgan
Chapter 1 – Exploring Data YMS Displaying Distributions with Graphs xii-7.
Chapter 1: Exploring Data AP Stats, Questionnaire “Please take a few minutes to answer the following questions. I am collecting data for my.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two.
Confidence Intervals I 2/1/12 Correlation (continued) Population parameter versus sample statistic Uncertainty in estimates Sampling distribution Confidence.
The Diminishing Rhinoceros & the Crescive Cow Exploring, Organizing, and Describing, Qualitative Data.
1 Excursions in Modern Mathematics Sixth Edition Peter Tannenbaum.
STAT 211 – 019 Dan Piett West Virginia University Lecture 1.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 9/11/12 Describing Data: Two Variables SECTIONS 2.1, 2.4, 2.5 Two categorical.
Chapter 1 The Role of Statistics. Three Reasons to Study Statistics 1.Being an informed “Information Consumer” Extract information from charts and graphs.
Chapter 1: Exploring Data Sec. 1.2: Displaying Quantitative Data with Graphs, cont.
Chapters 1 and 2 Week 1, Monday. Chapter 1: Stats Starts Here What is Statistics? “Statistics is a way of reasoning, along with a collection of tools.
Analyzing Categorical Data & Displaying Quantitative Data Section 1.1 & 1.2 Reference Text: The Practice of Statistics, Fourth Edition. Starnes, Yates,
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Displaying Distributions with Graphs. the science of collecting, analyzing, and drawing conclusions from data.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: One Quantitative Variable SECTIONS 2.2, 2.3 One quantitative.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Unit 2 Descriptive Statistics Objective: To correctly identify and display sets of data.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 1 Exploring Data 1.2 Displaying Quantitative.
+ Chapter 1: Exploring Data Section 1.2 Displaying Quantitative Data with Graphs The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
Bell Ringer You will need a new bell ringer sheet – write your answers in the Monday box. 3. Airport administrators take a sample of airline baggage and.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
The Practice of Statistics Third Edition Chapter 1: Exploring Data Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
1 Take a challenge with time; never let time idles away aimlessly.
AP Statistics Objective: Students will be able to construct and determine when to use bar charts, pie charts, and dot plots. (Histograms)
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Lecture #3 Tuesday, August 30, 2016 Textbook: Sections 2.4 through 2.6
Describing Data: Two Variables
Chapter 1: Exploring Data
Chapter 2: Methods for Describing Data Sets
Chapter 1: Exploring Data
Looking at data Visualization tools.
Warm Up.
Unit 4 Statistical Analysis Data Representations
Laugh, and the world laughs with you. Weep and you weep alone
Sec. 1.1 HW Review Pg. 19 Titanic Data Exploration (Excel File)
recap Individuals Variables (two types) Distribution
CHAPTER 1: Picturing Distributions with Graphs
Topic 5: Exploring Quantitative data
Sexual Activity and the Lifespan of Male Fruitflies
Means & Medians.
CHAPTER 1 Exploring Data
Basic Practice of Statistics - 3rd Edition
Basic Practice of Statistics - 3rd Edition
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Ten things about Descriptive Statistics
Displaying Distributions with Graphs
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Presentation transcript:

Exploratory Data Analysis I STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor Kari Lock Morgan Duke University

Announcements Textbooks are here! My office hours: (Old Chemistry 216) Wednesday 3-5 pm Friday 1-3pm Lecture slides, assignments, labs, etc. will be posted at http://stat.duke.edu/courses/Spring12/sta101.2/ Complete lecture slides to be posted after each class

Exploratory Data Analysis The Big Picture Population Sampling Sample Statistical Inference Exploratory Data Analysis

Class Survey Data Data from both STAT 101 classes and STAT 10

Data In order to make sense of this data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as exploratory data analysis (also known as descriptive statistics) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

One Categorical Variable Display the number or proportion of cases that fall in each category “What is your favorite day of the week?”

Frequency Table A frequency table shows the number of cases that fall in each category: Monday Tuesday Wednesday Thursday Friday Saturday Sunday 1 6 12 106 71 R: table(fav_day)

Proportion The sample proportion of students in each category is

Proportion Monday Tuesday Wednesday Thursday Friday Saturday Sunday 1 6 12 106 71 The sample proportion of students in this class who prefer Friday is Proportion and percent can be used interchangeably: 0.51 or 51%

Relative Frequency Table A relative frequency table shows the proportion of cases that fall in each category All the numbers in a relative frequency table sum to 1 Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0.005 0.029 0.057 0.507 0.340 R: round(table(fav_day)/209,3)

Bar Chart/Plot/Graph In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barplot(table(fav_day))

Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(fav_day))

Summary: One Categorical Variable Summary Statistics Proportion Frequency table Relative frequency table Visualization Barplot Pie chart

Two Categorical Variables Look at the relationship between two categorical variables Relationship status Gender

It’s Complicated / Other Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 It doesn’t matter which variable is displayed in the rows and which in the columns R: table(gender, relationship)

It’s Complicated / Other Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of females in intro stat are in a relationship?

It’s Complicated / Other Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of intro stat students in a relationship are female?

Two-Way Table CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

It’s Complicated / Other Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of intro stat students are in a relationship and female?

Side-by-Side Bar Chart The height of each bar is the number of the corresponding cell in the two-way table colors = c("pink", "blue") barplot(table(gender, relationship), beside=TRUE, col=colors, legend=TRUE)

Side-by-Side Bar Chart colors = c("red", "green","blue") barplot(table(relationship, gender), beside=TRUE, col=colors, legend=TRUE)

Segmented Bar Chart A segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side R: barplot(table(relationship, gender), legend=TRUE, col=c(“red”, “green”, “blue”))

Mosaic Plot Columns are the width of the proportion of the column category, and each column’s bar is colored according to the corresponding proportions of the row variable within each column category R: mosaicplot(table(Music, Gender), col=c("pink", "blue"))

Mosaic Plot colors = c("red", "green","blue") mosaicplot(table(gender, relationship), col=colors, legend=TRUE,cex.axis=.7,main="")

Mosaic Plot This tells us… Most people who are in favor of the new housing model are in (or plan to be in) a selected living group Most people who are in (or plan to be in) a selected living group are in favor of the new housing model Both (a) and (b) Neither (a) nor (b)

Difference in Proportions A difference in proportions is a difference in proportions for one categorical variable (e.g. proportion for whom “it’s complicated”) calculated for different levels of the other categorical variable (e.g. gender)

It’s Complicated / Other Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 0.833 0.066 –0.003 0.057 0.047 What is the difference in proportions 11/151 – 1/64

Summary: Two Categorical Variables Summary Statistics Two-way table Difference in proportions Visualization Side-by-side bar chart Segmented bar chart Mosaic plot

Kidney Stones Which treatment is better at removing kidney stones? Success Failure Treatment A 273 77 Treatment B 289 61 Which treatment is better at removing kidney stones? (a) Treatment A (b) Treatment B R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed) 292 (6524): 879–882

Kidney Stones Small Stones Success Failure Treatment A 81 6 Treatment B 234 36 Which treatment is better at removing small kidney stones? (a) Treatment A (b) Treatment B

Kidney Stones Large Stones Success Failure Treatment A 192 71 Treatment B 55 25 Which treatment is better at removing large kidney stones? (a) Treatment A (b) Treatment B

Kidney Stones Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall! How is this possible!?!?

Kidney Stones ALL STONES Success Failure Treatment A 273 77 Treatment B 289 61 Small Stones Success Failure Treatment A 81 6 Treatment B 234 36 Large Stones Success Failure Treatment A 192 71 Treatment B 55 25

Kidney Stones Treatment A is used more often on large stones, which are harder to treat. This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverses!) when a third variable is considered

Small Stones Treatment A Treatment B Successful 81 (93%) 234 (87%) Slope = # successful / # unsuccessful = odds Small Stones Treatment A Treatment B Successful 81 (93%) 234 (87%) Unsuccessful 6 36

Large Stones Treatment A Treatment B Successful 192 (73%) 55 (69%) Slope = # successful / # unsuccessful = odds Large Stones Treatment A Treatment B Successful 192 (73%) 55 (69%) Unsuccessful 71 25

Combined Treatment A Treatment B Successful 81+192=273 289 Unsuccessful 6+71=77 61

Combined Treatment A Treatment B Successful 273 (78%) 289 (83%) Unsuccessful 77 61

Combined Treatment A Treatment B Successful 273 (78%) 289 (83%) Unsuccessful 77 61

One Quantitative Variable We’ll look at how to analyze a quantitative variable such as Times checking Facebook per day Average hours of sleep per night Average hours of exercise per week GPA Average hours of spent on extracurricular activities per week Number of piercings

Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Average number of times checking Facebook per day Easy way to see each case

Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(exercise)

Histogram Although they look similar, a histogram is not the same as a bar plot A bar plot is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x-axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or the software you use), and the appearance can differ with different number of bars

Shape Long right tail Symmetric Right-Skewed Left-Skewed

Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x Example: x = Average hours of sleep x1 = 5, x2 = 9, x3 = 7, x4 = 7, …

Mean The sample mean is the average, and is computed by adding up all the numbers and dividing by the number of cases R: mean()

Median The sample median is the middle value when the data is ordered If there are an even number of values, the median is the average of the two middle values The sample median is denoted as m R: median()

Outliers An outlier is a value that is notably different from the other values Hours spent on extracurricular activities per week

Resistance Statistics are resistant if they are not heavily affected by outliers The median is resistant, the mean is not Average hours of extracurricular activities per week: Mean Median With Outlier 59.6 6 Without Outlier 9.2

Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results

Groups You will be put into groups of 4 or 5 based on common lab time and similar interests These groups will be used for discussion and group activities in class, and will be your groups for the final project at the end of the course They could be a natural study group for outside of class as well, but only if you want it to be

To Do Homework 1 (due Monday) Buy a clicker! (clicker grading starts on Monday)