Introduction to Data Science Lecture 6 Exploratory Data Analysis

Slides:



Advertisements
Similar presentations
Probability models- the Normal especially.
Advertisements

Inference Sampling distributions Hypothesis testing.
DISTRIBUTION FITTING.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Lecture 2: Basic steps in SPSS and some tests of statistical inference
Chapter 2 Simple Comparative Experiments
Introduction to Data Science Lecture 6 Exploratory Data Analysis
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Chapter 8 Introduction to Hypothesis Testing
Thinking Mathematically
Engineering Probability and Statistics - SE-205 -Chap 1 By S. O. Duffuaa.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
1 Course review, syllabus, etc. Chapter 1 – Introduction Chapter 2 – Graphical Techniques Quantitative Business Methods A First Course
Chapter 21 Basic Statistics.
Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics.
Review of Chapters 1- 6 We review some important themes from the first 6 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
BUSINESS STATISTICS I Descriptive Statistics & Data Collection.
Statistical Inference An introduction. Big picture Use a random sample to learn something about a larger population.
Statistics Psych 231: Research Methods in Psychology.
The field of statistics deals with the collection,
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
1 Design and Analysis of Experiments (2) Basic Statistics Kyung-Ho Park.
Statistics Psych 231: Research Methods in Psychology.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Reasoning in Psychology Using Statistics Psychology
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.
Howard Community College
Basics of Pharmaceutical Statistics
Advanced Data Analytics
Advanced Quantitative Techniques
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Review of Testing a Claim
Hypothesis testing Chapter S12 Learning Objectives
Statistical Data Analysis
Chapter 2 Simple Comparative Experiments
Introduction to Data Science Lecture 7 Machine Learning Overview
Reasoning in Psychology Using Statistics
Education for Service--qnt351.com. QNT 351 Final Exam Guide (New) For more course tutorials visit Q1 The Director of Golf for a local course.
QNT 351 Education for Service-- qnt351.com. QNT 351 Final Exam Guide (New) For more course tutorials visit Q1 The Director of Golf for.
Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)
INTEGRATED LEARNING CENTER
The Nature of Probability and Statistics
STATISTICS An Introduction.
(or why should we learn this stuff?)
Reasoning in Psychology Using Statistics
Chapter 1 Why Study Statistics?
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Statistical Data Analysis
Psych 231: Research Methods in Psychology
Lecture 1: Descriptive Statistics and Exploratory
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
DESIGN OF EXPERIMENT (DOE)
Testing Hypotheses I Lesson 9.
Inference Concepts 1-Sample Z-Tests.
Introductory Statistics
Presentation transcript:

Introduction to Data Science Lecture 6 Exploratory Data Analysis CS 194 Spring 2014 Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman

Outline for this Evening Class Lecture Exploratory Data Analysis Hypothesis Testing Exercise – EDA and HT in Python (Evan: Tutorial and Lab) next week: we’ll play with “R” Review of exercise Time for Project Group Discussions

Topics Today and Next Time Exploratory Data Analysis Data Diagnosis Graphical/Visual Methods Data Transformation Confirmatory Data Analysis Statistical Hypothesis Testing Graphical Inference

Descriptive vs. Inferential Descriptive: e.g., Mean; describes data you have but can't be generalized beyond that We’ll talk about Exploratory Data Analysis Inferential: e.g., t-test, that enable inferences about the population beyond our data These are the techniques we’ll leverage for Machine Learning and Prediction

Examples of Business Questions Simple (descriptive) Stats “Who are the most profitable customers?” Hypothesis Testing “Is there a difference in value to the company of these customers?” Segmentation/Classification What are the common characteristics of these customers? Prediction Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business”

Applying techniques What models/techniques to use depends on the problem context, data and underlying assumptions. e.g., Classification problem with binary outcome? -> logistic regression, Naïve Bayes, … e.g., Classification problem but no labels? -> Perhaps use K-means clustering

Exploratory Data Analysis 1977 Based on insights developed at Bell Labs in the 60’s Techniques for visualizing and summarizing data What can the data tell us? (in contrast to “confirmatory” data analysis) Introduced many basic techniques: 5-number summary, box plots, stem and leaf diagrams,… 5 Number summary: extremes (min and max) median & quartiles More robust to skewed & longtailed distributions

The Trouble with Summary Stats

Looking at Data

Data Presentation Dashboard

Data Presentation Data Art

Chart types Single variable Dot plot Jitter plot Box plot Histogram Kernel density estimate Cumulative distribution function (note: examples using qplot library from R) Chart examples from Jeff Hammerbacher’s 2012 CS194 class

Chart types Dot plot

Chart types Jitter plot

Chart types Box plot

Chart types Box plot

Chart types Histogram

Chart types Kernel density estimate

Chart types Histogram and Kernel Density Estimates Histogram KDE Proper selection of bin width is important Outliers should be discarded KDE Kernel function Box, Epanechnikov, Gaussian Kernel bandwidth

Chart types Cumulative distribution function

Chart types Two variables Scatter plot Pairs plot Line plot Log-log plot Cut-and-stack plot Pairs plot

Chart types Scatter plot

Chart types Line plot

Chart types Log-log plot

Chart types Coxcomb plot Atrributed to Florence Nightingale

Chart types Treemap

Chart types Heatmap

Chart types Gapminder

The Need for Models “All models are wrong, but some models are useful.” George Box Data represents the traces of the real-world processes. Two sources of randomness and uncertainty: 1) those underlying the process themselves 2) those associated with the data collection methods To simplify the traces into something more comprehensible you need: mathematical models or functions of the data -> Statistical estimators

More on Models N is size of population n is sample size (subset of the population) Getting the subset (i.e. sampling) can introduce "bias" leading to incorrect conclusions

Probability Distributions Natural processes tend to generate measurements whose empirical shape could be approximated by mathematical functions with a few parameters that could be estimated from the data.

Note on ML Algos vs. Stat Models Techniques and underlying concepts in common Difference in goals/use: ML Algos – goal: predict or classify with high accuracty. basis of many data products Models – get at the underlying generative process “Black box” vs. “White box” Dealing with uncertainty (at the heart of stats) Distributions vs. non-parametic approaches

More on Hypothesis Testing Null Hypothesis is given the benefit of the doubt (e.g., innocent until proven guilty). Alternative Hypothesis directly contradicts the Null Hypothesis "Step 1: State the  hypotheses." "Step 2: Set the  criteria for a decision." "Step 3: Compute the  test  statistic." "Step 4: Make a decision."

p Value A p value is the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. In many cases: when the p value is less than 5% (p < .05), we reject the null hypothesis Note this means that 1 out of 20 times we incorrectly reject the null hypothesis Do “green jelly beans cause acne?” (see XKCD)

From G.J. Primavera, “Statistics for the Behavioral Sciences”

Two-tailed Significance From G.J. Primavera, “Statistics for the Behavioral Sciences” When the p value is less than 5% (p < .05), we reject the null hypothesis

Hypothesis Testing From G.J. Primavera, “Statistics for the Behavioral Sciences”

Are Two Sets of Data Really Different? Null Hypothesis: The differences we see are due to “chance” For Small Sample sizes: use T-test We’ll do this next in the lab.

Some Notes on the Class 3/17 Intro to Supervised Learning HW2 coming out tomorrow night Due after Spring Break but do it before! FINAL PROJECTS Group size = 3 What’s expected – find data, build a COOL Data Product, integration & viz or good reason why not Schedule: Groups Formed 1-2page proposal DUE 3/11 Midnight Midway review meeting with Prof or GSIs following 1-2 weeks Final Presentation (Posters and/or Lightning talks) Final Report