CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint 2014 1.

Slides:



Advertisements
Similar presentations
Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2010.
Advertisements

Point Estimation Notes of STAT 6205 by Dr. Fan.
Sta220 - Statistics Mr. Smith Room 310 Class #14.
Chapter 7. Statistical Estimation and Sampling Distributions
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Sampling Distributions (§ )
Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
An Introduction to Bayesian Inference Michael Betancourt April 8,
1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling Distribution of a Statistic Confidence Interval Estimation.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CMPT 726 Simon Fraser University
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Modern Navigation Thomas Herring
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Chapter 5 Modeling & Analyzing Inputs
Sampling Theory Determining the distribution of Sample statistics.
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Random Sampling, Point Estimation and Maximum Likelihood.
Estimation in Sampling!? Chapter 7 – Statistical Problem Solving in Geography.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
10.1: Confidence Intervals – The Basics. Introduction Is caffeine dependence real? What proportion of college students engage in binge drinking? How do.
Chapter 7 Point Estimation
1 Lecture 13: Other Distributions: Weibull, Lognormal, Beta; Probability Plots Devore, Ch. 4.5 – 4.6.
Copyright © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 6 Some Continuous Probability Distributions.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Section 10.1 Confidence Intervals
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Inference: Probabilities and Distributions Feb , 2012.
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
Sampling and estimation Petter Mostad
What is Statistics? Chapter 0. What is Statistics? Statistics is the science (and art) of learning from data. Statistics is the study of variability.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Stage 1 Statistics students from Auckland university Sampling variability & the effect of sample size.
Chapter 5 Joint Probability Distributions and Random Samples  Jointly Distributed Random Variables.2 - Expected Values, Covariance, and Correlation.3.
Review of statistical modeling and probability theory Alan Moses ML4bio.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
WARM UP: Penny Sampling 1.) Take a look at the graphs that you made yesterday. What are some intuitive takeaways just from looking at the graphs?
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Stats 242.3(02) Statistical Theory and Methodology.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Stat 223 Introduction to the Theory of Statistics
Statistical Inference, Exploratory Data Analysis and Data Science Process Chapter 2 CSE4/587 B. Ramamurthy 9/21/2018.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Statistics Branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Practice or science of.
Statistical Methods For Engineers
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Modelling Input Data Chapter5.
CHAPTER 1 Exploring Data
Stat 223 Introduction to the Theory of Statistics
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Sampling Distributions
CHAPTER 1 Exploring Data
Midterm Exam Review.
CHAPTER 1 Exploring Data
Applied Statistics and Probability for Engineers
Presentation transcript:

CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint

Chapter 1 Lot of discussion of the authors’ context What is the skill set needed for a data-computer scientist? Read chapter 1: add “coding” to that. See code.org cse4/587-Sprint

Introduction Data represents the traces of the real-world processes.  What traces we collect depends on the sampling methods Two sources of randomness and uncertainty:  The process that generates data is random  The sampling process itself is random Your mind-set should be “statistical thinking in the age of big-data”  Combine statistical approach with big-data Our goal for this chapter: understand the statistical process of dealing with data and practice it using R cse4/587-Sprint

Uncertainty and Randomness A mathematical model for uncertainty and randomness is offered by probability theory. A world/process is defined by one or more variables. The model of the world is defined by a function: Model == f(w) or f(x,y,z) (A multivariate function) The function is unknown  model is unclear, at least initially. Typically our task is to come up with the model, given the data. Uncertainty: is due to lack of knowledge: this week’s weather prediction (e.g. 90% confident) Randomness: is due lack of predictability: 1-6 face of when rolling a die Both can be expressed by probability theory cse4/587-Sprint

Statistical Inference cse4/587-Sprint World  Collect Data  Capture the understanding/meaning of data through models or functions  statistical estimators for predicting things about  world Development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes

Population and Sample cse4/587-Sprint Population is complete set of traces/data points  US population 314 Million, world population is 7 billion for example  All voters, all things Sample is a subset of the complete set (or population): how we select the sample introduces biases into the data See an example in Here out of the 314 Million US population, households are form the sample (monthly) Population  mathematical model  sample (My) big-data approach for the world population: k-nary tree (MR) of 1 billion (of the order of 7 billion) : I basically forced the big-data solution/did not sample …etc.

Population and Sample (contd.) cse4/587-Sprint Example: s sent by people in the CSE dept. in a year. Method 1: 1/10 of all s over the year randomly chosen Method 2: 1/10 of people randomly chosen; all their over the year Both are reasonable sample selection method for analysis. However estimations pdfs (probability distribution functions) of the s sent by a person for the two samples will be different.

Big Data vs statistical inference cse4/587-Sprint Sample size N For statistical inference N < All For big data N == All For some atypical big data analysis N == 1  world model through the eyes of a prolific twitter user

New Kinds of Data cse4/587-Sprint Traditional: numerical, categorical, or binary Text: s, tweets, NY times articles ( ch.4, 7) Records: user-level data, time-stamped event data, json formatted log files (ch.6,8) Geo-based location data (ch.2) Network data ( ch.10) (How do you sample and preserve network structure?) Sensor data ( covered in Lin and Dyer’s) Images

Big-data context cse4/587-Sprint Analysis for inference purposes you don’t need all the data. At Google (at the originator big data algs.) people sample all the time. However if you want to render, you cannot sample. Some DNA-based search you cannot sample. Say we make some conclusions with samples from Twitter data we cannot extend it beyond the population that uses twitter. And this is what is happening now…be aware of biases. Another example is of the tweets pre- and post- hurricane Sandy.. Yelp example..

Modeling cse4/587-Sprint

Probability Distributions cse4/587-Sprint Normal, uniform, Cauchy, t-, F-, Chi-square, exponential, Weibull, lognormal,.. They are know as continuous density functions Any random variable x or y can be assumed to have probability distribution p(x), if it maps it to a positive real number. For a probability density function, if we integrate the function to find the area under the curve it is 1, allowing it to be interpreted as probability. Further, joint distributions, conditional distribution..

Fitting a Model cse4/587-Sprint Fitting a model means estimating the parameters of the model: what distribution, what are the values of min, max, mean, stddev, etc. Don’t worry R has built-in optimization algorithms that readily offer all these functionalities It involves algorithms such as maximum likelihood estimation (MLE) and optimization methods… Example: y = β1+β2 ∗  y = *x

Exploratory Data Analysis (EDA) cse4/587-Sprint Traditionally: histograms EDA is the prototype phase of ML and other sophisticated approaches; See Figure 2.2 Basic tools of EDA are plots, graphs, and summary stats. It is a method for “systematically” going through data, plotting distributions, plotting time series, looking at pairwise relationships using scatter plots, generating summary stats.eg. mean, min, max, upper, lower quartiles, identifying outliers. Gain intuition and understand data. EDA is done to understand Big data before using expensive bid data methodology.

Summary cse4/587-Sprint An excellent tool supporting EDA is R We will go through the demo discussed in the text. This will part of your Project1. Something to do this weekend: work on the example discussed in the text about NY Times ad clicks.