Data Science Process Chapter 2 Rich's Training 11/13/2018.

Slides:

Advertisements

Similar presentations

Chapter 7. Statistical Estimation and Sampling Distributions

Advertisements

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Machine Learning CMPT 726 Simon Fraser University

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.

Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,

Chapter 5 Modeling & Analyzing Inputs

Review for Exam 2 (Ch.6,7,8,12) Ch. 6 Sampling Distribution

Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Chapter 7 Point Estimation

1 Lecture 13: Other Distributions: Weibull, Lognormal, Beta; Probability Plots Devore, Ch. 4.5 – 4.6.

CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint

ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.

Engineering Statistics KANCHALA SUDTACHAT. Statistics  Deals with  Collection  Presentation  Analysis and use of data to make decision  Solve problems.

Stracener_EMIS 7305/5305_Spr08_ Reliability Data Analysis and Model Selection Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

24 Nov 2007Data Management and Exploratory Data Analysis 1 Exploratory Data Analysis Exploratory Data Analysis (EDA) is an Approach that Employs a Variety.

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,

Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.

Modeling and Simulation CS 313

Advanced Data Analytics

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 4 Basic Estimation Techniques

MATH-138 Elementary Statistics

Supplemental Lecture Notes

Oliver Schulte Machine Learning 726

PDF, Normal Distribution and Linear Regression

Inference for Regression (Chapter 14) A.P. Stats Review Topic #3

Modeling and Simulation CS 313

Let’s continue to do a Bayesian analysis

Ch3: Model Building through Regression

Chapter 5 Joint Probability Distributions and Random Samples

Statistical Data Analysis

Chapter 2 Simple Comparative Experiments

Bayes Net Learning: Bayesian Approaches

Materials for Lecture 18 Chapters 3 and 6

Introductory Statistical Language

Combining Random Variables

Econ Roadmap Focus Midterm 1 Focus Midterm 2 Focus Final Intro

Statistical Inference, Exploratory Data Analysis and Data Science Process Chapter 2 CSE4/587 B. Ramamurthy 9/21/2018.

Statistical Methods For Engineers

Introduction to Instrumentation Engineering

Modelling Input Data Chapter5.

Integration of sensory modalities

Simple Linear Regression

Chapter 8: Estimating with Confidence

Statistical Data Analysis

Analytics – Statistical Approaches

Chapter 8: Estimating with Confidence

CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis

Chapter 8: Estimating with Confidence

Parametric Methods Berlin Chen, 2005 References:

Chapter 8: Estimating with Confidence

Mathematical Foundations of BME Reza Shadmehr

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Mathematical Foundations of BME Reza Shadmehr

Applied Statistics and Probability for Engineers

Is Statistics=Data Science

Presentation transcript:

Data Science Process Chapter 2 Rich's Training 11/13/2018

Chapter 1 and 2 Data Science Read Chapter 1 to get a perspective on “big data” Chapter 2: Statistical thinking in the age of big data You build models to understand the data and extract meaning and information from the data: statistical inference Rich's Training 11/13/2018

Introduction Data represents the traces of the real-world processes. What traces we collect depends on the sampling methods You build models to understand the data and extract meaning and information from the data: statistical inference Two sources of randomness and uncertainty: The process that generates data is random The sampling process itself is random Your mind-set should be “statistical thinking in the age of big-data” Combine statistical approach with big-data Our goal for this chapter: understand the statistical process of dealing with data and practice it using R Rich's Training 11/13/2018

Uncertainty and Randomness A mathematical model for uncertainty and randomness is offered by probability theory. A world/process is defined by one or more variables. The model of the world is defined by a function: Model == f(w) or f(x,y,z) (A multivariate function) The function is unknown, model is unclear, at least initially. Typically our task is to come up with the model, given the data. Uncertainty: is due to lack of knowledge: this week’s weather prediction (e.g. 90% confident) Randomness: is due lack of predictability: 1-6 face of when rolling a die Both can be expressed by probability theory Rich's Training 11/13/2018

Statistical modeling A model is built by asking questions: What are the basic variables? What are underlying processes? What influences what? What causes what? What do you want to know? Rich's Training 11/13/2018

Statistical Inference World  Collect Data Capture the understanding/meaning of data through models or functions  statistical estimators for predicting things about world Development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes Rich's Training 11/13/2018

Population and Sample Population is complete set of traces/data points US population 314 Million, world population is 7 billion for example All voters, all things Sample is a subset of the complete set (or population): how we select the sample introduces biases into the data See an example in http://www.sca.isr.umich.edu/ Here out of the 314 Million US population, 250000 households are form the sample (monthly) Population mathematical model  sample (My) big-data approach for the world population: k-nary tree of 1 billion (of the order of 7 billion) : I basically forced the big-data solution/did not sample …etc. Rich's Training 11/13/2018

Population and Sample (contd.) Example: Emails sent by people in the CSE dept. in a year. Method 1: 1/10 of all emails over the year randomly chosen Method 2: 1/10 of people randomly chosen; all their email over the year Both are reasonable sample selection method for analysis. However estimation pdfs (probability distribution functions) of the emails sent by a person for the two samples will be different. Rich's Training 11/13/2018

Big Data vs statistical inference Sample size N For statistical inference N < All For big data N == All For some atypical big data analysis N == 1 Model of the world through the eyes of a prolific twitter user Rich's Training 11/13/2018

Big-data context Analysis for inference purposes can use sampled data and you don’t need all the data. However if you want to render an image, you cannot sample. Some DNA-based search you cannot sample. Say we make some conclusions with samples from Twitter data we cannot extend it beyond the population that uses twitter. And this is what is happening now…be aware of biases. Another example is of the tweets pre- and post- hurricane Sandy.. Yelp example.. Rich's Training 11/13/2018

Modeling Abstraction of a real world process Lets say we have a data set with two columns x and y and y is dependent on x, we could write is as: y = β1+β2∗𝑥 (linear relationship) How to build a model? Probability distribution functions (pdfs) are building blocks of statistical models. Look at figure 2-1 for various prob. distributions… Rich's Training 11/13/2018

Probability Distributions Normal, uniform, Cauchy, t-, F-, Chi-square, exponential, Weibull, lognormal,.. They are know as continuous density functions Any random variable x or y can be assumed to have probability distribution p(x), if it maps it to a positive real number. For a probability density function, integration of the function to find the area under the curve results in 1, allowing it to be interpreted as probability. Further, joint distributions, conditional distribution.. Rich's Training 11/13/2018

Fitting a Model Fitting a model means estimating the parameters of the model: what distribution, what are the values of min, max, mean, stddev, etc. Don’t worry R has built-in algorithms that readily offer all these functionalities It involves algorithms such as maximum likelihood estimation (MLE) and optimization methods… Example: y = β1+β2∗𝑥  y = 7.2 + 4.5*x Rich's Training 11/13/2018

Exploratory Data Analysis (EDA) Traditionally: histograms EDA is the prototype phase of Machine Learning and other sophisticated approaches; See Figure 2.2 Basic tools of EDA are plots, graphs, and summary stats. It is a method for “systematically” going through data, plotting distributions, plotting time series, looking at pairwise relationships using scatter plots, generating summary stats.eg. mean, min, max, upper, lower quartiles, identifying outliers. Gain intuition and understand data. EDA is done to understand big data before using expensive big data methodology. Rich's Training 11/13/2018

The Data Science Process Raw data collected Exploratory data analysis Machine learning algorithms; Statistical models Build data products Communication Visualization Report Findings Make decisions Data is processed Data is cleaned Rich's Training 11/13/2018

Summary An excellent tool supporting EDA is R We will go through the demo discussed in the text. Something to do this weekend: Read chapter 2 Work on the sample R code in the chapter though it is not in your domain Find some data sets from your work, import it to R, analyze and share the results with your team You collect or can collect a lot of data through existing channels you have. Can you re-purpose the data using modern approaches to gain insights into your business? Rich's Training 11/13/2018

Data Collection in Automobiles Large volumes of data is being collected the increasing number of sensors that are being added to modern automobiles. Traditionally this data is used for diagnostics purposes. How else can you use this data? How about predictive analytics? For example, predict the failure of a part based on the historical data and on-board data collected? On-board-diagnostics (OBDI) is a big thing in auto domain. How can we do this? Rich's Training 11/13/2018

Oil Price Prediction Rich's Training 11/13/2018