Presentation is loading. Please wait.

Presentation is loading. Please wait.

Is Statistics=Data Science

Similar presentations


Presentation on theme: "Is Statistics=Data Science"— Presentation transcript:

1 Is Statistics=Data Science
The big data issue Nairanjana Dasgupta

2

3 What determines big data :The 5 V’s
Volume Considered too large for regular software Variety Often a mix of many different data types Velocity Extreme speed at which data generated Variability Inconsistency of the data set Veracity How reliable is this data

4 How big is big? By big we mean its volume is such that it is hard to analyze this on a single computer. That in itself shouldn’t be problematic But requiring specialized machines to analyze this has added to the myth and enigma of big data. The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.

5 Some statistical thoughts?
Is the big data a sample or a population? If it is really a population: then analysis means constructing summary statistics. This is bulky but not too difficult. If it is a sample: what was the sampling frame? If no population was considered when collecting this data, it is definitely not a representative sample. So, should one really do inference on BIG data? If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.

6 Structure of data Generally most data sets are rectangular in nature with p variables and n observations collected. In big data we often have many more predictors than observations (the big p problem) Many more (orders of magnitude more) observations than predictors, (the big n problem). Both n and p are big and are fluid as they are constantly updated and amassed.

7 The Variety and Velocity Piece
Generally opportunistic data is a mix of categorical, discrete, ordinal, continuous and a mix of that as well. So if we use it as a multivariate we have to think about how to proceed. While not trivial this can be surmounted with too much difficulty. The big issue is often the data is being amassed (I am not using collected intentionally) at a faster rate than it can actually be analyzed and summarized.

8 Variability and Veracity
This type of data is extremely variable and there is no systematic model in place to capture the components of variability. Modeling is very hard when you have no idea about the sources of variability in these types of data sets. Veracity: is it measuring what we think it is? How truthful is this data? Just because it is big, is it really good? O’Donoghue and Herbert: “Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth.” (context of medical data)

9 Visualization of big data
Often called dashboards Really a collection of well known age old graphs that most of you can do in excel! It is really just summary data in pretty colors. Don’t be fooled by these fancy terms.

10 Example of a Dashboard.

11 Prediction versus Inference
As the whole question of whether it is a sample or a population in itself is muddy let us leave inference out for now and now focus on analyzing. A common analysis method associated with opportunistic data is predictive analysis.

12 Predictive Analytics and big data
Encompasses: prediction models, machine learning, data mining for prediction of the unknown using the history of the past. If we are predicting are we inferring?? I will assume it is okay to do that. Exploits patterns found in historical and allows assessment of risk associated with a particular set of conditions. Credit scoring has used predictive analytics for a long time However, here at least in the past sampling was done to perform inference.

13 Techniques used in Predictive Analytics or supervised learning
Regression techniques Logistic regression Time series models Survival or duration analysis Classification and Discrimination Regression Trees Modeled by humans etc., Neural networks Multilayer perceptron Radial basis functions Support vector machines Naïve Bayes k-nearest neighbors Geospatial predictive modeling Done by machines: no model etc., Analytical Methods Machine Learning Methods

14

15 Supervised Learning Idea is learning from a known data set to predict the unknown. Essentially we know the class labels ahead of time. What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. So that if we have a new observation with its features we can correctly classify it. Machine Learning uses this idea and so it is very popular now.

16 Steps Selection of features Model Fitting
Model Validation using prediction of known classes Feature selection is done by the computer No model, but computer determines the functions of the predictors used Model is validated based on prediction of known classes Discriminant Analysis Machine Learning

17 Feature Selection Find which of the observed variables can actually distinguish between the classes of interest. This is variable selection

18 MODEL FITTING Commonly used in Stats: LDA K Nearest Neighbor QDA
Logistic Regression

19 Without models we can use Machine Learning methods
Neural networks Naïve Bayes Support Vector machines Perceptron Decision Trees Random Forests

20

21 Validation See how well the classifiers classify the observations into the different classes. Mostly commonly used method leave-one-out-cross validation. Though test data set (holdout sample) and resubmissions are still used.

22 Recap of Part 4 The sticky problem is if the data we have is a sample or a population. Inference is tough, as it is hard to figure out to what population we are inferring for. Predictive analytics often associated with big data At the end of the day, machines are faster and more efficient but cannot create interpretative models (not yet). We still don’t know if big data is good data, it depends upon who is collecting it and for what purpose.

23 Myth of Big Data There is no myth, it is just unwieldy, unstructured, under- designed data that is already being amassed. It still has to be good data for us to make good analysis and predictions. At the end of the day to make inferences on data (big or small) we need it to be representative.


Download ppt "Is Statistics=Data Science"

Similar presentations


Ads by Google