Is Statistics=Data Science

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1 Statistical Modeling  To develop predictive Models by using sophisticated statistical techniques on large databases.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Indian Statistical Institute Kolkata
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Lecture 14 – Neural Networks
Statistical Methods Chichang Jou Tamkang University.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
AP Statistics Overview and Basic Vocabulary. Key Ideas The Meaning of Statistics Quantitative vs. Qualitative Data Descriptive vs. Inferential Statistics.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Critical Analysis. Key Ideas When evaluating claims based on statistical studies, you must assess the methods used for collecting and analysing the data.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
ANALYTICS BUSINESS INTELLIGENCE SOFTWARE STATISTICS Kreara Solutions | 9 years | 60 members | ISO 9001:2008.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Chong Ho Yu.  Data mining (DM) is a cluster of techniques, including decision trees, artificial neural networks, and clustering, which has been employed.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Saskatoon SAS user group
Bootstrap and Model Validation
CSE 4705 Artificial Intelligence
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Transformation: Normalization
Machine Learning – Classification David Fenyő
XLMiner – a Data Mining Toolkit
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Statistical Data Analysis
Boosting and Additive Trees
Data Mining 101 with Scikit-Learn
Bob Marshall, MD MPH MISM DoD Clinical Informatics Fellowship
Overview of Supervised Learning
Vincent Granville, Ph.D. Co-Founder, DSC
Introduction Data Mining for Business Analytics.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Machine Learning & Data Science
Predictive Modeling using Python
Prepared by: Mahmoud Rafeek Al-Farra
Introduction to Predictive Modeling
Classification and Prediction
Abdur Rahman Department of Statistics
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Parametric Methods Berlin Chen, 2005 References:
MIS2502: Data Analytics Classification Using Decision Trees
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
CAMCOS Report Day December 9th, 2015 San Jose State University
CHAPTER 1 INTRODUCTION Prem Mann, Introductory Statistics, 7/E Copyright © 2010 John Wiley & Sons. All right reserved.
Support Vector Machines 2
PDI Data Literacy: Busting Myths of Big Data
PDI Data Literacy: Busting Myths of Big Data
PDI Data Literacy: Busting Myths of Big Data
PDI Data Literacy: Busting Myths of Big Data: Part2
Presentation transcript:

Is Statistics=Data Science The big data issue Nairanjana Dasgupta

What determines big data :The 5 V’s Volume Considered too large for regular software Variety Often a mix of many different data types Velocity Extreme speed at which data generated Variability Inconsistency of the data set Veracity How reliable is this data

How big is big? By big we mean its volume is such that it is hard to analyze this on a single computer. That in itself shouldn’t be problematic But requiring specialized machines to analyze this has added to the myth and enigma of big data. The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.

Some statistical thoughts? Is the big data a sample or a population? If it is really a population: then analysis means constructing summary statistics. This is bulky but not too difficult. If it is a sample: what was the sampling frame? If no population was considered when collecting this data, it is definitely not a representative sample. So, should one really do inference on BIG data? If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.

Structure of data Generally most data sets are rectangular in nature with p variables and n observations collected. In big data we often have many more predictors than observations (the big p problem) Many more (orders of magnitude more) observations than predictors, (the big n problem). Both n and p are big and are fluid as they are constantly updated and amassed.

The Variety and Velocity Piece Generally opportunistic data is a mix of categorical, discrete, ordinal, continuous and a mix of that as well. So if we use it as a multivariate we have to think about how to proceed. While not trivial this can be surmounted with too much difficulty. The big issue is often the data is being amassed (I am not using collected intentionally) at a faster rate than it can actually be analyzed and summarized.

Variability and Veracity This type of data is extremely variable and there is no systematic model in place to capture the components of variability. Modeling is very hard when you have no idea about the sources of variability in these types of data sets. Veracity: is it measuring what we think it is? How truthful is this data? Just because it is big, is it really good? O’Donoghue and Herbert: “Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth.” (context of medical data)

Visualization of big data Often called dashboards Really a collection of well known age old graphs that most of you can do in excel! It is really just summary data in pretty colors. Don’t be fooled by these fancy terms.

Example of a Dashboard.

Prediction versus Inference As the whole question of whether it is a sample or a population in itself is muddy let us leave inference out for now and now focus on analyzing. A common analysis method associated with opportunistic data is predictive analysis.

Predictive Analytics and big data Encompasses: prediction models, machine learning, data mining for prediction of the unknown using the history of the past. If we are predicting are we inferring?? I will assume it is okay to do that. Exploits patterns found in historical and allows assessment of risk associated with a particular set of conditions. Credit scoring has used predictive analytics for a long time However, here at least in the past sampling was done to perform inference.

Techniques used in Predictive Analytics or supervised learning Regression techniques Logistic regression Time series models Survival or duration analysis Classification and Discrimination Regression Trees Modeled by humans etc., Neural networks Multilayer perceptron Radial basis functions Support vector machines Naïve Bayes k-nearest neighbors Geospatial predictive modeling Done by machines: no model etc., Analytical Methods Machine Learning Methods

Supervised Learning Idea is learning from a known data set to predict the unknown. Essentially we know the class labels ahead of time. What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. So that if we have a new observation with its features we can correctly classify it. Machine Learning uses this idea and so it is very popular now.

Steps Selection of features Model Fitting Model Validation using prediction of known classes Feature selection is done by the computer No model, but computer determines the functions of the predictors used Model is validated based on prediction of known classes Discriminant Analysis Machine Learning

Feature Selection Find which of the observed variables can actually distinguish between the classes of interest. This is variable selection

MODEL FITTING Commonly used in Stats: LDA K Nearest Neighbor QDA Logistic Regression

Without models we can use Machine Learning methods Neural networks Naïve Bayes Support Vector machines Perceptron Decision Trees Random Forests

Validation See how well the classifiers classify the observations into the different classes. Mostly commonly used method leave-one-out-cross validation. Though test data set (holdout sample) and resubmissions are still used.

Recap of Part 4 The sticky problem is if the data we have is a sample or a population. Inference is tough, as it is hard to figure out to what population we are inferring for. Predictive analytics often associated with big data At the end of the day, machines are faster and more efficient but cannot create interpretative models (not yet). We still don’t know if big data is good data, it depends upon who is collecting it and for what purpose.

Myth of Big Data There is no myth, it is just unwieldy, unstructured, under- designed data that is already being amassed. It still has to be good data for us to make good analysis and predictions. At the end of the day to make inferences on data (big or small) we need it to be representative.