Introduction Marco Puts

Slides:

Advertisements

Similar presentations

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Advertisements

Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Barteld Braaksma and Kees Zeelenberg “Re-make / Re-model”: Should big data change the modelling paradigm in official statistics?

STAT 497 APPLIED TIME SERIES ANALYSIS

Topic 6: Introduction to Hypothesis Testing

Principal Component Analysis

Statistics: Data Analysis and Presentation Fr Clinic II.

The Basics of Regression continued

DR. AHMAD SHAHRUL NIZAM ISHA

Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.

Statistical Inference Two Statistical Tasks 1. Description 2. Inference.

Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.

Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.

23 November Md. Tanvir Al Amin (Presenter) Anupam Bhattacharjee Department of Computer Science and Engineering,

11 Chapter 6 The Research Process – Data collection & Data analysis – (Stage 5 & 6 in Research Process) © 2009 John Wiley & Sons Ltd.

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

What is data? Wietse Dol, LEI-WUR 13 November 2012, 9.40 – 10.25, C435 Forumgebouw.

Computacion Inteligente Least-Square Methods for System Identification.

LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.

Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.

Estimating standard error using bootstrap

Quantitative Methods for Business Studies

CHAPTER 12 More About Regression

Quantified perceived and Expected Inflation in the Euro Area

Chapter 4 Dynamical Behavior of Processes Homework 6 Construct an s-Function model of the interacting tank-in-series system and compare its simulation.

Boyle’s law Verifying the relation between air pressure and volume measuring air pressure in a closed container. Objective The purpose of this activity.

Writing a sound proposal

Chapter 4 Dynamical Behavior of Processes Homework 6 Construct an s-Function model of the interacting tank-in-series system and compare its simulation.

I. Introduction to statistics

Statistical Data Analysis - Lecture /04/03

Regression Analysis Part D Model Building

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Distribution of the Sample Means

Chapter 4: The Nature of Regression Analysis

Simultaneous equation system

Sampling And Sampling Methods.

CH 8. Image Compression 8.1 Fundamental 8.2 Image compression models

Dimension Reduction via PCA (Principal Component Analysis)

8/28/15 Today I will explain the steps to the scientific method

Instructor :Dr. Aamer Iqbal Bhatti

Hidden Markov Models Part 2: Algorithms

CONCEPTS OF ESTIMATION

A Brief Introduction to Information Theory

Changed Data Collection Strategies

The European Statistical Training Programme (ESTP)

Uses of web scraping for official statistics

Chapter 2: Steps of Econometric Analysis

United Nations Statistics Division

On Convolutional Neural Network

Lecture 6: Data Quality and Pandas

Course Introduction CSC 576: Data Mining.

Sampling Distributions

CHAPTER 12 More About Regression

Seminar in Economics Econ. 470

Experiments and Variables Experiment 3.1 Using a Series of Experiments

Govt. Polytechnic Dhangar(Fatehabad)

Technical Science Scientific Tools and Methods

Financial Econometrics Fin. 505

Financial Econometrics Fin. 505

Chapter 2: Steps of Econometric Analysis

Lecturer Dr. Veronika Alhanaqtah

3.2. SIMPLE LINEAR REGRESSION

Chapter 4: The Nature of Regression Analysis

A modest attempt at measuring and communicating about quality

Big data and official statistics

Data Pre-processing Lecture Notes for Chapter 2

Road Sensor Data Marco Puts

Chapter 13: Item nonresponse

Linear Regression and Correlation

Presentation transcript:

Introduction Marco Puts THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

From primary to secondary data Surveys and census Admin sources Big Data What is this all about? We see a big shift from primary to secondary data Primary observation: the data are collected to fit the research purpose therefore sometimes less flexible for other topics or uses. No demanding data cleaning, processing, filtering required in order to do statistical analysis Response rates on a continuous decline, pressure to be cost/effective and to reduce reporting burden Secondary sources-admin Collected by governmental dep. By law available to SN, 1000 variables through admin sources. Higher demand on making data fit for the data analysis purpose (data superfluous, errors) Flexibility in the use of the data source Traditional estimation techniques Secondary sources-Big Data Data cleaning, filtering, processing are very demanding Data mining needed to obtain the relevant data to descibe the phenomenon Different estimation techniques as those known until now, to be developped 2000 BC 0 BC/AD 20th Century 21st century

What is Big Data? V,V,V V,V,V What is Big Data: According to some methodologists: just another way of estimations, non-probability sampling, etc. Some people still talk about the three V’s. This is, however a very technical definition of Big Data Or maybe a needle in a haystack: there is so much data and we have to find the right elements in this big heap of data.

Is Big Data cause or effect? What is Big Data? Is Big Data cause or effect? How big is big? But the right questions to be asked are more like: How big should be big. What is big today will be intermediate or even small by tomorrow. This should not influence the definition of big data. A right question to ask when talking about the definition of big data is: is big data cause or effect?

The Signal and the Data To explain this further we make a detour to information theory. Claude Shannon came with an excellent model about information, signals, and how noise deteriorates the signal. Shannon defined the amount of information in a dataset as the amount of bits that is necessary to store the data. This is an information measure. As one can imagine, the number of bits can be calculated in two different ways. The two ways can be described as lossy and lossless compression.

It is the lossy compression that is interesting for the definition of big data from an official statistics perspective. When we compare survey data and administrative data from an information theoretic point of view, we see that the number of bits that we receive based on surveys is really low. We only gather what we need to fulfill our information need. And we can do this quite well. Due to stratification, we query less units of a certain kind, which means that we compress at the source. For administrative data, we already gather more data. Actually, we gather data for all the units and aggregate them. Compression is done at the statistical side. For Big Data, each single data element is not very informative because it is not generated at all to answer our information need. To get a good quality statistics much more data needs to be gathered. So Big Data is more an effect than a cause.

The signal and the Data But what is noise? However, we have to find the right definition of noise In this bistable image, we can see the head of an old, bearded man or some people a lake. When you see the people at the lake, the percept of the head is inhibited; it is seen as noise. But also the other way around, when you see the old man, the people at the lake are inhibitted and thus are interpreted as noise. Oleg Shuplyak

The Signal and the Data What is noise? data = information + noise Noise is orthogonal on the thing that we see as information; noise is that part of the data that is not relevant. So the art of dealing with big data is segregating the information and the noise out of the original data. Noise is that part of the data that is not relevant!

The Signal and the Data This filtering process should be implemented in a fully automated way. Since the data consists of a lot of data elements, it is impossible to do anything by hand. The process sepparates the signal from the data and the only human interaction is about process parameters. The process is controlled by changing these parameters based on quality indicators. Filtering processes can be: Deleting data elements that are unnecessary Digital filtering: removing noise from a signal

Results of the filter The latter is depicted here.The original data (shown as gray dots) is really noisy and to find the interesting part in this data set, the data is filtered.

Signals Signals Everywhere For now i will show you some signals that you can find in the world. The first example is about AIS data. AIS, automatic identification system, is a system where vessels (maritime as well as inland waterway) share their location based on a radio signal. This data set is gathered for the netherlands and one can see some peculiar properties of the signal Question: What are the peculiar properties?

Signals Signals Everywhere Another one is mobile phone data. Here, we transformed the data set into a time series where one can see the daytime population in the netherlands.

Signals Signals Everywhere - GDP - Traffic Yet another one is a signal based on the traffic loop data and the GDP

Signals Signals Everywhere And, finally, our example that shows the relationship between the general sentiment on social media and consumer confidence.

Signals Signals Everywhere Official Statistics: Discrete signals (oftenly) Low pass filtered Cyclic (seasonal effects) … Signals are also in official statistics. They are discrete, most of the time low pass filtered, and cyclic.

Signals Signals Everywhere For all signals, it holds that they obey shannon’s ideas about information and, thus, that we can view each and every statistical process as a filtering process

Big Data Processing Big Data Microdata Transform and Reduce Missing Values & outliers Microdata In a Big data process, we see globally the following steps: Transform and reduce Missing values and outliers Dimensionality reduction. Dimensionality Reduction

Processing Big Data Transform and Reduce Get rid of all data that is not interesting for the process.

Missing Values & outliers Processing Big Data Missing Values & outliers Kalman filter Particle filter Linear regression Neural networks …

Dimensionality Reduction Processing Big Data Dimensionality Reduction In dimensionality reduction we try to find the intrinsic dimensionality of the data. Suppose you have a very high dimensional dataset with lots of columns. In many cases, the data in the different columns is not completely uncorrelated. It is quite normal that such a very high dimensional problem can be reduce to a couple of dimensions. In the example here, we have a two dimensional data set, which can be described as a one dimensional dataset.

From processing to data science Machine Learning applied to Big Data The process described here is not that different from the process proposed by others. Here we see a machine learning approach for big data, which also identifies the steps proposed in this presentation. Because they describe a complete research cycle, there are some other steps introduced