Introduction Marco Puts

Introduction Marco Puts
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

From primary to secondary data
Surveys and census Admin sources Big Data What is this all about? We see a big shift from primary to secondary data Primary observation: the data are collected to fit the research purpose therefore sometimes less flexible for other topics or uses. No demanding data cleaning, processing, filtering required in order to do statistical analysis Response rates on a continuous decline, pressure to be cost/effective and to reduce reporting burden Secondary sources-admin Collected by governmental dep. By law available to SN, 1000 variables through admin sources. Higher demand on making data fit for the data analysis purpose (data superfluous, errors) Flexibility in the use of the data source Traditional estimation techniques Secondary sources-Big Data Data cleaning, filtering, processing are very demanding Data mining needed to obtain the relevant data to descibe the phenomenon Different estimation techniques as those known until now, to be developped 2000 BC 0 BC/AD 20th Century 21st century

What is Big Data? V,V,V V,V,V What is Big Data:
According to some methodologists: just another way of estimations, non-probability sampling, etc. Some people still talk about the three V’s. This is, however a very technical definition of Big Data Or maybe a needle in a haystack: there is so much data and we have to find the right elements in this big heap of data.

Is Big Data cause or effect?
What is Big Data? Is Big Data cause or effect? How big is big? But the right questions to be asked are more like: How big should be big. What is big today will be intermediate or even small by tomorrow. This should not influence the definition of big data. A right question to ask when talking about the definition of big data is: is big data cause or effect?

The Signal and the Data To explain this further we make a detour to information theory. Claude Shannon came with an excellent model about information, signals, and how noise deteriorates the signal. Shannon defined the amount of information in a dataset as the amount of bits that is necessary to store the data. This is an information measure. As one can imagine, the number of bits can be calculated in two different ways. The two ways can be described as lossy and lossless compression.

It is the lossy compression that is interesting for the definition of big data from an official statistics perspective. When we compare survey data and administrative data from an information theoretic point of view, we see that the number of bits that we receive based on surveys is really low. We only gather what we need to fulfill our information need. And we can do this quite well. Due to stratification, we query less units of a certain kind, which means that we compress at the source. For administrative data, we already gather more data. Actually, we gather data for all the units and aggregate them. Compression is done at the statistical side. For Big Data, each single data element is not very informative because it is not generated at all to answer our information need. To get a good quality statistics much more data needs to be gathered. So Big Data is more an effect than a cause.

The signal and the Data But what is noise?
However, we have to find the right definition of noise In this bistable image, we can see the head of an old, bearded man or some people a lake. When you see the people at the lake, the percept of the head is inhibited; it is seen as noise. But also the other way around, when you see the old man, the people at the lake are inhibitted and thus are interpreted as noise. Oleg Shuplyak

The Signal and the Data What is noise? data = information + noise
Noise is orthogonal on the thing that we see as information; noise is that part of the data that is not relevant. So the art of dealing with big data is segregating the information and the noise out of the original data. Noise is that part of the data that is not relevant!

The Signal and the Data This filtering process should be implemented in a fully automated way. Since the data consists of a lot of data elements, it is impossible to do anything by hand. The process sepparates the signal from the data and the only human interaction is about process parameters. The process is controlled by changing these parameters based on quality indicators. Filtering processes can be: Deleting data elements that are unnecessary Digital filtering: removing noise from a signal

Results of the filter The latter is depicted here.The original data (shown as gray dots) is really noisy and to find the interesting part in this data set, the data is filtered.

Signals Signals Everywhere
For now i will show you some signals that you can find in the world. The first example is about AIS data. AIS, automatic identification system, is a system where vessels (maritime as well as inland waterway) share their location based on a radio signal. This data set is gathered for the netherlands and one can see some peculiar properties of the signal Question: What are the peculiar properties?

Another one is mobile phone data. Here, we transformed the data set into a time series where one can see the daytime population in the netherlands.

- GDP - Traffic Yet another one is a signal based on the traffic loop data and the GDP

And, finally, our example that shows the relationship between the general sentiment on social media and consumer confidence.

Official Statistics: Discrete signals (oftenly) Low pass filtered Cyclic (seasonal effects) … Signals are also in official statistics. They are discrete, most of the time low pass filtered, and cyclic.

For all signals, it holds that they obey shannon’s ideas about information and, thus, that we can view each and every statistical process as a filtering process

Big Data Processing Big Data Microdata Transform and Reduce
Missing Values & outliers Microdata In a Big data process, we see globally the following steps: Transform and reduce Missing values and outliers Dimensionality reduction. Dimensionality Reduction

Processing Big Data Transform and Reduce Get rid of all data that is not interesting for the process.

Missing Values & outliers
Processing Big Data Missing Values & outliers Kalman filter Particle filter Linear regression Neural networks …

Dimensionality Reduction
Processing Big Data Dimensionality Reduction In dimensionality reduction we try to find the intrinsic dimensionality of the data. Suppose you have a very high dimensional dataset with lots of columns. In many cases, the data in the different columns is not completely uncorrelated. It is quite normal that such a very high dimensional problem can be reduce to a couple of dimensions. In the example here, we have a two dimensional data set, which can be described as a one dimensional dataset.

From processing to data science
Machine Learning applied to Big Data The process described here is not that different from the process proposed by others. Here we see a machine learning approach for big data, which also identifies the steps proposed in this presentation. Because they describe a complete research cycle, there are some other steps introduced

Introduction Marco Puts

Similar presentations

Presentation on theme: "Introduction Marco Puts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction Marco Puts

Similar presentations

Presentation on theme: "Introduction Marco Puts"— Presentation transcript:

Similar presentations

About project

Feedback