Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Similar presentations


Presentation on theme: "Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015."— Presentation transcript:

1 facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015

2 The report is available at https://www.aapor.org

3 Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O’Neil, Johnson Research Labs Abe Usher, HumanGeo Group

4 AAPOR (American Association for Public Opinion Research)  a professional organization dedicated to advancing the study of “public opinion,” broadly defined, to include attitudes, norms, values, and behaviors  promotes best practices and transparency  works to educate its members as well as policy makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field  other task force reports available on https://www.aapor.org

5 Outline of our presentations  What is Big Data?  Paradigm shift  Big Data activities in different organizations  Skills required  Big Data process and data quality

6 UNTIL RECENTLY three main data sources

7 Administrative Data Survey Data Experiments

8 NOW

9 US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.

10 Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.

11 Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).

12 Big Data http://www.rosebt.com/blog/data-veracity

13 Hope that found/organic data Can replace or augment expensive data collections More (= better) data for decision making Information available in (nearly) real time

14 New paradigm  New business model  Federal agencies no longer major players  New analytical model  Outliers  Finegrained analysis  New units of analysis  New sets of skills  Computer scientists  Citizen scientists  Different cost structure Source: Julia Lane

15 Eurostat  Big Data Action Plan and Roadmap  Pilots exploring the potential of selected big data sources  The project will also include activities on:  Methodological frameworks,  Quality frameworks,  Metadata frameworks,  IT infrastructures,  Communication,  Legal frameworks,  Ethical frameworks,  Skills and training, and  Experience sharing.

16 UNECE and Big Data  The “ Sandbox” provides a computing environment to load Big Data sets and tools  Consumer price indices – experimenting with the computation of price indexes  Mobile telephone data – statistics on tourism and daily commuting  Smart meters – statistics on power consumption using data collected from smart meter readings.  Traffic loops – traffic statistics using data from traffic loops  Social media – using Twitter data to analyze sentiment and to tourism flows.  Job portals – computing statistics on job vacancies  Web scraping – tested methods for automatically collecting data from web sources.

17 UNECE Big Data Inventory

18 Statistics Netherlands: Roadmap BIG DATA Two focus projects: the use of traffic loop data for transportation statistics the use of mobile phone data for daytime population and tourism statistics. Six other projects: the use of internet data for price statistics, investigating the use of bank and credit card transactions, the use of social media data for detecting trends in social cohesion, the use of internet data for encoding enterprise purchases and sales, investigating the use of smartcards of public transport for statistics, and the use of internet data for statistics about job vacancies. 18 Source: Pieter Vlag, Statistics Netherlands

19 Examples from Statistics Sweden  Scanner data to improve the Household Budget Survey  Job vacancy statistics by scraping of the web  To evalutate the use of AIS (Automatic Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).

20 One day data Source: Moström and Justesen, Statistics Sweden

21 SKILLS What tasks are required to get there?

22 We have to do this jointly … Data Generating Process Data Curation/Storage Data Analysis Data Output/Access Examples: geolocated social media + survey + administrative data Example: Hadoop Distributed File System Example: Hadoop MapReduce; High Frequency Data Example: map visualization / privacy Research Questions Examples: Behavior of interest (migration/political participation/job searches)

23 Source: Abe Usher

24 Big words … What is big data? What is Hadoop File System? (HDFS) What is Hadoop MapReduce? (MR) How do you link surveys with big data? Source: Abe Usher

25 System Administrator Storage systems (MySQL, Hbase, Spark) Cloud computing: Amazon Web Services (AWS) Google Compute Engine Hadoop ecosystem Computer scientist Data preparation MapReduce algorithms Python/R programming Hadoop ecosystem Source: Abe Usher

26 RESEARCH What do we know about the data generating process?

27 Veracity Who? What? Why? Who is missing? Who is counted repeatedly? What is not said / measured?..and why?

28 But (at least) one more V http://www.rosebt.com/blog/data-veracity

29 Terrorist Detector Errors in Big Data: An Illustration Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? 29 Source: Paul Biemer

30 Terrorist Detector Suppose 1 in 1,000,000 people are terrorists The Big Data Terrorist Detector is 99.9 accurate The detector says your friend, Jack is a terrorist. What are the odds that Jack is really a terrorist? 30 Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false! Source: Paul Biemer Errors in Big Data: An Illustration

31 Big Data Process Map 31 Generate Source 1 Source 2 Source K Extract Transform (Cleanse) ETLAnalyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Load (Store) Source: Paul Biemer

32 Big Data Process Map 32 Generation Source 1 Source 2 Source K Extract Transform (Cleanse) ETLAnalyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Load (Store) Errors include: low signal/noise ratio; lost signals; failure to capture; non-random (or non- representative) sources; meta- data that are lacking, absent, or erroneous. Source: Paul Biemer

33 Big Data Process Map 33 Generation Source 1 Source 2 Source K Extract Transform (Cleanse) ETLAnalyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Load (Store) Errors include: specification error (including, errors in meta-data), matching error, coding error, editing error, data munging errors, and data integration errors.. Source: Paul Biemer

34 Big Data Process Map 34 Generation Source 1 Source 2 Source K Extract Transform (Cleanse) ETL Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Load (Store) Data are filtered, sampled or otherwise reduced. This may involve further transformations of the data. Errors include: sampling errors, selectivity errors (or lack of representativity), modeling errors Source: Paul Biemer

35 Big Data Process Map 35 Generation Source 1 Source 2 Source K Extract Transform (Cleanse) ETL Analyze Filter/Reduction (Sampling) Computation/ Analysis (Visualization) Load (Store) Errors include: modeling errors, inadequate or erroneous adjustments for representativity, computation and algorithmic errors. Source: Paul Biemer

36 POTENTIAL

37 We have to do this jointly … Data Generating Process Data Curation/Storage Data Analysis Data Output/Access Examples: geolocated social media + survey + administrative data Social Science & Psychology, Humanities, Econ, Business Example: Hadoop Distributed File System Math & Computer Science, Applied Statistics Example: Hadoop MapReduce; High Frequency Data Economics, Social Sciences, Business, Math&Comp Example: map visualization / privacy Psychology, Law, Math&Comp, Business Research Questions Examples: Behavior of interest (migration/political participation/job searches) Any field

38 ..and think about legal framework


Download ppt "Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015."

Similar presentations


Ads by Google