Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reproducible Research: the Need for Data Access (in a Big Data Age) SEM, Paris, July, 22nd 2015 Stefan Bender (Deutsche Bundesbank)

Similar presentations


Presentation on theme: "Reproducible Research: the Need for Data Access (in a Big Data Age) SEM, Paris, July, 22nd 2015 Stefan Bender (Deutsche Bundesbank)"— Presentation transcript:

1 Reproducible Research: the Need for Data Access (in a Big Data Age) SEM, Paris, July, 22nd 2015 Stefan Bender (Deutsche Bundesbank)

2 Aim of the Presentation My background: give access to highly sensitive data to the research community. The future: the use of found data for the research community will increase.  Nature of found data (big data),  Data generating process,  Paradigm shift in research,  Access to found data, 7/22/15 Bender: SEM 2015 Seite 2

3 Literature AAPOR Report on Big Data by AAPOR Big Data Task Force; February 12, 2015 Lilli Japec, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane, Cathy O’Neil, Abe Usher. Lane, Julia; Stodden, Victoria; Bender, Stefan, Nissenbaum, Helen (eds.) (2014): Privacy, big data, and the public good * frameworks for engagement. Cambridge: Cambridge University Press.. 7/22/15 Bender: SEM 2015 Seite 3

4 7/22/15 Bender: SEM 2015 4

5 US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website. 7/22/15 Bender: SEM 2015 Seite 5

6 Initial Claims for Unemployment Insurance (seasonally adjusted), U.S. DOL; Prediction, University of Michigan Social Media Job Loss Index.

7 (Common) Definition of Big Data Data http://www.rosebt.com/blog/data- veracity 7/22/15 Bender: SEM 2015 7

8 Types of Big Data Sources Social media data Personal data (e.g. data from tracking devices) Sensor data Transactional data Administrative data Examples: cell phone usage, web scraping, search queries, sensor and scanner data. 7/22/15 Bender: SEM 2015 Seite 8

9 Characteristics of Big Data I Secondary data Related to some non-research purpose and then reused by researchers The amount of control a researcher has and the potential inferential power vary between the different types of big data sources. 7/22/15 Bender: SEM 2015 Seite 9

10 made data (Taylor 2013) found data designed data (Groves 2011) organic data Characteristics of Big Data II Surveys Administrative Data e Big Data e 7/22/15 Bender: SEM 2015 Seite 10

11 Surveys vs. Admin Data vs. Big Data  All data are complementary data sources not competing data sources.  There are differences between the approaches: it is an advantage. 7/22/15 Bender: SEM 2015 Seite 11

12 Data Generating Process I  Big Data is often selective, incomplete and erroneous.  Big Data are typically aggregated from disparate sources at various points in time and integrated to form data sets.  Thus, using Big Data in statistically valid ways is increasingly challenging. 7/22/15 Bender: SEM 2015 Seite 12

13 Data Generating Process II  The volume of the data cannot compensate for any other deficiency in the data.  Big Data hubris fails to recognize that “... quantity of dat a does not mean that one can ignore foundati onal issues of measurement and construct validi ty and reliability....” (Lazer et al. 2014:2) 7/22/15 Bender: SEM 2015 Seite 13

14 Big Data Total Error (BDTE) In-depth knowledge of the data generating mechanism, the data processing infrastructure and the approaches used to create a specific data set or the estimates derived from it: “Total Survey Error (TSE)” Main issue: how the errors could affect inference 7/22/15 Bender: SEM 2015 Seite 14

15 Big Data Process Map AAPOR 2015: 21 7/22/15 Bender: SEM 2015 Seite 15

16 But (at least) one more V http://www.rosebt.com/blog/data-veracity 7/22/15 Bender: SEM 2015 16

17 Reproducibility  Many platforms that produce statistics with Big Data change their algorithms (algorithm dynamic): ambiguous results for any kind of long term study.  Strong need for (documentation) standards.  Strong need for a definition of reproducibility 7/22/15 Bender: SEM 2015 Seite 17

18 Big Data is not just data Imprecise description of a rich and complicated set of  characteristics,  practices,  techniques,  ethical issues, and  outcomes all associated with data. 7/22/15 Bender: SEM 2015 Seite 18

19 Paradigm shift I Changes  measurement of human behavior,  the nature of the new types of data,  their availability,  collected and mixed with other data sources, and  disseminated. 7/22/15 Bender: SEM 2015 Seite 19

20 Paradigm shift II The classic statistical paradigm (Groves 2011):  formulate a hypothesis,  identify a population frame,  design a survey and a sampling technique  analyze the results. The new paradigm:  digitally capture, semantically reconcile, aggregate, and correlate data. 7/22/15 Bender: SEM 2015 Seite 20

21 Cost-Benefit: Big Data “ The mining of personal data can help increase welfare, lower search costs, and reduce economic inefficiencies; at the same time, it can be source of losses, economic inequalities, and power imbalances between those who hold the data and those whose data is controlled.“ (Acquisti 2014, p. 98) 7/22/15 Bender: SEM 2015 Seite 21

22 Data Access in Research Data Centers (RDCs) 22 Data Producers  Survey studies  Official statistics  Big Data Data Producers  Survey studies  Official statistics  Big Data Data users Research Data Center 7/22/15 Bender: SEM 2015

23 Modes of Data Access Off-Site AccessOn-Site Access Email, encrypted (Scientific Use File) Remote Execution (near future) Guest Stay Factually anonymousWeakly anonymous (= confidential) 23 Output control 7/22/15 Bender: SEM 2015

24 Tasks of a RDC RDC offers access for non-commercial research to the (highly sensitive) micro data  Generates micro data (linking data)  Offers advisory service on data selection and data access (handling, potential, scope and validity of data)  Provides data access and data protection  Documents data and methodological aspects of data 7/22/15 Bender: SEM 2015 Seite 24

25 Factsheet on the RDSC  The RDSC has started in 2014 as part of the Statistics Department of the Bundesbank.  It continues tasks formerly performed by the Research Center:  Screening project applications  Granting access to micro data  Performing output control  120 active projects, 10 employees (increasing), 12 working places for guest researchers Slide 25 Bender: SEM 2015 7/22/15

26 Bundesbank Data Treasures 7/22/15 Seite 26 Bender: SEM 2015 ❙ Banks Monthly balance sheets statistics (BISTA) External position of banks Quarterly borrowers statistics MFI interest rate statistics (MIR) ❙ Banking supervision Prudential information system (BAKIS) Large credit micro data base (MiMiK) [special restrictions for access] ❙ Securities Securities Holdings Statistics ❙ Enterprises Microdatabase Direct Investment (MiDi) Statistics on International Trade in Services (SITS) Corporate balance sheets (Ustan) ❙ Households German Panel on Household Finances (PHF) [available as a Scientific Use File]

27 Conclusion: The New Oil Data Quality Generating Process? (Greenwood et al. 2014) 7/22/15 Bender: SEM 2015 27

28 Extract, Transform, Load Reproducibility 7/22/15 Bender: SEM 2015 28

29 Bender: SEM 2015 7/22/15 29

30 Data Protection, Access, Ownership Trust, Data and Ethics 7/22/15 Bender: SEM 2015 30

31 Conclussion  Research is about answering questions.  Start by utilizing all of the information that is available, including surveys, admin data and Big Data.  Fantastic possibilities: we can take best of all worlds: Big Data, surveys and admin data.  There is a need for public-private partnerships to blend data, to ensure data access and reproducibility (and fulfill privacy). 7/22/15 Bender: SEM 2015 Seite 31

32 Contact information: www.bundesbank.de\fdsz fdsz@bundesbank.de Seite 32 Bender: SEM 2015 7/22/15

33 (Some) International developments −UMETRICS: Universities: Measuring the Impacts of Research on Innovation, Competitiveness, and Science −IRIS is designed to transform UMETRICS into a permanent national resource by creating a secure professional data platform for the research community and university administrators −Modernizing Federal Statistics: Census Innovation Measurement Initiative −Triangle Census Research Network − Administrative Data Research Network (UK) 7/22/15 Seite 33 Bender: SEM 2015

34 The Research Data Center (FDZ-BA) FDZ-BA: Research Data Center of the German Federal Employment Agency (BA) Located at the Institute for Employment Research (IAB) in Nuremberg, Germany Established in 2004 Facilitates access to survey and administrative labor market data for non-commercial empirical research JÖRG HEINING is now head of the FDZ!!! 7/22/15 Bender: SEM 2015 Seite 34

35 35 Remote Access Centers of FDZ Additional Sites: UK Data Archive, Colchester, UK and Princeton University 7/22/15 Bender: SEM 2015

36 36 Data Available at FDZ - Overview External/Open Data 7/22/15 Bender: SEM 2015

37 New Data Developments (selection) Patent data Geocoded data Commercial business data (Bureau van Dijk, BvD): Combined BvD-IEB data 37 7/22/15 Bender: SEM 2015

38 That‘s what it looks like (1) 7/22/15 Bender: SEM 2015 Seite 38 Location: 20th floor of the Trianon- Tower in Frankfurt (near the main railway station)

39 That‘s what it looks like (2) 7/22/15 Seite 39 Bender: SEM 2015


Download ppt "Reproducible Research: the Need for Data Access (in a Big Data Age) SEM, Paris, July, 22nd 2015 Stefan Bender (Deutsche Bundesbank)"

Similar presentations


Ads by Google