Reproducible computational social science Allen Lee Center for Behavior, Institutions, and the Environment https://cbie.asu.edu.

Slides:



Advertisements
Similar presentations
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Advertisements

V Alyssa Rosemartin 1, Lee Marsh 1, Ellen Denny 1, Bruce Wilson USA National Phenology Network, Tucson, AZ; 2 - Oak Ridge National Laboratory, Oak.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Alternate Software Development Methodologies
Ontology Classifications Acknowledgement Abstract Content from simulation systems is useful in defining domain ontologies. We describe a digital library.
Open Statistics: Envisioning a Statistical Knowledge Network Ben Shneiderman Founding Director ( ), Human-Computer Interaction.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
© , Michael Aivazis DANSE Software Issues Michael Aivazis California Institute of Technology DANSE Software Workshop September 3-8, 2003.
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.
Data Management Needs and Challenges for Telemetry Scientists Josh M London Wildlife Biologist, Polar Ecosystems Program National Marine Mammal Laboratory.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Annual SERC Research Review - Student Presentation, October 5-6, Extending Model Based System Engineering to Utilize 3D Virtual Environments Peter.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
111 Subsystems CS 4311 Wirfs Brock et al., Designing Object-Oriented Software, Prentice Hall, (Chapter 7)
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
Zhiyong Wang In cooperation with Sisi Zlatanova
NE II NOAA Environmental Software Infrastructure and Interoperability Program Cecelia DeLuca Sylvia Murphy V. Balaji GO-ESSP August 13, 2009 Germany NE.
THEME 1: Improving the Experimentation and Discovery Process Unprecedented complexity of scientific enterprise Is science stymied by the human bottleneck?
Dresden, ECCS’07 06/10/07 Science of complex systems for socially intelligent ICT Overview of background document Objective IST FET proactive.
material assembled from the web pages at
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Tom O’Reilly Monterey Bay Aquarium Research Institute.
Michael Witt Interdisciplinary Research Librarian & Assistant Professor Purdue Libraries & Distributed Data Curation Center (D2C2) Eliciting.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
SUPERCOMPUTING CHALLENGE KICKOFF 2015 A Model for Computational Science Investigations Oct 2015 © challenge.org Supercomputing Around.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
A Model for Computational Science Investigations Supercomputing Challenge
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Deepcarbon.net Xiaogang (Marshall) Ma, Yu Chen, Han Wang, John Erickson, Patrick West, Peter Fox Tetherless World Constellation Rensselaer Polytechnic.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
CyberInfrastructure for Network Analysis Importance of, contributions by network analysis Transformation of NA Support needed for NA.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Data Science Background and Course Software setup Week 1.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara Advancing Software for Ecological.
21/1/ Analysis - Model of real-world situation - What ? System Design - Overall architecture (sub-systems) Object Design - Refinement of Design.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech 5 th GO-ESSP Community Meeting.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Practical Steps for Increasing Openness and Reproducibility Courtney Soderberg Statistical and Methodological Consultant Center for Open Science.
Webinar on increasing openness and reproducibility April Clyburne-Sherin Reproducible Research Evangelist
Practical Steps for Increasing Openness and Reproducibility Courtney Soderberg Statistical and Methodological Consultant Center for Open Science.
Data Coordinating Center University of Washington Department of Biostatistics Elizabeth Brown, ScD Siiri Bennett, MD.
Data Management: Data Analysis Types of Data Analysis at USGS There are several ways to classify Data Analysis activities at USGS, and here are some of.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
1 CASE Computer Aided Software Engineering. 2 What is CASE ? A good workshop for any craftsperson has three primary characteristics 1.A collection of.
The MIRACLE project: Cyberinfrastructure for visualizing model outputs
NASA Earth Science Data Stewardship
An Approach to Software Preservation
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
Foundations of Data Science
An Overview of Data-PASS Shared Catalog
System Design.
Automatic launch and tracking the computational simulations with LiFlow and Sumatra Evgeniy Kuklin.
Data Management: Documentation & Metadata
Georg Umgiesser and Natalja Čerkasova
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Overview of Workflows: Why Use Them?
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Presentation transcript:

Reproducible computational social science Allen Lee Center for Behavior, Institutions, and the Environment

Computational Social Science Wicked collective action problems Innovation -> Problems -> Innovation Mitigate transaction costs for information transfer

Methodologies Case study analysis Controlled experiments Computational modeling Integrative data analysis / natural experiments

Case Study Analysis seshatdatabank.info “Our goal is to test rival social scientific hypotheses with historical and archaeological data … treating history as a predictive, analytic science.”seshatdatabank.info

SES Library Descriptions of social ecological systems from around the world Embeds mathematical models relating to specific cases where relevant to specific social-ecological dynamics via xppaut

Controlled Behavioral Experiments Web-based experiments: Mechanical Turk, oTree, nodeGame, vcweb Desktop experiments: zTree, CoNG, foraging, irrigation Diversity in software platforms is valuable but also presents challenges General issues summarized in Experimental platforms for behavioral experiments on social-ecological systems (Janssen, Lee, Waring, 2014)

Computational Modeling Extrapolate potential future scenarios for complex systems with many interacting actors Computational modeling makes the processes underlying complex phemonema explicit, sharable, & reproducible. Assumptions are laid bare, and alternative assumptions / parameterizations can be explored via sensitivity analysis George Box – “All models are wrong, but some are useful”

Multiple methods Convergent validity Multiple methods complement each other, e.g., experiments, case study analysis, formal modeling (Poteete, et al., 2010)

Reproducibility Victoria Stodden: how do we know inference is reliable, and why should we believe "Big Data" findings? Need new standards for conducting “Data and Computational Science” and communicating results: sound workflows, sharing specifications, guides to good practice Distinguishing between empirical, statistical, and computational reproducibility

Replicable Research Workflows Planning, organizing, and documenting your research protocols Developing code for data analysis or experiments Running your analyses (generating visualizations) or conducting experiments (generating data) Presenting / publishing findings Cleaning and documenting your code and data Archival and documentation with contextual metadata that preserves provenance is a good example of a full-stack systemhttps://osf.io

Archiving data Vines TH et al. (2013) Current Biology DOI: /j.cub

CoMSES Net Computational Model Library for archiving model code, next generation in active development and planning stages Provide suite of microservices for transparency and reproducibility in computational modeling

The MIRACLE project: Cyberinfrastructure for visualizing model outputs Dawn Parker, Michael Barton, Terence Dawson, Tatiana Filatova, Xiongbing Jin, Allen Lee, Ju-Sung Lee, Lorenzo Milazzo, Calvin Pritchard, J. Gary Polhill, Kirsten Robinson, and Alexey Voinov

Background and motivation Growing interest in analyzing highly detailed “big data” Concurrent development of a new generation of simulation models including ABMS, which themselves produce “big data” as outputs Need for tools and methods to analyze and compare these two data sources

Motivation Sharing model code is great—but there are large barriers to entry to getting someone else’s model running (Collberg, et al 2015) Sharing model output data can accomplish many of the goals of code sharing It also lets other researcher explore new parameter spaces, or use different algorithms Sharing of analysis algorithms may jump start development of complex-systems specific output analysis methods

Objectives Collect, extend, and share methods for statistical analysis and visualization of output from computational agent-based models of coupled human and natural systems (ABM-CHANS). Provide interactive visualization and analysis of archived model output data for ABM-CHANS models

Objectives, cont. Conduct meta-analyses of our own projects, and invite the ABM-CHANS community to conduct further meta- analyses using the new tools. Apply the statistical analysis algorithms we develop to empirical datasets to validate their applicability to large scale data from complex social systems.

Metadata for ABM output data Goals –User needs to understand the data (what’s inside the files, what are the relationships between the files, project and owners…) –User needs to know how the data were generated (input data, analysis scripts, parameters, computer environment, workflows that chain several scripts…) Two types of metadata –Metadata that describe the current state of data (data structure, file and data table content  Fine Grain Metadata) –Metadata that describe the provenance of data (how the data were generated  Coarse Grain Metadata)

Capturing metadata Goal: Automated metadata extraction with minimum user input Fine grain metadata –Automatically extracting metadata from files (CSV columns, ArcGIS Shapefile metadata and attribute table columns, etc.) Coarse grain metadata –Workflow describes how a script could produce a certain file type, while provenance describes how script A produces file B –Provenance can be automatically captured when user runs scripts and workflows using the MIRACLE system (computer environment, user name, application name, process, input files and parameters, output files.) –Workflows can be constructed based on captured provenance

MIRACLE platform use cases Within a research group: –Efficiently share and discuss new model results –Let group member explore new parameter spaces –Create accessible archives for publications Across groups: –Provide prototypes to new researchers, or those looking for new analysis methods –Provide examples for teaching and labs –Facilitate additional “after-market” research and publication

MIRACLE project goals Develop, share, test, and compare new statistical methods appropriate for analysis of complex systems data; Improve communication and assessment within the modeling community; Reduce barriers to entry for use of models; Improve the ability of policy makers and stakeholders to understand and interact with model output

CoMSES Net: Catalog Track the state of archival Provide collective- action tools to incentivize model sharing

CoMSES Net: Catalog

CoMSES Net Future Goals Provide one-stop shop for computational modeling containerized execution with bundled dependencies integration with Jupyter and CyVerse and modeling platforms like RePast, NetLogo Reparameterizable data analysis and exploration via the Miracle project Bibliometric tracking Collective action tools to incentivize prosocial behavior among scientists

From

Guide to good practice Learn to use a source control system (git, mercurial, SVN) Use it with discipline: – commit early, commit often – write meaningful log messages –create tags and releases at important checkpoints during the research process List versioned dependencies (e.g., packrat, Maven/gradle, pip)

Guide to good practice Plan for reproducibility Use version control efficiently Archive everything – data, code, and contextual / provenance metadata Prefer open, durable, formats (plaintext, CSV, open file formats) Use cloud backups Automate where possible Learn the basics of “software carpentry”

Guides to good practice

Computational Social Science

Comments / Questions?