National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research Scientist
Kenton McHenry $10,519, Bill Michener $21,194, Golam Choudhury $10,085, Reagan Moore $8,300, Steven Ruggles $7,993, Margaret Hedstrom $8,000, Alex Szalay $7,603, Long Term Access to Large Scientific Data Sets: The SkyServer and Beyond Michael Levine $4,902, The Data Exacell Xiaohui Carol Song $3,409, Integrating Geospatial Capabilities into HUBzero NSF ACI Data Program
CIF21 DIBBs: Brown Dog NSF ACI $10,519,716 PI: Kenton McHenry, Ph.D. Co-PI: Jong Lee, Ph.D. Co-PI: Barbara Minsker, Ph.D. Co-PI: Praveen Kumar, Ph.D. Co-PI: Michael Dietze, Ph.D.
The Problem The Scientific Method: Question Hypothesis Testing Procedure Analysis Result When procedure is executed one obtains the same result every time! The majority of science today involves procedures which include software and digital data. Both have relatively short lifespans!
The Problem Large collections of un-curated and/or unstructured digital data (“long-tail” data) Many file formats No metadata No useful filenames No useful directory structure No textual contents
What is needed (from the data side) Means of deciphering the bytes that make up digital data so that one can retrieve its contents Data Structures (e.g. images, 3D points, sound waves, strings, fields, matrices, etc…) Means of indexing data contents so that large collections of data can be searched and desired data found An ability to compare data
What is needed (from the data side) The file format specifications describing how contents are represented within the file’s bytes, the software used to create and view the data, and the execution environment (platform, operating system, libraries, other software, etc…). The existence of metadata describing the data (possibly as simple as useful file/directory names), in order to search/index data.
software is also a factor in this (i.e. the data side), obsolete operating systems and platforms, storage requirements (e.g. storing a working environment in a virtual machine), software that is no longer available, software licensing, the existence of many file formats (even for the same kind of data), lack of standards for data formats or enforceability of standards, large complex file format specifications, Additional Considerations
unavailable format specifications (either lost or proprietary), the ease and reward of creating data versus the burden of curation (e.g. organizing and providing metadata for files), different metadata standards, assuring the long term availability of preserved software and data, assuring the archive preserving the software and data exists over a reasonably long period of time, Additional Considerations
assuring the archival tools needed to index, find, access, view, retrieve, and utilize the software and data within the archive exists over a reasonably long period of time (being software itself). Additional Considerations
a growing notion towards the need of academic reward, and perhaps education, surrounding the costly products of software development and data creation the necessity for science to build off of the work of others and have software and data reused (possibly in ways not remotely considered by the creator and crossing into other disciplines) need for computation during the analysis of data collections means of efficiently and reliably transferring large amounts of data
What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability
What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability
Sustainable Software Cyberinfrastructure Knowing our history: NCSA Telnet, 1986 Gaige Paulsen, Tim Krauskopf, Aaron Contorer Mosaic, 1993 Marc Andreessen, Eric Bina Netscape, Internet Explorer, Firefox, Chrome (84% of browser traffic) httpd (and CGI), 1993 Robert McCool Apache (64% of all webservers) All built to access supercomputing resources Though they still serve this purpose none will be remembered for that!
Sustainable Software Cyberinfrastructure Knowing our history: Funded to meet scientific need(s) Broad appeal (i.e. the general public) Free (e.g. open source) Broad public appeal to sustain and drive scientific software post funding
The Domain Name Service (DNS) Originally written by Paul Mockapetris in 1983 Distributed database to translate domain names (i.e. strings) into IP addresses (i.e. 4 bytes) 13 logical root servers (A-M), 359 instances worldwide Internet Corporation for Assigned Names and Numbers (ICANN) Essential part of the modern internet! Used constantly by all yet largely invisible
Data Access Proxy (DAP) A highly extensible and distributed service for carrying out file format conversions Move towards an internet/world that is agnostic to file formats Aid in accessing a files contents independent of how it is represented on disk Data Tilling Service (DTS) An extensible and distributed service for the extraction of new data or metadata from a file’s contents Provide means to query and/or relate collections of data without metadata Data Conversion: A transformation on digital data that largely preserves the entirety of the data. Largely reversible. Data Extraction: A transformation on digital data which creates new, often higher level, data from the contents of the given data (e.g. tags, signatures). Not reversible.
Brown Dog Data Transformation Services The Data Access Proxy (DAP) File in, File out The Data Tilling Service File in, JSON out JSON can contain metadata, tags, signatures, links to derived data products, etc…
Brown Dog Data Transformation Services Services!!! Provide a programmable interface (e.g. REST) Client applications build on top of these services Back with computational resources Place to preserve/reuse software/tools
Brown Dog Use Cases Addressed specifically here: Biology Ecology Civil and Environmental Engineering Social Science Towards all science Early User Workshop!!
Ecosystems and Climate Change The Predictive Ecosystem Analyzer (PEcAn) Models: Ecosystem Demography (ED) SIPNET DALEC Data: Biofuel Ecophysiological Trait and Yield Database (BETY) Forest Inventory and Analysis (FIA) North American Regional Reanalysis (NARR) North American Carbon Program (NACP) Food and Agriculture Organization (FAO) … Lots of conversions taking place!!!
Ecosystems and Climate Change MODIS (Multi-spectral) Lidar Palsar (Radar) Aviris (Airborne Infrared Spectrometer) Landsat (Images)
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D
Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D Archive, Database, Filesystem, …
DAP Native Byte Encoding File Formats, Data Bases, Websites, Documents Data Structures Arrays, Strings, Images, Videos, Audio, 3D Models, … Derived Data/ Metadata Tags, Signatures Applications Search, Relate, View, Process Data Collection URL, File System, … DTS Usable Data DAP Native Byte Encoding Various Formats Data Structures Tabular Derived Data/ Metadata Intermediary Analysis Results Applications Climate Modeling Data Collection Weather Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Text, Number Values Applications Climate Modeling Data Collection Handwritten Settlement Vegetation Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Land Cover/Usage/ … Applications Climate Modeling Data Collection MODIS Satellite Data DTS DAP Native Byte Encoding LAS Data Structures Depth Derived Data/ Metadata Floodplains Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polyglons Derived Data/ Metadata Floodplains, Depth Distribution Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Plot Derived Data/ Metadata River cross- sections, Maturity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polygons Derived Data/ Metadata Stream detection, Sinuosity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Measure of Aesthetic Appeal Applications Green Infrastructure Design Data Collection Architecture/Design Images DTS DAP Native Byte Encoding Various 3D Formats Data Structures 3D Model Derived Data/ Metadata Synthetic Images Applications Green Infrastructure Design Data Collection Architecture/Landscap e Models DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata 3D Model Applications Green Infrastructure Design Data Collection Photographs DTS DAP Native Byte Encoding Various Video Formats Data Structures Video Derived Data/ Metadata People Locations/ Interactions Applications Large Dynamic Group Behavior Data Collection Groupscope DTS
Brown Dog
The Data Access Proxy (Demo) Kenton McHenry The Data Tilling Service (Demo) Luigi Marini
Technology K. McHenry, R. Kooper, P. Bajcsy, “Towards a Universal, Quantiable, and Scalable File Format Converter", The IEEE International Conference on eScience, M. Ondrejcek, K. McHenry, P. Bajcsy, “The Conversion Software Registry", Microsoft eScience Workshop in San Francisco, CA, K. McHenry, M. Ondrejcek, L. Marini, R. Kooper, P. Bajcsy, “Towards a Universal Viewer for Digital Content", International Conference on Computer Science, Executable Paper Workshop, K. McHenry, R. Kooper, L. Marini, M. Ondrejcek, “The ISDA Tools: Preserving 3D Digital Content", The Preservation of Complex Objects Symposia, K. McHenry, R. Kooper, M. Ondrejcek, L. Marini, P. Bajcsy, “A Mosaic of Software", The IEEE International Conference on eScience, L. Marini, P. Bajcsy, S. Padhy, A. Vandecreme, R. Kooper, B. Long, M. Ondrejcek, P. Saba, D. Bonnie, J. Chalfoun, K. McHenry, “Versus: A Framework for General Content-Based Comparisons", IEEE eScience, L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “Digitization and Search: A Non- Traditional Use of HPC", IEEE eScience Workshop on Extending High Performance Computing Beyond its Traditional User Communities, L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “A Framework to Access Hand- written Information within Large Digitized Paper Collections", IEEE eScience, L. Diesendruck, R. Kooper, L. Marini, K. McHenry, “Using Lucene to Index and Search the Digitized 1940 US Census", XSEDE, (Best Paper Award and Best Science & Engineering Track Paper Award)
Brown Dog: Data Access Proxy (DAP)
Brown Dog: Data Tilling Service (DTS)
Goals Support Make list of supported formats as long and as relevant as possible Make list of extractors/signatures as long and as relevant as possible Performance Increase tasks per hour Backed by hardware (e.g. XSEDE, Amazon EC2, Azure, …) Minimize failures per hour
Software DAP & DTS REST Services Javascript bookmarklets (for DAP & DTS) Browser plugin (e.g. Firefox) Linux module Linux file manager (e.g. GNOME Files) Cross platform client to: Provide access to uncurated/unstructured collections Help users curate uncurated/unstructured collections Leverage other DataNet effort for rest of curation workflow
Medici Polyglot Versus Daffodil