Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research.

Similar presentations


Presentation on theme: "National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research."— Presentation transcript:

1 National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research Scientist

2 Kenton McHenry $10,519,716 2013-2018 Bill Michener $21,194,548 2009-2014 Golam Choudhury $10,085,120 2009-2014 Reagan Moore $8,300,992 2011-2016 Steven Ruggles $7,993,266 2011-2016 Margaret Hedstrom $8,000,000 2011-2016 Alex Szalay $7,603,723 2013-2018 Long Term Access to Large Scientific Data Sets: The SkyServer and Beyond Michael Levine $4,902,601 2013-2018 The Data Exacell Xiaohui Carol Song $3,409,029 2013-2018 Integrating Geospatial Capabilities into HUBzero NSF ACI Data Program

3 CIF21 DIBBs: Brown Dog NSF ACI $10,519,716 PI: Kenton McHenry, Ph.D. Co-PI: Jong Lee, Ph.D. Co-PI: Barbara Minsker, Ph.D. Co-PI: Praveen Kumar, Ph.D. Co-PI: Michael Dietze, Ph.D.

4

5

6 The Problem The Scientific Method: Question Hypothesis Testing Procedure Analysis Result When procedure is executed one obtains the same result every time! The majority of science today involves procedures which include software and digital data. Both have relatively short lifespans!

7 The Problem Large collections of un-curated and/or unstructured digital data (“long-tail” data) Many file formats No metadata No useful filenames No useful directory structure No textual contents

8 What is needed (from the data side) Means of deciphering the bytes that make up digital data so that one can retrieve its contents Data Structures (e.g. images, 3D points, sound waves, strings, fields, matrices, etc…) Means of indexing data contents so that large collections of data can be searched and desired data found An ability to compare data

9 What is needed (from the data side) The file format specifications describing how contents are represented within the file’s bytes, the software used to create and view the data, and the execution environment (platform, operating system, libraries, other software, etc…). The existence of metadata describing the data (possibly as simple as useful file/directory names), in order to search/index data.

10 software is also a factor in this (i.e. the data side), obsolete operating systems and platforms, storage requirements (e.g. storing a working environment in a virtual machine), software that is no longer available, software licensing, the existence of many file formats (even for the same kind of data), lack of standards for data formats or enforceability of standards, large complex file format specifications, Additional Considerations

11 unavailable format specifications (either lost or proprietary), the ease and reward of creating data versus the burden of curation (e.g. organizing and providing metadata for files), different metadata standards, assuring the long term availability of preserved software and data, assuring the archive preserving the software and data exists over a reasonably long period of time, Additional Considerations

12 assuring the archival tools needed to index, find, access, view, retrieve, and utilize the software and data within the archive exists over a reasonably long period of time (being software itself). Additional Considerations

13 a growing notion towards the need of academic reward, and perhaps education, surrounding the costly products of software development and data creation the necessity for science to build off of the work of others and have software and data reused (possibly in ways not remotely considered by the creator and crossing into other disciplines) need for computation during the analysis of data collections means of efficiently and reliably transferring large amounts of data

14 What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability

15 What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability

16 Sustainable Software Cyberinfrastructure Knowing our history: NCSA Telnet, 1986 Gaige Paulsen, Tim Krauskopf, Aaron Contorer Mosaic, 1993 Marc Andreessen, Eric Bina Netscape, Internet Explorer, Firefox, Chrome (84% of browser traffic) httpd (and CGI), 1993 Robert McCool Apache (64% of all webservers) All built to access supercomputing resources Though they still serve this purpose none will be remembered for that!

17 Sustainable Software Cyberinfrastructure Knowing our history: Funded to meet scientific need(s) Broad appeal (i.e. the general public) Free (e.g. open source) Broad public appeal to sustain and drive scientific software post funding

18 The Domain Name Service (DNS) Originally written by Paul Mockapetris in 1983 Distributed database to translate domain names (i.e. strings) into IP addresses (i.e. 4 bytes) 13 logical root servers (A-M), 359 instances worldwide Internet Corporation for Assigned Names and Numbers (ICANN) Essential part of the modern internet! Used constantly by all yet largely invisible

19 Data Access Proxy (DAP) A highly extensible and distributed service for carrying out file format conversions Move towards an internet/world that is agnostic to file formats Aid in accessing a files contents independent of how it is represented on disk Data Tilling Service (DTS) An extensible and distributed service for the extraction of new data or metadata from a file’s contents Provide means to query and/or relate collections of data without metadata Data Conversion: A transformation on digital data that largely preserves the entirety of the data. Largely reversible. Data Extraction: A transformation on digital data which creates new, often higher level, data from the contents of the given data (e.g. tags, signatures). Not reversible.

20 Brown Dog Data Transformation Services The Data Access Proxy (DAP) http://dap.ncsa.illinous.edu/conversion/:output/:file File in, File out The Data Tilling Service http://dts.ncsa.illinois.edu/extraction/:domain/:file File in, JSON out JSON can contain metadata, tags, signatures, links to derived data products, etc…

21 Brown Dog Data Transformation Services Services!!! Provide a programmable interface (e.g. REST) Client applications build on top of these services Back with computational resources Place to preserve/reuse software/tools

22 Brown Dog Use Cases Addressed specifically here: Biology Ecology Civil and Environmental Engineering Social Science Towards all science Early User Workshop!!

23 Ecosystems and Climate Change The Predictive Ecosystem Analyzer (PEcAn) Models: Ecosystem Demography (ED) SIPNET DALEC Data: Biofuel Ecophysiological Trait and Yield Database (BETY) Forest Inventory and Analysis (FIA) North American Regional Reanalysis (NARR) North American Carbon Program (NACP) Food and Agriculture Organization (FAO) … Lots of conversions taking place!!!

24 Ecosystems and Climate Change MODIS (Multi-spectral) Lidar Palsar (Radar) Aviris (Airborne Infrared Spectrometer) Landsat (Images)

25 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document

26 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image

27 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial

28 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular

29 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather

30 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D

31 Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D Archive, Database, Filesystem, …

32 DAP Native Byte Encoding File Formats, Data Bases, Websites, Documents Data Structures Arrays, Strings, Images, Videos, Audio, 3D Models, … Derived Data/ Metadata Tags, Signatures Applications Search, Relate, View, Process Data Collection URL, File System, … DTS Usable Data DAP Native Byte Encoding Various Formats Data Structures Tabular Derived Data/ Metadata Intermediary Analysis Results Applications Climate Modeling Data Collection Weather Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Text, Number Values Applications Climate Modeling Data Collection Handwritten Settlement Vegetation Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Land Cover/Usage/ … Applications Climate Modeling Data Collection MODIS Satellite Data DTS DAP Native Byte Encoding LAS Data Structures Depth Derived Data/ Metadata Floodplains Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polyglons Derived Data/ Metadata Floodplains, Depth Distribution Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Plot Derived Data/ Metadata River cross- sections, Maturity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polygons Derived Data/ Metadata Stream detection, Sinuosity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Measure of Aesthetic Appeal Applications Green Infrastructure Design Data Collection Architecture/Design Images DTS DAP Native Byte Encoding Various 3D Formats Data Structures 3D Model Derived Data/ Metadata Synthetic Images Applications Green Infrastructure Design Data Collection Architecture/Landscap e Models DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata 3D Model Applications Green Infrastructure Design Data Collection Photographs DTS DAP Native Byte Encoding Various Video Formats Data Structures Video Derived Data/ Metadata People Locations/ Interactions Applications Large Dynamic Group Behavior Data Collection Groupscope DTS

33 Brown Dog

34

35

36 The Data Access Proxy (Demo) Kenton McHenry The Data Tilling Service (Demo) Luigi Marini

37 Technology K. McHenry, R. Kooper, P. Bajcsy, “Towards a Universal, Quantiable, and Scalable File Format Converter", The IEEE International Conference on eScience, 2009. M. Ondrejcek, K. McHenry, P. Bajcsy, “The Conversion Software Registry", Microsoft eScience Workshop in San Francisco, CA, 2010. K. McHenry, M. Ondrejcek, L. Marini, R. Kooper, P. Bajcsy, “Towards a Universal Viewer for Digital Content", International Conference on Computer Science, Executable Paper Workshop, 2011. K. McHenry, R. Kooper, L. Marini, M. Ondrejcek, “The ISDA Tools: Preserving 3D Digital Content", The Preservation of Complex Objects Symposia, 2011. K. McHenry, R. Kooper, M. Ondrejcek, L. Marini, P. Bajcsy, “A Mosaic of Software", The IEEE International Conference on eScience, 2011. L. Marini, P. Bajcsy, S. Padhy, A. Vandecreme, R. Kooper, B. Long, M. Ondrejcek, P. Saba, D. Bonnie, J. Chalfoun, K. McHenry, “Versus: A Framework for General Content-Based Comparisons", IEEE eScience, 2012. L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “Digitization and Search: A Non- Traditional Use of HPC", IEEE eScience Workshop on Extending High Performance Computing Beyond its Traditional User Communities, 2012. L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “A Framework to Access Hand- written Information within Large Digitized Paper Collections", IEEE eScience, 2012. L. Diesendruck, R. Kooper, L. Marini, K. McHenry, “Using Lucene to Index and Search the Digitized 1940 US Census", XSEDE, 2013. (Best Paper Award and Best Science & Engineering Track Paper Award)

38 Brown Dog: Data Access Proxy (DAP)

39

40 Brown Dog: Data Tilling Service (DTS)

41

42

43 Goals Support Make list of supported formats as long and as relevant as possible Make list of extractors/signatures as long and as relevant as possible Performance Increase tasks per hour Backed by hardware (e.g. XSEDE, Amazon EC2, Azure, …) Minimize failures per hour

44 Software DAP & DTS REST Services Javascript bookmarklets (for DAP & DTS) Browser plugin (e.g. Firefox) Linux module Linux file manager (e.g. GNOME Files) Cross platform client to: Provide access to uncurated/unstructured collections Help users curate uncurated/unstructured collections Leverage other DataNet effort for rest of curation workflow

45 https://dap.ncsa.illinois.edu/conversion/https://dts.ncsa.illinois.edu/extraction/ http://browndog.ncsa.illinois.edu Medici https://opensource.ncsa.illinois.edu/stash/projects/MMDB/ Polyglot https://opensource.ncsa.illinois.edu/stash/scm/pol/polyglot.git Versus https://opensource.ncsa.illinois.edu/stash/projects/VS/ Daffodil https://opensource.ncsa.illinois.edu/stash/scm/dfdl/daffodil.git


Download ppt "National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research."

Similar presentations


Ads by Google