National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research.

Slides:



Advertisements
Similar presentations
Overview Environment for Internet database connectivity
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
Using Sakai to Support eScience Sakai Conference June 12-14, 2007 Sayeed Choudhury Tim DiLauro, Jim Martino, Elliot Metsger, Mark Patton and David Reynolds.
DSpace Devika P. Madalli DRTC, ISI Bangalore.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
CP476 Internet Computing Browser and Web Server 1 Web Browsers A client software program that allows you to access and view Web pages on the Internet –Examples.
Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.
Presented by Mina Haratiannezhadi 1.  publishing, editing and modifying content  maintenance  central interface  manage workflows 2.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Architecting an Extensible Digital Repository Anoop Kumar, Ranjani Saigal,Rob Chavez, Nikolai Schwertner Tufts University, Medford, MA.
IT 210 The Internet & World Wide Web introduction.
Computer Concepts 2014 Chapter 7 The Web and .
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
GIS technologies and Web Mapping Services
Chapter 1: Introduction to Web
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Integrating Educational Technology into the Curriculum
WORKFLOWS AND OTHER CONSIDERATIONS FOR DIGITIZATION  Steve Bingo  Processing Archivist Washington State University Libraries  Alex Merrill  Assistant.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
CIS 1310 – HTML & CSS 1 Introduction to the Internet.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,
MySQL and PHP Internet and WWW. Computer Basics A Single Computer.
Peter Bajcsy, Rob Kooper, Luigi Marini, Barbara Minsker and Jim Myers National Center for Supercomputing Applications (NCSA) University of Illinois at.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Conversion Software Registry Michal Ondrejcek, Kenton McHenry,
Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.
Fisheries Oceanography Collaboration Software Donald Denbo NOAA/PMEL-UW/JISAO Presented by Nancy Soreide NOAA/PMEL AMS 2002/IIPS 10.3.
Creating Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards Ruth Duerr, NSIDC MiQun Yang, THG Azhar Sikander,
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog PaaS for SaaS for PaaS Rob Kooper Senior Research.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Application Software System Software.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Matthew Baillie, Luke Day THE INTERNET. HISTORY OF THE INTERNET J.C.R. Licklider authored a series of memos concerning theoretical network structures.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Radoslav Pavlov, Galina Bogdanova, Desislava Paneva- Marinova, Todor Todorov, Konstantin Rangochev
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
© 2012 IBM Corporation IBM Linear Tape File System (LTFS) Overview and Demo.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Information Networks. Internet It is a global system of interconnected computer networks that link several billion devices worldwide. It is an international.
 The web is referred to as a “massive collection of web pages stored on millions of computers across the world that are linked by the Internet” (Chowdhury,
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Brown Dog: An Elastic Data Cyberinfrastrure for Autocuration.
ISDA + OpenStack Rob Kooper.
Data Sharing We all need data
NCSA Brown Dog Early User Workshop
CHAPTER 3 Architectures for Distributed Systems
Brown Dog Data Collection Native Byte Encoding Data Structures
Introduction to DSpace
Malte Dreyer – Matthias Razum
DIBBs Brown Dog BDFiddle
Presentation transcript:

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research Scientist

Kenton McHenry $10,519, Bill Michener $21,194, Golam Choudhury $10,085, Reagan Moore $8,300, Steven Ruggles $7,993, Margaret Hedstrom $8,000, Alex Szalay $7,603, Long Term Access to Large Scientific Data Sets: The SkyServer and Beyond Michael Levine $4,902, The Data Exacell Xiaohui Carol Song $3,409, Integrating Geospatial Capabilities into HUBzero NSF ACI Data Program

CIF21 DIBBs: Brown Dog NSF ACI $10,519,716 PI: Kenton McHenry, Ph.D. Co-PI: Jong Lee, Ph.D. Co-PI: Barbara Minsker, Ph.D. Co-PI: Praveen Kumar, Ph.D. Co-PI: Michael Dietze, Ph.D.

The Problem The Scientific Method: Question Hypothesis Testing Procedure Analysis Result When procedure is executed one obtains the same result every time! The majority of science today involves procedures which include software and digital data. Both have relatively short lifespans!

The Problem Large collections of un-curated and/or unstructured digital data (“long-tail” data) Many file formats No metadata No useful filenames No useful directory structure No textual contents

What is needed (from the data side) Means of deciphering the bytes that make up digital data so that one can retrieve its contents Data Structures (e.g. images, 3D points, sound waves, strings, fields, matrices, etc…) Means of indexing data contents so that large collections of data can be searched and desired data found An ability to compare data

What is needed (from the data side) The file format specifications describing how contents are represented within the file’s bytes, the software used to create and view the data, and the execution environment (platform, operating system, libraries, other software, etc…). The existence of metadata describing the data (possibly as simple as useful file/directory names), in order to search/index data.

software is also a factor in this (i.e. the data side), obsolete operating systems and platforms, storage requirements (e.g. storing a working environment in a virtual machine), software that is no longer available, software licensing, the existence of many file formats (even for the same kind of data), lack of standards for data formats or enforceability of standards, large complex file format specifications, Additional Considerations

unavailable format specifications (either lost or proprietary), the ease and reward of creating data versus the burden of curation (e.g. organizing and providing metadata for files), different metadata standards, assuring the long term availability of preserved software and data, assuring the archive preserving the software and data exists over a reasonably long period of time, Additional Considerations

assuring the archival tools needed to index, find, access, view, retrieve, and utilize the software and data within the archive exists over a reasonably long period of time (being software itself). Additional Considerations

a growing notion towards the need of academic reward, and perhaps education, surrounding the costly products of software development and data creation the necessity for science to build off of the work of others and have software and data reused (possibly in ways not remotely considered by the creator and crossing into other disciplines) need for computation during the analysis of data collections means of efficiently and reliably transferring large amounts of data

What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability

What Brown Dog Addresses Accessing Data Contents with a Lack of Standards and Many File Formats Discovering and Finding Data with a Lack of Curation while also Considering the Need to Preserve Software and Provide Credit for Software Development Creating Tools for Accessing Data while Addressing Archival Tool Sustainability

Sustainable Software Cyberinfrastructure Knowing our history: NCSA Telnet, 1986 Gaige Paulsen, Tim Krauskopf, Aaron Contorer Mosaic, 1993 Marc Andreessen, Eric Bina Netscape, Internet Explorer, Firefox, Chrome (84% of browser traffic) httpd (and CGI), 1993 Robert McCool Apache (64% of all webservers) All built to access supercomputing resources Though they still serve this purpose none will be remembered for that!

Sustainable Software Cyberinfrastructure Knowing our history: Funded to meet scientific need(s) Broad appeal (i.e. the general public) Free (e.g. open source) Broad public appeal to sustain and drive scientific software post funding

The Domain Name Service (DNS) Originally written by Paul Mockapetris in 1983 Distributed database to translate domain names (i.e. strings) into IP addresses (i.e. 4 bytes) 13 logical root servers (A-M), 359 instances worldwide Internet Corporation for Assigned Names and Numbers (ICANN) Essential part of the modern internet! Used constantly by all yet largely invisible

Data Access Proxy (DAP) A highly extensible and distributed service for carrying out file format conversions Move towards an internet/world that is agnostic to file formats Aid in accessing a files contents independent of how it is represented on disk Data Tilling Service (DTS) An extensible and distributed service for the extraction of new data or metadata from a file’s contents Provide means to query and/or relate collections of data without metadata Data Conversion: A transformation on digital data that largely preserves the entirety of the data. Largely reversible. Data Extraction: A transformation on digital data which creates new, often higher level, data from the contents of the given data (e.g. tags, signatures). Not reversible.

Brown Dog Data Transformation Services The Data Access Proxy (DAP) File in, File out The Data Tilling Service File in, JSON out JSON can contain metadata, tags, signatures, links to derived data products, etc…

Brown Dog Data Transformation Services Services!!! Provide a programmable interface (e.g. REST) Client applications build on top of these services Back with computational resources Place to preserve/reuse software/tools

Brown Dog Use Cases Addressed specifically here: Biology Ecology Civil and Environmental Engineering Social Science Towards all science Early User Workshop!!

Ecosystems and Climate Change The Predictive Ecosystem Analyzer (PEcAn) Models: Ecosystem Demography (ED) SIPNET DALEC Data: Biofuel Ecophysiological Trait and Yield Database (BETY) Forest Inventory and Analysis (FIA) North American Regional Reanalysis (NARR) North American Carbon Program (NACP) Food and Agriculture Organization (FAO) … Lots of conversions taking place!!!

Ecosystems and Climate Change MODIS (Multi-spectral) Lidar Palsar (Radar) Aviris (Airborne Infrared Spectrometer) Landsat (Images)

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D Archive, Database, Filesystem, …

DAP Native Byte Encoding File Formats, Data Bases, Websites, Documents Data Structures Arrays, Strings, Images, Videos, Audio, 3D Models, … Derived Data/ Metadata Tags, Signatures Applications Search, Relate, View, Process Data Collection URL, File System, … DTS Usable Data DAP Native Byte Encoding Various Formats Data Structures Tabular Derived Data/ Metadata Intermediary Analysis Results Applications Climate Modeling Data Collection Weather Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Text, Number Values Applications Climate Modeling Data Collection Handwritten Settlement Vegetation Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Land Cover/Usage/ … Applications Climate Modeling Data Collection MODIS Satellite Data DTS DAP Native Byte Encoding LAS Data Structures Depth Derived Data/ Metadata Floodplains Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polyglons Derived Data/ Metadata Floodplains, Depth Distribution Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Plot Derived Data/ Metadata River cross- sections, Maturity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding LAS Data Structures Depth, Polygons Derived Data/ Metadata Stream detection, Sinuosity Applications Flood Plain Analysis Data Collection Lidar Data DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata Measure of Aesthetic Appeal Applications Green Infrastructure Design Data Collection Architecture/Design Images DTS DAP Native Byte Encoding Various 3D Formats Data Structures 3D Model Derived Data/ Metadata Synthetic Images Applications Green Infrastructure Design Data Collection Architecture/Landscap e Models DTS DAP Native Byte Encoding Various Image Formats Data Structures Image Derived Data/ Metadata 3D Model Applications Green Infrastructure Design Data Collection Photographs DTS DAP Native Byte Encoding Various Video Formats Data Structures Video Derived Data/ Metadata People Locations/ Interactions Applications Large Dynamic Group Behavior Data Collection Groupscope DTS

Brown Dog

The Data Access Proxy (Demo) Kenton McHenry The Data Tilling Service (Demo) Luigi Marini

Technology K. McHenry, R. Kooper, P. Bajcsy, “Towards a Universal, Quantiable, and Scalable File Format Converter", The IEEE International Conference on eScience, M. Ondrejcek, K. McHenry, P. Bajcsy, “The Conversion Software Registry", Microsoft eScience Workshop in San Francisco, CA, K. McHenry, M. Ondrejcek, L. Marini, R. Kooper, P. Bajcsy, “Towards a Universal Viewer for Digital Content", International Conference on Computer Science, Executable Paper Workshop, K. McHenry, R. Kooper, L. Marini, M. Ondrejcek, “The ISDA Tools: Preserving 3D Digital Content", The Preservation of Complex Objects Symposia, K. McHenry, R. Kooper, M. Ondrejcek, L. Marini, P. Bajcsy, “A Mosaic of Software", The IEEE International Conference on eScience, L. Marini, P. Bajcsy, S. Padhy, A. Vandecreme, R. Kooper, B. Long, M. Ondrejcek, P. Saba, D. Bonnie, J. Chalfoun, K. McHenry, “Versus: A Framework for General Content-Based Comparisons", IEEE eScience, L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “Digitization and Search: A Non- Traditional Use of HPC", IEEE eScience Workshop on Extending High Performance Computing Beyond its Traditional User Communities, L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal, K. McHenry, “A Framework to Access Hand- written Information within Large Digitized Paper Collections", IEEE eScience, L. Diesendruck, R. Kooper, L. Marini, K. McHenry, “Using Lucene to Index and Search the Digitized 1940 US Census", XSEDE, (Best Paper Award and Best Science & Engineering Track Paper Award)

Brown Dog: Data Access Proxy (DAP)

Brown Dog: Data Tilling Service (DTS)

Goals Support Make list of supported formats as long and as relevant as possible Make list of extractors/signatures as long and as relevant as possible Performance Increase tasks per hour Backed by hardware (e.g. XSEDE, Amazon EC2, Azure, …) Minimize failures per hour

Software DAP & DTS REST Services Javascript bookmarklets (for DAP & DTS) Browser plugin (e.g. Firefox) Linux module Linux file manager (e.g. GNOME Files) Cross platform client to: Provide access to uncurated/unstructured collections Help users curate uncurated/unstructured collections Leverage other DataNet effort for rest of curation workflow

Medici Polyglot Versus Daffodil