National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Brown Dog: An Elastic Data Cyberinfrastrure for Autocuration.

Slides:



Advertisements
Similar presentations
The North American Carbon Program Google Earth Collection Peter C. Griffith, NACP Coordinator; Lisa E. Wilcox; Amy L. Morrell, NACP Web Group Organization:
Advertisements

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Sai Deng, Metadata Catalog Librarian, Wichita State University Libraries Tse-Min Wang, Graduate Student in CS, Wichita State University Digital Imaging.
Digitization Workflow Management System for Massive Digitization Projects Bibliotheca Alexandrina November 19, 2006 The 2 nd International Conference on.
Virtual Geophysics Laboratory (VGL) VGL v1.2 NeCTAR Project Close R.Fraser, T.Rankine, J.Vote, L.Wyborn, B.Evans, R.Woodcock, C.Kemp July 2013 CSIRO |
HDF 1 NCSA HDF XML Activities Robert E. McGrath Mike Folk National Center for Supercomputing Applications.
Vegetation Plot Management: A National Plots Database Demo Funding: National Science Foundation (DBI ) John Harris - NCEAS Robert K. Peet - University.
OnTimeMeasure Integration with Gush Prasad Calyam, Ph.D. (PI) Tony Zhu (Software Programmer) Alex Berryman (REU Student) GEC10 Selected.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
LARGE SCALE DEPLOYMENT OF DAP AND DTS Rob Kooper Jay Alemeda Volodymyr Kindratenko.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog An Overview Kenton McHenry, Ph.D. Senior Research.
C. Mattmann 1, C. Goodale 1, J. Kim 2, D.E. Waliser 1,2, D. Crichton 1, A. Hart 1, P. Zimdars 1 and Peter Lean* The International Workshop on CORDEX-East.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Dynamic Virtual Observatories James Myers, Luigi Marini, Rob.
Peter Bajcsy, Rob Kooper, Luigi Marini, Barbara Minsker and Jim Myers National Center for Supercomputing Applications (NCSA) University of Illinois at.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
HDF Converting between HDF4 and HDF5 MuQun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University of Illinois,
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Conversion Software Registry Michal Ondrejcek, Kenton McHenry,
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
1 Bridging the gap between the paper past and digital future.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign NCSA Brown Dog PaaS for SaaS for PaaS Rob Kooper Senior Research.
The New Zealand Institute for Plant & Food Research Limited Use of Cloud computing in impact assessment of climate change Kwang Soo Kim and Doug MacKenzie.
Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Taming the Big Data in Computational Chemistry #euroCRIS2015 Barcelona 9-11-XI-2015 Carles Bo ICIQ (BIST) -
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
PEcAn The Predictive Ecosystem Analyzer. Motivation Synthesize heterogeneous data Bridge gap between conceptual and computational models Summarize what.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Great Lakes to Gulf Virtual Observatory Marcus Slavenas September.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
1 RIC 2009 Symbolic Nuclear Analysis Package - SNAP version 1.0: Features and Applications Chester Gingrich RES/DSA/CDB 3/12/09.
Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics.
Data Browsing/Mining/Metadata
Delivery of Science Components to NRCS Business Applications
DocFusion 365 Intelligent Template Designer and Document Generation Engine on Azure Enables Your Team to Increase Productivity MICROSOFT AZURE APP BUILDER.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
ISDA + OpenStack Rob Kooper.
Joslynn Lee – Data Science Educator
Status and Challenges: January 2017
LAND COVER CLASSIFICATION WITH THE IMPACT TOOL
Data Sharing We all need data
Fernando Aguilar, IFCA-CSIC
NCSA Brown Dog Early User Workshop
Leveraging BI in SharePoint with PowerPivot and Power View
Joseph JaJa, Mike Smorul, and Sangchul Song
VI-SEEM Data Repository
Ecosystem Demography model version 2 (ED2)
Brown Dog Data Collection Native Byte Encoding Data Structures
VI-SEEM Data Repository
Google Sky.
DEEDS A Platform for Sharing Data, Computing & Scientific Workflows
Gordon Erlebacher Florida State University
DIBBs Brown Dog BDFiddle
Presentation transcript:

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Brown Dog: An Elastic Data Cyberinfrastrure for Autocuration and Digital Preservation Jay Alameda National Center for Supercomputing Applications University of Illinois at Urbana-Channpaign

Tabular Data Gap Filling Climate Modeling Lidar Flood Plain Analysis River Depth Distribution River Maturity Stream Detection and Sinuosity Satellite/Aerial Photos Land Cover/Usage Water Detection (e.g. Lakes, Retaining Ponds) Green Infrastructure Hyperspectral Radar Photos 3D Reconstruction 3D Data Human Preference Modeling Video People Detection/Tracking Large Dynamic Group Behavior Bee Detection/Tracking Bee Colony Behavior Underwater Photos Color Correction Image Stitching Mapping Event Detection Species Detection/Counting Reef Changes Food Supply Structural Defects Hazard Modeling Microscopy Images Pollen Detection/Classification Paleoclimate Evolution Root Tip Tracking Phenomics Materials Development Cell Tracking Tissue Classification Renal Failure Loss of Organ Function Feedlot Tracking Disease Detection Historic Maps River Meander Coastline Changes Documents NLP Sentiment Analysis Regions in Conflict Handwritten Documents Pre-Digital Datasets Databases Web Sites Publications Simulations Critical Zone Studies Social Science Ecology Biology

Acknowledgements The Brown Dog team & coauthors: Smruti Padhy, Jay Alameda, Rui Liu, Edgar Black, Liana Diesendruck, Mike Dietze, Greg Jansen, Praveen Kumar, Rob Kooper, Jong Lee, Richard Marciano, Luigi Marini, Dave Mattson, Barbara Minsker, Chris Navarro, Marcus Slavenas, William Sullivan, Jason Votava, Inna Zharnitsky, Kenton McHenry This material is based upon work supported by the National Science Foundation under Grant Number NSF ACI : “CIF21 DIBBs: Brown Dog” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The Problem Large collections of unstructured and/or un-curated data Many data types and file formats Variety of existing software Short life span of digital data and software Hinders reproducibility of scientific results

An Example: Ecosystems and Climate Change M. Dietze, K. McHenry, A. Desai, “Model-data Synthesis and Forecasting Across the Upper Midwest: Partitioning Uncertainty and Environmental Heterogeneity in Ecosystem Carbon,” NSF DBI , M. Dietze, K. McHenry, A. Desai, “ABI Development: The PEcAn Project - A Community Platform for Ecological Forecasting,” NSF DBI , Towards regional-scale high resolution estimates of plant life and carbon storage Scientific workflow and data assimilation system connecting a variety of models within the Ecology community to a variety of data sources Grown to 52 developers over the past 3 years NCSA / U. Illinois, BU, Brookhaven National Lab, University of Wisconsin, University of Notre Dame, Utah State, Columbia University, Pacific Northwest National Laboratory, DuPont Pioneer, Exeter College, UK, U. Arizona, Dartmouth College

Ecosystems and Climate Change Models: Ecosystem Demography (ED) SIPNET DALEC … Data: Biofuel Ecophysiological Trait and Yield Database (BETY) Forest Inventory and Analysis (FIA) North American Regional Reanalysis (NARR) North American Carbon Program (NACP) Food and Agriculture Organization (FAO) …

Ecosystems and Climate Change Data with Unstructured Aspects: MODIS (Multi-spectral) Lidar Palsar (Radar) Aviris (Airborne Infrared Spectrometer) Landsat (Images) Published results (e.g. tables, figures, plots) Manually done to ingest into BETY

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D

Settlement Vegetation data Born Physical Paper, Microfiche, Alphanumeric/Color coded on vellum sheets Born Digital PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5, XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS Ad hoc formats: Spreadsheets Databases Services R Data Matlab Data Ecosystems and Climate Change Document Image Spatial Tabular Weather 3D Archive, Database, Filesystem, …

What we need A system/framework that Enables access to data contents irrespective of file formats Extracts metadata from data content and does automatic curation Uses existing conversion/extraction/data analysis tools Is extensible – easily add new tools Is dynamically scalable Is easy to use

CIF21 DIBBs: Brown Dog PI: Kenton McHenry, Ph.D. Co-PI: Jong Lee, Ph.D. Co-PI: Barbara Minsker, Ph.D. Co-PI: Praveen Kumar, Ph.D. Co-PI: Michael Dietze, Ph.D.

Brown Dog – A framework for autocuration Data Access Proxy (DAP) File format conversions Example – png to pdf Data Tilling Service (DTS) Extraction of metadata, signatures or derived products from a file’s content Example – Face extraction, text extraction using OCR, table from pdf, previews Tools Catalog (TC) Allows to add new conversion/extraction tools to the DAP/DTS Elasticity Module (EM) Scales Up/Down DAP/DTS DAP DTS TC EM

#Application name (Version) #File types supported (e.g. document, depth, image, …) #Comma separated list of supported input formats #Comma separated list of supported output formats Describe #Call external application and/or carry out conversion … Convert File Data Access Proxy (Data format conversion) REST API Largely Reversible Software Servers 3 rd party software, library, external service Wrapper Scripts (Converters) Adding Converters to Software Server within DAP DAP DTS TC EM

Example DAP DTS TC EM

extractors.connect_message_bus(extractorName=extractorName, messageType=messageType, rabbitmqURL=rabbitmqURL, rabbitmqExchange=rabbitmqExchange, processFileFunction=process_file, checkMessageFunction=check_message) Connect def process_file(parameters): global extractorName inputfile=parameters['inputfile'] # call actual program result = subprocess.check_output(['wc', inputfile], stderr=subprocess.STDOUT) (lines, words, characters, filename) = result.split() Return Metadata Work on File extractors.upload_file_metadata(mdata=metadata, parameters=parameters) Data Tilling Service (Metadata Extraction) REST API Extractors Use any existing tool Python library - pyClowder Creating a Python extractor using pyClowder for DTS DAP DTS TC EM

Example DAP DTS TC EM

The Brown Dog Services Architecture Web Browser Custom Clients Software Server Polyglot Steward Load balancer Jobs (Mongo DB) Software Server HTTP/HTTPS Data Access Proxy … … Extractor (Java/ Python) Extractor (Java/ Python) Extractor (Java/ Python) Extractor (Java/ Python) Clowder WebApp (Scala/Play) Load balancer Data/ Metadata (MongoDB) Data/ Metadata (MongoDB) Data Tilling Service … … Event Bus (RabbitMQ) DAP DTS TC EM

Tools Catalog DAP DTS TC EM

Elasticity Automatically scales up/down DAP/DTS based on the user demands Leverages cloud computing IaaS Supports a variety virtual machine/container frameworks Leverages HPC resources to batch execute jobs in long queues Focuses on DTS extractors and DAP Software Servers DAP DTS TC EM

Elasticity Module Architecture DAP DTS TC EM

Elasticity – Performance Evaluation Tested with Open CV (Computer Vision) extractors faces, eyes, profiles and closeups Tested with OCR extractor with around 1200 test images Processing time is reduced by 70% and 80% if started with suspended VMs DAP DTS TC EM

The UMD Digital Curation Innovation Center Case Study: Archival Processing using CI-BER Test-bed Modern/ Legacy Documents formats PDF Modern/ Legacy Image formats TIFF DAP DTS TC EM Processing Time (ms)

Summary Huge diversity in data and analysis Programmable Interface – various client applications Automatically scales up/down Place to preserve/reuse software/tools Integrable with scientific workflow system Resuable modules

Brown Dog Services- Software Components, Cloud/HPC Resources Polyglot Versus Daffodil Project website:

Thank You Questions

More Information Project website: Brown Dog Clients: wnDog.mp4 wnDog.mp4