Laura Bright David Maier Portland State University

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

1 Computational Asset Description for Cyber Experiment Support using OWL Telcordia Contact: Marian Nodine Telcordia Technologies Applied Research

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

What is Software Design?. Systems Development Life- Cycle Planning Analysis Design Implementation Design.

Alternate Software Development Methodologies

Software Delivery. Software Delivery Management  Managing Requirements and Changes  Managing Resources  Managing Configuration  Managing Defects 

XSEDE 13 July 24, Galaxy Team: PSC Team:

Forecast Revision Goals Improve Reliability, Fault Tolerance, Recovery Measure and Improve Quality Change Management, Configuration Management, Standards,

Architecture is More Than Just Meeting Requirements Ron Olaski SE510 Fall 2003.

Requirements Specification

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.

GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.

A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

RDFS Rapid Deployment Forecast System Visit at: Registration required.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

AUTOMATION OF WEB-FORM CREATION - KINNERA ANGADI – MS FINAL DEFENSE GUIDANCE BY – DR. DANIEL ANDRESEN.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

LOGO PROOF system for parallel MPD event processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

Chapter 4 Automated Tools for Systems Development Modern Systems Analysis and Design Third Edition 4.1.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Developing GRID Applications GRACE Project

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.

Using volunteered resources for data-intensive computing and storage David Anderson Space Sciences Lab UC Berkeley 10 April 2012.

 1- Definition  2- Helpdesk  3- Asset management  4- Analytics  5- Tools.

Big Data is a Big Deal!.

Modern Systems Analysis and Design Third Edition

U.S. ATLAS Grid Production Experience

INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM

Modern Systems Analysis and Design Third Edition

Parallel Programming By J. H. Wang May 2, 2017.

Accelerate define.xml using defineReady - Saravanan June 17, 2015.

Software Documentation

OGSA Data Architecture Scenarios

Un</br>able’s MySecretSecrets

Emergent Semantics: Towards Self-Organizing Scientific Metadata

Grid Canada Testbed using HEP applications

Haiyan Meng and Douglas Thain

Cse 344 May 4th – Map/Reduce.

Modern Systems Analysis and Design Third Edition

Modern Systems Analysis and Design Third Edition

Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.

Introduction to VSTS Database Professional

Overview of Workflows: Why Use Them?

Publishing image services in ArcGIS

Terms: Data: Database: Database Management System: INTRODUCTION

Status and plans for bookkeeping system and production tools

Modern Systems Analysis and Design Third Edition

Introduction to the SHIWA Simulation Platform EGI User Forum,

Scientific Workflows Lecture 15

Presentation transcript:

Laura Bright David Maier Portland State University Deriving and Managing Data Products in an Environmental Observation and Forecasting System Laura Bright David Maier Portland State University

Introduction Large-scale scientific workflows common in many domains Data-intensive tasks generate large volume of data products Datasets, images, animations Data products may be inputs to subsequent tasks 2/16/2019

Motivation: CORIE Environmental Observation and Forecasting System for Columbia River Estuary Single forecast run generates over 5GB of data Existing workflow consists of Perl, C, and FORTRAN programs Difficult to modify and track tasks and data products 2/16/2019

Segment of CORIE Forecast Workflow start.pl ELCIRC *_salt.63 *_temp.63 *_vert.63 … master_process.pl do_isolines.pl do_transects.pl compute_plumevol.c plumevol*.dat do_plumevol.pl plot_plumevol.pl 2/16/2019

Challenges Creation of data products Management of data products Tasks are time and data intensive Competition for limited resources Opportunities for concurrent execution Management of data products Products are large (100s of MB) Tracking metadata and lineage (how data product was generated) 2/16/2019

Contributions Experiences implementing data product management system Managing data products and tasks Lineage Tracking Versioning Scheduling challenges and opportunities Prototype implementation and evaluation 2/16/2019

Outline Introduction CORIE Environmental Observation and Forecasting System Implementation using Thetus Scheduling Related Work and Conclusions 2/16/2019

CORIE Overview Measure and simulate physical properties of Columbia River Estuary e.g., salinity, temperature Forecast simulations (daily) Predict near-term conditions 5GB, 30,000 files Hindcasts (as needed) Extended simulations or calibration runs 20GB, 10,000 files Total of 8TB of online storage 2/16/2019

Example: Isolines 2/16/2019

Example: Transects 2/16/2019

Execution Environment Dedicated storage and processors Use all available capacity Variety of runs, e.g.: Simulations Data product generation Calibration runs Different runs may compete for resources Existing implementation runs sequentially on single processor 2/16/2019

Our Goals Speed up workflows via concurrency Execute independent tasks on dedicated Grid (set of processing nodes) Seamlessly adding processor nodes Improve ease of adding and modifying data products and tasks Lineage and metadata tracking 2/16/2019

Outline Introduction CORIE Environmental Observation and Forecasting System Implementation using Thetus Scheduling Related Work and Conclusions 2/16/2019

Thetus Overview Used Thetus™ commercial software Non-text scientific data management Storing and querying data files and metadata Automatically launches tasks when conditions met Using commercial software enabled rapid deployment of experimental system 2/16/2019

Thetus Terminology Data file Property Description Profile Metadata attributes associated with data files or descriptions Description Set of property-value pairs Profile Share properties between a set of files May launch one or more tasks on a file Every entity has a unique ID 2/16/2019

Thetus Architecture 2/16/2019

Our Thetus Deployment Modified existing CORIE tasks to execute as Thetus tasks Enable concurrent execution of independent tasks at separate nodes Use Thetus storage facilities for executable programs as well as data products Maintain default versions Store data locally at nodes 2/16/2019

Our Thetus Deployment input files Thetus Publisher Data stores data products & executables inputs & executables data products 2/16/2019 Task Server Nodes

Tasks in our Deployment Generation tasks Generate derived data products Management tasks Automatically maintain executables and metadata Updating versions Metadata extraction 2/16/2019

Executing a Generation Task Generation Task Plot_Plumevol: Profile: plumevol_profile Task: plot_plumevol File: plumevol.gif File: plumevol.dat Input: plumevol.dat Output: plumevol.gif Task: plot_plumevol 2/16/2019

Storing Executables Easily add and modify tasks Old versions remain stored Regenerate older data products Easily adding task server nodes Executables downloaded to nodes as needed Associate data products with actual programs that generated them 2/16/2019

Accessing Current Versions We store all versions of executables for historical purposes How to identify current version? Management task tracks current version of file No need to explicitly use ID 2/16/2019

Accessing Current Versions Management Task Set_Default: Profile: Set_Default_ Profile Task: Set_Default Description: prog.pl File: prog.pl ID: 123 Properties: Default_ID: 123 Task: Set_Default 2/16/2019

Storing Data at Task Server Nodes Many tasks share common inputs Local data stores can reduce data transfer overhead Need to ensure correct version Solution: store file IDs locally Check if local ID matches default, if yes, no need to download file 2/16/2019

Outline Introduction CORIE Environmental Observation and Forecasting System Implementation using Thetus Scheduling Related Work and Conclusions 2/16/2019

Scheduling Issues Task Splitting Data aware scheduling Workflow aware scheduling 2/16/2019

Task Splitting Modified tasks that iterate over multiple files to process single file Enables concurrent execution of task on different files at separate nodes Minimal changes to existing code 2/16/2019

Data-Aware Scheduling Many tasks process the same large files Assign tasks based on location of input files Reduce data transfer overhead 2/16/2019

Workflow Aware Scheduling Consider both currently ready and future workflow tasks Example: four tasks and two nodes Time 1 Task1 Task2 Task3 Task4 Tasks 1,2,3 ready at time 0, Task 4 at time 1 2/16/2019

Workflow Aware Scheduling Suboptimal: Assign tasks to nodes 1 and 2 as they become ready: Node A Node B Improved: Assign tasks 1,2,3 to Node 1, Task 4 to Node 2 Node A Node B 2/16/2019

Results Current Implementation: 3 nodes Used do_transects and do_isolines do_transects 4 input files – 3 334MB, 1 655MB do_isolines 11 input files – 3 334MB, 1 655MB, 7 23MB Many tasks have shared inputs Takes 19-20 min on single node 2/16/2019

Data Transfer and Execution Times 2/16/2019

Details Split into 15 tasks, 1 per file Compared random assignments manual data-aware and workflow-aware assignment Tasks that operate on same files execute at same node Divide long-running tasks evenly among nodes 2/16/2019

Effects of Data-Aware and Workflow-Aware Scheduling Random assignments Data- and workflow-aware 2/16/2019 ~800 sec > 13 min ~600 sec < 10 min

Outline Introduction CORIE Environmental Observation and Forecasting System Implementation using Thetus Scheduling Related Work and Conclusions 2/16/2019

Related Work Grid Computing Scientific Workflows Lineage Tracking Globus, Condor, JOSH Job Scheduling Replica Management Scientific Workflows Chimera, Zoo, GridDB, Kepler Lineage Tracking PASOA, ESSW 2/16/2019

Conclusions Executing scientific workflows on dedicated nodes presents new challenges Storing both data products and executables facilitates data maintenance and lineage tracking Data-aware and workflow-aware scheduling improves task execution 2/16/2019

Future work Automatic data and workflow aware scheduling Use statistics from previous executions System monitoring Task sets Group related tasks into a workflow Production planning Predefine workflows for future execution 2/16/2019

Preview of things to come… Manual scheduling (implementation) Automated scheduling (simulation) 2/16/2019

Acknowledgments Thetus Corporation CORIE team And many others… http://www.thetuscorp.com CORIE team And many others… 2/16/2019