Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

Trident Scientific Workflow Workbench eScience’08 Tutorial
Trident Scientific Workflow Workbench Nelson Araujo, Roger Barga, Tim Chou, Dean Guo, Jared Jackson, Nitin Gautam, Yogesh Simmhan, Catharine Van Ingen.
New Release Announcements and Product Roadmap Chris DiPierro, Director of Software Development April 9-11, 2014
A Tale of Two Workflows Roger Barga, Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Research The creative input of the Trident.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
IWay Service Manager 6.1 Product Update Scott Hathaway iWay Software Copyright 2010, Information Builders. Slide 1.
Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Linking HIS and GIS How to support the objective, transparent and robust calculation and publication of SWSI? Jeffery S. Horsburgh CUAHSI HIS Sharing hydrologic.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Maintaining and Updating Windows Server 2008
Passage Three Introduction to Microsoft SQL Server 2000.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Virtual Geophysics Laboratory (VGL) VGL v1.2 NeCTAR Project Close R.Fraser, T.Rankine, J.Vote, L.Wyborn, B.Evans, R.Woodcock, C.Kemp July 2013 CSIRO |
Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
AGENDA Welcome and introductions Brief introduction to PSI Mobile Technical Overview Demonstration Q and A Next Actions.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
WITSML Service Platform - Enterprise Drilling Information
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
Esri UC 2014 | Technical Workshop | Esri Roads and Highways: Integrating and Developing LRS Business Systems Tom Hill.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq.
The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Microsoft Virtual Academy. STANDARDIZATION SELF SERVICEAUTOMATION Give Customers of IT services the ability to identify, access and request services.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Microsoft Management Seminar Series SMS 2003 Change Management.
Biomedical Informatics Research Network BIRN Workflow Portal.
Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
Maintaining and Updating Windows Server 2008 Lesson 8.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Dan Fay Technical Computing Microsoft
Building Enterprise Applications Using Visual Studio®
DCC Workshop Input from Computing Coordination
Joseph JaJa, Mike Smorul, and Sangchul Song
Maximum Availability Architecture Enterprise Technology Centre.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Azure's Performance, Scalability, SQL Servers Automate Real Time Data Transfer at Low Cost MINI-CASE STUDY “Azure offers high performance, scalable, and.
Technical Capabilities
Large Scale Distributed Computing
Features Overview.
Mark Quirk Head of Technology Developer & Platform Group
Integrated Statistical Production System WITH GSBPM
Presentation transcript:

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine van Ingen

Demonstrate that a commercial workflow management system can be used to implement scientific workflow Offer this system as an open source accelerator Write once, deploy and run anywhere... Abstract parallelism (HPC and many core); Automatic provenance capture, for both workflow and results; Costing model for estimating resource required; Integrated data storage and access, in particular cloud computing; Reproducible research; Develop this in the context of real eScience applications Make sure we solve a real problem for actual project(s). And this is where things started to get interesting...

Role of workflow in data intensive eScience Explore architectural patterns/best practices Scalability Fault Tolerance Provenance Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research

Workflow is a bridge between the underwater sensor array (instrument) and the end users Features Allow human interaction with instruments; Create ‘on demand’ visualizations of ocean processes; Store data for long term time-series studies Deployed instruments will change regularly, as will the analysis; Facilitate automated, routine “survey campaigns”; Support automated event detection and reaction; User able to access through web (or custom client software); ‏Best effort for most workflows is acceptable;

One of the largest visible light telescopes 4 unit telescopes acting as one 1 Gigapixel per telescope Surveys entire visible universe once per week Catalog solar system, moving objects/asteroids ps1sc.org: UHawaii, Johns Hopkins, …

30TB of processed data/year ~1PB of raw data 5 billion objects; 100 million detections/week Updated every week SQL Server 2008 for storing detections Distributed over spatially partitioned databases Replicated for fault tolerance Windows 2008 HPC Cluster Schedules workflow, monitor system

Slice 1 Slice 1 Slice 2 Slice 2 Slice 3 Slice 3 Slice 4 Slice 4 Slice 5 Slice 5 Slice 6 Slice 6 Slice 7 Slice 7 Slice 8 Slice 8 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 s 16 s 16 s3s3 s3s3 s2s2 s2s2 s5s5 s5s5 s4s4 s4s4 s7s7 s7s7 s6s6 s6s6 s9s9 s9s9 s8s8 s8s8 s 11 s 11 s 10 s 10 s 13 s 13 s 12 s 12 s 15 s 15 s 14 s 14 s1s1 s1s1 Load Merge 1 Load Merge 1 Load Merge 2 Load Merge 2 Load Merge 3 Load Merge 3 Load Merge 4 Load Merge 4 Load Merge 5 Load Merge 5 Load Merge 6 Load Merge 6 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 csv IPP Shared Data Store L1 L2 HOTHOTHOTHOT WARMWARMWARMWARM Main Distributed View

Telescope CSV Files CSV Files Image Procesing Pipeline (IPP) CSV Files CSV Files Load Workflow Load DB Cold Slice DB 1 Cold Slice DB 2 Warm Slice DB 1 Warm Slice DB 2 Merge Workflow Hot Slice DB 2 Hot Slice DB 1 Flip Workflow Distrib uted View CASJobs Query Service CASJobs Query Service MyDB The Pan-STARRS Science Cloud ← Behind the Cloud|| User facing services → Validation Exception Notification Data Valet Workflows Data Consumer Queries & Workflows Data flows in one direction→, except for error recovery Slice Fault Recover Workflow Data Creators Astronomers (Data Consumers) Admin & Load-Merge Machines Production Machines

Workflow is just a member of the orchestra

Workflow carries out the data loading and merging Features Support scheduling of workflows for nightly load and merge; Offer only controlled (protected) access to the workflow system; Workflows are tested, hardened and seldom change; Not a unit of reuse or knowledge sharing; Fault tolerance – ensure recovery and cleanup from faults; Assign clean up workflows to undo state changes; Provenance as a record of state changes (system management); Performance monitoring and logging for diagnostics; Must “play well” in a distributed system; Provide ground truth for the state of the system;

Scientific data from sensors and instruments Time series, spatially distributed Need to be ingested before use Go from Level 1 to Level 2 data Potentially large, continuous stream of data A variety of end users (consumers) of this data Workflows shepherd raw bits from instruments to usable data in databases in the Cloud

Producer Data Valets Publishers Consumers Data Products Curators Reject/Fix New data upload Data Correction Reject/ Fix New data upload Accepted Data Download Data Correction Query Result Publish New Data

Shared Compute Resources Shared Queryable Data Store Configuration Management, Health and Performance Monitoring Operator User Interface User Interface Data Valet User Interface VALET WORKFLOWVALET WORKFLOW USER WORKFLOWUSER WORKFLOW User Storage Data Flow Control Flow Data Valet Queryable Data Store User Queryable Data Store

Switch OUT Slice partition to temp For Each Partition in Slice Cold DB UNION ALL over Slice & Load DBs into temp. Filter on partition bound. Start Post Partition Load Validation Switch IN temp to Slice partition End Detect Merge Fault. Launch Recovery Operations. Notify Admin. Slice Column Recalculations & Updates Post Slice Load Validation Determine ‘Merge Worthy’ Load DBs & Slice Cold DBs Sanity Check of Network Files, Manifest, Checksum Validate CSV File & Table Schema Create, Register empty LoadDB from template For Each CSV File in Batch BULK LOAD CSV File into Table Start Perform CSV File/Table Validation Perform LoadDB/Batch Validation End Detect Load Fault. Launch Recovery Operations. Notify Admin. Determine affine Slice Cold DB for CSV Batch

Monitor state of the system Data centric and Process centric views What is the Load/Merge state of each database in the system? What are the active workflows in the system? Drill down into actions performed: On a particular database till date By a particular workflow

Need a way to monitor state of the system Databases & Workflows Need a way to recover from error states Database states are modeled as a state transition diagram Workflows cause transition from one state to another state Provenance forms an intelligent system log

Faults are just another state PS aims to support 2 degrees of failure Upto 2 replicas out of 3 can fail and still be recovered

Provenance logs need to identify type and location of failure Verification of fault paths Attribution of failure to human error, infrastructure failure, data error Global view of system state during fault

Fine grained workflow activities Activity does one task Eases failure recovery Capture inputs and outputs from workflow/activity Relational/XML model for storing provenance Generic model supports complex.NET types Identify stateful data in parameters Build a relational view on the data states Domain specific view Encodes semantic knowledge in view query

Fault recovery depends on provenance Missing provenance can cause unstable system upon faults Provenance collection is synchronous Provenance events published using reliable (durable) messaging Guarantee that the event will be eventually delivered Provenance is reliably persisted

Trident Logical Architecture Visualization Design Workflow Packages Windows Workflow Foundation Trident Runtime Services Service & Activity Registry Workbench Scientific Workflows Provenance Fault Tolerance WinHPC Scheduling Monitoring Service Runtime Workflow Monitor Administration Console Workflow Launcher Community Archiving Web Portal Data Access Data Object Model (Database Agnostic Abstraction) SQL Server, SSDS Cloud DB, S3, …

Role of workflow in data intensive eScience Data Valet Explore architectural patterns/best practices Scalability, Fault Tolerance and Provenance implemented through workflow patterns Reference architecture, to handle data from creation/capture to curated reference data, and serve as a platform for research GrayWulf reference architecture

Data Acquisition Data Assembly Discovery and Browsing Science Exploration Domain Specific Analyses Scientific Output Archive Field sensor deployments and operations; field campaigns measuring site properties. “Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data. “Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing “Science variables” and data summaries for hypothesis testing and early exploration. Like discovery and browsing, but variables are computed via gap filling, units conversions, or simple equations. “Science variables” combined with models, other specialized code, or statistics for deep science understanding. Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS. Data and analysis methodology stored for data reuse, or repeating an analysis.