Weaving Abstractions into Workflows

Slides:



Advertisements
Similar presentations
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Advertisements

ABSTRACT The goal of this project was to create a more realistic and interactive appliance interface for a Usability Science class here at Union. Usability.
Topic 1: Introduction to Computers and Programming
Introduction To C++ Programming 1.0 Basic C++ Program Structure 2.0 Program Control 3.0 Array And Structures 4.0 Function 5.0 Pointer 6.0 Secure Programming.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Chapter 10 Architectural Design
Putting What We Learned Into Context – WSGI and Web Frameworks A290/A590, Fall /16/2014.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.
CIS 375 Bruce R. Maxim UM-Dearborn
Linux Standard Base Основной современный стандарт Linux, стандарт ISO/IEC с 2005 года Определяет состав и поведение основных системных библиотек.
The Post Windows Operating System
Topic 2: Hardware and Software
Advanced Computer Systems
Development Environment
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
CSCI5570 Large Scale Data Processing Systems
CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA
Topic: Programming Languages and their Evolution + Intro to Scratch
ITCS-3190.
Data Virtualization Tutorial: Introduction to SQL Script
PLM, Document and Workflow Management
Introduction to Visual Basic 2008 Programming
MapReduce Types, Formats and Features
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
Spark Presentation.
Unified Modeling Language
Map Reduce.
Scaling Up Scientific Workflows with Makeflow
CS101 Introduction to Computing Lecture 19 Programming Languages
课程名 编译原理 Compiling Techniques
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Abstract Machine Layer Research in VGrADS
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Object-Orientated Programming
FESA evolution and the vision for Front-End Software
Application Development Theory
Integration of Singularity With Makeflow
Software Project Planning &
CSCI/CMPE 3334 Systems Programming
Chapter 2: Operating-System Structures
Unit# 8: Introduction to Computer Programming
Computer Programming.
Learning to Program in Python
CIS16 Application Development – Programming with Visual Basic
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
GENERAL VIEW OF KRATOS MULTIPHYSICS
Distributed System Gang Wu Spring,2018.
Module 01 ETICS Overview ETICS Online Tutorials
Metadata Framework as the basis for Metadata-driven Architecture
Chapter 7 –Implementation Issues
Chapter 9 Architectural Design.
CHAPTER 9 (part a) BASIC INFORMATION SYSTEMS CONCEPTS
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Introduction to Spark.
Overview of Workflows: Why Use Them?
ENERGY 211 / CME 211 Lecture 27 November 21, 2008.
Computational Pipeline Strategies
MapReduce: Simplified Data Processing on Large Clusters
Functions, Procedures, and Abstraction
From Use Cases to Implementation
Map Reduce, Types, Formats and Features
Implementation Plan system integration required for each iteration
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Weaving Abstractions into Workflows Hello, my name is Peter Bui and I am a member of the Cooperative Computing Lab at the University of Notre Dame. Today, I will be talking about Weaver, a workflow compiler we have developed to enable users to program workflows and abstractions. Peter Bui University of Notre Dame

Programming Distributed Applications Distributed computing is hard Resource management, task scheduling, programming interface Abstractions Structured way of combining small executables into parallel graphs that can be scaled up to large sizes Map-Reduce All-Pairs What if we want to use multiple abstractions? What if the abstraction is not available? As we all know, programming distributed applications is hard. We have to deal with issues such as resource management, task scheduling and provide a convenient programming interface to our systems. A lot of the work that we have done at the Cooperative Computing Lab in the past has dealt with Abstractions.  That is structured ways of combining small executables into parallel graphs that can be scaled up. These abstractions are useful because the end user only concentrates on the logical domain problems and doesn't have to deal with the distributed aspects. Introduce biometrics experiment.

Workflow DAGs Workflows Organize execution in terms of directed acyclic graph Makeflow DAGMan Pipeline abstractions  using graph Implement abstractions as nodes and links in a graph One way to solve these shortcomings with abstractions is to use them in conjunction with a workflow system such as Makeflow or DAGMan. Using these workflow systems we can not only pipeline abstractions together, but we can also implement the abstraction as nodes and links in the graph. This solves the problem of how do we combine abstractions and how do we enable the use of abstractions even if the underlying distributed system does not directly support it.

Biometrics Experiment (Makeflow) 234437.bit:    /bxgrid/fileid/234437 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234437 234437.bit 234438.bit:    /bxgrid/fileid/234438 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234438 234438.bit 234439.bit:    /bxgrid/fileid/234439 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234439 234439.bit ap_00_00.txt:  234437.bit 234437.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234437.bit ap_00_01.txt:  234437.bit 234438.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234438.bit ap_00_02.txt:  234437.bit 234439.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234439.bit ap_01_00.txt:  234438.bit 234437.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234437.bit ap_01_01.txt:  234438.bit 234438.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234438.bit ap_01_02.txt:  234438.bit 234439.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234439.bit ... Here is the Makeflow DAG specification for the biometrics experiment.

Biometrics Experiment (Makeflow) 234437.bit:    /bxgrid/fileid/234437 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234437 234437.bit 234438.bit:    /bxgrid/fileid/234438 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234438 234438.bit 234439.bit:    /bxgrid/fileid/234439 convert_iris_to_template         ./convert_iris_to_template /bxgrid/fileid/234439 234439.bit ap_00_00.txt:  234437.bit 234437.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234437.bit ap_00_01.txt:  234437.bit 234438.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234438.bit ap_00_02.txt:  234437.bit 234439.bit compare_iris_templates         ./compare_iris_templates 234437.bit 234439.bit ap_01_00.txt:  234438.bit 234437.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234437.bit ap_01_01.txt:  234438.bit 234438.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234438.bit ap_01_02.txt:  234438.bit 234439.bit compare_iris_templates         ./compare_iris_templates 234438.bit 234439.bit ... Manually constructing DAGS is a tedious, error-prone process. Sophisticated workflows require LARGE DAGs. DAGs are too low-level. DAGs are the assembly language of distributed computing. Here is the Makeflow DAG specification for the biometrics experiment.

Biometrics Experiment (Weaver) db   = SQLDataSet('db', 'biometrics', 'irises') nefs = Query(db, db.c.state == 'Enrolled',                  Or(db.c.color == 'Blue',                     db.c.color == 'Green')) convert = SimpleFunction('convert_iris_to_template') compare = SimpleFunction('compare_iris_templates') bits = Map(convert, nefs) AllPairs(compare, bits, bits, output = 'matrix.txt',          use_native = True) This an example of a workflow with two abstractions. Shows the use of a database to get input data. Demonstrates use of Query ORM function. Use native function if available.

Weaver High-level framework that allows users to integrate abstractions into their workflows. Unique Features   Built on top of Python programming language. Compiles workflows for multiple workflow managers (Makeflow, DAGMan). Construct generic versions of abstractions as DAGs. To address these issues, we have created Weaver, a high-level framework that allows users to integrate distributed computing abstractions into their workflows. There have been other systems such as Swift that also facilitate constructing distributed workflows, but Weaver is unique in the following ways: 1. It is built on top of Python, which provides a few advantages.    1. User familiar with scripting and language.    2. Easy to deploy.    3. Lots of third-party software. 2. Weaver is workflow manager agnostic, meaning it creates the appropriate sandbox environment for both Makeflow and DAGman. 3. It can generate abstractions as nodes in the DAG, allow users to take advantage of an abstraction even if the execution environment doesn't natively support it.

Software Stack So how does Weaver fit in with the systems we already have? Let's look at the Weaver software stack.  At the bottom layer are the execution engines such as SGE, Condor, and WorkQueue.  These provide the mechanisms for resource management and task execution. The second layer consist of workflow managers, specifically DAG-based workflow engines such as Makeflow and DAGMan. Rather than scheduling tasks directly to the various execution engines, Weaver generates a DAG that can be used by a workflow manager to schedule tasks.  This layer of separation allows for Weaver scripts to be executed in a variety of different environments.

Programming Model Datasets Any iterable collection of Python objects. Simple ORM interface through Query function. Functions Binary executables or Python Functions. Abstractions Describe how functions are applied to dataset. Includes: Map, MapReduce, All-Pairs, Wavefront So how do you use Weaver? Weaver provides a simplified programming model that consists of datasets, functions, and abstractions. Datasets are basically any iterable set of Python objects.  We provide mechanisms for specify datasets on filesystems and in databases. Functions are used to specify executables or Python functions. To organize functions into specific patterns of computation, we have the concept of abstractions. To create a workflow, users specify their dataset and functions and combine them by calling a series of abstractions.

Compiler   After a specification is created, the user processes it using the Weaver compiler. The compiler will take the Python script and output a sandbox folder containing a DAG workflow specification along with symlinks or copies of the input executables and data. To actually execute the workflow, the user simply uses the generate DAG with the appropriate workflow manager.

Map-Reduce def wc_mapper(key, value): for w in value.split():         print '%s\t%d' % (w, 1) def wc_reducer(key, values):     print '%s\t%d' % (key, sum(map(int, values)) MapReduce( mapper  = wc_mapper,            reducer = wc_reducer,            input   = Glob('weaver/*.py'),            output  = 'wc.txt') Here is the Weaver implementation of the canonical word count application. The goal of this program is to compute how many times each word in a set of documents appears. In this case, we are taking the work count of the Weaver source code.

Molecular Dynamics Analysis ANALYSIS_ROOT = '/afs/crc.nd.edu/user/h/hfeng/Public/DHFRTS_project/Analysis/' residue_file  = os.path.join(ANALYSIS_ROOT, 'correlationCode', 'residueList') residue_list  = [s.strip() for s in open(residue_file)] residue_tars  = [] def make_residue_archiver(r):     archiver = Function('tar_residue.py')     archiver.output_string  = lambda i:     '_'.join(i.split()) + '.tar.gz'     archiver.command_string = lambda i, o:  './tar_residue.py ' + r     return archiver for r in residue_list:     f = make_residue_archiver(r)     t = Run(f, '', output = f.output_string(r))     residue_tars.append(t[0]) comparer = SimpleFunction('dihedral_mutent.sh', out_suffix = 'tar') comparer.add_functions(Glob(os.path.join(ANALYSIS_ROOT, 'correlationCode', '*py'))) merger   = SimpleFunction('tar_merge.sh', out_suffix = 'tar') AllPairs(comparer, residue_tars, residue_tars, output = 'results.tar', merge_func = merger) Here is the Weaver implementation of the canonical word count application. The goal of this program is to compute how many times each word in a set of documents appears. In this case, we are taking the work count of the Weaver source code.

Conclusion Weaver is... Status Workflow compiler and framework. DAGs = Assembly Language Abstractions = SIMD instructions Python library for distributed computing. Prototyping tool for abstraction builders. Status Work in progress, but usable and working state. Used for biometrics experiments. Foundation for molecular dynamics analysis framework. Source code: http://bitbucket.org/pbui/weaver Multiple views of Weaver.