Weaving Abstractions into Workflows Hello, my name is Peter Bui and I am a member of the Cooperative Computing Lab at the University of Notre Dame. Today, I will be talking about Weaver, a workflow compiler we have developed to enable users to program workflows and abstractions. Peter Bui University of Notre Dame
Programming Distributed Applications Distributed computing is hard Resource management, task scheduling, programming interface Abstractions Structured way of combining small executables into parallel graphs that can be scaled up to large sizes Map-Reduce All-Pairs What if we want to use multiple abstractions? What if the abstraction is not available? As we all know, programming distributed applications is hard. We have to deal with issues such as resource management, task scheduling and provide a convenient programming interface to our systems. A lot of the work that we have done at the Cooperative Computing Lab in the past has dealt with Abstractions. That is structured ways of combining small executables into parallel graphs that can be scaled up. These abstractions are useful because the end user only concentrates on the logical domain problems and doesn't have to deal with the distributed aspects. Introduce biometrics experiment.
Workflow DAGs Workflows Organize execution in terms of directed acyclic graph Makeflow DAGMan Pipeline abstractions using graph Implement abstractions as nodes and links in a graph One way to solve these shortcomings with abstractions is to use them in conjunction with a workflow system such as Makeflow or DAGMan. Using these workflow systems we can not only pipeline abstractions together, but we can also implement the abstraction as nodes and links in the graph. This solves the problem of how do we combine abstractions and how do we enable the use of abstractions even if the underlying distributed system does not directly support it.
Biometrics Experiment (Makeflow) 234437.bit: /bxgrid/fileid/234437 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234437 234437.bit 234438.bit: /bxgrid/fileid/234438 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234438 234438.bit 234439.bit: /bxgrid/fileid/234439 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234439 234439.bit ap_00_00.txt: 234437.bit 234437.bit compare_iris_templates ./compare_iris_templates 234437.bit 234437.bit ap_00_01.txt: 234437.bit 234438.bit compare_iris_templates ./compare_iris_templates 234437.bit 234438.bit ap_00_02.txt: 234437.bit 234439.bit compare_iris_templates ./compare_iris_templates 234437.bit 234439.bit ap_01_00.txt: 234438.bit 234437.bit compare_iris_templates ./compare_iris_templates 234438.bit 234437.bit ap_01_01.txt: 234438.bit 234438.bit compare_iris_templates ./compare_iris_templates 234438.bit 234438.bit ap_01_02.txt: 234438.bit 234439.bit compare_iris_templates ./compare_iris_templates 234438.bit 234439.bit ... Here is the Makeflow DAG specification for the biometrics experiment.
Biometrics Experiment (Makeflow) 234437.bit: /bxgrid/fileid/234437 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234437 234437.bit 234438.bit: /bxgrid/fileid/234438 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234438 234438.bit 234439.bit: /bxgrid/fileid/234439 convert_iris_to_template ./convert_iris_to_template /bxgrid/fileid/234439 234439.bit ap_00_00.txt: 234437.bit 234437.bit compare_iris_templates ./compare_iris_templates 234437.bit 234437.bit ap_00_01.txt: 234437.bit 234438.bit compare_iris_templates ./compare_iris_templates 234437.bit 234438.bit ap_00_02.txt: 234437.bit 234439.bit compare_iris_templates ./compare_iris_templates 234437.bit 234439.bit ap_01_00.txt: 234438.bit 234437.bit compare_iris_templates ./compare_iris_templates 234438.bit 234437.bit ap_01_01.txt: 234438.bit 234438.bit compare_iris_templates ./compare_iris_templates 234438.bit 234438.bit ap_01_02.txt: 234438.bit 234439.bit compare_iris_templates ./compare_iris_templates 234438.bit 234439.bit ... Manually constructing DAGS is a tedious, error-prone process. Sophisticated workflows require LARGE DAGs. DAGs are too low-level. DAGs are the assembly language of distributed computing. Here is the Makeflow DAG specification for the biometrics experiment.
Biometrics Experiment (Weaver) db = SQLDataSet('db', 'biometrics', 'irises') nefs = Query(db, db.c.state == 'Enrolled', Or(db.c.color == 'Blue', db.c.color == 'Green')) convert = SimpleFunction('convert_iris_to_template') compare = SimpleFunction('compare_iris_templates') bits = Map(convert, nefs) AllPairs(compare, bits, bits, output = 'matrix.txt', use_native = True) This an example of a workflow with two abstractions. Shows the use of a database to get input data. Demonstrates use of Query ORM function. Use native function if available.
Weaver High-level framework that allows users to integrate abstractions into their workflows. Unique Features Built on top of Python programming language. Compiles workflows for multiple workflow managers (Makeflow, DAGMan). Construct generic versions of abstractions as DAGs. To address these issues, we have created Weaver, a high-level framework that allows users to integrate distributed computing abstractions into their workflows. There have been other systems such as Swift that also facilitate constructing distributed workflows, but Weaver is unique in the following ways: 1. It is built on top of Python, which provides a few advantages. 1. User familiar with scripting and language. 2. Easy to deploy. 3. Lots of third-party software. 2. Weaver is workflow manager agnostic, meaning it creates the appropriate sandbox environment for both Makeflow and DAGman. 3. It can generate abstractions as nodes in the DAG, allow users to take advantage of an abstraction even if the execution environment doesn't natively support it.
Software Stack So how does Weaver fit in with the systems we already have? Let's look at the Weaver software stack. At the bottom layer are the execution engines such as SGE, Condor, and WorkQueue. These provide the mechanisms for resource management and task execution. The second layer consist of workflow managers, specifically DAG-based workflow engines such as Makeflow and DAGMan. Rather than scheduling tasks directly to the various execution engines, Weaver generates a DAG that can be used by a workflow manager to schedule tasks. This layer of separation allows for Weaver scripts to be executed in a variety of different environments.
Programming Model Datasets Any iterable collection of Python objects. Simple ORM interface through Query function. Functions Binary executables or Python Functions. Abstractions Describe how functions are applied to dataset. Includes: Map, MapReduce, All-Pairs, Wavefront So how do you use Weaver? Weaver provides a simplified programming model that consists of datasets, functions, and abstractions. Datasets are basically any iterable set of Python objects. We provide mechanisms for specify datasets on filesystems and in databases. Functions are used to specify executables or Python functions. To organize functions into specific patterns of computation, we have the concept of abstractions. To create a workflow, users specify their dataset and functions and combine them by calling a series of abstractions.
Compiler After a specification is created, the user processes it using the Weaver compiler. The compiler will take the Python script and output a sandbox folder containing a DAG workflow specification along with symlinks or copies of the input executables and data. To actually execute the workflow, the user simply uses the generate DAG with the appropriate workflow manager.
Map-Reduce def wc_mapper(key, value): for w in value.split(): print '%s\t%d' % (w, 1) def wc_reducer(key, values): print '%s\t%d' % (key, sum(map(int, values)) MapReduce( mapper = wc_mapper, reducer = wc_reducer, input = Glob('weaver/*.py'), output = 'wc.txt') Here is the Weaver implementation of the canonical word count application. The goal of this program is to compute how many times each word in a set of documents appears. In this case, we are taking the work count of the Weaver source code.
Molecular Dynamics Analysis ANALYSIS_ROOT = '/afs/crc.nd.edu/user/h/hfeng/Public/DHFRTS_project/Analysis/' residue_file = os.path.join(ANALYSIS_ROOT, 'correlationCode', 'residueList') residue_list = [s.strip() for s in open(residue_file)] residue_tars = [] def make_residue_archiver(r): archiver = Function('tar_residue.py') archiver.output_string = lambda i: '_'.join(i.split()) + '.tar.gz' archiver.command_string = lambda i, o: './tar_residue.py ' + r return archiver for r in residue_list: f = make_residue_archiver(r) t = Run(f, '', output = f.output_string(r)) residue_tars.append(t[0]) comparer = SimpleFunction('dihedral_mutent.sh', out_suffix = 'tar') comparer.add_functions(Glob(os.path.join(ANALYSIS_ROOT, 'correlationCode', '*py'))) merger = SimpleFunction('tar_merge.sh', out_suffix = 'tar') AllPairs(comparer, residue_tars, residue_tars, output = 'results.tar', merge_func = merger) Here is the Weaver implementation of the canonical word count application. The goal of this program is to compute how many times each word in a set of documents appears. In this case, we are taking the work count of the Weaver source code.
Conclusion Weaver is... Status Workflow compiler and framework. DAGs = Assembly Language Abstractions = SIMD instructions Python library for distributed computing. Prototyping tool for abstraction builders. Status Work in progress, but usable and working state. Used for biometrics experiments. Foundation for molecular dynamics analysis framework. Source code: http://bitbucket.org/pbui/weaver Multiple views of Weaver.