Download presentation
Presentation is loading. Please wait.
Published byAshlyn Day Modified over 9 years ago
1
Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013
2
Task-based Workflow Engines 8/15/20132 part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result Works on: Work Queue, SGE, Condor, Local Similar to Makeflow: Pegasus, Condor’s DAGMan, Dryad
3
Today’s (DAG-structured) Workflows 8/15/20133
4
Big Data is “hard” 8/15/20134 Master Worker Scenario A: Master Worker Worker 1TB Scenario B: Distributed File System WMS Cloud Grid AND/OR DFS
5
Data Size Increases? Turn to DFS Distributing task dependencies over network costly as data-size increases. Many data-sets are too large to be distributed from a master node. What is used? NFS, AFS, Ceph, PVFS, GPFS – Generic POSIX-compliant cluster file system. Problem: Data-Locality hard to achieve by workflow Contributing Factor: File systems do not offer an interface to locate storage nodes with data. Problem: Parallel applications’ accessing data is a Denial-of-Service waiting to happen. (herd effect) Contributing Factor: Maintaining POSIX semantics. 8/15/20135
6
Other Options? Specialized Workflow Abstractions 8/15/20136
7
Map-Reduce 8/15/20137 Source: developers.google.com
8
Distributed File Systems -- Specialized Specialized cluster file system for executing Map-Reduce. Hadoop Distributed File System: 8/15/20138 #include “hdfs.h” hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0); tSize nwritten = hdfsWrite(fs, writeFile, “hello”, 6); hdfsCloseFile(fs, writeFile);
9
8/15/20139 Source: hadoop.apache.org
10
Task-based Workflows on Hadoop? Single task whole-file access inefficient. Hadoop job execution not built for single-tasks. Single file dependencies. 8/15/201310 Hadoop the Elephant Your DAG Workflow
11
DAG Execution on Hadoop 8/15/201311 Makeflow w.makeflow Batch Job Work Queue Hadoop Hadoop NameNode Hadoop DataNode submit job Job 1234 Map:./split.py input.data Map Input: input.data Reduce:
12
8/15/201312 Hadoop Job Throughput SWEET’12: Makeflow: Portable Workflow Management for Distributed Computing, Albrecht et al.
13
Summary: Running Workflows on Large Datasets is Hard Today, users have two solutions: Use a generic POSIX Distributed File System. – Problem: Data-Locality hard to achieve by workflow managers. – Problem: Parallel applications accessing data is a Denial-of-Service waiting to happen. Use a specialized file system that executes a specific workflow abstraction. – Problem: Users must rewrite applications to use the workflow pattern (abstraction). – Problem: Task-based Workflows inefficient. 8/15/201313
14
Observations on DAG-Structured Workflows 1.Scientific workflows often re-use large datasets in multiple workflows. 2.Metadata interactions occur at task start/end. 3.Tasks consume whole files. 8/15/201314
15
Cluster File System Overview We have designed Confuga, an active storage cluster file system purposed for running task-based workflows. Distinguishing Features Data-locality-aware scheduling with multiple dependencies. Drop-in-replacement for other compute engines. Consistency maintained at task boundaries. 8/15/201315
16
Confuga: An Active Storage Cluster File System S1S1 S2S2 S3S3 MDS Single MDS with multiple storage nodes. 8/15/201316 RMNM F 1 : S 1, S 2 F 2 : S 1, S 3 F 3 : S 2 F1F1 F2F2 F1F1 F3F3 F2F2 Replica Manager: File granularity / |__ readme.txt --> F 1 |__ users/ |__ patrick/ |__ blast.db --> F 2 |__ blast --> F 3 Namespace Manager: Regular Directory Hierarchy, Files Point to File Identifiers
17
Replica Manager Files are indexed using content-addressable- storage. Tasks: – Ensure sufficient replication of files, restripe cluster as necessary. – Garbage collect extra unneeded replicas. 8/15/201317 MDS RM F 1 : S 1, S 2 F 2 : S 1, S 3 F 3 : S 2 SHA1: abcdef123456789 s2.cluster.nd.edu
18
Namespace Manager Maintains a mirror file system layout on the head node. – Regular files hold file identifiers (checksums). Global synchronization point for file system updates. 8/15/201318 MDS NM / |__ readme.txt --> F 1 |__ users/ |__ patrick/ |__ blast.db --> F 2 |__ blast --> F 3
19
Job Scheduler 8/15/201319 S1S1 F1F1 F2F2 S2S2 F1F1 F3F3 S3S3 F3F3 F2F2 Head Node Job Description command: “blast blast.db > out”, inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db” }, outputs: { out: “/users/patrick/out” } Job Description command: “blast blast.db > out”, inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db” }, outputs: { out: “/users/patrick/out” } Step 1: submit job Step 2: copy F 3 to S 3 Step 3 Step 4: execute T 1 Step 5: result T 1
20
Job Scheduler: Task Namespace Context-free execution. Atomic side-effect-free commits. 8/15/201320 Task 1 command: “blast blast.db > out”, namespace: blast.db: F 2 blast: F 3 out: OUTPUT Task 1 command: “blast blast.db > out”, namespace: blast.db: F 2 blast: F 3 out: OUTPUT drwxr-x--- 2 user users 4K 8:00. lrwxrwxrwx 1 user users 49 8:00 blast.db ->../store/F 2 lrwxrwxrwx 1 user users 49 8:00 blast ->../store/F 3 -rw-r----- 1 user users 0 8:00 out $./blast blast.db > out Task 1 Result exit status: 0 namespace: out: F out Task 1 Result exit status: 0 namespace: out: F out S3S3 F2F2 F1F1 F out
21
User Machine Exploiting DAG-structured Workflow Semantics 8/15/201321 Storage Node open writeclose POSIXAFS (commit-on-close) Storage Node openwriteflush + close …
22
read-on-exec/commit-on-exit 8/15/201322 User Machine Storage Node openwrite close open read close openwrite close Open + read + close Open + write + close Eliminates inter-task synchronization requirements Batches metadata operations
23
Why Confuga? Integrates cleanly with current DAG Workflows. – Task namespace encapsulates data dependencies – Writes/Reads resolve at task boundaries Global namespace allows data sharing and workflow checkpointing Express multiple dependencies for a task Minimize unnecessary metadata interactions 8/15/201323
24
Feature Comparison SolutionData- Locality Metadata Scaling Large File Support Application as Abstraction Task-Based Workflows Workflow on DFS Hadoop ?? Confuga ?? 8/15/201324
25
Implementation: Confuga Using Chirp Why Chirp? – Most of Confuga can be implemented using a (slightly modified) remote file server. – Interoperates with existing distributed computation tools 8/15/201325 Chirp Local FS Network RPC ACLPolicy User App libchirp Network RPC FUSE
26
Extending Chirp 8/15/201326 libchirp FUSE Chirp RPC QuotaACL Local FS Standard Chirp libchirp FUSE Confuga Storage Node Chirp RPC Quota Local FS ACL Job Scheduler libchirp FUSE Confuga Head Node Chirp RPC Quota Local FS ACL Job Namespace Manager Replica Manager Chirp File System libchirp Sched -uler Confuga FS
27
Concluding Thoughts Smart adaptation to workflow semantics allows the file system to reduce metadata operations and to minimize cluster synchronization steps. Task namespace is explicit as part of the job description, allowing the file system to schedule tasks near multiple dependencies. 8/15/201327
28
Questions? Patrick Donnelly ( PDONNEL3@ND.EDU ) PDONNEL3@ND.EDU Douglas Thain ( DTHAIN@ND.EDU ) DTHAIN@ND.EDU Have a challenging distributed system problem? Visit our lab at http://www.nd.edu/~ccl/ !! http://www.nd.edu/~ccl/ Source Code: http://www.github.com/cooperative- computing-lab/cctools http://www.github.com/cooperative- computing-lab/cctools 8/15/201328
29
Why content-addressable-storage? 8/15/201329
30
Batch Job Submission Integrating Chirp with Makeflow 8/15/201330 Makeflow Local FS SGE, Condor, Work Queue, …./exe./data/a.db./data/b.db submit, put, get Makeflow./exe./data/a.db./data/b.db Chirp ACLPolicy stat, submit Requires Makeflow to abstract access to the workflow namespace. Chirp needs to support a job submission interface. stat
31
SNSN S2S2 Rearchitect Chirp’s multi interface multi volume management in Chirp – Files are striped round-robin across a static set of nodes – No replication – File location requires traversal of namespace – Access not provided by Chirp itself 8/15/201331 Chirp “Head Node”./volume/hosts./volume/root/users/pdonnel3/blast --> S 1 :/abcd Chirp S 1./abcd Client multi library
32
Changes to Chirp 8/15/201332 Two services within Confuga back-end file system: Replica Manager Namespace Manager
33
Publications Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop, 2010 IEEE CloudCom. Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids, 2012 SWEET at ACM SIGMOD. Fine-Grained Access Control in the Chirp Distributed File System, 2012 IEEE CCGrid. 8/15/201333
34
Map-Reduce 8/15/201334 HDFS architecture influenced by Map-Reduce: – Block oriented; no whole-file access. – No multiple file access. Source: yahoo.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.