Fundamental Operations Scalability and Speedup

Fundamental Operations Scalability and Speedup
DataLab: Active Storage for Data-Driven Scientific Computing Brandon Rich and Douglas Thain, University of Notre Dame Many data-intensive scientific computing applications are constrained not by the amount of CPU cycles available, but by the ability of the I/O system to deliver data. To serve such applications, we present DataLab, an active storage solution in which a cluster of conventional machines is used primarily for its aggregate I/O capacity. Each node, called an active storage unit (ASU) is equipped with a local disk and processing capability. Large data sets are partitioned across the distributed storage units. Small programs are then dispatched to the location of the data that they wish to process, rather than vice versa. Fundamental Operations Apply F on A into B,C,D Select F from B into D Compare F on X and Y into Z Example Application in Biometrics: Convert 58,000 iris images from TIFF to BMP. Select all images with a particular artifact. Reduce all of those into a feature space. Compare all features against each other to produce a matrix. Retrieve matrix of values from the system. Why Active Storage? After deploying data once, you never have to move it again Processing data is a matter of transmitting the function, not moving data After deploying data once, you never have to move it again Processing data is a matter of transmitting the function, not moving data Useful Abstractions Create typed sets and populate them with data files Define functions to act on that data Apply, select, or compare (see figure at right) Output can be another data set or a report of results Architecture Fault Tolerance Job are transaction-oriented, with fault tolerance at both job and set level We log and report errors on each individual execution Our underlying execution model re-attempts failed executions We maintain state information during job startup so that interrupted jobs may be resumed (see figure below) What Comes Next? Better ways to accommodate adding or removing hosts from the pool Data duplication to avoid data loss and facilitate runtime job optimization Failure Recovery Scalability and Speedup Tiff to BMP image conversion of 58,000 images over n hosts. Superlinear scalability on fast cluster, sublinear on workstations. Job Startup has three phases Generate an execution plan for each host and record the job in the database Distribute the execution batch to each host ; “begin” the host’s work without starting Commit the execution on each host Jobs that fail during any phase can be resumed by restarting the client Presented at High Performance Desktop Computing, 23 June 2008. This work was supported by National Science Foundation Grants CCF and CNS Web address:

Fundamental Operations Scalability and Speedup

Similar presentations

Presentation on theme: "Fundamental Operations Scalability and Speedup"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fundamental Operations Scalability and Speedup

Similar presentations

Presentation on theme: "Fundamental Operations Scalability and Speedup"— Presentation transcript:

Similar presentations

About project

Feedback