Active Storage Processing in Parallel File Systems Jarek Nieplocha Evan Felix Juan Piernas-Canovas SDM CENTER
2 Active Storage in Parallel Filesystems Active Storage exploits the old concept of moving computing to the data source Avoids data movement across the network in parallel machine by allowing applications use compute resources on the I/O nodes of the cluster for data processing P P P P Network FS compute nodes I/O nodes Y=foo(X) x Y P P P P Network FS compute nodes I/O nodes Y=foo(X) Active Storage Traditional Approach
3 ExampleExample BLAS DSCAL on disk Y = α. Y Experiment Traditional: The input file is read from filesystem, and the output file is written to the same file system. The input file has 120,586,240 doubles. Active Storage: Each server receives the factor, reads the array of doubles from its disk locally, and stores the resulting array on the same disk. Each server processes 120,586,240/N doubles, where N is the number of servers Speedup contributed to avoiding data movement between client and servers
4 Related Work Active Disk/Storage concept was introduced a decade ago to use Processing resources ‘Near’ the disk On the Disk Controller. On Processors connected to disks. Reduce network bandwidth/latency limitations. References DiskOS Stream Based model (ASPLOS’98: Acharya, Uysal, Saltz) Active Storage For Large-Scale Data Mining and Multimedia (VLDB ’98: Riedel, Gibson, Faloutsos) Research proved Active Disk idea interesting, but Difficult to take advantage of in practice Processors in disk controllers not designed for the purpose Vendors have not been providing SDK Y=foo(X)
5 Lustre Architecture Client OST MDS O(10) OST O(1000) O(10000) NAL Directory Metadata & concurrency File IO & Locking Recovery, File Status, File Creation
6 Lustre Client OSC NAL LLITE LOV Application User Space Application IO requests LLITE module implements Linux VFS layer LOV stripes object and targets IO to correct Object Client OSC packages up request for transmission over the NAL
7 Lustre Object Storage Server OST OBDfilter ext3 NAL Requests arrive from Portals NAL Object Storage Target directs Request to appropriate lower level OBD OBDfilter presents ext3 as Object Based Disk
8 Current Implementation of Active Storage OST OBDfilter ext3 ASOBD ASDEV Processing Component User Space Extra Module passes data, until told to pipe data elsewhere Data is sent to user space process through Unix Character Device File. Processed Data is written back to disk Pattern: 1W->2W NA L
9 9.4 Tesla High Throughput Mass Spectrometer 1 Experiment per hour 5000 spectra per experiment 4 MByte per spectrum Per instrument: 20 Gbytes per hour 480 Gbytes per day Active Storage Application High Throughput Proteomics Application Problem Given 2 float input number for target mass and tolerance, find all the possible protein sequences that would fit into specified range Active Storage Solution Each OST receives its part of the float pair sent by the client stores the resulting processing output in its Lustre OBD (object-based disk) Next generation technology will increase data rates x200
10 SC’2004 StorCloud Most Innovative Use Award Sustained 4GB/s Active Storage write processing 320 TB Lustre GB disks 40 Lustre OSS's running Active Storage 4 Logical Disks (160 OST’s) 2 Xeon Processors 1 MDS 1 Client creating files Lustre OST Client System Gigabit Network Lustre OSS 39 Lustre OST Lustre OST Lustre MDS Lustre OSS 0 Lustre OSS 38
11 Real Time Visualization
12 Active Storage Processing Patterns PatternDescription 1W->2W Data will be written to the original raw file. A new file will be created that will receive the data after it has been sent out to a processing component. 1W->1W Data will be processed then written to the original file 1R->1W Data that was previously stored on the OBD can be re-processed into a new file. 1W->0 Data will be written to the original file, and also passed out to a processing component. There is no return path for data, the processing component will do 'something' with the data. 1R->0 Data that was previously stored on the OBD is read and sent to a processing component. There is no return path 1W->#W Data is read from one file and processed, but there may be many files that are output from #W->1W There are many inputs from various files being written as outputs from the processing component. 1R->1R Data is read from a file on disk, sent to a processing component, then the output is sent to the reading process.
13 Status and Future Work Status Proof of concept 1W->2W code works now Difficult to administer and use – 2 people Memory copies between user and kernel space Future Implement other processing patterns Optimize performance by eliminating memory copies Implement Active Storage for PVFS Support different striping in files HDF, NetCDF, Database Stored Procedure Calls ….