SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
File Consistency in a Parallel Environment Kenin Coloma
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
SDM Center Coupling Parallel IO with Remote Data Access Ekow Otoo, Arie Shoshani, Doron Rotem, and Alex Sim Lawrence Berkeley National Lab.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
2. Computer Clusters for Scalable Parallel Computing
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 35 – Media Server (Part 4) Klara Nahrstedt Spring 2012.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Distributed Processing, Client/Server, and Clusters
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Figure 1.1 Interaction between applications and the operating system.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
1 I/O Management in Representative Operating Systems.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 30 – Media Server (Part 6) Klara Nahrstedt Spring 2011.
1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Serverless Network File Systems Overview by Joseph Thompson.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Shuman Guo CSc 8320 Advanced Operating Systems
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
Parallel IO for Cluster Computing Tran, Van Hoai.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
An Introduction to GPFS
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Introduction to Operating Systems Concepts
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Open Source distributed document DB for an enterprise
Outline Midterm results summary Distributed file systems – continued
CSE451 Virtual Memory Paging Autumn 2002
Introduction To Distributed Systems
by Mikael Bjerga & Arne Lange
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching, Kenin Coloma ANL Collaborators:Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham Parallel I/O Middleware Optimizations and Future Directions

2 Progress and accomplishments – Wei-keng Liao –Parallel netCDF –Client-side file caching in MPI-IO –Data-type I/O for non-contiguous file access in PVFS Future research directions – Alok Choudhary –I/O middleware –Autonomic and Active storage Systems Outline

3 Parallel NetCDF NetCDF defines: –A set of APIs for file access –A machine-independent file format Parallel netCDF work –New APIs for parallel access –Maintaining the same file format Tasks –Built on top of MPI for portability and high performance –Support C and Fortran interfaces –Support external data representations P0P1P2P3 netCDF Parallel File System Parallel netCDF P0P1P2P3 Parallel File System

4 PnetCDF Current Status Version was released on July 27, 2005 Supported platforms –Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX Two sets of parallel APIs are completed –High level APIs (mimicking the serial netCDF APIs) –Flexible APIs (extended to utilize MPI derived datatype) Fully supported both in C and Fortran Support for large file ( > 4GB files) Test suites –Self test codes ported from Unidata netCDF package to validate against single-process results –Parallel test codes for both sets of APIs

5 Illustrative PnetCDF Users FLASH – astrophysical thermonuclear application from ASCI/Alliances center at university of Chicago ACTM – atmospheric chemical transport model, LLNL WRF-ROMS – regional ocean model system I/O module from scientific data technologies group, NCSA ASPECT – data understanding infrastructure, ORNL pVTK – parallel visualization toolkit, ORNL PETSc – portable, extensible toolkit for scientific computation, ANL PRISM – PRogram for Integrated Earth System Modeling, users from C&C Research Laboratories, NEC Europe Ltd. ESMF – earth system modeling framework, national center for atmospheric research More …

6 PnetCDF Future Work Non-blocking I/O APIs Performance improvement for data type conversion –Type conversion while packing non-contiguous buffers Extending PnetCDF for newer applications, e.g., data analysis and mining Collaboration with application users

7 File Caching in MPI-IO Parallel netCDF MPI-IO PVFS Applications Storage devices

8 File Caching for Parallel Apps Why file caching? –Improves the performance for repeated file access –Enable write-behind strategy Accumulates multiple small writes to better utilize network bandwidth May balance the work load for irregular I/O patterns Useful for checkpointing –Enable data pre-fetching Useful for read-only applications (parallel data mining, visualization) Why not just use traditional caching strategies? –Each client performs independently  cache incoherence –I/O servers are in charged with cache coherence control  potential I/O serialization –Inadequate for parallel environment where application clients frequently read/write shared files

9 Caching Sub-system in MPI-IO Application-aware file caching –A user-level implementation in MPI-IO library –MPI communicators define the subsets of processes operating on a shared file client processors I/O servers global cache pool local cache buffers network interconnect memory –Processes cooperate with each other to perform caching –Data cached in one client can be directly accessed by another –Moves cache coherence control from servers to clients –Distributed coherence control (less overhead) Supports both collective and independent I/O

10 Design Cache metadata –File-block based granularity –Cyclically stored in all processes Global cache pool –Comprises local memory of all processes –Single copy of file data to avoid coherence issue processes 1 P 2 P 3 P 0 P File logical partitioning Distributed cache meta data processes block 9 status block 5 status block 1 status block 10 status block 6 status block 2 status block 11 status block 7 status block 3 status block 8 status block 4 status block 0 status 1 P 2 P 3 P 0 P Global cache pool local memory page 3 page 2 page 1 block 4block 3block 2block 1block 0 page 3 page 2 page 1 page 3 page 2 page 1 page 3 page 2 page 1 Two implementations: –Using an I/O thread (POSIX thread) –Using the MPI remote-memory-access (RMA) facility

11 Example Read Operation File logical partitioning block 4block 3block 2block 1block 0 1 P 2 P 3 P 0 P Distributed metadata processes block 9 status block 5 status block 1 status block 10 status block 6 status block 2 status block 11 status block 7 status block 3 status block 8 status block 4 status block 0 status page 3 page 2 page 1 processes 1 P 2 P 3 P 0 P Global pool local memory page 3 page 2 page 1 page 3 page 2 page 1 page 3 page 2 page 1 If not yet cached page 2 Already cached block 3 metadata lookup lock it ! unlock it !

12 Future Work Data pre-fetching –Instructional (through MPI info) and non-instructional (based on sequential access) Collective write-behind for data check-pointing Stand-alone distributed lock sub-system –Using MPI-2 remote-memory access facility Design new MPI file hints for caching Application I/O pattern study –Structured/unstructured AMR

13 Data-type I/O in PVFS Parallel netCDF MPI-IO PVFS Applications Storage devices

14 Non-contiguous I/O Four types –Contiguous both in memory and file –Contiguous in memory, non- contiguous in file –Non-contiguous in memory, contiguous in file –Non-contiguous both in memory and file Each segment is an I/O request of (offset, length) memory file memory file memory file memory file

15 Implementations POSIX I/O –One call per (offset, length) –Generates large number of I/O requests Data sieving –Single (offset, length) covering multiple segments –Accessing unused data and introduces consistency control overhead List I/O –Single calls handle multiple non- contiguous access –Passing multiple (offset, length)s across network Application process I/O request Client-side file system Application process List I/O request Client-side file system Server-side file system network

16 Data-type I/O Single requests all the way to the servers Abandons offset-length pair representation –Borrow MPI datatype concept to describe non-contiguous access patterns –New file system data types –New file system interfaces An implementation in PVFS –Both client and server sides Application process Datatype I/O request PVFS client PVFS server network Single request

17 Summary of Accomplishments High-level I/O –Parallel netCDF Low-level I/O –MPI-IO file caching Parallel file system –Data-type I/O in PVFS Parallel netCDF MPI-IO PVFS

18 Future Research

19 Typical Components in I/O Systems Based on a lot of current apps High-level –E.g., NetCDF, HDF, ABC –Applications use these Mid-level –E.g., MPI-IO –Performance experience Low-level –E.g., File systems –Critical for performance in above More access info lost if more components used Compute node network I/O Server I/O Server I/O Server End-to-End Performance critical Applications Client-side File System Parallel netCDF, HDF5,... MPI-IO

20 Collectives, independents I/O hints: access style (read_once, write_mostly, sequential, random, …), collective buffering, chunking, striping Open mode (O_RDONLY, O_WRONLY, O_SYNC), f ile status, locking, flushing, cache invalidation Machine dependent: data shipping, sparse access, double buffering Access base on : file blocks, objects scheduling, aggregation Read-ahead, write-behind, metadata management, file striping, security, redundancy Save attributes along with data, external data types (byte-alignment), data structures (flexible dimensionality), hierarchical data model Access patterns: shared files, individual files, data partitioning, check- pointing, data structures, inter-data relationship network I/O Server I/O Server I/O Server Applications Client-side File System Parallel netCDF, HDF5,... MPI-IO application-aware caching, pre-fetching, file grouping, “vector of bytes”, flexible caching control, object-based data alignment, memory-file layout mapping, more control over hardware, Shared file descriptors, Group locks, flexible locking control, scalable metadata management, zero-copying, QoS, Shared file descriptors, Active storage: data filtering,object-based/hierarchical storage management, indexing, mining, power-management Caching, fault tolerance, read-ahead, write-behind, I/O load balance, wide-area, heterogeneous FS support, thread-safe Graph-based data model

21 FSDMDatasetsHSS Goal Decouple “What” from “How” and Be Proactive caching collective reorganize load balance Fault-tolerance Understand App1App2 App3 App4 I/O SW OPT streaming/ Small/large configuration s/w layer Regular/irregular Local/remote user burdened Ineffective interfaces Non-communicating layers Current Speed BW Latency QoS

22 Component Design for I/O Application-aware –Capture application’s file access information –Relationship between files, objects, users Environment-aware –Network (reliability, security), storage devices (active disks) Context-aware –Binding data attributes to files, indexing for fast search High-performance I/O needs supports from –Languages + Compilers –I/O libraries –File systems –Storage devices

23 Component Interface Design Informative –Should deliver access/storage information top-down/bottom- up Flexibility –Should describe arbitrary data distribution in memory buffers, files, storage devices Functionality –Asynchronous operations, read-ahead, write-behind, replications –Provides ability for additional innovation Object-based I/O –For hardware control (I/O co-processor, active disk, object- based file systems, etc.)

24 Future Work in MPI-IO Investigate interface extensions Client-side caching sub-system –Implementations for various I/O strategies: buffering, pre- fetching, replication, migration –Adaptive caching mechanisms and algorithms for optimizing different access patterns Distributed mutual exclusive locking sub-system –Shared resources, such as files and memory –Pipeline locking (overlap lock waiting time with I/O) Work with HDF5 and parallel netCDF –Design I/O strategies for metadata and data Metadata: small, overlap, repeated, strong consistency requirement Array data: large, less frequent update

25 Future Work in Parallel File Systems File caching (focus on parallel apps) File versioning –Alternative to file locking –Reliability and availability aspects Guarantee atomicity in the presence of client or I/O system failure Can enable efficient RAID-type schemes in PFS (because of atomicity) Dynamic rebalancing of I/O File list lock –Locks to multiple regions in a single request

26 ML310-board4 ML310-board3 ML310-board2 Active Storage System (reconfigurable system) External net ML310-host ML310-board1 Switch Xilinx XC2VP30 Virtex-II Pro family –30,816 logic cells (3424 CLBs) –2 PPC405 embedded cores –2,448 Kb ( Kb blocks) BRAM –136 dedicated 18x18 multiplier blocks Software: –Data Mining –Encryption –Functions and runtime libs –Linux micro-kernel

27 MineBench - data mining benchmark suite