VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science Calculations" Cesar R. S. da Silva 1 Pedro R. C. da Silveira 1 1 Minnesota Supercomputing Institute, University of Minnesota Work Sponsored by NSF grant ITR and MSI
The VLab -“A cyberinfrastructure designed to facilitate/enable execution of extensive calculations that can be broken into several decoupled tasks." Typically parameter sweeping applications, like: - Weather and Climate - Oil Search - Stress tests of investment strategies - Seismology - Geodynamics - The everybody's favorite: Calculation thermal properties of materials at high pressures and temperatures.”
VLab has three main roles 1 - Science Enabler Empowering users to manage extensive workflows -Automatic workflow management -Ease of use -Collaborative support -Diversity of tools for Data Analysis, Visualization, etc … Aggregating throughput of scattered resources to cope with huge workloads -Distributed computations -Fault tolerance -Optimal scheduling
… three main roles 2 - Community facility Available to the entire community of planetary materials Provide a set of tools of common interest 3 - Virtual Organization Globally accessible through the WWW Strong collaborative support - Shared access to projects - Collaborative data analysis with synchronous view of data. - Works combined with teleconference software. Allow geographically distributed groups to work on the Same project.
However, VLab is not : 1 - A program or Software Distribution You can download the sources and create your own VLab But You don't have any advantage doing so. 2 - A tool to calculate thermal properties of Materials This is just one VLab application New applications can be developed as users show interest and willingness to participate.
The VLab - Composed by a set of tools, made available to each other as Web Services distributed throughout the internet. Currently available tools include: - Quantum ESPRESSO Package tools - Input preparation for pwscf, phonon, workflows, etc … - Data Analysis and Visualization Tools (VTK/OpenGL) - Workflow Management and monitoring tools - and many more to come … - Automatic generation of task input and recollection of output - User Interface consolidated through a easy to use Portal
The VLab
VLab Workflows Typical VLab workflows, like the High-T C ij calculation involve iterations through the following steps: 1) Prepare inputs for tasks, and generate execution packages containing required files. 2) Dispatch the execution packages to compute nodes for execution. 3) Gather results for analysis and eventually iterate steps 1-3.
Leverages computing capabilities of distributed resources (TeraGrid, OSG, scattered resources, other grids) - Automatic Task Distribution and Data Recollection Exploit workflow level parallelism to increase performance Optimal scheduling is an Open field
Vlab - A Distributed System Approach -Distributed components are replicated for: - Redundancy - Performance - Flexibility -No central component to fail and bring everything down! -Flexible Scheduling for: - Cost - Turnaround Time - Job Throughput - Workload Balance - System Throughput
Vlab - What already works -Automatic task distribution and data recollection -Shared access to project monitoring tools and data -Non colaborative data analysis and 2D graphs. -High PT properties workflow and its sub-workflows High PT application completes successfully, generating a number of thermodynamic variables from a single input, with no user intervention during execution.
Vlab - What has to be done -Fault tolerance - Registry Based. - Redundant Registry and Metadata DB for data persistence. - Full Journaling of critical transactions for data (metadata) integrity. -Dynamical Composition of Web Services - Will facilitate development of new applications. -Volumetric (3D) data visualization *Has to be rewritten from the scratch. -Collaborative data analisys and visualization. - Have inconsistent iUI. -Erratic behavior with 2 or more simultaneous users. -Support for synchronous view of data not yet implemented
… What has to be done -Methodological improvements - Real space symmetry operations in ESPRESSO -> reciprocal space - Numerical instability with Wentzcovitch VCS-MD -> (PR?) - Constant g-space cut-off in VCS-MD in ESPRESSO -> (?) - Fitting procedure in High PT data analysis tool. Tool currently in use has a serious flaw.
VLab in Action Live demo at 2nd VLab Workshop 07: Calculation of High P,T Thermodynamic Properties Cubic MgO 2 atom cell Static + Lattice Dynamics calculation {P n }x{ q i } sampling Show distributed computing capabilities Ability to integrate visualization and data analysis tools Visit the VLab web site:
VLab Service Oriented Architecture On the Web: Usage oriented view of VLab SOA => Tree-like structure in 4 layers: 1) User Interface (Portal) 2) Workflow control and monitoring (Project Executor / Interaction) 3) Task Dispatching / Interaction, task data retrieving, Auxiliary Services 4) Heavy computations and Visualization resources layer.
C ij Workflow Left: Extensive High-T Cij Right: Detailed View of Cij and phonon
Scheduling => Fundamental importance for Performance The usual approach: -Use agents that interact with the broker Problem: Agents are not stateless! -More complicated to develop -Persistence must be guaranteed The VLab approach: -Use an independent WS to monitor workload. -Persistence of data is provided by a local DB. -Compute WS and Workload Monitor are stateless!
Vlab - Not Just a Client/Server The Client/Server Approach: -The portal and the supporting modules have access to a large central multi-processor system. -Can work as a facilitator but lacks other important features found in VLab. -No Flexibility of Scheduling -No redundancy => Poor availability -No choice for cost (usually High)
Fault Tolerance -Reactive: We have not identified any need for proactive FT. -Registry Based: Persistent sessions are registered and must periodically inform the registry about its "alive" state. -Redundant Registry and Metadata DB for data persistence -Fully Journaling (data and metadata) of Critical Transactions for data and metadata integrity. This guarantee the state of any persistent session can be restored in case of failure. Only Project Executor sessions and few user and project interaction sessions are required to be persistent. Therefore, a simple approach to Fault Tolerance (FT) is possible:
VLAB requirements Workflow management => Facilitator/Enabler Support for distributed computations Ease of use Support for collaboration Flexibility (update/add tools, new features) Fault tolerance Diversity of tools –analysis, visualization, data reduction, storage, etc.
Compute Performance x Throughput Leveraging Concurrent Computing for features and performance High Performance Parallel Computing High Throughput Distributed Processing The red line is the predicted optimal performance for up to 16 independent 4-way parallel tasks running concurrently (HTC job).
Basic Problem Demand for Extensive Parameter Sampling Typical High (P,T) study (ex. Thermal Properties) {P n }x{q i } => ~10 2 jobs Large High (P,T) study ( C ij (P,T) ) {P n }x{ i }x{q j } => ~ jobs Future studies: Extension to alloys (sampling over configurations) {{x m } l }x{P n }x{ i }x{q j } => ~10 5 jobs
Jobs to prepare, submit, monitor, and analyze results Manual work is prone to human errors => Unmanageable!!! First Principles => Sheer number ( ) of operations (Today) => Well over in 3-5 years Basic Problem (cont. …) Fundamental Requirements Enable user to manage these extensive workflows -Automatic workflow management -Ease of use, collaborative support, diversity of tools, flexibility Aggregate throughput to cope with huge workloads -Distributed computations, fault tolerance, optimal scheduling
The Big Challenge of Performance MPP systems are not very cost effective for this class of problems FFT and matrix transposition: Limited scalability or Low performance per processor
Examples of Operational Procedure 1 - C ij Workflow Input Preparation
Examples of Operational Procedure 2 - Consolidated view of the distributed workflow