VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science Calculations" Cesar R. S. da Silva 1 Pedro R. C. da Silveira.

Slides:

Advertisements

Similar presentations

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (

Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.

CLOUD COMPUTING AN OVERVIEW & QUALITY OF SERVICE Hamzeh Khazaei University of Manitoba Department of Computer Science Jan 28, 2010.

VLab: A Collaborative Cyberinfrastructure for Computations of Materials Properties at High Pressures and Temperatures Cesar R. S. da Silva 1 Pedro R. C.

CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

© , Michael Aivazis DANSE Software Issues Michael Aivazis California Institute of Technology DANSE Software Workshop September 3-8, 2003.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Workload Management Massimo Sgaravatto INFN Padova.

Knowledge Portals and Knowledge Management Tools

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 12 Slide 1 Distributed Systems Design 1.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

PMIT-6102 Advanced Database Systems

STRATEGIES INVOLVED IN REMOTE COMPUTATION

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.

DISTRIBUTED COMPUTING

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

material assembled from the web pages at

Architecting Web Services Unit – II – PART - III.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov

Distributed Systems and Algorithms Sukumar Ghosh University of Iowa Spring 2011.

Russ Hobby Program Manager Internet2 Cyberinfrastructure Architect UC Davis.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Tools for collaboration How to share your duck tales…

DISTRIBUTED COMPUTING. Computing? Computing is usually defined as the activity of using and improving computer technology, computer hardware and software.

GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.

NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.

Microsoft Management Seminar Series SMS 2003 Change Management.

6 February 2009 ©2009 Cesare Pautasso | 1 JOpera and XtremWeb-CH in the Virtual EZ-Grid Cesare Pautasso Faculty of Informatics University.

FRANEC and BaSTI grid integration Massimo Sponza INAF - Osservatorio Astronomico di Trieste.

Distributed System Architectures Yonsei University 2 nd Semester, 2014 Woo-Cheol Kim.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.

CSC 480 Software Engineering Lecture 17 Nov 4, 2002.

Distributed Geospatial Information Processing (DGIP) Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.

Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.

Workload Management Workpackage

Chapter 9: The Client/Server Database Environment

Clouds , Grids and Clusters

The Client/Server Database Environment

Architecting Web Services

Processes and Threads Processes and their scheduling

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

Architecting Web Services

Grid Computing.

The Client/Server Database Environment

An Introduction to Computer Networking

Module 01 ETICS Overview ETICS Online Tutorials

VLab (Virtual Laboratory for Earth and Planetary Materials )

Large Scale Distributed Computing

The Anatomy and The Physiology of the Grid

Overview of Workflows: Why Use Them?

The Anatomy and The Physiology of the Grid

Gordon Erlebacher Florida State University

Presentation transcript:

VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science Calculations" Cesar R. S. da Silva 1 Pedro R. C. da Silveira 1 1 Minnesota Supercomputing Institute, University of Minnesota Work Sponsored by NSF grant ITR and MSI

The VLab -“A cyberinfrastructure designed to facilitate/enable execution of extensive calculations that can be broken into several decoupled tasks." Typically parameter sweeping applications, like: - Weather and Climate - Oil Search - Stress tests of investment strategies - Seismology - Geodynamics - The everybody's favorite: Calculation thermal properties of materials at high pressures and temperatures.”

VLab has three main roles 1 - Science Enabler Empowering users to manage extensive workflows -Automatic workflow management -Ease of use -Collaborative support -Diversity of tools for Data Analysis, Visualization, etc … Aggregating throughput of scattered resources to cope with huge workloads -Distributed computations -Fault tolerance -Optimal scheduling

… three main roles 2 - Community facility Available to the entire community of planetary materials Provide a set of tools of common interest 3 - Virtual Organization Globally accessible through the WWW Strong collaborative support - Shared access to projects - Collaborative data analysis with synchronous view of data. - Works combined with teleconference software.  Allow geographically distributed groups to work on the Same project.

However, VLab is not : 1 - A program or Software Distribution You can download the sources and create your own VLab But You don't have any advantage doing so. 2 - A tool to calculate thermal properties of Materials This is just one VLab application New applications can be developed as users show interest and willingness to participate.

The VLab - Composed by a set of tools, made available to each other as Web Services distributed throughout the internet. Currently available tools include: - Quantum ESPRESSO Package tools - Input preparation for pwscf, phonon, workflows, etc … - Data Analysis and Visualization Tools (VTK/OpenGL) - Workflow Management and monitoring tools - and many more to come … - Automatic generation of task input and recollection of output - User Interface consolidated through a easy to use Portal

The VLab

VLab Workflows Typical VLab workflows, like the High-T C ij calculation involve iterations through the following steps: 1) Prepare inputs for tasks, and generate execution packages containing required files. 2) Dispatch the execution packages to compute nodes for execution. 3) Gather results for analysis and eventually iterate steps 1-3.

Leverages computing capabilities of distributed resources (TeraGrid, OSG, scattered resources, other grids) - Automatic Task Distribution and Data Recollection Exploit workflow level parallelism to increase performance Optimal scheduling is an Open field

Vlab - A Distributed System Approach -Distributed components are replicated for: - Redundancy - Performance - Flexibility -No central component to fail and bring everything down! -Flexible Scheduling for: - Cost - Turnaround Time - Job Throughput - Workload Balance - System Throughput

Vlab - What already works -Automatic task distribution and data recollection -Shared access to project monitoring tools and data -Non colaborative data analysis and 2D graphs. -High PT properties workflow and its sub-workflows High PT application completes successfully, generating a number of thermodynamic variables from a single input, with no user intervention during execution.

Vlab - What has to be done -Fault tolerance - Registry Based. - Redundant Registry and Metadata DB for data persistence. - Full Journaling of critical transactions for data (metadata) integrity. -Dynamical Composition of Web Services - Will facilitate development of new applications. -Volumetric (3D) data visualization *Has to be rewritten from the scratch. -Collaborative data analisys and visualization. - Have inconsistent iUI. -Erratic behavior with 2 or more simultaneous users. -Support for synchronous view of data not yet implemented

… What has to be done -Methodological improvements - Real space symmetry operations in ESPRESSO -> reciprocal space - Numerical instability with Wentzcovitch VCS-MD -> (PR?) - Constant g-space cut-off in VCS-MD in ESPRESSO -> (?) - Fitting procedure in High PT data analysis tool. Tool currently in use has a serious flaw.

VLab in Action Live demo at 2nd VLab Workshop 07:  Calculation of High P,T Thermodynamic Properties  Cubic MgO  2 atom cell  Static + Lattice Dynamics calculation {P n }x{ q i } sampling  Show distributed computing capabilities  Ability to integrate visualization and data analysis tools Visit the VLab web site:

VLab Service Oriented Architecture On the Web: Usage oriented view of VLab SOA => Tree-like structure in 4 layers: 1) User Interface (Portal) 2) Workflow control and monitoring (Project Executor / Interaction) 3) Task Dispatching / Interaction, task data retrieving, Auxiliary Services 4) Heavy computations and Visualization resources layer.

C ij Workflow Left: Extensive High-T Cij Right: Detailed View of Cij and phonon

Scheduling => Fundamental importance for Performance The usual approach: -Use agents that interact with the broker Problem: Agents are not stateless! -More complicated to develop -Persistence must be guaranteed The VLab approach: -Use an independent WS to monitor workload. -Persistence of data is provided by a local DB. -Compute WS and Workload Monitor are stateless!

Vlab - Not Just a Client/Server The Client/Server Approach: -The portal and the supporting modules have access to a large central multi-processor system. -Can work as a facilitator but lacks other important features found in VLab. -No Flexibility of Scheduling -No redundancy => Poor availability -No choice for cost (usually High)

Fault Tolerance -Reactive: We have not identified any need for proactive FT. -Registry Based: Persistent sessions are registered and must periodically inform the registry about its "alive" state. -Redundant Registry and Metadata DB for data persistence -Fully Journaling (data and metadata) of Critical Transactions for data and metadata integrity. This guarantee the state of any persistent session can be restored in case of failure. Only Project Executor sessions and few user and project interaction sessions are required to be persistent. Therefore, a simple approach to Fault Tolerance (FT) is possible:

VLAB requirements Workflow management => Facilitator/Enabler Support for distributed computations Ease of use Support for collaboration Flexibility (update/add tools, new features) Fault tolerance Diversity of tools –analysis, visualization, data reduction, storage, etc.

Compute Performance x Throughput Leveraging Concurrent Computing for features and performance High Performance Parallel Computing High Throughput Distributed Processing The red line is the predicted optimal performance for up to 16 independent 4-way parallel tasks running concurrently (HTC job).

Basic Problem Demand for Extensive Parameter Sampling Typical High (P,T) study (ex. Thermal Properties) {P n }x{q i } => ~10 2 jobs Large High (P,T) study ( C ij (P,T) ) {P n }x{  i }x{q j } => ~ jobs Future studies: Extension to alloys (sampling over configurations) {{x m } l }x{P n }x{  i }x{q j } => ~10 5 jobs

Jobs to prepare, submit, monitor, and analyze results Manual work is prone to human errors => Unmanageable!!! First Principles => Sheer number ( ) of operations (Today) => Well over in 3-5 years Basic Problem (cont. …) Fundamental Requirements Enable user to manage these extensive workflows -Automatic workflow management -Ease of use, collaborative support, diversity of tools, flexibility Aggregate throughput to cope with huge workloads -Distributed computations, fault tolerance, optimal scheduling

The Big Challenge of Performance MPP systems are not very cost effective for this class of problems FFT and matrix transposition: Limited scalability or Low performance per processor

Examples of Operational Procedure 1 - C ij Workflow Input Preparation

Examples of Operational Procedure 2 - Consolidated view of the distributed workflow