High-Performance Grid Computing and Research Networking

Slides:

Advertisements

Similar presentations

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

Advertisements

Running DiFX with SGE/OGE Helge Rottmann Max-Planck-Institut für Radioastronomie Bonn, Germany DiFX Meeting Sydney.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

DCC/FCUP Grid Computing 1 Resource Management Systems.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.

6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

Assignment 3 Using GRAM to Submit a Job to the Grid James Ruff Senior Western Carolina University Department of Mathematics and Computer Science.

Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,

Introduction to UNIX/Linux Exercises Dan Stanzione.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.

Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Clusters at IIT KANPUR - 1 Brajesh Pande Computer Centre IIT Kanpur.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Jean-Sébastien Gay LIP ENS Lyon, Université Claude Bernard Lyon 1 INRIA Rhône-Alpes GRAAL Research Team Join work with DIET TEAM D istributed I nteractive.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi

Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.

Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.

Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.

CSF4 Meta-Scheduler Zhaohui Ding College of Computer Science & Technology Jilin University.

GridWay Overview John-Paul Robinson University of Alabama at Birmingham SURAgrid All-Hands Meeting Washington, D.C. March 15, 2007.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

1 High-Performance Grid Computing and Research Networking Presented by Javier Delgodo Slides prepared by David Villegas Instructor: S. Masoud Sadjadi

Advanced Computing Facility Introduction

Workload Management Workpackage

PARADOX Cluster job management

OpenPBS – Distributed Workload Management System

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Introduction to the Application Hosting Environment

Using Paraguin to Create Parallel Programs

GWE Core Grid Wizard Enterprise (

Creating and running applications on the NGS

Architecture & System Overview

BIMSB Bioinformatics Coordination

Deploying and Configuring SSIS Packages

Building Grids with Condor

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Compiling and Job Submission

Privilege Separation in Condor

Requesting Resources on an HPC Facility

Wide Area Workload Management Work Package DATAGRID project

Sun Grid Engine.

GRID Workload Management System for CMS fall production

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Grid Computing Software Interface

Presentation transcript:

High-Performance Grid Computing and Research Networking How to Use the Cluster? Presented by David Villegas Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

Acknowledgements The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova henric@hawaii.edu

Is MPI enough? MPI submits the jobs using rsh/ssh There is no control of who runs what! For multiple users in the cluster, we want to have privileges, authentication, fair-share...

Introducing Batch Schedulers A job scheduler provides more features to control job execution: Interfaces to define workflows and/or job dependencies Automatic submission of executions Interfaces to monitor the executions Priorities and/or queues to control the execution order of unrelated jobs

Batch Schedulers Most production clusters are managed via a batch scheduler: You ask the batch scheduler to give you X nodes for Y hours to run program Z At some point, the program will be started. Later on you can look at the program output This is really different from what you’re used to, and honestly is sort of painful No interactive execution Necessary because: Since most applications are in this for high performance, they’d better be alone on their compute nodes There are not enough compute nodes for everybody at all times

Scheduling criteria Job priority Compute resource availability License key if job is using licensed software Execution time allocated to user Number of simultaneous jobs allowed for a user Estimated execution time Elapsed execution time Availability of peripheral devices Occurrence of prescribed events …

The case of GCB Rocks allows us to install different job schedulers: SGE, PBS, LSF, Condor… Currently we have SGE installed. Sun Grid Engine is an open source DRM (Distributed Resource Manager) sponsored by Sun Microsystems and CollabNet. It can be downloaded from http://gridengine.sunsource.net

Our Cluster You have (or soon will get) an account on the cluster Question: once I am logged in, what do I do? Clusters are always organized as A front end node To compile code (and do minimal testing) To submit jobs Compute nodes To run the code You don’t ssh to these directly In our case they are dual-proc Pentiums

How to use SGE as a user? You need to learn how to do three basic things Check the status of the platform Submit a job Check on job status All can be done from the command line Read the man pages Google “SGE commands” Checking on platform and job status qhost Information about nodes qstat –f Information about queues qstat –F [ resource ] Detailed information about resources qstat lists pending/running/done jobs

How to use SGE as a user? (contd.) Submitting and controlling jobs qsub We can pass the path to a binary or a script qsub –b yes Submits a binary qsub –q queue list Specifies to what queue the job will be sent qsub –pe parallel-env n Allows to send a parallel job qdel Attempts to terminate a range of jobs But for those of you who don’t like the command line… qmon Be sure that you are forwarding X11 and that you have a X server in your client machine!

How to use SGE as a user? (contd.) But sending a single command is not very interesting… Submitting scripts Scripts can submit many jobs We can pass options to SGE and consult environment variables. Example: #$ -cwd Use the currend directory as work directory #$ -j y Join errors and output in the same file #$ -N get_date Give a name to the job #$ -o output.$JOB_ID Use a given file for output $JOB_ID: The job number assigned by the scheduler to your job $JOBDIR: The directory your job is currently running in $USER: The username of the person currently running the job $JOB_NAME: The job name specified by -N option $QUEUE: Current running queue

How to use SGE as a user? (contd.) Submitting parallel jobs We can define parallel-environments to execute this kind of jobs. Parallel environments define startup procedures, maximum number of slots, users allowed to submit parallel jobs… Examples: mpich, lam … SGE allows "Tight Integration" with MPICH by intercepting the calls MPICH makes to run your job on other machines, and replacing those calls with SGE calls so that it may better monitor and manage your parallel jobs. ( Source http://rc.usf.edu/sge/submit.php ) It is also possible to integrate other MPI flavors with SGE

How to use SGE as an admin? Scheduler configuration. This values are found in /opg/gridengine/default/common/sched_configuration These values can only be altered using qconf or qmon algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params none reprioritize_interval 0:0:0 halftime 168 usage_weight_list cpu=1,mem=0,io=0 compensation_factor 5 …

How to use SGE as an admin? (contd.) Queue configuration Queues are created with qmon or qconf qconf –shgrpl show all host groups qconf -ahgrp group add a new host group qconf –shgrp group show details for one group qconf –sq queue shows a queue configuration qconf –Aq file create a queue from a file We’ll output a queue configuration to a file and modify it. Exercise: create a short/test job queue. Which are the best policies for this kind of queue?

Queue parameters For the rest, type man queue_conf qname hostlist seq_no load_thresholds suspend_thresholds nsuspend suspend_interval priority min_cpu_interval processors qtype ckpt_list pe_list rerun slots tmpdir shell … For the rest, type man queue_conf

But, are local schedulers enough? Schedulers allow us to manage jobs in one or more clusters, but there are still some limitations for more complex systems: Centralized job scheduling Computing nodes are in the same location Homogeneous software

Next step: GRID computing GRID computing allows us to make distant, heterogeneous clusters work together. Coordinate multiple resources (discovery, access, allocation, monitoring) Allow user authorization to provide secure access to resources Provide open standards to improve interoperability Give local control to organizations

How do we put everything together?

Wrapping up: What do we have?

In a nutshell…

Sending a job to SGE using GRAM Create a personal certificate with grid-cert-request Have it signed by the local CA Create a proxy with grid-proxy-init Submit it with globus-job-run localhost/jobmanager-sge /bin/hostname This is a very simple example that uses pre-WS GRAM services Globus still gives us more advanced features: File staging RSL (Resource Specification Language) and JSDL (Job Submission Description Language) Access across organization boundaries …

Some examples Hurricane mitigation Metascheduling and job flow management

There is still a lot to explore! Conclusion There is still a lot to explore!