1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi

Slides:

Advertisements

Similar presentations

GXP in nutshell You can send jobs (Unix shell command line) to many machines, very fast Very small prerequisites –Each node has python (ver or later)

Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

Running DiFX with SGE/OGE Helge Rottmann Max-Planck-Institut für Radioastronomie Bonn, Germany DiFX Meeting Sydney.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

DCC/FCUP Grid Computing 1 Resource Management Systems.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.

6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.

Parallelization and Grid Computing Thilo Kielmann Bioinformatics Data Analysis and Tools June 8th, 2006.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

Assignment 3 Using GRAM to Submit a Job to the Grid James Ruff Senior Western Carolina University Department of Mathematics and Computer Science.

Workload Management Massimo Sgaravatto INFN Padova.

Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Introduction to UNIX/Linux Exercises Dan Stanzione.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Clusters at IIT KANPUR - 1 Brajesh Pande Computer Centre IIT Kanpur.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Nadia LAJILI User Interface User Interface 4 Février 2002.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

© 2007 UC Regents1 Track 1: Cluster and Grid Computing NBCR Summer Institute Session 1.1: Introduction to Cluster and Grid Computing July 31, 2007 Wilfred.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.

Creating and running an application.

Portal Update Plan Ashok Adiga (512)

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.

Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.

Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

CSF4 Meta-Scheduler Zhaohui Ding College of Computer Science & Technology Jilin University.

GridWay Overview John-Paul Robinson University of Alabama at Birmingham SURAgrid All-Hands Meeting Washington, D.C. March 15, 2007.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

1 High-Performance Grid Computing and Research Networking Presented by Javier Delgodo Slides prepared by David Villegas Instructor: S. Masoud Sadjadi

Advanced Computing Facility Introduction

PARADOX Cluster job management

Using Paraguin to Create Parallel Programs

GWE Core Grid Wizard Enterprise (

Creating and running applications on the NGS

BIMSB Bioinformatics Coordination

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Compiling and Job Submission

Sun Grid Engine.

High-Performance Grid Computing and Research Networking

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Presentation transcript:

1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi sadjadi At cs Dot fiu Dot edu How to Use the Cluster?

2 Acknowledgements The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks!  Henri Casanova Principles of High Performance Computing

3 Is MPI enough? MPI submits the jobs using rsh/ssh There is no control of who runs what! For multiple users in the cluster, we want to have privileges, authentication, fair- share...

4 Introducing Batch Schedulers A job scheduler provides more features to control job execution:  Interfaces to define workflows and/or job dependencies  Automatic submission of executions  Interfaces to monitor the executions  Priorities and/or queues to control the execution order of unrelated jobs

5 Batch Schedulers Most production clusters are managed via a batch scheduler:  You ask the batch scheduler to give you X nodes for Y hours to run program Z  At some point, the program will be started.  Later on you can look at the program output This is really different from what you’re used to, and honestly is sort of painful  No interactive execution Necessary because:  Since most applications are in this for high performance, they’d better be alone on their compute nodes  There are not enough compute nodes for everybody at all times

6 Scheduling criteria Job priority Compute resource availability License key if job is using licensed software Execution time allocated to user Number of simultaneous jobs allowed for a user Estimated execution time Elapsed execution time Availability of peripheral devices Occurrence of prescribed events …

7 The case of GCB Rocks allows us to install different job schedulers: SGE, PBS, LSF, Condor… Currently we have SGE installed. Sun Grid Engine is an open source DRM (Distributed Resource Manager) sponsored by Sun Microsystems and CollabNet. It can be downloaded from

8 Our Cluster You have (or soon will get) an account on the cluster Question: once I am logged in, what do I do? Clusters are always organized as  A front end node To compile code (and do minimal testing) To submit jobs  Compute nodes To run the code You don’t ssh to these directly In our case they are dual-proc Pentiums

9 How to use SGE as a user? You need to learn how to do three basic things  Check the status of the platform  Submit a job  Check on job status All can be done from the command line  Read the man pages  Google “SGE commands” Checking on platform and job status  qhost Information about nodes  qstat –f Information about queues  qstat –F [ resource ] Detailed information about resources  qstat lists pending/running/done jobs

10 How to use SGE as a user? (contd.) Submitting and controlling jobs  qsub We can pass the path to a binary or a script qsub –b yes Submits a binary qsub –q queue list Specifies to what queue the job will be sent qsub –pe parallel-env n Allows to send a parallel job  qdel Attempts to terminate a range of jobs But for those of you who don’t like the command line…  qmon Be sure that you are forwarding X11 and that you have a X server in your client machine!

11 How to use SGE as a user? (contd.) But sending a single command is not very interesting…  Submitting scripts Scripts can submit many jobs We can pass options to SGE and consult environment variables. Example:  #$ -cwdUse the currend directory as work directory  #$ -j yJoin errors and output in the same file  #$ -N get_dateGive a name to the job  #$ -o output.$JOB_ID Use a given file for output  $JOB_ID: The job number assigned by the scheduler to your job  $JOBDIR: The directory your job is currently running in  $USER: The username of the person currently running the job  $JOB_NAME: The job name specified by -N option  $QUEUE: Current running queue

12 How to use SGE as a user? (contd.)  Submitting parallel jobs We can define parallel-environments to execute this kind of jobs. Parallel environments define startup procedures, maximum number of slots, users allowed to submit parallel jobs… Examples: mpich, lam … SGE allows "Tight Integration" with MPICH by intercepting the calls MPICH makes to run your job on other machines, and replacing those calls with SGE calls so that it may better monitor and manage your parallel jobs. ( Source ) It is also possible to integrate other MPI flavors with SGE

13 How to use SGE as an admin? Scheduler configuration. This values are found in /opg/gridengine/default/common/sched_configuration These values can only be altered using qconf or qmon  algorithm default  schedule_interval 0:0:15  maxujobs 0  queue_sort_method load  job_load_adjustments np_load_avg=0.50  load_adjustment_decay_time 0:7:30  load_formula np_load_avg  schedd_job_info true  flush_submit_sec 0  flush_finish_sec 0  params none  reprioritize_interval 0:0:0  halftime 168  usage_weight_list cpu=1,mem=0,io=0  compensation_factor 5  …

14 How to use SGE as an admin? (contd.) Queue configuration  Queues are created with qmon or qconf qconf –shgrpl show all host groups qconf -ahgrp group add a new host group qconf –shgrp group show details for one group qconf –sq queue shows a queue configuration qconf –Aq file create a queue from a file  We’ll output a queue configuration to a file and modify it.  Exercise: create a short/test job queue. Which are the best policies for this kind of queue?

15 Queue parameters  qname  hostlist  seq_no  load_thresholds  suspend_thresholds  nsuspend  suspend_interval  priority  min_cpu_interval  processors  qtype  ckpt_list  pe_list  rerun  slots  tmpdir  shell  … For the rest, type man queue_conf

16 But, are local schedulers enough? Schedulers allow us to manage jobs in one or more clusters, but there are still some limitations for more complex systems:  Centralized job scheduling  Computing nodes are in the same location  Homogeneous software

17 Next step: GRID computing GRID computing allows us to make distant, heterogeneous clusters work together.  Coordinate multiple resources (discovery, access, allocation, monitoring)  Allow user authorization to provide secure access to resources  Provide open standards to improve interoperability  Give local control to organizations

18 How do we put everything together?

19 Wrapping up: What do we have?

20 In a nutshell…

21 Sending a job to SGE using GRAM Create a personal certificate with grid-cert-request Have it signed by the local CA Create a proxy with grid-proxy-init Submit it with globus-job-run localhost/jobmanager-sge /bin/hostname This is a very simple example that uses pre-WS GRAM services Globus still gives us more advanced features:  File staging  RSL (Resource Specification Language) and JSDL (Job Submission Description Language)  Access across organization boundaries  …

22 Some examples Hurricane mitigation Metascheduling and job flow management

23 Conclusion There is still a lot to explore!