Clusters at IIT KANPUR - 1 Brajesh Pande Computer Centre IIT Kanpur.

Slides:



Advertisements
Similar presentations
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Advertisements

Running DiFX with SGE/OGE Helge Rottmann Max-Planck-Institut für Radioastronomie Bonn, Germany DiFX Meeting Sydney.
Parallel ISDS Chris Hans 29 November 2004.
Beowulf Supercomputer System Lee, Jung won CS843.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Using Clusters -User Perspective. Pre-cluster scenario So many different computers: prithvi, apah, tejas, vayu, akash, agni, aatish, falaq, narad, qasid.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
DCC/FCUP Grid Computing 1 Resource Management Systems.
Job Submission on WestGrid Feb on Access Grid.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Workload Management Massimo Sgaravatto INFN Padova.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Installing software on personal computer
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”
Electronic Visualization Laboratory, University of Illinois at Chicago MPI on Argo-new Venkatram Vishwanath Electronic Visualization.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Resource management system for distributed environment B4. Nguyen Tuan Duc.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Grid Computing I CONDOR.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Introduction to the Grid N1™ Grid Engine 6 Software.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Beowulf Software. Monitoring and Administration Beowulf Watch 
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Enabling Grids for E-sciencE SGE J. Lopez, A. Simon, E. Freire, G. Borges, K. M. Sephton All Hands Meeting Dublin, Ireland 12 Dec 2007 Batch system support.
1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi
Faucets Queuing System Presented by, Sameer Kumar.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.
Submitting Jobs to the Sun Grid Engine at Sheffield and Leeds (Node1)
Batch Systems P. Nilsson, PROOF Meeting, October 18, 2005.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
CFI 2004 UW A quick overview with lots of time for Q&A and exploration.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
1 High-Performance Grid Computing and Research Networking Presented by Javier Delgodo Slides prepared by David Villegas Instructor: S. Masoud Sadjadi
OAR : a batch scheduler Grenoble University LIG (Mescal Team)
Gridengine Configuration review ● Gridengine overview ● Our current setup ● The scheduler ● Scheduling policies ● Stats from the clusters.
Advanced Computing Facility Introduction
GRID COMPUTING.
Workload Management Workpackage
HPC usage and software packages
OpenPBS – Distributed Workload Management System
Clouds , Grids and Clusters
Using Paraguin to Create Parallel Programs
Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node
BIMSB Bioinformatics Coordination
CRESCO Project: Salvatore Raia
Cluster Usage Session: NBCR clusters introduction August 3, 2007
Basic Grid Projects – Condor (Part I)
Requesting Resources on an HPC Facility
Sun Grid Engine.
High-Performance Grid Computing and Research Networking
Presentation transcript:

Clusters at IIT KANPUR - 1 Brajesh Pande Computer Centre IIT Kanpur

Agenda Grid Definitions and Classifications Elements of Clusters Clusters at IITK Resource Management and Usage (Grid Engine)

Grid / Cluster Computing Grid - a collection of resources that are used for performing a task. Users view it as a large system with a few points of access. As a powerful distributed system. Usually treat it as a single computational resource. Resource manager takes up the task of submitting jobs to the grid. User does not bother where the job is being fired.

Grid Classification Cluster Grids –Set of hosts working together with a single point of access –Single Owner, Single Site, Single Organization Campus Grids –Shared heterogeneous nodes within a geographical boundary, usually an educational / corporate campus –Multiple Owners, Single Site, Single Organization Global Grids –Collection of campus grids that cross organizational boundaries, users can access resources far beyond what they can within their organization (Cost?) –Multiple Owners, Multiple Sites, Multiple Organizations.

Clusters the Management View Set of smp / non smp machines / nodes / blades Connected nodes (High speed interconnect) OS Deployment tools Maintenance and Monitoring Systems Parallel Computing Environments and Compilers Parallel File Systems Libraries, Software Packages Resource Management Systems ( Schedulers)

Clusters the Management View Compute MachineService Processor OS, File Systems IPMI SNMP Compilers Parallel Env Main MachineService Processor Resource Mgr Job Scheduler Local Agent Global Resource Mgr Job Scheduler Global Monitoring and Maintenance System InterConnect Deployment Tools

Building Blocks AMD Opteron, INTEL XEON OS – Linux in various flavors (Red Hat EL/AS) Interconnect – Myrinet, Infiniband,Gigabit Maintenance and monitoring – Sun Control Station, CMU Tool, NAGIOS Parallel Computing Environments – PVM, MPI, LAM Parallel File Systems – Lusture, GFS, XFS Libraries, Software, Compilers – NAG, PGI, g77, GAUSSIAN, CHARM etc Resource Manager – Sun Grid Engine, PBS PRO

IITK Clusters 32-node Intel based cluster from HP 98-node AMD OPTERON dual CPU (SUN) 48-node OPTERON dual CPU (HP) PARAM, IBM SP and home grown Beowulf clusters All run LINUX OS Have MPI Different Compilers, Softwares and Leveraging Technologies Domain of application is scientific research GARUDA with CDAC

Grid Engine (Resource Manager) The grid engine delivers computational power based on enterprise resource policies of the organization’s Policies are usually targeted towards maximizing throughput and utilization The grid engine examines resources based on the policies It then allocates and delivers resources optimizing usage The grid engine software provides users with the means to submit tasks to the grid for transparent distribution of the associated workload

Grid Engine (Resource Manager) Users can submit batch jobs, interactive jobs, and parallel jobs to the grid. Supports dynamic scheduling, accounting and check pointing Provides tools for monitoring and controlling jobs. Accepts jobs from the outside world. Jobs are users’ requests for computer resources. Puts jobs in a holding area until the jobs can be run. Sends jobs from the holding area to an execution device. Manages running jobs. Logs the record of job execution when the jobs are finished.

Grid Engine Components The engine has three main components –Hosts, Daemons, Queues Hosts –Master, Execution, Administration, Submit Host Daemons –sge_qmaster, sge_schedd, sge_execd –schedd decides the queue and priority –qmaster maintains info on hosts, queues, permissions and system loads –execd responsible for running and informing status to master

Grid Engine Components Queues –Container class for all jobs allowed to run on more than one host –Has attributes (like a parallel environment, amount of free tmp area, number of software licenses associated with it) –Jobs that require attributes are automatically dispatched to queues that can satisfy the required attributes –A cluster is a collection of hosts –A host has attributes including number of slots / processors specified for computation (slots need not be same as processors) –Hosts are associated with queues Grid Engine provides commands and interfaces to manipulate and configure the queues, the hosts and associated attributes

Queues of Suncluster Currently we have four queues on the cluster QueueDescriptionResources defaultThe queue to which any job is submitted by default by the qsub command 10 nodes100 slots batch interactive parallel sequentialThe queue that one would choose for running parallel jobs through mpich 14 nodes 28 slots batch only parallelThe queue that one would choose for running sequential jobs 47 nodes 94 slots parallel reservedA queue that is reserved for users who have paid partly for the procurement of the cluster resources 21 nodes 42 slots batch interactive and parallel

Some Queue Manipulation Commands qrsh qsh qtcsh qlogin qacct qdel qmod qsub qconf qstat qconf –ah –as qconf –sul qconf –mconf qconf –mp

Important output of some queue manipulation commands qconf –sq reserved.q qname reserved.q hostlist host1 host2 seq_no 3 loadthreshold np_load_average=8.5 pe_list mpichpar slots 2 Userlist reservedusers shell /bin/csh prolog /tmp/myscript epilog /tmp/cleartmp s_cpu qconf –spe mpichpar pe_name mpichpar slots 400 user_lists NONE xuser_lists NONE start_proc_args /home/sgeadmin/mpi/startmpi.sh - catch_rsh $pe_hostfile stop_proc_args /home/sgeadmin/mpi/stopmpi.sh allocation_rule $round_robin

Seeing user groups with special priviledges qconf –su reservedusers name reservedusers type ACL dharmkv,pssundar,vivkumar,pravir,vankates\ bhanesh,aprataps,samrahul,mkv,janurag,santo,ramji

Submitting A Sequential Job #!/bin/sh #$ -N Sleeper #$ -S /bin/sh /bin/echo "Sleeping now at: `date` and `hostname`“ time=60 if [ $# -ge 1 ]; then time=$1 fi sleep $time echo Now it is: `date` qsub -q qsub -l tmpfree=5G -q seq.q qsub -cwd -o /dev/null -e /dev/null myjob.sh

Submitting A Parallel Job #!/bin/csh #$ -N MPI_Job #$ -pe mpich* 2-20 #$ -v MPIR_HOME=/opt/mpichdefault echo "Got $NSLOTS slots.“ # enables $TMPDIR/rsh to catch rsh calls if available set path=($TMPDIR $path) $MPIR_HOME/bin/mpirun -np $NSLOTS –machinefile \ $TMPDIR/machines -nolocal /somepath/a.out qsub -pe mpichpar 10 -q par.q