1 Introduction to Supercomputing at ARSC Kate Hedstrom, Arctic Region Supercomputing Center (ARSC) Jan, 2004.

Slides:



Advertisements
Similar presentations
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
High Performance Computing
DCC/FCUP Grid Computing 1 Resource Management Systems.
Job Submission on WestGrid Feb on Access Grid.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
ORNL is managed by UT-Battelle for the US Department of Energy Tools Available for Transferring Large Data Sets Over the WAN Suzanne Parete-Koon Chris.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Types of Operating System
Task Farming on HPCx David Henty HPCx Applications Support
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Introduction to HPC resources for BCB 660 Nirav Merchant
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Using The Cluster. What We’ll Be Doing Add users Run Linpack Compile code Compute Node Management.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.
CS 240A Models of parallel programming: Distributed memory and MPI.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Hybrid MPI and OpenMP Parallel Programming
Katie Antypas User Services Group Lawrence Berkeley National Lab 17 February 2012 JGI Training Series.
Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
Using hpc Instructor : Seung Hun An, DCS Lab, School of EECSE, Seoul National University.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
Threaded Programming Lecture 2: Introduction to OpenMP.
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Advanced Computing Facility Introduction
Compute and Storage For the Farm at Jlab
Welcome to Indiana University Clusters
PARADOX Cluster job management
HPC usage and software packages
MPI Basics.
Welcome to Indiana University Clusters
BIMSB Bioinformatics Coordination
Is System X for Me? Cal Ribbens Computer Science Department
Welcome to our Nuclear Physics Computing System
Compiling and Job Submission
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Welcome to our Nuclear Physics Computing System
Introduction to High Performance Computing Using Sapelo2 at GACRC
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
Presentation transcript:

1 Introduction to Supercomputing at ARSC Kate Hedstrom, Arctic Region Supercomputing Center (ARSC) Jan, 2004

2 Topics Introduction to Supercomputers at ARSC –Computers Accounts –Getting an account –Kerberos –Getting help Architectures of parallel computers –Programming models Running Jobs –Compilers –Storage –Interactive and batch

3 Introduction to ARSC Supercomputers They’re all Parallel Computers Three Classes: –Shared Memory –Distributed Memory –Distributed & Shared Memory

4 Cray X1: klondike 128 MSPs 4 MSP/node 4 Vector CPU/MSP, 800 MHz 512 GB Total 21 TB Disk 1600 GFLOPS peak NAC required

5 Cray SX-6: rime 8 500MHz NEC Vector CPUs 64 GB of shared memory 1 TB RAID-5 Disk 64 GFLOPS peak Only one in the USA On loan from Cray Non-NAC

6 Cray SV1ex: chilkoot 32 Vector CPUs, 500 MHz 32 GB Shared memory 2 TB Disk 64 GFLOPS peak NAC required

7 Cray T3E: yukon 272 CPUs, 450 MHz 256 MB per processor 69.6 GB total distributed memory 230 GFLOPS peak NAC required

8 IBM Power4: iceberg 2 nodes of 32 p690+s, 1.7 GHz (2 cabinets) 256 GB each 92 nodes of 8 p655+s, 1.5 GHz (6 cabinets) 6 nodes of 8 p655s 1.1 GHz (1 cabinet) 16 GB Mem/Node 22 TB Disk 5000 GFLOPS NAC required

9 IBM Regatta: iceflyer 8-way, 16GB front end coming soon GHz Power4 CPUs in –24-way SMP node –7-way interactive node –1 test node –32-way SMP node soon 256 GB Memory 217 GFLOPS Non-NAC

10 IBM SP Power3: icehawk 50 4-Way SMP Nodes => 200 CPUs, 375 MHz 2 GB Memory/Node 36 GB Disk/Node 264 GFLOPS peak for 176 CPUs (max per job) Leaving soon NAC required

11 Storing Files Robotic tape silos Two Sun storage servers Nanook –Non-NAC systems Seawolf –NAC systems

12 Accounts, Logging In Getting an Account/Project Doing a NAC Logging in with Kerberos

13 Getting an Account/Project Academic Applicant for resources is a PI: –Full time faculty or staff research person –Non-commercial work, must reside in USA –PI may add users to their project – DoD Applicant – Commercial, Federal, State –Contact User Services Director –Barbara Horner-Miller, –Academic guidelines apply

14 Doing a National Agency Check (NAC) Required for HPCMO Resources only –Not required for workstations, Cray SX-6, or IBM Regatta Not a security clearance –But there are detailed questions covering last 5-7 years Electronic Personnel Security Questionnaire (EPSQ) –Windows only software Fill out EPSQ cover sheet – Fingerprinting, Proof of Citizenship (passport, visa, etc.) –See

15 Logging in with Kerberos On non-ARSC systems, download kerberos5 client – Used with SecureID –Uses a pin to generate a key at login time Login requires user name, pass phrase, & key –Don’t share your pin or SecureID with anyone Foreign Nationals or others with problems –Contact ARSC to use ssh to connect to ARSC gateway –Still need Kerberos & SecureID after connecting

16 SecureID

17 From ARSC System Enter username Enter for principle Enter pass phrase Enter SecureID passcode From that system: ssh iceflyer ssh handles X11 handshaking From ARSC System

18 From Your System Get Kerberos clients installed Get ticket kinit See tickets klist Login into arsc system krlogin -l username iceflyer ssh -l username iceflyer ktelnet -l username iceflyer

19 Rime and Rimegate Log into rimegate as usual, with your rimegate username (arscxxx) ssh -l arscksh rimegate Compile on rimegate (sxf90, sxc++) Log into rime from rimegate ssh rime Rimegate $HOME is /rimegate/users/username on rime

20 Supercomputer Architectures They’re all Parallel Computers Three Classes: –Shared Memory –Distributed Memory –Distributed & Shared Memory

21 Shared Memory Architecture Cray SV1, SX-6, IBM Regatta

22 Distributed Memory Architecture Cray T3E

23 Cluster Architecture IBM iceberg, icehawk, Cray X1 Scalable, distributed, shared-memory parallel processor

24 Programming Models Vector Processing –compiler detection or manual directives Threaded Processing (SMP) –OpenMP, Pthreads, java threads –shared memory only Distributed Processing (MPP) –message passing with MPI –shared or distributed memory

25 Vector Programming Vector CPUs are specialized for array/matrix operations –64-element (SV1, X1), 256-element (SX-6) Vector Registers –Operations proceed assembly-line fashion –High memory-to-CPU bandwidth Less CPU time wasted waiting for data from memory –Once loaded, produces one result per clock cycle Compiler does a lot of the work

26 Vector Programming Codes will run without modification. Cray compilers automatically detect loops which are safe to vectorize. Request listing file to find out what vectorized. Programmer can assist the compiler: –Directives and pragmas can force vectorization –Eliminate conditions which inhibit vectorization (e.g., subroutine calls and data dependencies in loops)

27 Threaded Programming on Shared-Memory Systems OpenMP –Directives/pragmas added to serial programs –A portable standard implemented on Cray (one node), SGI, IBM (one node), etc... Other Threaded Paradigms –Java Threads –Pthreads

28 OpenMP Fortran Example !$omp parallel do do n = 1,10000 A(n) = x * B(n) + c end do ___________________________________________________ On 2 CPUS, this pragma divides work as follows: CPU 1: do n = 1,5000 A(n) = x * B(n) + c end do CPU 2: do n = 5001,10000 A(n) = x * B(n) + c end do

29 OpenMP C Example #pragma omp parallel for for (n = 0; n < 10000; n++) A[n] = x * B[n] + c; ___________________________________________________ On 2 CPUS, this pragma divides work as follows: CPU 1: for (n = 0; n < 5000; n++) A[n] = x * B[n] + c; CPU 2: for (n = 5000; n < 10000; n++) A[n] = x * B[n] + c;

30 Threads Dynamically Appear and Disappear Number set by Environment

31 Distributed Processing Concept: 1) Divide the problem explicitly 2) CPUs Perform tasks concurrently 3) Recombine results 4) All processors may or may not be doing the same thing Branimir Gjetvaj

32 Distributed Processing Data needed by a given CPU must be stored in the memory associated with that CPU Performed on distributed or shared memory computer Multiple copies of code are running Messages/data are passed between CPUs Multi-level: can be combined with vector and/or OpenMP

33 Initialization Simple send/receive ! Processor 0 sends individual messages to others if (my_rank == 0) then do dest = 1, npes-1 call mpi_send(x, max_size, MPI_FLOAT, dest, 0, comm, ierr); end do else call mpi_recv(x, max_size, MPI_FLOAT, 0, 0, comm, status, ierr); end if call mpi_init(ierror) call mpi_comm_size (MPI_COMM_WORLD, npes, ierror); call mpi_comm_rank (MPI_COMM_WORLD, my_rank, ierror); Distributed Processing using MPI (Fortran)

34 Initialization Simple send/receive /* Processor 0 sends individual messages to others */ if (my_rank == 0) { for (dest = 1; dest < npes; dest++) { MPI_Send(x, max_size, MPI_FLOAT, dest, 0, comm); } else { MPI_Recv(x, max_size, MPI_FLOAT, 0, 0, comm, &status); } MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &npes); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); Distributed Processing using MPI (C)

35 Number of Processes Constant Number set by Environment

36 Message Passing Activity Example

37 Cluster Programming Shared-memory between processors on one node: –OpenMP, threads, or MPI Distributed-memory methods between processors on multiple nodes –MPI Mixed mode –MPI distributes to nodes, OpenMP within node

38 Programming Environments Compilers File Systems Running jobs –Interactive –Batch See individual machine documentation –

39 Cray Compilers SV1, T3E –f90, cc, CC X1 –ftn, cc, CC SX-6 front end (rimegate) –sxf90, sxc++ SX-6 (rime) –f90, cc, c++ No extra flags for MPI, OpenMP

40 IBM Compilers Serial –xlf, xlf90, xlf95, xlc, xlC OpenMP –Add -qsmp=omp, _r extension for thread- safe libraries, e.g. xlf_r MPI –mpxlf, mpxlf90, mpxlf95, mpcc, mpCC Might be best to always use _r extension (mpxlf90_r)

41 File Systems Local storage –$HOME –/tmp or /wrktmp or /wrkdir -> $WRKDIR –/scratch -> $SCRATCH Permanent storage –$ARCHIVE Quotas –quota -v on Cray –qcheck on IBM

42 Running a job Get files from $ARCHIVE to system’s disk Keep source in $HOME, but run in $WRKDIR Use $SCRATCH for local-to-node temporary files, clean up before job ends Put results out to $ARCHIVE $WRKDIR is purged

43 Iceflyer Filesystems Smallish $HOME Larger /wrkdir/username $ARCHIVE for longterm storage, especially larger files qcheck to check quotas

44 SX6 Filesystems Separate from the rest of ARSC systems Rimegate has /home, /scratch Rime mounts them as /rimegate/home, /rimegate/scratch Rime has own home, /tmp, /atmp, etc.

45 Interactive Works on the command line Limits exist on resources (time, # cpus, memory) Good for debugging Larger jobs must be submitted to the batch system

46 Batch Schedulers Cray: NQS –Commands: qsub, qstat, qdel IBM: LoadLeveler –Commands: llclass, llq, llsubmit, llcancel, llmap, xloadl

47 NQS Script (rime) batch # job queue class /bin/ksh # which shell # stdout and stderr together 100 MW 30:00 # time requested h:m:s 8 # 8 cpus # required last command # beginning of shell script cd $QSUB_WORKDIR # cd to submission directory export F_PROGINF=DETAIL export OMP_NUM_THREADS=8./my_job

48 NQS Commands qstat to find out job status, list of queues qsub to submit job qdel to delete job from queue

49 LoadLeveler Script (iceflyer) #!/bin/ksh total_tasks = 4 node_usage = shared wall_clock_limit = 1:00:00 job_type = parallel output = out.$(jobid) error = err.$(jobid) class = large notification = error queue poe./my_job

50 Loadleveler Commands llclass to find list of classes llq to see list of jobs in queue llsubmit to submit job llcancel to delete job from queue llmap is local program to see load on machine xloadl X11 interface to loadleveler

51 Getting Help Consultants and Specialists are here to serve YOU –

52 Homework Make sure you can log into –iceflyer –rimegate –rime Ask consultants for help if necessary