Download presentation
Presentation is loading. Please wait.
Published byFelix Egger Modified over 9 years ago
1
Research Computing @ CU Boulder Additional Topics in Research Computing Shelley Knuth Peter Ruprecht shelley.knuth@colorado.edushelley.knuth@colorado.edu peter.ruprecht@colorado.edupeter.ruprecht@colorado.edu www.rc.colorado.edu
2
Research Computing @ CU Boulder Outline Brief overview of Janus Update on replacement for Janus Alternate compute environment – condo PetaLibrary storage General supercomputing information Queues and job submission Keeping the system happy Using and installing software Compiling and submission examples Time for questions, feedback, and one-on-one UCD/AMC 2 4/3/15
3
Research Computing @ CU Boulder What does Research Computing do? We manage Shared large scale compute resources Large scale storage High-speed network without firewalls – ScienceDMZ Software and tools We provide Consulting support for building scientific workflows on the RC platform Training Data management support in collaboration with the Libraries UCD/AMC 3 4/3/15
4
Research Computing @ CU Boulder Research Computing - Contact rc-help@colorado.edu Main contact email Will go into our ticket system http://www.rc.colorado.edu - main web sitehttp://www.rc.colorado.edu http://data.colorado.edu - data management resourcehttp://data.colorado.edu Mailing Lists (announcements and collaboration) https://lists.rc.colorado.edu/mailman/listinfo/rc-announce Twitter @CUBoulderRC and @cu_data Facebookhttps://www.facebook.com/CuBoulderResearchCom putinghttps://www.facebook.com/CuBoulderResearchCom puting Meetuphttp://www.meetup.com/University-of-Colorado- Computational-Science-and-Engineering/http://www.meetup.com/University-of-Colorado- Computational-Science-and-Engineering/ UCD/AMC 4 4/3/15
5
Research Computing @ CU Boulder JANUS SUPERCOMPUTER AND OTHER COMPUTE RESOURCES UCD/AMC 5 4/3/15
6
Research Computing @ CU Boulder What Is a Supercomputer? CU’s supercomputer is one large computer made up of many smaller computers and processors Each different computer is called a node Each node has processors/cores Carry out the instructions of the computer With a supercomputer, all these different computers talk to each other through a communications network On Janus – InfiniBand UCD/AMC 6 4/3/15
7
Research Computing @ CU Boulder Why Use a Supercomputer? Supercomputers give you the opportunity to solve problems that are too complex for the desktop Might take hours, days, weeks, months, years If you use a supercomputer, might only take minutes, hours, days, or weeks Useful for problems that require large amounts of memory UCD/AMC 7 4/3/15
8
Research Computing @ CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 8 4/3/15
9
Research Computing @ CU Boulder Computers and Cars - Analogy ≈ UCD/AMC 9 4/3/15
10
Research Computing @ CU Boulder Hardware - Janus Supercomputer 1368 compute nodes (Dell C6100) 16,428 total cores No battery backup of the compute nodes Fully non-blocking QDR Infiniband network 960 TB of usable Lustre based scratch storage 16-20 GB/s max throughput UCD/AMC 10 4/3/15
11
Research Computing @ CU Boulder Different Node Types Login nodes This is where you are when you log in to login.rc.colorado.edu No heavy computation, interactive jobs, or long running processes Script or code editing, minor compiling Job submission Compile nodes Compiling code Job submission Compute/batch nodes This is where jobs that are submitted through the scheduler run Intended for heavy computation UCD/AMC 11 4/3/15
12
Research Computing @ CU Boulder Additional Compute Resources 2 Graphics Processing Unit (GPU) Nodes Visualization of data Exploring GPUs for computing 4 High Memory Nodes (“himem”) 1 TB of memory, 60-80 cores per node 16 Blades for long running jobs (“crestone”) 2-week walltimes allowed 96 GB of memory (4 times more compared to a Janus node) UCD/AMC 12 4/3/15
13
Research Computing @ CU Boulder Replacing Janus Janus goes off warranty in July, but we expect to run it (or part of it) for another 9-12 months after that. Have applied for a National Science Foundation grant for a replacement cluster. About 3x the FLOPS performance of Janus Approx 350 compute nodes, 24 cores, 128 GB RAM Approx 20 nodes with GPU coprocessors Approx 20 nodes with Xeon Phi processors Scratch storage suited to a range of applications Decision from NSF expected by early Sept. Installation would begin early in 2016. UCD/AMC 13 4/3/15
14
Research Computing @ CU Boulder “Condo” Cluster CU provides infrastructure such as network, data center space, and scratch storage Partners buy nodes that fit into the infrastructure Partners get priority access on their nodes Other condo partners can backfill onto nodes not currently in use Node options will be updated with current hardware each year A given generation of nodes will run 5 years Little or no customization of OS and hardware options UCD/AMC 14 4/3/15
15
Research Computing @ CU Boulder Condo Node Options - 2015 Compute node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet Potential uses: general purpose high-performance computation that doesn’t require low-latency parallel communication between nodes $6,875 GPU node: 2x 12-core 2.5 GHz Intel “Haswell” processors 128 GB RAM 1x 1 TB 7200 RPM hard drive 10 gigabit/s Ethernet 2x NVIDIA K20M GPU coprocessors Potential uses: molecular dynamics, image processing $12,509 UCD/AMC 15 4/3/15
16
Research Computing @ CU Boulder Node Options – cont’d High-memory node 4x 12-core 3.0 GHz Intel “Ivy Bridge” processors 1024 GB RAM 10x 1 TB 7200 RPM hard drives in high-performance RAID configuration 10 gigabit/s Ethernet Potential uses: genetics/genomics or finite-element applications $34,511 Notes: Dell is providing bulk discount pricing even for incremental purchases 5-year warranty Network infrastructure in place to support over 100 nodes at 10 Gbps UCD/AMC 16 4/3/15
17
Research Computing @ CU Boulder Condo Scheduling Overview All jobs are run through a batch/queue system (Slurm.) Interactive jobs on compute nodes are allowed but these must be initiated through the scheduler. Each partner group has its own high-priority queue for jobs that will run on nodes that it has purchased. High-priority jobs are guaranteed to start running within 24 hours (unless other high-priority jobs are already running) and can run for a maximum wall time of 14 days. All partners also have access to a low-priority queue that can run on any condo nodes that are not already in use by their owners. UCD/AMC 17 4/3/15
18
Research Computing @ CU Boulder RMACC conference Rocky Mountain Advanced Computing Consortium August 11-13, 2015, Wolf Law Building, CU- Boulder campus https://www.rmacc.org/HPCSymposium Hands on and lecture tutorials about HPC, data, and technical topics Educational and student focused panels Poster session for students UCD/AMC 18 4/3/15
19
Research Computing @ CU Boulder DATA STORAGE AND TRANSFER UCD/AMC 19 4/3/15
20
Research Computing @ CU Boulder Storage Spaces Home Directories Not high performance; not for direct computational output 2 GB quota; snapshotted frequently /home/user1234 Project Spaces Not high performance; can be used to store or share programs, input files, maybe small data files 250 GB quota; snapshotted regularly /projects/user1234 Lustre Parallel Scratch Filesystem No hard quotas; no backups Files created more than 180 days in the past may be purged at any time /lustre/janus_scratch/user1234 UCD/AMC 20 4/3/15
21
Research Computing @ CU Boulder Research Data Storage: PetaLibrary Long term, large scale storage option “Active” or “Archive” Primary storage system for condo Modest charge to use Provide expertise and services around this storage Data management Consulting UCD/AMC 21 4/3/15
22
Research Computing @ CU Boulder PetaLibrary: Cost to Researchers Additional free options Snapshots: on-disk incremental backups Sharing through Globus Online 22 UCD/AMC4/3/15
23
Research Computing @ CU Boulder Keeping Lustre Happy Janus Lustre is tuned for large parallel I/O operations. Creating, reading, writing, or removing many small files simultaneously can cause performance problems. Don’t put more than 10,000 files in a single directory. Avoid “ls –l” in a large directory. Avoid shell wildcard expansions ( * ) in large directories. UCD/AMC 23 4/3/15
24
Research Computing @ CU Boulder Data Sharing and Transfers Globus tools: Globus Online and gridftp https://www.globus.org Easier external access without CU-RC account, especially for external collaborators Endpoint: colorado#gridftp SSH: scp, sftp, rsync Adequate for smaller transfers Connections to RC resources are limited by UCD bandwidth UCD/AMC 24 4/3/15
25
Research Computing @ CU Boulder ACCESSING AND UTILIZING RC RESOURCES UCD/AMC 25 4/3/15
26
Research Computing @ CU Boulder Initial Steps to Use RC Systems Apply for an RC account https://portals.rc.colorado.edu/account/request Read Terms and Conditions Choose “University of Colorado – Denver” Use your UCD email address Get a One-Time Password device Apply for a computing allocation Initially you will be added to UCD’s existing umbrella allocation Subsequently will need a brief proposal for additional time UCD/AMC 26 4/3/15
27
Research Computing @ CU Boulder Logging In SSH to login.rc.colorado.edu Use your RC username and One-Time Password From a login node, can SSH to a compile node - janus-compile1(2,3,4) - to build your programs All RC nodes use RedHat Enterprise Linux 6 UCD/AMC 27 4/3/15
28
Research Computing @ CU Boulder Job Scheduling On supercomputer, jobs are scheduled rather than just run instantly at the command line Several different scheduling software packages are in common use Licensing, performance, and functionality issues have caused us to choose Slurm SLURM = Simple Linux Utility for Resource Management Open source Increasingly popular at other sites More at https://www.rc.colorado.edu/support/examples/slurmtestjob UCD/AMC 28 4/3/15
29
Research Computing @ CU Boulder Running Jobs Do NOT compute on login or compile nodes Interactive jobs Work interactively at the command line of a compute node Command: salloc --qos=janus-debug sinteractive --qos=janus-debug Batch jobs Submit job that will be executed when resources are available Create a text file containing information about the job Submit the job file to a queue sbatch --qos= file UCD/AMC 29 4/3/15
30
Research Computing @ CU Boulder Quality of Service = Queues janus-debug Only debugging – no production work Maximum wall time 1 hour, 2 jobs per user (The maximum amount of time your job is allowed to run) janus Default Maximum wall time of 24 hours janus-long Maximum wall time of 168 hours; limited number of nodes himem crestone gpu UCD/AMC 30 4/3/15
31
Research Computing @ CU Boulder Parallel Computation Special software is required to run a calculation across more than one processor core or more than one node Sometimes the compiler can automatically parallelize an algorithm, and frequently special libraries and functions can be used to abstract the parallelism OpenMP For multiprocessing across cores in a single node Most compilers can auto-parallelize with OpenMP MPI – Message Passing Interface Can parallelize across multiple nodes Libraries/functions compiled into executable Several implementations (OpenMPI, mvapich, mpich, …) Usually requires a fast interconnect (like InfiniBand) UCD/AMC 31 4/3/15
32
Research Computing @ CU Boulder Software Common software is available to all users via the “module” system To find out what software is available, you can type module avail To set up your environment to use a software package, type module load / Can install your own software But you are responsible for support We are happy to assist UCD/AMC 32 4/3/15
33
Research Computing @ CU Boulder EXAMPLES UCD/AMC 33 4/3/15
34
Research Computing @ CU Boulder Login and Modules example Log in: ssh username@login.rc.colorado.edu ssh janus-compile2 List and load modules module list module avail module load openmpi/openmpi-1.8.0_intel-13.0.0 module load slurm Compile parallel program mpicc -o hello hello.c UCD/AMC 34 4/3/15
35
Research Computing @ CU Boulder Login and Modules example Log in: ssh username@login.rc.colorado.eduusername@login.rc.colorado.edu List and load modules module list module avail matlab module load matlab/matlab-2014b module load slurm UCD/AMC 35 4/3/15
36
Research Computing @ CU Boulder Submit Batch Job example Batch Script: #!/bin/bash #SBATCH –N 2 # Number of requested nodes #SBATCH --ntasks-per-node=12 # number of cores per node #SBATCH --time=1:00:00 # Max walltime #SBATCH --job-name=SLURMDemo # Job submission name #SBATCH --output=SLURMDemo.out # Output file name ###SBATCH -A # Allocation ###SBATCH --mail-type=end # Send Email on completion ###SBATCH --mail-user= # Email address mpirun./hello Submit the job: sbatch --qos janus-debug slurmSub.sh Check job status: squeue –q janus-debug cat SLURMDemo.out UCD/AMC 36 4/3/15
37
Research Computing @ CU Boulder Batch Job example Contents of scripts Matlab_tutorial_general.sh Wrapper script loads the slurm commands Changes to the appropriate directory Calls the matlab.m files Matlab_tutorial_general_code.m Run matlab program as a batch job sbatch matlab_tutorial_general.sh Check job status: squeue –q janus-debug cat SERIAL.out UCD/AMC 37 4/3/15
38
Research Computing @ CU Boulder Example to work on Take the wrapper script matlab_tutorial_general.sh Alter it to do the following: Rename your output file to whatever you’d like Email yourself the output; do it at the beginning of the job run Run on 12 cores Run on 2 nodes Rename your job to whatever you’d like Submit the batch script UCD/AMC 38 4/3/15
39
Research Computing @ CU Boulder TRAINING, CONSULTING AND PARTNERSHIPS UCD/AMC 39 4/3/15
40
Research Computing @ CU Boulder Training Weekly tutorials on computational science and engineering topics Meetup group http://www.meetup.com/University-of- Colorado-Computational-Science-and- Engineeringhttp://www.meetup.com/University-of- Colorado-Computational-Science-and- Engineering All materials are online http://researchcomputing.github.io Various boot camps/tutorials UCD/AMC 40 4/3/15
41
Research Computing @ CU Boulder Consulting Support in building software Workflow efficiency Parallel performance debugging and profiling Data management in collaboration with the Libraries See Tobin Magle (tobin.magle@ucdenver.edu) Getting started with XSEDE UCD/AMC 41 4/3/15
42
Research Computing @ CU Boulder Thank you! Questions? Slides available at http://researchcomputing.github.io
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.