JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.

Slides:



Advertisements
Similar presentations
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Advertisements

Profiling your application with Intel VTune at NERSC
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
High Performance Computing
Chapter 14 Chapter 14: Server Monitoring and Optimization.
1 Computer System Overview OS-1 Course AA
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Operating Systems: Principles and Practice
Introduction to UNIX/Linux Exercises Dan Stanzione.
Task Farming on HPCx David Henty HPCx Applications Support
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Introduction to HPC resources for BCB 660 Nirav Merchant
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
MPI and High Performance Computing: Systems and Programming Barry Britt, Systems Administrator Department of Computer Science Iowa State University.
Advanced SCC Usage Research Computing Services Katia Oleinik
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.
Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
The Foundation API How does it work?. How It Runs... At the DE: The job is started, it runs the Foundation API App, with the information provided in json#1.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
ARCHER Advanced Research Computing High End Resource
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
NGS computation services: APIs and.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
Advanced SCC Usage Research Computing Services Katia Oleinik
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Advanced Computing Facility Introduction
Auburn University
PARADOX Cluster job management
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Process Management Process Concept Why only the global variables?
Distributed Processors
Using Paraguin to Create Parallel Programs
Architecture & System Overview
BIMSB Bioinformatics Coordination
Integration of Singularity With Makeflow
Bruce Pullig Solution Architect
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Advanced Computing Facility Introduction
Threads Chapter 4.
Operating System Introduction.
Quick Tutorial on MPICH for NIC-Cluster
Chapter 3: Processes Process Concept Process Scheduling
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
Presentation transcript:

JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013

Overview New hardware structure (# of nodes, cores, cores per socket) Exclusive use of a node – what does that mean Running serial (single-core) jobs on the exclusive nodes – Python – TaskFarmerMQ Hands-on testing/work

Genepool Components 450 SGI commodity nodes 8 Slots, 48 GB of memory 450 SGI commodity nodes 8 Slots, 48 GB of memory 222 Appro commodity nodes (new hardware) 16 physical cores, 120 GB of memory 222 Appro commodity nodes (new hardware) 16 physical cores, 120 GB of memory GB nodes GB nodes GB nodes 1 2TB node GB 8 slot nodes (x4170) – high priority nodes

New Commodity Node Layout 120G of memory 16 physical cores (2 sockets - NUMA) 16 virtual cores (hyperthreading) 1.8 TB of local disk

New High Memory Node Layout 5 500G Nodes, G Nodes (why not 512 and 1024??) 32 physical cores (4 sockets - NUMA) 32 virtual cores (hyperthreading) 3.6 TB of local disk

NUMA – Non-uniform Memory Architecture There is a memory hierarchy on each die, so each thread will not have uniform access time to different blocks of memory Image from -

Hyperthreading 16 physical cores + 16 virtual cores means that you can run applications with up to 32 threads. We have done some experiments with hyperthreading on/off and didn’t see any negative effects, but very few codes showed appreciable speed-up

How are the old and new systems connected?

NERSC Machine Room

How do I access the new nodes? User still specify the following parameters: Wallclock limit ( -l h_rt HH:MM:SS) # of cores/nodes ( -pe... ) Amount of memory per core (-l ram.c=16G) The new hardware has 120 GB of memory, if you specify more than 48GB of memory, your job will be routed to the new hardware. #!/bin/bash #$ -l h_rt=12:00:00, ram.c=100GB #$ -pe pe_slots 16 #$ -N whole_node_serial_test Can run up to 16 MPI tasks or with 16 threads. #!/bin/bash #$ -l h_rt=12:00:00, ram.c=100GB #$ -pe pe_slots 16 #$ -N whole_node_mpi_test #$ -pe pe_1 4 ## requesting 4 whole nodes, can run up to 16*4 MPI tasks

What about run time? There are 50 commodity nodes that can run long jobs (>12 hours), all the high memory nodes can run long jobs The remainder of the jobs can run jobs with a up to a 12 hour wallclock

Exclusive use of the node I/O from this node will only be done by your job, don’t need to share the 1Gb ethernet with anyone else 16 processors, 16 virtual cores (can test the benefit of hyperthreading with your code) You can use up to 120G (more on the highmem nodes)

Want to take advantage of all 16 cores, but how? Task 1 Task 2 Task 3 … Task 15 Task 16

Running 16 serial tasks - Python You can use a python's mpi4py module to launch multiple serial jobs. Below is a sample python script, 'mwrapper.py': #!/usr/bin/env python from mpi4py import MPI from subprocess import call import sys exctbl = sys.argv[1] comm = MPI.COMM_WORLD rank = comm.Get_rank() myDir = "dir"+str(rank).zfill(2) cmd = "cd "+myDir+" ; "+exctbl+" outfile" sts = call(cmd,shell=True) comm.Barrier()

Running 16 serial tasks - Python #!/bin/bash –l #$ -l h_rt=12:00:00 #$ –pe pe_slots 16 #$ -l ram.c=7680MB #$ -cwd module load python module load openmpi aprun –n 16 mwrapper.py a.out Below is a batch script to use it for a serial program, a.out:

P tfmq- worker_1 RabbitMQ P tfmq- worker_2 P tfmq- worker_n... $ tfmq-client -i task.lst client task_1task_2task_t fork() Running 16 Serial tasks - TaskfarmerMQ /jgi/tools/bin/blastall -b 100 -v 100 -K 100 -p blastn -S 3 - d./data/hs.m51.D4.diplotigs+fullDepthIsotigs.fa -e 1e-10 -F F -W 41 -i./data/blast_query_1_160.fna -m 8 -o./out- blastn/test1.m8.bout:/project/projectdirs/genomes/sulsj/test/ taskfarmer-mq/task _version/out-blastn:test1.m8.bout:0 /jgi/tools/bin/blastall -b 100 -v 100 -K 100 -p blastn -S 3 - d./data/hs.m51.D4.diplotigs+fullDepthIsotigs.fa -e 1e-10 -F F -W 41 -i./data/blast_query_1_160.fna -m 8 -o./out- blastn/test2.m8.bout:/project/projectdirs/genomes/sulsj/test/ taskfarmer-mq/task _version/out-blastn:test1.m8.bout,test2.m8.bout:0 user task list status Workers can be added at any time and reused

TaskfarmerMQ Client/Worker Usage tfmq-client -i [-q user_specified_queue_name] [-w reuse_workers] -i,--tf: user task list file -q,--tq: user-specified queue name (*NOTE: If you set your queue name with this option, you SHOULD set the same queue name when you start the worker using “-q/--tq”). -w,--reuse: worker termination option. If set as “0 (default)”, all workers will be terminated after completion; If set as “1”, all workers will stay running for other tasks. tfmq-worker [-q,--tq user-specified_queue_name] The “-q/--tq” option is for setting user-defined queue name. If you set a different queue name for running tfmq-client, you SHOULD set the same name when you run the worker. ex) User-defined queue name $ tfmq-client -i task1.lst -q mytaskqueuename1 $ tfmq-worker -q mytaskqueuename1

TaskfarmerMQ Task List Example Task list format: : : : blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d./data/db.fa -e 1e- 10 -F F -W 41 -i./data/input1.fna -m 8 -o./out- blastn/test1.m8.bout :./out-blastn :test1.m8.bout :0 blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d./data/db.fa -e 1e- 10 -F F -W 41 -i./data/input2.fna -m 8 -o./out- blastn/test2.m8.bout :./out-blastn :test1.m8.bout,test2.m8.bout :0 blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d./data/db.fa -e 1e- 10 -F F -W 41 -i./data/input3.fna -m 8 -o./out- blastn/test3.m8.bout :./out-blastn :test4.m8.bout :0

TaskfarmerMQ Task Examples 1.A case where I have a list of tasks that each require 1 core and 7680MB of memory -Step 1 Fire up a client with the name of the queue that I want: tfmq-client -i task7680.lst -q my7680MBqueue 2.A case where I have a list of tasks that each require 1 core and 15G of memory 3.A case where I have a list of tasks that each require 1 core and 30G of memory

#!/bin/sh #$ -N taskfarmermq_test #$ -l h_rt=12:00:00 #$ -pe pe_slots 16 #$ -l ram.c=7680MB #$ -cwd for i in {1..16} do tfmq-worker –q my7680MBqueue & done wait Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on Genepool Create a batch script. In this case called submit_16workers.q Submit the job genepool01:$ qsub submit_16workers.q Note: We only specify, memory, slots and runtime to route our jobs!

#!/bin/sh #$ -N taskfarmermq_test #$ -l h_rt=12:00:00 #$ -pe pe_slots 1 #$ -l ram.c=7680G #$ -cwd ## Running on the gpint ## tfmq-client -i task1.lst -q my7680MBqueue for i in {1..15} do tfmq-worker –q my7680MBqueue & done wait Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on Genepool Create a batch script. In this case called submit_16workers.q Submit the job genepool01:$ qsub submit_16workers.q The name of the queue for the client and worker needs to be the same

#!/bin/sh #$ -N taskfarmermq_test #$ -l h_rt=12:00:00 #$ -pe pe_slots 16 #$ -l ram.c=7680MB #$ -cwd ## Running on the gpint ## tfmq-client -i task1.lst -q my7680MBqueue for i in {1..16} do tfmq-worker –q my7680MBqueue & done wait Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on Genepool Create a batch script. In this case called submit_16workers.q Submit the job genepool01:$ qsub submit_16workers.q There are 16 cores on a node, so I can have 16 workers

TaskfarmerMQ Task Examples 1.A case where I have a list of tasks that each require 1 core and 7.5G of memory 2.A case where I have a list of tasks that each require 1 core and 15G of memory - Step 1 - Fire up a client with the name of the queue that I want: tfmq-client -i task15.lst -q my15GBqueue 1.A case where I have a list of tasks that each require 1 core and 30G of memory

#!/bin/sh #$ -N taskfarmermq_test #$ -l h_rt=12:00:00 #$ -pe pe_8 #$ -l ram.c=15G #$ -cwd for i in {1..8} do tfmq-worker –q my15GBqueue & done wait Example 2 – my tasklist is full of jobs that need 15GB of memory and 1 core each, to run these on Genepool, so I can only use 8 cores Create a batch script. In this case called submit_8workers.q Submit the job genepool01:$ qsub -t 1-10 submit_8workers.q In this case we have only 8 workers (120G/15G = 8)

TaskfarmerMQ Task Examples 1.A case where I have a list of tasks that each require 1 core and 7.5G of memory 2.A case where I have a list of tasks that each require 1 core and 15G of memory 3.A case where I have a list of tasks that each require 1 core and 30G of memory - Step 1 - Fire up a client with the name of the queue that I want: tfmq-client -i task30.lst -q my30GBqueue

#!/bin/sh #$ -N taskfarmermq_test #$ -pe pe_slots 4 #$ -l ram.c=30G #$ -cwd for i in {1..4} do tfmq-worker –q my30GBqueue & done wait Example 3 – my tasklist is full of jobs that need 15GB of memory and 1 core each, so I can only use 4 cores Create a batch script. In this case called submit_4workers.q Submit the job genepool01:$ qsub -t 1-10 submit_4workers.q

#!/bin/sh #$ -N taskfarmermq_test #$ -pe pe_slots 4 #$ -l ram.c=30G #$ -cwd for i in {1..4} do tfmq-worker –q my30GBqueue & done wait Example 3 – my tasklist is full of jobs that need 15GB of memory and 1 core each, so I can only use 4 cores Create a batch script. In this case called submit_4workers.q Submit the job genepool01:$ qsub -t 1-10 submit_4workers.q You can also run with task arrays to increase the number of workers available to a particular queue

Summary The JGI now has access to almost 2x the computing power that was available before break. To access the new hardware, just request between 48G and 240G of memory and your jobs will be routed to those nodes. In an effort to keep jobs scheduling efficiently for all users, we are scheduling the new nodes a whole node at a time. This will also make it easier for users to debug workflows and should enable jobs to complete more consistently. There are tools available (Python, TaskFarmerMQ) that will enable users with serial jobs to take advantage of the new hardware.

hands-on section