Bruce Pullig Solution Architect

Slides:



Advertisements
Similar presentations
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
Chapter 8 Operating System Support
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,
Installing and running COMSOL on a Windows HPCS2008(R2) cluster
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
Thrive Installation.
Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
Zellescher Weg 12 Trefftz-Building – HRSK/151 Phone Guido Juckeland Center for Information Services.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
©Brooks/Cole, 2003 Chapter 7 Operating Systems. ©Brooks/Cole, 2003 Define the purpose and functions of an operating system. Understand the components.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Batch Systems In a number of scientific computing environments, multiple users must share a compute resource: –research clusters –supercomputing centers.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Gridengine Configuration review ● Gridengine overview ● Our current setup ● The scheduler ● Scheduling policies ● Stats from the clusters.
First proposal for a modification of the GIS schema
Unix Scripts and PBS on BioU
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
HPC usage and software packages
OpenPBS – Distributed Workload Management System
Applying Control Theory to Stream Processing Systems
Operating Systems (CS 340 D)
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 2: System Structures
Lecture Topics: 11/1 Processes Process Management
ASU Saguaro 09/16/2016 Jung Hyun Kim.
Architecture & System Overview
How to Fix the Automatic Repair Loop in Windows 8.1
Resource Management for High-Throughput Computing at the ESRF G
Bomgar Remote support software
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Integration of Singularity With Makeflow
OPERATING SYSTEM OVERVIEW
Distributed System Structures 16: Distributed Structures
Process management Information maintained by OS for process management
CMU Access via Launch Cluster Management Utility GUI.
Compiling and Job Submission
Chapter 6: CPU Scheduling
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Operating Systems.
Unit 1: Introduction to Operating System
Bruce Pullig Solution Architect
High Performance Computing in Bioinformatics
Processes Hank Levy 1.
Parallel computation with R & Python on TACC HPC server
Chapter 2: Operating-System Structures
Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads
Processes Hank Levy 1.
Condor-G Making Condor Grid Enabled
Chapter 11: Printers IT Essentials v6.0 Chapter 11: Printers
Kajornsak Piyoungkorn,
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Bruce Pullig Solution Architect bpullig@slb Bruce Pullig Solution Architect bpullig@slb.com EnginFrame, LSF, and ECLIPSE Overview

Why use LSF? Without LSF: Difficult to determine which cores in cluster are in use -This can change before you can submit your job. -Choosing incorrectly can cause inefficiency or job failure. No queuing -You must wait until ALL resources are available before submitting your job. -low-importance jobs can’t be scheduled for non-working hours. With LSF: Jobs run when resources (including memory, cores, licenses) are available. Jobs can be scheduled for off-hours. Jobs can be run efficiently.

4-core parallel job Efficient use of hardware

8-core parallel job Efficient use of hardware

4 serial jobs Efficient use of hardware…but

4 serial jobs and 1 very inefficient 8-core parallel job

Dividing cluster “virtually” helps prevents inefficiency. Can overlap if configured carefully.

Default queues: Typical LSF Installation: Installed on shared storage which is then available to all nodes. Script on individual nodes under /etc/rc3.d is installed as a service and started up automatically upon reboot. Default queues: normal (For normal low priority jobs, running only if hosts are lightly loaded.) owners (For owners of some machines) priority (Jobs submitted for this queue are scheduled as urgent jobs. Jobs in this queue can preempt jobs in lower priority queues. Premption is incompatible with ECLIPSE) short (For short jobs that would not take much CPU time. Killed if they run more than 15 minutes) idle (Run only if the machine is idle and very lightly loaded.) license (For licensed package. Scheduled to run with moderate priority.) night (For large heavy duty jobs, running during off hours and weekends. Scheduled with higher priority.) chkpnt_rerun_queue (Incompatible with ECLIPSE)

Recommended queues: ParallelWork (Normal priority queue, limited to Parallel nodes) SerialWork (Normal priority queue, limited to Serial nodes) MR_Serial (Low priority queue for MR jobs, limited to Serial nodes) MR_Parallel (Low priority queue for MR jobs, limited to Parallel nodes)

License Requirements Simulations require one Black Oil (eclipse) license per job. Parallel jobs also require 1 Parallel (parallel) license per core. Other feature licenses may be required depending upon features used. If there are insufficient licenses available, LSF will queue the job until ALL resources, including licenses, are available. ECLIPSE and Parallel licenses are checked by default, but all other features must be identified in the users .DATA file.

Data Storage

LSF Monitoring compute nodes bhosts Monitoring queues/jobs   Monitoring compute nodes bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV ho02 ok - 6 0 0 0 0 0 ho03 ok - 6 0 0 0 0 0 *********************SNIP************************ ho16 ok - 6 0 0 0 0 0 Monitoring queues/jobs bjobs -a -u all 1290 msaxena DONE Eclipse holsf01 ho11 *e_THP5020 Mar 22 17:38 1291 dzajac DONE Eclipse holsf01 ho10 *PERM_HM_2 Mar 22 17:46 msaxena DONE Eclipse holsf01 ho10 *PERM_HM_2 Mar 22 17:52 pulligb RUN Eclipse holsf01 6*ho45 PULLIG9 Mar 23 08:43 6*ho12 6*ho13 6*ho14 6*ho15 2*ho02

LSF To obtain more info for troubleshooting bjobs –l <jobID>   To obtain more info for troubleshooting bjobs –l <jobID>  Job <1292>, Job Name <TIMEDEP_E300_2>, User <dzajac>, Status <DONE>, Queue <Eclipse>, Command <cd /data/PTCI/Eclipse/dzajac ; ./TIMEDEP_E300_2.175224> Thu Mar 22 17:52:24: Submitted from host <holsf01>, CWD </data/PTCI/Eclip se/dzajac>, Requested Resources <select[type= =any] rusage[eclipse=1:compositional=1]>; Thu Mar 22 17:52:28: Started on <ho10>, Execution Home </home/dzajac> , Execution CWD </data/PTCI/Eclipse/dzajac >; Thu Mar 22 17:53:57: Done successfully. The CPU time used is 61.5 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE ATTACHMENT 0 dzajac Mar 22 17:52 EF_SPOOLER_URI Y Datafile name Location of datafile and relevant files for debugging

LSF Jobs completed over 60 minutes ago bhist –a –u <username>   Jobs completed over 60 minutes ago bhist –a –u <username> Killing jobs bkill <jobid> -bash-3.2$ bkill 1293 Job <1293> is being terminated queue setup vi /lsftop/conf/lsbatch /CLUSTER_NAME/configdir/lsb.queues badmin reconfig

LSF Infiniband Issues If you suspect Infiniband issues, ping another node on the ib0 device. IP range is: x.x.x.x If the other system doesn’t respond, check the OpenSM service on the head node. Restart if necessary by running: service opensmd restart

EnginFrame Access via: http://HEADNODE:8080

Submitting Simulation Job

Monitoring Simulation Job

Cluster Info

My Jobs

All Jobs

ECLIPSE Launching Simulation from Command Line eclrun –s holsf01 –q <QUEUENAME> <application> <DATASET> Example eclrun –s holsf01 –q Eclipse ParallelWork BIG_RESERVOIR Application will be: eclipse, e300, or frontsim Modifing benchmarks to run on x-cores  Change PARALLEL section in .DATA file eclipse PARALLEL 32 'DISTRIBUTED' /   E300 36 /  

ECLIPSE Troubleshooting Check the following for errors: <SIMULATION_NAME>.OUT <SIMULATION_NAME>.PRT <SIMULATION_NAME>.LOG <SIMULATION_NAME>.ECLRUN Simulation name is datafile name without .DATA Run either a benchmark or sample datafiles to rule out dataset issues. (/ecl/benchmarks/)

ECLIPSE The commands for running the benchmarks are below: Run the benchmarks for various configurations from the command line. Files will be run from the following structure: ECLIPSE E100 Parallel: /ecl/benchmarks/parallel/data ECLIPSE E100 Serial: /ecl/benchmarks/serial/e100 ECLIPSE E300 Serial: /ecl/benchmarks/serial/e300 ECLIPSE E300 Parallel: /ecl/benchmarks/2MMbenchmark/E300 Schlumberger sample datasets are also included in file structure: ECLIPSE Sample: /ecl/benchmarks/sample The commands for running the benchmarks are below: ECLIPSE E100 Parallel benchmarks: eclrun –s Houston –q eclipse –u <users_name> eclipse ONEM ECLIPSE E100 Serial benchmarks: eclrun –s Houston –q eclipse –u <user_name> eclipse E100 ECLIPSE E300 Parallel benchmarks: eclrun –s Houston –q eclipse –u <user_name>e300 MMx ECLIPSE E300 Serial benchmarks: eclrun –s Houston –q eclipse –u <user_name> e300 E300 If you submit ticket with Schlumberger, they will ask for *.OUT, *.DATA, *.LOG, *.PRT