CommLab PC Cluster (Ubuntu OS version) PC Cluster Manager: Sandy Ardianto sandyardianto@gmail.com
Outline Architecture Sending jobs to slaves Specification Torque PBS Torque Features How to use PBS file example Registration PBS command Connect using Putty Python Example Upload & Download files Matlab Example References
Architecture *NAS (Network Attached Storage) Master Public IP: 140.113.211.20 Local IP: 192.168.1.2 Slave01 IP: 192.168.1.3 Slave02 IP: 192.168.1.4 … Slave16 IP: 192.168.1.18 NAS 192.168.1.35 *NAS (Network Attached Storage)
Specification Master Slave01-14 Slave15-16 CPUs i7-2600 @3.4GHz 8 cores Xeon E5620 @2.4GHz 16 cores Memory 16GB 4GB Folder /home of master and all slaves are synchronized using NAS All OS have been changed from Centos 5.6 to Ubuntu 14.04
How to use Commlab PC cluster
Registration Contact cluster manager (sandyardianto@gmail.com) <name> <username> (ex. sardianto [Sandy Ardianto]) <password> <advisor> <e-mail>
Connect using SSH Putty: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Upload & Download files Filezilla : https://filezilla-project.org/download.php?type=client Default location: /home/<username>
Sending Jobs to Slaves
Torque PBS (Portable Batch System) Terascale Open-source Resource and QUEue Manager (TORQUE) a distributed resource manager providing control over batch jobs and distributed compute nodes
Torque Features (1/2) Fault Tolerance Additional failure conditions checked/handled Node health check script support Scheduling Interface Extended query interface providing the scheduler with additional and more accurate information Extended control interface allowing the scheduler increased control over job behavior and attributes Allows the collection of statistics for completed jobs https://en.wikipedia.org/wiki/TORQUE
Torque Features (2/2) Scalability Significantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger jobs (over 2000 processors) Ability to support larger server messages Usability Extensive logging additions More human readable logging (i.e. no more 'error 15038 on command 42') https://en.wikipedia.org/wiki/TORQUE
PBS File Example Some useful variables: $PBS_JOBID: the job identifier Job Name Some useful variables: $PBS_JOBID: the job identifier $PBS_JOBNAME: the job name $PBS_O_WORKDIR: the absolute path where qsub command sent Error output file Output string on the terminal Queue name (batch, batch1-batch16) Ppn: Processor per nodes Compute unit Assign specific slave Nodes=slaveXX (XX=01-16)
Check which computer available to use Open http://140.113.211.20/ganglia in browser
PBS Command Sending jobs: qsub <filename.sh> Show jobs status: qstat Run the jobs: qrun <job ID> Stop jobs: qdel <job ID> Status: Q - Queue R - Running E - Error C - Completed
Python - Hello World Example Files available at http://140.113.211.20 pbs.sh hello.py
Running Hello World qsub pbs.sh qrun <job ID> qstat cat 3.master-job_name.log Qsub to send job to master Qrun to run the job Qstat to check job status
Matlab Example Files available at http://140.113.211.20 pbs_matlab.sh mtest.m
Running Matlab Example (1/2) qsub pbs_matlab.sh qrun <job ID> qstat Qsub to send job to master Qrun to run the job Qstat to check job status
Running Matlab Example (2/2) head -20 24.master- job_name.log Head -20 24.master-job_name.log show first 20 line of log
Any Problem/Question ? Contact me! sandyardianto@gmail.com