Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008
Resources needed for applications arising from Nanotechnology Large memory – Tbytes High floating point computing speed – Tflops High data throughput – state of the art …
SMP architecture P P PP Memory
Cluster architecture Interconnection network
Why not a cluster Single SMP system easier to purchase/maintain Ease of programming in SMP systems
Why a cluster Scalability Total available physical RAM Reduced cost But …
Having an application which exploits the parallel capabilities Studying the application or applications which will run on the cluster
Things to include in design Property of code Essential component CPU bound Fast computing unit Memory bound Large memory, fast access Global flow of data in parallel app Fast interconnect
Our choices Property of code Essential component Choice Computationn ally intensive,FP Fast computing unit 64 bit dual core,Opteron, Rev.F Large matrices Large memory, fast access 8 GB /node Finite element, spectral codes, Fast interconnect Infiniband DDR (20 Gb/s,low latency)
Other requirements Space, power,cooling constraints, strength of floors Software configuration: 1. Operating system 2. Compilers & application deve. tools 3. Load balancing and job scheduling 4. System management tools
Configuration PPPP PP MM M Infiniband Switch
Before finalizing our choice … One should check, on a similar system : Single processor peak performance Infiniband interconnect performance SMP behaviour Non commercial parallel applications behaviour
Parallel applications issues Execution time Parallel speedup Sp= T1/Tp Scalability
Benchmark design Must give a good estimate of performance of your application Acceptance test -should match all its components
Comparison of performance NancoCarmelComputer Mflops Ratio of 7.8 !! 487 Mflops Lapack program, N=9000
Execution time of Monte-Carlo parallel code (MPI) Nanco Carmel1 Processes) 4389 (~1 hr) (~6hrs !)
What did work Running MPI code interactively Running a serial job through the queue Compiling C code with MPI
What did not work Compiling F90 or C++ code with MPI Running MPI code through the queue Queues do not do accounting per CPU
Parallel performance results Theoretical peak 2.1 Tflops Nanco performance on HPL: 0.58 Tflops
Comparison with Sun Benchmark
Execution time –comparison of compilers
Performance with different optimizations
Conclusions from acceptance tests New gcc (gcc4) is faster than Pathscale for some applications MPI collective communication functions are differently implemented in various MPI versions Disk access times are crucial - use attached storage when possible
Scheduling decisions Assessing priorities between user groups Assessing parallel efficiency of different job types (MPI,serial,OPenMP) /commercial software and designing special queues for them Avoiding starvation by giving weight to the urgency parameter
Observations during production mode Assessing user ’ s understanding of machine – support in writing scripts and efficient parallelization Lack of visualization tools – writing of script to show current usage of cluster
Utilization of cluster
Utilization of nanco sep08
Nanco jobs by type
Conclusion Benchmark correct design is crucial to test capabilities of proposed architecture Acceptance tests allow to negotiate with vendors and give insights on future choices Only after several weeks and running of the cluster at full capacity can we make informed decisions on management of the cluster