Download presentation
Presentation is loading. Please wait.
Published byCorey French Modified over 8 years ago
1
Willkommen Welcome Bienvenue How we work with users in a small environment HPC@Empa Patrik Burkhalter
2
How we work with users in a small environment Patrik Burkhalter System administrator HPC cluster at Empa At Empa since 2012 Linux system admin before Empa (mainly web, db and app servers)
3
Situation at Empa Agenda Situation at Empa Cluster support User support Enforcement
4
Situation at Empa At the moment, we have 2 clusters at Empa Ipazia, the cluster which we have since 2006 Hypatia, the new cluster which we have built this year The computing nodes from the old cluster will be dettached from Ipazia and connected to Hypatia step by step
5
Situation at Empa Ipazia the Empa HPC cluster 102 nodes (Dell) Built with the help of Partec and CSCS Parastation cluster middleware from Partec Torque resource manager Maui scheduler Infiniband DDR interconnect Lustre file systems
6
Situation at Empa Ipazia hardware Front end node PowerEdge 2900 2 * Intel(R) Xeon(R) CPU 5140 @ 2.33GHz (4 cores) 4GB RAM 1TB shared /home Computing nodes Node 1...30: deactivated, old 4 core pizza boxes Node 31...46: PowerEdge M605 2 * Quad-Core AMD Opteron(tm) Processor 2356 32GB RAM Node 47…102: PowerEdge M610 2 * Intel(R) Xeon(R) CPU E5540 @ 2.53GHz 24GB RAM
7
Situation at Empa Hypatia the new Empa cluster Built from scratch by Empa 32 nodes in 2 Dell M1000e chassis Torque resource manager Maui scheduler Infiniband FDR interconnect Lustre file systems Know-How completly @Empa (we have support for the SAN units) Well documented In production. Nodes from Ipazia are getting migrated to Hypatia soon.
8
Situation at Empa Hypatia hardware Front end node PowerEdge R620 2 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (16*2 cores, hyper threading) 32GB RAM Computing nodes PowerEdge M620 2 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (16 cores) 64GB RAM
9
Situation at Empa pbstop on Ipazia
10
Situation at Empa pbstop on Hypatia (new cluster)
11
Situation at Empa Lustre storage available to both clusters 25TB for backuped data (/project) 35TB speed optimized space (/scratch) Due to the amount of disks
12
Situation at Empa We changed our support model this year from external support to inhouse support. Why did we do this? We felt confident, that it is possible We can save money on the service contracts We can now fix (almost) everything by ourselves We can provide a better user support, because we have a deeper understanding How did we minimize the risk that we break the cluster We built a new cluster and did leave the running cluster alone A lot of users are using the new cluster already We can migrate the nodes to the new cluster when the stability is proven
13
Situation at Empa Ipazia Pizza nodes removed 2 new chassis 1 new front end 1 new SSD storage
14
Situation at Empa Support team Daniele Passerone (5% FTE) Carlo Pignedoli (5% FTE) Patrik Burkhalter (50% FTE)
15
Cluster Support Agenda Situation at Empa Cluster support User support Enforcement
16
Cluster Support Support we provide Introduction to basic Linux usage Connecting to the system using a SSH client Linux basic commands File system hierarchy Introduction of new users to the cluster Planning of future jobs Reservation of nodes for users Installation, compilation and testing of new software GNU and Intel compilers, MPI (openmpi/mvapich2), OpenFOAM, Abaqus Every software requested by the user System updates Hardware, OS Software updates Acquiring and installing new hardware New nodes GPU node Replacing failed hardware
17
Cluster Support Documentation of the cluster architecture
18
Cluster Support Documentation of the cluster usage
19
Cluster Support Lustre file system maintenance and extension At the moment, we are migrating our Lustre file systems workspc and storage to project and scratch while the file systems are online 1 complete new file system named project using new hardware SSD Meta data (MDT)
20
Cluster Support Lustre file system maintenance and extension 1 new file system scratch out of the file systems workspc and storage We deactivate one OST per fs from the old file systems We are using `lfs find’ to find the files having stripes on the deactivated OSTs We copy the files to a new location on the same fs Finally, we move them to the origin location
21
Cluster Support OST gets disabled temporary on the ionode This makes sure the OST will stay readable lctl dl | grep ‘ osc ‘ lctl –-device deactivate
22
Cluster Support Migration for files with an access time > 14 days Copies quickly but is kind of dirty TMPDIR="/mnt/storage/tmp" for i in $(lfs find --obd storage-OST0003 --atime +14 /mnt/storage); do DIR=$(dirname $i) FILE=$(basename $i) TMPPATH="$TMPDIR/$FILE"; SRCPATH="$DIR/$FILE"; # testing above values, continue to next entry if one test fails echo -en "$SRCPATH: " cp -p $SRCPATH $TMPPATH || exit 1 mv $TMPPATH $SRCPATH || exit 1 echo done done
23
Cluster Support Migration for newer files Checks if file was changed during the migration process Does not check if file is open on another node Therefore we only touch users which have no jobs and no running processes on the front end node lfs find --obd storage-OST0003 /mnt/storage/pbu | lfs_migrate -y
24
Cluster Support After the migration, the nodes gets deactivated permanently lctl conf_param storage-OST0003.osc.active=0
25
Cluster Support Situation after the migration
26
Cluster Support Problems we experience during the migration A lot of small files are hard to migrate The user tends to “hoard” data
27
Cluster Support We also provide several shell environments for the users to ease up the cluster usage. We are using the Modules environment (http://modules.sourceforge.net/)http://modules.sourceforge.net/ A module can be loaded with the command: `module load / ` The module sets the user environment variables as defined in the module We provide modules for each self compiled app and library This is particular handy for users which like to compile their own software We started to use this approach this year
28
Cluster Support Modules on Ipazia
29
Cluster Support Modules on Hypatia New modules are getting installed by user request
30
Cluster Support Example output of a module A simple module for ffmpeg We are trying to get rid of LD_LIBRARY_PATH and use RPATH instead This makes sure that a compiled binary uses the proper libraries independently from the user environment The module concept was new to our users but was accepted well
31
User Support Agenda Situation at Empa Cluster support User support Enforcement
32
Situation at Empa Users from Empa and Eawag ~120 users 40 active users in the last 30 days last | awk '{print $1}' | sort | uniq | wc –l
33
User Support Typical vendor to customer situation does not work at Empa We cannot provide a Service Level Agreement (SLA) We only can provide support on a best effort basis No support during the night or on weekend Unplanned down time can happen
34
User Support Typical IT user support does not work We cannot offer out of the box solution We don’t like to “just solve the problem now” We often don’t know the solution right now
35
User Support User as partner does work best for us The user gets threaded as equal. “If you think your users are idiots, only idiots will use it.” Linus Torvalds
36
User as a Partner The user has a strong scientific know how and sometimes just uses the software The engineer has a strong know how about clusters, but this means: A request by a scientist has to be reduced to the point at which the engineer is able to understand it The problem gets fixed by the engineer The solution gets communicated to the scientist in detail, until the scientist understands the particular situation It gets tested by the user It is important that each side understands the issue, otherwise potential optimization of the system gets lost.
37
User as a Partner If an user is experienced, tasks are getting delegated to the user. This could be: Compilation of apps and libraries Testing of a new package Problem analysis The solution always gets deployed by root to make sure all standards are fulfilled. If it is in the repository of our Linux distribution, it gets installed using the package manager If it is too old or not available, it gets compiled and installed in /share/apps or /share/libs The are modules provided to set the user environment module load / Our software gets compiled on a computing node and installed on the share file system
38
User as a Partner Example, Abaqus A Finite Element Method (FEM) software used by the mechanical systems engineering department of Empa. The users have a strong background in mechanical engineering The users are using Abaqus on Windows to engineer parts We made a wrapper to simplify the job submission
39
Enforcement Agenda Situation at Empa Cluster support User support Enforcement
40
At the moment, we only do enforcement of: Obviously - the root password is not given to the users Disk quotas are in place (size and inodes) Maui scheduling configuration Optimization is planned for Hypatia, the new cluster
41
Enforcement login screen provides some information to make the user aware of the cluster situation
42
Thanks for listening Any questions, thoughts?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.