WORKER NODE Alfonso Pardo EPIKH School, System Admin Tutorial Beijing, 2010 August 30th – 2010 September 3th
OUTLINE OVERVIEW INSTALLATION & CONFIGURATION TESTING FIREWALL SETUP TROUBLESHOOTING 2
OVERVIEW The Worker Node is a service where the jobs run. Its main functionally are: – execute the jobs – update to Computing Element the status of the jobs It can run several kinds of client batch system: – Torque – LSF – SGE – Condor 3
TORQUE client The Torque client is composed by a: – pbs_mom – pbs_mom which places the job into execution. It is also responsible for returning the job’s output to the user 4
Worker Node installation & configuration using YAIM
There are several kinds of metapackages to install: glite_WN – “Generic” WorkerNode. glite_WN_noafs – Like ig_WN but without AFS. glite_WN_LSF – LSF WorkerNode. IMPORTANT: provided for consistency, it does not install LSF software but it apply some fixes via ig_configure_node. glite_WN_LSF_noafs – Like ig_WN_LSF but without AFS. glite_WN_torque – Torque WorkerNode. glite_WN_torque_noafs – Like ig_WN_torque but without AFS. WHAT KIND OF WN?
Repository settings REPOS=Ӗcg-ca dag ig gilda glite-wn_torque" Download and store repo files: for name in $REPOS; do wget it.cnaf.infn.it/mrepo/repos/sl5/x86_64/$name.repo -O /etc/yum.repos.d/$name.repo; done wget -O /etc/yum.repos.d/gilda.repohttp://grid018.ct.infn.it/mrepo/repos/gilda.repo -O /etc/yum.repos.d/ wget -O /etc/yum.repos.d/jpackage.repohttp://grid-it.cnaf.infn.it/mrepo/repos/jpackage.repo 7
INSTALLATION yum install jdk java sun-compat yum install lcg-CA yum install lcg-WN yum install glite-TORQUE_utils yum install glite-TORQUE_client Gilda rpms: yum install gilda_utils gilda_applications 8
Copy users and groups example files to /opt/glite/yaim/etc/gilda/ cp /opt/glite/yaim/examples/ig-groups.conf /opt/glite/yaim/etc/gilda/ cp /opt/glite/yaim/examples/ig-users.conf /opt/glite/yaim/etc/gilda/ Append gilda users and groups definitions to /opt/glite/yaim/etc/gilda/ig- users.conf cat /opt/glite/yaim/etc/gilda/gilda_ig-users.conf >> /opt/glite/yaim/etc/gilda/ig-users.conf cat /opt/glite/yaim/etc/gilda/gilda_ig-groups.conf >> /opt/glite/yaim/etc/gilda/ig-groups.conf Customize ig-site-info.def
Copy ig-site-info.def template file provided by ig_yaim in to gilda dir and customize it cp /opt/glite/yaim/examples/siteinfo/ig-site-info.def /opt/glite/yaim/etc/gilda/ Open /opt/glite/yaim/etc/gilda/ file using a text editor and set the following values according to your grid environment: CE_HOST= TORQUE_SERVER=$CE_HOST 10 Customize ig-site-info.def
WN_LIST=/opt/glite/yaim/etc/gilda/wn-list.conf The file specified in WN_LIST has to be set with the list of all your WNs hostname. WARNING: It’s important to setup it before to run the configure command Customize ig-site-info.def
GROUPS_CONF=/opt/glite/yaim/etc/gilda/ig-groups.conf USERS_CONF=/opt/glite/yaim/etc/gilda/ig-users.conf JAVA_LOCATION="/usr/bin/java/jdk“ JOB_MANAGER=lcgpbs BATCH_BIN_DIR=/usr/bin BATCH_VERSION=torque VOS=“gilda” ALL_VOMS=“gilda” Customize ig-site-info.def
QUEUES="short long infinite“ SHORT_GROUP_ENABLE=$VOS LONG_GROUP_ENABLE=$VOS INFINITE_GROUP_ENABLE=$VOS In case of to configure a queue fo a single VO: QUEUES="short long infinite gilda“ SHORT_GROUP_ENABLE=$VOS LONG_GROUP_ENABLE=$VOS INFINITE_GROUP_ENABLE=$VOS GILDA_GROUP_ENABLE=“gilda” Customize ig-site-info.def
WN Torque CONFIGURATION Now we can configure the node: /opt/glite/yaim/bin/ig_yaim -n glite-WN -n glite-TORUQE_client -n glite- TORQUE_utils
Worker Node testing
Verify if the pbs_mom is active and if its status is free: root]# /etc/init.d/pbs_mom status pbs_mom (pid 3692) is running... root]# pbsnodes -a wn.localdomain state = free np = 2 properties = lcgpro ntype = cluster status = arch=linux,uname=Linux wn.localdomain EL.cern 1 Tue Oct 4 16:45:05 CEST 2005 i686,sessions= ,3584,nsessions=6,nusers=1,idletime=1569,totmem=254024kb,availme m=69852kb,physmem=254024kb,ncpus=1,loadave=0.30,rectime= Testing
First of all, check if a generic user on WN can do ssh to the CE without type the password: root] su – gilda001 gilda001] ssh ce gilda001] The same test has to be executed between the WNs in order to run MPI jobs: gilda001] ssh wn1 gilda001] Testing
FIREWALL setup
*filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT -A FORWARD -j RH-Firewall-1-INPUT -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A RH-Firewall-1-INPUT -p tcp -s --dport 22 -j ACCEPT -A RH-Firewall-1-INPUT -p all -s -j ACCEPT -A RH-Firewall-1-INPUT -p tcp -m tcp --syn -j REJECT -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited COMMIT /etc/sysconfig/iptables
IPTABLES STARTUP /sbin/chkconfig iptables on /etc/init.d/iptables start
Troubleshooting
root]# su – gilda001 gilda001] ssh ce password: probably this wn hostname is not in /etc/ssh/shosts.equiv or its ssh keys were not created and stored in /etc/ssh/ssh_known_hosts on CE Solution (to run on CE): Ensure that the wn is in pbs list using: root]# pbsnodes –a And then: root]# /opt/edg/sbin/edg-pbs-shostsequiv root]# /opt/edg/sbin/edg-pbs-known-hosts Troubleshooting
root]# pbsnodes -a wn.localdomain state = down np = 2 properties = lcgpro ntype = cluster Solution: root]# /etc/init.d/pbs_mom restart Troubleshooting
24