E-science grid facility for Europe and Latin America COMPUTING ELEMENT GIUSEPPE PLATANIA INFN Catania 30 June - 4 July, 2008
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – OUTLINE OVERVIEW INSTALLATION & CONFIGURATION TESTING FIREWALL SETUP TROUBLESHOOTING
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – OVERVIEW The Computing Element is the central service of a site. Its main functionally are: – manage the jobs (job submission, job control) – update to WMS the status of the jobs – publish all site informations (site location, queues, about the CPUs status, and so on) via ldap (site BDII service) It can run several kinds of batch system: – Torque + MAUI – LSF – SGE – Condor
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – TORQUE + MAUI The Torque server is composed by a: – pbs_server – pbs_server which provides the basic batch services such as receiving/creating a batch job. The Torque client is composed by a: – pbs_mom – pbs_mom which places the job into execution. It is also responsible for returning the job’s output to the user The MAUI system is composed by a: – job_scheduler – job_scheduler which contains the site's policies in order to choose which job must be executed.
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Site BDII** – By default it is installed on the CE – It collects all site GRISes* (for example SE,RB,LFC,etc..) – The name of the service is bdii – The list of GRISes you want to publish is: /opt/glite/etc/gip/site-urls.conf – Log file: /opt/bdii/var/bdii.log *GRIS=Grid Resource Information Service **BDII=Berkely Database Infomatin Index
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Computing Element installation & configuration using YAIM
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – There are several kinds of metapackages to install: ig_CE – LCG ComputingElement without batch system packages. ig_CE_LSF – LCG ComputingElement with LSF. IMPORTANT: providedfor consistency, it does not install LSF but it apply some fixes via ig_configure_node. ig_CE_torque – LCG ComputingElement with Torque+MAUI. WHAT KIND OF CE?
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – HOW TO GET AN HOST CERTIFICATE Host certificate for CE. – Please, request it to your RA Install host certificate (hostcert.pem and hostkey.pem) in /etc/grid-security. – mkdir /etc/grid-security – chmod 644 hostcert.pem – chmod 400 hostkey.pem
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Repository settings REPOS="ca dag ig jpackage gilda glite-lcg_ce_torque glite- bdii" Download and store repo files: for name in $REPOS; do wget -O /etc/yum.repos.d/$name.repo; done
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – INSTALLATION yum install jdk java sun-compat yum install lcg-CA yum install ig_CE_torque If it's also the site bdii collector: yum install ig_BDII Gilda rpms: yum install gilda_utils
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Copy ig-site-info.def template file provided by ig_yaim in to gilda dir and customize it cp /opt/glite/yaim/examples/siteinfo/ig-site-info.def /opt/glite/yaim/etc/gilda/ Open /opt/glite/yaim/etc/gilda/ file using a text editor and set the following values according to your grid environment: CE_HOST= BATCH_SERVER=$CE_HOST Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – WN_LIST=/opt/glite/yaim/etc/gilda/wn-list.conf The file specified in WN_LIST has to be set with the list of all your WNs hostname. WARNING: It’s important to setup it before to run the configure command Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Copy users and groups example files to /opt/glite/yaim/etc/gilda/ cp /opt/glite/yaim/examples/ig-groups.conf /opt/glite/yaim/etc/gilda/ cp /opt/glite/yaim/examples/ig-users.conf /opt/glite/yaim/etc/gilda/ Append gilda users and groups definitions to /opt/glite/yaim/etc/gilda/ig-users.conf cat /opt/glite/yaim/etc/gilda/gilda_ig-users.conf >> /opt/glite/yaim/etc/gilda/ig-users.conf cat /opt/glite/yaim/etc/gilda/gilda_ig-groups.conf >> /opt/glite/yaim/etc/gilda/ig-groups.conf Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – GROUPS_CONF=/opt/glite/yaim/etc/gilda/ig-groups.conf USERS_CONF=/opt/glite/yaim/etc/gilda/ig-users.conf JAVA_LOCATION="/usr/java/j2sdk1.4.2_12“ SITE_NAME=GILDA SITE_LOC=“Catania, ITALY" SITE_LAT=37.5 SITE_LONG= SITE_WEB=" SITE_TIER="GILDA Testbed" " Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – JOB_MANAGER=lcgpbs CE_BATCH_SYS=pbs BATCH_BIN_DIR=/usr/bin BATCH_VERSION=torque CE_CPU_MODEL=Opteron CE_CPU_VENDOR=AMD CE_CPU_SPEED=3000 CE_OS="Scientific Linux“ CE_OS_RELEASE=4.5 CE_OS_VERSION="SL“ CE_MINPHYSMEM=2048 CE_MINVIRTMEM=4096 CE_SMPSIZE=2 CE_SI00=1000 CE_SF00=1200 CE_OUTBOUNDIP=TRUE CE_INBOUNDIP=TRUE Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – DPM_HOST=“dpm_hostname” SE_LIST="$DPM_HOST“ SITE_BDII_HOST=$CE_HOST BDII_REGIONS="CE SE“ BDII_CE_URL="ldap://$CE_HOST:2170/mds-vo- name=resource,o=grid“ BDII_SE_URL="ldap://$DPM_HOST:2170/mds-vo- name=resource,o=grid“ VOS=“gilda” ALL_VOMS=“gilda” Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – QUEUES="short long infinite“ SHORT_GROUP_ENABLE=$VOS LONG_GROUP_ENABLE=$VOS INFINITE_GROUP_ENABLE=$VOS In case of to configure a queue fo a single VO: QUEUES="short long infinite gilda“ SHORT_GROUP_ENABLE=$VOS LONG_GROUP_ENABLE=$VOS INFINITE_GROUP_ENABLE=$VOS GILDA_GROUP_ENABLE=“gilda” Customize ig-site-info.def
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – CE Torque CONFIGURATION Now we can configure the node: /opt/glite/yaim/bin/ig_yaim -c -s /opt/glite/yaim/etc/gilda/ -n ig_CE_torque -n BDII_site
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Computing Element testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Check if the local GRIS and the site BDII are running on CE and are publishing the right informations (CPU, site name and so on) ldapsearch -x -h -p b mds-vo- name=resource,o=grid ldapsearch -x -h -p b mds-vo- name=,o=grid Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Become a gilda user # su – gilda001 Edit a file and write: #!/bin/sh sleep 20 #(it's useful to see the job status) hostname Save it and set the permission of execution: chmod 700 test.sh Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – gilda001]$ qsub -q short test.sh gilda001]$ qstat -a ce.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time wn.localdo gilda001 short test.sh :15 R -- Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – gilda001]$ qstat -a gilda001]$ The job execution has finished and we have to list the output file: gilda001]$ ls test.sh.e3 test.sh.o3 And show them: gilda001]$ cat test.sh.e3 (error file) gilda001]$ gilda001]$ cat test.sh.o3 (output file) wn.localdomain Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Log on the UI: hostname -> glite-tutor.ct.infn.it Username -> catania Password -> GridCAT Grid passphrase -> CATANIA Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – plt]$ voms-proxy-init –voms gilda plt]$ globus-job-run grid006.ct.infn.it:2119/jobmanager-lcgpbs -q short /bin/hostname wn.localdomain plt]$ edg-job-submit -r grid006.ct.infn.it:2119/jobmanager-lcgpbs-short hostname.jdl Selected Virtual Organisation name (from proxy certificate extension): gilda Connecting to host glite-rb.ct.infn.it, port 7772 Logging to host glite-rb.ct.infn.it, port 9002 ******************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - ******************************************************************************** Testing
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – FIREWALL SETUP
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – /etc/sysconfig/iptables (1/2) *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT -A FORWARD -j RH-Firewall-1-INPUT -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport maui -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_mom -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_resmom -j ACCEPT
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 3878:3879 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 1020:1023 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 20000: j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 32768: j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 32768: j ACCEPT -A RH-Firewall-1-INPUT -p tcp -m tcp --syn -j REJECT -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited COMMIT /etc/sysconfig/iptables (2/2)
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – IPTABLES STARTUP /sbin/chkconfig iptables on /etc/init.d/iptables start
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Troubleshooting
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – Troubleshooting plt]$ globus-job-run :2119/jobmanager-lcgpbs -q short /bin/hostname GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) solution: check if the globus-gatekeeper daemon is up and running on CE plt]$ globus-job-run :2119/jobmanager-lcgpbs -q short /bin/hostname GRAM Job submission failed because authentication failed: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization init_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problems globus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials globus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials: Couldn't verify the remote certificate OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate (error code 7) solution: probably there is no GILDA CA rpm installed on CE
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – plt]$ edg-gridftp-ls gsiftp:// / error the server sent an error response: LCMAPS credential mapping NOT successful solution: check on CE the VO mapping in /opt/edg/etc/lcmaps/gridmapfile /opt/edg/etc/lcmaps/groupmapfile Troubleshooting
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – The CE is publishing wrong informations such as: GlueCEStateFreeCPUs: 0 GlueCEStateRunningJobs: 0 GlueCEStateStatus: Production GlueCEStateTotalJobs: 0 GlueCEStateWaitingJobs: 4444 Run the script: /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper and check if it gives some errors. Often it doesn’t work because the batch system is down or in lock state. In this case restart torque service: /etc/init.d/pbs_server restart Troubleshooting
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – If a query to the site BDII doesn’t show the information about a site, you have to look at the bdii log file /opt/bdii/var/bdii.log For example: GILDA: ldap_bind: Can't contact LDAP server Check if: – bdii is up & running (ps aux |grep bdii) – That resource url is in the list file /opt/glite/etc/gip/site-urls.conf – Firewall setup Troubleshooting
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, –