McFarm: first attempt to build a practical, large scale distributed HEP computing cluster using Globus technology Anand Balasubramanian Karthik Gopalratnam.

Slides:



Advertisements
Similar presentations
IEEE NSS 2003 Performance of the Relational Grid Monitoring Architecture (R-GMA) CMS data challenges. The nature of the problem. What is GMA ? And what.
Advertisements

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Massimo Sgaravatto INFN Padova.
11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW Understand the difference between service.
16.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 16: Examining Software Update.
11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW  Understand the difference between service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Administering Windows 7 Lesson 11. Objectives Troubleshoot Windows 7 Use remote access technologies Troubleshoot installation and startup issues Understand.
JetWeb on the Grid Ben Waugh (UCL), GridPP6, What is JetWeb? How can JetWeb use the Grid? Progress report The Future Conclusions.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
PNPI HEPD seminar 4 th November Andrey Shevel Distributed computing in High Energy Physics with Grid Technologies (Grid tools at PHENIX)
Grid Workload Management Massimo Sgaravatto INFN Padova.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Status of UTA IAC + RAC Jae Yu 3 rd DØSAR Workshop Apr. 7 – 9, 2004 Louisiana Tech. University.
Status of Grid-enabled UTA McFarm software Tomasz Wlodek University of the Great State of TX At Arlington.
The ALICE short-term use case DataGrid WP6 Meeting Milano, 11 Dec 2000Piergiorgio Cerello 1 Physics Performance Report (PPR) production starting in Feb2001.
26SEP03 2 nd SAR Workshop Oklahoma University Dick Greenwood Louisiana Tech University LaTech IAC Site Report.
Andrey Meeting 7 October 2003 General scheme: jobs are planned to go where data are and to less loaded clusters SUNY.
DØSAR a Regional Grid within DØ Jae Yu Univ. of Texas, Arlington THEGrid Workshop July 8 – 9, 2004 Univ. of Texas at Arlington.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
BES III Computing at The University of Minnesota Dr. Alexander Scott.
From DØ To ATLAS Jae Yu ATLAS Grid Test-Bed Workshop Apr. 4-6, 2002, UTA Introduction DØ-Grid & DØRACE DØ Progress UTA DØGrid Activities Conclusions.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Portal Update Plan Ashok Adiga (512)
UTA Site Report DØrace Workshop February 11, 2002.
MySQL and GRID status Gabriele Carcassi 9 September 2002.
GRID activities in Wuppertal D0RACE Workshop Fermilab 02/14/2002 Christian Schmitt Wuppertal University Taking advantage of GRID software now.
 CMS data challenges. The nature of the problem.  What is GMA ?  And what is R-GMA ?  Performance test description  Performance test results  Conclusions.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
LCG LCG-1 Deployment and usage experience Lev Shamardin SINP MSU, Moscow
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Re-Reconstruction Of Generated Monte Carlo In a McFarm Context 2003/09/26 Joel Snow, Langston U.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
UTA Site Report Jae Yu UTA Site Report 7 th DOSAR Workshop Louisiana State University Apr. 2 – 3, 2009 Jae Yu Univ. of Texas, Arlington.
McFarm Improvements and Re-processing Integration D. Meyer for The UTA Team DØ SAR Workshop Oklahoma University 9/26 - 9/27/2003
Troubleshooting Windows Vista Lesson 11. Skills Matrix Technology SkillObjective DomainObjective # Troubleshooting Installation and Startup Issues Troubleshoot.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
PROTO-GRID Status of Grid-enabled UTA McFarm software Tomasz Wlodek University of the Great State of Texas At Arlington.
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Belle II Physics Analysis Center at TIFR
ALICE Monitoring
Cloud based Open Source Backup/Restore Tool
Dick Greenwood OU Workshop 7/10/02
News & Action Items Automatic Release Ready notification system is in place and working (Thanks Alan!! ) Needs refinements!!! New Fermi RH CD-Rom released.
News p has no fetchpkgs.sh and fetchprod.sh shell script in the .../D0reltools/ These are missing due to the switch over to the new distribution.
Remote SAM Initiative (RSI) – Proposed Work Plan
Presentation transcript:

McFarm: first attempt to build a practical, large scale distributed HEP computing cluster using Globus technology Anand Balasubramanian Karthik Gopalratnam Amruth Dattatreya Mark Sosebee Drew Meyer Tomasz Wlodek Jae Yu University of the Great State of Texas

UTA D0 Monte Carlo Farms UTA operates 2 independent Linux MC farms: HEP and CSE HEP farm: 25 dual pentium machines CSE farm: 10 single 866MHz Pentium machines Control software (job submission, load balancing, archiving, bookkeeping, job execution control etc) developed entirely in UTA We plan to add several new farms from various institutions

Main server (Job Manager) Can read and write to all other nodes Contains executables and job archive Execution node (The worker) Mounts its home directory on main server Can read and write to file server disk File server Mounts /home on main server Its disk stores min bias and generator files and is readable and writable by everybody Both CSE and HEP farm share the same layout, they differ only by the number of nodes involved and by the software which exports completed jobs to final destination The layout is flexible enough to allow for farm expansion when new nodes are available

Several experimental groups in D0 have expressed interest in our software and plan to install it on their farms LTU Boston Dubna Oklahoma Brazil Tata Institute, India Manchester The Oklahoma farm is already running, LTU has installed McFarm but is not yet producing data. Tata institute is installing McFarm. We hope that others will follow.

How can you install the McFarm software? WWW page hep.uta.edu/~d0race/McFarm/McFarm.htmlhttp://www- hep.uta.edu/~d0race/McFarm/McFarm.html You will find there a collection of notes and scripts for installation of farm server, file server, worker nodes, gather servers etc. Also you will find there additional information: how to install Linux, Globus, Sam, etc. Software is available for download, but read documentation first!

How do we intend to control MC production? When the number of farms is small (<2) it is possible to logon to each one of them and submit/kill jobs But when one has to manage a large number of farms (= 2 or more) this becomes impractical. We need a central site which will control the production at distributed locations. Here is where GLOBUS comes into picture…

Future of job submission and bookkeeping Participating production farms – can be anywhere in the world! www server (production status) user Job submission and control via Globus-tools MC production server Only one machine takes care of the job submission and monitoring for all farms! SAM

What do we need to run McFarm at a dozen of farms? We need three things: The ability to submit (and delete) jobs to remote farm The ability to monitor the production progress (The bookkeeper) The ability of obtain fast information about number of jobs waiting, running,errored,done on a given farm and the number of active nodes on that farm (the intelligence provider, or McView)

The plan Submitter (Globus,Grid-ftp, Condor-G)) Bookkeeper McView (MDS) GEM

Future of bookkeeper At present it stores run information in plain text files in a private format In the long run this is not a way to go I intend to split bookkeeper into two packages: one collects the run information and stores in in MySQL, second does the WWW publishing I will design a good way of storing run information in MySQL database.

McView data flow McView MDS server Info provider McFarm monitor daemon McFarm jobstat program MDS server Control station Farm server

GEM – the future of future Globus Enabled McFarm The three components: submitter, bookkeeper and McView so far are supposed to be used in semi-automatic way by operator. Operator does the scheduling, based on what he sees in McView table and submits jobs to given farm GEM is going to connect those three packages and will do everything automatic. User will have to submit a run, and the rest will be left to GEM.

Life is hard: Known Problems: Sites tend to have various versions of executables, a bookkeeping of “which site has which version” is needed. (A potential nightmare) Possible solution (?): send executables with job

Related problems: Not all minitars work properly We had hard time with installing most recent ones in OU We will need (sooner or later) centralized system for automatic distribution and installation of minitars

Another related problem: Various places might have different kernels This did not hit us yet, but it will…

Problems: Various farms have different environment initialization (For example: to run “setup” command you have to source different script at different sites!) Solution: we lump the setup commands into “setup_farm” script which hides the problems Not an ideal solution!

Problems: Clocks are not synchronous at different farms! Result: Proxy identification can fail! Solution: use common time servers!

Problems: MDS information providers need root pwd to be installed. Possible maintenance nightmare! Solution: Anybody knows? Related: it would be good if the “Superserver” farm had the right to restart Globus daemons on “slave farms”. (I doubt farm administrators will like this idea…)

Problems: As mc_runjob evolves it needs constantly be patched. This (patching) should be made more automatic!

Conclusion: We do have a cluster of 3 McFarm sites (2 in UTA and 1 in OU) running and controlled via Globus First step towards Grid has been made! We expect to have LTU ready soon (few days, maybe even hours…), Tata should follow soon after More institutions are invited to join…