Integration of Physics Computing on GRID S. Hou, T.L. Hsieh, P.K. Teng Academia Sinica 04 March, 2009 1.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Physics with SAM-Grid Stefan Stonjek University of Oxford 6 th GridPP Meeting 30 th January 2003 Coseners House.

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Duke Atlas Tier 3 Site Doug Benjamin (Duke University)

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Guide To UNIX Using Linux Third Edition

ORNL is managed by UT-Battelle for the US Department of Energy Tools Available for Transferring Large Data Sets Over the WAN Suzanne Parete-Koon Chris.

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

EU funding for DataGrid under contract IST is gratefully acknowledged GridPP Tier-1A Centre CCLRC provides the GRIDPP collaboration (funded.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

LcgCAF:CDF submission portal to LCG Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel.

◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.

YuChul Yang Oct KPS 2006 가을 EXCO, 대구 The Current Status of KorCAF and CDF Grid 양유철, 장성현, 미안 사비르 아메드, 칸 아딜, 모하메드 아즈말, 공대정, 김지은,

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.

A Design for KCAF for CDF Experiment Kihyeon Cho (CHEP, Kyungpook National University) and Jysoo Lee (KISTI, Supercomputing Center) The International Workshop.

Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.

SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

May 12, 2005Batch Workshop HEPiX Karlsruhe 1 Preparing for the Grid— Changes in Batch Systems at Fermilab HEPiX Batch System Workshop.

Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

System Center Lesson 4: Overview of System Center 2012 Components System Center 2012 Private Cloud Components VMM Overview App Controller Overview.

Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.

DCAF(DeCentralized Analysis Farm) for CDF experiments HAN DaeHee*, KWON Kihwan, OH Youngdo, CHO Kihyeon, KONG Dae Jung, KIM Minsuk, KIM Jieun, MIAN shabeer,

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International.

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.

Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

IFAE Apr CDF Computing Experience - Gabriele Compostella1 IFAE Apr CDF Computing Experience Gabriele Compostella, University.

Advanced Computing Facility Introduction

Open OnDemand: Open Source General Purpose HPC Portal

LcgCAF:CDF submission portal to LCG

Dag Toppe Larsen UiB/CERN CERN,

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Dag Toppe Larsen UiB/CERN CERN,

Workload Management System

TYPES OFF OPERATING SYSTEM

Data Processing for CDF Computing

Presentation transcript:

Integration of Physics Computing on GRID S. Hou, T.L. Hsieh, P.K. Teng Academia Sinica 04 March,

Introduction Integration of computing… via GRID issues  integrity on system privacy,  CA access on common OS,  common services to larger user pools High-Performance-Computing for High-Energy Physics  late binding from local cluster toward the GRID PacCAF PacCAF, the Pacific CDF Analysis Farm  a GRID distributed computing model the Common-Analysis-Farm for Physics,e-Science  use GRID CA, Parrot service 2 HPC Asia 2009

High-Performance-Computing (HPC) for High-Energy Physics (HEP) 1. HEP users run serial, I/O intensive computing jobs 2. coding in F77, C, C++, flexible on OS, hardware 3. explosive CPU, I/O usage  submit jobs to GRID connecting clusters ~~ The CDF experiment users are serfing the wave of GRID are serfing the wave of GRID our discussion case for integration ~~ our discussion case for integration ~~ 3 HPC Asia 2009

HEP computing: the evolution 4 HPC Asia 2009 User access & computing model evolving with hardware and network technology Hardware Architecture 1. in old days of DEC, VM terminal login, mainframe computing 2. Work-stations (Sun,DEC,HP,SGI..) xterm login, local cluster computing 3. Linux Pentium clusters xterm and web access 4. GRID & integration of clusters xterm and web access via CA Network access 1. telnet, ftp, transparent many users, site manager control 2. ssh, scp, encrypted few users, site manager control 3. Kerberos authentication hundreds of users, DB control 4. CA trusted thousands of users, VO control Computing of HEP users is an evolution from Mainframe, Unix station to GRID

home institutes 1. Desk-top + small clusters limited system support 2. Fortrain/C/C++ coding PAW/ROOT graphic analysis 3. Unix based serial programming interactive/batch jobs with limited system support for CPU, Storage 4. Network to Center for Data/DB CDF computing patterns 5 HPC Asia 2009 Computing for the Tevatron CDF experiment: an example of wide varieties of user patterns Experiment site 1. Data Acquisition of petaByte, need Computer Center support 2. Data production, Mass storage, customized computing model DB management on data 3. Distribution of Data, DB hundreds of user access Users are flexible, willing to learn Users are never satisfied

Distributed computing for CDF 1. computing was on site main frame 2. evolved into intel based Linux PC clusters  the Central Analysis Farm (CAF) 3. distributed computing by Kerberos login  de-centralized CAF’s in Asia, Europe, US 4. a late binding solution to GRID  a Condor-G baed Glide CAF  our model for integration  our model for integration 6 HPC Asia 2009

The CAF portal for distributed computing 7 HPC Asia 2009 CAF: the CDF Analysis Farm, launched in 2002 on dedicated Linux Clusters on FBSNG batch is a portal for user accessing everything Xterm login  Software + Data Handling + Batch + Monitoring CAF headnode Submitter Monitor Condor HTTPd SAM data Handling Enstore tape library Pentium workers CAFmon

Startd User jobs Use of Condor in CAF (local batch) 8 Moved to Condor in 2004, departed from traditional, dedicated resources Schedd Schedd : manage user jobs Negotiator Negotiator : assigns nodes to jobs Collector Collector : gathers information from other deamons Startd Startd : manage jobs on Worker node (WN) CAF Submitter Schedd Negotiator User jobs Startd User jobs Startd CafExe User job Collector Worker Nodes HPC Asia 2009

What users do on CAF 9 User desktop installed with CAF client commands 1. User prepare working directory 2. Prepare run script (steps to be executed on a WN) 3. Archive the full directory by “tar” command  is the tarball to be sent to WN $ CAF_submit tarball Worker Nodes execute user run script 1. I/O by Kerberos ticket, may “scp” output files eleswhere 2. at finishing of a WN section tar of the working scratch is sent back to submit node notification is sent to user HPC Asia 2009

Interactive monitoring by user 10 CAF cafrout Monitor User jobs User jobs User job WN write pipe CafExewrapper Condor CoD outgoing only HPC Asia 2009 CAF client on user desktop Command line tools via Condor CoD to WN $ list jobs $ pstree list of processes in a section $ list WN working directory $ tail files on WN $ debugging a process Can’t predict how users will mess up Interactive monitoring is everyone’s demand !!

Monitor Web monitoring of all jobs 11 CAFmon CAFmon account on CAF headnode to fetch job status into HTTP web display CAF HTTPd HPC Asia 2009

Monitor Interactive Web monitoring 12 HPC Asia 2009 Display each running section, and processes is very powerful for debugging system/software problems CAF HTTPd

Integration via GRID : connecting private clusters 1. not to grow/merge clusters 2. create a dispatch center 3. add GateKeepers at affiliated GRID sites,  via CA authentication  send jobs over the GRID The late binding solution: Condor-G on globus 13 HPC Asia 2009

CAF integration via Condor Glide-in 14 CAN NOT demand dedicated resource any more  computing goes integration and sharing of GRID sites  computing goes integration and sharing of GRID sites CAF hunts CPU by Glide-in to GRID pools CAF Submitter Schedd User jobs Collector 1 3 Glidekeeper Schedd User jobs Glidein User jobs Glidein User job Glidein Globus User jobs Glidein Batch queue User jobs Glidein User job Glidein Globus 2 GRID pools HTTPd tarball HPC Asia 2009

The Pacific CAF for CDF 15 PacCAF Peak performance: 1k jobs Joint GRID sites: - IPAS_OSG, IPAS_CS, TW-LCG2, NCHC - JP-TsuKUBA - KR-KISTI-HEP HPC Asia 2009

Integration of computing facilities feasible for the advantages: 1. common OS, Linux on Intel based PC clusters 2. GNU based compilers for g77, gcc 3. GRID connection provides -. hardware integrity -. common service software and development -. security via CA access -. user groups via VO management 16 HPC Asia 2009

Migrating to GRID 17 HPC Asia 2009 OSG gatekeeper IPAS CondorHEP Condor LCG gatekeeper HEP-loginIPAS-login Scheduler Physics Computing at Academia Sinica upgraded to Condor and GRID services merged into two large clusters for serial/parallel tasks 400 CPU node 100 TBytes 600 CPU node blade systems Gbit to abroad

Common Analysis Facility CAF becomes Common Analysis Facility 18 Glide-CAF Submitter Monitor dedicated File-servers HTTPd dedicated User interface Glidekeeper Integratiion at Academia Sinica Linux OSGCC language Common platform using 1.Linux OS; 2.GCC language - Nuclear/HEP - Nuclear/HEP demands custom software, data handling - Complex/Bio-physics - Complex/Bio-physics demands Parallel computing Cdf-soft bio-soft Bio-disk Cdf-disk ParallelCondor SerialCondor HPC Asia 2009

CAF Howto build an integrated CAF CAF center: - build headnode 2. Joint cluster: - build GRID gatekeeper - (optional HTTP Proxy, GCB) 3. User interface: - build globus client - /usr/local/bin/caf_submit - software distribuiton Glide-CAF Submitter Monitor HTTPd Glidekeeper GRID site GCB broker Globus HTTP Proxy User interface My-soft the Load is on system managers for Switching to GRID tools HPC Asia 2009

CAF Users surfing the integrated CAF 20 Users: - replace your acquainted batch submit to $ caf_submit - enjoy/be scared by the surges of running jobs - watch Web monitoring kill broken jobs, clear network clogging Buckle up to sudden surge of CPU and network load HPC Asia 2009

Prototype CAF service 21 CAF for physics computing (two condor batches) 1.Serial computing for HEP, Nuclear, experiment users demand: intensive network and data I/O  2 gatekeepers (for OSG and LCG)  OSG to NCHC  1U workers with 2/4/8/16 core, total 400 CPUs  3 User-Interface flavors  200 TB storage on 11 servers 2.Parallel computing for Nano-science, complex systems demand: local, fast backplane network to CPU nodes  blade 7 crates, 640 CPUs  3 User-Interface flavors CAF for integration is a mature technology Seeking for expansion on collaboration with NCHC GRID, AS CITI, CC nation-wide in Taiwan and abroad HPC Asia 2009

Web based job-submit 22 Make it even easier to users  developing a NCHC Web submission interface 1.Use Web access as User Interface 2.Provide templates of GRID jobs upload/download & clicking on the GRID HPC Asia 2009

Scaling issues 23 All systems have a limit 1. Network: - single Hard-disk spinning limit is ~40 MB/sec - a gigabit port support 2 spinning jobs - effective file-system load a few tens 2. System: - Condor tolerates a few thousands sections - HTTP tolerates a few thousands query  Polite user pattern: - submit slowly, a few hundreds of jobs at once, - watch return network and file-server load  Prevent CPU from waiting for file transfer HPC Asia 2009

Summary on demand of HPC we build distributed CAF for data processing of CDF experiment 2. a late binding on GRID integrates Pacific CAFs into one service 3. the Prototype Common Analysis Farm is constructed for Physics computing ~~ for I/O intensive serial jobs and CPU intensive parallel computing HPC Asia 2009