Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004.

Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004

Outline Conclusion of CHEP04 –computing fabric New developments of CASTOR Storage Resource Manager Grid File System –GGF-WG, GFAL ELFms –Quattor, Lemon, Leaf Others: –AFS, wireless network,Oracle,condor,SLC, InDiCo, lyon visiting, CERN openday (Oct, 16)

CHEP04

Conclusion of CHEP04 CHEP04, from Sept. 26 to Oct. 1 Plenary conference in every morning Seven parallel sessions at each afternoon –Online computing –Event processing –Core software –Distributed computing services –Distributed computing systems and experiences –Computing fabrics –Wide area network Documents: www.chep2004.org Our Presentations (one talk each person): – two on Sep. 27 and one on Sep. 30

Computing fabrics Computing nodes, disk servers, tape servers, network bandwidth at different HEP institutes Fabrics at Tier0, Tier1 and Tier2 Installation, configuration, maintenance and management of large Linux farms Grid Software installation Monitoring of computing fabrics OS choice: move to RHES3/Scientific Linux Storage observations

Storage stack JASMine dCache/TSM HPSS CASTOR ENSTORE FibreChannel/SATA SAN EIDE/SATA in a box SATA array direct connect iSCSI GPFS XFS ext2/3 SAN FS 1Gb eth 10Gb ethInfiniband StoRM NFS v2 v3 Lustre GoogleFS Chimera PNFS dCache PVFSCASTOR SRB gfarm HW Raid 5HW Raid 1SW Raid 5SW Raid 0 StoRM SRBgfarm SRM Expose to WAN Expose to LAN Local network File Systems Disk Organisation Disks Tape Store

Storage observations Castor and dCache are in full growth –Growing numbers of adopters outside the development sites SRM supports all major managers SRB at Belle (KEK) Not always going for largest disks (capacity driver), already choosing smaller for performance –Key issue for LHC Cluster file system comparisons –SW based solutions allow HW reuse

Architecture Choice 64 bits are coming soon and HEP is not really ready for it! Infiniband for HPC –Low latency –High bandwidth (>700MB/s for CASTOR/RFIO) Balance of CPU to disk resources Security issues –Which servers exposed to users or WAN? High performance data access and computing support –Gfarm file system (Japan)

New CASTOR developments

Castor Current Status Usage at CERN –370 disk servers, 50 stagers (disk pool managers) –90 tapes drives, More than 3PB in total –Dev team (5), Operations team (4) Associated problems –Management is more and more difficult –Performance –Scalability –I/O request scheduling –Optimal use of resource

Challenge for CASTOR LHC is a big challenge A single stager should scale up to handle peak rates of 500/1000 requests per second Expected system configuration –4PB of disk cache, 10 PB stored on tapes per year –Tens of millions of disk resident files –peak rate of 4GB/s from online –10000 disks, 150 tape drives –Increase of small files The current CASTOR stager cannot do it

Vision With clusters of 100s of disk and tape servers, the automated storage management faces more and more the same problems as CPU clusters management –(Storage) Resource management –(Storage) Resource sharing –(Storage) Request access scheduling –Configuration –Monitoring The stager is the main gateway to all resources managed by CASTOR Vision: Storage Resource Sharing Facility

Ideas behind new stager Pluggable framework rather than total solution –True request scheduling: third party schedulers, e.g. Maui or LSF –Policy attributes: externalize policy engines governing the resource matchmaking. move toward full-fledged policy languages, “ GUILE ” Restricted access to storage resources –All requests are scheduled –No random rfiod eating up the resources behind the back of the scheduling system Database centric architecture –Stateless components: all transactions and locking provided by the DB system –Allows for easy stop/restarting components –Facilitates development/debugging

New Stager Architecture Application RFIO/stage API Request Handler Migration and recall LSF CASTOR tape archive components (VDQM, VMGR, RTCOPY) Maui rfiod (disk mover) Master Stager Notification (UDP) Request repository and file catalogue (Oracle or MySQL) Data (TCP) Control (TCP) Disk cache

Architecture Request handling & scheduling RequestHandler Fabric Authentication service e.g. Kerberos-V server Read: /castor/cern.ch/user/c/castor/TastyTreesDN=castor Typical file request Thread pool Authenticate “castor” Request repository (Oracle, MySQL) Scheduler Scheduling Policies user “castor” has priority Job Dispatcher Store request Run request on pub003d Get Jobs Disk server load Catalogue File staged? Request registration: Must keep up with high request rate peaks Request scheduling: Must keep up with average request rates

Security Implementing strong authentication (Encryption is not planned for the moment) Developed a plugin system, based on the GSSAPI so as to use the mechanisms: –GSI, KBR5 And support KBR4 for back compatibility Modifying various CASTOR components to integrate the security layer Impact on the config of machines (need for service keys etc…)

Castor GUI Client Prototype was developed by LIU aigui, on the platform of Kylix 3. If possible, it will be downloadable on CASTOR web site. Still exists many problems Need to be optimized Functionality and performance tests are very necessary

Storage Resource Manager

Introduction of SRM ● SRMs are middleware components that manage shared storage resources on the Grid and provide: ● Uniform access to heterogeneous storage ● Protocol negotiation ● Dynamic Transfer URL allocation ● Access to permanent and temporary types of storage ● Advanced space and file reservation ● Reliable transfer services ● Storage resources refers to: ● DRM: disk resource managers ● TRM: Tape resource managers ● HRM: Hierarchical resource manager

SRM Collaboration Jefferson Lab Bryan Hess Andy Kowalski Chip Watson Fermilab Don Petravick Timur Perelmutov LBNL Arie Shoshani Alex Sim Junmin Gu EU DataGrid WP2 Peter Kunszt Heinz Stockinger Kurt Stockinger Erwin Laure EU DataGrid WP5 Jean-Philippe Baud Stefano Occhetti Jens Jensen Emil Knezo Owen Synge

SRM versions Two SRM Interface specifications –SRM v1.1provides Data access/transfer Implicit space reservation –SRM v2.1 adds Explicit space reservation Namespace discovery and manipulation Access permissions manipulation Fermilab SRM implements SRM v1.1 specification SRM v2.1 by the end of 2004 Reference: http://sdm.lbl.gov/srm-wg

High Level View of SRM SRM Enstore JASMine Client USER/APPLICATIONS Grid Middleware SRM DCache SRM CASTOR

Role of SRM on the GRID SRM-Client SRM cache SRM dCache 6.GridFTP ERET (pull mode) Enstore CASTOR Replica Catalog Network transfer of DATA 1.DATA Creation 2. SRM- PUT Network transfer 3. Register (via RRS) CERN Tier 0 Replica Manager FNAL Tier 1 archive files stage files 4.SRM- COPY Tier0 to Tier1 5.SRM-GET archive files SRM CASTOR Tier 2 Center Network transfer 9.GridFTP ESTO (push mode) 8.SRM-PUT 7.SRM- COPY Tier1 to Tier2 SRM-Client Retrieve data for analysis 10.SRM-GET Users SRM-Client Network transfer of DATA

Main Advantages of using SRM Provides smooth synchronization between shared resources Eliminates unnecessary burden from the client Insulate them from storage systems failures Transparently deal with network failures. Enhance the efficiency of the grid, eliminating unnecessary file transfers by sharing files. Provide a “streaming model” to the client

Grid File System

Introduction There can be many hundreds of petabytes of data in grids, among which a very large percentage is stored in files A standard mechanism to describe and organize file-based data is essential for facilitating access to this large amount of data. GGF GFS-WG GFAL- Grid File Access Library

GGF GFS-WG Global Grid forum, Grid File System Working Group Two goals (two documents) –File System Directory Services Manage namespace for files, access control, and metadata management –Architecture for Grid File System Services Provides functionality of virtual file systemin grid environment Facilitates federation and sharing of virtualized data Uses File System Directory Services and standard access protocols –They will be submitted in GGF13 and GGF14 (2005)

GFS view Transparent access to dispersed file data in a Grid –POSIX I/O APIs –Applications can access Gfarm file system without any modification as if it is mounted at /gfs –Automatic and transparent replica selection for fault tolerance and access-concentration avoidance GRID File System /gfs ggfCN aistgtrc file1file3 file2 file4 file1file2 File replica creation Virtual Directory Tree mapping File system metadata

GFAL Grid File Access Library Grid storage interactions today require using several existing software components: –the replica catalog services to locate valid replicas of files. –The SRM software to ensure: files exist on disk or space is allocated on disk for new files GFAL hides these interactions and presents a Posix interface for the I/O operations. The currently supported protocols are: file for local access, dcap (dCache access protocol) and rfio (CASTOR access protocol).

Compile and Link The function names are obtained by prepending gfal_ to the Posix names, for example gfal_open, gfal_read, gfal_close... The argument lists and the values returned by the functions are identical. –The header file gfal_api.h needs to be included in the application source code –Linked with libGFAL.so –Security libraries: libcgsi_plugin_gsoap_2.3, libglobus_gssapi_gsi_gcc32dbg and libglobus_gss_assist_gcc32dbg are used internally

Basic Design Physics applications GFAL VFS SRM Client Local File I/O Root I/O Open() Read() rfio I/O Open() Read() dCap I/O Open() Read() Replica Catolog Client RC Services SRM services RFIO services dCap services MSS services Local DISK Posix I/O Wide Area Access

File system implementation Two options have been considered to offer a File System view –the way to run standard applications without modifying the source and without re-linking –The Pluggable File System (PFS) built on top of “Bypass” and developed by University of Wisconsin –The Linux Userland File System (LUFS) File system view: /grid/{vo}/… CASTORfs based on LUFS –I developed it –Available –Low efficiency

Extremely Large Fabric management system

ELFms ELFms: Extremely Large Fabric management system Sub Systems: –QUATTOR : system installation and configuration tool suite –LEMON: monitoring framework –LEAF: Hardware and State management

Deploy at CERN ELFms manages and controls most of the nodes in the CERN CC –~2100 nodes out of ~ 2400, to be scaled up to > 8000 in 2006-8 (LHC) –Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, … ) –Heterogeneous hardware (CPU, memory, HD size,..) –Linux (RH) and Solaris (9)

Quattor Quattor takes care of the configuration, installation and management of fabric nodes A Configuration Database holds the ‘desired state’ of all fabric elements –Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) –Cluster (name and type, batch system, load balancing info…) –Defined in templates arranged in hierarchies – common properties set only once Autonomous management agents running on the node for –Base installation –Service (re-)configuration –Software installation and management Quattor was developed in the scope of EU DataGrid. Development and maintenance now coordinated by CERN/IT

Quattor Architecture Configuration Management –Configuration Database –Configuration access and caching –Graphical and Command Line Interfaces Node and Cluster Management –Automated node installation –Node Configuration Management –Software distribution and management Node Configuration Management Node Management

LEMON Monitoring sensors and agent –Large amount of metrics (~ 10 sensors implementing 150 metrics) –Plug-in architecture: new sensors and metrics can easily be added –Asynchronous push/pull protocol between sensors and agent –Available for Linux and Solaris Repository –Data insertion via TCP or UDP –Data retrieval via SOAP –Backend implementations for text file and Oracle SQL –Keeps current and historical samples – no aging out of data but archiving on TSM and CASTOR Correlation Engines and ‘ self-healing ’ Fault Recovery –allows plug-in correlations accessing collected metrics and external information (eg. quattor CDB, LSF), and also launch configured recovery actions –Eg. average number of users on LXPLUS, total number of active LCG batch nodes –Eg. cleaning up /tmp if occupancy > x %, restart daemon D if dead, … LEMON is an EDG development now maintained by CERN/IT

LEMON Architecture LEMON stands for “LHC Era Monitoring”

LEAF -LHC Era Automated Fabric) Collection of workflows for automated node hardware and state management HMS (Hardware Management System) –eg. installation, moves, vendor calls, retirement –Automatically requests installs, retires etc. to technicians –GUI to locate equipment physically SMS (State Management System) –Automated handling high-level configuration steps, eg. Reconfigure, reboot,Reallocate nodes,reconfig –extensible framework – plug-ins for site-specific operations possible –Issues all necessary (re)configuration commands on top of quattor CDB and NCM HMS and SMS interface to Quattor and LEMON for setting/getting node information respectively

LEAF screenshot

Other Activities AFS –AFS documents download –AFS DB servers configuration Wireless network deployment Oracle license for LCG Condor deployment at some HEP institutes SLC: Scientific Linux CERN version lyon visiting (Oct. 27, CHEN gang) CERN OpenDay (Oct, 16)

Thank you!!

Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004.

Similar presentations

Presentation on theme: "Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004.

Similar presentations

Presentation on theme: "Our Work at CERN Gang CHEN, Yaodong CHENG Computing center, IHEP November 2, 2004."— Presentation transcript:

Similar presentations

About project

Feedback