Download presentation
Presentation is loading. Please wait.
Published byMadison Webster Modified over 9 years ago
1
1 Report on CHEP 2007 Raja Nandakumar
2
2 Synopsis Two classes of talks and posters ➟ Computer hardware ▓ Dominated by cooling / power consumption ▓ Mostly in the plenary sessions ➟ Software ▓ Grid job workload management systems Job submission by the experiments Site Job handling, monitoring Grid operations (Monte Carlo production, glexec, interoperability, …) Data integrity checking …. ▓ Storage systems Primarily concerning dCache and DPM Distributed storage systems Parallel session : Grid middleware and tools
3
3 Computing hardware Power requirements of LHC computing ➟ Important for running costs ▓ ~330W to provision for 100W of electronics ➟ Some sites running with air or water cooled racks Electronics100 W Server fans13 W Voltage regulation22 W Case power supply48 W Room power distribution4 W UPS18 W Room cooling125 W
4
4 High performance and multi-core computing Core Frequencies ~ 2-4 GHz, will not change significantly Power ➟ 1,000,000 cores at 25 W / core = 25 MW ▓ Just for the cpu ➟ Have to bring core power down by multiple orders of magnitude ▓ reduces chip frequency, complexity, capability Memory Bandwidth ➟ As we add cores to a chip, it is increasingly difficulty to provide sufficient memory bandwidth ➟ Application tuning to manage memory bandwidth becomes critical Network and I/O Bandwidth, data integrity, reliability ➟ A Petascale computer will have Petabytes of Memory ➟ Current Single File Servers achieve 2-4 GB/s ▓ 70+ hours to checkpoint 1 Petabyte ➟ IO management is a major challenge Memory Cost ➟ Can’t expect to maintain current memory / core numbers at petascale. ▓ 2GB/core for ATLAS / CMS
5
5 Grid job submission Most new developments were on pilot agent based grid systems ➟ Implement job scheduling based on “pull” scheduling paradigm ➟ The only method for grid job submission LHCb ▓ DIRAC (> 3 years experience) ▓ Ganga is the user analysis front end ➟ Also used in Alice (and Panda and Magic) ▓ AliEn since 2001 ➟ Used for production, user analysis, data management in LHCb & Alice ➟ New developments for others ▓ Panda : Atlas, Charmm Central server based on Apache ▓ GlideIn : Atlas, CMS, CDF Based on Condor ▓ Used for production and analysis ➟ Very successful implementations ▓ Real-time view of the local environment ▓ Pilot agents can have some intelligence built into the system Useful for heterogeneous computing environment ▓ Recently Panda to be used for all Atlas production One talk on distributed batch systems
6
6 Pilot agents Pilot agents submitted on demand ➟ Reserve the resource for immediate use ▓ Allows checking of the environment before job scheduling ▓ Only bidirectional network traffic ▓ Unidirectional connectivity ➟ Terminates gracefully if no work is available ➟ Also called GlideIn-s LCG jobs are essentially pilot jobs for the experiment
7
7 DIRAC WMS
8
8 Panda WMS
9
9 Alice (AliEn / MonaLisa) History plot of running jobs
10
10 LHCb (Dirac) Max running jobs snapshot
11
11 Glexec A thin layer to change Unix domain credentials based on grid identity and attribute information Different modes of operation ➟ With or without setuid ▓ Ability to change the user id of the final job Enable VO to ➟ Internally manage job scheduling and prioritisation ➟ Late binding of user jobs to pilots In production at Fermilab ➟ Code ready and tested, awaiting full audit
12
12 LSF universus LSFPBSSGECCE Cluster/Desktops LSF Scheduler Web PortalJob Scheduler Cluster/Desktops LSF Scheduler MultiCluster
13
13 LSF universus Commercial extension of LSF ➟ Interface to multiple clusters ➟ Centralised scheduler, but sites retain local control ➟ LSF daemons installed on head nodes of remote cluster ➟ Kerberos for user, host and service authentication ➟ Scp for file transfer Currently deployed in ➟ Sandia National labs to link OpenPBS, PBS Pro and LSF clusters ➟ Singapore national grid to link PBS Pro, LSF and N1GE clusters ➟ Distributed European Infrastructure for Supercomputing Applications (DEISA)
14
14 Grid interoperability Many different grids ➟ WLCG, Nordugrid, Teragrid, … ➟ Experiments span the various grids Short term solutions have to be ad-hoc ➟ Maintain parallel infrastructures by the user, site or both For the medium term setup adaptors and translators In the long term adopt common standards and interfaces ➟ Important in security, information, CE, SE ➟ Most grids use X509 standard ➟ Multiple “common” standards … ➟ GIN (Grid interoperability now) group working on some of this SRM Storage Control Protocol GSI/VOMS GridFTP GLUE v1 LDAP/GIIS GRAM OSG GSI/VOMS Security GridFTP Storage Transfer Protocol GLUE v1.2ARCSchema LDAP/BDIILDAP/GIISService Discovery GRAMGridFTPJob Submission EGEEARC
15
15 Distributed storage GridPP organised into 4 regional Tier-2s in the UK Currently a job follows data into a site ➟ Consider disk at one site as close to cpu at another site ▓ Eg. Disk at Edinburgh vs cpu at Glasgow ➟ Pool resources for efficiency and ease of use ➟ Jobs need to access storage directly from the worker node
16
16 RTT between Glasgow and Edinburgh ~ 12 s Custom rfio client ➟ Normal : One call / read ➟ Readbuf : Fills internal buffer to service request ➟ Readahead : Reads till EOF ➟ Streaming : Separate streams for control & data Tests using single DPM server Atlas expects ~ 10 MiB/s / job Better performance with dedicated light path Ultimately a single DPM instance to span Glasgow and Edinburgh sites
17
17 Data Integrity Large number of components performing data management in an experiment Two approaches to checking data integrity ➟ Automatic agents continuously performing checks ➟ Checks in response to special events Different catalogs in LHCb : Bookkeeping, LFC, SE Issues seen : ➟ zero size files: ➟ missing replica information ➟ missing replica information: ➟ wrong SAPath ➟ wrong SE host: ➟ wrong protocol ▓sfn, rfio, bbftp… ➟ mistakes in files registration ▓blank spaces on the surl path ▓carriage returns ▓presence of port number in the surl path..
18
18 Summary Many experiments have embraced the grid Many interesting challenges ahead ➟ Hardware ▓ Reduce the power consumed by cpu-s ▓ Applications need to manage with lesser RAM ➟ Software ▓ Grid interoperability ▓ Security with generic pilots / glexec ▓ Distributed grid network And many opportunities ➟ To test solutions to above issues ➟ Stress test the grid infrastructure ▓ Get ready for data taking ▓ Implement lessons in other fields Biomed … ➟ Note : 1 fully digitised film = 4 PB and needs 1.25 GB/s to play
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.