Download presentation
Presentation is loading. Please wait.
Published byTabitha Tate Modified over 9 years ago
1
1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28, 2004
2
2 Grid2003: an application grid laboratory virtual data grid laboratory virtual data research end-to-end HENP applications CERN LHC: US ATLAS testbeds & data challenges CERN LHC: USCMS testbeds & data challenges Grid3
3
3 Grid3 at a Glance Grid environment built from core Globus and Condor middleware, as delivered through the Virtual Data Toolkit (VDT) GRAM, GridFTP, MDS, RLS, VDS …equipped with VO and multi-VO security, monitoring, and operations services …allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG) USATLAS: GriPhyN VDS execution on LCG sites USCMS: storage element interoperability (SRM/dCache) Delivering the US LHC Data Challenges
4
4 Grid3 Design Simple approach: Sites consisting of Computing element (CE) Storage element (SE) Information and monitoring services VO level, and multi-VO VO information services Operations (iGOC) Minimal use of grid-wide systems No centralized workload manager, replica or data management catalogs, or command line interface higher level services are provided by individual VO’s site VO … iGOC site CE SE
5
5 Site Services and Installation Goal is to install and configure with minimal human intervention Use Pacman and distributed software “caches” Registers site with VO and Grid3 level services Accounts, application install areas & working directories Compute Element Storage Grid3 Site %pacman –get iVDGL:Grid3 VDT VO service GIIS register Info providers Grid3 Schema Log management $app $tmp four hour install and validate
6
6 Multi-VO Security Model DOEGrids Certificate Authority PPDG or iVDGL Registration Authority Authorization service: VOMS Each Grid3 site generates a Globus gridmap file with an authenticated SOAP query to each VO service Site-specific adjustments or mappings Group accounts to associate VOs with jobs iVDGL US CMS LSC SDSS BTeV Site Grid3 grid- map VOMS US ATLAS
7
7 iVDGL Operations Center (iGOC) Co-located with Abilene NOC (Indianapolis) Hosts/manages multi-VO services: top level Ganglia, GIIS collectors MonALISA web server and archival service VOMS servers for iVDGL, BTeV, SDSS Site Catalog service, Pacman caches Trouble ticket systems phone (24 hr), web and email based collection and reporting system Investigation and resolution of grid middleware problems at the level of 30 contacts per week Weekly operations meetings for troubleshooting
8
8 Service monitoring Grid3 – a snapshot of sites Sep 04 30 sites, multi-VO shared resources ~3000 CPUs (shared)
9
9 Grid3 Monitoring Framework c.f. M. Mambelli, B. Kim et al., #490
10
10 Monitors Jobs by VO ACDC Job Queues (Monalisa) Data IO (Monalisa) Metrics (MDViewer)
11
11 Use of Grid3 – led by US LHC 7 Scientific applications and 3 CS demonstrators A third HEP and two biology experiments also participated Over 100 users authorized to run on Grid3 Application execution performed by dedicated individuals Typically ~few users ran the applications from a particular experiment
12
12 US CMS Data Challenge DC04 CMS dedicated (red) Opportunistic use of Grid3 non-CMS (blue) Events produced vs. day c.f. A. Fanfani, #497
13
13 Ramp up ATLAS DC2 Sep 10 Mid July CPU-day c.f. R. Gardner, et al., #503
14
14 Shared infrastructure, last 6 months cms dc04 atlas dc2 Sep 10 Usage: CPUs
15
15 ATLAS DC2 production on Grid3: a joint activity with LCG and NorduGrid G. Poulard, 9/21/04 # Validated Jobs total c.f. L. Goossens, #501 & O. Smirnova #499 Day
16
16 Typical Job distribution on Grid3 G. Poulard, 9/21/04
17
17 Beyond LHC applications… Astrophysics and Astronomical LIGO/LSC: blind search for continuous gravitational waves SDSS: maxBcg, cluster finding package Biochemical SnB: Bio-molecular program, analyses on X-ray diffraction to find molecular structures GADU/Gnare: Genome analysis, compares protein sequences Computer Science Supporting Ph.D. research adaptive data placement and scheduling algorithms mechanisms for policy information expression, use, and monitoring
18
18 Astrophysics: Sloan Sky Survey Image stripes of the sky from telescope data sources: galaxy cluster finding red shift analysis, weak lensing effects Analyze weighted images Increase sensitivity by 2 orders of magnitude with object detection and measurement code Workflow: replicate sky segment data to Grid3 sites average, analyze, send output to Fermilab 44,000 jobs, 30% complete
19
19 Time Period: May 1 - Sept. 1, 2004 Total Number of Jobs: 71949 Total CPU Time: 774 CPU Days Average Job Runtime: 0.26 Hr SDSS Job Statistics on Grid3
20
20 Structural Biology SnB is a computer program based on Shake-and-Bake where: A dual-space direct-methods procedure for determining molecular crystal structures from X-ray diffraction data is used. As many as 2000 unique non-H atom difficult molecular structures have been solved in a routine fashion. SnB has been routinely applied to jump-start the solution of large proteins, increasing the number of selenium atoms determined in Se-Met molecules from dozens to several hundred. SnB is expected to play a vital role in the study of ribosomes and large macromolecular assemblies containing many different protein molecules and hundreds of heavy- atom sites.
21
21 Genomic Searches and Analysis Searches and find new genomes on public databases (eg. NCBI) Each genome composed of ~4k genes Each gene needs to be processed and characterized Each gene handled by separate process Save results for future use also: BLAST protein sequences GADU 250 processors 3M sequences ID’d: bacterial, viral, vertebrate, mammal
22
22 Lessons (1) Human interactions in grid building costly Keeping site requirements light lead to heavy loads on gatekeeper hosts Diverse set of sites made jobs requirements exchange difficult Single point failures – rarely happen; certificate revocation lists expiring happened twice Configuration problems – Pacman helped, but still spent enormous amounts of time diagnosing problems
23
23 Lessons (2) Software updates were relatively easy or extremely painful Authorization: simple in Grid3, but coarse grained Troubleshooting: efficiency for submitted jobs was not as high as we’d like. Complex system with many failure modes and points of failure. Need fine grained monitoring tools. Need to improve at both service level and user level
24
24 Operations Experience iGOC and US ATLAS Tier1 (BNL) developed operations response model in support of DC2 Tier1 center core services, “on-call” person available always response protocol developed iGOC Coordinates problem resolution for Tier1 “off hours” Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings ~600 trouble tickets (generic); ~20 ATLAS DC2 specific Extensive use of email lists
25
25 Not major problems bringing sites into single purpose grids simple computational grids for highly portable applications specific workflows as defined by today’s JDL and/or DAG approaches centralized, project-managed grids to a particular scale, yet to be seen
26
26 Major problems: two perspectives Site & service providing perspective: maintaining multiple “logical” grids with a given resource; maintaining robustness; long term management; dynamic reconfiguration; platforms complex resource sharing policies (department, university, projects, collaborative), user roles Application developer perspective: challenge of building integrated distributed systems end-to-end debugging of jobs, understanding faults common workload and data management systems developed separately for each VO
27
27 Grid3 is evolving into OSG Main features/enhancements Storage Resource Management Improve authorization service Add data management capabilities Improve monitoring and information services Service challenges and interoperability with other Grids Timeline Current Grid3 remains stable through 2004 Service development continues Grid3dev platform c.f. R. Pordes, #192
28
28 Conclusions Grid3 taught us many lessons about how to deploy and run a production grid Breakthrough in demonstrated use of “opportunistic” resources enabled by grid technologies Grid3 will be a critical resource for continued data challenges through 2004, and environment to learn how to operate and upgrade large scale production grids Grid3 is evolving to OSG with enhanced capabilities
29
29 Acknowledgements R. Pordes (Grid3 co-coordinator) and the rest of the Grid3 team which did all the work! Site administrators VO service administrators Application developers Developers and contributors iGOC team Project teams
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.