Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28,

Similar presentations


Presentation on theme: "1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28,"— Presentation transcript:

1 1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28, 2004

2 2 Grid2003: an application grid laboratory virtual data grid laboratory virtual data research end-to-end HENP applications CERN LHC: US ATLAS testbeds & data challenges CERN LHC: USCMS testbeds & data challenges Grid3

3 3 Grid3 at a Glance Grid environment built from core Globus and Condor middleware, as delivered through the Virtual Data Toolkit (VDT)  GRAM, GridFTP, MDS, RLS, VDS …equipped with VO and multi-VO security, monitoring, and operations services …allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG)  USATLAS: GriPhyN VDS execution on LCG sites  USCMS: storage element interoperability (SRM/dCache) Delivering the US LHC Data Challenges

4 4 Grid3 Design Simple approach:  Sites consisting of Computing element (CE) Storage element (SE) Information and monitoring services  VO level, and multi-VO VO information services Operations (iGOC) Minimal use of grid-wide systems  No centralized workload manager, replica or data management catalogs, or command line interface higher level services are provided by individual VO’s site VO … iGOC site CE SE

5 5 Site Services and Installation Goal is to install and configure with minimal human intervention Use Pacman and distributed software “caches” Registers site with VO and Grid3 level services Accounts, application install areas & working directories Compute Element Storage Grid3 Site %pacman –get iVDGL:Grid3 VDT VO service GIIS register Info providers Grid3 Schema Log management $app $tmp four hour install and validate

6 6 Multi-VO Security Model DOEGrids Certificate Authority PPDG or iVDGL Registration Authority Authorization service: VOMS Each Grid3 site generates a Globus gridmap file with an authenticated SOAP query to each VO service Site-specific adjustments or mappings Group accounts to associate VOs with jobs iVDGL US CMS LSC SDSS BTeV Site Grid3 grid- map VOMS US ATLAS

7 7 iVDGL Operations Center (iGOC) Co-located with Abilene NOC (Indianapolis) Hosts/manages multi-VO services:  top level Ganglia, GIIS collectors  MonALISA web server and archival service  VOMS servers for iVDGL, BTeV, SDSS  Site Catalog service, Pacman caches Trouble ticket systems  phone (24 hr), web and email based collection and reporting system  Investigation and resolution of grid middleware problems at the level of 30 contacts per week Weekly operations meetings for troubleshooting

8 8 Service monitoring Grid3 – a snapshot of sites Sep 04 30 sites, multi-VO shared resources ~3000 CPUs (shared)

9 9 Grid3 Monitoring Framework c.f. M. Mambelli, B. Kim et al., #490

10 10 Monitors Jobs by VO ACDC Job Queues (Monalisa) Data IO (Monalisa) Metrics (MDViewer)

11 11 Use of Grid3 – led by US LHC 7 Scientific applications and 3 CS demonstrators  A third HEP and two biology experiments also participated Over 100 users authorized to run on Grid3  Application execution performed by dedicated individuals  Typically ~few users ran the applications from a particular experiment

12 12 US CMS Data Challenge DC04 CMS dedicated (red) Opportunistic use of Grid3 non-CMS (blue) Events produced vs. day c.f. A. Fanfani, #497

13 13 Ramp up ATLAS DC2 Sep 10 Mid July CPU-day c.f. R. Gardner, et al., #503

14 14 Shared infrastructure, last 6 months cms dc04 atlas dc2 Sep 10 Usage: CPUs

15 15 ATLAS DC2 production on Grid3: a joint activity with LCG and NorduGrid G. Poulard, 9/21/04 # Validated Jobs total c.f. L. Goossens, #501 & O. Smirnova #499 Day

16 16 Typical Job distribution on Grid3 G. Poulard, 9/21/04

17 17 Beyond LHC applications… Astrophysics and Astronomical  LIGO/LSC: blind search for continuous gravitational waves  SDSS: maxBcg, cluster finding package Biochemical  SnB: Bio-molecular program, analyses on X-ray diffraction to find molecular structures  GADU/Gnare: Genome analysis, compares protein sequences Computer Science  Supporting Ph.D. research adaptive data placement and scheduling algorithms mechanisms for policy information expression, use, and monitoring

18 18 Astrophysics: Sloan Sky Survey Image stripes of the sky from telescope data sources:  galaxy cluster finding  red shift analysis, weak lensing effects Analyze weighted images  Increase sensitivity by 2 orders of magnitude  with object detection and measurement code Workflow:  replicate sky segment data to Grid3 sites  average, analyze, send output to Fermilab  44,000 jobs, 30% complete

19 19 Time Period: May 1 - Sept. 1, 2004 Total Number of Jobs: 71949 Total CPU Time: 774 CPU Days Average Job Runtime: 0.26 Hr SDSS Job Statistics on Grid3

20 20 Structural Biology SnB is a computer program based on Shake-and-Bake where:  A dual-space direct-methods procedure for determining molecular crystal structures from X-ray diffraction data is used.  As many as 2000 unique non-H atom difficult molecular structures have been solved in a routine fashion.  SnB has been routinely applied to jump-start the solution of large proteins, increasing the number of selenium atoms determined in Se-Met molecules from dozens to several hundred.  SnB is expected to play a vital role in the study of ribosomes and large macromolecular assemblies containing many different protein molecules and hundreds of heavy- atom sites.

21 21 Genomic Searches and Analysis Searches and find new genomes on public databases (eg. NCBI) Each genome composed of ~4k genes Each gene needs to be processed and characterized  Each gene handled by separate process Save results for future use  also: BLAST protein sequences GADU 250 processors 3M sequences ID’d: bacterial, viral, vertebrate, mammal

22 22 Lessons (1) Human interactions in grid building costly Keeping site requirements light lead to heavy loads on gatekeeper hosts Diverse set of sites made jobs requirements exchange difficult Single point failures – rarely happen; certificate revocation lists expiring happened twice Configuration problems – Pacman helped, but still spent enormous amounts of time diagnosing problems

23 23 Lessons (2) Software updates were relatively easy or extremely painful Authorization: simple in Grid3, but coarse grained Troubleshooting: efficiency for submitted jobs was not as high as we’d like.  Complex system with many failure modes and points of failure. Need fine grained monitoring tools.  Need to improve at both service level and user level

24 24 Operations Experience iGOC and US ATLAS Tier1 (BNL) developed operations response model in support of DC2 Tier1 center  core services, “on-call” person available always  response protocol developed iGOC  Coordinates problem resolution for Tier1 “off hours”  Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings  ~600 trouble tickets (generic); ~20 ATLAS DC2 specific Extensive use of email lists

25 25 Not major problems bringing sites into single purpose grids simple computational grids for highly portable applications specific workflows as defined by today’s JDL and/or DAG approaches centralized, project-managed grids to a particular scale, yet to be seen

26 26 Major problems: two perspectives Site & service providing perspective:  maintaining multiple “logical” grids with a given resource; maintaining robustness; long term management; dynamic reconfiguration; platforms  complex resource sharing policies (department, university, projects, collaborative), user roles Application developer perspective:  challenge of building integrated distributed systems  end-to-end debugging of jobs, understanding faults  common workload and data management systems developed separately for each VO

27 27 Grid3 is evolving into OSG Main features/enhancements  Storage Resource Management  Improve authorization service  Add data management capabilities  Improve monitoring and information services  Service challenges and interoperability with other Grids Timeline  Current Grid3 remains stable through 2004  Service development continues  Grid3dev platform c.f. R. Pordes, #192

28 28 Conclusions Grid3 taught us many lessons about how to deploy and run a production grid Breakthrough in demonstrated use of “opportunistic” resources enabled by grid technologies Grid3 will be a critical resource for continued data challenges through 2004, and environment to learn how to operate and upgrade large scale production grids Grid3 is evolving to OSG with enhanced capabilities

29 29 Acknowledgements R. Pordes  (Grid3 co-coordinator) and the rest of the Grid3 team which did all the work!  Site administrators  VO service administrators  Application developers  Developers and contributors  iGOC team  Project teams


Download ppt "1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28,"

Similar presentations


Ads by Google