Download presentation
Presentation is loading. Please wait.
Published byBerenice Ward Modified over 9 years ago
1
Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu The Role of Scientific Middleware in the Future of HEP Computing e
2
www.cs.wisc.edu/condor This talk focuses on the how rather than on the what
3
www.cs.wisc.edu/condor Can we do it? Does the scientific community (scientists from different disciplines and the funding agencies) have the know how, the resources and the will to develop, maintain, document, evolve and support a common (and shared) suite of production quality middleware that meets the needs and expectations of the HEP community?
4
www.cs.wisc.edu/condor Why is Grid Middleware Such a Challenge? (the road from CHEP2000 to CHEP2003) My CHEP 03 talk
5
www.cs.wisc.edu/condor Because developing good software is not easy and Distributed Computing is a hard problem! (do not try it at home!) My CHEP 03 talk
6
www.cs.wisc.edu/condor Since CHEP 03 we … › Delivered distributed (grid) computing and data management functionality to HEP applications via Grid-3, LCG-1, NurdoGrid, SAM Grid, … › Secured funding for a middleware effort as part of EGEE, the OMII activity and the continuation of the PPDG work and the NMI-GRIDS center
7
www.cs.wisc.edu/condor Observations › The Virtual Data Toolkit (VDT) provides a common foundation for most HEP deployments › Recognition of the value and price of interdisciplinary collaboration › No “silver bullets” › Closer coordination in the development of interfaces (e.g. SRM) and end-to-end software stacks (e.g. gLite) › Efforts to leverage, harmonize and/or share build and test infrastructure and capabilities › Closer relationships with other scientific domains (sharing of capabilities and requirements) › Sharing and coordination of experiences and requirements in Authentication, Authorization and Accounting areas
8
VDT Growth VDT 1.0 Globus 2.0b Condor 6.3.1 VDT 1.1.3, 1.1.4 & 1.1.5 pre-SC 2002 VDT 1.1.7 Switch to Globus 2.2 VDT 1.1.11 Grid2003 VDT 1.1.8 First real use by LCG
9
The Build Process Sources (CVS) Patching GPT src bundles NMI Build & Test Condor pool (~40 computers) … Build Test Package VDT Build Contributors (VDS, etc.) Build Pacman cache RPMs Binaries Test Hope use NMI processes soon Note patches
10
Tools in the VDT 1.2.0 Condor Group Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) EDG & LCG Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA MyProxy GSI OpenSSH UberFTP LBL PyGlobus Netlogger Caltech MonaLisa VDT VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Components built by NMI
11
Tools in the VDT 1.2.0 Condor Group Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) EDG & LCG Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA MyProxy GSI OpenSSH UberFTP LBL PyGlobus Netlogger Caltech MonaLisa VDT VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Components built by contributors
12
Tools in the VDT 1.2.0 Condor Group Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) EDG & LCG Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA MyProxy GSI OpenSSH UberFTP LBL PyGlobus Netlogger Caltech MonaLisa VDT VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Components built by VDT
13
www.cs.wisc.edu/condor (unique) Challenges › Name branding and distribution of credit and recognition › Building and maintaining teams of middleware developers (this includes support, documentation, testing, …) › Longevity and stability of funding
14
www.cs.wisc.edu/condor The Condor Experience (86-04)
15
www.cs.wisc.edu/condor Yearly Condor usage at UW-CS 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000
16
www.cs.wisc.edu/condor
17
Slides used by UWCS with permission of Micron Technology, Inc. Condor at Micron 8000+ processors in 11 “pools” Linux, Solaris, Windows <50 th Top500 Rank 3+ TeraFLOPS Centralized governance Distributed management 16+ applications Self developed Micron’s Global Grid
18
Slides used by UWCS with permission of Micron Technology, Inc. The Chief Officer value proposition ■ Info Week 2004 IT Survey includes Grid questions! Makes our CIO look good by letting him answer yes Micron’s 2003 rank: 23 rd ■ Without Condor we only get about 25% of PC value today Did’t tell our CFO a $1000 PC really costs $4000! Doubling utilization to 50% doubles CFO’s return on capital Micron’s goal: 66% monthly average utilization ■ Providing a personal supercomputer to every engineer CTO appreciates the cool factor CTO really “gets it” when his engineer’s say: I don’t know how I would have done that without the Grid Condor at Micron
19
Slides used by UWCS with permission of Micron Technology, Inc. Condor at Micron Example Value 73606 job hours / 24 / 30 = 103 Solaris boxes 103 * $10,000/box = $1,022,306 And that’s just for one application not considering decreased development time, increased uptime, etc. Chances are if you have Micron memory in your PC, it was processed by Condor!
20
www.cs.wisc.edu/condor Condor at Oracle Condor is used within Oracle's Automated Integration Management Environment (AIME) to perform automated build and regression testing of multiple components for Oracle's flagship Database Server product. Each day, nearly 1,000 developers make contributions to the code base of Oracle Database Server. Just the compilation alone of these software modules would take over 11 hours on a capable workstation. But in addition to building, AIME must control repository labelling/tagging, configuration publishing, and last but certainly not least, regression testing. Oracle is very serious about the stability and correctness about their products. Therefore, the AIME daily regression test suite currently covers 90,000 testable items divided into over 700 test packages. The entire process must complete within 12 hours to keep development moving forward. About five years ago, Oracle selected Condor as the resource manager underneath AIME because they liked the maturity of Condor's core components. In total, ~3000 CPUs at Oracle are managed by Condor today.
21
www.cs.wisc.edu/condor Condor at Noregon Noregon has entered into a partnership with Targacept Inc. to develop a system to efficiently perform molecular dynamics simulations. Targacept is a privately held pharmaceutical company located in Winston-Salem's Triad Research Park whose efforts are focused on creating drug therapies for neurological, psychiatric, and gastrointestinal diseases. … Using the Condor® grid middleware, Noregon is designing and implementing an ensemble Car-Parrinello simulation tool for Targacept that will allow a simulation to be distributed across a large grid of inexpensive Windows® PC’s. Simulations can be completed in a fraction of the time without the use of high performance (expensive) hardware. >At 10:14 AM 7/15/2004 -0400, xxx wrote: >Dr. Livny: >I wanted to update you on our progress with our grid computing >project. We have about 300 nodes deployed presently with the ability to >deploy up to 6,000 total nodes whenever we are ready. The project has >been getting attention in the local press and has gained the full support >of the public school system and generated a lot of excitement in the >business community.
22
www.cs.wisc.edu/condor In other words, I believe that the answer is – “Yes, we can do it!”
23
www.cs.wisc.edu/condor Should we do it? Should the scientific community invest the resources needed to build and maintain the infrastructure for developing, maintaining and supporting production quality middleware or should we expect commercial entities to sell us the middleware and services we need?
24
www.cs.wisc.edu/condor Innovation Access to production quality middleware enables experimentation with new (out of the box) ideas and paradigms. “real” users, “real” applications and “realistic” scales (time, distribution, number of resources, number of files, number of jobs, … ).
25
Physics Computer Science Trillium A meeting point of two sciences
26
The CS Perspective u Application needs are instrumental in the formulation of new frameworks and Information Technologies (IT) u Scientific applications are an excellent indicator to future IT trends u The physics community is at the leading edge of IT u Experimentation (quantitative evaluation) is fundamental to the scientific process u Requires robust software materialization of new technology u Requires an engaged community of consumers u Multi disciplinary teams hold the key to advances in IT u Collaboration across CS disciplines and projects (intra-CS) u Collaboration with domain scientists
27
The Scientific Method u Deployment of end-to-end capabilities u Advance the computational and or data management capabilities of a community u Based on coordinated design and implementation u Teams of domain and computer scientists u May span multiple CS areas u Mission focused u From design to deployment
28
Impact on Computer Science u Grid-2003 has a profound impact on the members of the CS team because it u encourages and facilitates collaboration across CS groups, u requires work in closely coordinated CS-Physics teams, u provides an engaged user community, u enables real-life deployment and experimentations, u facilitates a structured feedback process, u exposes the constraints of “real-life” fabric and u provides long term and evolving operational horizon
29
Some (non traditional) challenges u Packaging, distribution and deployment of middleware and application software u Trouble shooting of deployed software stack u Monitoring of Hardware and Software components u Multi VO allocation policies for processing and storage resources.
30
www.cs.wisc.edu/condor Education and Training Involvement in the development, maintenance and support of production quality middleware provides an environment in which students learn “hands on” the do and do-not of software engineering and experience the life-cycle of software.
31
www.cs.wisc.edu/condor A view on Education “The Condor project provides unique and irreplaceable educational opportunities that in turn are leveraged by commercial operations like Micron's. Also, the opportunity to hire graduates skilled in the tools and methods of HPC are unparalleled through efforts like the Condor project.” From a recent NSF review.
32
www.cs.wisc.edu/condor A view on Training “In terms of training, the organization of this effort (the Condor project) relies on students in CS who are working in the area of distributed computing related topics. They get the exposure to the research, but it comes with the price of participating in the production software engineering activities. The result is a well-rounded training experience and graduates whose skills are in much demand.” From a recent NSF review
33
www.cs.wisc.edu/condor Yes, we can and should do it!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.