O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Deployment, Deployment, Deployment March, 2002 Randy Burris Center for Computational Sciences Oak Ridge National Laboratory
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Overview of this presentation Our goal: let scientists (our customers) do science without worrying about their computer environment Our clientele: Four disciplines (climate, astrophysics, genomics and proteomics, high-energy physics) National labs and universities Using resources all over the country Residing all over the place We must deploy the result (“Deploy or die”)
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Well, OK. But…deploy what? Where are the commonalities in our space? Security and trust – nonexistent to extreme Network connectivity – dialup to OC12 File sizes – bytes to terabytes File location – local unit to partitions around the world Visualization – static to dynamic real-time And so on. We can’t do it all. So exactly what are we going to deploy? And how should we proceed?
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Achieving successful deployment For each of the 4 projects, define basic steps: Define target environment(s) Characterize successful deployment (in each) Prototype in a close-to-production environment Deploy in production In parallel with the above: Produce documentation at every step Develop tools for support staff Start now.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 1: Define target environment(s) We cannot support all combinations. Security – {DCE, Kerberos, PKI, gss}, firewalls, … Compute resource – MPP, cluster, workstation,… User platform – MPP, cluster, Unix/linux, Windows, … Storage Storage resource – HPSS, PVFS, … ? User API for access to data NetCDF, HDF5, both, something else? HRM, pftp, GridFTP, hsi, … Network WAN – GigE/jumbo, FastE, OC12, OC3, ESnet, hops, … LAN – GigE, FastE, iSCSI, FibreChannel, … Visualization – CAVE, workstations, Palm Pilots, … We will have to choose.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize successful deployment A. Correct operation in the security environment B. Optimized performance in the target network environment C. Rugged infrastructure D. Unobtrusive infrastructure E. Thorough documentation for users and support staff
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize A: Security I believe we must define the environment into which we intend to deploy. Starting now Because it will take a long time and will almost certainly require development. Questions to which we need answers: Are we concerned with DOE sites or DOE+NSF+…? Are there circumstances where clear-text passwords are OK? Where no security is OK? Must we support authentication in pki, gsi, dce and/or Kerberos? Will all of our infrastructure work with firewalls at one or both ends of a transfer? Whose firewalls, what filtering parameters, …
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize B: Network On what network are the end nodes? What is our target environment – ESnet, ESnet+Internet2, Grid, www, … What throughput is needed for effective science?
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize C: Rugged Must not crash (of course) Must be in service when needed Must be secure Must have a support plan (which does not require an army of support people) Must have trouble-resolution mechanism and resources Must be survivable over normal maintenance System software patches and upgrades Equipment upgrades
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize D: Unobtrusive User should need minimal knowledge The deeper the infrastructure, the less the user should need to know User should be protected from mistakes Try not to let the user screw things up Documentation and real-time warnings Effective defaults
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 2: Characterize E: Documentation White papers to inform larger community For users: how-to-use documents For system-admin staff: How to install, debug, maintain, troubleshoot For user-support staff How to troubleshoot Tuning knobs For programmers Overview documents to give context Correct interface documents Correct documentation for all appropriate platforms
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Step 3: Prototype in close-to- production environment Example of deployment approach on Probe: Deploy early prototypes in Oak Ridge and NERSC Use Probe, Probe HPSS, Production HPSS and supercomputers Use (and require) documented code and procedures As development progresses, evaluate and address deployment issues such as security, network performance, system-admin documentation As prototype becomes more robust, migrate more functions to Oak Ridge and NERSC production environments Continue to evaluate and address deployment issues that now include user and user-support documentation Iterate as necessary When this sequence is done, you’re in production.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Overview of ORNL Architecture, March 2002 Stingray RS/6000 S80 Marlin RS/6000 H70 STK Library 220 GB SCSI RAID 360 GB Sun FibreChannel RAID Origin 2000 Reality Monster STK Library IBM and Compaq Supercomputers and 64-node linux cluster Probe Production Gigabit Ethernet (jumbo frames) Production HPSS Probe HPSS Disk Cache CAVE Other Probe Nodes Disk Cache 360 GB FibreChannel RAID 600 GB SCSI JBOD External Esnet Router
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Example: How Terascale Supernova Initiative could be prototyped Stingray RS/6000 S80 Marlin RS/6000 H70 Origin 2000 Reality Monster External Esnet Router IBM and Compaq Supercomputers Probe Production Production HPSS Probe HPSS CAVE Other Probe Nodes Bulk storage Data reduction, pre- viz manipulation Rendering
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY We should start right away: Select initial, intermediate and ultimate target environments Including supported applications, platforms, security and target network Describe in a white paper Seek common elements in supported applications Develop a deployment plan for common elements Write white paper describing deployment plan Specify our approach to deploying support for those elements Identifying un-met requirements, and how to remedy Describing approach to ruggedness and unobtrusiveness Address non-common elements in supported applications Seek to minimize their impact Specify our approach to deploying support for those elements Develop deployment plans and describe them Write white paper describing deployment plan
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY DISCUSSION?
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Serious questions for early resolution What is the role of HPSS? HPSS will never be pervasive – expensive. Treat HPSS sites as primary repositories? Which file transfer protocol(s) do we support? GridFTP, pftp, his
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Probe – “Place to be” Overview of ORNL Probe Cell, February 2002 Stingray RS/6000 S80 Marlin RS/6000 H70 STK Silo 200 GB SCSI RAID Disks Sun E250 Compaq DS GB Sun FibreChannel Disks 360 GB STK FibreChannel Disks FibreChannel Switch GSN Switch Origin 2000 Reality Monster RS/6000 B80 External Esnet Router To NERSC Probe Sun Ultra 10 STK Silo IBM and Compaq Supercomputers 3494 Library GSN Bridge RS/ P-170 Probe Production Sun E450 IBM F50 SGI Origin 200 Gigabit Ethernet Intel Dual P-III Linux
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Backup slide
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Technology on hand and available Software HPSS (unlimited instantiations) and HPSS development license HDF5, NetCDF R, ggobi gcc suite C on Solaris, AIX, IRIX and Tru64 Fortran on AIX Oracle 8i and DB2 (current developer’s editions) on AIX Globus 2.0/AIX and Solaris HRM Inter-HPSS hsi application OPNET modeling product MPI/IO testbed 18 nodes – IBM/AIX, Sun/Solaris, SGI/IRIX, Compaq/Tru64 GRID nodes (Sun/Solaris, IBM/AIX, possibly linux) ESnet III OC12 externally, GigE jumbo and Fast Ethernet internally Web100 and NET100 participation