Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager

Similar presentations


Presentation on theme: "Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager"— Presentation transcript:

1 Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager http://scs.lbl.gov/ August 8, 2005

2 2 UCCSC – August 8, 2005 Agenda SCS Program Overview Implementation Areas for collaboration

3 3 UCCSC – August 8, 2005 Background The 1990’s – Computing at the Desktop o The “Gap” between desktops and NERSC 2001 - MRC Working Group o Large institutional system originally envisioned by working group o Users not ready to share large system o Recommendation to support Linux clusters December 2002 - SCS Program approved o $1.3M Four-year program started January 2003 o Ten strategic science projects are selected o IT Division provides support for Linux Clusters Goals o Enable our scientists to use and take advantage of computing o HPC that works. Avoid security incidents,lost time and expensive mistakes o More effective science

4 4 UCCSC – August 8, 2005 Strategy Planning o Formal project mgmt methods o Steering Committee Support Methodology o Use proven technical approaches that enable us to quickly provide production capability o Adopt standards to facilitate scaling support to several clusters Staffing o Need to develop a core of expertise to address changes in technology. (e.g. 64-bit Linux, kernel hacking, Cluster mgmt, Myrinet, MPI, schedulers) Costs o Drive down support costs

5 5 UCCSC – August 8, 2005 Support Methodology Balance Choice vs. Standardization User has choice over components that are important to them (e.g. cpu, memory, interconnect.) We standardize on the aspects that allow us to scale support and reduce costs Leading, but not bleeding. No exotic stuff. (e.g. no Lustre yet) On the other hand, tightly coupled, parallel systems are a must to push paradigm shift Remember that the goal is a production system. The real trick is in the integration. Making the correct choices so that they all work together and perform well

6 6 UCCSC – August 8, 2005 Support Methodology The Standard Hardware - ia32 or AMD64 Interconnect – GigE, Myrinet, or Infiniband Operating system - Red Hat Enterprise Linux or Centos LBNL Warewulf Cluster Toolkit (http://warewulf-cluster.org) MPI implementation - LAM-MPI Scheduler - Sun Grid Engine, Torque Monitoring – Nagios, Ganglia (http://metacluster.lbl.gov) Cybersecurity – Host-based measures, PIX Firewall, OTP, specific user policies

7 7 UCCSC – August 8, 2005 Staffing Need team with specialized skills to meet technical expertise requirements Limited funding, tight timeline. Team roles – Division of responsibilities o Project mgmt, facilities planning o Technology and procurement o Cluster architect, OS, kernel, MPI expert o Scheduler expert o Cluster installation and support 1.6 FTE total - 10 SCS clusters, 295 nodes

8 8 UCCSC – August 8, 2005 Costs Driving Down Costs Standardization of components critical In-house integration reduces hardware costs and facilitates standards Leverage relations with open source community Outsourcing of various pieces - wiring, seismic Long term planning for electrical infrastructure saves on cost Develop lower cost staff - college interns Competitive bid procurement Benchmarking costs - other National labs, private industry

9 9 UCCSC – August 8, 2005 Success Factors Adherence to standards Effective Steering Committee Initial funding key to get started Prominent scientists were our customers o Funding, visibility, ROI Talented, motivated staff Creativity allowed, but the focus is on production use Transparent costing model

10 10 UCCSC – August 8, 2005 Collaboration What do we have from this? Methodology for cluster support New Consulting Offerings o Cluster architecture o Procurement specification o Facilities planning Development of cluster support business o Effort/cost analysis o Recharge model LBNL Warewulf software o GPL, 20,000 downloads

11 11 UCCSC – August 8, 2005 Collaboration Challenges Larger systems Scalability issues - e.g. parallel filesystems, power and cooling issues Moving up the technology curve - Infiniband, PCI-E Assessing integration risks – Will it work? How will it perform? Harder problems to debug Getting scientists to share a system New services - User facilities, application support Computer room space Funding and funding models Driving down costs further Charting path forward


Download ppt "Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager"

Similar presentations


Ads by Google