The LBNL Perceus Cluster Infrastructure Next Generation Cluster Provisioning and Management October 10, 2007 Internet2 Fall Conference Gary Jung, SCS Project.

Slides:



Advertisements
Similar presentations
DOE Perspective on Cyberinfrastructure - LBNL Gary Jung Manager, High Performance Computing Services Lawrence Berkeley National Laboratory Educause CCI.
Advertisements

Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
© 2011 Autodesk Go Big or Go Home! Part 1 – Large Scale Autodesk Vault Deployments Irvin Hayes Jr. Technical Product Manager.
High Performance Computing Course Notes Grid Computing.
LANs and WANs Network size, vary from –simple office system (few PCs) to –complex global system(thousands PCs) Distinguish by the distances that the network.
Chapter 9 Designing Systems for Diverse Environments.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager
Rates and Billing for New ITS Services Financial Unit Liaison Meeting February 16, 2011 Barry D. MacDougall Information Technology Service.
SaaS, PaaS & TaaS By: Raza Usmani
Advanced Metering Infrastructure
Simplify your Job – Automatic Storage Management Angelo Session id:
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Module 13: Configuring Availability of Network Resources and Content.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Chapter 9: Novell NetWare
5.1 © 2004 Pearson Education, Inc. Lesson 5: Administering User Accounts Exam Microsoft® Windows® 2000 Directory Services Infrastructure Goals 
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
CIS 191 – Lesson 2 System Administration. CIS 191 – Lesson 2 System Architecture Component Architecture –The OS provides the simple components from which.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
Electronic Records Management: A Checklist for Success Jesse Wilkins April 15, 2009.
WINDOWS XP PROFESSIONAL AUTOMATING THE WINDOWS XP INSTALLATION Bilal Munir Mughal Chapter-2 1.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Server Virtualization
©2015 EarthLink. All rights reserved. Private Cloud Hosting Create Your Own Private IT Environment.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
Private Cloud Hosting. IT Business Challenges I need to extend my on-premises virtualized environment to utilize the Cloud and manage the entire environment.
Cluster Software Overview
VMware vSphere Configuration and Management v6
Introduction To BlueMix By: Ryan
Paperless Timesheet Management Project Anant Pednekar.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Background Computer System Architectures Computer System Software.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Installing Windows 7 Lesson 2.
Advanced Computing Facility Introduction
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
By: Raza Usmani SaaS, PaaS & TaaS By: Raza Usmani
Clouds , Grids and Clusters
Parallel Programming By J. H. Wang May 2, 2017.
Grid Computing.
LQCD Computing Operations
Dev Test on Windows Azure Solution in a Box
An Introduction to Computer Networking
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
NCSA Supercluster Administration
Introduce yourself Presented by
Traditional Virtualized Infrastructure
Presentation transcript:

The LBNL Perceus Cluster Infrastructure Next Generation Cluster Provisioning and Management October 10, 2007 Internet2 Fall Conference Gary Jung, SCS Project Manager

2 October 10, 2007 Outline Perceus Infrastructure What is this talk about? Cluster Support at LBNL Support Methodology (previous) The New Infrastructure Support Methodology (new) What’s next? Upcoming Challenges

3 October 10, 2007 Cluster Support at LBNL The SCS Program Research projects purchase their own Linux clusters IT provides comprehensive cluster support o Pre-purchase consulting o Procurement assistance o Cluster integration and installation o Ongoing systems administration and cyber security o Computer room space with networking and cooling 30 Clusters in production (over 2500 processors) Huge success! Fifth year of operation Examples of recent clusters include: o Molecular Foundry 296 -> 432 PE Infiniband (IB) Cluster (Oct 2007) o Earth Sciences 388 -> 536 PE IB Cluster (Feb 2007) o Department of Homeland Security 80 PE IB Cluster (Dec 2006)

4 October 10, 2007 Support Methodology (previous) Key points Standard software stack facilitates systems administration o Minimizes skill set requirements for support o Expertise transferable between clusters Warewulf Cluster toolkit allows us to easily scale up clusters o Allows all nodes to be configured from the master node o Runs nodes “stateless” (disks not required) o Incremental effort to manage more nodes Minimal infrastructure to keep recharge costs down o Most clusters behind firewall o One-time password authentication

5 October 10, 2007 SCS Infrastructure (previous)

6 October 10, 2007 Support Issues (previous) What are the issues that we need to solve? Each cluster has its own o Software: applications, libraries, compilers, scheduler, etc.. o Hardware: Interconnect, storage, mgmt infrastructure o Not standard enough! Provisioning new clusters is time consuming o Each new system built from scratch o Significant amount of work required to get user applications running first and then performing well with each new cluster Major cluster upgrades usually time consuming. Typical process would be to: o Partition some part of the cluster o Install new software stack on new partition o Integrate and test user applications on new stack o Migrate compute nodes and users to new environment o Overall process usually takes several weeks or months

7 October 10, 2007 Support Issues (previous) Proliferation of clusters has lead to users needing accounts on more than one cluster Giving users access to multiple systems is a manual process. No centralized mgmt of user accounts. o Requires managing multiple accounts o Users have to copy data from one cluster to another to run. o Results in replicated user environments (e.g. Redundant data sets, binaries) o Minor software differences may require users to recompile o Installation of layered software is repeated to a varying degree Sharing resources between clusters not possible o Idle compute resources can’t be easily leveraged o As mentioned, software inconsistencies would make for a lot of work for the users o Grid might help with sharing, but not with the support effort

8 October 10, 2007 How can we improve? We’ve scaled the support effort to manage a standalone cluster very well. Now, how do we scale the support effort in another dimension to manage a very large number of clusters?

9 October 10, 2007 New Infrastructure Infiscale Perceus Next Generation Warewulf software ( o Large scale provisioning of nodes LBNL Perceus Cluster Infrastructure Flatten network to connect everything together Use Infiscale Perceus software to allow sharing of certain de-facto resources Perceus Appliance is essentially a “SuperMaster” node Single Master job scheduler system Global home filesystem Everything else becomes a diskless cluster node o Interactive nodes o Compute nodes o Storage nodes

10 October 10, 2007 New Infrastructure

11 October 10, 2007 Support Methodology (new) Perceus Appliance (SuperMaster) Almost everything is booted from the Perceus Appliance Two master nodes provide high availability Cluster configurations stored in Perceus Berkeley DB (BDB) database on shared filesystem VNFS capsules o Supports the use and sharing of multiple different cluster images o Provides versioning of cluster images to facilitate upgrades and testing Facilitates provisioning new clusters. Instead of building each new cluster and software stack from scratch, we do the following: o Users test their applications on our test cluster nodes o Purchase and install new cluster hardware o Configure and test. New systems uses existing shared software stack, home directories, and user accounts already setup. o Saves several days effort. As a result, new systems are provisioned much faster

12 October 10, 2007 Support Methodology (new) Global Filesystem Shared global filesystem can minimize the copying of data between systems Users compile once on Interactive node and can submit jobs to clusters accessible to them from one place Software installed on global filesystem can be shared by all o Minimizes the need to install software on every cluster o Uniform path to software facilitates running of code across all o Sharing compilers is more cost effective User Environment Modules Allows different versions and builds of software to coexist on the clusters Users can dynamically load specific version of compiler, MPI, and application software at run time Facilitates finding the “right” combination of software to build older applications Allows us to upgrade software without breaking applications

13 October 10, 2007 Support Methodology (new) Master Scheduler Single master scheduler to setup to schedule for all clusters. Allows users to submit jobs to any accessible clusters Features in OpenMPI will allow us to run jobs spanning multiple clusters across different fabrics. Facilitates monitoring and accounting of usage Shared Interactive Nodes Users login to interactive nodes; compile their programs and submit jobs to the scheduler from these nodes Interactive Nodes includes support for graphical interactive sessions More Interactive Nodes can be added as needed Dedicated Resources Clusters can be still dedicated as before Users see what they need o Default to their own cluster (but can have access to others) o Always see their “global” home directory o User still sees a cluster. We see groups of nodes

14 October 10, 2007 What’s next? Shared Laboratory Research Cluster (LRC) Integrating a large shared resource makes more sense now Users can use either the LRC or their own cluster The goal will be better utilization of resources to meet demand Cycle Sharing Opportunity to make use of spare cycles Now more a a political, instead of technical, challenge UC Berkeley Separate institution also managed by the University of California Recent high profile joint projects now encourages close collaboration Proximity to LBNL facilitates the use of SCS services o One cluster in production; 3 more pending o Perceus will be used to setup similar infrastructure o Future goal may be to connect both infrastructures together and run clusters over WAN

15 October 10, 2007 Upcoming Challenges Global Filesystem Currently used for home directories only Can see that this is a good path Need to select appropriate technology to scale up o HPC parallel filesystems usually complex, require client app, and $$ o Conventional enterprise technology will work for now Can it run over WAN to UCB? Scheduler and Accounting Much more complex than scheduling for small groups Need method for tracking user allocations in cpu hours Many solutions; not clear which is the best Security and Networking Clusters are more tightly integrated. Need to evaluate current practices What are the security implications of extending our Perceus Infrastructure to the UC Berkeley campus How do we technically do it?

16 October 10, 2007 Conclusion Summary Next evolutionary step in providing cluster support Simplifies provisioning new clusters Reduces support effort and keeps our technical support staff sane Transparent to the users for the most part, but becomes a great help when they start using more than one cluster Leveraging of multiple resources will benefit researchers