High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC.

Slides:



Advertisements
Similar presentations
Windows Deployment Services WDS for Large Scale Enterprises and Small IT Shops Presented By: Ryan Drown Systems Administrator for Krannert.
Advertisements

WSUS Presented by: Nada Abdullah Ahmed.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Types of Parallel Computers
Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center.
Information Technology Center Introduction to High Performance Computing at KFUPM.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 11 Windows Server 2008 Virtualization.
© UC Regents 2010 Extending Rocks Clusters into Amazon EC2 Using Condor Philip Papadopoulos, Ph.D University of California, San Diego San Diego Supercomputer.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
Lesson 5-Accessing Networks. Overview Introduction to Windows XP Professional. Introduction to Novell Client. Introduction to Red Hat Linux workstation.
Lesson 4-Installing Network Operating Systems. Overview Installing and configuring Novell NetWare 6.0. Installing and configuring Windows 2000 Server.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Simo Niskala Teemu Pasanen
Storage area network and System area network (SAN)
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
VMware vCenter Server Module 4.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Building a High-performance Computing Cluster Using FreeBSD BSDCon '03 September 10, 2003 Brooks Davis, Michael AuYeung, Gary Green, Craig Lee The Aerospace.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Rocks Clusters SUN HPC Consortium November 2004 Federico D. Sacerdoti Advanced CyberInfrastructure Group San Diego Supercomputer Center.
Rocks cluster : a cluster oriented linux distribution or how to install a computer cluster in a day.
Operating Systems Operating System
Linux+ Guide to Linux Certification Chapter Three Linux Installation and Usage.
Chapter-4 Windows 2000 Professional Win2K Professional provides a very usable interface and was designed for use in the desktop PC. Microsoft server system.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Guide to Linux Installation and Administration, 2e1 Chapter 3 Installing Linux.
Virtualization Lab 3 – Virtualization Fall 2012 CSCI 6303 Principles of I.T.
ROCKS & The CASCI Cluster By Rick Bohn. What’s a Cluster? Cluster is a widely-used term meaning independent computers combined into a unified system through.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Rocks ‘n’ Rolls An Introduction to Programming Clusters using Rocks © 2008 UC Regents Anoop Rajendra.
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
SoCal Infrastructure OptIPuter Southern California Network Infrastructure Philip Papadopoulos OptIPuter Co-PI University of California, San Diego Program.
Large Scale Parallel File System and Cluster Management ICT, CAS.
Deploying a Network of GNU/Linux Clusters with Rocks / Arto Teräs Slide 1(18) Deploying a Network of GNU/Linux Clusters with Rocks Arto Teräs.
CHAPTER 2. Overview 1. Pre-Installation Tasks 2. Installing and Configuring Linux 3. X Server 4. Post Installation Configuration and Tasks.
Cluster Software Overview
Virtual Machines Created within the Virtualization layer, such as a hypervisor Shares the physical computer's CPU, hard disk, memory, and network interfaces.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Automating Installations by Using the Microsoft Windows 2000 Setup Manager Create setup scripts simply and easily. Create and modify answer files and UDFs.
Background Computer System Architectures Computer System Software.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
Red Hat Enterprise Linux Presenter name Title, Red Hat Date.
© 2007 UC Regents1 Rocks – Present and Future The State of Things Open Source Grids and Clusters Conference Philip Papadopoulos, Greg Bruno Mason Katz,
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Advanced Network Administration Computer Clusters.
Virtualization for Cloud Computing
Computing Clusters, Grids and Clouds Globus data service
Create setup scripts simply and easily.
Guide to Linux Installation and Administration, 2e
Presentation transcript:

High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC

Overview of San Diego Supercomputer Center Founded in 1985  Non-military access to supercomputers Over 400 employees Mission: Innovate, develop, and deploy technology to advance science Recognized as an international leader in:  Grid and Cluster Computing  Data Management  High Performance Computing  Networking  Visualization Primarily funded by NSF

My Background : NCR - Helped to build the world’s largest database computers  Saw the transistion from proprietary parallel systems to clusters : HPVM - Helped build Windows clusters Now: Rocks - Helping to build Linux- based clusters

Why Clusters?

Moore’s Law

Cluster Pioneers In the mid-1990s, Network of Workstations project (UC Berkeley) and the Beowulf Project (NASA) asked the question: Can You Build a High Performance Machine From Commodity Components?

The Answer is: Yes Source: Dave Pierce, SIO

The Answer is: Yes

Types of Clusters High Availability  Generally small (less than 8 nodes) Visualization High Performance  Computational tools for scientific computing  Large database machines

High Availability Cluster Composed of redundant components and multiple communication paths

Visualization Cluster Each node in the cluster drives a display

High Performance Cluster Constructed with many compute nodes and often a high- performance interconnect

Cluster Hardware Components

Cluster Processors Pentium/Athlon Opteron Itanium

Processors: x86 Most prevalent processor used in commodity clustering Fastest integer processor on the planet:  3.4 GHz Pentium 4, SPEC2000int: 1705

Processors: x86 Capable floating point performance  #5 machine on Top500 list built with Pentium 4 processors

Processors: Opteron Newest 64-bit processor Excellent integer performance  SPEC2000int: 1655 Good floating point performance  SPEC2000fp: 1691  #10 machine on Top500

Processors: Itanium First systems released June 2001 Decent integer performance  SPEC2000int: 1404 Fastest floating-point performance on the planet  SPEC2000fp: 2161 Impressive Linpack efficiency: 86%

Processors Summary ProcessorGHzSPECintSPECfpPrice Pentium 4 EE Athlon FX Opteron Itanium Itanium Power ????

But What You Really Build? Itanium: Dell PowerEdge 3250 Two 1.4 GHz CPUs (1.5 MB cache)  11.2 Gflops peak 2 GB memory 36 GB disk $7,700  Two 1.5 GHz (6 MB cache) makes the system cost ~$17, GHz vs. 1.5 GHz  ~7% slower  ~130% cheaper

Opteron IBM eServer 325 Two 2.0 GHz Opteron 246  8 Gflops peak 2 GB memory 36 GB disk $4,539  Two 2.4 GHz CPUs: $5, GHz vs. 2.4 GHz  ~17% slower  ~25% cheaper

Pentium 4 Xeon HP DL140 Two 3.06 GHz CPUs  12 Gflops peak 2 GB memory 80 GB disk $2,815  Two 3.2 GHz: $3, GHz vs. 3.2 GHz  ~4% slower  ~20% cheaper

If You Had $100,000 To Spend On A Compute Farm System # of Boxes Peak GFlops Aggregate SPEC2000fp Aggregate SPEC2000int Pentium 4 3 GHz Opteron GHz Itanium 1.4 GHz

What People Are Buying Gartner study Servers shipped in 1Q04  Itanium: 6,281  Opteron: 31,184 Opteron shipped 5x more servers than Itanium

What Are People Buying Gartner study Servers shipped in 1Q04  Itanium: 6,281  Opteron: 31,184  Pentium: 1,000,000 Pentium shipped 30x more than Opteron

Interconnects

Ethernet  Most prevalent on clusters Low-latency interconnects  Myrinet  Infiniband  Quadrics  Ammasso

Why Low-Latency Interconnects? Performance  Lower latency  Higher bandwidth Accomplished through OS-bypass

How Low Latency Interconnects Work Decrease latency for a packet by reducing the number memory copies per packet

Bisection Bandwidth Definition: If split system in half, what is the maximum amount of data that can pass between each half? Assuming 1 Gb/s links:  Bisection bandwidth = 1 Gb/s

Bisection Bandwidth Assuming 1 Gb/s links:  Bisection bandwidth = 2 Gb/s

Bisection Bandwidth Definition: Full bisection bandwidth is a network topology that can support N/2 simultaneous communication streams. That is, the nodes on one half of the network can communicate with the nodes on the other half at full speed.

Large Networks When run out of ports on a single switch, then you must add another network stage In example above: Assuming 1 Gb/s links, uplinks from stage 1 switches to stage 2 switches must carry at least 6 Gb/s

Large Networks With low-port count switches, need many switches on large systems in order to maintain full bisection bandwidth  128-node system with 32-port switches requires 12 switches and 256 total cables

Myrinet Long-time interconnect vendor  Delivering products since 1995 Deliver single 128-port full bisection bandwidth switch MPI Performance:  Latency: 6.7 us  Bandwidth: 245 MB/s  Cost/port (based on 64-port configuration): $1000 Switch + NIC + cable

Myrinet Recently announced 256- port switch  Available August 2004

Myrinet #5 System on Top500 list System sustains 64% of peak performance  But smaller Myrinet-connected systems hit 70-75% of peak

Quadrics QsNetII E-series  Released at the end of May 2004 Deliver 128-port standalone switches MPI Performance:  Latency: 3 us  Bandwidth: 900 MB/s  Cost/port (based on 64-port configuration): $1800 Switch + NIC + cable A3EE4AED738B6E DD30057B227

Quadrics #2 on Top500 list Sustains 86% of peak  Other Quadrics-connected systems on Top500 list sustain 70-75% of peak

Infiniband Newest cluster interconnect Currently shipping 32-port switches and 192-port switches MPI Performance:  Latency: 6.8 us  Bandwidth: 840 MB/s  Estimated cost/port (based on 64-port configuration): $ Switch + NIC + cable

Ethernet Latency: 80 us Bandwidth: 100 MB/s Top500 list has ethernet-based systems sustaining between 35-59% of peak

Ethernet With Myrinet, would have sustained ~1 Tflop  At a cost of ~$130,000 Roughly 1/3 the cost of the system What we did with 128 nodes and a $13,000 ethernet network  $101 / port $28/port with our latest Gigabit Ethernet switch  Sustained 48% of peak

Rockstar Topology 24-port switches Not a symmetric network  Best case - 4:1 bisection bandwidth  Worst case - 8:1  Average - 5.3:1

Low-Latency Ethernet Bring os-bypass to ethernet Projected performance:  Latency: less than 20 us  Bandwidth: 100 MB/s Potentially could merge management and high-performance networks Vendor “Ammasso”

Application Benefits

Storage

Local Storage Exported to compute nodes via NFS

Network Attached Storage A NAS box is an embedded NFS appliance

Storage Area Network Provides a disk block interface over a network (Fibre Channel or Ethernet) Moves the shared disks out of the servers and onto the network Still requires a central service to coordinate file system operations

Parallel Virtual File System PVFS version 1 has no fault tolerance PVFS version 2 (in beta) has fault tolerance mechanisms

Lustre Open Source “Object-based” storage  Files become objects, not blocks

Cluster Software

Cluster Software Stack Linux Kernel/Environment  RedHat, SuSE, Debian, etc.

Cluster Software Stack HPC Device Drivers  Interconnect driver (e.g., Myrinet, Infiniband, Quadrics)  Storage drivers (e.g., PVFS)

Cluster Software Stack Job Scheduling and Launching  Sun Grid Engine (SGE)  Portable Batch System (PBS)  Load Sharing Facility (LSF)

Cluster Software Stack Cluster Software Management  E.g., Rocks, OSCAR, Scyld

Cluster Software Stack Cluster State Management and Monitoring  Monitoring: Ganglia, Clumon, Nagios, Tripwire, Big Brother  Management: Node naming and configuration (e.g., DHCP)

Cluster Software Stack Message Passing and Communication Layer  E.g., Sockets, MPICH, PVM

Cluster Software Stack Parallel Code / Web Farm / Grid / Computer Lab  Locally developed code

Cluster Software Stack Questions:  How to deploy this stack across every machine in the cluster?  How to keep this stack consistent across every machine?

Software Deployment Known methods:  Manual Approach  “Add-on” method Bring up a frontend, then add cluster packages  OpenMosix, OSCAR, Warewulf  Integrated Cluster packages are added at frontend installation time  Rocks, Scyld

Rocks

Primary Goal Make clusters easy Target audience: Scientists who want a capable computational resource in their own lab

Philosophy Not fun to “care and feed” for a system All compute nodes are 100% automatically installed  Critical for scaling Essential to track software updates  RHEL 3.0 has issued 232 source RPM updates since Oct 21 Roughly 1 updated SRPM per day Run on heterogeneous standard high volume components  Use the components that offer the best price/performance!

More Philosophy Use installation as common mechanism to manage a cluster  Everyone installs a system: On initial bring up When replacing a dead node Adding new nodes Rocks also uses installation to keep software consistent  If you catch yourself wondering if a node’s software is up- to-date, reinstall! In 10 minutes, all doubt is erased  Rocks doesn’t attempt to incrementally update software

Rocks Cluster Distribution Fully-automated cluster-aware distribution  Cluster on a CD set Software Packages  Full Red Hat Linux distribution Red Hat Linux Enterprise 3.0 rebuilt from source  De-facto standard cluster packages  Rocks packages  Rocks community packages System Configuration  Configure the services in packages

Rocks Hardware Architecture

Minimum Components X86, Opteron, IA64 server Local Hard Drive Power Ethernet OS on all nodes (not SSI)

Optional Components Myrinet high-performance network  Infiniband support in Nov 2004 Network-addressable power distribution unit keyboard/video/mouse network not required  Non-commodity  How do you manage your management network?  Crash carts have a lower TCO

Storage NFS  The frontend exports all home directories Parallel Virtual File System version 1  System nodes can be targeted as Compute + PVFS or strictly PVFS nodes

Minimum Hardware Requirements Frontend:  2 ethernet connections  18 GB disk drive  512 MB memory Compute:  1 ethernet connection  18 GB disk drive  512 MB memory Power Ethernet switches

Cluster Software Stack

Rocks ‘Rolls’ Rolls are containers for software packages and the configuration scripts for the packages Rolls dissect a monolithic distribution

Rolls: User-Customizable Frontends Rolls are added by the Red Hat installer  Software is added and configured at initial installation time Benefit: apply security patches during initial installation  This method is more secure than the add-on method

Red Hat Installer Modified to Accept Rolls

Approach Install a frontend 1. Insert Rocks Base CD 2. Insert Roll CDs (optional components) 3. Answer 7 screens of configuration data 4. Drink coffee (takes about 30 minutes to install) Install compute nodes: 1. Login to frontend 2. Execute insert-ethers 3. Boot compute node with Rocks Base CD (or PXE) 4. Insert-ethers discovers nodes 5. Goto step 3 Add user accounts Start computing Optional Rolls  Condor  Grid (based on NMI R4)  Intel (compilers)  Java  SCE (developed in Thailand)  Sun Grid Engine  PBS (developed in Norway)  Area51 (security monitoring tools)

Login to Frontend Create ssh public/private key  Ask for ‘passphrase’  These keys are used to securely login into compute nodes without having to enter a password each time you login to a compute node Execute ‘insert-ethers’  This utility listens for new compute nodes

Insert-ethers Used to integrate “appliances” into the cluster

Boot a Compute Node in Installation Mode Instruct the node to network boot  Network boot forces the compute node to run the PXE protocol (Pre-eXecution Environment) Also can use the Rocks Base CD  If no CD and no PXE-enabled NIC, can use a boot floppy built from ‘Etherboot’ (

Insert-ethers Discovers the Node

Insert-ethers Status

eKV Ethernet Keyboard and Video Monitor your compute node installation over the ethernet network  No KVM required! Execute: ‘ssh compute-0-0’

Node Info Stored In A MySQL Database If you know SQL, you can execute some powerful commands

Cluster Database

Kickstart Red Hat’s Kickstart  Monolithic flat ASCII file  No macro language  Requires forking based on site information and node type. Rocks XML Kickstart  Decompose a kickstart file into nodes and a graph Graph specifies OO framework Each node specifies a service and its configuration  Macros and SQL for site configuration  Driven from web cgi script

Sample Node File ]> Enable SSH &ssh; &ssh;-clients &ssh;-server &ssh;-askpass Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no FallBackToRsh no Protocol 1,2 chmod o+rx /root mkdir /root/.ssh chmod o+rx /root/.ssh >

Sample Graph File Default Graph for NPACI Rocks. …

Kickstart framework

Appliances Laptop / Desktop  Appliances  Final classes  Node types Desktop IsA  standalone Laptop IsA  standalone  pcmcia Code re-use is good

Architecture Differences Conditional inheritance Annotate edges with target architectures if i386  Base IsA grub if ia64  Base IsA elilo One Graph, Many CPUs  Heterogeneity is easy  Not for SSI or Imaging

Installation Timeline

Status

But Are Rocks Clusters High Performance Systems? Rocks Clusters on June 2004 Top500 list:

What We Proposed To Sun Let’s build a Top500 machine … … from the ground up … … in 2 hours … … in the Sun booth at Supercomputing ‘03

Rockstar Cluster (SC’03) Demonstrate  We are now in the age of “personal supercomputing”  Highlight abilities of: Rocks SGE Top500 list  #201 - November 2003  #413 - June 2004 Hardware  129 Intel Xeon servers 1 Frontend Node 128 Compute Nodes  Gigabit Ethernet $13,000 (US) 9 24-port switches 8 4-gigabit trunk uplinks