High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 How do Design a Cluster Dana Brunson Asst. VP for Research Cyberinfrastructure.

Slides:



Advertisements
Similar presentations
U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
Advertisements

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Information Technology Center Introduction to High Performance Computing at KFUPM.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Novell Server Linux vs. windows server 2008 By: Gabe Miller.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Multiprocessing Memory Management
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Cluster Components Compute Server Disk Storage Image Server.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Storage Survey and Recent Acquisition at LAL Michel Jouvin LAL / IN2P3
ICER User Meeting 3/26/10. Agenda What’s new in iCER (Wolfgang) Whats new in HPCC (Bill) Results of the recent cluster bid Discussion of buy-in (costs,
Copyright 2009 Fujitsu America, Inc. 0 Fujitsu PRIMERGY Servers “Next Generation HPC and Cloud Architecture” PRIMERGY CX1000 Tom Donnelly April
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Planning and Designing Server Virtualisation.
Cyberinfrastructure for Distributed Rapid Response to National Emergencies Henry Neeman, Director Horst Severini, Associate Director OU Supercomputing.
Storage Systems Market Analysis Dec 04. Storage Market & Technologies.
Business Intelligence Appliance Powerful pay as you grow BI solutions with Engineered Systems.
1 Computer and Network Bottlenecks Author: Rodger Burgess 27th October 2008 © Copyright reserved.
Taking Your Business to the Internet. The Internet is one of the fastest growing mediums for businesses today, yet most businesses are not yet taking.
IT253: Computer Organization
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
ALMA Archive Operations Impact on the ARC Facilities.
CSE 451: Operating Systems Spring 2013 Module 26 Cloud Computing Ed Lazowska Allen Center 570 © 2013 Gribble, Lazowska, Levy,
Rob Allan Daresbury Laboratory NW-GRID Training Event 25 th January 2007 Introduction to NW-GRID R.J. Allan CCLRC Daresbury Laboratory.
CSE 451: Operating Systems Autumn 2010 Module 25 Cloud Computing Ed Lazowska Allen Center 570.
News from Alberto et al. Fibers document separated from the rest of the computing resources
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
ClinicalSoftwareSolutions Patient focused.Business minded. Slide 1 Opus Server Architecture Fritz Feltner Sept 7, 2007 Director, IT and Systems Integration.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Supercomputing in Plain English What is a Cluster? Blue Waters Undergraduate Petascale Education Program May 23 – June
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
Extending Auto-Tiering to the Cloud For additional, on-demand, offsite storage resources 1.
Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
HPC In The Cloud Case Study: Proteomics Workflow
Chapter 1: Introduction to the Personal Computer
Storage Area Networks The Basics.
Setting Up a Low Cost Statewide Cyberinfrastructure Initiative
NDGF The Distributed Tier1 Site.
Buying into “Summit” under the “Condo” model
Resource Management IB Computer Science.
The Beijing Tier 2: status and plans
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Low-Cost High-Performance Computing Via Consumer GPUs
Installation 1. Installation Sources
Slow Control Servers C. Irmler SVD 27th B2GM, KEK 21 June 2017.
Vanderbilt Tier 2 Project
INTRODUCTION TO NETWORKS AND COMMUNICATIONS
Grid Computing.
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
CI Milieu (Mafia, Peeps, Homies)
Introduction to client/server architecture
Instructor Materials Chapter 1: Introduction to the Personal Computer
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Virtual Private Servers – Types of Virtualization platforms Virtual Private ServersVirtual Private Servers, popularly known as VPS is considered one of.
Design Unit 26 Design a small or home office network
NSF cloud Chameleon: Phase 2 Networking
Dana Brunson XSEDE Campus Engagement co-manager
Introduction to Research Facilitation
Introduce yourself Presented by
INFO 344 Web Tools And Development
HPC USER FORUM I/O PANEL April 2009 Roanoke, VA
Outline IT Division News Other News
Introduction to Research Facilitation
Presentation transcript:

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 How do Design a Cluster Dana Brunson Asst. VP for Research Cyberinfrastructure Director, High Performance Computing Center Adjunct Assoc. Professor, CS & Math Depts. Oklahoma State University

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 It depends. -- Henry Neeman 2

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 What is a Cluster? “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean” Credit: Henry Neeman

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 What a Cluster is… A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network (or interconnect for short). It also needs software that allows the nodes to communicate over the interconnect. But what a cluster is … is all of these components working together as if they’re one big computer... a super computer. Credit: Henry Neeman

High Performance Computing Center ACI-REF Virtual Residency August 7-13, An Actual Cluster Interconnect Nodes Also named Boomer, in service Credit: Henry Neeman

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Considerations Your budget Power, space, and cooling Your researchers’ workload Your funding cycle Your staff What else? 6

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 OSU’s considerations Budget ~ $1.3M Building out new power & cooling Workload is mix of single core, shared memory and jobs up to ~512 cores Funding cycle is one big purchase every 4-ish years (thanks NSF!) Staff = 1 dedicated person for combined sysadmin, user support, application installation. 7

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Components overview Compute Nodes (standard, large memory, accelerated) Storage (slow & fast) Interconnects (slow & fast) Login nodes Management nodes Other? (Data transfer node, web interfaces, etc.) 8

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Compute nodes “Standard” compute nodes – Processor type – RAM (speed, # channels to fill) – Cheap as possible and still make users happy Special compute nodes – Large memory – Accelerators Choices depends on your users’ needs and the sweet spot in pricing. 9

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 OSU’s latest basic outline: Compute nodes Compute Nodes: (depends on sweet spot of pricing) Standard: processor 2620 or better, GB RAM Large memory: One 1 TB RAM, 4x 256 GB GPU nodes (specs will come from the researcher wanting them.) We already have Xeon Phi from recent purchase 10

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Storage Scratch: Do you need a parallel filesystem? – Needs staff and/or great support (expensive) – Size depends on workload and purge policy Home -- small and simple and is often backed up. Work – big, not too slow. Archive – PetaStore (what do others do?) 11

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 OSU’s latest basic outline: Storage Home: ~20TB storage appliance x3 (we want this redundant-ish) Scratch: 100TB appliance with 800 number on it (unclear if we can get it this small.) Work: 1 PB (servers full of disks, NFS, RAID6, cheap and simple.) 12

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Interconnects Gigabit Ethernet – workhorse Infiniband/Omnipath – Depends on workload (oversubscription can save money if your workload doesn’t have many large parallel jobs.) IPMI network for out-of-band mgmt – Worth the small expense 13

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 OSU’s latest basic outline: Interconnects Infiniband/Omnipath: – as cheap as we can get it – Highly oversubscribed – all our parallel jobs fit within a single switch GigE – top of rack, uplinked to central 10G IPMI 14

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Login & Management Login nodes – Get enough to handle all your users – >1 can give high availability – Round robin DNS Management nodes: – Much diversity in how this is done – Where you can run all the cluster-wide services – Depends on size of cluster, services needed Very small clusters often have a single server for both login and management… 15

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 OSU’s latest basic outline: Management 2x login nodes (same proc as compute) 3x mgmt nodes We also want Vendor installation and support (remember 1 FTE dedicated to the technical stuff.)

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Other optional bits Data Transfer node (see ESNet for specs) Web interfaces/science gateways, etc. What else? OSU – yes to DTN, we already have sufficient webby stuff (aka virtual server pool) 17

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Strategy We gave this list to any vendor that wanted to talk to us. We told them the budget The resulting quotes were very informative – go over them carefully! We have plenty of time because we’re waiting for the new power and cooling to be installed. We will go out for bid

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Watch out for: Enterprise vendors have a completely different mindset: uptime, redundancy, etc. The level of redundancy on a cluster is the compute node. Get references from someone who bought from that vendor a similarly sized cluster who has a similarly sized staff Compare notes with as many people as possible

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Be warned: No matter what you do, the minute you send out the PO you’ll think of something you should’ve done differently. And it’s okay, we all feel that way. 20

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Topics from Etherpad Advertising specs – casual or RFP? Scheduler & policies – what meets your researchers needs? Provisioning mgmt. system – Rocks, xcat, openhpc, razor, puppet, vendor supplied, etc. Replacement schedule – What’s your funding cycle like? 21

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Topics from etherpad Single system vs two? – we keep our old system around a while to ease transition Liquid cooling? – $$$$ Voltage – Depends on how much and what you have available – UPS > 100KW are usually (only?) 3 phase 408V Rack standards – Space available? Density of cooling available? 22

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Acquisition start to finish (was part 2 plan) Get money Spec system based on what’s needed Get bids (informally or formally) Buy Don’t sign acceptance before you test everything with at least some of your workload 23

High Performance Computing Center ACI-REF Virtual Residency August 7-13, 2016 Thanks! Questions? Dana Brunson OSU High Performance Computing Center 24