Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Virtualization in HPC Minesh Joshi CSC 469 Dr. Box Feb 1, 2012.
© 2003 IBM Corporation IBM Systems and Technology Group Operating System Attributes for High Performance Computing Ken Rozendal Distinguished Engineer.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
CS533 Concepts of Operating Systems Class 14 Virtualization.
1: Operating Systems Overview
OPERATING SYSTEM OVERVIEW
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Operating Systems: Principles and Practice
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Computer System Architectures Computer System Software
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Multi-stack System Software Jack Lange Assistant Professor University of Pittsburgh.
Server Virtualization
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
The Role of Virtualization in Exascale Production Systems Jack Lange Assistant Professor University of Pittsburgh.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Breakout Group: Debugging David E. Skinner and Wolfgang E. Nagel IESP Workshop 3, October, Tsukuba, Japan.
Operating Systems: Internals and Design Principles
Full and Para Virtualization
Outline Why this subject? What is High Performance Computing?
Tackling I/O Issues 1 David Race 16 March 2010.
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Next Generation of Apache Hadoop MapReduce Owen
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Open Source Virtualization Andrey Meganov RHCA, RHCX Consultant / VDEL
Introduction to Operating Systems Concepts
Introduction to Operating Systems
Operating System & Application Software
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Parallel Programming By J. H. Wang May 2, 2017.
Introduction to Operating Systems
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Lecture Topics: 11/1 Hand back midterms
Presentation transcript:

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements

04/11/08 Processor-rich environments Today's max SGI system sizes: –1000s of cores per ccNUMA Single System Image (SSI)‏ > 4096 cores limited by Linux scalability –1000s of small SSI nodes per system cores limited by cost Mutli-core is another layer of NUMA! In one SSI: –threads in a core –cores in a cpu/socket –sockets on a numa (or bus) node –numa nodes in a SSI latency/bandwidth between nodes –SSIs in a system

04/11/08 Processor-rich environments Job placement, scheduling continue to be critical, both within SSI and across SSIs –batch managers need to see topology –placement policies based on applications within a node: –processor/cache/node affinity –cpusets –isolated cpus across nodes: –synchronized scheduling

04/11/08 Resiliancy requirements all hardware is not created equal. :)‏... but all hardware will fail tolerate as many failures as possible –memory in particular –diskless clusters eliminate one more failure component –this is an ongoing (and endless) effort –tight integration between software and hardware helps as systems scale, node failures become a certainty –application based checkpoint/restart? –OS supported checkpoint/restart?

04/11/08 The Power of Linux Disruptive OS development process –>1000 developers, 186 known companies (2.6.24)‏ –That's just the kernel!!! Code reviews and signoffs –Hard and controversial changes get reviewed and done right Open Source Flexibility –Can be customized for HPC hardware –Can be on the cutting edge of research Productized and supported Same OS on laptop, desktop, clusters with 1000s of nodes, SSIs with 1000s of cores No vendor lock-in.

04/11/08 Virtualization Microkernel approaches (e.g. Xen)‏ –hides hardware details :-( –how much memory do you want to waste? –how much cpu do you want to waste? –what about IO performance? KVM more interesting for some problems –First domain runs directly on hardware –Additional domains for compatibility. Rebooting –Can be better for clusters with dedicated nodes...compare boot times to job runtimes –No added runtime overhead –Provides clean image –Multiple boot images on disk or on server –Integrate into batch manager!

04/11/08 Green Computing hardware is helping and hurting... –laptop capabilities moving up –more and larger systems consuming more power HPC challenge: save power w/o sacrificing performance –tickless OS –cpu frequency management –management of idle cpus, nodes, and systems

04/11/08 Slide 8

04/11/08 Slide 9 Time Barrier Complete System Overhead Wasted Cycles Wasted Cycles System Overhead Compute Cycles Wasted Cycles Wasted Cycles Wasted Cycles Wasted Cycles Node 1 Node 2 Node 3 Unsynchronized OS Noise → Wasted Cycles System Overhead System Overhead Node 1 System Overhead Node 2 System Overhead Node 3 Synchronized OS Noise → Faster Results Process on: OS system noise: CPU cycles stolen from a user application by the OS to do periodic or asynchronous work (monitoring, daemons, garbage collection, etc). Management interface will allow users to select what gets synchronized Huge performance boost on larger scales systems OS Noise Synchronization‏

04/11/08 Workflow Characteristics Multi-Discipline Data-Intensive Mixed or Uncertain Workloads Interactivity Rapid development cycles Workflow Characteristics Price-performance key Little data sharing Predictable Workloads Non-interactive Standard Modes Mature Apps Capability & Conceptual Computing Capacity & Compliant Computing Focus on Innovation Focus on Efficiency Slide 10 Servers: One Size Doesn’t Fit All!

04/11/08 User Productivity Advantages of Large Global Shared Memory Freedom to arbitrarily scale problem size without decomposition or other rework Minimal penalty to fetch off-node data Ability to freely exploit Open MP and/or MPI in any combination or scale. Simplified code development and prototyping platform Freedom to experiment without the hindrance of cluster paradigms Unified parallel C translator in development Greatly simplified load balancing Simple to direct a task to any processor as all data is accessible