© 2008 IBM Corporation Blue Heron Project IBM Rochester: Tom Budnik: Amanda Peters: Condor: Greg Thain With contributions.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Threads, SMP, and Microkernels
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Software Overview.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Rensselaer Why not change the world? Rensselaer Why not change the world? 1.
IBM Systems and Technology Group © 2007 IBM Corporation High Throughput Computing on Blue Gene IBM Rochester: Amanda Peters, Tom Budnik With contributions.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Operating System Concepts Chapter One: Introduction What is an operating system? Simple Batch Systems Multiprogramming Systems Time-Sharing Systems Personal-Computer.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
CLOUD COMPUTING. What is cloud computing ? History Virtualization Cloud Computing hardware Cloud Computing services Cloud Architecture Advantages & Disadvantages.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
Interconnection network network interface and a case study.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
© 2008 IBM Corporation IBM Portfolio Update IDC User Forum John S. Zuk WW Portfolio & Strategy; Deep Computing.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
Chapter 1: Introduction
Introduction to Distributed Platforms
Chapter 1: Introduction
Architecture & Organization 1
Chapter 1: Introduction
Chapter 1: Introduction
BlueGene/L Supercomputer
Architecture & Organization 1
CLUSTER COMPUTING.
Chapter 1: Introduction
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Presentation transcript:

© 2008 IBM Corporation Blue Heron Project IBM Rochester: Tom Budnik: Amanda Peters: Condor: Greg Thain With contributions from: IBM Rochester: Mark Megerian, Sam Miller, Brant Knudson and Mike Mundy Other IBMers: Patrick Carey, Abbas Farazdel, Maria Iordache and Alex Zekulin UW-Madison Condor: Dr. Miron Livny April 30, 2008

© 2008 IBM Corporation 2 Agenda  What is the Blue Heron Project?  Condor and IBM Blue Gene Collaboration  Introduction to Blue Gene/P  What applications fit the Blue Heron model?  How does Blue Heron work?  Information Sources  Condor on BG/P demo (Greg Thain)

© 2008 IBM Corporation 3 What is the Blue Heron Project? Blue Gene Environment Serial and Pleasantly Parallel Apps Highly Scalable Msg Passing Apps Paths Toward a General Purpose Machine *** NEW *** Available 5/16/08 HTCHPC (MPI) Blue Heron = Blue Gene/P HTC and Condor Blue Heron provides a complete integrated solution that gives users a simple, flexible mechanism for submitting single-node jobs.  Blue Gene looks like a "cluster" from an app’s point of view  Blue Gene supports hybrid application environment  Classic HPC (MPI) apps and now HTC apps

© 2008 IBM Corporation 4 and Blue Gene Collaboration  Both IBM and Condor teams engaged in adapting code to bring Condor and Blue Gene technologies together  Previous Activities (BG/L) Prototype/research Condor running HTC workloads  Current Activities (BG/P) Blue Heron Project  Partner in design of HTC services  Condor supports HTC workloads using static partitions  Future Collaboration (BG/P and BG/Q) Condor supports dynamic machine partitioning Condor supports HPC (MPI) jobs I/O Node exploitation with Condor Persistent memory support (data affinity scheduling) Petascale environment issues

© 2008 IBM Corporation 5 Introduction to Blue Gene Technology Roadmap Blue Gene/P PPC 850MHz Scalable to 3+ PF Blue Gene/Q Blue Gene/L PPC 700MHz Scalable to 596+ TF BG/P is the 2 nd Generation of the Blue Gene Family

© 2008 IBM Corporation 6 Introduction to Blue Gene/P Chip 13.6 GF/s 8 MB EDRAM 4 processors 1 chip, 20 DRAMs 13.6 GF/s 2 or 4 GB DDR2 32 Node Cards up to 64x10 GigE I/O links 14 TF/s 2 or 4 TB up to 3.56 PF/s 512 or 1024 TB Cabled Rack System Compute Card 435 GF/s 64 or 128 GB 32 Compute Cards up to 2 I/O cards Node Card Leadership performance in a space-saving, processor dense, power-efficient package. High reliability: Designed for less then 1 failure per rack per year (7 days MTBF for 72 racks). Easy administration using the powerful web based Blue Gene Navigator. Ultrascale capacity machine (“cluster buster”): run 4,096 HTC jobs on a single rack. The system scales from 1 to 256 racks: 3.56 PF/s peak Quad-Core PowerPC System-on-Chip up to 256 racks

© 2008 IBM Corporation 7 What applications fit the Blue Heron model?  Master/Worker Paradigm:  Many “pleasantly parallel” apps on BG/P use a compute node as the “master node”  Advantage of Blue Heron (HTC) Solution:  Move the “master node” from a Blue Gene compute node to the Front-End Node (FEN). This is a better solution for the following reasons:  Application resiliency: In MPI model a single node failure kills entire app for the partition. In HTC mode only the job running on the failed node is ended, other single node jobs continue to run on partition.  FEN has more memory, better performance, more functionality than a single compute node  Code that runs on the compute nodes is much cleaner, since it only contains the work to be performed, and leaves the coordination to a script or scheduler (NO MPI NEEDED)  The coordinator functionality can be a Perl script, Python, compiled program, or anything that runs on Linux  The coordinator can interact directly with DB2 or MySQL, to either get the inputs for the application, or to store the results. This can eliminate the need to create a flat-file input for the app, or to generate the results in an output file.  Example: American Monte Carlo (options pricing) Reference: en.wikipedia.org/wiki/Monte_Carlo_methods_in_finance MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) { // send work to other nodes and collect results } else { // do real work }

© 2008 IBM Corporation 8 How does Blue Heron work? “Software Architecture Viewpoint”  Lightweight  Extreme scalability  Flexible scalability  High throughput (fast) Design Goals:

© 2008 IBM Corporation 9 How does Blue Heron work? “End user perspective”  “submit” client:  Acts as a shadow or proxy for the real job running on the compute node – very lightweight  Submit jobs to location or pool  Pool id concept: scheduler alias for a collection of partitions available to run a job on  location: the resource where the job will execute in the form of a processor or wildcard location  Example #1 (submit to location): submit -location “R00-M0-N00-J05-C00” -exe hello_world  Example #2 (submit to pool): submit -pool BIOLOGY –exe hello_world  Job scheduler example:  Submit jobs using Condor (“condor_submit”) Submitting jobs (typically from FEN):

© 2008 IBM Corporation 10 Navigator  Viewing active HTC jobs running on Blue Gene partitions (blocks)

© 2008 IBM Corporation 11 Navigator  Viewing HTC job history on Blue Gene

© 2008 IBM Corporation 12 Information Sources Official Website  Blue Gene Redbooks and Redpapers  For the latest list go to and search for “Blue Gene” IBM Journal of Research and Development  researchweb.watson.ibm.com/journal/rd/521/team.html  Research Site  TOP500 List  Green500 List 

© 2008 IBM Corporation 13 Condor using HTC on BG/P Demo: Rosetta++ with MySQL  Rosetta++ is a protein prediction algorithm  It is very well-suited to HTC, since it runs many simulations of the same protein, using different random number seeds  The one that results in the lowest energy model among those attempted is the “solution”  Rosetta++ had already been shown to work on Blue Gene, by David Baker’s lab  Our goal was to show that it runs well in HTC mode  Very little actual code changes were required:  Compiled for Blue Gene, but using the single node version (NO MPI)  Changed a few places that did file output to use stdout, since that made it easier for the submitting script to associate each task to its results  Created a simple database front-end using both DB2 and MySQL, to contain the proteins and the seeds  Perl script reads inputs from database, submits each task to Condor, and processes results back into the database  Demonstrates HTC mode using Condor, with perfect linear scaling and no MPI

© 2008 IBM Corporation 14 Questions?

© 2008 IBM Corporation 15 Backup Slides

© 2008 IBM Corporation 16 What are the Blue Gene System Components? Blue Gene Rack(s) Hardware/Software Host System Service Node and Front End (login) Nodes SuSE SLES/10, HPC SW Stack, File Servers, Storage Subsystem, XLF/C Compilers, DB2 3 rd Party  Ethernet Switch

© 2008 IBM Corporation 17 Blue Gene Integrated Networks  Torus  Compute nodes only  Direct access by app  DMA  Collective  Compute and I/O node attached  16 routes allow multiple network configurations to be formed  Contains an ALU for collective operation offload  Direct access by app  Barrier  Compute and I/O nodes  Low latency barrier across system (< 1usec for 72 racks)  Used to synchronize time bases  Direct access by app  10Gb Functional Ethernet  I/O nodes only  1Gb Private Control Ethernet  Provides JTAG, i2c, etc, access to hardware. Accessible only from Service Node  Clock network  Single clock source for all racks

© 2008 IBM Corporation 18 Blue Gene is the most Power, Space, and Cooling Efficient Supercomputer (Published specs per peak performance) IBM BG/P

© 2008 IBM Corporation 19 Blue Gene is Orders of Magnitude more Reliable than other Platforms Results of survey conducted by Argonne National Lab on 10 clusters ranging from 1.2 to 365 TFlops (peak); excluding storage subsystem, management nodes, SAN network equipment, software outages. * Estimated based on reliability improvements implemented in BG/P compared to BG/L <1 *

© 2008 IBM Corporation 20 Blue Gene Software Hierarchical Organization  Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)  I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination  Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

© 2008 IBM Corporation 21 Quad Mode  Also called Virtual Node Mode  All 4 cores run 1 process each  No threading  Each process gets ¼ node memory  MPI/HTC programming model Dual Mode  2 cores run 1 process each  Each process may spawn 1 thread on core not used by other process  Each process gets ½ node memory  MPI/OpenMP/HTC programming model SMP Mode  1 core runs 1 process  Process may spawn threads on each of the other cores  Process gets full node memory  MPI/OpenMP/HTC programming model M P M P M P Memory address space M Core 0 P Application Core 1Core 2Core 3 Application M P T M P T Core 0 Core 1 Core 2 Core 3 Memory address space CPU2CPU3 Application M P TTT Core 0 Core 1Core 2Core 3 Memory address space BG/P Job Modes allow Flexible use of Compute Node Resources

© 2008 IBM Corporation 22 Why and for What is Blue Gene Used?  Improve understanding – significantly larger scale, more complex and higher resolution models; new science applications  Multiscale and multiphysics – From atoms to mega-structures; coupled applications  Shorter time to solution – Answers from months to minutes Physics – Materials Science Molecular Dynamics Environment and Climate ModelingLife Sciences: Sequencing Biological Modeling – Brain Science Computational Fluid Dynamics Life Sciences: In-Silico Trials, Drug Discovery Financial Modeling Streaming Data Analysis Geophysical Data Processing Upstream Petroleum

© 2008 IBM Corporation 23 Many Computational Science Modeling and Simulation Algorithms and Numerical Methods are Massively Parallel

© 2008 IBM Corporation 24 What applications fit the Blue Heron model?  Wide range of applications can run in HTC mode  Many applications that run on Blue Gene today are “embarrassingly (pleasantly) parallel” or “independently parallel”  They don’t exploit the torus for MPI communication and just want a large number of small tasks, with a coordinator of results HTC Application Identification  Solution Statement:  A high-throughput computing (HTC) application is one in which the same basic calculation must be performed over many independent input data elements and the results collected. Because each calculation is independent, it is extremely easy to spread calculations out over multiple cluster nodes. For this reason, high-throughput applications are sometimes called “embarrassingly parallel.” HTC applications occur much more frequently than one might think, showing up in areas such as parameters studies, search applications, data analytics, and what-if calculations.  Identifying a HTC application:  There are a number of identifiers you can use to determine if your specific computing problem fits into the category of a high-throughput application:  Do you need to run many instances of the same application with different arguments or parameters?  Do you need to run the same application many times with different input files?  Do you have an application that can select subsets of the input data and whose results can be combined by a simple merge process such as concatenating, placing them into a single data base, or adding them together? If the answer to any of these questions is “yes,” then it is quite likely that you have a HTC application. Source: Grid.org

© 2008 IBM Corporation 25 How does Blue Heron work? Key Features:  Provides a job submit command that is simple, lightweight, and extremely fast  Job state is integrated into Control System database, so administrators know which nodes have jobs, and which are idle  Provides stdin/stdout/stderr on a per-job basis  Enables individual jobs to be signaled or killed  Maintains a user ID on per-job basis (allows multiple users per partition)  Blue Gene Navigator shows HTC jobs (active or in history) with job exit status & runtime stats  Designed for easy integration with job schedulers (e.g. Condor, LoadLeveler, SIMPLE, etc.)

© 2008 IBM Corporation 26 submit command./submit [options] or./submit [options] binary [arg1 arg2... argn] Job options: [-]-exe executable to run [-]-args "arg1 arg2... argn" arguments, must be enclosed in double quotes [-]-env add an environmental for the job [-]-exp_env export an environmental to the job's environment [-]-env_all add all current environmentals to the job's environment [-]-cwd the job's current working directory [-]-timeout number of seconds before the job is killed [-]-strace run job under system call tracing Resource options: [-]-mode the job mode [-]-location compute core location to run the job [-]-pool compute node pool ID to run the job Options: [-]-port listen port of the submit mux to connect to (default 10246) [-]-trace tracing level, default(6) [-]-enable_tty_reporting disable the default line buffering of stdin, stdout, and stderr when input (stdin) or output (stdout/stderr) is not a tty [-]-raise if a job dies with a signal, submit will raise this signal