IBM Systems and Technology Group © 2007 IBM Corporation High Throughput Computing on Blue Gene IBM Rochester: Amanda Peters, Tom Budnik With contributions.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Performance Testing - Kanwalpreet Singh.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Presenter: Sora Choe.  Introduction…………………………………3~7  Requirements……………………………..8~12  Implementation…………………………13~17  Microbenchmarks Performance……..18~23.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Workload Management Massimo Sgaravatto INFN Padova.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Software Overview.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
1 Oracle 9i AS Availability and Scalability Margaret H. Mei Senior Product Manager, ST.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
© 2008 IBM Corporation Blue Heron Project IBM Rochester: Tom Budnik: Amanda Peters: Condor: Greg Thain With contributions.
Computer System Architectures Computer System Software
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Copyright 2009 Fujitsu America, Inc. 0 Fujitsu PRIMERGY Servers “Next Generation HPC and Cloud Architecture” PRIMERGY CX1000 Tom Donnelly April
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Argonne Leadership Computing Facility ALCF at Argonne  Opened in 2006  Operated by the Department of Energy’s Office of Science  Located at Argonne.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Full and Para Virtualization
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Interconnection network network interface and a case study.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
1
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
WebSphere XD Compute Grid High Performance Architectures
Low-Cost High-Performance Computing Via Consumer GPUs
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Capriccio – A Thread Model
湖南大学-信息科学与工程学院-计算机与科学系
Software models - Software Architecture Design Patterns
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

IBM Systems and Technology Group © 2007 IBM Corporation High Throughput Computing on Blue Gene IBM Rochester: Amanda Peters, Tom Budnik With contributions from: IBM Rochester: Mike Mundy, Greg Stewart, Pat McCarthy IBM Watson Research: Alan King, Jim Sexton UW-Madison Condor: Greg Thain, Miron Livny, Todd Tannenbaum

IBM Systems and Technology Group © 2007 IBM Corporation 2 Agenda  Blue Gene Architecture Overview  High Throughput Computing (HTC) on Blue Gene  Condor and IBM Blue Gene Collaboration  Exploratory Application Case Studies for Blue Gene HTC  Questions and Web resource links

IBM Systems and Technology Group © 2007 IBM Corporation 3 2.8/5.6 GF/s 2 processors 2 chips 5.6/11.2 GF/s 1.0 GB 32 chips 16 compute, 0-2 IO cards 90/180 GF/s 16 GB 32 node cards 1,024 chips 2.8/5.6 TF/s 512 GB 64 Racks 65,536 chips 180/360 TF/s 32 TB Rack System Node card Compute node Chip Blue Gene/L Overview Scalable from 1 rack to 64 racks  Rack has 2048 processors with 512 MB or 1 GB DRAM/node  Blue Gene has 5 independent networks (Torus, Collective, Control (JTAG), Global barrier, and Functional 1 Gb Ethernet) November 2006 Top500 List  2 in Top10 (#1 and #3)  9 in Top30  16 in Top100  27 overall in Top150

IBM Systems and Technology Group © 2007 IBM Corporation 4 Blue Gene System Architecture Functiona l Gigabit Ethernet I/O Node 0 Linux ciod C-Node 0 CNK I/O Node 1023 Linux ciod C-Node 0 CNK C-Node 63 CNK C-Node 63 CNK Control Gigabit Ethernet IDo chip Resource Scheduler System Console Control System DB2 I2CI2C fs client JTAG network Torus network Collective network Front-end Nodes Pset 1023 Pset 0 File Servers Service Node app

IBM Systems and Technology Group © 2007 IBM Corporation 5 HPC vs. HTC Comparison  High Performance Computing (HPC) Model –Parallel, tightly coupled applications Single Instruction, Multiple Data (SIMD) architecture –Programming model: typically MPI –Apps need tremendous amount of computational power over short time period  High Throughput Computing (HTC) Model –Large number of independent tasks Multiple Instruction, Multiple Data (MIMD) architecture –Programming model: non-MPI –Apps need large amount of computational power over long time period –Traditionally run on large clusters  HTC and HPC modes co-exist on Blue Gene –Determined when resource pool (partition) is allocated

IBM Systems and Technology Group © 2007 IBM Corporation 6 Why Blue Gene for HTC ?  High processing capacity with minimal floor space –High compute node density – 2,048 processors in one Blue Gene rack –Scalability from 1 to 64 racks (2,048 to 131,072 processors)  Resource consolidation –Multiple HTC and HPC workloads on a single system –Optimal use of compute resources  Low power consumption –#1 on Green MFlops/Watt ( –Twice the performance per watt of a high frequency microprocessor  Low cooling requirements enable extreme scale-up  Centralized system management –Blue Gene Navigator

IBM Systems and Technology Group © 2007 IBM Corporation 7

IBM Systems and Technology Group © 2007 IBM Corporation 8 Generic HTC Flow on Blue Gene  One or more dispatcher programs are started on front end/service node –Dispatcher will manage HTC work request queue  A pool (partition) of compute nodes is booted on Blue Gene –Every compute node has a launcher program started on it that connects back to the designated HTC dispatcher –New pools of resources can be added dynamically as workload increases  External work requests are routed to HTC dispatcher queue –Single or multiple work requests from each source  HTC dispatcher finds available HTC client and forwards the work request  HTC client runs executable on compute node –A launcher program on each compute node handles work request sent to it by the dispatcher. When work request completes, the launcher program is reloaded and client is ready to handle another work request.  Executable exit status is reported back to dispatcher

IBM Systems and Technology Group © 2007 IBM Corporation 9 HTC activates one launcher thread on each node -- thread restarts when “exec()” terminates or fails. Node launcher: { w=read(fd); exec(w); } “work-rqst1” “w2” “w3” “w4” “w5” “w6” “w7” … w2 w3 w1 w6 w4 w5 w7 Blue Gene HTC partition Dispatcher Generic HTC Flow on Blue Gene

IBM Systems and Technology Group © 2007 IBM Corporation 10 Node Resiliency for HTC  In HPC mode a single failing node in a partition (pool of compute nodes) causes termination of all nodes in the partition –Expected behavior for parallel MPI type apps, but unacceptable for HTC apps –HTC mode partition handles this situation  In HTC mode Blue Gene can recover from soft node failures –For example parity errors –If failure is not related to network hardware, a software reboot will recover the node Other nodes in the partition are unaffected and continue to run jobs Job on failed node is terminated and must be resubmitted by dispatcher –If the partition is started in HTC mode, the Control System will poll at regular intervals looking for nodes in the reset state Nodes in the reset state will be rebooted and launcher restarted on them

IBM Systems and Technology Group © 2007 IBM Corporation 11 Condor and IBM Blue Gene Collaboration  Both IBM and Condor teams engaged in adapting code to bring Condor and Blue Gene technologies together  Initial Collaboration (Blue Gene/L) –Prototype/research Condor running HTC workloads on Blue Gene/L Condor developed dispatcher/launcher running HTC jobs Prototype work for Condor being performed on Rochester On-Demand Center Blue Gene system  Mid-term Collaboration (Blue Gene/L) –Condor supports HPC workloads along with HTC workloads on Blue Gene/L  Long-term Collaboration (Next Generation Blue Gene) –I/O Node exploitation with Condor –Partner in design of HTC services for Next Generation Blue Gene Standardized launcher, boot/allocation services, job submission/tracking via database, etc. –Study ways to automatically switch between HTC/HPC workloads on a partition –Data persistence (persisting data in memory across executables) Data affinity scheduling –Petascale environment issues

12 Execute MachineSubmit Machine Condor Architecture Submit Schedd Starter Job Shadow Startd Central Manager CollectorNegotiator

13 Blue Gene I/O Node Submit Machine Condor with Blue Gene/L Submit Schedd Starter Dispatcher Shadow Startd Central Manager CollectorNegotiator mpirun Blue Gene Compute Nodes etc. Launcher Job Launcher Job

IBM Systems and Technology Group © 2007 IBM Corporation 14 Exploratory Application Case Studies for Blue Gene HTC  Case Study #1: Financial overnight risk calculation for trading portfolio –Large number of calculations to be completed by market opening –Algorithm is Monte Carlo simulation Easy to distribute and robust to resource failure (fewer simulations just gives less accurate result) –Grid middleware bundles tasks into relatively long-running jobs (45 minutes) –Limiting resource is number of CPUs –In some cases power density (KW/sq foot) is critical  Case Study #2: Molecular docking code for virtual drug screening –Docking simulation algorithm for screening large databases of potential drugs against targets –Large number of independent calculations to determine the minimization energy between the target and each potential candidate, and subsequently find the strongest leads

IBM Systems and Technology Group © 2007 IBM Corporation 15 Exploratory Application Case Studies for Blue Gene HTC  Experience results: –Demonstrated scalable task dispatch to 1000’s of processors –Successfully verified multiple dispatcher architecture –Discovered optimal ratio of dispatcher to partition (pool) size is 1:64 or less Latencies increase as ratio increases above this level, possibly due to launcher contention for socket resource as scaling increases – still investigating in this area May depend on task duration and arrival rates –Running in HTC mode changes the I/O patterns Typical MPI programs read and write to the file system with small buffer sizes HTC requires loading the full executable into memory and sending it to compute node –Launcher is cached on IO Node but not the executable Experiments with delaying dispatch proportional to executable size for effective task distribution across partitions were successful –Due to IO Node to Compute Node bandwidth To achieve the fastest throughput a low compute node to I/O node ratio is desirable

IBM Systems and Technology Group © 2007 IBM Corporation 16 Questions?    Web resources:

IBM Systems and Technology Group © 2007 IBM Corporation 17 Backup Slides

IBM Systems and Technology Group © 2007 IBM Corporation 18 Blue Gene Software Stack Compute NodeService NodeFront-end Node Compute Node Kernel Run-time MPI Application Linux XL compilers mpirun front-end Debuggers Linux Proxy MMCS Resource Scheduler CIODB mpirun back-end DB2 & Firmware Navigator I/O Node Linux File system Debuggers GNU tools CIOD

IBM Systems and Technology Group © 2007 IBM Corporation 19 Task submission Thread Work queue Result queue Task verification Thread I/O Nodes Compute Nodes Client Dispatcher BG Partition

IBM Systems and Technology Group © 2007 IBM Corporation 20 Dispatcher Launcher Connect to Dispatcher Dispatch task N Start task N Reboot Launcher Connect to Dispatcher & send task N status Exit task N Boot Launcher Write task N status Read task N Submitter Submit task N to Work Queue Read task N status off Results Queue

IBM Systems and Technology Group © 2007 IBM Corporation 21 Node Resiliency