In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

Slides:



Advertisements
Similar presentations
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Advertisements

CSC 360- Instructor: K. Wu Overview of Operating Systems.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Types of Parallel Computers
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
Michihiro Koibuchi(NII, Japan ) Tomohiro Otsuka(Keio U, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) An On/Off.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Lecture 8: 9/19/2002CS149D Fall CS149D Elements of Computer Science Ayman Abdel-Hamid Department of Computer Science Old Dominion University Lecture.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Beowulf – Cluster Nodes & Networking Hardware Garrison Vaughan.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Interconnection network network interface and a case study.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Background Computer System Architectures Computer System Software.
Running clusters on a Shoestring Fermilab SC 2007.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Running clusters on a Shoestring US Lattice QCD Fermilab SC 2007.
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Cluster Computers.
Presentation transcript:

in Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo /11/21The University of Tokyo IssuesResource Management

Outline Jittering Memory Affinity Power Management Bottleneck Resource Management 2007/11/2The University of Tokyo 2

Issues Jittering Problem – The execution of a parallel application is disturbed by system processes in each node independently. This causes the delay of global operations such as allreduce 2007/11/2The University of Tokyo 3 References: Terry Jones, William Tuel, Brain Maskell, “Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System,” SC2003. Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,1928 Processors of ASCI Q,” SC2003. # 0 # 1 # 2 # 3 # 0 # 1 # 2 # 3 # 0 # 1 # 2 # 3

Jittering Problem Our Approach – Clusters usually have two types network Network for Computing Network for Management – The Management network is used to deliver the global clock Interval Timer is turned off Broadcast packet is sent from the global clock generator – Gang scheduling is employed for all system and application processes 2007/11/2The University of Tokyo 4 Network for Computing i.e., Myrinet, Infiniband Network for Management i.e., gigabit ethernet Global clock generator

Jittering Problem Preliminary Experience – The Management network is used to deliver the global clock – The Interval Timer is turned off – Each arrival of the special broadcast packet, the tick counter is updated (The kernel code has been modified) – No cluster daemons, such as batch scheduler nor information daemon, are running, but system daemons are running 2007/11/2The University of Tokyo 5 CPU : AMD Opteron GHz Memory : 2GHz Network : Myri-10G : BCM5721 Gigabit Ethernet # of Host : 16 Kernel : Linux x86_64 modified MPI : mpich-mx MX : MX Version: Daemons: syslog, portmap, sshd, sysstat, netfs, nfslock, autofs, acpid, mx, ypbind, rpcgssd, rpcidmapd, network

Preliminary Global Clock Experience 2007/11/2The University of Tokyo 6 NAS Parallel Benchmark MG Elapsed time (second) 20 times executions are sorted + No global clock X Global clock

Preliminary Global Clock Experience 2007/11/2The University of Tokyo 7 NAS Parallel Benchmark FT + No global clock X Global clock Elapsed time (second)

Preliminary Global Clock Experience 2007/11/2The University of Tokyo 8 + No global clock X Global clock Elapsed time (second) NAS Parallel Benchmark CG

What kind of heavy daemon running in cluster Batch Job System – In case of Torque – Every 1 second, the daemon takes 50 microseconds – Every 45 seconds, the daemon takes about 8 milliseconds Monitoring System – Not yet majored Simple Formulation 2007/11/2The University of Tokyo 9 In case of 1000 node cluster *1000/ *1000/45 = 22.8 % Worst Case Overhead MIN(TIi, TRi x N) TIi N: Number of nodes TIi: Interval time in daemon i TRi: Running time in daemon i = Σ t TI TR The worst case might never happen !

Issues on NUMA Memory Affinity in NUMA – CPU  Memory – Network  Memory An Example of network and memory 2007/11/2The University of Tokyo 10 Dual Core CPU Dual Core CPU Memory NIC NFP 3600 NFP 3050 Memory NFP 3600 NFP 3050 Dual Core CPU Dual Core CPU Memory NIC Node 0Node 1 Near Far

Memory Location and Communication 11 P P M N N N N C C M C C P P M M N N N N Note: The result depends on the BIOS settings. Communication performance depends on data location. Data is also accessed by CPU. The location of data should be determined based on both CPU and network location. Dynamic data migration mechanism is needed ?? 2007/11/2The University of Tokyo

Power Management 100 Tflops cluster machine – 1666 Nodes If 80 % machine resource utilization (332 nodes are idle) – 66 KW power is wasted in case of idle 55K$(660 万円 )/year This is under estimation because memory size is small and no network switches are included – 10.6KW power is wasted though the power is turned off!! 9K$ (110 万円 )/year 2007/11/2The University of Tokyo 12 Power Consumption (Amp) HPL running (Not optimized) 2.92 Idle (1.9GHz) 2.44 Idle (1.0GHz) 2.02 Suspended 1.61 No Power but power cable is plugged in (BMC running) 0.32 Power Consumption Issue Supermicro AS-2021-M-UR+V Opteron 2347 x 2 (Balcerona 1.9 GHz, 60.8 Gflops) 4 Gbyte Memory Infiniband HCA x 2 Fedora Core 7 Power Consumption in single node Digital Ammeter FLUKE105B ??

Power Management Cooperating with Batch Job system – Idle machines are turned off – When those machines are needed, they are turned on using the IPMI (Intelligent Platform Management Interface) protocol (BMC). – However, still we lose 300 mA for each idle machine Quick shutdown/restart and synchronization mechanism 2007/11/2The University of Tokyo 13 JOB1 running JOB2 running Batch Job System Idle JOB2 running Turn OFF JOB2 running Submit JOB3 Turn OFF JOB2 running Turn ON Turn OFF JOB2 running Idle JOB2 running Turn OFF In Service Dispatch JOB3 Turn OFF JOB2 running JOB3 runs

Bottleneck Resource Management What are bottleneck resources – A cluster machine has many resources while other resources are limited. – When the cluster accesses such a resource, overloading or congestion happens Examples 2007/11/2The University of Tokyo 14 Internet 10 GB/sec x N 10 GB/sec – Internet We have been focusing on bottleneck links in GridMPI – Global File System From the file system view point, N file operations are independently performed where N is the number of node

Summary We have presented issues on large-scale clusters – Jittering – Memory affinity – Power management – Bottleneck resource management 2007/11/2The University of Tokyo 15