KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Slides:



Advertisements
Similar presentations
Advanced Piloting Cruise Plot.
Advertisements

1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Solve Multi-step Equations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Redesigning Xen Memory Sharing (Grant) Mechanism Kaushik Kumar Ram (Rice University) Jose Renato Santos (HP Labs) Yoshio Turner (HP Labs) Alan L. Cox (Rice.
PP Test Review Sections 6-1 to 6-6
Seungmi Choi PlanetLab - Overview, History, and Future Directions - Using PlanetLab for Network Research: Myths, Realities, and Best Practices.
EU market situation for eggs and poultry Management Committee 20 October 2011.
Virtual Memory II Chapter 8.
CRUISE: Cache Replacement and Utility-Aware Scheduling
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
Operating Systems Operating Systems - Winter 2011 Dr. Melanie Rieback Design and Implementation.
Operating Systems Operating Systems - Winter 2012 Dr. Melanie Rieback Design and Implementation.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
Universität Kaiserslautern Institut für Technologie und Arbeit / Institute of Technology and Work 1 Q16) Willingness to participate in a follow-up case.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
SE-292 High Performance Computing
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Energy Generation in Mitochondria and Chlorplasts
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.
Presentation transcript:

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin Kwon¹, Young-ri Choi², and Jaehyuk Huh¹ ¹ KAIST (Korea Advanced Institute of Science and Technology) ² KISTI (Korea Institute of Science and Technology Information)

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 2

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 3

Benefits of Virtualization 4 Hardware Virtual Machine Monitor VM Improve system utilization by consolidation

Benefits of Virtualization 5 Hardware Virtual Machine Monitor VM Windows VM Linux VM Solaris Improve system utilization by consolidation Support for multiple types of OSes on a system

Benefits of Virtualization 6 Hardware Virtual Machine Monitor VM Windows VM Linux VM Solaris Improve system utilization by consolidation Support for multiple types of OSes on a system Fault isolation

Benefits of Virtualization 7 Hardware Virtual Machine Monitor VM Windows VM Linux VM Solaris Hardware Virtual Machine Monitor Improve system utilization by consolidation Support for multiple types of OSes on a system Fault isolation Flexible resource management

Benefits of Virtualization 8 Improve system utilization by consolidation Support for multiple types of OSes on a system Fault isolation Flexible resource management Hardware Virtual Machine Monitor VM Windows VM Linux VM Solaris Hardware Virtual Machine Monitor

Benefits of Virtualization 9 Improve system utilization by consolidation Support for multiple types of OSes on a system Fault isolation Flexible resource management Cloud computing VM Windows VM Linux VM Solaris Cloud Hardware Virtual Machine Monitor

Virtualization for HPC Benefits of virtualization – Improve system utilization by consolidation – Support for multiple types of OSes on a system – Fault isolation – Flexible resource management – Cloud computing HPC is performance-sensitive Virtualization can help HPC workloads 10  resource-sensitive

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 11

Virtualization on Multi-core 12 core More VMs on a physical machine More complex memory hierarchy (NUCA, NUMA) VM core VM core VM core VM core VM core VM Shared cache Memory core VM core VM

Challenges VM management cost Semantic gaps – vCPU scheduling, NUMA 13 Virtual Machine Monitor VM Scheduling, Memory, Communication, I/O multiplexing… MemMem MemMem MemMem MemMem co re Virtual Machine Monitor co re OS Memory $ $ $ $

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 14

Virtualization for HPC on Multi-core Virtualization may help HPC Virtualization on multi-core may have some overheads For servers, improving system utilization is a key factor For HPC, performance is a key factor. 15 How much overheads are there? Where do they come from?

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 16

Machines Single Socket System – 12-cores AMD processor – Uniform memory access latency – Two 6MB L3 caches shared by 6 cores Dual Socket System – 2x 4-core Intel processor – Non-uniform memory access latency – Two 8MB L3 caches shared by 4 cores 17 P L2 P L3 P L2 P P P P P L3 P L2 P P P Single socket: 12-core CPU Memory P L2 P P P L3 P L2 P P P L3 Dual socket: 2x 4-core CPUs

Workloads PARSEC – Shared memory model – Input: native – On one machine Single and Dual socket – Fix: One VM – Vary: 1, 4, 8 vCPUs NAS Parallel Benchmark – MPI model – Input: class C – On two machines (dual socket) 1Gb Ethernet switch – Fix: 16 vCPUs – Vary: 2 ~ 16 VMs 18 MemMem MemMem MemMem MemMem core Virtual Machine Monitor core OS Memory $ $ $ $ Virtual Machine Monitor VM Hardware Virtual Machine Monitor VM Hardware Semantic gapsVM management cost

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 19

PARSEC – Single Socket Single socket No NUMA effect Very low virtualization overheads 20 2~4 % Execution times normalized to native runs

PARSEC – Single Socket Single socket + pin vCPU to each pCPU Reduce semantic gaps by prevent vCPU migration vCPU migration has negligible effect 21 Execution times normalized to native runs Similar to unpinned

PARSEC – Dual Socket Dual socket, unpinned vCPUs NUMA effect  semantic gap Significant increase of overheads 22 16~37 % Execution times normalized to native runs

PARSEC – Dual Socket Dual socket, pinned vCPUs May reduce NUMA effect also Reduced overheads with 1 and 4 vCPUs 23 Execution times normalized to native runs

XEN and NUMA machine Memory allocation policy – Allocate up to 4GB chunk on one socket Scheduling policy – Pinning to allocated socket – Nothing more Pinning 1 ~ 4 vCPUs on the socket mem. allocated is possible Impossible with 8 vCPUs 24 MemMem MemMem co re $ $ $ $ MemMem MemMem VM 0 VM 0 VM 1 VM 1 VM 2 VM 2 VM 3 VM 3

Mitigating NUMA Effects Range pinning – Pin vCPUs of a VM on a socket – Work only if # of vCPUs < # of cores on a socket – Range-pinned (best): memory of VM in the same socket – Range-pinned (worst): memory of VM in the other socket NUMA-first scheduler – If there is an idle core in the socket memory allocated, pick it – If not, anyway, pick a core in the machine – All vCPUs are not active all the time (sync. or I/O) 25

Range Pinning For 4 vCPUs case Range-pinned(best) ≈ Pinned 26 Execution times normalized to native runs

NUMA-first Scheduler For 8 vCPUs case Significant improvement by NUMA-first scheduler 27 Execution times normalized to native runs

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 28

VM Granularity for MPI model Fine-grained VMs – Few processes in a VM – Small VM: vCPUs, memory – Fault isolation among processes in different VMs – Many VMs on a machine – MPI communications mostly through the VMM Coarse-grained VMs – Many processes in a VM – Large VM: vCPUs, memory – Single failure point for processes in a VM – Few VMs on a machine – MPI communications mostly within a VM 29 VMM Hardware VMM Hardware VMM Hardware VMM Hardware

NPB - VM Granularity Work to do are same for all granularity 2 VMs: each VM has 8 vCPUs, 8 MPI processes 16 VMs: each VM has 1 vCPU, 1 MPI processes 30 Execution times normalized to native runs 11~54 %

NPB - VM Granularity Fine-grained VMs  significant overheads (avg. 54%) – MPI communications mostly through VMM Worst in CG with high communication ratio – Small memory per VM – VM management costs of VMM Coarse-grained VMs  much less overheads (avg. 11%) – Still dual socket, but less overheads than shared memory model  the bottle neck is moved to communication – MPI communication largely within VM – Large memory per VM 31

Outline Virtualization for HPC Virtualization on Multi-core Virtualization for HPC on Multi-core Methodology PARSEC – shared memory model NPB – MPI model Conclusion 32

Conclusion Questions on virtualization for HPC on multi-core system – How much overheads are there? – Where do they come from? For shared memory model – Without NUMA  little overheads – With NUMA  large overheads from semantic gaps For MPI model – Less NUMA effect  communication is important – Fine-grained VMs have large overheads Communication mostly through VMM Small memory / VM management cost Future Works – NUMA-aware VMM scheduler – Optimize communication among VMs in a machine 33

34 Thank you!

35 Backup slides

PARSEC CPU Usage Environments: native linux, turn on only 8 cores (use 8 threads mode) Get CPU usage every seconds, then average them For all workloads, less than 800% (fully parallel)  NUMA-first can work 36