Download presentation
Presentation is loading. Please wait.
Published byNathaniel Manning Modified over 9 years ago
1
Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng { leo, zghuo, zhangpanyong, md}@ncic.ac.cn Presenter: Xiang Zhang zhangxiang@ncic.ac.cn
2
Introduction Virtualization is now one of the enabling technologies of Cloud Computing Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include: – Penguin's POD – IBM's Computing On Demand service – R Systems' dedicated hosting service – Amazon’s EC2
3
Introduction: Virtualizing HPC clouds? Pros: – good manageability – proactive fault tolerance – performance isolation – online system maintenance Cons: – Performance gap Lack low latency interconnects, which is important to tightly- coupled MPI applications VMM-bypass has been proposed to relieve the worry
4
Introduction: VMM-bypass I/O Virtualization Xen split device driver model only used to setup necessary user access points data communication in the critical path bypasses both the guest OS and the VMM VMM-Bypass I/O (courtesy [7])
5
Introduction: InfiniBand Overview InfiniBand is a popular high-speed interconnect – OS-bypass/RDMA – Latency: ~1us – BW: 3300MB/s ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems June 2010 Source: http://www.top500.org
6
RQ SRQ Introduction: InfiniBand Scalability Problem Reliable Connection (RC) – Queue Pair (QP), Each QP consists of SQ and RQ – QPs require memory Shared Receive Queue (SRQ) eXtensible Reliable Connection (XRC) – XRC domain & SRQ-based addressing Conns/Process: (N-1)×C Conns/Process: (N-1) SRQ5 SRQ6 SRQ7 SRQ8 N: node count C: cores per node
7
Problem Statement Does scalability gap exist between native and virtualized environments? – C V : cores per VM TransportQPs per ProcessQPs per Node NativeRC(N-1)×C(N-1)×C 2 XRC(N-1)(N-1)×C VMRC(N-1)×C(N-1)×C 2 XRC(N-1)×(C/C V )(N-1)×(C 2 /C V ) Scalability gap exists!
8
Presentation Outline Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work
9
Proposed Design: VM-proof XRC design Design goal is to eliminate the scalability gap – Conns/Process: (N-1)×(C/C V ) (N-1)
10
Proposed Design: Design Challenges VM-proof sharing of XRC domain –A single XRC domain must be shared among different VMs within a physical node VM-proof connection management –With a single XRC connection, P1 is able to send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in
11
Proposed Design: Implementation VM-proof sharing of XRCD – XRCD is shared by opening the same XRCD file – guest domains and IDD have dedicated, non- shared filesystem – pseudo XRCD file and real XRCD file VM-proof CM – Traditionally IP/hostname was used to identify a node – LID of the HCA is used instead
12
Proposed Design: Discussions safe XRCD sharing – unauthorized applications from other VMs may share the XRCD the isolation of the sharing of XRCD could be guaranteed by the IDD – isolation between VMs running different MPI jobs By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other XRC migration – main challenge: XRC connection is a process-to-node communication channel. Future work
13
Presentation Outline Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work
14
Evaluation: Platform Cluster Configuration: – 128-core InfiniBand Cluster – Quad Socket, Quad-Core Barcelona 1.9GHz – Mellanox DDR ConnectX HCA, 24-port MT47396 Infiniscale-III switch Implementation – Xen 3.4 with Linux 2.6.18.8 – OpenFabrics Enterprise Edition (OFED) 1.4.2 – MVAPICH-1.1.0
15
Evaluation: Microbenchmark The bandwidth results are nearly the same Virtualized IB performs ~0.1us worse when using blueframe mechanism. – memory copy of the sending data to the HCA's blueframe page IB verbs latency using doorbell IB verbs latency using blueframe MPI latency using blueframe Explanation: Memory copy operations under virtualized case would include interactions between the guest domain and the IDD.
16
Evaluation: VM-proof XRC Evaluation Configurations – Native-XRC: Native environment running XRC- based MVAPICH. – VM-XRC (C V =n): VM-based environment running unmodified XRC-based MVAPICH. The parameter C V denotes the number of cores per VM. – VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design.
17
Evaluation: Memory Usage 16 cores/node cluster fully connected – The X-axis denotes the process count – ~12KB memory for each QP 16x less memory usage – 64K processes will consume 13GB/node with the VM-XRC (C V =1) configuration – The VM-proof XRC design reduces the memory usage to only 800MB/node Better 800MB 13GB
18
Evaluation: MPI Alltoall Evaluation a total of 32 processes 10%~25% improvement for messages < 256B Better VM-proof XRC
19
Evaluation: Application Benchmarks VM-proof XRC performs nearly the same as Native- XRC – Except BT and EP Both are better than VM-XRC Better little variation for different C V values Cv=8 is an exception Memory allocation not NUMA-aware guaranteed VM-proof XRC
20
Evaluation: Application Benchmarks (Cont’d) Benchmark Configuration Comm. Peers Avg. QPs/Process Max QPs/Process Avg. QPs/Node FT VM-XRC (Cv=1) 127 2032 VM-XRC (Cv=2)63.465 1014 VM-XRC (Cv=4)31.132 498 VM-XRC (Cv=8)15.116 242 VM-proof XRC88 128 Native-XRC77 112 IS VM-XRC (Cv=1) 127 2032 VM-XRC (Cv=2)63.765 1019 VM-XRC (Cv=4)31.733 507 VM-XRC (Cv=8)15.818 253 VM-proof XRC8.612 138 Native-XRC7.611 122 ~15.9x less conns ~14.7x less conns
21
Conclusion and Future Work VM-proof XRC design converges two technologies – VMM-bypass I/O virtualization – eXtensible Reliable Connection in modern high speed interconnection networks (InfiniBand) the same raw performance and scalability as in native non- virtualized environment with our VM-proof XRC design – ~16x scalability improvement is seen in 16-core/node clusters Future work – evaluations on different platforms with increased scale – add VM migration support to our VM-proof XRC design – extend our work to the newly SRIOV-enabled ConnectX-2 HCAs
22
Questions? {leo, zghuo, zhangpanyong, md}@ncic.ac.cn
23
Backup Slides
24
OS-bypass of InfiniBand OpenIB Gen2 stack
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.