Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng { leo, zghuo, zhangpanyong, md}@ncic.ac.cn Presenter: Xiang Zhang zhangxiang@ncic.ac.cn

Introduction Virtualization is now one of the enabling technologies of Cloud Computing Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include: – Penguin's POD – IBM's Computing On Demand service – R Systems' dedicated hosting service – Amazon’s EC2

Introduction: Virtualizing HPC clouds? Pros: – good manageability – proactive fault tolerance – performance isolation – online system maintenance Cons: – Performance gap Lack low latency interconnects, which is important to tightly- coupled MPI applications VMM-bypass has been proposed to relieve the worry

Introduction: VMM-bypass I/O Virtualization Xen split device driver model only used to setup necessary user access points data communication in the critical path bypasses both the guest OS and the VMM VMM-Bypass I/O (courtesy [7])

Introduction: InfiniBand Overview InfiniBand is a popular high-speed interconnect – OS-bypass/RDMA – Latency: ~1us – BW: 3300MB/s ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems June 2010 Source: http://www.top500.org

RQ SRQ Introduction: InfiniBand Scalability Problem Reliable Connection (RC) – Queue Pair (QP), Each QP consists of SQ and RQ – QPs require memory Shared Receive Queue (SRQ) eXtensible Reliable Connection (XRC) – XRC domain & SRQ-based addressing Conns/Process: (N-1)×C Conns/Process: (N-1) SRQ5 SRQ6 SRQ7 SRQ8 N: node count C: cores per node

Problem Statement Does scalability gap exist between native and virtualized environments? – C V : cores per VM TransportQPs per ProcessQPs per Node NativeRC(N-1)×C(N-1)×C 2 XRC(N-1)(N-1)×C VMRC(N-1)×C(N-1)×C 2 XRC(N-1)×(C/C V )(N-1)×(C 2 /C V ) Scalability gap exists!

Presentation Outline Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work

Proposed Design: VM-proof XRC design Design goal is to eliminate the scalability gap – Conns/Process: (N-1)×(C/C V )  (N-1)

Proposed Design: Design Challenges VM-proof sharing of XRC domain –A single XRC domain must be shared among different VMs within a physical node VM-proof connection management –With a single XRC connection, P1 is able to send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in

Proposed Design: Implementation VM-proof sharing of XRCD – XRCD is shared by opening the same XRCD file – guest domains and IDD have dedicated, non- shared filesystem – pseudo XRCD file and real XRCD file VM-proof CM – Traditionally IP/hostname was used to identify a node – LID of the HCA is used instead

Proposed Design: Discussions safe XRCD sharing – unauthorized applications from other VMs may share the XRCD the isolation of the sharing of XRCD could be guaranteed by the IDD – isolation between VMs running different MPI jobs By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other XRC migration – main challenge: XRC connection is a process-to-node communication channel. Future work

Presentation Outline Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work

Evaluation: Platform Cluster Configuration: – 128-core InfiniBand Cluster – Quad Socket, Quad-Core Barcelona 1.9GHz – Mellanox DDR ConnectX HCA, 24-port MT47396 Infiniscale-III switch Implementation – Xen 3.4 with Linux 2.6.18.8 – OpenFabrics Enterprise Edition (OFED) 1.4.2 – MVAPICH-1.1.0

Evaluation: Microbenchmark The bandwidth results are nearly the same Virtualized IB performs ~0.1us worse when using blueframe mechanism. – memory copy of the sending data to the HCA's blueframe page IB verbs latency using doorbell IB verbs latency using blueframe MPI latency using blueframe Explanation: Memory copy operations under virtualized case would include interactions between the guest domain and the IDD.

Evaluation: VM-proof XRC Evaluation Configurations – Native-XRC: Native environment running XRC- based MVAPICH. – VM-XRC (C V =n): VM-based environment running unmodified XRC-based MVAPICH. The parameter C V denotes the number of cores per VM. – VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design.

Evaluation: Memory Usage 16 cores/node cluster fully connected – The X-axis denotes the process count – ~12KB memory for each QP 16x less memory usage – 64K processes will consume 13GB/node with the VM-XRC (C V =1) configuration – The VM-proof XRC design reduces the memory usage to only 800MB/node Better 800MB 13GB

Evaluation: MPI Alltoall Evaluation a total of 32 processes 10%~25% improvement for messages < 256B Better VM-proof XRC

Evaluation: Application Benchmarks VM-proof XRC performs nearly the same as Native- XRC – Except BT and EP Both are better than VM-XRC Better little variation for different C V values Cv=8 is an exception Memory allocation not NUMA-aware guaranteed VM-proof XRC

Evaluation: Application Benchmarks (Cont’d) Benchmark Configuration Comm. Peers Avg. QPs/Process Max QPs/Process Avg. QPs/Node FT VM-XRC (Cv=1) 127 2032 VM-XRC (Cv=2)63.465 1014 VM-XRC (Cv=4)31.132 498 VM-XRC (Cv=8)15.116 242 VM-proof XRC88 128 Native-XRC77 112 IS VM-XRC (Cv=1) 127 2032 VM-XRC (Cv=2)63.765 1019 VM-XRC (Cv=4)31.733 507 VM-XRC (Cv=8)15.818 253 VM-proof XRC8.612 138 Native-XRC7.611 122 ~15.9x less conns ~14.7x less conns

Conclusion and Future Work VM-proof XRC design converges two technologies – VMM-bypass I/O virtualization – eXtensible Reliable Connection in modern high speed interconnection networks (InfiniBand) the same raw performance and scalability as in native non- virtualized environment with our VM-proof XRC design – ~16x scalability improvement is seen in 16-core/node clusters Future work – evaluations on different platforms with increased scale – add VM migration support to our VM-proof XRC design – extend our work to the newly SRIOV-enabled ConnectX-2 HCAs

Questions? {leo, zghuo, zhangpanyong, md}@ncic.ac.cn

Backup Slides

OS-bypass of InfiniBand OpenIB Gen2 stack

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Similar presentations

Presentation on theme: "Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Similar presentations

Presentation on theme: "Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,"— Presentation transcript:

Similar presentations

About project

Feedback