Download presentation
Presentation is loading. Please wait.
1
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent Agents cadia.ru.is Eric Nivel eric@ru.is
2
Motivation Lowest latencies as possible: physical layer, HCA, switch, protocol overhead (CPU load, memory operations) High bandwidth Scalability Low cost motivation technologies hardware software Conclusion HPC: Large data sets, large amount of messages, n-to-n connectivity
3
Quadrics Elan 4Very low latencies (0.75x) – 10 Gbps– high cost (2x) – MPI, RDMA Myrinet 10GVery low latencies (2x) - 10 Gbps– low cost (1x) – MPI, TCP/UDP/IP InfinibandVery low latencies (1x) – 10... 60 Gbps – low cost (1x) – MPI, RDMA, uDAPL, SDP, SRP, TCP/UDP/IP Ethernet 10GbpsHigh latencies (10x) – 10 Gbps – high cost (2x and more) – MPI, TCP/UDP/IP Available technologies motivation technologies hardware software Conclusion Quadrics / IB for HPC - Myrinet / 10G ethernet for data centers Quadrics performs slightly better than IB for ~ twice the cost Ethernet 1GbpsVery high latencies (100x) – 1 Gbps – verl low cost (0.1x) – TCP/UDP/IP
4
Range (4x) ~ 15m (copper), 150 m (fiber) Low latencies: switch ~ 2ns, HCA ~10 µs Characteristics Physical layer Links: 1x (2.5 Gbps) - 4x (10 Gbps) – 12x (30 Gbps) Data rates: SDR – DDR – QDR (soon). Ex: 4x DDR, full duplex: signaling rate 2 x 20Gbps, data rate 2 x 16Gbps HCAs (can be LOM): PCIe 8x, HTX slots Cost 4x DDR single port HCA: 478€ 24 (288) ports 4x DDR switch: 3000€ (86000€) Switched fabric Tx Rx switching Queue Pairs HCA descriptors for send / recv operations RAM motivation technologies hardware software Conclusion Characteristics Performance Topology... Fat tree
5
Performance motivation technologies hardware software Conclusion Characteristics Performance 4x SDR unidirectional Latency Bandwidth 4 16 64 256 1K 4K 10 8 6 4 2 0 Time (µs) Message size (bytes) MPI RDMA write RDMA read 16 64 256 1K 4K 16K 256K 1M 1000 750 500 250 0 Bandwidth (MB/s) Message size (bytes) MPI RDMA write
6
Protocol stack motivation technologies hardware software Conclusion Protocol stack RDMA MPI Application RDMA: Remote Direct Memory Access MPI: Message Passing Interface SDP: Socket Direct Protocol SRP: Scsi over Rdma uDAPL: user Direct Access programming Library HCA Application User space Kernel space Socket API SDP TCPI/IP socket TCP/IP stack Interface driver RDMA, MPI, uDAPL, SRP RDMA semanticsIP over IB Kernel bypass Link: www.openfabrics.org
7
MPI – channel semantics motivation technologies hardware software Conclusion Protocol stack RDMA MPI Send / Receive operations, two-sided The receiver pushes (FIFO) a recv descriptor (RD): where to put the incoming data in memory The sender pushes a send descriptor (SD): where to find the outgoing data When the message arrives, RD is used to perform memory write access. Notification in Completion Queues (CQ) Recent implementations (MPI 2.0) over RDMA, see http://nowlab.cse.ohio-state.edu
8
RDMA – memory semantics motivation technologies hardware software Conclusion Protocol stack RDMA MPI Read / Write operations, one-sided Read: push a descriptor: where to read, where to write the result Write: where to get the data, where to write it Buffer registration (pinned down memory) Zero-copy direct placement of incoming data in the registered buffer Notification in CQs Atomic operations: Compare & Swap, Fetch & Add 1 – NIC interrupts CPU 2 – CPU switches context to a TCP thread 3 – CPU uses DMA to place data in I/O memory. CPU resumes interrupted user thread 4 – DMA interrupts CPU for stack processing 5 – CPU switches context to process the completed statck and associate with recv application 6 – CPU uses DMA to copy data from I/O to user memory 1 – HCA receives data containing destination address 2 – HCA writes data in user memory North Bridge Memory NIC CPU PCI user I/O 2 1 3 4 5 6 Standard TCP recv HCA North Bridge Memory CPU PCIe user I/O 1 2 RDMA write Kernel bypass
9
Conclusion motivation technologies hardware software Conclusion IB/RDMA well suited for HPC within reasonable costs MPI, MVAPICH have now reached good performance levels Compatibility with Berkley style sockets – don’t expect too much on Infinihost III. ConnectX: TCP/IP protocol CPU offload in hardware
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.