An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent Agents cadia.ru.is Eric Nivel eric@ru.is

Motivation  Lowest latencies as possible: physical layer, HCA, switch, protocol overhead (CPU load, memory operations)  High bandwidth  Scalability  Low cost motivation technologies hardware software Conclusion  HPC: Large data sets, large amount of messages, n-to-n connectivity

 Quadrics Elan 4Very low latencies (0.75x) – 10 Gbps– high cost (2x) – MPI, RDMA  Myrinet 10GVery low latencies (2x) - 10 Gbps– low cost (1x) – MPI, TCP/UDP/IP  InfinibandVery low latencies (1x) – 10... 60 Gbps – low cost (1x) – MPI, RDMA, uDAPL, SDP, SRP, TCP/UDP/IP  Ethernet 10GbpsHigh latencies (10x) – 10 Gbps – high cost (2x and more) – MPI, TCP/UDP/IP Available technologies motivation technologies hardware software Conclusion  Quadrics / IB for HPC - Myrinet / 10G ethernet for data centers  Quadrics performs slightly better than IB for ~ twice the cost  Ethernet 1GbpsVery high latencies (100x) – 1 Gbps – verl low cost (0.1x) – TCP/UDP/IP

 Range (4x) ~ 15m (copper), 150 m (fiber)  Low latencies: switch ~ 2ns, HCA ~10 µs Characteristics  Physical layer Links: 1x (2.5 Gbps) - 4x (10 Gbps) – 12x (30 Gbps) Data rates: SDR – DDR – QDR (soon). Ex: 4x DDR, full duplex: signaling rate 2 x 20Gbps, data rate 2 x 16Gbps  HCAs (can be LOM): PCIe 8x, HTX slots  Cost 4x DDR single port HCA: 478€ 24 (288) ports 4x DDR switch: 3000€ (86000€)  Switched fabric Tx Rx switching Queue Pairs HCA descriptors for send / recv operations RAM motivation technologies hardware software Conclusion Characteristics Performance  Topology... Fat tree

Performance motivation technologies hardware software Conclusion Characteristics Performance 4x SDR unidirectional Latency Bandwidth 4 16 64 256 1K 4K 10 8 6 4 2 0 Time (µs) Message size (bytes) MPI RDMA write RDMA read 16 64 256 1K 4K 16K 256K 1M 1000 750 500 250 0 Bandwidth (MB/s) Message size (bytes) MPI RDMA write

Protocol stack motivation technologies hardware software Conclusion Protocol stack RDMA MPI Application RDMA: Remote Direct Memory Access MPI: Message Passing Interface SDP: Socket Direct Protocol SRP: Scsi over Rdma uDAPL: user Direct Access programming Library HCA Application User space Kernel space Socket API SDP TCPI/IP socket TCP/IP stack Interface driver RDMA, MPI, uDAPL, SRP RDMA semanticsIP over IB Kernel bypass Link: www.openfabrics.org

MPI – channel semantics motivation technologies hardware software Conclusion Protocol stack RDMA MPI  Send / Receive operations, two-sided  The receiver pushes (FIFO) a recv descriptor (RD): where to put the incoming data in memory  The sender pushes a send descriptor (SD): where to find the outgoing data  When the message arrives, RD is used to perform memory write access. Notification in Completion Queues (CQ)  Recent implementations (MPI 2.0) over RDMA, see http://nowlab.cse.ohio-state.edu

RDMA – memory semantics motivation technologies hardware software Conclusion Protocol stack RDMA MPI  Read / Write operations, one-sided  Read: push a descriptor: where to read, where to write the result  Write: where to get the data, where to write it  Buffer registration (pinned down memory)  Zero-copy direct placement of incoming data in the registered buffer  Notification in CQs  Atomic operations: Compare & Swap, Fetch & Add 1 – NIC interrupts CPU 2 – CPU switches context to a TCP thread 3 – CPU uses DMA to place data in I/O memory. CPU resumes interrupted user thread 4 – DMA interrupts CPU for stack processing 5 – CPU switches context to process the completed statck and associate with recv application 6 – CPU uses DMA to copy data from I/O to user memory 1 – HCA receives data containing destination address 2 – HCA writes data in user memory North Bridge Memory NIC CPU PCI user I/O 2 1 3 4 5 6 Standard TCP recv HCA North Bridge Memory CPU PCIe user I/O 1 2 RDMA write Kernel bypass

Conclusion motivation technologies hardware software Conclusion  IB/RDMA well suited for HPC within reasonable costs  MPI, MVAPICH have now reached good performance levels  Compatibility with Berkley style sockets – don’t expect too much on Infinihost III. ConnectX: TCP/IP protocol CPU offload in hardware

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Similar presentations

Presentation on theme: "An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Similar presentations

Presentation on theme: "An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent."— Presentation transcript:

Similar presentations

About project

Feedback