Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas

FORTH-ICSCARV/scalable2 Motivation  Typically, clusters today use multiple interconnects  Interprocess communication (IPC): myrinet, infiniband, etc  IO: fibre channel, scsi  Fast LAN: 10 GigE  However, this increases system and management cost  Can we use a single interconnect for all types of traffic?  Which one?  High network speeds  10-40 GBit/s

FORTH-ICSCARV/scalable3 Trends and Constraints  Most interconnects use similar physical layer, but differ in  Protocol semantics and guarantees they provide  Protocol implementation on the NIC and network core  Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology  10+ Gbps Ethernet is particularly attractive, but …  Typically associated with higher overheads  Requires more support at the edge due to simpler net core

FORTH-ICSCARV/scalable4 This Work  How well can a protocol do over 10-40 GigE?  Scale throughput efficiently over multiple links  Analyze protocol overhead at the host CPU  Propose and evaluate optimizations for reducing host CPU overhead  Implemented without H/W support

FORTH-ICSCARV/scalable5 Outline Motivation  Protocol design over Ethernet  Experimental results  Conclusions and future work

Standard Protocol Processing  Sources of overhead  System call to issue operation  Memory copies at sender and receiver  Protocol packet processing  Interrupt notification for freeing send-side buffer, packet arrival  Extensive device accesses  Context switch from interrupt to receive thread for packet processing FORTH-ICSCARV/scalable6

FORTH-ICSCARV/scalable7 Our Base Protocol  Improves on MultiEdge [IPDPS’07]  Support for multiple links with different schedulers  H/W coalescing for send- & receive-side interrupts  S/W coalescing in interrupt handler  Still requires  System calls  One copy at send and one at receive side  Context switch in receive path

FORTH-ICSCARV/scalable8 Evaluation Methodology  Research questions  How does the protocol scale with the number of links?  What are the important overheads at 10 Gbits/s?  What is the impact of link scheduling?  We use two nodes connected back-to-back  Dual-CPU (Opteron 244)  1-8 links of 1 Gbit/s (Intel)  1 link of 10 Gbit/s (Myricom)  We focus on  Throughput: end-to-end, reported by benchmarks  Detailed CPU breakdowns: extensive kernel instrumentation  Packet-level statistics: flow-control, out-of-order

FORTH-ICSCARV/scalable9 Throughput Scalability: One Way

FORTH-ICSCARV/scalable10 What If…  We were able to avoid certain overheads  Interrupts  Use polling instead  Data copying  Remove copies from send and receive path  We examine two more protocol configurations  Poll: Realistic, but consumes one CPU  NoCopy: Artificial, as data are not delivered

FORTH-ICSCARV/scalable11 Poll Results

FORTH-ICSCARV/scalable12 NoCopy Results

Memory Throughput  Copy performance related to memory throughput  Max memory throughput (NUMA w/ Linux support)  Read: 20 GBits/s  Write: 15 GBits/s  Max copy throughput  8 GBits/s per CPU accessing local memory  Overall, multiple links approach memory throughput  Copies important in future FORTH-ICSCARV/scalable13

FORTH-ICSCARV/scalable14 Packet Scheduling for Multiple Links  Evaluated three packet schedulers  Static round robin (SRR)  Suitable for identical links  Weighted static round robin (WSRR)  Assign packets proportionally to link throughput  Does not consider link load  Weighted dynamic (WD)  Assign packets proportionally to link throughput  Consider link load

FORTH-ICSCARV/scalable15 Multi-link Scheduler Results Setup  4x1 + 1x10  NoCopy + Poll

FORTH-ICSCARV/scalable16 Lessons Learned  Multiple links introduce overheads  Base protocol scales up-to 4 x 1 Gbit/s links  Removing interrupts allows scaling to 6 x 1 Gbit/s links  Beyond 6 GBits/s copying becomes dominant  Removing copies allows scaling to 8-10 GBits/s  Dynamic weighted performs best  10% better over simpler alternative (SWRR)

Future work 1)Eliminate even single copy  Use page remapping without H/W support 2)More efficient interrupt coalescing  Share interrupt handler among multiple NICs 3)Distribute protocol over multiple cores  Possibly dedicate cores to network processing FORTH-ICSCARV/scalable17

FORTH-ICSCARV/scalable18 Related Work  User-level communication systems & protocols  Myrinet, Infiniband, etc.  Break kernel abstraction and require h/w support  Not successful with commercial applications and IO  iWARP  Requires H/W support  Ongoing work and efforts  TCP/IP optimizations and offload  Complex and expensive  Important for WAN setups, rather than datacenters

FORTH-ICSCARV/scalable19 Thank you! Questions? Contact: Stavros Passas stabat@ics.forth.gr

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Similar presentations

Presentation on theme: "Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Similar presentations

Presentation on theme: "Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial."— Presentation transcript:

Similar presentations

About project

Feedback