Download presentation
Presentation is loading. Please wait.
Published byElinor Benson Modified over 9 years ago
1
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial Parallelism in Ethernet-based Cluster Interconnects Stavros Passas, George Kotsis, Sven Karlsson, and Angelos Bilas
2
FORTH-ICSCARV/scalable2 Motivation Typically, clusters today use multiple interconnects Interprocess communication (IPC): myrinet, infiniband, etc IO: fibre channel, scsi Fast LAN: 10 GigE However, this increases system and management cost Can we use a single interconnect for all types of traffic? Which one? High network speeds 10-40 GBit/s
3
FORTH-ICSCARV/scalable3 Trends and Constraints Most interconnects use similar physical layer, but differ in Protocol semantics and guarantees they provide Protocol implementation on the NIC and network core Higher layer protocols (e.g. TCP/IP, NFS) are independent of the interconnect technology 10+ Gbps Ethernet is particularly attractive, but … Typically associated with higher overheads Requires more support at the edge due to simpler net core
4
FORTH-ICSCARV/scalable4 This Work How well can a protocol do over 10-40 GigE? Scale throughput efficiently over multiple links Analyze protocol overhead at the host CPU Propose and evaluate optimizations for reducing host CPU overhead Implemented without H/W support
5
FORTH-ICSCARV/scalable5 Outline Motivation Protocol design over Ethernet Experimental results Conclusions and future work
6
Standard Protocol Processing Sources of overhead System call to issue operation Memory copies at sender and receiver Protocol packet processing Interrupt notification for freeing send-side buffer, packet arrival Extensive device accesses Context switch from interrupt to receive thread for packet processing FORTH-ICSCARV/scalable6
7
FORTH-ICSCARV/scalable7 Our Base Protocol Improves on MultiEdge [IPDPS’07] Support for multiple links with different schedulers H/W coalescing for send- & receive-side interrupts S/W coalescing in interrupt handler Still requires System calls One copy at send and one at receive side Context switch in receive path
8
FORTH-ICSCARV/scalable8 Evaluation Methodology Research questions How does the protocol scale with the number of links? What are the important overheads at 10 Gbits/s? What is the impact of link scheduling? We use two nodes connected back-to-back Dual-CPU (Opteron 244) 1-8 links of 1 Gbit/s (Intel) 1 link of 10 Gbit/s (Myricom) We focus on Throughput: end-to-end, reported by benchmarks Detailed CPU breakdowns: extensive kernel instrumentation Packet-level statistics: flow-control, out-of-order
9
FORTH-ICSCARV/scalable9 Throughput Scalability: One Way
10
FORTH-ICSCARV/scalable10 What If… We were able to avoid certain overheads Interrupts Use polling instead Data copying Remove copies from send and receive path We examine two more protocol configurations Poll: Realistic, but consumes one CPU NoCopy: Artificial, as data are not delivered
11
FORTH-ICSCARV/scalable11 Poll Results
12
FORTH-ICSCARV/scalable12 NoCopy Results
13
Memory Throughput Copy performance related to memory throughput Max memory throughput (NUMA w/ Linux support) Read: 20 GBits/s Write: 15 GBits/s Max copy throughput 8 GBits/s per CPU accessing local memory Overall, multiple links approach memory throughput Copies important in future FORTH-ICSCARV/scalable13
14
FORTH-ICSCARV/scalable14 Packet Scheduling for Multiple Links Evaluated three packet schedulers Static round robin (SRR) Suitable for identical links Weighted static round robin (WSRR) Assign packets proportionally to link throughput Does not consider link load Weighted dynamic (WD) Assign packets proportionally to link throughput Consider link load
15
FORTH-ICSCARV/scalable15 Multi-link Scheduler Results Setup 4x1 + 1x10 NoCopy + Poll
16
FORTH-ICSCARV/scalable16 Lessons Learned Multiple links introduce overheads Base protocol scales up-to 4 x 1 Gbit/s links Removing interrupts allows scaling to 6 x 1 Gbit/s links Beyond 6 GBits/s copying becomes dominant Removing copies allows scaling to 8-10 GBits/s Dynamic weighted performs best 10% better over simpler alternative (SWRR)
17
Future work 1)Eliminate even single copy Use page remapping without H/W support 2)More efficient interrupt coalescing Share interrupt handler among multiple NICs 3)Distribute protocol over multiple cores Possibly dedicate cores to network processing FORTH-ICSCARV/scalable17
18
FORTH-ICSCARV/scalable18 Related Work User-level communication systems & protocols Myrinet, Infiniband, etc. Break kernel abstraction and require h/w support Not successful with commercial applications and IO iWARP Requires H/W support Ongoing work and efforts TCP/IP optimizations and offload Complex and expensive Important for WAN setups, rather than datacenters
19
FORTH-ICSCARV/scalable19 Thank you! Questions? Contact: Stavros Passas stabat@ics.forth.gr
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.