Download presentation
Presentation is loading. Please wait.
Published byDarren Morrison Modified over 8 years ago
1
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg
2
Heiner.litz@uni-hd.de 2 Motivation Future Trends More cores, 2-fold increase per year [Asanovic 2006] More nodes, 200.000+ nodes for Exascale [Exascale Rep.] Consequence Exploit fine grain parallelisim Improve serialization/synchronization Requirement Low latency communication
3
Heiner.litz@uni-hd.de Motivation Latency lags Bandwidth [Patterson, 2004] Memory vs. Network Memory BW 10GB/s Network BW 5 GB/s Memory Latency 50ns Network Latency 1us 2x vs. 20x 3
4
Heiner.litz@uni-hd.de State of the Art 4 Scalability Lower Latency Infiniband Ethernet Quickpath SW DSM HyperTransport Larrabee Tilera Clusters SMPs TCCluster
5
Heiner.litz@uni-hd.de 5 Observation Today’s CPUs represent complete Cluster nodes Processor cores Switch Links
6
Heiner.litz@uni-hd.de 6 Approach Use host interface as interconnect Tightly Coupled Cluster (TCCluster)
7
Heiner.litz@uni-hd.de Background Coherent HyperTransport Shared memory SMPs Cache coherency overhead Max. 8 endpoints Table based routing (nodeID) Non-coherent HyperTransport Subset of cHT I/O devices, Southbridge,.. PCI like protocol “Unlimited” number of devices Interval routing (memory address) 7
8
Heiner.litz@uni-hd.de 8 Approach Processors pretend to be I/O devices Partitioned global address space Communicate via PIO writes to MMIO
9
Heiner.litz@uni-hd.de Routing 9
10
Heiner.litz@uni-hd.de Programming Model Remote Store PM Each process has local private memory Each process supports remotely writable regions Sending by storing to remote locations Receiving by reading from local memory Synchronization through serializing instructions No support of bulk transfers (DMA) No support for remote reads Emphasis on locality, low latency reads 10
11
Heiner.litz@uni-hd.de 11 Implementation 2x Two-socket Quadcore Shanghai Tyan Box SB HTX 3 3 2 1 node1 node0 HTX SB 3 3 2 1 node0 node1 ncHT link 16@3.6Gbit BOX 0 BOX 1 Reset/PWR
12
Heiner.litz@uni-hd.de 12 Implementation
13
Heiner.litz@uni-hd.de 13 Implementation Software based approach Firmware Coreboot (LinuxBIOS) Link de-enumeration Force non-coherent Link frequency & electrical parameters Driver Linux based Topology & Routing Manages remotely writable regions
14
Heiner.litz@uni-hd.de 14 Memory Layout 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB MMIO WC RW mem UC 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB RW mem UC MMIO WC DRAM Hole BOX 0 BOX 1
15
Heiner.litz@uni-hd.de 15 Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio
16
Heiner.litz@uni-hd.de 16 Latency – HT800(16bit) 227 ns Software-2-Software Half-Roundtrip
17
Heiner.litz@uni-hd.de 17 Conclusion Introduced novel tightly coupled interconnect “Virtually” moved the NIC into the CPU Order of magnitude latency improvement Scalable Next steps: MPI over RSM support Own mainboard with multiple links
18
Heiner.litz@uni-hd.de References [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report. 2006. [Exascale Rep ] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp. 71-75, October 2004. 18
19
Heiner.litz@uni-hd.de UoH confidential and proprietary19 Routing Traditional system (all nodes have same view of memory) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F IO x50 - x5F IO cHT ncHT
20
Heiner.litz@uni-hd.de UoH confidential and proprietary20 Routing Our approach (each CPU has its own view with one coherent node0 and 4 IO links) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F ncHT DRAM
21
Heiner.litz@uni-hd.de UoH confidential and proprietary21 Routing in the Opteron Fabric Type of a HT packet (posted, non posted, cHT, ncHT) is determined by SRQ based on: MTRR GART Top of memory register IO and DRAM range registers Routing is determined by the NB on: Routing table registers MMIO base/limit registers Coherent link traffic distribution register
22
Heiner.litz@uni-hd.de UoH confidential and proprietary22 Transaction Example 1. Core 0 performs write to IO address. Forwarded to X-Bar via SRQ 2. X-Bar forwards it to IO bridge to convert into posted Write 3. X-Bar forwards it to IO link 4. X-Bar forwards it to IO bridge to convert into coherent sizedWr 5. X-Bar forwards it to Mem Ctrler
23
Heiner.litz@uni-hd.de UoH confidential and proprietary23 Topology and Adressing 12 78 1516 2122 1314 1920 34 910 1718 2324 56 1112 2728 3334 2526 3132 2930 3536 1-12 -> Top 13-15 -> Left 17-18 -> Right 19-24 -> Down 1-30 -> Top 31-34 -> Left 36 -> Right null -> Down Limited possibilities as Opteron only supports 8 address range registers
24
Heiner.litz@uni-hd.de UoH confidential and proprietary24 Limitations Communication is PIO, no DMA, no offloading No congestion management, no HW barriers, no multicast, limited QoS etc Synchronous system, all Opterons require same clock, no COTS boxes Security Issues: Nodes can write directly to phys mem on any node Posted writes to remote memory do not have the coherency bit set, no local caching possible?
25
Heiner.litz@uni-hd.de How does it work? Minimalistic Linux Kernel (MinLin) 100 MB, runs in Ramdisk Boots over ethernet or FILO Mount homes over ssh PCI subsystem, to access NB config Multicore/processor supported No harddisk, VGA, keyboard,.. No Module support, no device drivers No Swapping/paging UoH confidential and proprietary25
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.