L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008

Motivation In modern many-core systems, there is significant asymmetry between the number of cores and the number of memory access points  Tilera’s multiprocessor has 64 cores and only 4 memory controllers PARSEC benchmarks suggest that off-chip memory traffic increases with the number of cores for CMPs We explore mechanisms to lower latency and power consumption for processor-memory interconnect

Tilera Tile64 x5

Tilera Tile64 Five physical mesh networks  UDN, IDN, SDN, TDN, MDN TDN and MDN are used for handling memory traffic Memory requests transit TDN  Large store requests, small load requests Memory responses transit MDN  Large load responses, small store responses  Includes cache-to-cache transfers and off- chip transfers

Tapered Fat-Tree Good for many-to-few connectivity  Fewer hops  Shorter latency  Fewer routers  Less power, less area Root nodes directly connect to memory controller Replace MDN mesh network with two tapered fat-tree networks  One for routing requests up  One for routing responses down

Tile64 with Tapered Fat Tree

Memory Model Directory-based cache coherence Directory cache at every node Off-chip directory controller Tile-to-tile requests and responses transit the TDN Off-chip memory requests and responses transit the MDN

TDN and MDN Traffic for L2 Read Misses

Synthetic Benchmarks Statistical simulation  Model benchmarks from PARSEC suite  Based on off-chip traffic for 64-byte cache-line for 64 cores streamcluster 0.0266 lines off-chip/cycle 99% are loads 1% are stores canneal 0.0189 lines off-chip/cyc 70% are loads 30% are stores blackscholes 9.38e-5 lines off-chip/cycle 20% are loads 80% are stores x264 0.0025 lines off-chip/cycle 70% are loads 30% are stores Working Set Size Sharing Small Large More Less

Breakdown of Average Latency Latency of memory intensive applications dominated by queuing delay. Benchmarks with little off-chip traffic save on transit time.

Power Modeling Orion power simulator for on-chip routers from Princeton University Models switching power as sum of  Buffer power  Crossbar power  Arbitration power Specify parameters  Activity factor, number of input and output ports, virtual channels, size of input buffer, etc.

Tilera MDN Routers RouterNumberInputsOutputsWidth 43332 bits 244432 bits 365532 bits 44132 bits in 64 bits out 41464 bits in 32 bits out

RouterNumberInputsOutputsWidth 16 422432 884444 448118 32 in 64 out 64 in 32 out Tree Routers

Parameters 100 nm CMOS process V DD = 1.0V Clock Frequency = 750 MHz 32-bit flit width

Conclusion Physical design of the tapered fat-tree is more difficult The TFT topology can reduce memory latency and power dissipation for many- core systems

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Similar presentations

Presentation on theme: "L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Similar presentations

Presentation on theme: "L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008."— Presentation transcript:

Similar presentations

About project

Feedback