Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Presented By: Sarah Lynn Bird
Scalar Operand Networks “A set of mechanisms that joins the dynamic operands and operations of a program in space to enact the computation specified by a program graph” Physical Interconnection Network Operation-operand matching system
Example Scalar Operand Networks Register File Raw Microprocessor
Design Issues Delay Scalability Bandwidth Scalability Intra-component delay Inter-component delay Managing latency Bandwidth Scalability Deadlock and Starvation Efficient Operation-Operand Matching Handling Exceptional Events
Operation-Operand Matching 5-Tuples of Costs <SO, SL, NHL, RL, RO> SO: Send Occupancy The number of cycles that the ALU wastes in sending SL: Send Latency The number of cycles of delay for the message on the send side of the network NHL: Network Hop Latency The number of cycles of delay per hop RL: Receive Latency The number of cycles of delay between the final input arrives and the instruction is consumed RO: Receive Occupancy The number of cycles that an ALU wastes before employing a remote value
Raw Design 8 -stage in-order single-issue pipeline 2 Static Networks Instructions from a 64KB cache Point-to-point for operand transport 2 Dynamic networks Memory traffic, interrupts, user-level messages 8 -stage in-order single-issue pipeline 4-stage pipelined FPU 32KB data cache 32KB instruction cache 16 Cores on a Chip
Experiments Beetle: a cycle-accurate simulator Memory Model Benchmarks Actual Scalar Operand Network Parameterized Scalar Operand Network without Contention Data cache misses modeled correctly Assume no instruction cache misses Memory Model Compiler maps memory to tiles Each location has one home site Benchmarks From Spec92, Spec95, Raw benchmark suite Dense Matrix Codes, 1 Secure Hash Algorithm
Benchmark Scaling 2 4 8 16 32 64 cholesky 1.622 3.234 5.995 9.185 11.898 12.934 vpenta 1.714 3.112 6.093 12.132 24.172 44.872 mxm 1.933 3.731 6.207 8.900 14.836 20.472 fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536 sha 1.123 1.955 1.976 2.321 2.536 2.523 swim 1.601 2.624 4.691 8.301 17.090 28.889 jacobi 1.430 2.757 4.953 9.304 15.881 22.756 life 1.807 3.365 6.436 12.049 21.081 36.095 Benchmark speedups on many tiles relative to the speed of the benchmark on one tile
Effect of Send & Receive Occupancy 64 tiles Parameterized network without contention <n,1, 1, 1, 0> & <0,1,1,1, n>
Effect of Send or Receive Latencies Applications with courser-grain parallelism are less sensitive to send/receive latencies Overall, applications are less sensitive to send/receive latencies as compared with send/receive occupancies.
Other Experiments Removing Contention Increasing Hop Latency Comparing with Other networks
Conclusions Many difficult issues with designing scalar operand networks Send and receive occupancies have the biggest impact on performance Network contention, multicast, and send/receive latencies have a smaller impact