Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Similar presentations


Presentation on theme: "Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal"— Presentation transcript:

1 Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Presented By: Sarah Lynn Bird

2 Scalar Operand Networks
“A set of mechanisms that joins the dynamic operands and operations of a program in space to enact the computation specified by a program graph” Physical Interconnection Network Operation-operand matching system

3 Example Scalar Operand Networks
Register File Raw Microprocessor

4 Design Issues Delay Scalability Bandwidth Scalability
Intra-component delay Inter-component delay Managing latency Bandwidth Scalability Deadlock and Starvation Efficient Operation-Operand Matching Handling Exceptional Events

5 Operation-Operand Matching
5-Tuples of Costs <SO, SL, NHL, RL, RO> SO: Send Occupancy The number of cycles that the ALU wastes in sending SL: Send Latency The number of cycles of delay for the message on the send side of the network NHL: Network Hop Latency The number of cycles of delay per hop RL: Receive Latency The number of cycles of delay between the final input arrives and the instruction is consumed RO: Receive Occupancy The number of cycles that an ALU wastes before employing a remote value

6 Raw Design 8 -stage in-order single-issue pipeline
2 Static Networks Instructions from a 64KB cache Point-to-point for operand transport 2 Dynamic networks Memory traffic, interrupts, user-level messages 8 -stage in-order single-issue pipeline 4-stage pipelined FPU 32KB data cache 32KB instruction cache 16 Cores on a Chip

7 Experiments Beetle: a cycle-accurate simulator Memory Model Benchmarks
Actual Scalar Operand Network Parameterized Scalar Operand Network without Contention Data cache misses modeled correctly Assume no instruction cache misses Memory Model Compiler maps memory to tiles Each location has one home site Benchmarks From Spec92, Spec95, Raw benchmark suite Dense Matrix Codes, 1 Secure Hash Algorithm

8 Benchmark Scaling 2 4 8 16 32 64 cholesky 1.622 3.234 5.995 9.185 11.898 12.934 vpenta 1.714 3.112 6.093 12.132 24.172 44.872 mxm 1.933 3.731 6.207 8.900 14.836 20.472 fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536 sha 1.123 1.955 1.976 2.321 2.536 2.523 swim 1.601 2.624 4.691 8.301 17.090 28.889 jacobi 1.430 2.757 4.953 9.304 15.881 22.756 life 1.807 3.365 6.436 12.049 21.081 36.095 Benchmark speedups on many tiles relative to the speed of the benchmark on one tile

9 Effect of Send & Receive Occupancy
64 tiles Parameterized network without contention <n,1, 1, 1, 0> & <0,1,1,1, n>

10 Effect of Send or Receive Latencies
Applications with courser-grain parallelism are less sensitive to send/receive latencies Overall, applications are less sensitive to send/receive latencies as compared with send/receive occupancies.

11 Other Experiments Removing Contention Increasing Hop Latency
Comparing with Other networks

12 Conclusions Many difficult issues with designing scalar operand networks Send and receive occupancies have the biggest impact on performance Network contention, multicast, and send/receive latencies have a smaller impact


Download ppt "Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal"

Similar presentations


Ads by Google