Download presentation
Presentation is loading. Please wait.
Published byDora Stevens Modified over 9 years ago
1
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign
2
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Performance = f(architecture, implementation) 1-D IDCT 1-D IDCT Time 1-D IDCT 1-D IDCT 1-D IDCT 1-D IDCT 1-D IDCT 1-D IDCT LD ADD MUL LD MUL ST LD MUL ST LD ADD MUL LD MUL ST LD ADD MUL ADD MUL ADD MUL ST
3
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Efficient Implementation Everything you give up in clock rate you have to make back in architectural efficiency Wire delay is the big limiting factor in system architectures today –Wires get slower relative to transistors as fab. process improves Programmable processors moving to deeper pipelines –Not good enough to just prevent wires from making reconf. logic slower
4
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Amalgam DRAM Cache (Multi-Banked) Network PCluster RCluster
5
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Network Interface ACU Reconfigurable Cluster Design 4 Register banks –8 registers/bank 4 Reconfigurable logic segments –8 Rows x 32 LBs per segment Array control unit Network interface Counter-clockwise flow of computation through cluster Segment Bank Segment Bank Segment Bank Segment
6
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Reconfigurable Clock Rates
7
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Critical Path Latches in logic blocks only resource for pipelining Vertical and horizontal wires carry data between logic blocks –Wires have heavy loads, making them slower than their length would indicate Effect on clock rate varies significantly with fabrication process LB FF HWIRE VWIRE Bank VWIRE HWIRE LB FF
8
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Supporting Pipelining Goal: make logic block delay the limiting factor on clock rate Add configurable latches at each wire intersection –Problem: different paths may have different latencies Add retiming buffers at logic block inputs/outputs Add network queues to reduce synchronization overhead
9
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Critical Path Delay of individual wires < logic block delay in all processes studied Add configurable pipeline latches at junctions between wires Pipeline latches also added on carry chains within rows LB FF HWIRE VWIRE Bank VWIRE HWIRE FF LB FF
10
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Retiming Buffers 5-deep chain of latches added to each logic block input –Similar structure added to LB output Can “borrow” up to two cycles of additional delay from adjacent input Total pipeline register overhead = 17% FF
11
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Register Queues WRITE R8, Val1 WRITE R8, Val2 Sync. Message Network Register File Register File Original Architecture WRITE R8, Val1 WRITE R8, Val2 EMPTY R8 Network Original Architecture Register Queue Register Queue Register File Register File
12
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Implementing Pipelined Apps. Logical vs. Physical pipelining –Logical: Program-visible, uses array and registers –Physical: Only visible to ACU, uses pipeline registers on wires, retiming buffers Take advantage of decoupling provided by queues Applications use same reconfigurable logic configurations in different fab. processes –Only FSM in ACU changes –Applications to portability, managing intra-die variation
13
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Experimental Methodology Programs simulated using Amalsim –Set each cluster’s clock rate independently Benchmarks: IDCT, Rijndael, DNA comparison –Fine-grained version of each benchmark does one computation –Medium-grained version performs four independent computatons Programmable cluster clock rates based on ITRS –Limit stages to 7 FO4 delay, slightly more aggressive than ITRS Logic block latencies, wire lengths taken from circuit-level design of reconf. Cluster in 180nm CMOS –Convert logic block delay to FO4, scale by FO4 delay of each fabrication process –Scale wire length based on fabrication process, simulate wire delay in SPICE –Pipeline such that reconf. cluster cycle time is determined by logic block delay
14
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Clock Rates
15
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Fine-Grained Benchmark Perf. Reconfigurable version maintains about 20% perf. Improvement over programmable in all fab. processes Pipelining only small benefit Majority of speedup comes from reduction in memory references
16
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Medium-Grain Benchmark Perf. Pipelined architecture sees 2.6x perf improvement over programmable Unpipelined architecture only minor improvement over programmable –Greater parallelism means more ability to tolerate memory delays
17
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Limit Studies Believe that memory operations are much of the benefit for small tasks –Study limit where memory latency = 1 –Also test theory that streaming benchmarks have enough parallelism to cover latency Understand how much clock rate of reconfigurable unit affects performance –Model reconfigurable unit at same clock rate as programmable clusters –Completely unreasonable for unpipelined –Might be indicator of what industry could do with pipelined
18
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Fine-Grained Removing memory latencies makes programmable performance similar to reconfigurable Latency of reconfig. clusters has large impact on performance -- no parallelism to cover latency
19
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Fine-Grained Results similar to unpipelined –Benefit still mostly from memory reduction
20
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Medium-Grain Eliminating memory latencies really helps programmable Latency of reconf. logic an even bigger problem –Programmable clusters can exploit parallelism through pipelines
21
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Medium-Grain Impact of memory system on reconfigurable performance very small Less benefit from increasing reconfigurable cluster clock rate –With even small amounts of parallelism, throughput becomes more important than latency.
22
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Future Directions ASIC-like performance with programmable systems –ASICs typically get 100x better performance per unit area than microprocessors Application-specific memory systems in a programmable chip –Transform memory references into communication –Create natural division of programs into regular and irregular blocks
23
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Conclusion Reconfigurable computing must provide both speedup from custom logic and high clock rates to succeed Amalgam does this by limiting and tolerating wire delay at multiple levels –Clustered architecture –Segmented reconfigurable unit –Pipeline wire delays Result: 2.6x speedup over 8-way CMP in current and future fabrication processes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.