Download presentation
Presentation is loading. Please wait.
1
NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical Engineering Princeton University† Dept. of Electrical and Computer Engineering Queen’s University ‡
2
Outline Temporal Logic Folding Background on NRAMs Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) NanoMap: Design Optimization Flow Experimental Results Conclusions
3
Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles Temporal Logic Folding i =abc’ LUT 1 LUT 1 LUT 2 LUT 3 MEM l =(I’+e’+f’)h’ OUT =d’g’+l LUT 2 LUT 3 LUT 3 LUT 2 LUT 1
4
NATURE CMOS fabrication compatible CMOS fabrication compatible NRAM-based Run-time reconfiguration Run-time reconfiguration Temporal logic folding Temporal logic folding Design flexibility Design flexibility Logic density Logic density Overview of NATURE Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) and logic folding Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in area- time product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing
5
Overview of NATURE (Contd.) Challenges in nano-circuits/architectures Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%) Regular, reconfigurable architectures, such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process
6
Source: http://www.nantero.com/nram.html Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future NRAMTM by Nantero
7
NRAMs Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM
8
Island-style logic blocks (LBs) connected by various levels of interconnects An LB contains a super macroblock (SMB) and a local switch matrix Architecture of NATURE
9
n 1 macroblocks (MBs) comprise an SMB: here n 1 = 4 Architecture of a Super Macroblock (SMB)
10
n 2 logic elements (LEs) comprise an MB: here n 2 = 4 Architecture of a Macroblock (MB)
11
Logic Element (Basic Configuration) An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input
12
Folding Levels Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs Level-p folding: LE reconfiguration after the execution of p LUT computations Reconfiguration time: 160ps Larger folding level, typically delay decrease, area increase (a) level-1 folding (b) level-2 folding
13
Design Optimization Flow: NanoMap Optimize and implement design on NATURE Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles Input design specified in register-transfer level (RTL) and/or gate-level VHDL
14
Motivational Example Different planes should have same number of folding stages to guarantee global synchronization Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages Level 1 register Level 2 register Plane Logic in Plane Plane cycle Folding stage Folding cycle
15
Motivational Example (Contd.) Example optimization objective Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flip- flops: 32 LEs provide 32 LUTs and 64 flip-flops 50 LUTs 14 flip-flops 8 LUTs Logic depth: 4 38 LUTs Logic depth: 7 Plane depth: 9
16
Iterative Design Flow Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but large area cost Initial #folding stages: Initial folding levels: Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure
17
Iterative Design Flow (Contd.) Cluster size should be smaller than the area constraint 34 LUTs > 32 LUTs Level-5 foldingLevel-4 folding
18
Solution for the Example Three folding stages using level-4 folding 32 LEs required for mapping the RTL circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay
19
NanoMap: Flow Diagram Logic Mapping Temporal clustering Temporal placement Routing Input network Module library Folding level computation Delay estimation Schedule each LUT/ LUT cluster using FDS Perform logic folding? Yes No Placement routable? No Yes Satisfy area constraints? Yes Final placement using modified VPR placer Satisfy delay constraints? Yes Output reconfiguration bits Optimization objective No RTL module partition 1 3 4 5 6 7 8 10 11 12 14 15 Final routing using VPR router 16 User constraint Circuit parameter search 2 Map each LUT/LUT cluster to SMBs 7 Fast placement using modified VPR placer 9 Refine placement? Yes No 13
20
Force-Directed Scheduling Perform FDS on RTL modules partitioned into LUTs/LUT clusters Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage LE usage depends on LUT computations and register storage operations: two DGs needed
21
Temporal Clustering For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB Attractions depend on timing criticality and input pin sharing Considers attractions across all the folding cycles
22
Placement and Routing VPR (U. Toronto) modified to perform placement and support temporal logic folding Simulated annealing approach Cost function computed across the folding stages Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects
23
23 Experimental Setup Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs Results based on 100nm technology parameters to implement CMOS logic and NRAMs
24
Experimental Results (Contd.) 1 1 1 11 1 1 22 2 2 2 2 1 1 1 1 1 1 1 1 22 2 2 2 2 1
25
Reduction in #LEs Maximum AT improvement Average AT improvement Circuit delay increase k enough14.8X16.2X11.0X31.8% k = 169.2X9.3X7.8X19.4% Improvement under AT optimization for RTL Benchmarks LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous
26
Experimental Results (Contd.) Flexibility in choosing the best folding level and performing area-delay trade-offs Mapping results for typical optimizations using Paulin benchmark as an example Opt. obj. Area const. (#LEs) Delay const. (ns) Folding level Case1ATNo 1 Case2DelayNo Case3AreaNo274 Case4Delay210No3 Typical optimizations
27
Conclusions NATURE: A new high-performance run-time reconfigurable architecture NanoMap: an integrated optimization design flow for NATURE Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages Can be very useful for cost-conscious embedded systems and improvement of future FPGAs Non-volatility: helpful in secure and low power processing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.