Give qualifications of instructors: DAP

ECE 636 Reconfigurable Computing Lecture 14 Power Reductions Techniques for FPGAs
Give qualifications of instructors: DAP teaching computer architecture at Berkeley since 1977 Co-athor of textbook used in class Best known for being one of pioneers of RISC currently author of article on future of microprocessors in SciAm Sept 1995 RY took 152 as student, TAed 152,instructor in 152 undergrad and grad work at Berkeley joined NextGen to design fact 80x86 microprocessors one of architects of UltraSPARC fastest SPARC mper shipping this Fall

Overview FPGAs generally considered power hungry compared to ASIC and processor counterparts Mostly due to unused interconnect Recent area of extensive research Device techniques Voltage scaling Sleep mode Software techniques Reduced switching Reduced capacitance

One cycle involves a rising and falling output.
Dynamic Power Dynamic power is required to charge and discharge load capacitances when transistors switch. One cycle involves a rising and falling output. On rising output, charge Q = CVDD is required On falling output, charge is dumped to GND Short circuit current Charge/discharge current Courtesy: Harris

Short circuit power <10% of dynamic power

FPGA Static Power Consumption
Junction leakage Gate oxide leakage Subthreshold leakage Handel-C is almost identical to ANSI-C and should be familiar to anyone that has done algorithm development. The extensions that have been put in not only control timing and parallelism, but also include constructs to interface to external logic, instantiate RAMs and define clock domains. Things that do not make sense in hardware (recursion, malloc, etc.) have been taken out of the language but can be used in simulation.

FPGA Static Power Consumption
Junction leakage Small fraction of leakage Gate oxide leakage When Vgs < Vt still some source-drain current Increases exponentially as Vt decreases Decreases exponentially as Vgs decreases Subthreshold leakage Increases exponentially as Vgs increases Technology trend Courtesy: Nowak

FPGA Power Reduction Goals
Dynamic power goals Reduce Vdd along non-critical paths Low swing signalling Use CAD approaches to limit long high-toggle paths Pdynamic = 0.5 * C * Vdd2 * f Static power goals Cut-off Vdd for unused transistors Use high Vt transistors for SRAM cells Various other voltage biasing techniques

Traditional Routing Switch
Courtesy: Anderson level-restoring buffer

Proposed Switch Designs: Anderson
Based on 3 observations: Routing switch inputs tolerant to weak-1 signals (level-restoring buffers). Considerable slack in FPGA designs  many switches can be slowed down. Most routing switches feed other routing switches. Can produce weak-1 logic signals.

“Basic” Switch Design VVD high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION:

output swing: rail-to-rail.
High-Speed Mode output swing: rail-to-rail. VVD = VDD high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION:

output swing: GND-to- (VDD-VTH).
Low-Power Mode high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION: output swing: GND-to- (VDD-VTH). VVD = VDD - VTH VVD

Sleep Mode VVD high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION:

Leakage Power Results: Anderson
70 60.8 Basic 60 50 39.7 38.7 40 36 % leakage power reduction vs. high-speed mode 30 20 10 0.3 LP mode Sleep mode LP mode LP mode Traditional (+unused (+used switch fanout) fanout)

Region Constrained Placement
Rather than just focusing on routing, consider constraining logic Most circuits exhibit locality Gayasen: FPGA’2004

Region Constrained Placement
Several issues to consider Size of sleep transistor Too large: increases leakage, area Too small: affects logic performance Size of region Too large: possibly unused resources, complicates placement Too small: Sleep transistors take up too much room

Experimental Flow: RCP
Different region sizes considered for flow Area constraints for portions of design determined by hand May encourage designers to create granular designs

Power Savings: RCP Note significant reduction in leakage power savings as region size increases Bottom curve primarily due to luck

Performance Limitation: RCP
Performance limited by use of regions Nearly 10% clock frequency reduction for many designs

Low-swing Signalling Techniques we have examined so far look at tinkering with supply voltage Also possible to modify wire signalling to reduce voltage swing Most of FPGA is made up of interconnect Approach targets dynamic power consumption George and Rabaey: 1997

Low-swing Signalling Interconnect swing is at 0.8V while rest of circuit operates at 1.5V Cascode circuitry used at sink to overcome slow speed issues 50% energy savings at cost of 25% delay

Alternate approach: Modifying FPGA CAD
FPGA architecture modification impact all designs- even those that don’t care about power Can placement and routing be modified to consider dynamic power Need to know which signals are high toggle Attempt to minimize length of high-toggle wires Minimize impact on performance and area Techniques fit well into our previous work on placement and routing Lamoreaux and Wilton

Modifying FPGA CAD Placement
Previous cost metrics for annealing considered bounding box wire length and timing costs Include additional term which considers signal switching activity

FPGA Placement for Power
Previous cost metrics for annealing considered bounding box wire length and timing costs Include additional term which considers signal switching activity Post-route energy reduced by 3.0%. Power decreased by 7% but delay increases by 4%

FPGA Routing Modifications for Power
Original routing cost function takes congestion b(n) and delay(n) into account Augment with factor that takes net activity into account Minimize length of most active nets, even in the presence of congestion.

FPGA Routing for Power Results
Potential benefits somewhat limited by placement Note that most nets have low activity Power is decreased by 6% but delay increased by 4%. Energy savings of about 3%

FPGA Embedded Memory Blocks
Embedded memory blocks (EMBs) are important parts of FPGAs Consume roughly 14% of Altera Stratix II dynamic power * Increasing in recent designs * Stratix II Low Power Applications Note, 2005

Embedded Memory Block Port Internal View
MClk Clk Enable Clk RAM cell BIT Bit Line Pre-charge MClk Write Data MClk Write Enable Column Mux Write Buffers Sense Amps Row Decode Read Data Read Latch Address Reducing clocking saves dynamic power

Power Optimization #1 Convert EMB read enable/write enable signals to associated read/write clock enable signals Limitations Each port has read or write enable control signal Embedded memory block has read enable input Before After Data Data Q Q Data Data Q Q Wr clk enable Rd clk enable Wr clk enable Rd clk enable Vcc Vcc Wren Rden Wren Write enable Read enable Rden Write enable Read enable Vcc Vcc Write Address Read Address Read Address Write Address Read Address Read Address Write Address Write Address Clock Clock

Implementation Conversion mode
Ties off R/W enable to RAM clock enables Doesn’t make transform if CE already present on port Combining mode AND user RAM clock enables with derived R/W clock Could impact performance Combined Write Clk Enable Write Enable User-defined Write Clk Enable

FPGA RAM Processing FIFO, Shift Register, RAM specification Logical-to-physical RAM processing Memory/ logic placement Create Logical Memory Placed Memory Logical RAMs/ logic RAM blocks/ logic FIFOs and Shift registers converted into logical RAMs Logical RAMs mapped to RAM blocks

Mapping RAM to EMBs Implementation choice can impact design area, performance, and power. Some mappings may require multiple EMBs User-defined (logical) memory Physical (EMB) memory 4K bits 4K bits 4K bits 4K bits 4k deep x 4 wide 16K bits M4K M4K M4K M4K 512K MRAM

Memory Organization Each EMB can be configured to have different depth and width (e.g. Stratix II M4K) All hold 4K bits Slightly lower power consumption for wider EMB configurations (not including routing) 4K words deep 1 bit wide 8 bits wide 512 words deep 128 words deep 32 bits wide

Area and Delay Optimal Mapping
Configure each EMB to be as deep as possible Number of address bits on each EMB same as on logical memory Area and performance efficient: no external logic needed Power inefficient: All EMBs must be active during each logical RAM access Vertical Slicing 4k words deep and 1 bit wide (4 times) Addr[0:11] Data[0:3] 4k words deep and 4 bits wide Logical memory 4 EMBs active during access EMB

Alternative Mapping Configure EMB to have width of logical RAM (e.g. 1Kx4) Allows shutdown of some RAMs each cycle But adds some logic Saves RAM power, adds combinational logic and register power Horizontal Slicing Addr Decoder Addr[10:11] 1K deep x 4 wide More Power Efficient: Logical memory (4 times) 1 EMB active during access Addr[0:9] 4k words deep and 4 bits wide 4 Addr[10:11] Data[0:3]

Multiplexer Power Increasing
RAM Slicing - Example Power reduction available with different slicing 4kx32 Dynamic Power Multiplexer Power Increasing 140 Best range 120 100 80 Dynamic Power (mW) 60 40 20 128 256 512 1k 2k 4k EMB Power Increasing Maximum Depth

Power Optimization #2: Power-aware RAM Partitioning
Completed placement Insert Decode and Mux Logic FIFO, Shift Register Create Logical Memory Power-aware Physical RAM processing Memory/ Logic Placement Power Library Algorithm considers possible logical to physical RAM mappings

Experimental Approach
40 designs evaluated Quartus 5.1 Mapped to smallest possible device and target max frequency Simulation with test vectors Power analysis with PowerPlay

Memory Power 21.0% average reduction for all techniques (9.7% with convert/combine)

Overall Core Dynamic Power
6.8% average power reduction for all techniques (2.6% with convert/combine) 35 Enable convert/ combine 30 Enable convert/ combine + mem 25 partition 20 % Dyn. Power Reduction 15 10 5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 -5 Designs

Design Performance 1.0% average performance loss for all techniques (0.1% for enable convert/combine) Average Design Clock Frequency 10 5 -5 % Frequency Improvement -10 Enable Convert/ -15 Combine -20 Enable Convert/ Combine + -25 Mem Partition -30 Designs

Enable convert/ combine + Mem partition
Results Summary Almost 7% core dynamic power reduction across all designs Some designs benefit more than others Minimal clock frequency hit for most designs Enable convert Enable convert/ combine Enable convert/ combine + Mem partition Core dynamic power -1.8% -2.6% -6.8% Memory dynamic power -6.3% -9.7% -21.0% Max clk freq -0.1% -0.2% -1.0% LUT count 0.0% 0.1% 0.7%

Impact of Multiple Embedded Memory Blocks
Rerun 40 designs but only allow one type of target EMB for each mapping All designs targeted to Stratix II EP2S180 Significant power impact for most designs versus EP2S180 target with no restrictions M512 M4K M-RAM Designs completed 23 38 4 Core dynamic power 40.4% 6.6% 47.3% Memory power 279.5% 33.3% 754.0% Max clk freq. -2.2% 0.6% -1.0% LUT count 0.4% -0.5% 0.0%

Summary Key to reducing RAM power is keeping clocks disabled.
Movement of read/write enables to clock enables limits dynamic activity Power-aware RAM partitioner attempts to select power-optimal mapping – combined with clock enable enhancement Overall About 21% average memory power reduction 10% enable convert/combine About 7% average dynamic power reduction 3% enable convert/combine Diversity of EMBs reduces power by 33%

Summary FPGA power consumption under consideration at numerous level: architecture, circuit, CAD, and physical FPGA companies just now embracing power-aware CAD, power-aware architectures on the way Many circuit-level techniques still possible RTL CAD synthesis techniques provide a promising area for exploration

Give qualifications of instructors: DAP

Similar presentations

Presentation on theme: "Give qualifications of instructors: DAP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Give qualifications of instructors: DAP

Similar presentations

Presentation on theme: "Give qualifications of instructors: DAP"— Presentation transcript:

Similar presentations

About project

Feedback