Download presentation
Presentation is loading. Please wait.
1
CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley
2
2 CS252 Spring 2007 Project Outline Goal: Reduce the size of Leon on FPGAs Our motivation for using Leon: RAMP research: emulation of multiprocessors Analysis: LUT breakdown Optimizations: Circuit Level Architectural Level
3
3 CS252 Spring 2007 Leon Overview 32-bit SPARC V8 compliant processor 7 stage pipeline, in-order Separate L1 Instruction & Data caches Configurable cache size, associativity, replacement policy Optional Memory Management Unit AMBA bus interface to memory and peripherals Supports Symmetric Multiprocessing Open-source (Gaisler Research)
4
4 CS252 Spring 2007 Area analysis Configuration MMU: Combined I/D-TLB, 2-entry only Integer MUL/DIV enable Cache: Direct-map I/D cache Variables DSU - Debug support unit Target clock 20 MHz - easy to achieve 200 MHz - over constrained
5
5 CS252 Spring 2007 Resource break down Relaxed Constraint (20 MHz), no DSU Relaxed Constraint (20 MHz) with DSU Over constrained (200 MHz) no DSU Over constrained (200 MHz) with DSU LUTRegister LUT % LUT Register LUT % LUT Regs LUT % LUT Regs LUT % Integer Pipeline268490752.48%343597157.15%3085105754.21%4034122458.56% MUL2001353.91%2001353.33%2261163.97%2241083.25% DIV391817.65%383816.37%3881196.82%4311236.26% MMU4462768.72%4953188.23%4512837.92%4813326.98% ICache320896.26%320895.32%354906.22%3791055.50% DCache94328518.44%105828917.60%105729218.57%122730417.81% Bus IF13082.54%12082.00%13072.28%113101.64% Total51141781601118915691196468892206
6
6 CS252 Spring 2007 Why it’s BIG Debugging Support More MUXes One additional pipeline stage Useful for RAMP emulation / bootstrapping IU is over 50% Barrel shifter Pipeline control (forwarding)
7
7 CS252 Spring 2007 Circuit Level Optimizations Store LRU bits in Block RAMs instead of Flip Flops Also saves LUTs One-hot encoding for signals Synthesis tool does a good job of 1-hot encoding for many signals (e.g., state encoding) Applied this to the cache output Instead of data(set), we can use data(0) or data(1) or data(2) or data(3) Useful only for multiway caches LUT savings: ~ 100 LUTs
8
8 CS252 Spring 2007 Circuit Level Optimizations Use fast-carry chain logic Provided 30% savings in LUT usage for TLB entries Multipliers for barrel shifter Right shift by b is same as multiplication by 2^b Savings of ~ 100 LUTs
9
9 CS252 Spring 2007 LUTs for Integer Mul / Div 2195 / 18429* for entire two core system (12%) 11.5% of Leon3 core *(Xilinx ISE)
10
10 CS252 Spring 2007
11
11 CS252 Spring 2007 Didn’t your mother teach you to share? Savings of ~350 LUTs for prototype Only multiplier shared Only two cores 10% could become 5%..2.5%...1%…. Even more for MAC
12
12 CS252 Spring 2007 Operand MUXes: 32 bit, 7 to 1 MUX 32 bit, 5 to 1 MUX
13
13 CS252 Spring 2007 Operand MUXes 313 LUTs + 64 MUX /each
14
14 CS252 Spring 2007 Integer Pipeline Changes Remove all forwarding Single thread: Just stall Fine Grain Multithreading could boost performance LUTs saved: 27-37 % Maximum Freq improvement: 20% LUTsRegsLUTsRegsLUTsRegsLUTsRegs StandardImproved 47.4 MHz49.3 MHz73.3 MHz77.7 MHz 2683907263790719536841943684 108.5 MHz 114.6 MHz127.6 MHz136.1 MHz 30539603013104819287152035768 Standard w. DSU Improved w DSU 55.9 MHz 52.7 MHz 84.7 MHz85.2 MHz 3477971332097124697442457744 113.9 MHz123 MHz125 MHz144.9 MHz 4202104941981147261077829381371
15
15 CS252 Spring 2007 Conclusions CAD tools already perform many optimizations Remove unused logic Infer technology dependent logic from HDL source, e.g. Fast carry chain logic Optimize logic globally
16
16 CS252 Spring 2007 Conclusions Optimization is possible Higher levels yield (much) greater savings Circuit Level: 200-300 LUTs Architectural Level: 1000+ of LUTs Sharing: ~700 per core Total: 35-40% savings
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.