Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.

Similar presentations


Presentation on theme: "Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering."— Presentation transcript:

1 Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

2 Frank Vahid, UC Riverside 2 How Much is Enough?

3 Frank Vahid, UC Riverside 3 How Much is Enough? Perhaps a bit small

4 Frank Vahid, UC Riverside 4 How Much is Enough? Reasonably sized

5 Frank Vahid, UC Riverside 5 How Much is Enough? Probably plenty big

6 Frank Vahid, UC Riverside 6 How Much is Enough? More than typically necessary

7 Frank Vahid, UC Riverside 7 How Much is Enough? Very few people could use this

8 Frank Vahid, UC Riverside 8 How Much is Enough for an IC? 1993: ~ 1 million logic transistors IC packageIC Perhaps a bit small

9 Frank Vahid, UC Riverside 9 How Much is Enough for an IC? 1996: ~ 5-8 million logic transistors Reasonably sized

10 Frank Vahid, UC Riverside 10 How Much is Enough for an IC? 1999: ~ 10-50 million logic transistors Probably plenty big

11 Frank Vahid, UC Riverside 11 How Much is Enough for an IC? 2002: ~ 100-200 million logic transistors More than typically necessary

12 Frank Vahid, UC Riverside 12 How Much is Enough for an IC? 2008: >1 BILLION logic transistors 1993: 1 M Perhaps very few people could design this Point of diminishing returns 8-bit uC: ~15K 32-bit ARM: ~30K MPEG dcd: ~1M 100M good enough for audio/video/etc.? Other examples Fast cars (> 100 mph) High res digital cameras (> 4M) Disk space Even IC performance

13 Frank Vahid, UC Riverside 13 Very Few Companies Can Design High-End ICs Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 10,000 1000 100 10 1 0.1 0.01 Productivity (K) Trans./Staff-Mo. 198119831985198719891991199319951997199920012003200520072009 IC capacity productivity Gap Design productivity gap Source: ITRS’99

14 Frank Vahid, UC Riverside 14 Meanwhile, ICs Themselves are Costlier And take longer to fabricate While market windows are shrinking Less than 1,000 out of 10,000 ASIC designs have volumes to justify fabrication in 0.13 micron Tech:0.80.350.180.13 NRE:$40k$100k$350k$1,000k Turnaround42 days49 days56 days76 days Market:$3.5B$6B$12B$18B Source: DAC’01 panel on embedded programmable logic

15 Frank Vahid, UC Riverside 15 Summarizing So Far... * Transistors are less scarce ICs are big enough, fast enough * ICs take more time and money to design and fabricate While market windows are shrinking Buy pre-fabricated system-level ICs: platforms Designers

16 Frank Vahid, UC Riverside 16 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre- fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01

17 Frank Vahid, UC Riverside 17 A Sample Pre-Fabricated Platform uP L1 cache L2 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform Must be programmable for use in variety of products Ideally also configurable Means high volume Platform designer’s investment pays off Cost per IC is reasonable Use additional (readily available) transistors for high configurability Our research focus Design and use of highly configurable platforms IC

18 Frank Vahid, UC Riverside 18 Commercial Highly-Configurable Platform Type: Single-Chip Microprocessor/FPGA Platforms Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core 10 Dhrystone MIPS at 40MHz 60 kbytes on-chip RAM up to 40K logic gates Cost only about $4 (in volume)

19 Frank Vahid, UC Riverside 19 Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k configurable logic gates On-chip RAM (20-36Kb) and EEPROM $5-$10 Courtesy of Atmel

20 Frank Vahid, UC Riverside 20 Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip Based on ARM7 32- bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates On-chip cache and RAM $10-$20 in volume Courtesy of Triscend

21 Frank Vahid, UC Riverside 21 Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 ARM (922T) hard core ~200 Dhrystone MIPS at ~200 MHz Devices range from ~200k to ~2 million programmable logic gates Source: www.altera.com

22 Frank Vahid, UC Riverside 22 Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers 3,000 to 50,000 logic cells 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Config. logic Up to 16 serial transceivers 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps PowerPCs Courtesy of Xilinx

23 Frank Vahid, UC Riverside 23 Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? Single-Chip Microprocessor/FPGA Platforms

24 Frank Vahid, UC Riverside 24 Single-Chip Microprocessor/FPGA Platforms Lots of silicon area taken up by configurable logic As discussed earlier, less of an issue every year Smaller area doesn’t necessarily mean higher yield (lower costs) any more Previously could pack more die onto a wafer But die are becoming pad (pin) limited in nanoscale technologies Configurable logic typically used for peripherals, glue logic, etc. We have investigated another use...

25 Frank Vahid, UC Riverside 25 Software Improvements using On-Chip Configurable Logic Partitioned software critical loops onto on-chip FPGA for several benchmarks Performed physical measurements on Triscend A7 and E5 devices A7 IC Triscend A7 development board Work done by Greg Stitt, Brian Grattan, Shawn Nematbaktsh at UCR

26 Frank Vahid, UC Riverside 26 Software Improvements using On-Chip Configurable Logic Extensive simulated results for 8051 and MIPS (Physical measurement very time consuming) For Powerstone (PS), MediaBench (MB) and Netbench (NB) Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

27 Frank Vahid, UC Riverside 27 Speedup Gained with Relatively Few Gates Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates; diminishing returns after that Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear).

28 Frank Vahid, UC Riverside 28 Other Types of Configurability Microprocessor (other researchers) VLIW configurations Voltage scaling Memory hierarchy Our focus: build a highly-configurable cache that can be tuned to a particular program Work by Chaunjun Zhang, along with Walid Najjar, at UCR

29 Frank Vahid, UC Riverside 29 Cache Contributes Much to Performance and Power Well-known for performance Energy ARM920T: caches consume nearly half of total power (Segars 01) M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) ARM920T. Source: Segars ISSCC’01 Mem L1 Cache Processor

30 Frank Vahid, UC Riverside 30 Associativity Plays a Big Role Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time)

31 Frank Vahid, UC Riverside 31 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache

32 Frank Vahid, UC Riverside 32 Associativity and Energy Best performing cache is not always lowest energy Significantly poorer energy

33 Frank Vahid, UC Riverside 33 So What’s the Best Cache? Looking at popular embedded processors, there’s obviously no standard cache Dilemma Direct mapped –good performance and energy for most programs Four-way – good performance for all programs, but at cost of higher power per access for all programs Do we design for the average case or the worst case?

34 Frank Vahid, UC Riverside 34 Solution to the Dilemma Configurable cache Can be configured as four way, two way, or one way Ways can be concatenated Furthermore, ways can even be shut down to decrease total size Memory Direct mapped cache Four-way Now two-wayNow one-way

35 Frank Vahid, UC Riverside 35 Configurable Cache Design: Way Concatenation index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Small area and performance overhead

36 Frank Vahid, UC Riverside 36 Configurable Cache Experiments Configurable cache with both way concatenation and way shutdown is superior on every benchmark Considered Powerstone, MediaBench, and Spec2000 Tuning the cache to the program is important Work submitted to High-Performance Computer Architectures 2003, Zhang, Vahid and Najjar 100% = 4-way conventional cache

37 Frank Vahid, UC Riverside 37 Conclusions Trend is away from semi-custom IC fabrication Big enough; other pressures encourage buying pre-fabricated platforms Platforms must be highly configurable To be useful for a variety of applications, and hence mass produced We have discussed Software speedup/energy benefits of on-chip configurable logic: 3x speedups with only ~10,000 gates Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache Current/future work (collaborators: Walid Najjar UCR, Nik Dutt UCI) Automatically partitioning software loops to configurable logic Several approaches: platform-assisted, and dynamically on-chip Work being done by Roman Lysecky, Susan Cotterell, Greg Stitt, and Shawn Nematbaktsh at UCR Automatically tuning a configurable cache Ann Gordon-Ross at UCR


Download ppt "Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering."

Similar presentations


Ads by Google