Download presentation
Presentation is loading. Please wait.
1
Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier Department of Electrical and Computer Engineering University of Massachusetts, Amherst {alaffely, jliang, pjain, nweng, burleson, tessier} @ecs.umass.edu This material is based upon work supported by the National Science Foundation under Grant No. 9988238. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
2
Overview Motivation Video Processing Architecture Dynamic Power Management Core, Interconnect, and Clock
3
Problem Wireless video processing requires High throughput Low Power Flexible
4
System on a Chip Solutions Take advantage of parallelism Possible improved performance Allow use and reuse of existing integrated components If The application can be partitioned The appropriate architecture is used
5
Proposed Architecture: aSoC High throughput Heterogeneous processor elements Use the right tool for the job Fast and predictable interconnect Flexible Runtime reconfiguration of cores and interconnect Power consumption Implement power saving features in both cores and interconnect Use reconfiguration to dynamically control power consumption
6
aSoC: adaptive System on a Chip Tiled SoC architecture DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation
7
aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Pick and place cores which best perform the given application Increase performance Save power Cores may be any number of tiles in size DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation
8
aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Connected with an interconnect mesh Restricted to near neighbor communications Creates pipeline Decreases cycle time DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation
9
aSoC: adaptive System on a Chip Tiled SoC architecture Supports the use of independently developed heterogeneous cores Connected with a fixed interconnect mesh Using a communication interface (CI) to manage data Network port (Coreport) for each core Each CI uses a memory and FSM to repetitively process a predefined schedule of communications Crossbar DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation
10
Stream Control Instruction memory Holds the predetermined schedule of communications PC Selects and synchronizes the communications Decoder Sets crossbar Controller Sets PC Interprets incoming configuration commands Crossbar Any input to any set of outputs North South East West Core North South East West Core Decoder/Controller PC Inputs Outputs Instruction Memory Local Config.
11
Example: Communication Stream A-D Core CCore BCore A A given application requires periodic communications from Core A to Core C aSoC uses a prescheduled communication STREAM Core A places the data in a dedicated STREAM between the two tiles Core C pulls the data from that STREAM The tile to tile communication uses 3 cycles
12
Example: Stream CBA 1Core to East
13
Example: Stream Stream A-D CBA 2West to East
14
Example: Stream CBA West to Core3
15
Example: Stream Stream A-D CBA West to Core 1 3 2 Core to East West to East Loop Back
16
Static Scheduled Communications Creates system scalability by “eliminating” network congestion Many interconnect segments managed with time division multiplexing lots of Bandwidth Improves SoC performance by up to factor of 8 DCT VLE MemoryViterbiFIR EncryptControl Motion Estimation and Compensation
17
Power Consumption? Provide reconfiguration methods for cores and CI Develop programmable clocking systems at each tile
18
Power Aware Core Custom motion estimation core Choose search method Full search 960-600mW (bit width and pel sub-sampling) Spiral search 76mW Three step search 25mW Data taken with Synopsys TM Power Compiler at the RTL level
19
aSoC Support Multiple streams in and out through dedicated coreports Easy to manage on both sides of the port Schedule configuration streams in with the data Stream A: Input Frame Stream B: Configuration (Choose search mode and size) Stream C: Motion Vectors Motion Estimation Core in1in2out2out1 Stream A Stream B Stream C Coreports
20
Reconfigurable Interconnect P-frame I-frame MEMC - + Input Frame DCT Input Frame DCT
21
aSoC Support Lumped ME, MC and Summation into one double core DCT Motion Estimation & Compensation
22
aSoC Support: P-Frame Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B)
23
aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration Streams (C & D)
24
aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration (Streams C) Schedule 1 Schedule 2 PC
25
aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Difference Frame (Stream B) Configuration (Streams C) Schedule 1 Schedule 2 PC
26
aSoC Support: Schedule Change Input Frame (Stream A) DCT Motion Estimation & Compensation Configuration (Streams D) Schedule 1 Schedule 2 PC
27
aSoC Support: Schedule Change Input Frame (Stream A’) DCT Motion Estimation & Compensation Configuration (Streams D) Schedule 1 Schedule 2 PC
28
aSoC Support: I-Frame Input Frame (Stream A’) DCT Motion Estimation & Compensation OFF
29
Operating Frequency? Interconnect synchronized H-tree clock distribution Core frequencies depend on critical path Tile provides clock reference Coreport provides asynchronous boundary Dynamic core configuration requires dynamic clock configuration aSoC clock reference provides multiples of interconnect clock (… 4x, 2x, 1x, 0.5x, 0.25x, …) Configured through the tile controller
30
Mixed vs. Fixed Core Frequencies Cores not designed with clock gating Core power from Synopsys RTL simulation Interconnect from SPICE Assumes 10 cycle schedule, 4 pixels/word
31
Current Density and Clocking Red: fixed worst case clocking Short spikes of high current Green: optimal independent clocking Slow and low Optimal clocking eliminates current spikes (improved battery life) Deadline Process Start ME: Full Search ME: Spiral ME: Three Step Search DCT Time Current
32
Configuration Overhead Configuration adds up to 2 streams per tile Only 2 required for data Total BW =5xTxN 5 streams/(cycle,tile) T tiles N cycles in schedule Single tile can support up to 50 different streams in 10 cycle schedule DCT Transform Frame (Stream D) Input Frame (Stream B) Configuration Streams
33
Configuration Power Overhead Configuration streams used infrequently Once/Macro block or Once/Frame Architecture disables unused streams Data valid bit already used for flow control Only 4-9% of interconnect power is due to configuration streams
34
Conclusion aSoC supports dynamic power management with Reconfiguration Cores Interconnect Clocks Low configuration overhead in both Communication Bandwidth Power
35
Future Work Add reconfigurable voltage supplies at each tile Finish test chip Import larger applications
36
Questions
37
aSoC: adaptive System on a Chip DCTVLEMemoryViterbiFIREncryptControl Motion Estimation and Compensation Cores Interconnect Interface Tile
38
Example: Stream Stream A-D CBA
39
Partitioning Automated partitioning a non trivial problem For small signal processing systems user defined partitioning may be possible Key: Perfectly partitioning the system may not be possible How can the SoC mitigate the penalty?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.