Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert.

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium

ESAT/ACCA 2Overview Context: Introduction to the problem Motivation for L0 Buffer organization and status Distributed L0 Buffer organization Instruction Memory Exploration  Software and Compiler Transformation Conclusions

ESAT/ACCA 3Context Low Power Embedded Systems  Battery operated (low energy)  10-50 MOPS/mW  Small  Low cost  Flexible  Multimedia Applications  Video, audio, wireless  High performance 10-100 GOPS real-time constraints Low Energy Embedded systems

ESAT/ACCA 4Context Embedded processors Power Breakdown  43 % of power in on-chip Memory  StrongARM SA110: A 160MHz 32b 0.5W CMOS ARM processor  40 % of power in internal memory  C6x, Texas Instruments Inc. 25-30% of power in Instruction Memory To address the data memory issues: Data Transfer and Storage Methodology (DTSE) F.Catthoor et. al. Embedded systems: Programmable Processor Based

ESAT/ACCA 5 Related Work Significant Power consumption in Instruction Memory Hierarchy Core Main Memory (off-chip) L1 cache (on-chip) Compression (code size reduction) - L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999 - P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999 - T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000. Software Transformations - N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001 - S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001

ESAT/ACCA 6Overview Context: Introduction to the problem Motivation for L0 Buffer organization and status Distributed L0 Buffer organization Instruction Memory Exploration  Software and Compiler Transformation Conculsions

ESAT/ACCA 7 Application Domain: Multimedia Characteristics (1) Instruction Count Static  Instruction Count Dynamic  High locality Instruction count IC static < 1% IC dynamic 0% 100% 2% 0%

ESAT/ACCA 8 Application Domain: Multimedia Characteristics (2) Normalized static instruction count Normalized dynamic instruction count  Within a program, few basic blocks or instructions take up most of the execution time (IC dynamic )

ESAT/ACCA 9 Motivation for additional small memory Application Domain: high locality in few basic blocks  Small memory, in addition to the conventional L1 cache should be used to reduce energy without compromising performance Size (  basic blocks high locality ) is still large if L1 cache (on-chip) is made small performance degrades capacity (compulsory) misses system power increases off-chip memory / bus activity increases Core Main Memory (off-chip) L1 cache (on-chip)

ESAT/ACCA 10 Related Work (Microarchitecture): Cache Design N. Jouppi et.al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990 Aim: to reduce miss penalty cycles miss caching, victim caching, stream buffers Core Main Memory (off-chip) L1 cache (on-chip) cache

ESAT/ACCA 11 J. D. Bunda et.al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993 Aim: to reduce instruction cache energy L0 buffer: cache block buffer (1 cache block + 1 tag) Limitations: block trashing Related Work (Microarchitecture): Cache Design Core Main Memory (off-chip) L1 cache (on-chip) L0 Buffer J. Kin et.al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000 Aim: to reduce instruction cache energy L0 buffer: filter cache – Small regular cache (< 1KB) – L0 access (hit) latency: 1 cycle – L1 access (hit) latency: 2 cycles Limitations: – Energy reduced at the expense of performance – 256Byte, 58% power reduction with 21% performance degradation

ESAT/ACCA 12 R.S. Bajwa et.al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997 L. H. Lee et.al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999 Core Main Memory (off-chip) L1 cache (on-chip) L0 Buffer LC - L0 Buffer: Buffer (< 1KB) + Local Controller (LC); [no tags] - L0 / L1 access latency: 1 cycle - Used only for specific program segments (innermost loops) - Software control: Special instruction (lbon, sbb) to map program segments to L0 buffer Datapath L1 L0 Datapath L1 L0 Datapath L1 L0 Normal Operation Filling L0 Buffer Operation InitiationExecution Termination Related Work (Architecture): Software controlled L0 buffers

ESAT/ACCA 13 Assumed Architecture  MIPS 4000 ISA  Single Issue Processor  L1 Cache  16KB Direct Mapped  Loop Buffer (2KB)  Depth = 128 instructions  Width = 16 Bytes Tools  Simplescalar 2.0  Wattch Power estimator Loops with less than 128 instructions were hand-mapped onto the loop buffer Related Work (Architecture): Software controlled L0 buffers

ESAT/ACCA 14 Related Work (Architecture): Software controlled L0 buffers Advantages  50% (avg) energy reduction, with no performance degradation  Software control: enables to map only a selected program segments Limitations  Supports only innermost loops (regular basic blocks)  Other basic blocks frequently executed are still fetched from L1 cache  No support for control constructs within loops F. Vahid et.al [2001-2002] : Hardware support for conditional constructs within loops  Identifying the loop address bounds (preloading the program segment/loop)  Sub-routines  conditional constructs  1 level nested loop

ESAT/ACCA 15 Related Work (Architecture): Compiler controlled L0 buffers N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998 Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer. L0 Buffer: Regular cache (< 1KB; 128 instr) Technique: – profile – function inlining – identify basic blocks – code layout Core Main Memory (off-chip) L1 cache (on-chip) L0 Buffer code layout basic blocks allocated to L0 buffer L0 Buffer address space Advantages - Automated: a ‘tool’ can do this job - Use of basic block as atomic unit of allocation - 60% (avg) energy reduction in i-mem hierarchy [SPEC95] Limitations - Tag overhead

ESAT/ACCA 16 Loop Buffers: Commercial Processors RISC DSP Processors  SH-DSP  Decoded instruction buffers  Supports regular loops (no conditional constructs/nested loops) VLIW Processors  StarCore SC140  Supports regular and nested loops  Conditional constructs through predication  STMicroelectronics, ST120  Supports nested loops and loops with conditional constructs

ESAT/ACCA 18Shortcomings So far... Hardware, software, compiler optimizations to increase accesses/activity at L0 Buffers Core Main Memory (off-chip) L1 cache (on-chip) L0 Buffer Increased Accesses (activity) Bottleneck to solve – L0 Buffer organization – Interconnect: from L0 Buffer to Datapath – Efficient buffer controller Organization Scalable with increase in #FUs L0 Buffer FU Centralized Organization LC

ESAT/ACCA 19 Current Organizations for L0 Buffers Uncompressed L0 Buffer Buffer: Width  issue width (# FUS) Interconnect: Long LC: Simple Addressing (counter based) Ref: Bajwa et.al., L.H. Lee et.al., F. Vahid et.al. L0 Buffer FU L0 Buffer FU Decompressor/Dispatch Compressed L0 Buffer Buffer: – High storage density (no NOPs) – Width  issue width (# FUS) – Overhead in decompressing Interconnect : Still centralized, long lines LC: Simple Addressing (counter based) Ref: TI (execute packet fetch mechanism)

ESAT/ACCA 20 Current Organizations for L0 Buffers…. Sub-banked/Partitioned L0 Buffer with Compression Buffer: Smaller memories, overhead in re-organizer Interconnect: Still centralized LC: Complex addressing (needs expensive tags) Ref: T. Conte et.al [TINKER] No correlation between partitioning and FUs Bank 1 FU Re-organizer Bank 2Bank 3Bank 4 LC par 1 FU par 2par 3par 4 LC Partitioned L0 Buffer Buffer: Smaller memories Interconnect: Still long LC: – Simple addressing (counter based) – Need to access all the banks simultaneously, even if some of the FUs are not active Ref: Sub-banking

ESAT/ACCA 21 Solution Distributed Instruction Buffer Organization A balance of energy consumption between Buffers, Interconnect and Local Controllers is needed Buffers FU Distributor/Dispatch Buffers ATC FU ATC Instruction Cluster IROC Buffer Control Stores instructions in each partition Fetches instructions during loop execution Regulates the accesses to each partition Buffers Sub-banked/Partitioned in correlation with FU activation Interconnect Localized (limited connectivity b/w FUs and Buffers) ATC: Address Translation and Control IROC: Instruction Registers Operation and Control

ESAT/ACCA 22 Distributed L0 Buffer Operation Similar to conventional L0 buffer operation Initiation  Special instruction LBON Filling  Pre-fetching instructions from to Termination  When the program flow jumps to an address out of to range Datapath L1 Distributed L0 Datapath L1 Distributed L0 Datapath L1 Distributed L0 Normal Operation Filling L0 Buffer Operation InitiationExecution Termination

ESAT/ACCA 23 The Buffer Operation: An Illustration OP11 for (..) { … if (..) {.….} else {.….} … } OP21OP31NOP OP22OP32BNZ ‘x’ OP12NOP BR ‘y’ OP13NOPOP33NOP OP14OP23NOPBNZ ‘s’ S: X: Y: LBON if block else block

ESAT/ACCA 24 The Buffer Operation: An Illustration OP11 for (..) { … if (..) {.….} else {.….} … } OP21OP31NOP OP22OP32BNZ ‘x’ OP12NOP BR ‘y’ OP13NOPOP33NOP OP14OP23NOPBNZ ‘s’ S: X: Y: LBON if block else block IROC START_ADDR END_ADDR IR_USE NEW_PC PC FU1 OP11 OP12 OP13 OP14 01 -0 11 21 31 FU2 OP21 OP22 OP23 01 11 -0 -0 21 FU3 OP31 OP32 OP33 01 11 -0 21 -0 BR BNZ ‘x’ BR ‘y’ BNZ ‘s’ -0 01 11 -0 21

ESAT/ACCA 30 Energy Trade-Offs Energy =  E buffer i +  E LC i +  E interconnect i i = 1 #partitions i = 1 #partitions i = 1 #partitions Energy (normalized) 1 1  E buffer i  E interconnect i  E LC i Baseline #FUs

ESAT/ACCA 31 Profile Based Clustering Instruction Clustering 1 1 1 0 0 … 1 1 0 1 0 1 … 0 0 1 1 0 1 … 1. 1 1 1 0 1 … 0 Energy Models (Register File) Dynamic Trace (during loop execution) Static Trace (loops mapped to L0) begin 1 1 1 0 0 … 1 1 0 1 0 1 … 0 end begin 0 1 1 0 1 … 1 end Instruction Clusters Instruction Cluster A group of functional units with a separate local controller and an instruction buffer partition Min { Energy(clust, Dynamic profile, Static profile ) }  clust(i,j) = 1;  j i =1 max_clusters clust (i,j) = 1; if j th FU is assigned to cluster ‘j’ = 0; otherwise S.T Where, - FU grouping - Width and Depth of instruction buffers in each partition

ESAT/ACCA 32Results Energy =  E buffer i +  E LC i i = 1 #partitions i = 1 #partitions Energy (normalized) Assumptions - Only the buffers and controller is modeled (no interconnect as yet) - #FUs in datapath = 10 - Fixed Schedule ( activation trace) - Schedule generated using Trimaran 2.0

ESAT/ACCA 33 In Comparison With Other Schemes Uncompressed Compressed Paritioned (sub-banked) ( no access regulation ) Clustered (varying width only) Clustered (varying both width and depth) Results Shown for ADPCM Uncompressed - CentralizedL0 buffer Compressed - Centralized L0 Buffer - 2 additional registers for VLDecoding Partitioned (no control) - 2 partitions Clustered (width only) - 3 partitions Clustered (width and depth) - 2 partitions

ESAT/ACCA 34 Fully Distributed Instruction Memory Hierarchy L0 Buffers FU L0 Buffers FU L0 Buffers FU L0 Buffers FU Main Memory (off-chip) L1 cache (on-chip) L1 cache (on-chip) L0 ClusterL1 Cluster

ESAT/ACCA 36 Exploration Methodology What we have Application Software Transformations Compiler (Scheduling) Clustering Tool Energy Models Instruction Clusters Pareto Curve Generation - For Choosing the operating point at Run-time - Enable the designer to asses the trade-off between energy and performance Delay Energy optimized for performance - maximum cluster activity optimized for Energy - minimal cluster activity

ESAT/ACCA 37 Exploration Methodology What we want to achieve… Application Software Transformations Compiler (Scheduling & Clustering) Energy Models Instruction Clusters Schedule Pareto Curve Generation - For Choosing the operating point at Run-time - Enable the designer to asses the trade-off between energy and performance Delay Energy optimized for performance - maximum cluster activity optimized for Energy - minimal cluster activity

ESAT/ACCA 38 Compiler Scheduling Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance OP11OP12-OP13-OP14 All 3 clusters need to be active OP11OP12OP13OP14-- Only 2 clusters need to be active OP11OP12-OP13-OP14 OP21-OP22-OP23- 2 activations of all 3 clusters OP11OP12---- OP11----- --OP22OP13OP23OP14 2 activations for 1st, 1 activation for 2nd and 3rd cluster Energy reduction without performance loss Energy reduction at the expense of performance loss

ESAT/ACCA 39 Software Transformations loop 1 loop 2 Loop High level code transformations can also impact/change the clustering result and hence energy and performance Loop Transformations - Loop splitting - Loop merging - Loop peeling (for nested loops) - Loop collapsing (nested loops) - Code movement across loops -....etc Loop Splitting

ESAT/ACCA 41Conclusions L0 Buffer Organization  Multimedia applications have high locality in small program segments  An additional small L0 buffer should be used  Current options for L0 buffer still not efficient (energy)  A distributed L0 buffer organization should be sought  But, the clustering/partitioning should be application specific L1 Cache Organization  Distributed (?) Instruction Memory Exploration  Software transformations and compiler scheduling can change the clusterting results  An exploration methodology should be sought to analyze the trade-offs in energy and performance (pareto curves)

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert.

Similar presentations

Presentation on theme: "Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert.

Similar presentations

Presentation on theme: "Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert."— Presentation transcript:

Similar presentations

About project

Feedback