Download presentation
Presentation is loading. Please wait.
Published byKerry Perry Modified over 8 years ago
1
Automated Design of Custom Architecture Tulika Mitra http://www.comp.nus.edu.sg/~tulika
2
2 Motivation Embedded system is designed for a specific application or a class of applications Design a processor for an application domain Processor ISA and micro-architecture optimized for that application domain Multiple optimization criteria: time, power, cost,.. Stringent time-to-market constraint Design of domain specific processor should be (semi)-automated
3
3 Architecture Synthesis Automatic Tool Application Customized Architecture Power Size Performance Timing
4
4 Tensilica: Xtensa Architecture Copyright: Tensilica
5
5 Design Framework Copyright: Tensilica
6
6 Design Framework: Key Steps Instantiate parameters for the core processor to optimize performance, power, or cost Identify useful and feasible ISA extensions Implement the domain specific processor Implement compilers, assembler, simulator, debugger, …
7
7 Silicon Choices ASIC implementation of Processor Fast but not flexible Time intensive design process Example: Tensilica, HP-STMicroelectronics Processor core in ASIC but instruction set extensions in reconfigurable logic Medium speed but flexible Fast design process Example: Triscend Configurable System-on-Chip
8
8 Triscend Configurable SoC Copyright: Triscend
9
9 Reconfigurable Computing 101 Higher performance than software with higher level of flexibility than hardware e.g. Field Programmable Gate Arrays (FPGA) Logic Blocks Array of computational elements whose functionality is determined through multiple SRAM configuration bits Interconnection Logic blocks are connected using programmable routing resources Any custom circuit can be mapped to FPGA by computing logic functions within logic blocks and using configurable routing to connect the logic blocks together Dynamically reconfigurable Logic Logic reconfiguration during application execution Temporal partitioning of software reduces logic area Overhead for reconfiguration
10
10 Use of Reconfigurable Computing Two choices Map both control and datapath to RC Map only datapath to RC Granularity of reconfigurable logic Bit Multiple bits ALU
11
11 RC Coupled to I/O System Bus Most common form of commercial RC Overhead of data transfer between CPU and RC Requires large granularity of computation on RC CPU RC I/O Memory Local Bus PCI Bus Local Bus
12
12 RC Coupled to Local Bus Pilchard from Chinese University of Hong Kong Still requires large granularity of computation CPU RC I/O Memory Local Bus PCI Bus Local Bus
13
13 PICO Architecture Copyright: Bob Rau et. al.
14
14 Design Framework Copyright: Bob Rau et. al.
15
15 Design Flow Copyright: Bob Rau et. al.
16
16 Hardware/software Co-design Well studied problem. Then what’s new? High Level Synthesis (HLS) Time-to-market constraint forces automated generation of reconfigurable bitstream from high level specification or software Automated generation of interface Spatial and temporal partitioning Partitioning among multiple configurable devices Map a function that exceeds the available space of reconfigurable device using time sharing Requires new compilation techniques
17
17 High Level Synthesis-1 High level hardware description language Start from software programming language and add support for Parallelism via threads Message passing Examples: Handel-C, SystemC Make current HDL more abstract Superlog, System Verilog Still requires user to find parallelism
18
18 High Level Synthesis-2 Combine research in two different fields: compiler and design automation Traditional HLS techniques target ASIC implementation RC does not have the layout freedom Objective of RC is to minimize execution time Temporal partitioning if insufficient area Hardware library of operators or structures commonly used by software programs
19
19 High Level Synthesis-3 Concentrate on loops Leverage parallelizing compiler technology combined with high level synthesis Parallelize computation Optimize external memory access Loop transformation: Area versus Performance Unroll and Jam Loop unrolling Software pipelining Loop-invariant code motion Data layout Hardware specific optimizations Bitwidth reduction
20
20 RC Coupled to CPU as Coprocessor Tight coupling between CPU and RC RC can execute ISA extensions CPU and RC cannot share register file CPU RC I/O Memory Local Bus PCI Bus Local Bus
21
21 RC Integrated in Processor Datapath Most tight coupling between CPU and RC RC implements custom functional units for ISA extensions CPU and RC share register file RC I/O Memory Local Bus PCI Bus RCCPU PCI Bus Local Bus
22
22 Custom Functional Unit Typically no restriction of number of input and output for CFU Register File Memory ADDMULLD/ST CFU1
23
23 Changing Role of Compiler Standard compiler generates code for fixed ISA and micro-architecture Retargetable compiler accepts ISA + micro- architecture as input and generates code Compiler for domain specific processor First search and define the optimal ISA Generate code for the optimal ISA Defining optimal ISA requires hardware knowledge but is more similar to traditional s/w compiler analysis than h/w synthesis This process will work for ASIC as well, but No dynamic reconfigurability Different choice of ISA due to speed difference
24
24 Instruction Set Extension Two options to identify instruction set extensions Static data flow graph (DFG) Dynamic execution trace + x- a b c d mn FU (i1, i2, i3) = (i1+i2) x i3
25
25 Static Data Flow Graph Identify a special sub-graph called MaxMISO to be collapsed into instruction set extension MaxMISO: A maximal multiple input single output sub-graph + + <<>> | Correct + + <<>> | Incorrect
26
26 MaxMISO Limitations Only multiple input single output FU Cannot go beyond control flow boundaries Execution frequency not taken into account No dynamic reconfigurability MaxMISO may not be the optimal choice Too big sub-graph Too small sub-graph
27
27 Dynamic Execution Trace Identify and isolate frequently occurring patterns of operations Pattern matching algorithm On-the-fly construction of pattern library Select matches for minimum cover Evaluate the most frequently occurring operation patterns in terms of how useful it would be to implement them as custom operations Downside: High complexity algorithm
28
28 Pattern Construction & Matching * + ld Pattern Library * + ld ab c d ef P1 P2 P3 P1 (a, b, c) P2 (c, d, e) P3 (d, f) P4 (a, b, d, e) * + P4
29
29 Dynamic Reconfiguration Functional units organized around interconnection network creates programmable datapath Synthesize a datapath for each loop Small reconfiguration time due to coarse logic blocks Embedded Processor Cache FPGA FU RG FU Reconfigurable Interconnection
30
30 Datapath Merging Merge the different loop datapaths into a single reconfigurable datapath Reuse hardware blocks and interconnections across the loop datapaths as much as possible Datapath Merging Problem Identify similarities among loop datapaths and produce a merged datapath with minimum number of hardware blocks and interconnects
31
31 Datapath Merging: Example + x- a b c d mn + x- a b c d mn + x- Merge
32
32 Datapath Merging Algorithm-1 A11 A12 B11 C11 A21 A22 A23 B21 C21 A11B11 A21B21 A11B11 A22B21 A11C11 A23C21 B11C11 B21C21 A12C11 A23C21 Find maximum clique in compatibility graph A11B11 A12C11 A11C11 B11C11 A21B21 B21C21 A23C21 A22B21 C21A22
33
33 Datapath Merging Algorithm-2 A11 A12 B11 C11 A21 A22 A23 B21 C21 A11 A21 B11 B21 A22 C11 C21 A12 A23 A11B11 A21B21 A11B11 A22B21 A11C11 A23C21 B11C11 B21C21 A12C11 A23C21 Find maximum clique in compatibility graph
34
34 Summary High level synthesis technique for loop is not yet mature Very little research on compilation technique for dynamic reconfigurability of loops Instruction set extension: Static DFG based technique has limitations Dynamic trace based technique too slow Techniques for dynamic reconfiguration of custom FU yet to be developed
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.