University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott Mahlke University of Michigan

Electrical Engineering and Computer Science 2 Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration

University of Michigan Electrical Engineering and Computer Science CGRA : Attractive Alternative to ASICs Suitable for running multimedia applications for future embedded systems –High throughput, low power consumption, high flexibility Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW viterbi at 80Mbpsh.264 at 30fps50-60 MOps /mW Morphosys SiliconHive ADRES 3

University of Michigan Electrical Engineering and Computer Science Control Power Explosion Large number of configuration signals –Distributed interconnect, many resources to control –Nealy 1000 bits each cycle No code compression technique developed for CGRAs –Fully decoded instructions are stored in memory –45% total power 4 Single PEPE Instruction

University of Michigan Electrical Engineering and Computer Science Code Compressions Huffman encoding –High efficiency, but sequential process Dictionary-based –Recurring patterns stored in dictionary –Not many patterns found in CGRAs Instruction level code compression –No-op compression : Itanium, DSPs –Only 17% are no-ops in CGRA

University of Michigan Electrical Engineering and Computer Science Fine-grain Code Compression Compress unused fields rather than the whole instruction –Opcode, MUX selection, register address –35% of fields contain valid information Instruction format needs be stored in the memory –Information regarding which fields exist in the memory –Significant overhead : 172 bits (20%) for a 4x4 CGRA 6

University of Michigan Electrical Engineering and Computer Science Dynamic Instruction Format Discovery Resources need configuration only when data flows through them Instruction format can be discovered by looking at the data flow Token network from dataflow machines can be utilized –Token is 1 bit information indicating incoming data in next cycle –Each PE observes incoming tokens and determines the instruction format 7 FU : dest <- src0 + src1 RF : reg write

University of Michigan Electrical Engineering and Computer Science Dynamic Configuration of PEs Each cycle, tokens are sent to the consuming PEs –Consuming resources collect incoming tokens, discover instruction formats, and fetch only necessary instruction fields Next cycle, resources can execute the scheduled operations 8 Cycle 0Cycle 1Cycle 2Cycle 3Cycle 4 Dataflow GraphMappingConfiguration configured executed routing node

University of Michigan Electrical Engineering and Computer Science Token Generation 9 Tokens are generated at the beginning of dataflow : live-in nodes in RFs Each RF read port needs token generation info : 26 read ports in 4x4 CGRA –26 bits for token generation vs. 172 bits for instruction format

University of Michigan Electrical Engineering and Computer Science Token Network 10 Token network between datapath and decoder –No instruction format, but token generation info in the memory –Adds 1 cycle between IF and EX stage Created by cloning the datapath –1 bit interconnect with same topology –Each resource translated to a token processing module –Encode dest fields, not src fields

University of Michigan Electrical Engineering and Computer Science Register File Token Module 11 Write port MUXes are converted to token receivers –Determine selection bits Read ports are converted to token senders –Tokens are initially generated here –Token generation information stored in a separate memory token_gen token sender

University of Michigan Electrical Engineering and Computer Science FU Token Module 12 Input MUXes are converted to token receivers Opcode processor –Fetch opcode field if necessary –Determine token type (data/pred), latency

University of Michigan Electrical Engineering and Computer Science System Overview datapath token generation

University of Michigan Electrical Engineering and Computer Science Experimental Setup Target multimedia applications for embedded systems –Modulo scheduling for compute intensive loops in 3D graphics, AAC decoder, AVC decoder (214 loops) Three different control path designs –baseline : fully decoded instructions –static : fine-grained code compression with instruction format stored in the memory –token : fine-grain code compression with token network 14

University of Michigan Electrical Engineering and Computer Science Code Size / Performance Fine grain code compression increase code efficiency Token network further improve code efficiency Performance degradation –Sharing of fields, allowing only 2 dests 15

University of Michigan Electrical Engineering and Computer Science Power / Area 16 SRAM read power is greatly reduced with token network –Introducing token network slightly increases power and area Area overhead can be mitigated with the reduced SRAM area Hardware overhead for migrating staging predicates into token network is minimal

University of Michigan Electrical Engineering and Computer Science Staging Predicates Optimization Modulo scheduled loops –Prolog (filling pipeline) –Kernel code (steady state) –Epilog (draining pipeline) Only kernel code is stored in memory –Staging predicate control prolog/epilog phases 17 II Overlapped Execution A B C A B C A B C A B C A B C A B C i0 i1 i2 i0i1i2

University of Michigan Electrical Engineering and Computer Science Migrating Staging Predicate Staging predicate –Control information, not data dependent –10% configurations used for routing staging predicate Move staging predicates into control path –Increase token by 1 bit : staging predicate –Only top nodes are guarded –Staging predicate flows along with tokens Benefits –Code size reduction –Performance increase 18 data staging predicate stage 0 stage 1 stage 2 stage 3

University of Michigan Electrical Engineering and Computer Science Code Size / Performance Code size reduction by 9% Migrating staging predicates improve performance by 7% –5% increase over baseline 19

University of Michigan Electrical Engineering and Computer Science Power / Area 20 Power/area of token network increase due to valid bit Reduced code size decreases SRAM power/area Overall overhead for migrating staging predicates is minimal

University of Michigan Electrical Engineering and Computer Science Overall Power System power measured for a kernel loop in AVC Introducing token network reduces the overall system power by 25%, while achieving 5% performance gain 21 226.4 mW170.0 mW

University of Michigan Electrical Engineering and Computer Science Conclusion Fine grain code compression is a good fit for CGRAs Token network can eliminate the instruction format overhead –Dynamic discovery of instruction format –Small overhead (< 3%) –Migrating staging predicates to token network improves performance Applicable to other highly distributed architectures 22

University of Michigan Electrical Engineering and Computer Science Questions? 23

University of Michigan Electrical Engineering and Computer Science Token Sender Each output port of resources are converted into a token sender –FU output, routing mux output, register file read ports Send out tokens only to the specified consumers in dest fields –Allow only two destinations for each output, potentially limits the performance

University of Michigan Electrical Engineering and Computer Science Token Receiver Input MUXes are converted to token receivers –Dest fields are stored in the memory, not src fields –MUX selection bits are determined with incoming token position 25

University of Michigan Electrical Engineering and Computer Science Dynamic Instruction Format Discovery Resources need configuration only when data flows through them Instruction format can be discovered by looking at the data flow Token network from dataflow machines can be utilized –Token is 1 bit information indicating incoming data in next cycle –Each PE observes incoming tokens and determines the instruction format 26

University of Michigan Electrical Engineering and Computer Science Who Generates Tokens? Tokens are generated at the start of dataflow –Live-ins –Terminate when they get into a register file Tokens terminated in register files can be re- generated Read ports of register files generate tokens –Token generation information at RF read ports are stored separately –26 read ports in 4x4 CGRA 27 Live Add Live Add Live RF Live Add Live RF Live Add Live RF Live Add Live Add Live

University of Michigan Electrical Engineering and Computer Science Reducing Decoder Complexity 28 Partitioning the configuration memory and decoder –Trade-off between number of memories and decoder complexity Design space exploration for memory partitioning –Which fields are stored in the same memory? –Sharing of field entries in the memory : under-utilized fields MEM …… decoder MEM decoder MEM decoder MEM decoder Token Network

University of Michigan Electrical Engineering and Computer Science Memory Partitioning Bundle fields with the same type : field width uniformity Design space exploration result for a 4x4 CGRA –sharing degree = # total entries / # total fields Reduces decoder complexity by 33% over naïve partitioning –Sharing incurs less than 1% performance degradation 29 type# fields# memories# entries# total entries sharing degree opcode1628 1.0 dest9688640.75 const1626120.75 reg addr4846240.5

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott."— Presentation transcript:

Similar presentations

About project

Feedback