Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Chapter 8. Pipelining.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
1 ReCPU:a Parallel and Pipelined Architecture for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Multiscalar processors
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Pipelining By Toan Nguyen.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
COMP3221 lec04--prog-model.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lecture 4: Programmer’s Model of Microprocessors
ITEC 352 Lecture 12 ISA(3). Review Buses Memory ALU Registers Process of compiling.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
Lecture 11: 10/1/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
1 TM 1 Embedded Systems Lab./Honam University ARM Microprocessor Programming Model.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Advanced Architectures
Instruction Packing for a 32-bit Stack-Based Processor Witcharat Lertteerawattana and Prabhas Chongstitvatana Department of Computer Engineering Chulalongkorn.
Micro-programmed Control
Multiscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 4 – The Instruction Set Architecture.
Lu Peng, Jih-Kwon Peir, Konrad Lai
Improving Program Efficiency by Packing Instructions Into Registers
Chapter 4 The Von Neumann Model
Vector Processing => Multimedia
Chapter 4 The Von Neumann Model
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The Microarchitecture of the Pentium 4 processor
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Systems I Pipelining II
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Guest Lecturer TA: Shreyas Chand
Chapter 8. Pipelining.
Program Execution.
Lecture 4: Instruction Set Design/Pipelining
William Stallings Computer Organization and Architecture
COMPUTER ORGANIZATION AND ARCHITECTURE
Chapter 4 The Von Neumann Model
Presentation transcript:

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley

IIntroduction IICoding/decoding scheme IIIDecoder IVEncoder VExperimental results VIConclusion

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Traditional ASIP Design FlowTipi Design Flow Design Micro-architecture Control path ISA Assembler Simulator Programming Write assembly in terms of the ISA ( manual scheduling and register allocation) Design Micro-architecture Generate HDL Data path Control path Horizontal microcode description Horizontal Microcode code generator ( automatic scheduling and register allocation) Simulator (Cycle accurate) Programming Write computational DAG Intermediate Representation Use the generated HM code generator to get the trace

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Object microcode Encoder Decoder Object microcode Memory Reduce memory size Reduce memory bandwidth Can use any scheme we want as long as we get back the correct microcode Horizontal Microcode Control Buffer Data Path Compiler HLL Program Store Decoder Encoder Traces Where does this fit in ? How does it help?

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Compression/Decompression scheme We use trace caches which can be thought of as a L0 cache Trace Cache 2 Trace Cache 1 1|2|3|1|2|1 2|3|1|4|2|4 Sequence Manager Sequence Manager 10xx11x10001x0001x00100x0x010x 000x10x0xxx00xxx 1|3|1|2|1|1 2|1|4|3|2|4 The trace caches are filled by the decoder which is itself a processor The encoder is the compiler that generates instructions for this Processor The set of instructions that the encoder generates is what is stored in the memory Two main instructions are the WRITE and SEQUENCE instructions The WRITE instruction fills an entry of a cache with a given value The SEQUENCE instruction gives the order in which cache entries must be accessed to give the correct microcode output.

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Consider an example with a single cache Encoder WRITE line0, WRITE line1, WRITE line2, SEQUENCE line0,line1, line2,line0 Note: No write required for last microcode How do we get compression? Trace cache hits: no need to WRITE Lots of don’t cares in the microcode – low entropy Other methods – COPY instruction (delta coding) Compression/Decompression scheme

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion NOP WRITE SEQUENCE START JUMP SEQLENGTH STOP COPY CacheIndexData010CacheSequence100AddressOffset101Length 111 Cache Destination SourceBit changes Instruction Set - Variable instruction size - 8 different opcodes

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Architecture

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Fetch Decode Size is not ready ! Stream 1Stream 2Stream 1 Stream 2Stream 1 Fetch Decode Cycles instruction Stream 2 instruction size FetchDecode Size Instruction FetchDecode Size Instruction

Initial JUMPS IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Fetch 3 Fetch 1 Fetch 2 Cycle parity Unpacking Example Issue width = 3 Memory 0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F 0x0010 0x0011 0x0000 0x0013 0x0000 0x0015

ENCODER STAGE 1 Input assembly file Parameter file Cache Simulation Architecture description in XML Original Microcode Sequencer Linear Assembly file Mapped microcode IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion ENCODER STAGE 2 Linear Assembly file Packer Packed Assembly file

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion xxxx xxxx01xxx 0xxxxxxxxxx xxxx011 Original microcode Has a lot of don’t cares Can be exploited by the trace cache to avoid WRITE instructions Trace Cache contents at a particular point

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Mapped microcode The assembly produced is much smaller if we use this mapping Why? Because we avoid costly WRITE instructions What happens on a cache miss? We replace all don’t cares by 0’s. Motivation: The microcode has only a few 1’s and we expect the hit rate in the trace cache to increase

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Simulation environment Encoder (Java application) GCC X-compiler for DLX Tipi microcode generator Decoder (C++ lowlevel cycle- accurate simulation) Statistics collecter C code Assembly code Microcode Packed file Microcode Report file 3 test architectures : - RSA is a hardware RSA coder/decoder. - CC is a convolution coder - DLX is a DLX ISA processor Metrics : - Compression ratio - Number of stalls in the main architecture introduced by the decoder. References : - Dictionary-based compression methods (Huffman) - Hand made DLX decoder (simplescalar)

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Compression ratio / Cache sizeCompression ratio / Cache number Compression ratio / Sequence lengthNumber of stalls / Sequence length Des_branch Des_unrolledfftGsm_decodeGsm_encode Idct Median

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion ArchitectureNumber of bits in one microcode line CachesLength of the sequence Word size Issue width RSA 91 cache of size CC 3314 caches of size DLX 1061 cache of size Comparaison with other schemes

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Contribution of instructions in the total size

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Benefit from the COPY instruction COPYNO COPYImprovement idct19.85%41.17%-51.79% des_branch8.70%11.86%-26.64% fft8.38%12.29%-31.81% gsm_decode8.80%21.39%-58.86% gsm_encode8.24%14.37%-42.66% des_unrolled6.47%7.12%-9.13% Average10.07%18.03%-36.82%

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Conclusion Our approach allows for high compression ratios, and is a valid alternative to hand-made ISA. It is easily scalable, a wide range of parameters can be explored by the architecture designer. Future Work A better branching mechanism could easily be created, which would allow prefetching, and would not create any performance loss for high granularity branches, by making the sequence managers more clever. As the SEQUENCE instruction represents the biggest part of the compressed file after the introduction of COPY, different ways of compressing this instruction should be explored. The parameters could be automatically generated by a program carefully analyzing the main architecture.

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Questions ?

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Application Architect Logic RTL Hardware Library Physical Asm Compiler Object Code Operation Extractor Ops Actor Library Actors Editor uArch Architect uArch RTL Extractor Simulator HLL IDE To memory Problem: Microcode memory size and bandwidth requirement too high! Solution: Use a compression/decompression scheme

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Packing Example Initial JUMPS Memory 0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F 0x0010 0x0011 0x0000 0x0013 0x0000 0x0015 linear Instructions stream Issue width of 3

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Percentages in number of instructions

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Stalls function of the issue width

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Influence of the size of the microcode

IIntroduction IICoding/decoding IIIDecoder IVEncoder VExperimental results VIConclusion Tipi Export execution cycles RSADESGSM… RSACCDLX Tipi + compression Export execution Cycles, memory size & bandwidth RSADESGSM… RSACCDLX Compression/Decompression platform based design