Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University.

Slides:

Advertisements

Similar presentations

T.Sharon-A.Frank 1 Multimedia Compression Basics.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Performance Analysis and Optimization (General guidelines; Some of this is review) Outline: introduction evaluation methods timing space—code compression.

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007.

Compression & Huffman Codes

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

A Hybrid Test Compression Technique for Efficient Testing of Systems-on-a-Chip Aiman El-Maleh King Fahd University of Petroleum & Minerals, Dept. of Computer.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks Christopher M. Sadler and Margaret Martonosi In: Proc. of the 4th.

Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.

1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.

Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf.

Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

RISC Architecture RISC vs CISC Sherwin Chan.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Hanyang University Hyunok Oh Energy Optimal Bit Encoding for Flash Memory.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.

Multi-media Data compression

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Sunpyo Hong, Hyesoon Kim

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )

COSC3330 Computer Architecture

Prof. Hsien-Hsin Sean Lee

Compression & Huffman Codes

Applied Algorithmics - week7

Cache Memory Presentation I

Lempel-Ziv-Welch (LZW) Compression Algorithm

Code Compression Motivation Efficient Compression

Improving Program Efficiency by Packing Instructions Into Registers

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Tosiron Adegbija and Ann Gordon-Ross+

Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.

Esam Ali Khan M.S. Thesis Defense

Spring 2019 Prof. Eric Rotenberg

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Efficient Placement of Compressed Code for Parallel Decompression

Presentation transcript:

Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University

© 2004 Embedded Systems Group Outline Lv/Henkel/Lekatsas/Wolf: Adaptive dictionary method for bus encoding Lin/Xie/Wolf: Dictionary coding for code compression

© 2004 Embedded Systems Group Adaptive bus encoding Goal: Reduce bus energy Significant part of energy is related to IO Significant Impact of inter-wire capacitances Approach: Explore data properties Past success in address buses Few approaches for data buses Results: 28% average power reduction One additional line for a 32-line bus No additional cycles Applies to both address and data buses

© 2004 Embedded Systems Group Related work Stan/Burleson [TVLSI95]: Bus-invert Encoding Panda/Dutt [TVLSI99]: Reduce address bus switching by memory access exploitation Benini et al. [GLS-VLSI97]: T0 Encoding Mussol et al. [TVLSI98]: Working Zone Encoding Sotiriadis/Chandrakasan [ICCAD00]: Transition Pattern Coding Kim et al. [DAC00]: Coupling sensitive scheme

© 2004 Embedded Systems Group General Two Line Bus Bus model (I)

© 2004 Embedded Systems Group i R I C L C L C i R i R i R Simplify bus model by quantizing energy values: 0, 1, 2. Bus model (II)

© 2004 Embedded Systems Group Bus model (III) Bus Energy for multiple line buses

© 2004 Embedded Systems Group Source properties on data buses Correlation of transition signaling code on adjacent lines: D(x) = n x /N = transitions/total transactions Bit number

© 2004 Embedded Systems Group Source properties (II) Adjacent bit lines in a word are correlated.

© 2004 Embedded Systems Group Source properties (III) 10 most frequently-occurring patterns:

© 2004 Embedded Systems Group Energy savings from different compression schemes. Compare transition, interwire energy savings.

© 2004 Embedded Systems Group Dictionary techniques Look up symbol strings in dictionary; replace with shorter code. Types of dictionaries: Static dictionary Adaptive dictionary ‘ is ’ ‘ the ’ ‘ are ’ ‘ do ’

© 2004 Embedded Systems Group Approach Use dictionary scheme to take advantage of frequent patterns. Word divided into key, index, bypassed part:

© 2004 Embedded Systems Group Adaptive Dictionary Encode Scheme (ADES)

© 2004 Embedded Systems Group Encoder miss 0xFFFF000 0x x1234FFE 0xFEFA830 0xF1234FF0 =? xF1234FF 0 0 Upper Part Non-compress Part Index Part 0xF1234FF

© 2004 Embedded Systems Group Decoder hit 0xFFFF000 0x x1234FFE 0xFEFA X Read 0x1234FFEB t

© 2004 Embedded Systems Group Decoder miss 0xFFFF000 0x x1234FFE 0xFEFA xF1234FF Write 0xF1234FF0 0xF1234FF

© 2004 Embedded Systems Group Architecture ADES

© 2004 Embedded Systems Group Area, delay, energy Area 750 Gates Energy Primarily consumed by relatively small memory Latency Encoding/decoding can be finished in one cycle

© 2004 Embedded Systems Group Results: experimental setup SimpleScalar simulator 32-bit data bus Various real-world applications in SPEC95 and Mediabench ApplicationDescription Adpcm-encADPCM encoder for voice Adpcm-decADPCM decoder for voice CompressFile compression program in UNIX system GccGnu c compiler GoGo is a game program in SPEC95 IjpegJPEG encoder/decoder program LiLisp interpreter M88ksimA small operating system PerlPerl language interpreter

© 2004 Embedded Systems Group Results: detail

© 2004 Embedded Systems Group Results: comparison Scheme Avg. Energy per Mem Access Avg. Energy Reduced Num. of Additional Lines Number of Gates (approximately) Delay Raw 1.94e-11J 0%N/A BI4 1.90e-11J 2.5%4100Low WZE 1.60e-11J 17.8%41800High TPC 3.28e-11J -68.9%12N/ALow ADES with BI 1.38e-11J 28.9%2750Low

© 2004 Embedded Systems Group Results: graphical comparison of energy savings

© 2004 Embedded Systems Group Summary: adaptive bus encoding Upcoming technologies induce inter-wire capacitances in the order of magnitude of intrinsic capacitances Ordinary methods (e.g. Hamming distance) minimization can ’ t capture those effects Exploits information redundancy on data buses ADES Average 28% energy savings on data bus Extendable to address buses Low cost

© 2004 Embedded Systems Group Code compression Memory size is critical for embedded system Program size grows with application complexity Code compression is a solution to reduce code size Code size grows as RISC or VLIW is used Improved VLIW code compression is needed (Xie,2002) Code Size of MPEG2 Encoder

© 2004 Embedded Systems Group Base l1l1 l2l2 l3l3 lklk... b1b2 b4 b3 ck2 block4 block1 blo- block3 block4 Requirements on code compression Random Access Start decompression at block boundaries Synchronize model and arithmetic coder Byte Alignment Faster Decoding Easier and more compact indexing Indexing LAT Patching branch offsets (only for code compression)

© 2004 Embedded Systems Group Previous work Wolfe and Chanin (1992) IBM CodePack (1998) Larin and Conte (1999) Huffman coding Xie et al. ( ) F2VCC and V2FCC Power PC 40x Embedded Processor Cache External Memory Decompression Core Processor Local Bus Decoder Table

© 2004 Embedded Systems Group Our approach Problem definition Propose code compression schemes to reduce code size on VLIW embedded system Texas Instruments ’ TMS320C6x VLIW DSP Our contribution Branch blocks Branch targets are fixed once the code is compiled Average: 80.1 blocks, 454 bytes LZW-based code compression schemes Selective code compression schemes

© 2004 Embedded Systems Group Compression/decompression

© 2004 Embedded Systems Group Decompression architecture Works for pre-/post-cache:

© 2004 Embedded Systems Group LZW data compression Input:a a b ab aba aa Output: Compression Engine Decompression Engine Codeword Longest Phrase Original Phrase Table N+1 N = N?? Welch (1984) modified Ziv- Lempel (1978) Generate coding table on-the-fly Search for the longest phrase already in the table Output the index of the phrase Add the phrase with the next element as a new table entry Decompressi on lags compression by one codeword

© 2004 Embedded Systems Group Example IndexPhraseDerivation 0aInitial 1b 2aa0 + a 3ab0 + b 4ba1 + a 5aba3 + a 6abaa5 + a Input:a a b ab aba aa Output: Compression Engine Decompression Engine Codeword Longest Phrase Original Phrase Table N+1 N = N??

© 2004 Embedded Systems Group LZW-based code compression Use BYTE ( 0x00 ~ 0xFF ) as basic element. Variable-to-fixed code compression: Longer codeword means: Larger table (exponentially) More decompression overhead Useless when the block is too small Use more bits to encode same phrase CR: 83, 83, 84, 87% for 9-12 bit LZW Wider decoding table means: Larger table (linearly) Wider decoding bandwidth Less than 1% CR difference for 8-20 bytes

© 2004 Embedded Systems Group Compression ratio vs. codeword size for two examples small large

© 2004 Embedded Systems Group Compression ratio vs. codeword size on benchmark set

© 2004 Embedded Systems Group Selective code compression Motivation Branch blocks vary in size No benefit to use longer codeword if the block can not fill up the coding table Only 12.8% of the branch blocks can fill up 9-bit LZW table Only < 1% of the branch blocks can fill up 12-bit LZW table Selective Code Compression Apply different compression methods on different branch blocks Block size, instruction frequency, … are collected during profiling Profile is used to determine the compression method Source Program Branch Blocks Profiling Method Selection Compression Compressed Code

© 2004 Embedded Systems Group Selective compression (cont ’ d.) Minimum table-usage selective compression (MTUSC) Calculate the number of phrases generated during compression Select the smallest table that all the phrases could fit in the table Average compression ratio is 79.2% Minimum code-size selective compression (MCSSC) Some compressed blocks use more bytes than original data Compress the blocks using different codeword length The smallest compressed or uncompressed block is selected Average compression ratio is 76.8% Dynamic LZW Codeword length grows as compression goes on 75.8% and 75.2% for MTUSC and MCSSC

© 2004 Embedded Systems Group Experiments Benchmarks Collected from Texas Instruments and Mediabench Compression Ratio Longer codeword works better in large benchmarks Dynamic MCSSC is always the best

© 2004 Embedded Systems Group Compression ratio vs. algorithm

© 2004 Embedded Systems Group Average throughput 1.72 bytes for 12-bit LZW and 1.82 bytes for dynamic MCSSC

© 2004 Embedded Systems Group Parallel decompression Parallel Decompression Execution time: 0.51x, 0.27x, 0.14x Throughput: 3.31, 6.37, bytes Hardware Features 2-30 kBytes decoding table < 4500  m 2 using TSMC.25  m model 5508 cycles to decompress 9344 bytes ADPCM decoder 90k cycles to decompress 182k bytes MPEG-2 encoder Current Code = 300 DC1DC2 Code 295 Code 277 DC1DC2 Code 295Code 301

© 2004 Embedded Systems Group Comparison with previous work Wolfe Chanin MIPSHuffman73%< 1mm 2 1 byteserial CodePack PowerPC CodePack 60%< 1mm 2 1 byteserial LekatsasMIPSSAMC57%4K tableNAserial XieTMS320F2V V2F 65% 70%- 82% 6-48K table 2-30K table 4.9 bits avg, 13 bits max 89 bits max IID is parallel UsC6xLZW MCSSC 83%- 87% 75% < 0.05mm 2 30K table avg 1.8 bytes avg, 13 bytes max parallel

© 2004 Embedded Systems Group Code compression summary We proposed code compression schemes using branch blocks as compression unit. Compression ratio is around 83% and 75% respectively. Low power is achieved by smaller memory required. Compare to previous work, our schemes have less decompression overhead, larger decompression bandwidth with comparable compression ratio. Parallel decompression could be applied to achieve faster decompression which is suitable for VLIW architecture. Compiler techniques could be used to generate source programs more suitable for code compression. Find other schemes can take advantage of branch blocks.