Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Machine cycle.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
DSPs Vs General Purpose Microprocessors
CSCI 4717/5717 Computer Architecture
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CSCE 212 Quiz 4 – 2/16/11 *Assume computes take 1 clock cycle, loads and stores take 10 cycles and branches take 4 cycles and that they are running on.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
CH12 CPU Structure and Function
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Lecture #32 Page 1 ECE 4110–5110 Digital System Design Lecture #32 Agenda 1.Improvements to the von Neumann Stored Program Computer Announcements 1.N/A.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Compiled by Maria Ramila Jimenez
The Central Processing Unit (CPU) and the Machine Cycle.
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Principles of Linear Pipelining
IBM System/360 Matt Babaian Nathan Clark Paul DesRoches Jefferson Miner Tara Sodano.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Pipelining and Parallelism Mark Staveley
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Chapter One Introduction to Pipelined Processors.
Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.
Computer Architecture Chapter (14): Processor Structure and Function
William Stallings Computer Organization and Architecture 8th Edition
Multiscalar Processors
Parallel Processing - introduction
Lecture: Pipelining Basics
The Problem Finding a needle in haystack An expert (CPU)
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Improving Program Efficiency by Packing Instructions Into Registers
Vector Processing => Multimedia
Milad Hashemi, Onur Mutlu, Yale N. Patt
STUDY AND IMPLEMENTATION
Advanced Computer Architecture
Computer Architecture
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Learning Objectives To be able to describe the purpose of the CPU
CPU Structure CPU must:
CPU Structure and Function
Chapter 11 Processor Structure and function
Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.
What Are Performance Counters?
Presentation transcript:

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz

Overview Goals To explore possible optimizations of data compression algorithms for embedded architectures. Theorize the ideal characteristics of an embedded architecture specialized for compression/decompression.

Optimizing Zlib Based on Lembel-Ziv (LZ77) algorithm Used included minigzip app as our performance benchmark Flat profile Most commonly executed line accounted for ~9% of total cycles Decided to see if several TIE optimizations would produce a significant speedup…

Results: Speedup UPDATE_HASH 48.74% ADLER % SEND_BITS 5.20% Optimized (cycles)Improvement Small Txt % Large Txt % TIFF Img % PPM Img % Original (cycles)

Obstacles for TIE Optimization Most highly executed instructions were branches on value in memory High latency but not computationally intensive Accesses to memory were random Prefetching from memory not an option Zlib implementation already highly optimized

Optimizing LZO Based on Lempel-Ziv Algorithm Differs from Zlib Favors speed over compression Profile was less flat Likely to provide better performance gains than Zlib…

Results: Instruction Speedup: Speedup D_INDEX33.33% ADLER % Original (cycles)Optimized (cycles) Improvement (percent ) Small Txt % Large Txt % TIFF Img % PPM Img %

Optimizing the Cache Size Motivation: Memory intensive, but lacked memory access pattern for prefetching Results: Performance improved linearly as cache size doubled Greater improvement seen when window can fit entirely in cache

Performance as Cache Size Increases

Other Zlib Architectures Our Ideal Architecture Low access latency buffer to hold window Process 2 or more windows in parallel Algorithm is independent across window boundaries IBMLZ1 for storage systems Window stored in CAM, large comparator used to rapidly find longest matches in parallel Arithmetic coding used instead of huffman coding

Reimplementing Longest Match Attempted to fully optimize a subset of the Zlib architecture Reimplemented LongestMatch() routine in C++ Attempted to vectorize the matching loop

Vectra Extensions for Zlib Fetch 8 bytes into a vector register Effectively implements local prefetching Test for equality in parallel Reduces latency overhead of load followed by branch

Conclusions Zlib was already fairly optimized Demonstrated by flat profile TIE extensions best suited for computationally intensive algorithms Memory access latency and control flow instructions were bottleneck in zlib Zlib performance did not scale well with increased cache size Overall, our TIE extensions did not provide enough improvement to justify their cost

Questions? Answers?