Mapping DSP algorithms to a general purpose out-of-order processor

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Computer Architecture Instruction-Level Parallel Processors

Xtensa C and C++ Compiler Ding-Kai Chen

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

FDR--ECE6276 Class Project 12/06/00 The ChooChoo: Final Design Review System Integration Software School of Electrical and Computer Engineering Georgia.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

ICC Module 3 Lesson 1 – Computer Architecture 1 / 6 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 3 – Instruction.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.

Computer Organization and Architecture Lecture 1 : Introduction

IBM System 360. Common architecture for a set of machines

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

System-on-Chip Design Homework Solutions

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chapter 1 Introduction.

EECE571R -- Harnessing Massively Parallel Processors ece

Computer Architecture Principles Dr. Mike Frank

Computer Programming.

Evaluating Register File Size

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Parallel Programming By J. H. Wang May 2, 2017.

Announcements MP 3 CS296 (Chase Geigle

Embedded Systems Design

Chapter 1 Introduction.

CDA 3101 Spring 2016 Introduction to Computer Organization

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Improving cache performance of MPEG video codec

Introduction to cosynthesis Rabi Mahapatra CSCE617

Dynamically Reconfigurable Architectures: An Overview

COSC121: Computer Systems

Performance Optimization for Embedded Software

Objective of This Course

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

STUDY AND IMPLEMENTATION

Chapter 2 Analysis of Algorithms

Memory Management Overview

Chapter 1 Introduction.

Computer Programming.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Optimizing stencil code for FPGA

The Vector-Thread Architecture

Multithreading Why & How.

Chapter 12 Pipelining and RISC

Implementation of a De-blocking Filter and Optimization in PLX

Computer Architecture

Design and Analysis of Algorithms

CSE 373: Data Structures and Algorithms

COMPUTER ORGANIZATION AND ARCHITECTURE

Research: Past, Present and Future

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Mapping DSP algorithms to a general purpose out-of-order processor ECE 734 Ilhyun Kim Donghyun Baik

Outline Introduction Out-of-order execution overview Dependence graph change for mapping to GPP To-do list Expected results

Introduction Why DSP applications are implemented on GPP Lower development cost commodity part, lower maintenance cost, faster turn-around GPP meets performance requirement GPP becomes faster. Only DSP chips could do it in the past Problems with algorithm transformations Used for faster operations / efficient hardware On GPP, no control over hardware configurations Some of them are effective while others not Problems with extracting parallelism Duplicated efforts on each of layers (source code, compiler, processor) Narrow machine scope over independent operations What are the efficient ways to map algorithm to GPP? Understanding how a GPP executes instructions How can software improve the performance?

Out-of-order execution overview Dynamic parallelism extraction Instruction re-ordering Dynamically searching independent operations within a limited scope Trying to keep all available hardware resources busy for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) +c(i-1,j) functional units

Understanding DG change for mapping to GPP instruction window for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) instruction window instruction window instruction window instruction window computation instruction window instruction window mem access instruction window instruction window instruction window control-related

To-do list Infrastructure The effect of single assignment Building a perfect machine model simulator that keeps track of only computations in the algorithm, measuring ideal execution time of a compiled binary assuming perfect parallelism Building a profile tool that locates an instruction that we are interested in among instructions in the binary The effect of single assignment Characterizations on various machine configurations The effect of unfolding The effect of SIMD parallelism Optimization techniques for the Alpha architecture based on characterization data Optimizing an existing DSP application: MPEG-2 decoder 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp

Expected Results Single assignment transformation doesn’t work Rather, try to recycle storage space whenever possible HW-based single assignment is performed Unfolding transformation It works on iteration-independent loops w/ trivial computations It reduces the loop indices overhead (even on iteration-dependent loops) Do not unfold loops w/ non-trivial computations SIMD parallelism to reduce memory communications Alpha doesn’t support SIMD instruction sets but it has 64-bit datapath and instructions read/write 64 bits at a time by splitting/merging narrow words There are more computing units (4) than memory units (2) : reducing memory operations helps performance Performance improvement of MPEG-2 decoder based on the optimizations that we applied 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp