CSCE 313: Embedded Systems Performance Analysis and Tuning

Slides:

Advertisements

Similar presentations

Lecture 6 Programming the TMS320C6x Family of DSPs.

Advertisements

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /15/2013 Lecture 11: MIPS-Conditional Instructions Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER.

CSCE 313: Embedded Systems Scaling Multiprocessors Instructor: Jason D. Bakos.

CSCE 313: Embedded Systems Performance Counters Instructor: Jason D. Bakos.

Processor Technology and Architecture

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Railway Foundation Electronic, Electrical and Processor Engineering.

ECE Department: University of Massachusetts, Amherst Lab 1: Introduction to NIOS II Hardware Development.

Floating Point Numbers

CSCE 313: Embedded Systems Multiprocessor Systems

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Railway Foundation Electronic, Electrical and Processor Engineering.

Games and Simulations O-O Programming in Java The Walker School

Chapter 2 Software Tools and Assembly Language Syntax.

1 CS/COE0447 Computer Organization & Assembly Language Pre-Chapter 2.

CSCE 313: Embedded Systems Final Project Instructor: Jason D. Bakos.

Chapter 3 Elements of Assembly Language. 3.1 Assembly Language Statements.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

The Teacher CP4 Binary and all that… CP4 Revision.

Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

EEL 3801C EEL 3801 Part I Computing Basics. EEL 3801C Data Representation Digital computers are binary in nature. They operate only on 0’s and 1’s. Everything.

Lab 2 Parallel processing using NIOS II processors

Digital Representations ME 4611 Binary Representation Only two states (0 and 1) Easy to implement electronically %0= (0) 10 %1= (1) 10 %10= (2) 10 %11=

1 ENGI 2420 Structured Programming (Lab Tutorial 8) Memorial University of Newfoundland.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

2016 N5 Prelim Revision. HTML Absolute/Relative addressing in HTML.

Computer Science I CSCI Summer 2009 David E. Goldschmidt, Ph.D.

Chapter 2 — Instructions: Language of the Computer — 1 Conditional Operations Branch to a labeled instruction if a condition is true – Otherwise, continue.

CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale Floating point arithmetic.

MATH Lesson 2 Binary arithmetic.

Floating Point Numbers

Topic: Binary Encoding – Part 2

Computer Architecture & Operations I

Lab 1: Using NIOS II processor for code execution on FPGA

Chapter 1 Introduction.

Introduction To Computer Science

CSCE 313: Embedded Systems Final Project Notes

Execution time Execution Time (processor-related) = IC x CPI x T

Data Representation Binary Numbers Binary Addition

CSCI206 - Computer Organization & Programming

CSCE 212 Chapter 4: Assessing and Understanding Performance

Lecture 5: GPU Compute Architecture

Topic 3d Representation of Real Numbers

Pipelining: Advanced ILP

CSCI1600: Embedded and Real Time Software

Lecture 5: GPU Compute Architecture for the last time

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

CSCI206 - Computer Organization & Programming

October 24 Programming problems? Read Section 6.1 for November 5

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Embedded System Development Lecture 13 4/11/2007

March 27 Test 2 Review Read Sections 5.1 through 5.3.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

ECE 352 Digital System Fundamentals

Multi-Core Programming Assignment

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSCE 313: Embedded Systems Final Project

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Execution time Execution Time (processor-related) = IC x CPI x T

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Cache - Optimization.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSCE 313: Embedded Systems Scaling Multiprocessors

CSCI1600: Embedded and Real Time Software

Performance Measurement

Computer Organization

CMPE 152: Compiler Design April 30 Class Meeting

Presentation transcript:

CSCE 313: Embedded Systems Performance Analysis and Tuning Instructor: Jason D. Bakos

Performance Considerations time per frame = (number of pixels) * (time per pixel) time per pixel = (cycles per pixel) * (clock period) cycles per pixel = (instructions per pixel) * (cycles per instruction) instructions per pixel = (ops per pixel) * (instructions per op) (constant) (variable) (variable) (constant) (variable) programmer, compiler (variable) data and control hazards (constant) (variable)

Easy Performance Tweaks Nios2 has no hardware floating point instructions Compiler replaces float multiplies with function call to __muldf3 Compiler replaces int-to-float typecast to __floatsisf Add floating point “custom instructions” Works for float only (not double) Floating point instructions will then appear using the “custom” instruction Make sure you don’t have any doubles cos() should be cosf(), floor() should be floorf(), 1.0 should 1.0f

Easy Performance Tweaks Turn on compiler optimization Right-click Eclipse project, Properties, Nios II Application Properties (turn off for debugging)

Counting Cycles Performance counters: hardware counters that increment under certain conditions Allow cycle-level precision Allows for high resolution hardware instrumentation: Events, i.e. cache misses, instruction commits, exceptions Timing, use counter to count cycles to execute a segment of code In Lab 3, we will use them to time sections of code counter = <current counter state> <code to time> cycles elapsed = <current counter state> - counter

Performance Counters in NIOS SOPC To use, add in SOPC Builder: Performance Counter Unit Name: “performance_counter_0” Number of simultaneous counters = 1 Make sure you put it on sys_clk and (automatically) assign base addresses

Performance Counters in NIOS SOPC To use: include file: #include <altera_avalon_performance_counter.h> Reset counters: PERF_RESET(PERFORMANCE_COUNTER_0_BASE); Start measuring: PERF_START_MEASURING(PERFORMANCE_COUNTER_0_BASE); Read counter: time=perf_get_total_time ((void*)PERFORMANCE_COUNTER_0_BASE); PERF_START_MEASURING(PERFORMANCE_COUNTER_0_BASE); // restart counter

Instruction Counting No way to measure dynamic instruction count Solution: find static instruction count for innermost loop Only the transformation and interpolation code No loops Consider only the most common execution paths if-statements for pixel bounds (assume in bounds) function calls (include instructions in all function calls, including builtins like __fixsfsi but don’t include alt_up_pixel_buffer_dma_draw()) Open <project>.objdump, search for a unique substring from your code

Object Dump for(i = 0; i<240; i++) { for(j = 0; j<320; j++) { row = (j-120)*cosf(counter) - (i-160)*sinf(counter) + 120; col = (j-120)*sinf(counter) + (i-160)*cosf(counter) + 160; alt_up_pixel_buffer_dma_draw(my_pixel_buffer, (my_image[(row*320)*3+col*3+2]) + (my_image[(row*320*3+col*3+1)] <<8) + (my_image[(row*320*3+col*3+0)] <<16),j,i); } for(i = 0; i<240; i++) { for(j = 0; j<320; j++) { row = (j-120)*cosf(counter) - (i-160)*sinf(counter) + 120; 800055c:913fe204 addi r4,r18,-120 8000560:80025880 call 8002588 <__floatsisf> 8000564:1023883a mov r17,r2 col = (j-120)*sinf(counter) + (i-160)*cosf(counter) + 160; alt_up_pixel_buffer_dma_draw(my_pixel_buffer, (my_image[(row*320)*3+col*3+2]) + 8000568:d8800117 ldw r2,4(sp) 800056c:8889ff32 custom 252,r4,r17,r2 8000570:2589ffb2 custom 254,r4,r4,r22 8000574:0090bc34 movhi r2,17136 8000578:2089ff72 custom 253,r4,r4,r2 800057c:80026200 call 8002620 <__fixsfsi> 8000580:d8c00017 ldw r3,0(sp)

Instruction-Level Debugging Alternatively you can use instruction stepping in the debugger Open your project in the debugger Set breakpoint on your pixel code Make sure the first iteration doesn’t perform unusual behavior (e.g. transform to an out-of-bounds location) Switch to instruction stepping Use step into (F6) to walk through your code (until call to alt_up_pixel_buffer_dma_draw(my_pixel_buffer()) Copy instruction sequence from disassembly view

Data Types Colors are only 8 bits Likewise, weights require 8 bits of precision Weight value of 2-8 corresponds to one increment Row and col require 10-bit range +/- 29 = [-512,511] will encompas column value for 320x240 Floating point is overkill for this application Use 18-bit fixed point

Fixed-Point Need a way to represent fractional numbers in binary (N,M) Fixed-point representation Assume a decimal point at some location in a value: Example (6,4)-fixed format: 6-bit (unsigned) value = 1x21 + 0x20 + 1x2-1 + 1x2-2 + 0x2-3 + 1x2-4 Range = [0, 2N-M – 2-M] = [0, 4 – 2-4] = [0,3.9375] For signed, use two’s complement Range = [-2N-M-1, 2N-M-1 – 2-M] 1 0.

Range and Accuracy Requirements For image transformation: Need sufficient bits on LHS to accommodate pixel locations Need sufficient bits on RHS to accommodate pixel depth

Lab 3 Fixed-Point Adding two (32,9) values gives a (32,9) value C doesn’t offer a dedicated fixed-point type We don’t have fixed-point sin and cos, so use floating-point for this, but convert to (32,9) fixed point by: Multiply result by 512.0 Typecast to int Multiply: cos/sin, which is (32,9) fixed-point value with row/col, which is a (32,0) fixed-point number, gives a (32,9) value, so keep this in mind! Adding two (32,9) values gives a (32,9) value

Lab 3 Split into two parts: Change CPU to NIOS II/f Enable hardware multiply

Lab 3 Enable hardware floating point

Lab 3 Use performance counters to find average time, in cycles, to transform each individual pixel when using different cache sizes: 4K instruction cache, 4K data cache 4K instruction cache, 16K data cache

Lab 3 For rotation only For floating- and fixed-point, calculate: Instructions per pixel And, for data cache sizes of 4KB and 16KB: Cycles per pixel Cycles per instruction (CPI) Calculate speedup of fixed relative to floating for each cache Calculate speedup of 16KB cache over 4KB cache for each datatype