Measuring Program Performance Matrix Multiply

Slides:

Advertisements

Similar presentations

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Advertisements

1 Lecture 6 Performance Measurement and Improvement.

Functions Definition: Instruction block called by name Good design: Each function should perform one task and do it well Functions are the basic building.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.

Fast matrix multiplication; Cache usage

Week 8 - Friday.  What did we talk about last time?  String to int conversions  Users and groups  Password files.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

CPT: Arrays of Pointers/ Computer Programming Techniques Semester 1, 1998 Objectives of these slides: –to illustrate the use of arrays.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.

(language, compilation and debugging) David 09/16/2011.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

System Calls & Signals. setsockopt n Used to set socket options n SO_LINGER - Sets or gets the SO_LINGER option. The argument is a linger structure. n.

1 Homework Continue with K&R Chapter 5 –Skipping sections for now –Not covering section 5.12 Continue on HW5.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Consultation Hours. Mubashir: – Tuesday from 12:30 to 1:30 with ease of Students. Zohaib – Wednesday b/w 9:30 -10:30 Location: TA Room (next to HOD Office)

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

A thread is a basic unit of CPU utilization within a process Each thread has its own – thread ID – program counter – register set – stack It shares the.

Stack and Heap Memory Stack resident variables include:

Thread & Processor Scheduling

User-Written Functions

The Machine Model Memory

A bit of C programming Lecture 3 Uli Raich.

INC 161 , CPE 100 Computer Programming

Data Structure and Algorithms

Protection of System Resources

Functions, Part 2 of 2 Topics Functions That Return a Value

CS399 New Beginnings Jonathan Walpole.

Dynamic Array Multidimensional Array Matric Operation with Array

CS 537 Section 1 Programming in Unix and C

Intro to Processes CSSE 332 Operating Systems

Multi-dimensional Array

2-D arrays a00 a01 a02 a10 a11 a12 a20 a21 a22 a30 a31 a32

Programmazione I a.a. 2017/2018.

Lecture 6 C++ Programming

Introduction to the C Language

Operating Systems Lecture 13.

Using compiler-directed approach to create MPI code automatically

Cache Memories Topics Cache memory organization Direct mapped caches

Chapter 14 - Advanced C Topics

Measuring Program Performance Matrix Multiply

Operating Systems Lecture 14.

Introduction to C Topics Compilation Using the gcc Compiler

Programming in C Miscellaneous Topics.

Programming in C Miscellaneous Topics.

Lecture 18–Linux Performance II

Process Control B.Ramamurthy 2/22/2019 B.Ramamurthy.

Programming with Shared Memory

Jonathan Walpole Computer Science Portland State University

Introduction to C Topics Compilation Using the gcc Compiler

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Patterns Paraguin Compiler Version 2.1.

Programming with Shared Memory

Functions continued.

CSE 451: Operating Systems Winter 2003 Lecture 4 Processes

Unix Process Control B.Ramamurthy 4/11/2019 B.Ramamurthy.

CSE 451: Operating Systems Autumn 2004 Module 4 Processes

System Calls System calls are the user API to the OS

CSCE Systems Programming

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Outline Chapter 3: Processes Chapter 4: Threads So far - Next -

Cache Memory and Performance

Computer Architecture Multi-threaded Matrix Multiply

15213 C Primer 17 September 2002.

SPL – PS1 Introduction to C++.

Writing Cache Friendly Code

Presentation transcript:

Measuring Program Performance Matrix Multiply CSCE 513 Computer Architecture Measuring Program Performance Matrix Multiply Topics Linux times Matrix multiplication Readings: November 20, 2017

Times in Unix File times Process times struct timeval { ls –l gives modification date (#seconds since Jan 1, 1970) Process times struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ };

The time command cocsce-l1d39-11> time gcc pthread1.c -l pthread -o pthread1 real 0m0.077s user 0m0.052s sys 0m0.012s cocsce-l1d39-11> Note real == wall clock time, and real-time >= user-time + system-time

cocsce-l1d39-11> gcc pthread1 cocsce-l1d39-11> gcc pthread1.c -l pthread -o pthread1 cocsce-l1d39-11> ./pthread1 In main: creating thread 0 In main: creating thread 1 In main: creating thread 2 Hello World! It's me, thread 0! In main: creating thread 3 Hello World! It's me, thread 1! Hello World! It's me, thread 2! In main: creating thread 4 Hello World! It's me, thread 3! Hello World! It's me, thread 4!

TIME(7) Linux Programmer's Manual TIME(7) NAME time - overview of time and timers DESCRIPTION Real time and process time Real time is defined as time measured from some fixed point, either from a standard point in the past (see the description of the Epoch and calendar time below), or from some point (e.g., the start) in the life of a process (elapsed time). Process time is defined as the amount of CPU time used by a process. This is some‐ times divided into user and system components. User CPU time is the time spent executing code in user mode. System CPU time is the time spent by the kernel exe‐ cuting in system mode on behalf of the process (e.g., executing system calls). The time(1) command can be used to determine the amount of CPU time consumed during the execution of a program. A program can determine the amount of CPU time it has con‐ sumed using times(2), getrusage(2), or clock(3). The hardware clock …

Getrusage struct rusage { struct timeval ru_utime; /* user CPU time used */ struct timeval ru_stime; /* system CPU time used */ long ru_maxrss; /* maximum resident set size */ long ru_ixrss; /* integral shared memory size */ long ru_idrss; /* integral unshared data size */ long ru_isrss; /* integral unshared stack size */ long ru_minflt; /* page reclaims (soft page faults) */ long ru_majflt; /* page faults (hard page faults) */ long ru_nswap; /* swaps */ long ru_inblock; /* block input operations */ long ru_oublock; /* block output operations */ long ru_msgsnd; /* IPC messages sent */ long ru_msgrcv; /* IPC messages received */ long ru_nsignals; /* signals received */ long ru_nvcsw; /* voluntary context switches */ long ru_nivcsw; /* involuntary context switches */ };

struct timeval struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ };

Matmult.c - example Headers / declarations Initialize arrays A and B Multiplication 𝑪 𝒊,𝒋 = 𝒌=𝟎 𝒏−𝟏 𝑨 𝒊,𝒌 ∗ 𝑩 𝒌,𝒋

3 Nested Loops to compute product for(i=0;i<rows;++i){ for(j=0;j<cols2;++j){ for(k=0;k<cols;++k){ C[i][j] = C[i][j] + A[i][k] * B[k][j]; } Note rows*cols2 *cols multiplications and additions If for square matrices rows=cols2=cols= n then there are n3 multiplications

Headers #include <stdio.h> #include <stdlib.h> #include <math.h> #include <assert.h> #include <time.h> #include <sys/resource.h> double **allocmatrix(int, int ); int freematrix(double **, int, int); void nerror(char *error_text); double seconds(int nmode); double rand_gen(double fmin, double fmax); void SetSeed(int flag);

int main(int argc, char** argv) { int l,rows,cols2,cols; int i,j,k; double temp; double **A, **B, **C; double tstart, tend; /* **************************************************** // * The following allows matrix parameters to be * // * entered on the command line to take advantage * // * of dynamically allocated memory. You may modify * // * or remove it as you wish. * // ****************************************************/ if (argc != 4) { nerror("Usage: <executable> <rows-value> <cols-value> <cols2-value>"); } rows = atoi(argv[1]); /* A is a rows x cols matrix */ cols = atoi(argv[2]); /* B is a cols x cols2 matrix */ cols2 = atoi(argv[3]); /* So C=A*B is a rows x cols2 matrix */ Main: args

Initializing the arrays A=(double **) allocmatrix(rows,cols); /* ********************************************************* // * Initialize matrix elements so compiler does not * // * optimize out * // *********************************************************/ for(i=0;i<rows;i++) { for(j=0;j<cols;j++) { A[i][j] = rand_gen(1.0, 2.0); /* if(i == j) A[i][j]=1.0; else A[i][j] = 0.0; */ }

Rand_gen /* generate a random double between fmin and fmax */ double rand_gen(double fmin, double fmax) { return fmin + (fmax - fmin) * drand48(); } /* The drand48() and erand48() functions return nonnegative double-precision floating-point values uniformly distributed over the interval [0.0, 1.0). */

Seconds- a function to combine all the times into one double /* Returns the total cpu time used in seconds. */ double seconds(int nmode){ struct rusage buf; double temp; getrusage( nmode, &buf ); /* Get system time and user time in micro-seconds.*/ temp = (double)buf.ru_utime.tv_sec*1.0e6 + (double)buf.ru_utime.tv_usec + (double)buf.ru_stime.tv_sec*1.0e6 + (double)buf.ru_stime.tv_usec; /* Return the sum of system and user time in SECONDS.*/ return( temp*1.0e-6 ); }

Timing a section of code tstart = seconds(RUSAGE_SELF); for(i=0;i<rows;++i){ for(j=0;j<cols2;++j){ for(k=0;k<cols;++k){ C[i][j] = C[i][j] + A[i][k] * B[k][j]; } tend = seconds(RUSAGE_SELF);

Timing a section of code – kij variation tstart = seconds(RUSAGE_SELF); for(k=0;k<cols;++k){ for(i=0; i<rows;++i){ for(j=0; j<cols2;++j){ C[i][j] = C[i][j] + A[i][k] * B[k][j]; } tend = seconds(RUSAGE_SELF);

Timing a section of code – kji variation tstart = seconds(RUSAGE_SELF); for(k=0;k<cols;++k){ for(j=0;j<cols2;++j){ for(i=0;i<rows;++i){ C[i][j] = C[i][j] + A[i][k] * B[k][j]; } tend = seconds(RUSAGE_SELF);

Performance variations cocsce-l1d39-11> gcc kij.c -o kij cocsce-l1d39-11> gcc kji.c -o kji cocsce-l1d39-11> ./matmul 1000 1000 1000 The total CPU time is: 9.212000 seconds cocsce-l1d39-11> ./kij 1000 1000 1000 The total CPU time is: 3.712000 seconds cocsce-l1d39-11> ./kji 1000 1000 1000 The total CPU time is: 14.264000 seconds

Address Trace - &x – address of operator &x – address of x if((mytracefile = fopen(“trace”, “w”)) == NULL) fprintf(stderr, “Could not open file %s!\n”, “trace”; fprintf(mytracefile, “address of x is %p\n”, &x);

for(i=0;i<rows;++i){ for(j=0;j<cols2;++j){ for(k=0;k<cols;++k){ C[i][j] = C[i][j] + A[i][k] * B[k][j]; }