Optimization: The Art of Computing

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Introductions to Parallel Programming Using OpenMP
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Rules for Designing Multithreaded Applications CET306 Harry R. Erwin University of Sunderland.
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Instruction Level Parallelism (ILP) Colin Stevens.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.
Parallel implementation of RAndom SAmple Consensus (RANSAC) Adarsh Kowdle.
Optimizing the trace transform Using OpenMP and CUDA Tim Besard
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Performance Optimization Getting your programs to run faster.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Threaded Programming Lecture 2: Introduction to OpenMP.
Single Node Optimization Computational Astrophysics.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
1 Lecture 5a: CPU architecture 101 boris.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
The Art of Parallel Processing
Programming Technique in Matlab Allows Coder Generate Vectorized Friendly C Code Zijian Cao.
SHARED MEMORY PROGRAMMING WITH OpenMP
Atomic Operations in Hardware
Loop Parallelism and OpenMP CS433 Spring 2001
Atomic Operations in Hardware
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Exploiting Parallelism
CPU Efficiency Issues.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Getting Started with Automatic Compiler Vectorization
Sieve of Eratosthenes.
Cache memory Direct Cache Memory Associate Cache Memory
Vector Processing => Multimedia
Instructor’s Intent for this course
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
XMT Another PRAM Architectures
Multi-core CPU Computing Straightforward with OpenMP
Synchronization Memory Consistency
Jun Doi Tokyo Research Laboratory IBM Research
Computer Architecture Lecture 4 17th May, 2006
Performance Optimization for Embedded Software
Compiler Back End Panel
Compiler Back End Panel
Coe818 Advanced Computer Architecture
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
EE 4xx: Computer Architecture and Performance Programming
Multithreading Why & How.
M4 and Parallel Programming
Programming with Shared Memory Specifying parallelism
Husky Energy Chair in Oil and Gas Research
OpenMP Parallel Programming
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Optimization: The Art of Computing Intel Challenge experience and other tricks … Mathieu Gravey

Golden principle of Optimizing - t e r m Algorithm Implementation Hardware P e r f o m a n c

Example: Prime Number Algorithm For i=2 to N bool isPrime=true; For j=2 to N If (mod(i,j)==0 and i != j) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm For i=2 to N bool isPrime=true; For j=2 to i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm For i=2 to N bool isPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i=2 to N bool isPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i=2 to N bool isPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i=2 to N bool isPrime=true; vectorize the job For j=2 to √i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i=3 to N step 2 bool isPrime=true; vectorize the job For j in √i step 2 isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i=2 to N step 2 bool isPrime=true; vectorize the job For j=2 to √i step 2 isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for

Example: Prime Number Algorithm // the job For i==2 to N bool isPrime=true; vectorize the job For j in listOfPrimeNumber and j<√i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber in order End for

Example: Prime Number Algorithm // the job For i==1 or i==5 in base 6, to N bool isPrime=true; vectorize the job For j in listOfPrimeNumber and j<√i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber in order End for

Basic principles Pareto principle Structure Parallelization Vectorization inotes4you.files.wordpress.com

Basic principles Start by the main issues Global view  critical issue Monkey development Start simple  go to complex Iterative process Optimizing, start by slowing down Global picture ! http://bestofpicture.com/

Rules Guidelines Be lazy Don’t reinvent the wheel Don’t be idle Design pattern Global variables are your enemies Don’t Overgeneralize

Rules Guidelines Trust the compiler Simple for you = simple for compiler | computer Share your knowledge Compiler

Rules Guidelines Think different, try, change and try again … Don’t aim for the Best, but something Good and Better

Concrete trick : Memory Array vs. List Prefetch | random access

Concrete trick : First step Optimization Compiler optimization icpc myCodeFile –O3 -xhost –o myCompiledProgram ⚠ -g const No-writes inline restrict/__restrict__ No read updates Loop-unroll __builtin_expect((x),(y))

Concrete trick : OpenMP Vectorization => SIMD #pragma omp simd Multi-operation with one instruction ⚠ non-aligned data Multi-Thread L3 cache-communication Shared memory How to use : #pragma omp parallel for default(none) shared(x,y) fisratPrivate(array) reduction(max:MaxValue) schedule(static) for(int i=0; i< 10000; i++){ something … } #pragma omp critical #pragma omp barrier

Multi-Chip | Multi-Sockets NUMA (Non-uniform memory access) slower than local memory Position in memory => first touch Parallelize the initialisation with : schedule(static) read only data => copy in each local memory Thread Affinity

Questions ?