Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Section 3-1 to 3-2, 3-5 Drawing Lines Some of the material in these slides may have been adapted from university of Virginia, MIT, and Åbo Akademi University.

1 Optimizing compilers Managing Cache Bercovici Sivan.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Computer Abstractions and Technology

9/20/6Lecture 14 - Dynamic Memory1 Course Paper/Final-Presentation.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Chapter 6: Process Synchronization

Physically Based Real-time Ray Tracing Ryan Overbeck.

ARM-DSP Multicore Considerations CT Scan Example.

An Analysis of SIMD Instructions in the Pentium III Microprocessor By Alexander J. Aved 05 DEC 2000 CS689 Ball State University Muncie, Indiana.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 

Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.

Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

Detecting Image Region Duplication Using SIFT Features March 16, ICASSP 2010 Dallas, TX Xunyu Pan and Siwei Lyu Computer Science Department University.

Systems Programming Course Gustavo Rodriguez-Rivera.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Introduction to Intel Core Duo Processor Architecture Al-Asli, Mohammed.

Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.

HS06 on the last generation of CPU for HEP server farm Michele Michelotto 1.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Multi-core architectures. Single-core computer Single-core CPU chip.

Takuya Matsuo, Norishige Fukushima and Yutaka Ishibashi

Optimizing Katsevich Image Reconstruction Algorithm on Multicore Processors Eric FontaineGeorgiaTech Hsien-Hsin LeeGeorgiaTech.

Plug-in and tutorial development for GIMP- Cathy Irwin, 2004 The Development of Image Completion and Tutorial Plug-ins for the GIMP By: Cathy Irwin Supervisors:

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.

Lab 2 Parallel processing using NIOS II processors

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Project Two Adding Web Pages, Links, and Images Define and set a home page Add pages to a Web site Describe Dreamweaver's image accessibility features.

Processor Architecture

CS 376b Introduction to Computer Vision 02 / 11 / 2008 Instructor: Michael Eckmann.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Full and Para Virtualization

CS 8625 High Performance Computing Dr. Hoganson Copyright © 2003, Dr. Ken Hoganson CS8625 Class Will Start Momentarily… CS8625 High Performance.

Single Pass Point Rendering and Transparent Shading Paper by Yanci Zhang and Renato Pajarola Presentation by Harmen de Weerd and Hedde Bosman.

Single Node Optimization Computational Astrophysics.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Sharpening Spatial Filters ( high pass)  Previously we have looked at smoothing filters which remove fine detail  Sharpening spatial filters seek to.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Computer Graphics CC416 Lecture 04: Bresenham Line Algorithm & Mid-point circle algorithm Dr. Manal Helal – Fall 2014.

Processor Level Parallelism 1

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Using the VTune Analyzer on Multithreaded Applications

Introduction to Operating Systems

IM.Grid: A Grid Computing Solution for image processing

Nikos Anastopoulos Nectarios Koziris

Introduction to Operating Systems

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Wireless Autonomous Trolley (WAT)

Technique 6: General gray-level transformations

Technique 6: General gray-level transformations

Chapter Contents 7.1 The Memory Hierarchy 7.2 Random Access Memory

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006

Chosen application Refocus-it Iterative refocus plug-in for Gimp Home Page: Download Page: GIMP Page:

Refocus-it Iterative refocus GIMP plug-in can be used to refocus images acquired by a defocused camera, blurred by gaussian or motion blur or any combination of these. Adaptive or static area smoothing can be used to remove the so called "ringing" effect. Example:

Algorithm Description Algorithm runs i iterations (supplied in the command line).Algorithm runs i iterations (supplied in the command line). Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing.Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing. Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.

hopfield_iteration…() How the function works:

hopfield_iteration…() There are 4 different functions chosen by the parameters in the command line: hopfield_iteration_mirror_lambda()hopfield_iteration_mirror_lambda() hopfield_iteration_mirror()hopfield_iteration_mirror() hopfield_iteration_period_lambda()hopfield_iteration_period_lambda() hopfield_iteration_period()hopfield_iteration_period()

Threading approaches Split by colorSplit by color

Threading approaches Using Open MPUsing Open MP Divide the PictureDivide the Pictureor

Threading approaches Divide pixelsDivide pixels Thread 0 Thread 1

Threading approaches Divide Columns (final)Divide Columns (final) Main Thread Helper Thread

Synchronization (time) Barriers (to provide algorithm consistency)Barriers (to provide algorithm consistency) Threading solution must take into account a data dependencyThreading solution must take into account a data dependency Main Thread Helper Thread Main Thread BarriersHelper Thread Barriers BARRIER !!!

Synchronization (space) Mutexed Areas (to prevent write/read conflicts)Mutexed Areas (to prevent write/read conflicts) Intersection of the scanning areas causes W/R conflictsIntersection of the scanning areas causes W/R conflicts Main Thread Helper Thread BARRIER !!! Main MutexPeriod Mutexes

Randomizer Thread Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads.Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads. Solution:Solution: Randomizer Thread Main ThreadHelper Thread Random Buffers

Threads’ Loads Let’s take a look at the Intel® VTune™ Performance Analyzer plot

Threading – holes’ covering Consider an DP or MP system where each core is hyperthreaded.Consider an DP or MP system where each core is hyperthreaded. Problem: the OS can put both of the application’s cores on the same physical core.Problem: the OS can put both of the application’s cores on the same physical core. Solution: take care of the processor affinities.Solution: take care of the processor affinities.

General Code Optimization Get rid of heavy macro image_get_mirror No calculation needed Only Y parameter should be recalculated Only X parameter should be recalculated Only Y parameter should be recalculated Only X parameter should be recalculated Original calculation needed

SIMD Approach The heaviest line in program is classical for SIMD: loop {sum += weights[p, r] * image[i+p, j+r] } Why it didn’t work then? The most inner loop is short.The most inner loop is short. Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned.Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned. Overhead on adding the “SIMD sum” to the “non-SIMD sum”.Overhead on adding the “SIMD sum” to the “non-SIMD sum”.

Results HT machine (P4 3.0GHz) Threading Only

Results HT machine (P4 3.0GHz) Code Optimization Only

Results HT machine (P4 3.0GHz) Full Optimization

The results on Dual Core machine (Pentium D)

Compilation by Intel Compiler

Compilation by Intel Compiler (cont) Intel compiler gives up to 26.1% performanceIntel compiler gives up to 26.1% performance boost. boost. The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.