Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

Slides:



Advertisements
Similar presentations
Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
Advertisements

1a. Outline how the main memory of a computer can be partitioned b. What are the benefits of partitioning the main memory? It allows more than 1 program.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
CSCI 4717/5717 Computer Architecture
Multi-Threading LAME MP3 Encoder
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Digital signature using MD5 algorithm Hardware Acceleration
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
GPS Computer Program Performed by: Moti Peretz Neta Galil Supervised by: Mony Orbach Spring 2009 Part A Presentation High Speed Digital Systems Lab Electrical.
Parallel processing
Parallelizing an Image Compression Toolbox MSE Project Final Presentation Hadassa Baker.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Concurrency and Performance Based on slides by Henri Casanova.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Using the VTune Analyzer on Multithreaded Applications
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Background on the need for Synchronization
Parallel Software Development with Intel Threading Analysis Tools
External Sorting Sort n records/elements that reside on a disk.
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Core i7 micro-processor
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Burrows Wheeler Transform In Image Compression
Semaphore and Multithreading
STUDY AND IMPLEMENTATION
ICIEV 2014 Dhaka, Bangladesh
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Lecture 2 The Art of Concurrency
Mapping the FFT Algorithm to the IBM Cell Processor
Chapter 01: Introduction
Multicore and GPU Programming
Presentation transcript:

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel

Project Goal Gain knowledge on software optimization. Learn and implement different optimization techniques. Get acquainted with different performance analysis tools.

Optimization Approaches Multithreading (main part). Multithreading (main part). Implementation considerations. Implementation considerations. Architectonic considerations. Architectonic considerations.

Chosen program Called EOCF. Implements the “Burrows – Wheeler lossless compression algorithm”, by M. Burrows and D.J. Wheeler. Can compress and decompress files. We chose to work on the compression part..

Algorithm Description Compression: Source file is read in blocks of bytes. Burrows – Wheeler transform followed by Move- To-Front transform applied on each block. Each processed block is written to a temp file. After all the blocks have been written, Huffman compression is applied on the temp file..

Algorithm Description Decompression: Performing the compression algorithm in reverse order.

EOCF – Program Structure Block processing section (for each block in the file) Read Block Write Block To temporary file BW transformation MTF transformation Block Processing section Huffman compression Input File Output File Temp File

Code Analysis The following two functions are performed about 2/3 of the runtime:

Code Analysis Call graph:

Code Analysis The conclusion: The code spends most of the runtime in performing the transformations.

Multi Threading Based on the results, the block process section was multi-threaded. Huffman compression section was not multi-threaded. Data decomposition approach was used.

Data Decomposition Data decomposition approach in general: Func1 Input Output Func2FuncN Func1 Input/N Output Func2FuncN Input/N Func1Func2FuncN Func1Func2FuncN Thread 1 Thread K Thread 2

Data Decomposition Data decomposition approach on EOCF: Huffman InputOutputTemp File BW MTF Thread 1 BW MTF Input Buffers Input File Output Buffers Huffman Output File Temp File Thread 2 BWMTF Thread n BWMTF

Thread Design Read a block from the input buffer. Perform the transformations. Write to the output buffer. Fill input buffer our empty output buffer if needed.

Thread Design yes no Fill buffer from input file Current block is the last block? no finish yes Read next input block Perform transformations Current write Buffer is full? yes Write buffer to temp file no Write block to output buffer Current Read buffer is empty? finish

Implementation WIN32 API was used rather than openMP API. Yields better performance, according to research based on previous projects and internet articles.

Synchronization A Critical Section objects were used. Provides a slightly faster, more efficient mechanism for mutual exclusion than Mutex object.

Thread Performance Threads share the load almost equally and about 2/3 of the time we spend in the parallel section, as expected.

Thread Checker Thread checker found no errors. *The warning is due to the fact that we have a thread that waits for the worker-threads to finish (main).

Number of Threads Best performance when number of threads equals number of cores. On Dual Core:

Input Buffers Implementing the double buffering technique. When a buffer is being filled, other threads continue to read from the second buffer.

Output Buffers To comply with the decompression algorithm, sequential output had to be achieved. Based on empiric observation, we hold enough buffers, so that each thread can write at least four blocks. The minimum number of buffers is two.

Buffer Size Based on our observations when using Dual Core processor, the optimal buffer size is 16KB:

Data Sharing and Alignment To eliminate False Sharing the following steps were taken: Moving as much as possible of shared data to each threads private data. Aligning to a cache line size shared arrays of data, when each individual element is being accessed by a different thread.

Data Sharing and Alignment Runtime without cache alignment: sec sec Runtime with cache alignment: sec sec Overall improvement of 0.5%

SIMD Usage of SIMD was not implemented.

Optimization Achieved Using a Dual Core processor, the ideal speed-up would be X2. Since we have multi-threaded only about 2/3 of the code, we could expect speed-up of:

Optimization Achieved We have achieved speed up of: X1.47 Unavoidably, we loose time on managing and synchronizing threads.

Comparison to Other Intel Architectures We ran our program on 2 other computers: Intel® Core™2 Quad Intel® Core™2 Quad Intel® Core™ i7 Intel® Core™ i7 X1.47 X1.9 X2.17 X1.96

Thank You