Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Part IV: Memory Management
Chapter 11: File System Implementation
File System Implementation
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
CS 536 Spring Run-time organization Lecture 19.
3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.
Multiprocessing Memory Management
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
CS 104 Introduction to Computer Science and Graphics Problems
Memory Management 2010.
Chapter 3.2 : Virtual Memory
File System Structure §File structure l Logical storage unit l Collection of related information §File system resides on secondary storage (disks). §File.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
File System Implementation
+ CS 325: CS Hardware and Software Organization and Architecture Integers and Arithmetic Part 4.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
CS333 Intro to Operating Systems Jonathan Walpole.
CS 346 – Chapter 8 Main memory –Addressing –Swapping –Allocation and fragmentation –Paging –Segmentation Commitment –Please finish chapter 8.
Chapter Two Memory organisation Examples of operating system n Windows 95/98/2000, Windows NT n Unix, Linux, n VAX/VMS IBM MVS n Novell Netware and Windows.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Copyright © 2005 Elsevier Chapter 8 :: Subroutines and Control Abstraction Programming Language Pragmatics Michael L. Scott.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 File-System Structure.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Dr. T. Doom 11.1 CEG 433/633 - Operating Systems I Chapter 11: File-System Implementation File structure –Logical storage unit –Collection of related information.
Silberschatz and Galvin  Operating System Concepts File-System Implementation File-System Structure Allocation Methods Free-Space Management.
Page 111/15/2015 CSE 30341: Operating Systems Principles Chapter 11: File System Implementation  Overview  Allocation methods: Contiguous, Linked, Indexed,
Optimised C/C++. Overview of DS General code Functions Mathematics.
HONGIK UNIVERSITY School of Radio Science & Communication Engineering Visual Information Processing Lab Hong-Ik University School of Radio Science & Communication.
10.1 CSE Department MAITSandeep Tayal 10 :File-System Implementation File-System Structure Allocation Methods Free-Space Management Directory Implementation.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 11 File-System Implementation Slide 1 Chapter 11: File-System Implementation.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Architecture Lecture 27 Fasih ur Rehman.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
CS6502 Operating Systems - Dr. J. Garrido Memory Management – Part 1 Class Will Start Momentarily… Lecture 8b CS6502 Operating Systems Dr. Jose M. Garrido.
Lab4: Virtual Memory CS 3410 : Computer System Organization & Programming Spring 2015.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
File-System Implementation
Selective Code Compression Scheme for Embedded System
Chapter 8 Main Memory.
A Closer Look at Instruction Set Architectures: Expanding Opcodes
Paging and Segmentation
Chapter 8: Main Memory.
Chapter 9 :: Subroutines and Control Abstraction
Chapter 8: Main Memory.
Operating System Concepts
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management 11/17/2018 A. Berrached:CS4315:UHD.
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
Computer Architecture
Main Memory Background Swapping Contiguous Allocation Paging
Outline Allocation Free space management Memory mapped files
Overview: File system implementation (cont)
Lecture 3: Main Memory.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Contents Memory types & memory hierarchy Virtual memory (VM)
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006

Chosen application Archer Smart Password Recovery Tool Download Page:

Archer The Archer application is intended to recover a lost password used to garble an ARJ archive. The ARJ archive program uses Huffman table search & substitute algorithm which is expected to shrink the size of the file being archived.

Algorithm Description The program operates as follows: The input file gets read. The input file gets read. The smaller garbled file inside the archive is selected. The smaller garbled file inside the archive is selected. An iterative password trial is performed till the CRC32 of the stored file is matched against the tried one. An iterative password trial is performed till the CRC32 of the stored file is matched against the tried one.

Algorithm Description (cont.) The ARJ archive program uses the following technique to garble the produced archive: Compress the file(s) as usual Compress the file(s) as usual XOR the resulting contents with the password, which is chained as necessary to match the length of the compressed data. XOR the resulting contents with the password, which is chained as necessary to match the length of the compressed data.

Method 1,2,3 vs. 4 1,2,3 differ only by the dynamic dictionary 1,2,3 differ only by the dynamic dictionary Different maximal depth influences compression Different maximal depth influences compression The same decompressing procedure The same decompressing procedure Possibility to employ sanity check heuristics Possibility to employ sanity check heuristics Skipping passwords at large speeds Skipping passwords at large speeds 4 th method 4 th method Fast compression – fixed dictionary size Fast compression – fixed dictionary size No shortcuts or sanity checks No shortcuts or sanity checks Each trial leads to CRC32 calculation of the data Each trial leads to CRC32 calculation of the data Slow password rate for large files Slow password rate for large files

Optimization Steps 32 Bit variables 32 Bit variables Original code In 16 bit – Pentium slowdown Original code In 16 bit – Pentium slowdown Majority of variables were converted to 32 bit Majority of variables were converted to 32 bit Some variables and buffers remained in 16 bit Some variables and buffers remained in 16 bit Those, which inherently must be such for algorithmic reasons (overflow, shifts etc.) Those, which inherently must be such for algorithmic reasons (overflow, shifts etc.)

Optimization Steps Power Buffer Unwinding Power Buffer Unwinding Dynamically created buffers for constant data Dynamically created buffers for constant data Certain combinations of powers of 2 Certain combinations of powers of 2 Those were hard-coded in the program Those were hard-coded in the program Several parameters to procedures suppressed Several parameters to procedures suppressed One procedure rewritten and spread in 2 One procedure rewritten and spread in 2

Optimization Steps Threading Threading Original code was single-threaded Original code was single-threaded Virtually no dependence between password trials Virtually no dependence between password trials There can be as many workers launched as possible There can be as many workers launched as possible The only interaction point is password incrementing The only interaction point is password incrementing Every worker has its own local storage Every worker has its own local storage The shared data is global The shared data is global

Threading (cont.) Our original threading scheme Our original threading scheme Increment password Worker №1 Worker №2 Main Thread – Initialize Workers and SP threads Waiting for the workers to finish Show Progress Thread – show global data every 1 sec. Then go to sleep… Increment password Increment password The only CS – “Increment password” The only CS – “Increment password” Some fake data races reported by the Thread Checker Some fake data races reported by the Thread Checker

Threading (cont.) Threading scheme - revisited Threading scheme - revisited Worker №1 – Increment password by Workers Count and continue independently… Main Thread – Initialize Workers and SP threads Waiting for the workers to finish Show Progress Thread – show global data every 1 sec. Then go to sleep… Worker №2 – Increment password by Workers Count and continue independently… Best suitable for methods 1,2,3 Best suitable for methods 1,2,3

Optimization Steps Optimizing CRC32 Optimizing CRC32 Practically, influences only the 4 th method Practically, influences only the 4 th method Rewritten using 4 pre-generated polynomial value tables Rewritten using 4 pre-generated polynomial value tables Calculation is done with buckets of 4 bytes Calculation is done with buckets of 4 bytes Instead of iteratively calculating CRC32 with each byte, the bucket values are combined Instead of iteratively calculating CRC32 with each byte, the bucket values are combined The performance of CRC32 algorithm improves by approximately factor of 2 The performance of CRC32 algorithm improves by approximately factor of 2

Optimizing CRC32 (cont.) Original pseudo-code: Original pseudo-code: void calccrc(BYTE *buf, int count) { while (count--) { crc32 = (crc32 >> 8) ^ crctbl[(BYTE)crc32 ^ *buf]; buf++; }

Optimizing CRC32 (cont.) Optimized pseudo-code: Optimized pseudo-code: #define DO4 c ^= *buf4++; \ c = crc_table[3][c & 0xff] ^ crc_table[2][(c >> 8) & 0xff] ^ \ crc_table[1][(c >> 16) & 0xff] ^ crc_table[0][c >> 24] #define DO32 DO4; DO4; DO4; DO4; DO4; DO4; DO4; DO4 void calccrc(BYTE *buf, int count) { buf4 = (const unsigned long*)buf; while (count >= 32) { DO32; count -= 32; } // Make the reminder }

Optimization Steps Using SIMD Instructions for decrypting Using SIMD Instructions for decrypting 16 bytes of data 16 bytes of chained password 16 bytes of constant XMM1 XMM2 XMM3 XOR 16 bytes Source file Decrypted file 16 bytes LOAD STORE

Optimization Steps Using SIMD Instructions for password maintenance Using SIMD Instructions for password maintenance abcabcabcabcabca XMM1 2) Copy the current password string to another register XMM1 1) Shift right the XMM1, 16 % password_length bits 0bcabcabcabcabca XMM2 0bcabcabcabcabca 3) Shift the copy left 16 – (16 % password_length) bits XMM2 b ) XMM1 = XMM1 OR XMM2 XMM1 bbcabcabcabcabca Now XMM1 contains the chained password for the next 16 bytes of data

Optimization Steps Limited Buffers Limited Buffers Allocating memory for the whole file causes cache misses Allocating memory for the whole file causes cache misses Small buffers cause overhead penalty Small buffers cause overhead penalty Large buffers cause cache misses penalty Large buffers cause cache misses penalty Gold value is 128K Gold value is 128K

Optimization Steps Compilation by Intel Compiler Compilation by Intel Compiler Method 1, 2, 3 – penalty of 9.26% Method 1, 2, 3 – penalty of 9.26% Method 4 – boost of 18.44% Method 4 – boost of 18.44%

Results (Times)

Results (Boost)