CSc 352 Performance Tuning Saumya Debray Dept. of Computer Science The University of Arizona, Tucson

Slides:



Advertisements
Similar presentations
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Advertisements

CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
OS Fall’02 Virtual Memory Operating Systems Fall 2002.
Fundamentals of Python: From First Programs Through Data Structures
CSC1016 Coursework Clarification Derek Mortimer March 2010.
VBA Modules, Functions, Variables, and Constants
Memory Management 2010.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Performance Improvement
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.
1 1 Profiling & Optimization David Geldreich (DREAM)
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Efficient Software-Based Fault Isolation—sandboxing Presented by Carl Yao.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Recursion Chapter 7. Chapter Objectives  To understand how to think recursively  To learn how to trace a recursive method  To learn how to write recursive.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
© 2011 Pearson Addison-Wesley. All rights reserved 10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
CS 3500 L Performance l Code Complete 2 – Chapters 25/26 and Chapter 7 of K&P l Compare today to 44 years ago – The Burroughs B1700 – circa 1974.
Prolog Program Style (ch. 8) Many style issues are applicable to any program in any language. Many style issues are applicable to any program in any language.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
8.1 8 Algorithms Foundations of Computer Science  Cengage Learning.
1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.
Chapter 5 Record Storage and Primary File Organizations
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
CS4432: Database Systems II
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
C++ Functions A bit of review (things we’ve covered so far)
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
Smalltalk Implementation Harry Porter, October 2009 Smalltalk Implementation: Optimization Techniques Prof. Harry Porter Portland State University 1.
Practical Database Design and Tuning
Jonathan Walpole Computer Science Portland State University
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
COMP 430 Intro. to Database Systems
Review Graph Directed Graph Undirected Graph Sub-Graph
Hashing Exercises.
CSCI1600: Embedded and Real Time Software
Indexing and Hashing Basic Concepts Ordered Indices
CSE 451: Operating Systems Autumn 2005 Memory Management
File Storage and Indexing
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Overview of Query Evaluation
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Analysis of Algorithms
Min Heap Update E.g. remove smallest item 1. Pop off top (smallest) 3
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
CS703 - Advanced Operating Systems
CSE 542: Operating Systems
CSCI1600: Embedded and Real Time Software
CSc 352 Performance Tuning
Presentation transcript:

CSc 352 Performance Tuning Saumya Debray Dept. of Computer Science The University of Arizona, Tucson

Background Performance tuning  modifying software to make it more efficient – often the performance metric is execution speed – other metrics also possible, e.g., memory footprint, response time, energy efficiency How to get performance improvements – “system tweaking” (e.g., compiler optimizations) can get some improvement; typically this is relatively small – most large improvements are algorithmic in nature  needs active and focused human intervention  requires data to identify where to focus efforts 2

When to optimize? 1.Get the program working correctly – calculating incorrect results quickly isn’t useful – “premature optimization is the root of all evil” – Knuth (?) 2.Determine whether performance is adequate – Optimization unnecessary for many programs 3.Figure out what code changes are necessary to improve performance 3  be cognizant of the possibility that performance tuning may be necessary later on ►design and write the program with this in mind

Compiler optimizations Invoked using compiler options, e.g., “gcc –O2” – usually several different levels supported (gcc: -O0 … -O3) – may also allow fine-grained control over code optimization gcc supports ~ 200 optimization-related command-line options They address machine-level inefficiencies, not algorithm-level inefficiencies – e.g., gcc optimizations improve hardware register usage… – … but not sequential search over a long linked list Significant performance improvements usually need human intervention 4

Example 5 about 10% improvement overall not atypical; possible to do better compiler optimization effect small if either: code already highly optimized; or algorithm is lousy

Where to optimize? Consider a program with this execution time distribution: 6 doubling speed of func3  overall improvement = 5% doubling speed of func1  overall improvement = 30% focusing on func1 gives better results for time invested

Profiling tools These are tools that: – monitor the program’s execution at runtime – give data on how often routines are called, where the program spends its time – provide guidance on where to focus one’s efforts Many different tools available, we’ll focus on two: – gprof: connected to gcc – kcachegrind: connected to valgrind 7

Using gprof Compile using “gcc –pg” – this adds some book-keeping code, so this will be a little slower Run this executable, say a.out, on “representative” inputs – creates a data file “gmon.out” Run “gprof a.out” – extracts information from gmon.out – “flat profile” : time and #calls info per function – “call graph” : time and #calls per function broken down on each place where the function is called 8

Using gprof: example 9 % time spent in each function time accounted for by each function alone no. of times called ave. time per call spent in the function ave. time per call spent in the function and its descendants

Using the profile information 10 Expect %time and self-seconds to correlate If self μ s/call high [or: self-seconds is high and calls is low]: – each call is expensive; overhead is due to the code for the function if calls is high and self μ s/call is low: – each call is inexpensive; overhead mainly due to no. of function calls if self μ s/call is low and total μ s/call is high: – each call is expensive, but overhead due to some descendant routine

Examining the possibilities 1 Code for the function is expensive [self μs /call high] – need to get a better idea of where time is being spent in the function body – may help to pull parts of the function body into separate functions allows more detailed profile info can be “inlined back” after performance optimization Optimization approach: – reduce the cost of the common-case execution path through the function 11

Examining the possibilities 2 No. of calls to a function is the problem [calls is high but self μs /call is low]: – need to reduce the number/cost of calls – possible approaches: [best] avoid the call entirely whenever possible, e.g.: – use hashing to reduce the set of values to be considered; or – see if the call can be avoided in the common case (e.g., maybe by maintaining asome extra information) reduce the cost of making the call – inline the body of the called function into the caller 12

Examining the possibilities Often, performance improvement will involve a tradeoff. E.g.: – transform linear to binary search: reduces no. of values considered in the search requires sorting – transform a simple linked list into a hash table reduces the no. of values considered when searching requires more memory (hash table), some computation (hash values) Need to be aware of this tradeoff 13

Approaching performance optimization Different problems may require very different solutions Essential idea: – avoid unnecessary work whenever possible – prefer cheap operations to expensive ones Apply these ideas at all levels: – library routines used – language-level operations (e.g., function calls vs. macros) – higher-level algorithms 14

Optimization 1: Filtering Useful when: – we are searching a large collection of items, most of which don’t match the search criteria – determining whether a particular item matches is expensive – there is a (relatively) cheap check that is satisfied iff an item does not match What we do: – use the cheap check to quickly disqualify items that won’t match – effectiveness depends on how many items get disqualifed 15

Filtering Hashing – particularly useful for strings (but not restricted to them) – can give order-of-improvement performance improvements – sensitive to quality of hash function Binary search – knowing that the data items are sorted allows us to quickly exclude many of them that won’t match 16

Filters can apply to complex structures In a research project, we were searching through a large no. of code fragments looking for repetition: – code in compiler’s internal form (directed graph), not source code – we used a 64-bit “fingerprint” for each code region bits size of region 48 bits type and size of the first 8 code blocks in the region (6 bits per block: 2 bits for type, 6 bits for no. of instrs)

Optimization 2: Buffering Useful when: – an expensive operation is being applied to a large no. of items – the operation can also be applied collectively to a group of items What we do: – collect the items into groups – apply the operation to the groups instead of individual items Most often used for I/O operations 18

Optimization technique 3: precomputation Useful when: – a result can be computed once and reused many times – we can predict which results will be computed – we can look up a result cheaply What we do: – identify operations that get executed over and over – compute the result ahead of time and save it – use the saved result later in the program 19

Optimization 3: cacheing Useful when: – we repeatedly perform an expensive operation – there is a cheap way to check whether a computation has been done before What we do: – keep a cache of computations and results; reuse a result if it is already in the cache Difference from precomputation: – caches usually have a limited size – the cache may need to be emptied if it fills up 20

Optimization 4: Using cheaper operations Macros vs. functions – sometimes it may be cheaper to write a code fragment as a macro than as a function – the macro does not incur the cost of function call/return – macro arguments may be evaluated multiple times #define foo(x, y, z) …. x …. y … x … y … x… y … z … x … y … foo(e 1, e 2, e 3 )  …. e 1 …. e 2 … e 1 … e 2 … e 1 … e 2 … e 3 … e 1 … e 2 … Function inlining – conceptually similar to (but slightly different from) macros – replace a call to a function by a copy of the function body eliminates function call/return overhead 21

Optimization 4: Using cheaper operations 22

Hashing and Filtering Many computations involve looking through data to find those that have some property for each data item X { if (X has property) { process X } This can be expensive if: no. of items is large; and /or checking for the property is expensive. Hashing and filtering can be used to reduce the cost of checking. 23 Total cost = no. of data items x cost of checking each item

Filtering: Basic Idea Given: – a set of items S – some property P Find: – a function h such that 1.h() is easy to compute; 2.h(x) says something useful about whether x has property P 24 h Goal: (Cheaply) reduce no. of items to process

Filtering: Examples isPrime(n): – full test: check for divisors between 1 and n – filter: n == 2 or n is odd filters out even numbers > 2 equality of two strings s 1 and s 2 – full test: strcmp(s 1, s 2 ) – filter: s 1 [0] == s 2 [0] isDivisibleBy3(n) s 1 and s 2 are anagrams 25 The filter depends on the property we’re testing! Must be a necessary condition: (forall x)[  filter (x)   full_test(x)]

Hashing Conceptually related to filtering Basic idea: Given a set of items S and a property P: – use a hash function h() to divide up the set S into a number of “buckets” usually, h() maps S to integers (natural numbers) – h(x) == h(y) means x and y are in the same bucket if x and y fall in the same bucket, they may share the property P (need to check) if x and y are in different buckets, they definitely don’t share the property P (no need to check) 26

Hashing: An Implementation 27 hash table (n buckets) hash bucket … compute a hash function h() where h(x)  {0, …, n-1} use h() to index into the appropriate bucket search/insert in this bucket n-1

Performance Tuning: Summary Big improvements come from algorithmic changes – but don’t ignore code-level issues (e.g., cheaper operations) Use profiling to understand performance behavior – where to focus efforts – reasons for performance overheads Figure out how to transform the program based on nature of overheads Good design, modularization essential 28