Performance Tuning John Black CS 425 UNR, Fall 2000.

Slides:

Advertisements

Similar presentations

Analysis of Computer Algorithms

Advertisements

More Intel machine language and one more look at other architectures.

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Instruction Set Design

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Microprocessors. Microprocessor Buses Address Bus Address Bus One way street over which microprocessor sends an address code to memory or other external.

Computer Architecture and Data Manipulation Chapter 3.

PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.

Processor Technology and Architecture

1 Lecture 6 Performance Measurement and Improvement.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Data Manipulation Computer System consists of the following parts:

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Intro to CS – Honors I Introduction GEORGIOS PORTOKALIDIS

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

3 1 3 C H A P T E R Hardware: Input, Processing, and Output Devices.

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

Introduction CSE 410, Spring 2008 Computer Systems

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

What have mr aldred’s dirty clothes got to do with the cpu

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Organization & Assembly Language © by DR. M. Amer.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

Cs 147 Spring 2010 Meg Genoar. History Started to emerge in mid-1970s 1988 – RISC took over workstation market.

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

EECS 322 March 18, 2000 RISC - Reduced Instruction Set Computer Reduced Instruction Set Computer  By reducing the number of instructions that a processor.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

The Processor & its components. The CPU The brain. Performs all major calculations. Controls and manages the operations of other components of the computer.

Page 1 Computer Architecture and Organization 55:035 Final Exam Review Spring 2011.

CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Lecture # 10 Processors Microcomputer Processors.

Introduction CSE 410, Spring 2005 Computer Systems

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Homework Reading Machine Projects Labs

ISA - Instruction Set Architecture

Instant replay The semester was split into roughly four parts.

Architecture Background

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Introduction to Computer Systems

* From AMD 1996 Publication #18522 Revision E

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

A Level Computer Science Topic 5: Computer Architecture and Assembly

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Sec (2.3) Program Execution.

Presentation transcript:

Performance Tuning John Black CS 425 UNR, Fall 2000

Why Go Fast? Real Time Systems Solve Bigger Problems Less waiting time Better Simulations, Video Games, etc. Moore’s Law: Every 18 months CPU speeds double for constant cost –Has held true for last 30 years!

How do we go Faster? Find the Bottlenecks! –Hardware problems –Bad algorithms/data structures –I/O bound (disk, network, peripherals) If none of these, we have to hand tune

Hand Tuning Use a “profiler” –80% of time is spent in 20% of the code –Find this 20% and tune by hand –Do NOT waste time tuning code which is not a bottleneck How can we hand-tune?

Hand Tuning (cont.) Exploit architecture-dependent features (of course this approach limits portability) We focus on the Pentium family –Memory System –Pipeline

Memory System Modern processors have a hierarchical memory system –Memory units nearer the processor are faster but smaller Registers Level 1 Cache Level 2 Cache Main Memory 32K P3 has 256K or 512K About bit registers

Common Memory-Related Bottlenecks Alignment: the requirement that an accessed object lie at an address which is a multiple of 16 or 32 or 64, etc. For Pentium Pro, P2, P3, there is no penalty for misaligned data (unless you cross a cache line boundary) –Cache lines are 32 byte cells in the caches –We always read 32-bytes at a time

Locality Since we fetch 32 bytes at a time, accessing memory sequentially (or in a small neighborhood) is efficient –This is called “spatial locality” Since cache contents are aged (with an LRU algorithm) and eventually kicked out, accessing items repeatedly is efficient –This is called “temporal locality”

Digression on Big-Oh We learn in algorithms class that we are concerned only with the asymptotic running time on a RAM machine –With new architectures this may no longer be a good way to measure performance –An O(n lg n) algorithm may be FASTER than an O(n) algorithm due to architectural considerations

Pipelining Modern processors execute many instructions in parallel to increase speed DecodeGet ArgsExecuteRetire An instruction may be getting decoded at the same time as another is getting its arguments and another is executing; this parallelism greatly speeds up the processor

Is a Pentium RISC or CISC? The Pentium instruction set is CISC (Complex Instruction Set Computer) but it actually translates into micro-ops and runs on a RISC (Reduced Instruction Set Computer) under the sheets The RISC instructions (micro-ops) are what is actually pipelined

How does this Affect Tuning? If the pipeline is not fully utilized, we can improve performance by fixing this –Reorder instructions to reduce dependencies and “pipeline stalls” –Avoid mispredicted branches with loop unrolling –Avoid function calls with inlining

The Downside All this tuning makes code harder to write, maintain, debug, port, etc. Assembly language may be required, which has all the above drawbacks