Performance Tuning John Black CS 425 UNR, Fall 2000.

Performance Tuning John Black CS 425 UNR, Fall 2000

Why Go Fast? Real Time Systems Solve Bigger Problems Less waiting time Better Simulations, Video Games, etc. Moore’s Law: Every 18 months CPU speeds double for constant cost –Has held true for last 30 years!

How do we go Faster? Find the Bottlenecks! –Hardware problems –Bad algorithms/data structures –I/O bound (disk, network, peripherals) If none of these, we have to hand tune

Hand Tuning Use a “profiler” –80% of time is spent in 20% of the code –Find this 20% and tune by hand –Do NOT waste time tuning code which is not a bottleneck How can we hand-tune?

Hand Tuning (cont.) Exploit architecture-dependent features (of course this approach limits portability) We focus on the Pentium family –Memory System –Pipeline

Memory System Modern processors have a hierarchical memory system –Memory units nearer the processor are faster but smaller Registers Level 1 Cache Level 2 Cache Main Memory 32K P3 has 256K or 512K About 40 32-bit registers

Common Memory-Related Bottlenecks Alignment: the requirement that an accessed object lie at an address which is a multiple of 16 or 32 or 64, etc. For Pentium Pro, P2, P3, there is no penalty for misaligned data (unless you cross a cache line boundary) –Cache lines are 32 byte cells in the caches –We always read 32-bytes at a time

Locality Since we fetch 32 bytes at a time, accessing memory sequentially (or in a small neighborhood) is efficient –This is called “spatial locality” Since cache contents are aged (with an LRU algorithm) and eventually kicked out, accessing items repeatedly is efficient –This is called “temporal locality”

Digression on Big-Oh We learn in algorithms class that we are concerned only with the asymptotic running time on a RAM machine –With new architectures this may no longer be a good way to measure performance –An O(n lg n) algorithm may be FASTER than an O(n) algorithm due to architectural considerations

Pipelining Modern processors execute many instructions in parallel to increase speed DecodeGet ArgsExecuteRetire An instruction may be getting decoded at the same time as another is getting its arguments and another is executing; this parallelism greatly speeds up the processor

Is a Pentium RISC or CISC? The Pentium instruction set is CISC (Complex Instruction Set Computer) but it actually translates into micro-ops and runs on a RISC (Reduced Instruction Set Computer) under the sheets The RISC instructions (micro-ops) are what is actually pipelined

How does this Affect Tuning? If the pipeline is not fully utilized, we can improve performance by fixing this –Reorder instructions to reduce dependencies and “pipeline stalls” –Avoid mispredicted branches with loop unrolling –Avoid function calls with inlining

The Downside All this tuning makes code harder to write, maintain, debug, port, etc. Assembly language may be required, which has all the above drawbacks

Performance Tuning John Black CS 425 UNR, Fall 2000.

Similar presentations

Presentation on theme: "Performance Tuning John Black CS 425 UNR, Fall 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Tuning John Black CS 425 UNR, Fall 2000.

Similar presentations

Presentation on theme: "Performance Tuning John Black CS 425 UNR, Fall 2000."— Presentation transcript:

Similar presentations

About project

Feedback