Instructor: Erol Sahin Hypertreading, Multi-core architectures, Amdahl’s Law and Review CENG331: Introduction to Computer Systems 14 th Lecture Acknowledgement: Some of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
– 2 – Computation power is never enough Our machines are million times faster than ENIAC.. Yet we ask for more.. To make sense of universe To build bombs To understand climate To unravel genome Upto now, the solution was easy: Make the clock run faster..
– 3 – Hitting the limits… No more.. We’ve hit the fundamental limits on clock speed: No more.. We’ve hit the fundamental limits on clock speed: Speed of light: 30cm/nsec in vacuum and 20 cm/nsec in copper In a 10GHz machine, signals cannot travel more than 2cm in total In a 100GHz, at most 2mm, just to get the signal from one end to another and back We can make computers as small as that, but we face with other problems: More difficult to design and verify.. and fundamental problems Heat dissipation! Coolers are already bigger than the CPUs.. Going from 1MHz to 1GHz was simply engineering the chip fabrication process, not from 1GHz to 1 THz..
– 4 – Parallel computers New approach: Build parallel computers Each computer having a CPU running at normal speeds Systems with 1000 CPU’s are already available.. The big challenge: The big challenge: How to utilize parallel computers to achieve speedups Programming has been mostly a “sequential business” »Parallel programming remained exotic How to share information among different computers How to coordinate processing? How to create/adapt OSs?
Instructor: Erol Sahin Superscalar architectures, Simultenous Multi-threading CENG331: Introduction to Computer Systems 14 th Lecture Acknowledgement: Some of the slides are adapted from the ones prepared by Neil Chakrabarty and William May
– 6 – Threading Algorithms Time-slicing A processor switches between threads in fixed time intervals. High expenses, especially if one of the processes is in the wait state.Switch-on-event Task switching in case of long pauses Waiting for data coming from a relatively slow source, CPU resources are given to other processes
– 7 – Threading Algorithms (cont.) Multiprocessing Distribute the load over many processors Adds extra cost Simultaneous multi-threading Multiple threads execute on a single processor without switching. Basis of Intel’s Hyper-Threading technology.
– 8 – Hyper-Threading Concept At each point of time only a part of processor resources is used for execution of the program code. Unused resources can also be loaded, for example, with parallel execution of another thread/application. Extremely useful in desktop and server applications where many threads are used.
– 9 –
– 10 – Hyper-Threading Architecture First used in Intel Xeon MP processor Makes a single physical processor appear as multiple logical processors. Each logical processor has a copy of architecture state. Logical processors share a single set of physical execution resources
– 11 – Hyper-Threading Architecture Operating systems and user programs can schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors. From an architecture perspective we have to worry about the logical processors using shared resources. Caches, execution units, branch predictors, control logic, and buses.
– 12 – Advantages Extra architecture only adds about 5% to the total die area. No performance loss if only one thread is active. Increased performance with multiple threads Better resource utilization.
– 13 – Disadvantages To take advantage of hyper-threading performance, serial execution can not be used. Threads are non-deterministic and involve extra design Threads have increased overhead
Instructor: Erol Sahin Multi-Core processors CENG331: Introduction to Computer Systems 14 th Lecture Acknowledgement: Some of the slides are adapted from the ones prepared by Neil Chakrabarty and William May
– 15 – Multicore chips Recall Moore’s law: The number of transistors that can be put on a chip will double every 18 months! Still holds! 300 million transistors on an Intel Core 2 Duo class chip Question: what do you do with all those transistors? Increase cache.. But we already have 4MB caches.. Performance gain is little. Another option: put two or more cores on the same chip (technically die) Dual-core and quad-core chips are already on the market 80-core chips are manufactured.. More will follow.
– 16 – (a) A quad-core chip with a shared L2 cache. (b) A quad-core chip with separate L2 caches. Multicore Chips Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Intel styleAMD style Shared L2: Good for sharing resources Greedy cores may hurt the performances of others
– 17 –
– 18 – Comparison: SMT versus Multi-core Multi-core Several cores each designed to be smaller and not too powerful Several cores each designed to be smaller and not too powerful Great thread-level parallelism Great thread-level parallelismSMT One large, powerful superscalar core One large, powerful superscalar core Great performance on running a single thread Great performance on running a single thread Exploits instruction-level parallelism Exploits instruction-level parallelism
– 19 – Cloud Computing and Server farms Cloud computing Server farms
Instructor: Erol Sahin Amdahl’s Law and Review CENG331: Introduction to Computer Systems 14 th Lecture Acknowledgement: Some of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
– 21 – Performance Evaluation This week Performance evaluation Amdahl’s law Review
– 22 – Problem You plan to visit a friend in Cannes France and must decide whether it is worth it to take the Concorde SST or a 747 from NY (New York) to Paris, assuming it will take 4 hours LA (Los Angeles) to NY and 4 hours Paris to Cannes.
– 23 – Amdahl’s Law You plan to visit a friend in Cannes France and must decide whether it is worth it to take the Concorde SST or a 747 from NY(New York) to Paris, assuming it will take 4 hours LA (Los Angeles) to NY and 4 hours Paris to Cannes. time NY Paristotal trip timespeedup over 747 time NY Paristotal trip timespeedup over hours16.5 hours1 SST3.75 hours11.75 hours1.4 Taking the SST (which is 2.2 times faster) speeds up the overall trip by only a factor of 1.4!
– 24 – Speedup T1T1 T2T2 Old program (unenhanced) T 1 = time that can NOT be enhanced. T 2 = time that can be enhanced. T 2 = time after the enhancement. Old time: T = T 1 + T 2 T 1 = T 1 T 2 T 2 New program (enhanced) New time: T = T 1 + T 2 Speedup: S overall = T / T
– 25 – Computing Speedup Two key parameters: F enhanced = T 2 / T (fraction of original time that can be improved) S enhanced = T 2 / T 2 (speedup of enhanced part) T = T 1 + T 2 = T 1 + T 2 = T(1-F enhanced ) + T 2 = T(1 – F enhanced ) + (T 2 /S enhanced ) [by def of S enhanced ] = T(1 – F enhanced ) + T(F enhanced /S enhanced ) [by def of F enhanced ] = T((1 – F enhanced ) + F enhanced /S enhanced ) Amdahl’s Law: S overall = T / T = 1/((1 – F enhanced ) + F enhanced /S enhanced ) Key idea: Amdahl’s Law quantifies the general notion of diminishing returns. It applies to any activity, not just computer programs.
– 26 – Amdahl’s Law Example Trip example: Suppose that for the New York to Paris leg, we now consider the possibility of taking a rocket ship (15 minutes) or a handy rip in the fabric of space-time (0 minutes): time NY->Paristotal trip timespeedup over hours16.5 hours1 SST3.75 hours11.75 hours1.4 rocket0.25 hours8.25 hours2.0 rip0.0 hours8 hours2.1
– 27 – Lesson from Amdahl’s Law Useful Corollary of Amdahl’s law: 1 S overall 1 / (1 – F enhanced ) F enhanced Max S overall Moral: It is hard to speed up a program. Moral++ : It is easy to make premature optimizations.
– 28 – Other Maxims Second Corollary of Amdahl’s law: When you identify and eliminate one bottleneck in a system, something else will become the bottleneck Beware of Optimizing on Small Benchmarks Easy to cut corners that lead to asymptotic inefficiencies
– 29 – Review: Time to look back!
– 30 – Take-home message: Nothing is real!
– 31 – Great Reality #1: Int’s are not Integers, Float’s are not Reals Example 1: Is x 2 ≥ 0? Float’s: Yes! Int’s: * > * > ?? Example 2: Is (x + y) + z = x + (y + z)? Unsigned & Signed Int’s: Yes! Float’s: (1e e20) > e20 + (-1e ) --> ??
– 32 – Great Reality #2: You’ve Got to Know Assembly Chances are, you’ll never write program in assembly Compilers are much better & more patient than you are But: Understanding assembly key to machine-level execution model Behavior of programs in presence of bugs High-level language model breaks down Tuning program performance Understand optimizations done/not done by the compiler Understanding sources of program inefficiency Implementing system software Compiler has machine code as target Operating systems must manage process state Creating / fighting malware x86 assembly is the language of choice!
– 33 – Great Reality #3: Memory Matters Random Access Memory Is an Unphysical Abstraction Memory is not unbounded It must be allocated and managed Many applications are memory dominated Memory referencing bugs especially pernicious Effects are distant in both time and space Memory performance is not uniform Cache and virtual memory effects can greatly affect program performance Adapting program to characteristics of memory system can lead to major speed improvements
– 34 – Great Reality #4: There’s more to performance than asymptotic complexity Constant factors matter too! And even exact op count does not predict performance Easily see 10:1 performance range depending on how code written Must optimize at multiple levels: algorithm, data representations, procedures, and loops Must understand system to optimize performance How programs compiled and executed How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity and generality
– 35 – Great Reality #5: Computers do more than execute programs They need to get data in and out I/O system critical to program reliability and performance They communicate with each other over networks Many system-level issues arise in presence of network Concurrent operations by autonomous processes Coping with unreliable media Cross platform compatibility Complex performance issues
– 36 – What waits for you in the future?
– 37 – New tricks to learn..
– 38 – Coming to a classroom next to you in Spring semester: 334- Int. to Operating systems Processes and threads Synchronization, semaphores, monitors Assignment 1: Synchronization CPU scheduling policies Assignment 2: System calls and scheduling Virtual memory, paging, and TLBs Assignment 3: Virtual memory Filesystems, FFS, and LFS Assignment 4: File systems RAID and NFS filesystems