Download presentation
Presentation is loading. Please wait.
Published byMaria Rhoda Rice Modified over 9 years ago
1
Topics Speedup Amdahl’s law Execution timeReadings January 10, 2012 CSCE 713 Computer Architecture
2
– 2 – CSCE 713 Spring 2012 Overview Readings for today Landscape of Parallel Computing Research Berkeley View EECS-2006-183 Parallel Benchmarks Inspired by Berkeley Dwarfs Ka10_7dwarfsOfSymbolicComputationNew Topics overview Syllabus and other course pragmatics Website (not shown) Dates Power wall, ILP wall, to multicore Seven Dwarfs Amdahl’s Law, Gustaphson’s law
3
– 3 – CSCE 713 Spring 2012 Copyright © 2012, Elsevier Inc. All rights reserved. Single Processor Performance Introduction RISC Move to multi-processor
4
– 4 – CSCE 713 Spring 2012 Power Wall Note both of dynamic power and energy have voltage 2 as dominant term So lower voltage improves both; 5V 1V over period of time, but then can’t continue without errors
5
– 5 – CSCE 713 Spring 2012 Static Power CMOS chip have power loss due to current leakage even when the transistor is off In 2006 the goal for leakage is 25%
6
– 6 – CSCE 713 Spring 2012 Single CPU Single Thread Programming Model
7
– 7 – CSCE 713 Spring 2012 Berkeley Conventional Wisdom Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power. 3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005]
8
– 8 – CSCE 713 Spring 2012 Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.
9
– 9 – CSCE 713 Spring 2012 Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.
10
– 10 – CSCE 713 Spring 2012 Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the squareof the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.
11
– 11 – CSCE 713 Spring 2012 CSAPP – Bryant O’Hallaron.
12
– 12 – CSCE 713 Spring 2012 Topics Covered The needs for gains in performance The need for Parallelism Amdahl’s and Gustaphson’s laws Various Problems: the 7 Dwarfs and … Various Approaches Bridges between Multithreaded Multicore Posix pthreads Intel’s TTB Distributed – MPIDistributed – MPI Shared Memory – OpenMPShared Memory – OpenMP GPUsGPUs Grid ComputingGrid Computing Cloud ComputingCloud Computing
13
– 13 – CSCE 713 Spring 2012 Top 10 challenges in parallel computing By Michael Wrinn (Intel) In priority order: Michael Wrinn (Intel)Michael Wrinn (Intel) 1.Finding concurrency in a program - how to help programmers “think parallel”? 2.Scheduling tasks at the right granularity onto the processors of a parallel machine. 3.The data locality problem: associating data with tasks and doing it in a way that our target audience will be able to use correctly. 4.Scalability support in hardware: bandwidth and latencies to memory plus interconnects between processing elements. 5.Scalability support in software: libraries, scalable algorithms, and adaptive runtimes to map high level software onto platform details. http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/
14
– 14 – CSCE 713 Spring 2012 Synchronization constructs (and protocols) that enable programmers write programs free from deadlock and race conditions. Tools, API’s and methodologies to support the debugging process. Error recovery and support for fault tolerance. Support for good software engineering practices: composability, incremental parallelism, and code reuse. Support for portable performance. What are the right models (or abstractions) so programmers can write code once and expect it to execute well on the important parallel platforms? http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/
15
– 15 – CSCE 713 Spring 2012 Berkeley Conventional Wisdom 1. Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
16
– 16 – CSCE 713 Spring 2012 Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005] 4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
17
– 17 – CSCE 713 Spring 2012 5. Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
18
– 18 – CSCE 713 Spring 2012 6. Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
19
– 19 – CSCE 713 Spring 2012 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. · New CW is the “ILP wall”: There are diminishing returns on finding more ILP. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
20
– 20 – CSCE 713 Spring 2012 Old CW: Uniprocessor performance doubles every 18 months. · New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots processor performance for almost 30 years. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
21
– 21 – CSCE 713 Spring 2012 Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer. · New CW: It will be a very long wait for a faster sequential computer http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
22
– 22 – CSCE 713 Spring 2012 Old CW: Increasing clock frequency is the primary method of improving processor performance. · New CW: Increasing parallelism is the primary method of improving processor performance. 12. Old CW: Less than linear scaling for a multiprocessor application is failure. · New CW: Given the switch to parallel computing, any speedup via parallelism is a success. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
23
– 23 – CSCE 713 Spring 2012.
24
– 24 – CSCE 713 Spring 2012 Amdahl’s Law Suppose you have an enhancement or improvement in a design component. Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used
25
– 25 – CSCE 713 Spring 2012 Exec Time of Parallel Computation
26
– 26 – CSCE 713 Spring 2012 Gustafson’s Law: Scale the problem http://en.wikipedia.org/wiki/Gustafson%27s_law
27
– 27 – CSCE 713 Spring 2012 Matrix Multiplication – scaling the problem Note we would really scale a model of a “real problem,” but matrix multiplication might be one step required
28
– 28 – CSCE 713 Spring 2012
29
– 29 – CSCE 713 Spring 2012 High-end simulation in the physical sciences = 7 numerical methods : 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” If add 4 for embedded, covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004 www.eecs.berkeley.edu/bears/presentations/06/Patterson.ppt
30
– 30 – CSCE 713 Spring 2012 Seven Dwarfs - Dense Linear Algebra Data are dense matrices or vectors. Generally, such applications use unit-stride memory accesses to read data from rows, andGenerally, such applications use unit-stride memory accesses to read data from rows, and strided accesses to read data from columns.strided accesses to read data from columns. Communication patternCommunication pattern Black is no communication http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
31
– 31 – CSCE 713 Spring 2012 Seven Dwarfs -Sparse Linear Algebra. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
32
– 32 – CSCE 713 Spring 2012 Seven Dwarfs –Spectral Methods (e.g., FFT). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
33
– 33 – CSCE 713 Spring 2012 Seven Dwarfs - N-Body Methods Depends on interactions between many discrete points. Variations include particle-particle methods, where every point depends on all others, leading to an O(N2) calculation, and hierarchical particle methods, which combine forces or potentials from multiple points to reduce the computational complexity to O(N log N) or O(N). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
34
– 34 – CSCE 713 Spring 2012 Seven Dwarfs –Structured Grids. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
35
– 35 – CSCE 713 Spring 2012 Seven Dwarfs – Unstructured Grids An irregular grid where data locations are selected, usually by underlying characteristics of the application. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
36
– 36 – CSCE 713 Spring 2012 Seven Dwarfs - Monte Carlo Calculations depend on statistical results of repeated random trials. Considered embarrassingly parallel. Communication is typically not dominant in Monte Carlo methods. EmbarrassinglyParallel / NSF Teragrid http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
37
– 37 – CSCE 713 Spring 2012 Principle of Locality Rule of thumb – A program spends 90% of its execution time in only 10% of the code. So what do you try to optimize? Locality of memory references Temporal locality Spatial locality
38
– 38 – CSCE 713 Spring 2012 Taking Advantage of Parallelism Logic parallelism – carry lookahead adder Word parallelism – SIMD Instruction pipelining – overlap fetch and execute Multithreads – executing independent instructions at the same time Speculative execution -
39
– 39 – CSCE 713 Spring 2012 Linux – Sytem Info saluda> lscpu Architecture: i686 CPU op-mode(s): 32-bit, 64-bit CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 4 CPU socket(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 15 Stepping: 11 CPU MHz: 2393.830 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 4096K saluda>
40
– 40 – CSCE 713 Spring 2012 Control Panel System and Sec… System ……
41
– 41 – CSCE 713 Spring 2012 Task Manager.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.