1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang.

1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang

2 Computer Architecture Trends

3 Figure 1.1 H&P Growth in microprocessor 35% per year

4 Technology Trends Smaller feature sizes – higher speed, density Density is increased by 77 times

5 Technology Trends Larger chips –Trend is toward more RAM, less logic per chip –Historically 2x per generation; leveling off? –McKinley has large on-chip caches => larger wafers to reduce fabricate costs

6 Moore’s Law Number of transistors doubles every 18 months (amended to 24 months) Combination of both greater density and larger chips

7 Tech. Trends, contd. More, faster, cheaper transistors have fed an application demand for higher performance –1970s -- serial, 1-bit integer microprocessors –1980s -- pipelined 32-bit RISC ISA simplicity allows processor on chip –1990s -- large, superscalar processors, even for CISC –2000s -- multiprocessors on a chip

8 Pipelining and Branch Prediction Two basic ways of increasing performance Pipelining: Branch Prediction –Speculate on branch outcome to avoid waiting MEWBEXIFID latch clock

9 Tech. Trend: memory sizes Memories have grown very dense –Feeding application demand for large, complex software

10 Tech. Trend: memory speeds Main memory speeds have not kept up with processor speeds

11 Memory Hierarchies Gap between processor and memory performance has led to widespread use of memory hierarchies –1960s no caches, no virtual memory –1970s shared I & D-cache, 32-bit virtual memory –1980s Split I- and D-caches –1990s Two level caches, 64-bit virtual memory –2000s Multi-level caches, both on and off-chip

12 Memory Hierarchies MEMORY SYSTEM PROCESSOR Main Memory L2 Cache Registers L1 Cache L3 Cache Small/Fast Large/Slow

13 I/O a key system component I/O has evolved into a major distinguishing feature of computer systems –1960s: disk, tape, punch cards, tty; batch processing –1970s: character oriented displays –1980s: video displays, audio, increasing disk sizes, beginning networking –1990s: 3D graphics; networking a fundamental element; high quality audio –2000s: real-time video, immersion…

14 I/O Systems A hierarchy that divides bandwidth CD ROM Hard Drives Controller Buffer Frame Monitor Expansion Floppy interface DRAM Proc Local Bus LAN Interface High Speed I/O bus Controller Slow Speed I/O bus Floppy Data rates Memory: 100 MHz, 8 bytes  800 MB/s (peak) PCI: 33 MHz, 4 bytes wide  132 MB/s (peak) SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes)  80 MB/s (peak)

15 Multiprocessors Multiprocessors have been available for decades… –1960s small MPs –1970s small MPs Dream of automatic parallelization –1980s small MPs; emergence of servers Dream of automatic parallelization –1990s expanding MPs Very large MPPs failed Dream of automatic parallelization fading –2000s wide-spread MPs; on-chip multithreading Many applications have independent threads Programmers write applications to be parallel in the first place CC P M Interconnection Network C PP M M...

16 Evaluating Computer Architectures

17 Computation Science Computation is synthetic –Many of the phenomena in the computing field are created by humans rather than occurring naturally in the physical world –Very different from nature sciences » When one discovers a fact about nature, it is a contribution, no matter how small » Creating something new alone does not establish a contribution –Anyone can create something new in a synthetic field –Rather, one must show that the creation is better

18 What Means “Better”? “Better” can mean many things –Solves a problem in less time (faster) –Solves a larger class of problems (more powerful) –Is more efficient of resources (cheaper) –Is less prone to errors (more reliable) –Is easy to manage/program (lower human cost)

19 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------------- = ------------------------ ExTime w/ E Performance w/o E F E Find how Speedup coming from some enhancement E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected Defines the Speedup that can be gained by using a special feature

20 Amdahl’s Law ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced Speedup overall = ExTime old ExTime new Speedup enhanced = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced

21 The “better” property is not simply an observation –Rather, the research will postulate that a new idea An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc. –Will lead to a “better” result –Making the connection between the idea and the improvement is as important as quantifying how much the improvement is The contribution is the idea, and is generally a component of a larger computational system.

22 How to Evaluate Architecture Ideas Measuring/observing/analyzing real systems –Accurate results –Need a working system » Too expensive to evaluate architecture/system ideas

23 Analytic models –Fast & easy analysis of relations –Tprogram = NumOfInst (T cpu + T m (1- Cache hit )) –Allows extrapolation to ridiculous parameters, e.g. thousands of processors –Sometimes infeasible to obtain accuracy (e.g. modeling caches) –To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention) –Queuing theory is a commonly used technique

24 Simulation –The most popular method in computer architecture and system research –Mimic the architecture/system using software –Very flexible: nearly unlimited evaluation –Prototyping of non-existing machines possible –Evaluation of design options (design space exploration) cheap & flexible –Requires some sort of validation –Can be VERY slow

25 Tradeoff between accuracy and computational intensity –Low level of abstraction slow (e.g. simulating at the level of gates) –High level of abstraction fast (e.g. only simulating processor, cache and memory components) Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated

26 Three Simulation Techniques Profile-based static modeling –Simplest and least costly –Use hardware counters on the chip or instrumented execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM) Trace-driven –A more sophisticated technique –How it works (Ex. modeling memory system performance): Collect traces generated by ATOM Trace format: inst address executed, data address accessed Build the memory hierarchy model Feed trace in the simulation model and analyze results

27 1. Compile: pgcc –Mprof=func prg.cc 2. Run the code: to produce a profile data file called pgprof.out 3. View the execution profile: pgprof prprof.out

28 Using Perfex Usage : perfex [-e num] [-y] program [program args] -e num: count only event type num; -y: generate a “cost report”; Example perfex –e 41 –13 –y a.out EVENT # Event Events Counted 41 Floating point OP retired 25292884493 13 L2 cache lines loaded 223490870 Statistics: MFLOPS 29.175907 Main memory  L2 bandwidth 8.249655 MB/s

29 Execution–driven –The most accurate and most costly –Trace-driven can not simulate the interaction between memory system and processor –A detailed of the memory system and the processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar

30 Measuring by Means of Benchmarks Micro-benchmarks (e.g. instruction latencies, file system throughput) Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2) Only limited evaluation possible (e.g. limited systems support for measurement) The machine must be available Benchmark Suites: Collection of kernels, real and benchmark programs, lessening the weakness of any one benchmark by the presence of others.

31 Summarize Results Weighted Arithmetic Mean Execution Time  (W i *T i ) –Summarize the products of weighting factors and execution times and reflect individual frequency of each workload –W i = 1/(Time i *  n j =1 (1/Time j )) Geometric Mean Execution Time (  T i /N i ) 1/n –Normalize execution times to a reference machine and take the average of normalized execution times –Used by SPEC

32 A Report Example (P&H figure 1.17) Normalized to ANormalized to BNormalized to C ABCABCABC Program P1 1.010.020.00.11.02.00.050.51.0 Program P2 1.00.10.0210.01.00.250.05.01.0 Arithmetic mean 1.05.0510.015.051.01.125.032.751.0 Geometric Mean 1.0 0.631.0 0.631.58 1.0

1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang.

Similar presentations

Presentation on theme: "1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang.

Similar presentations

Presentation on theme: "1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang."— Presentation transcript:

Similar presentations

About project

Feedback