Download presentation
Presentation is loading. Please wait.
1
Samira Khan University of Virginia Sep 12, 2018
COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740
2
AGENDA Logistics Review from last lecture Multi-Cores
3
LOGISTICS Project list Sample project proposals
Posted in Piazza Be prepared to spend time on the project Sample project proposals Project Proposal Due on Sep 26 Project Proposal Presentations: Oct 1-3 Slides due on Oct 1, send it to Yizhou Groups of at most 2-3 students
4
Project Proposal Problem: Clearly define what is the problem you are trying to solve Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work? Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea? Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?
5
LITERATURE SURVEY Goal: Critically analyze related work to your project Pick 2-3 papers related to your project Use the same format as the reviews What is the problem the paper is solving What is the key insight What are the advantages and disadvantages How can you do better Send the list of the papers to the TA by Sep 20 Will become the related work in your proposal
6
Review 3 Due Sep 19, 2018 Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.
7
TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE
ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software
8
SMALL VERSUS LARGE SEMANTIC GAP
John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding smaller code size saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work compiler has less opportunity to optimize - More complex hardware translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.
9
ISA-LEVEL TRADEOFFS: INSTRUCTION LENGTH
Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently Tradeoffs Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness Performance? Smaller code vs. imperfect decode
10
ISA-LEVEL TRADEOFFS: UNIFORM DECODE
Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location Ditto operand specifiers, immediate values, … Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space Non-uniform decode E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively generate branch target)
11
X86 VS. ALPHA INSTRUCTION FORMATS
12
ISA-LEVEL TRADEOFFS: NUMBER OF REGISTERS
Affects: Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check logic
13
ISA-LEVEL TRADEOFFS: ADDRESSING MODES
Addressing mode specifies how to obtain an operand of an instruction Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointer-based accesses) -- make it harder for the architect to design -- too many choices for the compiler? Many ways to do the same thing complicates compiler design Read Wulf, “Compilers and Computer Architecture”
14
X86 VS. ALPHA INSTRUCTION FORMATS
15
x86 register indirect absolute register + displacement register
16
x86 indexed (base + index) scaled (base + index*4)
17
OTHER ISA-LEVEL TRADEOFFS
Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction vs. SIMD 0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands) Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software) …
18
AGENDA Logistics Review from last lecture Multi-Cores
19
MULTIPLE CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores
20
MOORE’S LAW Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.
22
MULTI-CORE Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)
23
WHY MULTI-CORE? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)
24
MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand
25
WHY MULTI-CORE? Alternative: Bigger caches
+ Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy
26
CACHE VS. CORE
27
WHY MULTI-CORE? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)
28
WHY MULTI-CORE? Other alternatives? Dataflow?
Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?)
29
WITH MULTIPLE CORES ON CHIP
What we want: N times the performance with N times the cores when we parallelize an application on N cores What we get: Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion
30
CAVEATS OF PARALLELISM
Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 1 Speedup = + f 1 - f N
31
THE PROBLEM: SERIALIZED CODE SECTIONS
Many parallel programs cannot be parallelized completely Causes of serialized code sections Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Serialized code sections Reduce performance Limit scalability Waste energy
32
Samira Khan University of Virginia Sep 12, 2018
COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.