Samira Khan University of Virginia Sep 12, 2018

Samira Khan University of Virginia Sep 12, 2018
COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from last lecture Multi-Cores

LOGISTICS Project list Sample project proposals
Posted in Piazza Be prepared to spend time on the project Sample project proposals Project Proposal Due on Sep 26 Project Proposal Presentations: Oct 1-3 Slides due on Oct 1, send it to Yizhou Groups of at most 2-3 students

Project Proposal Problem: Clearly define what is the problem you are trying to solve Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work? Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea? Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?

LITERATURE SURVEY Goal: Critically analyze related work to your project Pick 2-3 papers related to your project Use the same format as the reviews What is the problem the paper is solving What is the key insight What are the advantages and disadvantages How can you do better Send the list of the papers to the TA by Sep 20 Will become the related work in your proposal

Review 3 Due Sep 19, 2018 Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE
ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software

SMALL VERSUS LARGE SEMANTIC GAP
John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding  smaller code size  saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work  compiler has less opportunity to optimize - More complex hardware  translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.

ISA-LEVEL TRADEOFFS: INSTRUCTION LENGTH
Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently Tradeoffs Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness Performance? Smaller code vs. imperfect decode

ISA-LEVEL TRADEOFFS: UNIFORM DECODE
Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location Ditto operand specifiers, immediate values, … Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space Non-uniform decode E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively generate branch target)

X86 VS. ALPHA INSTRUCTION FORMATS

ISA-LEVEL TRADEOFFS: NUMBER OF REGISTERS
Affects: Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler  fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check logic

ISA-LEVEL TRADEOFFS: ADDRESSING MODES
Addressing mode specifies how to obtain an operand of an instruction Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointer-based accesses) -- make it harder for the architect to design -- too many choices for the compiler? Many ways to do the same thing complicates compiler design Read Wulf, “Compilers and Computer Architecture”

X86 VS. ALPHA INSTRUCTION FORMATS

x86 register indirect absolute register + displacement register

x86 indexed (base + index) scaled (base + index*4)

OTHER ISA-LEVEL TRADEOFFS
Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction vs. SIMD 0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands) Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software) …

AGENDA Logistics Review from last lecture Multi-Cores

MULTIPLE CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores

MOORE’S LAW Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

MULTI-CORE Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)

WHY MULTI-CORE? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

WHY MULTI-CORE? Alternative: Bigger caches
+ Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

CACHE VS. CORE

WHY MULTI-CORE? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)

WHY MULTI-CORE? Other alternatives? Dataflow?
Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?)

WITH MULTIPLE CORES ON CHIP
What we want: N times the performance with N times the cores when we parallelize an application on N cores What we get: Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion

CAVEATS OF PARALLELISM
Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 1 Speedup = + f 1 - f N

THE PROBLEM: SERIALIZED CODE SECTIONS
Many parallel programs cannot be parallelized completely Causes of serialized code sections Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Serialized code sections Reduce performance Limit scalability Waste energy

Samira Khan University of Virginia Sep 12, 2018
COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Sep 12, 2018

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Sep 12, 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samira Khan University of Virginia Sep 12, 2018

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Sep 12, 2018"— Presentation transcript:

Similar presentations

About project

Feedback