Samira Khan University of Virginia Sep 12, 2018

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
INSTRUCTION SET ARCHITECTURES
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Lecture 2: Computer Architecture: A Science ofTradeoffs.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
15-740/ Computer Architecture Lecture 2: ISA, Tradeoffs, Performance Prof. Onur Mutlu Carnegie Mellon University.
Samira Khan University of Virginia Feb 2, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: ISA Tradeoffs The content and concept of this course.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture Lecture 4: ISA Tradeoffs (Continued) and Single-Cycle Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring.
Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture & Operations I
Displacement (Indexed) Stack
18-447: Computer Architecture Lecture 30B: Multiprocessors
15-740/ Computer Architecture Lecture 4: Pipelining
15-740/ Computer Architecture Lecture 3: Performance
Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics
CS5102 High Performance Computer Systems Thread-Level Parallelism
A Closer Look at Instruction Set Architectures
15-740/ Computer Architecture Lecture 4: ISA Tradeoffs
6/16/2018 CIS-550 Advanced Computer Architecture Lecture 4: ISA Tradeoffs (Continued) and MIPS ISA Dr. Muhammad Abid DCIS, PIEAS Spring
Computer Architecture Lecture 4: More ISA Tradeoffs
Samira Khan University of Virginia Sep 6, 2017
CS161 – Design and Architecture of Computer Systems
15-740/ Computer Architecture Lecture 7: Pipelining
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Alvaro Mauricio Peña Dariusz Niworowski Frank Rodriguez
Architecture & Organization 1
Samira Khan University of Virginia Sep 4, 2017
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
CS775: Computer Architecture
Multi-Processing in High Performance Computer Architecture:
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Architecture & Organization 1
CS170 Computer Organization and Architecture I
Lecture 14: Reducing Cache Misses
Samira Khan University of Virginia Sep 10, 2018
Chapter 1 Introduction.
Classification of instructions
Computer Architecture
Evolution of ISA’s ISA’s have changed over computer “generations”.
What is Computer Architecture?
What is Computer Architecture?
Evolution of ISA’s ISA’s have changed over computer “generations”.
Evolution of ISA’s ISA’s have changed over computer “generations”.
CPU Structure CPU must:
Design of Digital Circuits Lecture 19a: VLIW
Lecture 4: Instruction Set Design/Pipelining
CSE 502: Computer Architecture
Samira Khan University of Virginia Jan 30, 2019
Evolution of ISA’s ISA’s have changed over computer “generations”.
CSE378 Introduction to Machine Organization
Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 1/25/2012
Chapter 4 The Von Neumann Model
Presentation transcript:

Samira Khan University of Virginia Sep 12, 2018 COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from last lecture Multi-Cores

LOGISTICS Project list Sample project proposals Posted in Piazza Be prepared to spend time on the project Sample project proposals Project Proposal Due on Sep 26 Project Proposal Presentations: Oct 1-3 Slides due on Oct 1, send it to Yizhou Groups of at most 2-3 students

Project Proposal Problem: Clearly define what is the problem you are trying to solve Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work?  Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea?  Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?

LITERATURE SURVEY Goal: Critically analyze related work to your project Pick 2-3 papers related to your project Use the same format as the reviews What is the problem the paper is solving What is the key insight What are the advantages and disadvantages How can you do better Send the list of the papers to the TA by Sep 20 Will become the related work in your proposal

Review 3  Due Sep 19, 2018  Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009.  Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017. 

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software

SMALL VERSUS LARGE SEMANTIC GAP John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding  smaller code size  saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work  compiler has less opportunity to optimize - More complex hardware  translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.

ISA-LEVEL TRADEOFFS: INSTRUCTION LENGTH Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently Tradeoffs Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness Performance? Smaller code vs. imperfect decode

ISA-LEVEL TRADEOFFS: UNIFORM DECODE Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location Ditto operand specifiers, immediate values, … Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space Non-uniform decode E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively generate branch target)

X86 VS. ALPHA INSTRUCTION FORMATS

ISA-LEVEL TRADEOFFS: NUMBER OF REGISTERS Affects: Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler  fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check logic

ISA-LEVEL TRADEOFFS: ADDRESSING MODES Addressing mode specifies how to obtain an operand of an instruction Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointer-based accesses) -- make it harder for the architect to design -- too many choices for the compiler? Many ways to do the same thing complicates compiler design Read Wulf, “Compilers and Computer Architecture”

X86 VS. ALPHA INSTRUCTION FORMATS

x86 register indirect absolute register + displacement register

x86 indexed (base + index) scaled (base + index*4)

OTHER ISA-LEVEL TRADEOFFS Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction vs. SIMD 0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands) Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software) …

AGENDA Logistics Review from last lecture Multi-Cores

MULTIPLE CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores

MOORE’S LAW Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

MULTI-CORE Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)

WHY MULTI-CORE? Alternative: Bigger, more powerful single core Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

MULTI-CORE VS. LARGE SUPERSCALAR Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

WHY MULTI-CORE? Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

CACHE VS. CORE

WHY MULTI-CORE? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)

WHY MULTI-CORE? Other alternatives? Dataflow? Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?)

WITH MULTIPLE CORES ON CHIP What we want: N times the performance with N times the cores when we parallelize an application on N cores What we get: Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion

CAVEATS OF PARALLELISM Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 1 Speedup = + f 1 - f N

THE PROBLEM: SERIALIZED CODE SECTIONS Many parallel programs cannot be parallelized completely Causes of serialized code sections Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Serialized code sections Reduce performance Limit scalability Waste energy

Samira Khan University of Virginia Sep 12, 2018 COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Sep 12, 2018 The content and concept of this course are adapted from CMU ECE 740