Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Computer Architecture and Data Manipulation Chapter 3.
The Evolution of RISC A Three Party Rivalry By Jenny Mitchell CS147 Fall 2003 Dr. Lee.
RISC vs CISC CS 3339 Lecture 3.2 Apan Qasem Texas State University Spring 2015 Some slides adopted from Milo Martin at UPenn.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Prince Sultan College For Woman
Cisc Complex Instruction Set Computing By Christopher Wong 1.
RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
RISC Architecture RISC vs CISC Sherwin Chan.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Ramesh.B ELEC 6200 Computer Architecture & Design Fall /29/20081Computer Architecture & Design.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
ISA's, Compilers, and Assembly
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 1
PipeliningPipelining Computer Architecture (Fall 2006)
COMP 740: Computer Architecture and Implementation
Advanced Architectures
Review: Instruction Set Evolution
Visit for more Learning Resources
Central Processing Unit Architecture
CISC (Complex Instruction Set Computer)
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Levels of Parallelism within a Single Processor
Central Processing Unit
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
What is Computer Architecture?
What is Computer Architecture?
Levels of Parallelism within a Single Processor
What is Computer Architecture?
CSE378 Introduction to Machine Organization
Pipelining.
Presentation transcript:

Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)

Today’s Lecture More Computer Architecture More Computer Architecture Flynn’s Taxonomy Flynn’s Taxonomy Improving CPU performance (instructions per clock) Improving CPU performance (instructions per clock) Instruction Set Architecture classifications Instruction Set Architecture classifications Future of CPU design Future of CPU design

Machine architecture classifications Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) A way of describing the information flow in computers: A way of describing the information flow in computers: architectural definition Information is divided into and Information is divided into instructions (I) and data (D) There can be single (S) or multiple instances of both (M) There can be single (S) or multiple instances of both (M) Four combinations: SISD,SIMD,MISD,MIMD Four combinations: SISD,SIMD,MISD,MIMD

SISD Single Instruction, Single Data Single Instruction, Single Data An absolutely serial execution model An absolutely serial execution model Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Typically viewed as describing a serial computer, but todays CPUs exploit parallelism PM Single processor Single data element

SIMD Single Instruction, Multiple Data Single Instruction, Multiple Data In this case one instruction is applied to multiple data streams at the same time In this case one instruction is applied to multiple data streams at the same time K P P P MaMa MaMa MaMa Single instruction processor K, broadcasts instruction to processing elements (PEs) Each processor typically has its own data memory Array of processors

MISD Multiple Instruction, Single Data Multiple Instruction, Single Data Largely useless definition (not important) Largely useless definition (not important) Closest relevant example would be a cpu than can `pipeline’ instructions Closest relevant example would be a cpu than can `pipeline’ instructions P P P MiMi MiMi MiMi MaMa Each processor has its own instruction stream but operates on the same data stream Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours.

MIMD Multiple Instruction, Multiple Data Multiple Instruction, Multiple Data Covers a host of modern architectures Covers a host of modern architectures PP PP M M MM Processors have independent data and instruction streams. Processors may communicate directly or via shared memory.

Instruction Set Architecture ISA – interface between hardware and software ISA – interface between hardware and software ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) Assembly language is a realization of the ISA in a form easy to remember (and program) Assembly language is a realization of the ISA in a form easy to remember (and program)

Key Concept in ISA evolution and CPU design Efficiency gains to be had by executing as many operations per clock cycle as possible Efficiency gains to be had by executing as many operations per clock cycle as possible Instruction level parallelism (ILP) Instruction level parallelism (ILP) Exploit parallelism within the instruction stream Exploit parallelism within the instruction stream Programmer does not see this parallelism explicitly Programmer does not see this parallelism explicitly Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI) Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI)

ILP versus thread level parallelism Many modern programs have more than one (parallel) “thread” of execution Many modern programs have more than one (parallel) “thread” of execution Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level Instructions One “thread” These instructions are executed in parallel even though there is one thread

ILP techniques The two main ILP techniques are The two main ILP techniques are Pipelining – including additional techniques such as out-of-order execution Pipelining – including additional techniques such as out-of-order execution Superscalar execution Superscalar execution

Instr 2 Pipelining Multiple instructions overlapped in execution Multiple instructions overlapped in execution Throughput optimization: doesn’t reduce time for individual instructions Throughput optimization: doesn’t reduce time for individual instructions Instr 1 Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Instr 7Instr 6Instr 5Instr 4Instr 3Instr 2Instr 1 Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Design sweetspot Pipeline stepping time is determined by slowest operation in pipeline Pipeline stepping time is determined by slowest operation in pipeline Best speed-up: if all operations take same amount of time Best speed-up: if all operations take same amount of time Net time per instruction=stepping time/pipeline stages Net time per instruction=stepping time/pipeline stages Perfect speed up factor = # pipeline stages Perfect speed up factor = # pipeline stages Never achieved: start up overheads to consider Never achieved: start up overheads to consider

Pipeline compromises Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Time to issue instruction 10ns 5ns10ns5ns10ns5ns Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Instruction 10ns =55ns =70ns These stages take longer than necessary

Superscalar execution Careful about definitions: superscalar execution is not simply about having multiple instructions in flight Careful about definitions: superscalar execution is not simply about having multiple instructions in flight Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store) Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store)

Benefits of superscalar design Having more than one functional unit of a given type can help schedule more instructions within the pipeline Having more than one functional unit of a given type can help schedule more instructions within the pipeline The Pentium IV pipeline was 20 stages deep! The Pentium IV pipeline was 20 stages deep! Enormous throughput potential but big pipeline stall penalty Enormous throughput potential but big pipeline stall penalty Incorporation of multiple units into the pipeline is sometimes called superpipelining Incorporation of multiple units into the pipeline is sometimes called superpipelining

Other ways of increasing ILP Branch prediction Branch prediction Predict which path will be taken by assigning certain probabilities Predict which path will be taken by assigning certain probabilities Out of order execution Out of order execution Independent operations can be rescheduled in the instruction stream Independent operations can be rescheduled in the instruction stream Pipelined functional units Pipelined functional units Floating point units can be pipelined to increase throughput Floating point units can be pipelined to increase throughput

Limits of ILP See D. Wall “Limits of ILP” 1991 See D. Wall “Limits of ILP” 1991 Probability of hitting hazards (instructions that cannot be pipelined) increases with added length Probability of hitting hazards (instructions that cannot be pipelined) increases with added length Instruction fetch and decode rate Instruction fetch and decode rate Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… Branch prediction – Branch prediction – Multiple condition statements increase branches severely Multiple condition statements increase branches severely Cache locality and memory limitations Cache locality and memory limitations Finite limits to effectiveness of prefetch Finite limits to effectiveness of prefetch

Scalar Processor Architectures ‘Scalar’ Pipelined Superscalar Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD

Complex Instruction Set Computing CISC – older design idea (x86 instruction set is CISC) CISC – older design idea (x86 instruction set is CISC) Many (powerful) instructions supported within the ISA Many (powerful) instructions supported within the ISA Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) Upside: Reduced instruction memory usage Upside: Reduced instruction memory usage Downside: designing CPU is much harder Downside: designing CPU is much harder

Reduced Instruction Set Computing RISC – newer concept than CISC (but still old) RISC – newer concept than CISC (but still old) MIPS, PowerPC, SPARC, all RISC designs MIPS, PowerPC, SPARC, all RISC designs Small instruction set, CISC type operation becomes a chain of RISC operations Small instruction set, CISC type operation becomes a chain of RISC operations Upside: Easier to design CPU Upside: Easier to design CPU Upside: Smaller instruction set => higher clock speed Upside: Smaller instruction set => higher clock speed Downside: assembly language typically longer (compiler design issue though) Downside: assembly language typically longer (compiler design issue though) Most modern x86 processors are implemented using RISC techniques Most modern x86 processors are implemented using RISC techniques

Birth of RISC Roots can be traced to three research projects Roots can be traced to three research projects IBM 801 (late 1970s, J. Cocke) IBM 801 (late 1970s, J. Cocke) Berkeley RISC processor (~1980, D. Patterson) Berkeley RISC processor (~1980, D. Patterson) Stanford MIPS processor (~1981, J. Hennessy) Stanford MIPS processor (~1981, J. Hennessy) Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment Commercialization benefitted from 3 independent projects Commercialization benefitted from 3 independent projects Berkeley Project -> begat Sun Microsystems Berkeley Project -> begat Sun Microsystems Stanford Project -> begat MIPS (used by SGI) Stanford Project -> begat MIPS (used by SGI)

Modern RISC processors Complexity has nonetheless increased significantly Complexity has nonetheless increased significantly Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations What if we could remove the scheduling complexity by using a smart compiler…? What if we could remove the scheduling complexity by using a smart compiler…?

VLIW & EPIC VLIW – very long instruction word VLIW – very long instruction word Idea: pack a number of noninterdependent operations into one long instruction Idea: pack a number of noninterdependent operations into one long instruction Strong emphasis on compilers to schedule instructions Strong emphasis on compilers to schedule instructions When executed, words are easily broken up and allow operations to be dispatched to independent execution units When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word

VLIW & EPIC II Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs VLIW processors should be faster and less expensive than RISC VLIW processors should be faster and less expensive than RISC EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW ISA is called IA-64 ISA is called IA-64

VLIW & EPIC III Hey – it’s 2015, why aren’t we all using Intel Itanium processors? Hey – it’s 2015, why aren’t we all using Intel Itanium processors? AMD figured out an easy extension to make x86 support 64 bits & introduced multicore AMD figured out an easy extension to make x86 support 64 bits & introduced multicore Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64 Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64

RISC vs CISC recap RISC (popular by mid 80s) Operations on registers CISC (pre 1970s) Operations directly on memory Pro: Small instruction set makes design easy Pro: Many powerful instructions, easy to write assembly language* Pro: decreased CPI, but also get faster CPU through easier design (t c reduced) Pro: Reduced memory requirement for instructions, reduced number of total instructions (N i )* Con: complicated instructions must be built from simpler ones Con: ISA often large and wasteful (20-25% usage) Con: Efficient compiler technology absolutely essential Con: ISA hard to debug during development *Driven by 1970s issues of memory size (SMALL) and speed (FASTER THAN CPU)

Who “won”? – Not VLIW! Modern x86 are RISC-CISC hybrids Modern x86 are RISC-CISC hybrids ISA is translated at hardware level to shorter instructions ISA is translated at hardware level to shorter instructions Very complicated designs though, lots of scheduling hardware Very complicated designs though, lots of scheduling hardware MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory? Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory?

Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language BasedConcept of a Family (B )(IBM ) General Purpose Register Machines Complex Instruction SetsLoad/Store Architecture RISC (Vax, Intel ) (CDC 6600, Cray ) (Mips,Sparc,HP-PA,IBM RS6000,PowerPC ) LIW/”EPIC”?(IA ) From Patterson’s lectures (UC Berkeley CS252)

Simultaneous multithreading Completely different technology to ILP Completely different technology to ILP NOT multi-core NOT multi-core Designed to overcome lack of fine grained parallelism in code Designed to overcome lack of fine grained parallelism in code Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales Requires programmer to have created a parallel program for this to work though Requires programmer to have created a parallel program for this to work though One physical processor looks like two logical processors One physical processor looks like two logical processors

Motivation for SMT Strong motivation for SMT: memory latency making load operations take longer and longer Strong motivation for SMT: memory latency making load operations take longer and longer Need some way to hide this bottleneck (memory wall again!) Need some way to hide this bottleneck (memory wall again!) SMT: switch over execution to threads that have their data and execute those SMT: switch over execution to threads that have their data and execute those TERA MTA (bought by Cray) TERA MTA (bought by Cray) attempt to design computer entirely around this concept

SMT Example: IBM Power Dual core, each core can support 2 SMT threads Dual core, each core can support 2 SMT threads “MCM” package “MCM” package 4 dual core processors 4 dual core processors 144 MB of cache 144 MB of cache SMT gives ~40-60% improvement in performance SMT gives ~40-60% improvement in performance Not bad Not bad Intel Hyperthreading ~ 10% improvement Intel Hyperthreading ~ 10% improvement

Multiple cores Simply add more CPUs Simply add more CPUs Easiest way to increase throughput now Easiest way to increase throughput now Why do this? Why do this? Response to problem of increasing power output on modern CPUs Response to problem of increasing power output on modern CPUs We’ve essentially reached the limit on improving individual core speeds We’ve essentially reached the limit on improving individual core speeds Design involves compromise: n CPUs must now share memory bus – less bandwidth to each Design involves compromise: n CPUs must now share memory bus – less bandwidth to each

Intel & AMD multi-core processors Intel 18-core processors Intel 18-core processors Codename “Haswell” Codename “Haswell” Design envelope 150W, but divide by number of processors => each core is v. power efficient Design envelope 150W, but divide by number of processors => each core is v. power efficient AMD has 16 core processors Codename “Warsaw” 115 W design envelope Individual cores not as good as Intel though

Summary Flynn’s taxonomy categorizes instruction and data flow in computers Flynn’s taxonomy categorizes instruction and data flow in computers Modern processors are MIMD Modern processors are MIMD Pipelining and superscalar design improve CPU performance by increasing the instructions per clock Pipelining and superscalar design improve CPU performance by increasing the instructions per clock CISC/RISC design approaches appear to be reaching the limits of their applicability CISC/RISC design approaches appear to be reaching the limits of their applicability VLIW didn’t make an impact – will it return? VLIW didn’t make an impact – will it return? In the absence of improved single core performance, designers are simply integrating more cores In the absence of improved single core performance, designers are simply integrating more cores