Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)

Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E) thacker@ap.smu.ca

Today’s Lecture More Computer Architecture More Computer Architecture Flynn’s Taxonomy Flynn’s Taxonomy Improving CPU performance (instructions per clock) Improving CPU performance (instructions per clock) Instruction Set Architecture classifications Instruction Set Architecture classifications Future of CPU design Future of CPU design

Machine architecture classifications Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) A way of describing the information flow in computers: A way of describing the information flow in computers: architectural definition Information is divided into and Information is divided into instructions (I) and data (D) There can be single (S) or multiple instances of both (M) There can be single (S) or multiple instances of both (M) Four combinations: SISD,SIMD,MISD,MIMD Four combinations: SISD,SIMD,MISD,MIMD

SISD Single Instruction, Single Data Single Instruction, Single Data An absolutely serial execution model An absolutely serial execution model Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Typically viewed as describing a serial computer, but todays CPUs exploit parallelism PM Single processor Single data element

SIMD Single Instruction, Multiple Data Single Instruction, Multiple Data In this case one instruction is applied to multiple data streams at the same time In this case one instruction is applied to multiple data streams at the same time K P P P MaMa MaMa MaMa Single instruction processor K, broadcasts instruction to processing elements (PEs) Each processor typically has its own data memory Array of processors

MISD Multiple Instruction, Single Data Multiple Instruction, Single Data Largely useless definition (not important) Largely useless definition (not important) Closest relevant example would be a cpu than can `pipeline’ instructions Closest relevant example would be a cpu than can `pipeline’ instructions P P P MiMi MiMi MiMi MaMa Each processor has its own instruction stream but operates on the same data stream Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours.

MIMD Multiple Instruction, Multiple Data Multiple Instruction, Multiple Data Covers a host of modern architectures Covers a host of modern architectures PP PP M M MM Processors have independent data and instruction streams. Processors may communicate directly or via shared memory.

Instruction Set Architecture ISA – interface between hardware and software ISA – interface between hardware and software ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) Assembly language is a realization of the ISA in a form easy to remember (and program) Assembly language is a realization of the ISA in a form easy to remember (and program)

Key Concept in ISA evolution and CPU design Efficiency gains to be had by executing as many operations per clock cycle as possible Efficiency gains to be had by executing as many operations per clock cycle as possible Instruction level parallelism (ILP) Instruction level parallelism (ILP) Exploit parallelism within the instruction stream Exploit parallelism within the instruction stream Programmer does not see this parallelism explicitly Programmer does not see this parallelism explicitly Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI) Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI)

ILP versus thread level parallelism Many modern programs have more than one (parallel) “thread” of execution Many modern programs have more than one (parallel) “thread” of execution Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level Instructions One “thread” 32 3 2 1 1 These instructions are executed in parallel even though there is one thread

ILP techniques The two main ILP techniques are The two main ILP techniques are Pipelining – including additional techniques such as out-of-order execution Pipelining – including additional techniques such as out-of-order execution Superscalar execution Superscalar execution

Instr 2 Pipelining Multiple instructions overlapped in execution Multiple instructions overlapped in execution Throughput optimization: doesn’t reduce time for individual instructions Throughput optimization: doesn’t reduce time for individual instructions Instr 1 Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Instr 7Instr 6Instr 5Instr 4Instr 3Instr 2Instr 1 Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Design sweetspot Pipeline stepping time is determined by slowest operation in pipeline Pipeline stepping time is determined by slowest operation in pipeline Best speed-up: if all operations take same amount of time Best speed-up: if all operations take same amount of time Net time per instruction=stepping time/pipeline stages Net time per instruction=stepping time/pipeline stages Perfect speed up factor = # pipeline stages Perfect speed up factor = # pipeline stages Never achieved: start up overheads to consider Never achieved: start up overheads to consider

Pipeline compromises Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Time to issue instruction 10ns 5ns10ns5ns10ns5ns Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1 Instruction 10ns =55ns =70ns These stages take longer than necessary

Superscalar execution Careful about definitions: superscalar execution is not simply about having multiple instructions in flight Careful about definitions: superscalar execution is not simply about having multiple instructions in flight Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store) Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store)

Benefits of superscalar design Having more than one functional unit of a given type can help schedule more instructions within the pipeline Having more than one functional unit of a given type can help schedule more instructions within the pipeline The Pentium IV pipeline was 20 stages deep! The Pentium IV pipeline was 20 stages deep! Enormous throughput potential but big pipeline stall penalty Enormous throughput potential but big pipeline stall penalty Incorporation of multiple units into the pipeline is sometimes called superpipelining Incorporation of multiple units into the pipeline is sometimes called superpipelining

Other ways of increasing ILP Branch prediction Branch prediction Predict which path will be taken by assigning certain probabilities Predict which path will be taken by assigning certain probabilities Out of order execution Out of order execution Independent operations can be rescheduled in the instruction stream Independent operations can be rescheduled in the instruction stream Pipelined functional units Pipelined functional units Floating point units can be pipelined to increase throughput Floating point units can be pipelined to increase throughput

Limits of ILP See D. Wall “Limits of ILP” 1991 See D. Wall “Limits of ILP” 1991 Probability of hitting hazards (instructions that cannot be pipelined) increases with added length Probability of hitting hazards (instructions that cannot be pipelined) increases with added length Instruction fetch and decode rate Instruction fetch and decode rate Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… Branch prediction – Branch prediction – Multiple condition statements increase branches severely Multiple condition statements increase branches severely Cache locality and memory limitations Cache locality and memory limitations Finite limits to effectiveness of prefetch Finite limits to effectiveness of prefetch

Scalar Processor Architectures ‘Scalar’ Pipelined Superscalar Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD

Complex Instruction Set Computing CISC – older design idea (x86 instruction set is CISC) CISC – older design idea (x86 instruction set is CISC) Many (powerful) instructions supported within the ISA Many (powerful) instructions supported within the ISA Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) Upside: Reduced instruction memory usage Upside: Reduced instruction memory usage Downside: designing CPU is much harder Downside: designing CPU is much harder

Reduced Instruction Set Computing RISC – newer concept than CISC (but still old) RISC – newer concept than CISC (but still old) MIPS, PowerPC, SPARC, all RISC designs MIPS, PowerPC, SPARC, all RISC designs Small instruction set, CISC type operation becomes a chain of RISC operations Small instruction set, CISC type operation becomes a chain of RISC operations Upside: Easier to design CPU Upside: Easier to design CPU Upside: Smaller instruction set => higher clock speed Upside: Smaller instruction set => higher clock speed Downside: assembly language typically longer (compiler design issue though) Downside: assembly language typically longer (compiler design issue though) Most modern x86 processors are implemented using RISC techniques Most modern x86 processors are implemented using RISC techniques

Birth of RISC Roots can be traced to three research projects Roots can be traced to three research projects IBM 801 (late 1970s, J. Cocke) IBM 801 (late 1970s, J. Cocke) Berkeley RISC processor (~1980, D. Patterson) Berkeley RISC processor (~1980, D. Patterson) Stanford MIPS processor (~1981, J. Hennessy) Stanford MIPS processor (~1981, J. Hennessy) Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment Commercialization benefitted from 3 independent projects Commercialization benefitted from 3 independent projects Berkeley Project -> begat Sun Microsystems Berkeley Project -> begat Sun Microsystems Stanford Project -> begat MIPS (used by SGI) Stanford Project -> begat MIPS (used by SGI)

Modern RISC processors Complexity has nonetheless increased significantly Complexity has nonetheless increased significantly Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations What if we could remove the scheduling complexity by using a smart compiler…? What if we could remove the scheduling complexity by using a smart compiler…?

VLIW & EPIC VLIW – very long instruction word VLIW – very long instruction word Idea: pack a number of noninterdependent operations into one long instruction Idea: pack a number of noninterdependent operations into one long instruction Strong emphasis on compilers to schedule instructions Strong emphasis on compilers to schedule instructions When executed, words are easily broken up and allow operations to be dispatched to independent execution units When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word

VLIW & EPIC II Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs VLIW processors should be faster and less expensive than RISC VLIW processors should be faster and less expensive than RISC EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW ISA is called IA-64 ISA is called IA-64

VLIW & EPIC III Hey – it’s 2015, why aren’t we all using Intel Itanium processors? Hey – it’s 2015, why aren’t we all using Intel Itanium processors? AMD figured out an easy extension to make x86 support 64 bits & introduced multicore AMD figured out an easy extension to make x86 support 64 bits & introduced multicore Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64 Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64

RISC vs CISC recap RISC (popular by mid 80s) Operations on registers CISC (pre 1970s) Operations directly on memory Pro: Small instruction set makes design easy Pro: Many powerful instructions, easy to write assembly language* Pro: decreased CPI, but also get faster CPU through easier design (t c reduced) Pro: Reduced memory requirement for instructions, reduced number of total instructions (N i )* Con: complicated instructions must be built from simpler ones Con: ISA often large and wasteful (20-25% usage) Con: Efficient compiler technology absolutely essential Con: ISA hard to debug during development *Driven by 1970s issues of memory size (SMALL) and speed (FASTER THAN CPU)

Who “won”? – Not VLIW! Modern x86 are RISC-CISC hybrids Modern x86 are RISC-CISC hybrids ISA is translated at hardware level to shorter instructions ISA is translated at hardware level to shorter instructions Very complicated designs though, lots of scheduling hardware Very complicated designs though, lots of scheduling hardware MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory? Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory?

Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language BasedConcept of a Family (B5000 1963)(IBM 360 1964) General Purpose Register Machines Complex Instruction SetsLoad/Store Architecture RISC (Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76) (Mips,Sparc,HP-PA,IBM RS6000,PowerPC...1987) LIW/”EPIC”?(IA-64...1999) From Patterson’s lectures (UC Berkeley CS252)

Simultaneous multithreading Completely different technology to ILP Completely different technology to ILP NOT multi-core NOT multi-core Designed to overcome lack of fine grained parallelism in code Designed to overcome lack of fine grained parallelism in code Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales Requires programmer to have created a parallel program for this to work though Requires programmer to have created a parallel program for this to work though One physical processor looks like two logical processors One physical processor looks like two logical processors

Motivation for SMT Strong motivation for SMT: memory latency making load operations take longer and longer Strong motivation for SMT: memory latency making load operations take longer and longer Need some way to hide this bottleneck (memory wall again!) Need some way to hide this bottleneck (memory wall again!) SMT: switch over execution to threads that have their data and execute those SMT: switch over execution to threads that have their data and execute those TERA MTA (bought by Cray) TERA MTA (bought by Cray) attempt to design computer entirely around this concept

SMT Example: IBM Power 5 - 8 Dual core, each core can support 2 SMT threads Dual core, each core can support 2 SMT threads “MCM” package “MCM” package 4 dual core processors 4 dual core processors 144 MB of cache 144 MB of cache SMT gives ~40-60% improvement in performance SMT gives ~40-60% improvement in performance Not bad Not bad Intel Hyperthreading ~ 10% improvement Intel Hyperthreading ~ 10% improvement

Multiple cores Simply add more CPUs Simply add more CPUs Easiest way to increase throughput now Easiest way to increase throughput now Why do this? Why do this? Response to problem of increasing power output on modern CPUs Response to problem of increasing power output on modern CPUs We’ve essentially reached the limit on improving individual core speeds We’ve essentially reached the limit on improving individual core speeds Design involves compromise: n CPUs must now share memory bus – less bandwidth to each Design involves compromise: n CPUs must now share memory bus – less bandwidth to each

Intel & AMD multi-core processors Intel 18-core processors Intel 18-core processors Codename “Haswell” Codename “Haswell” Design envelope 150W, but divide by number of processors => each core is v. power efficient Design envelope 150W, but divide by number of processors => each core is v. power efficient AMD has 16 core processors Codename “Warsaw” 115 W design envelope Individual cores not as good as Intel though

Summary Flynn’s taxonomy categorizes instruction and data flow in computers Flynn’s taxonomy categorizes instruction and data flow in computers Modern processors are MIMD Modern processors are MIMD Pipelining and superscalar design improve CPU performance by increasing the instructions per clock Pipelining and superscalar design improve CPU performance by increasing the instructions per clock CISC/RISC design approaches appear to be reaching the limits of their applicability CISC/RISC design approaches appear to be reaching the limits of their applicability VLIW didn’t make an impact – will it return? VLIW didn’t make an impact – will it return? In the absence of improved single core performance, designers are simply integrating more cores In the absence of improved single core performance, designers are simply integrating more cores

Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E)"— Presentation transcript:

Similar presentations

About project

Feedback