David Gregg Department of Computer Science

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Chapter 17 Parallel Processing.
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Background Computer System Architectures Computer System Software.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
These slides are based on the book:
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
Distributed Processors
buses, crossing switch, multistage network.
Parallel Processing - introduction
Simultaneous Multithreading
Multi-core processors
Cell Architecture.
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
/ Computer Architecture and Design
Multi-Core Computing Osama Awwad Department of Computer Science
Hyperthreading Technology
Multi-Processing in High Performance Computer Architecture:
What is Parallel and Distributed computing?
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
buses, crossing switch, multistage network.
Constructing a system with multiple computers or processors
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Chapter 4 Multiprocessors
8 – Simultaneous Multithreading
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
Types of Parallel Computers
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

A whistle-stop tour of some parallel architectures that are used in practice David Gregg Department of Computer Science University of Dublin, Trinity College

Language Design by Committee According to the Auraicept na n-Éces, Fenius journeyed from Scythia together with Goídel mac Ethéoir, Íar mac Nema and a retinue of 72 scholars. They came to the plain of Shinar to study the confused languages at Nimrod's tower. Finding that the speakers had already dispersed, Fenius sent his scholars to study them, staying at the tower, coordinating the effort. After ten years, the investigations were complete, and Fenius created in Bérla tóbaide "the selected language", taking the best of each of the confused tongues, which he called Goídelc, Goidelic, after Goídel mac Ethéoir. He also created extensions of Goídelc, called Bérla Féne, after himself, Íarmberla, after Íar mac Nema, and others, and the Beithe-luis-nuin (the Ogham) as a perfected writing system for his languages. The names he gave to the letters were those of his 25 best scholars.

Parallel Architectures Wide range of parallel architectures Many in use Many more have been proposed Same ideas emerge again and again But Moore’s law means that the components change that we build parallel computers from So the trade-offs between approaches change

Levels of parallelism Different parallel architectures exploit parallelism at different levels Coarse or fine grain Different types of synchronization Trend towards architectures that exploit multiple types and levels of parallelism

E.g. Cell B/E Cell Broadband Engine is a processor developed by Sony, Toshiba and IBM Most celebrated use is in Playstation 3 Good graphics, physics, etc. Also used in IBM Cell clusters for HPC Computing values of options Also used in Toshiba consumer electronics

Cell B/E

Flynn’s Taxonomy Prof. Mike Flynn’s famous taxonomy of parallel computers Single instruction stream, single data stream E.g. VLIW, found in Cell BE Single instruction stream, multiple data streams E.g. Vector SIMD, found in Cell BE Multiple instruction streams, multiple data streams E.g. Multiple parallel cores, found in Cell BE Multiple instruction streams, single data stream Unusual, sometimes found in safety-critical systems Texas Instruments Hercules processors

Flynn’s Taxonomy It is quite difficult to fit many parallel architectures into Flynn’s taxonomy It attempted to describe different types of parallel architectures in 1966 It’s not entirely clear where some architectures fit E.g. instruction level parallel E.g. fine-grain speculative multithreading Most important distinction is between SIMD and MIMD These terms are used a lot

Instruction level parallelism Attempt to execute multiple machine instructions from the same instruction stream in parallel Two major approaches Get the compiler to find the parallelism VLIW, in-order superscalar Get the hardware to find independent instructions in the running program Out-of-order superscalar Cell B/E main processor is superscalar Accelerator cores are VLIW

Vector Computers Most machine instructions operate on a single value at a time Vector computers operate on an array of data with a single instruction Many parallel computing problems operate on large arrays of data Vector computers were some of the earliest and most successful parallel computers E.g. many supercomputers designed by Seymour Cray

Vector Computers Cray 1 glass panels allow viewer to “see more” Cray

Vector Computers Vector instructions are a form of SIMD very popular on modern computers Accelerate multimedia, games, scientific, graphics apps Vector sizes are smallish E.g. 16 bytes, four floats – SSE, Neon Intel AVX uses 32 bytes Intel AVX-512 uses 64 bytes Vector SIMD used in Cell BE, where all SPE instructions are SIMD

Multithreaded Computers Pipelined computers spend a lot of their time stalled Cache misses, branch mispredictions, long-latency instructions Multi-threaded computers attempt to cover the cost of stalls by running multiple threads on the same processor core Require very fast switching between threads Separate register set for each thread

Multithreaded Computers Three major approaches Switch to a different thread every cycle Switch thread on a major stall Use OOO superscalar hardware to mix instructions from different threads together (aka simultaneous multithreading) Big problem is software Programmer must divide code into threads Multithreading is used in Cell main core

Multi-core computers Power wall, ILP wall and memory wall pushed architecture designers to put multiple cores on a single chip Two major approaches Shared memory multi-core Distributed memory multi-core Shared memory is usually easier to program, but hardware does not scale well shared memory needs cache consistency Big problem is dividing software into threads Cell BE has distributed memory SPEs, but SPEs can also access global memory shared by PPE

Shared memory multiprocessors Shared memory multiprocessors are parallel computers where there are multiple processors sharing a single memory Main approaches Symmetric multiprocessors Non-uniform memory access (NUMA) Usually a distributed shared memory Big problem is software IBM Cell Blades allow two Cell processors on the same board to share memory

Distributed memory multicomputers A system of communicating processors, each with their own memory Distributed memory systems scale extremely well The hardware is simple Huge variety of configurations ring, star, tree, hypercube, mesh Programming is arguably even more difficult than for shared memory machines Always need to move data explicitly to where it is needed The Cell processor provides 8 SPEs, which each have their own local memory, and are connected in a ring topology

Clusters A group of loosely coupled computers that work closely and can be viewed as a single machine Multiple stand-alone machines connected by a network Clusters are a form of distributed memory multicomputers But may support a virtual shared memory using software E.g PVM (parallel virtual machine) Classic example is the Beowulf Cluster, made from standard PCs, running Linux or Free BSD, connected by ethernet

Clusters Most big supercomputers these days are clusters It’s usually cheaper to build large machines out of thousands of cheap PCs Cell BE is used in IBM Cell clusters Clusters of Playstation 3 machines running Linux are also being used as cheap supercomputers

Graphics Processing Units (GPU) Emergence of GPUs is a major recent development in computer architecture GPUs have very large amounts of parallelism on a single chip vector processing units multithreaded processing units multiple cores GPUs originally aimed at graphics rendering Very large memory bandwidth But high memory latency Lots of low-precision floating point units GPUs implement graphics pipeline directly Generality was once very limited, but is getting better with each new generation of GPU

Graphics Processing Units (GPU) GPUs are increasingly programmable Trend towards using GPUs for more “general purpose” computation (GPGPU) But mapping general purpose applications to GPU architecture is challenging Only certain types of apps run well on GPU Special languages are popular for GPGPU E.g. CUDA, OpenCL

Field Programmable Gate Arrays There is an array of gates that can be used to implement hardware circuits Field programmable The array can be configured (programmed) at any time “in the field”. FPGAs can exploit parallelism at many levels Special purpose processing units E.g. bit twiddling operations Fine grain parallelism Circuit is clocked, so very fine grain synchronization is possible Coarse grain parallelism FPGA can implement any parallel architecture that fits on the chip

Field Programmable Gate Arrays FPGAs are commonly used to implement special-purpose parallel architectures E.g. Systolic arrays

Low Power Parallel Processors Several big changes in computing (Exact categories and times very debatable) Giant computer room with single user – 1950s Mainframe computers with hundreds of users – 1960s Minicomputer with dozens of users – 1970s Microcomputer with single user – 1980s Networks and microcomputer servers – 1990s Continuous network connection – 2000s Mobile internet devices – 2010s

Low Power Parallel Processors Power and energy are extremely important for mobile computing Trend towards mobile devices, mobile phones, wearable computing, internet of things Battery life depends directly on energy consumption Devices that consume very little energy can really add up if you have dozens or hundreds of them Parallelism is alternative to faster clock Dynamic power is Capacitance * Frequency * Voltage2 Significant increase in clock normally requires increase in voltage

Low Power Parallel Processors Low power multicore processors Often trade software complexity for hardware complexity Many are VLIW processors Instruction-level parallelism under software control Complicated compilers On-chip scratchpad memory instead of caches Caches need logic to find cache line Cache line tags, comparators, replacement logic, etc. Scratch-pad memory is fully under software control Data transfers using so-called directed memory access (DMA) More complicated compilers, software Cores do not share on-chip memory Programmer must manage each core’s scratchpad memory Explicit movement of data between cores