Multi-core and Beyond COMP25212 System Architecture

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Storage area network and System area network (SAN)
Router Architectures An overview of router architectures.
Switching, routing, and flow control in interconnection networks.
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Computer Communication and Networking Lecture # 4 by Zainab Malik 1.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Deterministic Communication with SpaceWire
Lynn Choi School of Electrical Engineering
Overview Parallel Processing Pipelining
Parallel Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Advanced Computer Networks
Dynamic connection system
Lecture 23: Interconnection Networks
ESE532: System-on-a-Chip Architecture
Architecture & Organization 1
Azeddien M. Sllame, Amani Hasan Abdelkader
Chapter 3 Top Level View of Computer Function and Interconnection
CT1303 LAN Rehab AlFallaj.
BIC 10503: COMPUTER ARCHITECTURE
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
ECE 4450:427/527 - Computer Networks Spring 2017
Architecture & Organization 1
Parallel and Multiprocessor Architectures – Shared Memory
Switching, routing, and flow control in interconnection networks
Multi-core systems COMP25212 System Architecture
CMSC 611: Advanced Computer Architecture
Storage area network and System area network (SAN)
Multiple Processor Systems
Overview of Computer Architecture and Organization
Data Communication Networks
PRESENTATION COMPUTER NETWORKS
Networks Networking has become ubiquitous (cf. WWW)
Advanced Computer and Parallel Processing
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
CS 6290 Many-core & Interconnect
Advanced Computer and Parallel Processing
Database System Architectures
Overview of Networking
CPE 631 Lecture 20: Multiprocessors
Switching, routing, and flow control in interconnection networks
An Engineering Approach to Computer Networking
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Multi-core and Beyond COMP25212 System Architecture Dr. Javier Navaridas 1

From Last Lecture Explain the differences between Snoopy and directory-based cache coherence protocols Global view vs local view + directory Minimal info vs extra info for directory and remote shared lines Centralized communications vs parallel communication Poor scalability vs better scalability Explain the concept of false sharing Pathological behaviour when two unrelated variables are stored in the same cache line If they are written by two different cores often, they will generate lots of invalidate/update traffic

On-chip Interconnects

The Need for Networks Any multi-core system must clearly contain the means for cores to communicate With memory With each other (coherence/synchronization) There are many different options Each have different characteristics and tradeoffs Performance/energy/area/fault-tolerance/scalability May provide different functionality Can restrict the type of coherence mechanism 4

The need for Networks Most multi- and many-core applications require some short of communication Why having so many cores if not, we rarely run that many number of applications at the same time Multicore systems need to provide a way for them to communicate effectively What ‘effectively’ means depends on the context

The need for Networks Shared-memory applications Multicores need to ensure consistency and coherence Memory consistency: ensure correct ordering of memory accesses Synchronization within a core Synchronization across cores – needs to send messages Memory coherence: ensure changes are seen everywhere Snooping: all the cores see what is going on – centralized Directory: distributed communications; more traffic required, but higher parallelism achieved – interconnection network

The need for Networks Distributed-memory Applications Independent processor/store pairs Each core has its own memory, independent from the rest No coherence is granted at the processor level Saves chip area Communication/synchronization is introduced explicitly in the code – message passing Needs to be handled efficiently to avoid becoming the bottleneck Interconnection network becomes an important part of the design E.g. Intel Single-chip Cloud Computer – SCC (2009) Later replaced by the cache-coherent Xeon Phi (2012)

Evaluating Networks Bandwidth: Amount of data that can be moved per unit of time Latency: How long it takes a given piece of the message to traverse the network Congestion: The effect on bandwidth and latency of using the network close to its peak Fault tolerance Area Power dissipation 8

Bandwidth vs. Latency Definitely not the same thing: A truck carrying one million 256Gbyte flash memory cards to London Latency = 4 hours (14,400 secs) Bandwidth = ~128Tbit/sec (128 * 1012 bit/sec) A broadband internet connection Latency = 100 microsec (10-4 sec) Bandwidth = 100Mbit/sec (108 bit/sec) 9 9

Important features of a NoC Topology How cores and networking elements are connected together Routing How traffic moves through the topology Switching How traffic moves from one component to the next

Topology

Bus Common wire interconnection – broadcast medium Only single usage at any point in time Controlled by clock – divided into time slots Sender must ‘grab’ a slot (via arbitration) to transmit Often ‘split transaction’ E.g send memory address in one slot Data returned by memory in later slot Intervening slots free for use by others Main scalability issue is limited throughput Bandwidth divided by number of cores 12 12

Crossbar E.g. to connect N inputs to N outputs Can achieve ‘any to any’ (disjoint) in parallel Area and power scale quadratically to the number of nodes – not scalable 13 13

Tree Variable bandwidth Variable Latency Reliability? (Depth of the Tree) Variable Latency Reliability? 14 14

Fat Tree 15 15

Ring Simple but Cell Processor - PS3 (2006) Low bandwidth Variable latency Cell Processor - PS3 (2006) 16 16

Mesh / Grid Reasonable bandwidth Variable Latency Tilera TILE64 Processor (2007) Xeon Phi Knights Landing Processor (2016) Reasonable bandwidth Variable Latency Convenient for very large systems physical layout 17 17

Routing

Length of Routes Minimal routing Non-minimal routing Selects always the shortest path to a destination Packets always move closer to their destination Packets are more likely to be blocked Non-minimal routing Packets can be diverted To avoid blocking, keeping the traffic moving To run away from congested areas Risk of livelock

Oblivious routing Unaware of network state Pros Con Deterministic routing Fixed path, e.g. XY routing Non-deterministic routing More complex strategies Pros Simpler router Deadlock-free oblivious routing Con Prone to contention

Adaptive Routing Aware of network state Pros Cons Barely used in NoCs Packets adapt to avoid contention Pros Higher performance Cons Router instrumentation is required More complex i.e. more area and power Deadlock prone Even more hardware Barely used in NoCs

Switching

Packet switching Data is split into small packets and these into flits Some extra info is added to the packets to identify the data and to perform routing Allows time-multiplexing of network resources Typically better performance, specially for short messages Several packet switching strategies Store and forward, cut-through, wormhole Packet Head Data

Store and Forward Switching A packet is not forwarded until all its phits arrive to each intermediate node Pros On-the-fly failure detection Cons Low performance Latency: distance × #phits Large buffering required Long, bursty transmissions E.g. Internet 24

Cut-through / Wormhole Switching A packet can be forwarded as soon as the head arrives to an intermediate node Pros Better performance Latency: distance + #phits Cons Fault detection only possible at the destination Less hardware 25

Beyond Multicore

Typical Multi-core Structure L1 Inst L1 Data core L1 Inst L1 Data Main Memory (DRAM) L2 Cache L2 Cache Memory Controller L3 Shared Cache On Chip QPI or HT PCIe Input/Output Hub PCIe Graphics Card Input/Output Controller … Motherboard I/O Buses (PCIe, USB, Ethernet, SATA HD)

Multiprocessor Shared memory Memory Memory Multi-core Chip Input/Output Hub (DRAM) Memory (DRAM) Memory Multi-core Chip Multi-core Chip (DRAM) Memory (DRAM) Memory Multi-core Chip Multi-core Chip QPI or HT Input/Output Hub Motherboard 28

Interconnection Network Multicomputer Distributed memory ... Interconnection Network 29

Amdahl’s Law Estimates a parallel system maximum performance based on the available parallelism of an application It was intended to discourage parallel architectures But was later reformulated to show that S is normally constant while P depends on the size of the input data If you want more parallelism, just increase your dataset Speed up = S + P S + (P/N) S = Fraction of the code which is serial P = Fraction of the code which can be parallel S + P = 1 N = Number of processor 30

Amdahl’s Law 31

Amdahl’s Law Estimates a parallel system maximum performance based on the available parallelism of an application It was intended to discourage parallel architectures But was later reformulated to show that S is normally constant while P depends on the size of the input data If you want more parallelism, just increase your dataset Speed up = S + P S + (P/N) S = Fraction of the code which is serial P = Fraction of the code which can be parallel S + P = 1 N = Number of processor 32

Clusters, Datacentres and Supercomputers

Clusters, Supercomputers and Datacentres All terms overloaded and misused Have lots of CPU’s on lots of Mother boards The distinction is becoming increasingly blurred High Performance Computing Run one large task as quickly as possible Supercomputers and (to an extent) clusters High Throughput Computing Run as many tasks per unit of time as possible Clusters/Farms (compute) and Datacentres (data) Big Data Analytics Analyse and extract patterns from large, complex data sets Datacentres 34

Building a Cluster, Supercomputer or Datacentre Large numbers of self contained computers in a small form factor Optimised for cooling and power efficiency Racks house 1000s of cores High redundancy for fault tolerance They normally also contain separate units for networking and power distribution 35

Building a Cluster, Supercomputer or Datacentre Join lots of compute racks Add a network Add power distribution Add cooling Add dedicated storage Some frontend node(s) Small user functions (compile, read results, etc) do not affect compute nodes performance 36

Top 500 List of Supercomputers A list with the most powerful supercomputers in the world, updated twice a year (Jun/Nov) (www.top500.org) Theoretical peak performance (Rpeak) vs maximum perf. running a computation intensive application (Rmax) Let’s peek at the latest Top 10 (Nov’18)

Questions?