Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Slides:

Advertisements

Similar presentations

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Advertisements

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

Structure of Computer Systems

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Introduction Companion slides for

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Background Computer System Architectures Computer System Software.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.

Lecture 1: Introduction to High Performance Computing.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Lecture 2 : Introduction to Multicore Computing

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Last Time Performance Analysis It’s all relative

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

Shashwat Shriparv InfinitySoft.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Copyright © Curt Hill Parallelism in Processors Several Approaches.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

Lecture 3: Computer Architectures

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

CS203 – Advanced Computer Architecture

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Multi-core processors

Multi-core processors

What is Parallel and Distributed computing?

Chapter 1 Introduction.

Computer Evolution and Performance

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4 Multiprocessors

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University

What is Parallel Computing? Parallel computing using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel machines: A cluster computer that contains multiple PCs combined together with a high speed network A shared memory multiprocessor by connecting multiple processors to a single memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip Concurrent execution comes from desire for performance; unlike the inherent concurrency in a multi-user distributed system

Multicore Computer composed of two or more independent cores Core (CPU): computing unit that reads/executes program instructions Ex) dual-core, quad-core, hexa-core, octa-core, … share cache or not symmetric or asymmetric Cores are integrated onto a single integrated circuit die (CMP : Chip Multi-Processor) or they may be integrated onto multiple dies in a single chip package

Multicore Computer performance gained by multi-core processor strongly dependent on the software algorithms and implementation. Dual-Core CPU

Manycore processor multi-core architectures with an especially high number of cores (tens or hundreds or even more) CUDA Compute Unified Device Architecture parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce

Parallel Programming Techniques Shared Memory OpenMP, pthreads Distributed Memory MPI Distributed/Shared Memory Hybrid (MPI+OpenMP) GPU Parallel Programming CUDA programming (NVIDIA) OpenCL

Parallel Processing Systems Small-Scale Multicore Environment Notebook, Workstation, Server OS supports multicore POSIX threads (pthread), win32 thread GPGPU-based supercomputer Development of CUDA/OpenCL/GPGPU Large-Scale Multicore Environment Supercomputer : more than 10,000 cores Clusters Servers Grid Computing

Parallel Computing vs. Distributed Computing Parallel Computing all processors may have access to a shared memory to exchange information between processors. more tightly coupled to multi-threading Distributed Computing multiple computers communicate through network each processor has its own private memory (distributed memory). executing sub-tasks on different machines and then merging the results.

Parallel Computing vs. Distributed Computing No Clear Distinction Distributed Computing Parallel Computing

Cluster Computing vs. Grid Computing Cluster Computing a set of loosely connected computers that work together so that in many respects they can be viewed as a single system good price / performance memory not shared Grid Computing federation of computer resources from multiple locations to reach a common goal (a large scale distributed system) grids tend to be more loosely coupled, heterogeneous, and geographically dispersed

Cluster Computing vs. Grid Computing

Cloud Computing shares networked computing resources rather than having local servers or personal devices to handle applications. “Cloud” is used as a metaphor for “Internet" meaning "a type of Internet-based computing,“ different services - such as servers, storage and applications - are delivered to an user’s computers and smart phones through the Internet.

Good Parallel Program Writing good parallel programs Correct (Result) Good Performance Scalability Load Balance Portability Hardware Specific Utilization

Moore’s Law : Review Doubling of the number of transistors on integrated circuits roughly every two years. Microprocessors have become smaller, denser, and more powerful. processing speed, memory capacity, sensors and even the number and size of pixels in digital cameras.All of these are improving at (roughly) exponential rates

Computer Hardware Trend Chip density is continuing increase ~2x every 2years Clock speed is not (in high clock speed, power consumption and heat generation is too high to be tolerated.) # of cores may double instead No more hidden parallelism (ILP;instruction level parallelism) to be found Transistor# still rising Clock speed flattening sharply Need Multicore programming! Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Examples of Parallel Computer Chip MultiProcessor (CMP) Intel Core Duo AMD Dual Core Symmetric Multiprocessor (SMP) Sun Fire E25K Heterogeneous Chips Cell Processor Clusters Supercomputers

Intel Core Duo Two 32-bit Pentium processors Each has its own 32K L1 cache Shared 2MB or 4MB L2 cache Fast communication through shared L2 Coherent shared memory

AMD Dual Core Opteron Each with 64K L1 cache Each with 1MB L2 cache Coherent shared memory

Intel vs. AMD Main difference : L2 cache position AMD More core private memory Easier to share cache coherency info with other CPUs Preferred in multi chip systems Intel Core can use more of the shared L2 at times Lower latency communication between cores Preferred in single chip systems

Generic SMP Symmetric MultiProcessor (SMP) System multiprocessor hardware architecture two or more identical processors are connected to a single shared memory controlled by a single OS instance Most common multiprocessor systems today use an SMP architecture Both Multicore and multi-CPU Single logical memory image Shared bus often bottleneck

GPGPU : NVIDIA GPU Tesla K20 GPU : 1 Kepler GK cores; 706MHz Tpeak 3.52Tflop/s – 32bit floating point Tpeak 1.17Tflop/s – 64bit floating point GTX CUDA cores; 1.0GHz

Hybrid Programming Model Main CPU performs hard to parallelize portion Attached processor (GPU) performs compute intensive parts

Summary All computers are now parallel computers! Multi-core processors represent an important new trend in computer architecture. Decreased power consumption and heat generation. Minimized wire lengths and interconnect latencies. They enable true thread-level parallelism with great energy efficiency and scalability.

Summary To utilize their full potential, applications will need to move from a single to a multi-threaded model. Parallel programming techniques likely to gain importance. Hardware/Software the software industry needs to get back into the state where existing applications run faster on new hardware.

Why writing (fast) parallel programs is hard

Principles of Parallel Computing Finding enough parallelism (Amdahl’s Law) granularity Locality Load balance Coordination and synchronization All of these things makes parallel programming even harder than sequential programming.

Finding Enough Parallelism Suppose only part of an application seems parallel Amdahl’s law let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part

Overhead of Parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Locality and Parallelism Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

Load Imbalance Load imbalance is the time that some processors in the system are idle due to insufficient parallelism (during that phase) unequal size tasks Algorithm needs to balance load