4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Slides:



Advertisements
Similar presentations
Network Processor Technical Report Present by: Jiening Jiang June 05.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Structure of Computer Systems
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
1 Microprocessor-based Systems Course 4 - Microprocessors.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
Parallel Computer Architectures
Computer Organization and Assembly language
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Computer performance.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
1 Introduction CEG 4131 Computer Architecture III Miodrag Bolic.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1.Prerequisites.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Course.
Understanding Parallel Computers Parallel Processing EE 613.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
High performance computing architecture examples Unit 2.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Hardware Architecture
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
William Stallings Computer Organization and Architecture 8th Edition
Lynn Choi School of Electrical Engineering
Multi-core processors
Parallel Computers Today
ECE/CS 757: Advanced Computer Architecture II
William Stallings Computer Organization and Architecture 8th Edition
An Overview of MIMD Architectures
William Stallings Computer Organization and Architecture 8th Edition
Multicore and GPU Programming
Multicore and GPU Programming
Intel CPU for Desktop PC: Past, Present, Future
Presentation transcript:

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multicore examples General purpose vs special purpose Special purpose: Network Processors, DSP Homogeneous vs heterogeneous SPM vs NUMA Low parallelism vs high parallelism Moore law: exponential growth of core number per chip … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

General purpose - current (low parallelism) chips X86 based Intel Xeon (Core 2 Duo, Core 2 Quad), Nehalem AMD Athlon, Opteron quad-core (Barcelona) Power based IBM Power 5, 6 IBM Cell UltraSPARC based Sun UltraSparc T1 Sun UltraSparc T2 Except IBM Cell: homogeneous, shared cache (C2 / C3) multiprocessors MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Intel SMPs Automatic cache coherence MESI Snooping Xeon Core 2 Duo 3 GHz L1: 32Kb + 32Kb L2: 6Mb Off-chip main memory interface System bus: 10.6 GB/s One thread per core, 4-superscalar Xeon Core 2 Quad /Harpertown Two Core 2 in the same chip External main memory: shared by both L2 caches Automatic cache coherence MESI Snooping MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Intel SMPs: Nehalem Evolution of SMP Xeon Private L2 caches Shared L3 cache, MESIF Trend: 8, 16 core Memory interface on chip Point-to-point interconnection structure, 32 GB/s Two simultaneous threads per core MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

AMD Opteron Quad Core Shared L3 cache Possible Cache-coherent NUMA behaviour: access to remote L2 caches, via point-to-point interconnection, with L3 cache acting (also) as a synchronization agent. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

SUN Niagara 2 UltraSPARC T2 SMP, shared L2 cache 8 simple, pipelined, in-order cores, 1 floating point unit per core 8 simultaneous threads per core Interconnection: crossbar MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

IBM BlueGene/P PowerPC 32-bit 450 quad-core NUMA Mesh interconnect Automatic cache coherence 4 simultaneous floating point operations per clock cycle (850 MHz) Significantly less power consumption (16 W) compared to > 65 W of x86 quad-core BlueGene massively parallel system: > 75000 quad-core chips (> 290000 cores total). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

IBM Cell BE EIB BEI P E MIC S E0 I/O0 E1 E2 E3 E4 E5 E6 E7 I/O1 Evolution of uniprocessor (Power PC Processor Element, PPE) with 8 powerful I/O coprocessors towards a heterogeneous NUMA multiprocessor, Coprocessors evolution towards Processign Cores: Synergistic Processing Element (SPE), with vectorization capabilities MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

IBM Cell BE PPE: superscalar, in-order, L2 cache accessible by SPEs SPE: RISC, 128-bit,pipelined, in-order, vectorized instructions SPE Local Memory: 256 Kb, NOT cache SPEs set = NUMA, with additional access to the PPE memory (DMA) Interconnection structure: 4 bidirectional Rings, 16 bytes per ring Robust cost model for memory access and communication. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Intel Terascale project 80 core Network on-chip: bidimensional mesh (k-ary 2-cube) Wormohole routing VLIW processors: 96 bit instruction word NOT cache MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Tilera Tile 64 64 cores 8-ary 2-cube toroidal interconnection NUMA Local cache (L1, L2): program-controlled cache coherence No floating-point Oriented to Video encoding, Network Packet processing MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Network processors Parallel architectures oriented to network processing: Real-time processing of multiple data streams IP protocol packet switching and forwarding capabilities Packet operations Packet queueing Checksum / CRC per packet Pattern matching per packet Tree searches Frame forrwarding Frame filtering Frame alteration Traffic control and statistics QoS control Enhance security Primitive network interfaces on-chip Intel, IBM, Ezchip, Xelerated, Agere, Alchemy, AMCC, Cisco, Cognigine, Motorola, … Similar features for DSP multicore processors MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Network processors and multithreading Network processors apply multithreading to bridge latencies during (remote) memory accesses Blocked multithreading (BMT) Multithreading applied to cores that perform the data traffic handling Hard real-time events (i.e., deadline soluld never be missed) Specific instruction scheduling during multithreaded execution Examples: Intel IXP IBM PowerNP MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Network processors: Intel Internet eXchange Processor (IXP) 8 cores (IXP2400) or 16 cores (IPX2800), specialized for low-level packet processing, fifty 40-bit instructions + one RISC Intel Xscale, 600-700 MHz: heterogeneous NUMA Pipelined architecture, 8 threads per core (zero cost context switching thread-thread) Ring-like core interconnection MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms