Special Course on Computer Architecture

Slides:

Advertisements

Similar presentations

A Case for Wireless 3D NoCs for CMPs Hiroki Matsutani (1), Paul Bogdan (2), Radu Marculescu (2), Yasuhiro Take (1), Daisuke Sasaki (1), Hao Zhang (1),

Advertisements

The University of Adelaide, School of Computer Science

Cache Coherence Mechanisms (Research project) CSCI-5593

Lecture 7. Multiprocessor and Memory Coherence

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Cache Optimization Summary

Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs

Multiple Processor Systems

UMA Bus-Based SMP Architectures

A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Chapter 18 Multicore Computers

Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.

A Vertical Bubble Flow Network using Inductive-Coupling for 3D CMPs

Networks-on-Chips (NoCs) Basics

IntroductionSnoopingDirectoryConclusion IntroductionSnoopingDirectoryConclusion Memory 1A 2B 3C 4D 5E Cache 1 1A 2B 3C Cache 2 3C 4D 5E Cache 4 1A 2B.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Lecture 13: Multiprocessors Kai Bu

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Understanding Parallel Computers Parallel Processing EE 613.

The University of Adelaide, School of Computer Science

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Lecture 13: Multiprocessors Kai Bu

Outline Introduction (Sec. 5.1)

COSC6385 Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

Multiprocessing.

תרגול מס' 5: MESI Protocol

Cache Coherence in Shared Memory Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Chip-Multiprocessor.

Lecture 4: Update Protocol

Improving Multiple-CMP Systems with Token Coherence

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 25: Multiprocessors

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Working in The IITJ HPC System

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Special Course on Computer Architecture #7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano June 3rd, 2011 Special Course on Computer Architecture

Outline: Simulation of Multi-Processors Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture Coherence protocols Simulation environment: GEMS/Simics Exercises [50min] Performance evaluation of parallel applications Performance evaluation of coherence protocols June 3rd, 2011 Special Course on Computer Architecture

Multi- and many-core architectures 4 8 16 32 64 128 256 2011 2004 2006 2008 2010 MIT RAW STI Cell BE Sun T1 Sun T2 TILERA TILE64 Intel Core, IBM Power7 AMD Opteron Intel 80-core ClearSpeed CSX600 ClearSpeed CSX700 picoChip PC102 picoChip PC205 UT TRIPS (OPN) Number of PEs (caches are not included) 2 Fujitsu SPARC64 Intel SCC

Network-on-Chip (NoC) Interconnection network to connect many-cores Router Core 16-Core Tile Architecture June 3rd, 2011 Special Course on Computer Architecture

On-chip router architecture 2) arbitration for the selected output channel 1) selecting an output channel Input ports Output ports ARBITER X+ GRANT X+ FIFO 3) sending the packet to the output channel X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration,&switch traversal are performed in pipeline manner June 3rd, 2011 Special Course on Computer Architecture

Outline: Simulation of Multi-Processors Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture Coherence protocols Simulation environment: GEMS/Simics Exercises [50min] Performance evaluation of parallel applications Performance evaluation of coherence protocols June 3rd, 2011 Special Course on Computer Architecture

Today’s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processor tile Cache tile UltraSPARC L1 cache (I & D) L2 cache bank June 3rd, 2011 Special Course on Computer Architecture

Today’s target architecture Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processors and L2 cache banks are connected via NoC Processor tile Cache tile UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Cache coherence is maintained Write back policy Cache-write updates the memory when block is evicted Write invalidate policy Cache-write invalidates all copies of the other sharers Processor tile Cache tile Main Memory June 3rd, 2011 Special Course on Computer Architecture

Cache coherence is maintained A CPU wants to read a block cached at The CPU sends a read request to the memory controller The controller forwards the request to current owner The owner sends the block to the requestor Processor tile Cache tile Main Memory June 3rd, 2011 Special Course on Computer Architecture

Cache coherence: MOESI protocol class Status of each cache block is represented with M/O/E/S/I Modified (M) Modified (i.e., dirty) Valid in one cache Shared (S) Shared by multiple CPUs Exclusive (E) Clean Exists in one cache Invalid (I) Owned (O) May or may not clean Exists in multiple caches Owned by one cache Owner Responsibility to respond any requests MOESI protocols MSI, MOSI, MESI, MOESI, … June 3rd, 2011 Special Course on Computer Architecture

Cache coherence protocols MSI/MOSI directory protocol E state is not implemented S-to-M transition always updates the main memory MESI directory protocol O state is not implemented; Dirty sharing not allowed M-to-S transition always updates the main memory MOESI directory protocol MOESI token protocol [Martin ISCA03] There are tokens as many as the number of CPUs A CPU has one or more tokens  It can read the block A CPU has all tokens  It can modify (write) the block June 3rd, 2011 Special Course on Computer Architecture

MSI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- CpuRd--- M S M S CpuWrBusWr BusRdFlush CpuWr BusWr CpuRd BusRd BusWr Flush BusWr--- I I BusRd--- BusWr--- S-to-M transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MESI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- M E BusRd FlushOpt M E BusWr Flush CpuWr--- BusRd Flush BusWr FlushOpt CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr S I S CpuRd BusRd(C) I BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- CpuRd--- M-to-S transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (1/2) CpuRd--- CpuWr--- CpuRd--- CpuWr BusUpgr M E CpuWr--- CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr O S CpuRd BusRd(C) I CpuRd--- CpuRd--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (2/2) BusRd Flush BusRd FlushOpt BusWr Flush BusWr FlushOpt O S I BusRdFlush BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- BusWrFlush BusUpgr--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Outline: Simulation of Multi-Processors Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture Coherence protocols Simulation environment: GEMS/Simics Exercises [50min] Performance evaluation of parallel applications Performance evaluation of coherence protocols June 3rd, 2011 Special Course on Computer Architecture

Full-system simulation: GEMS/Simics Wind River’s Simics Commercial detailed processor simulator Univ. of Wisconsin’s GEMS Cache, memory, and network module for Simics Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Full-system simulation: GEMS/Simics Today’s simulation target Solaris 9 OS on eight UltraSPARC processors Parallel application examples: Pi and Integer sort Various coherence protocols are supported Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Full-system simulation: GEMS/Simics Simulation target Solaris 9 OS on eight UltraSPARC processors Parallel application example: Integer Sort (IS) Solaris 9 is running on 8-core UltraSPARC Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank A parallel program Compile Execute it with 8-core On-chip router June 3rd, 2011 Special Course on Computer Architecture

Parallel application example: OpenMP #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("hello world from %d of %d\n", omp_get_thread_num(), 　　　　　　　　　omp_get_num_threads()); return 0; } Hello from all threads

Parallel application example: OpenMP int main() { int i; double start_time, end_time; start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel shared(A) private(i) { #pragma omp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); return 0;

Parallel application example: OpenMP int main() { int i; double s = 0.0; double start_time, end_time; start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+:s) { #pragma omp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } printf("pi = %lf\n", s); end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time);

Outline: Simulation of Multi-Processors Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture Coherence protocols Simulation environment: GEMS/Simics Exercises [50min] Performance evaluation of parallel applications Performance evaluation of coherence protocols June 3rd, 2011 Special Course on Computer Architecture

The first step: How to use the simulator Please pick up your account information Log-in one of ICS cluster machines (id = 01…15) ssh –X <username>@cluster<id>.ics.keio.ac.jp Copy sample scripts and configuration files cp –r ~matutani/comparch2011/files work cd work June 3rd, 2011 Special Course on Computer Architecture

The first step: How to use the simulator Start Simics ./start_ideal_memory.sh You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs). June 3rd, 2011 Special Course on Computer Architecture

The first step: How to use the simulator In the target machine, for example, you can check the number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v You will see that there are eight processors June 3rd, 2011 Special Course on Computer Architecture

Parallel application: “pi” calculation You can execute a "pi" calculation program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# export OMP_NUM_THREADS=1 June 3rd, 2011 Special Course on Computer Architecture

Parallel application: Integer Sort (IS) You can execute an Integer Sort (IS) program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# export OMP_NUM_THREADS=1 June 3rd, 2011 Special Course on Computer Architecture

Special Course on Computer Architecture Exercise 1 Report the execution time of “pi” using 1, 4, 8, and 16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results. June 3rd, 2011 Special Course on Computer Architecture

Coherence protocols: Integer Sort (IS) The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh Each simulation takes five to ten minutes. Do not run more than one scripts at the same time! June 3rd, 2011 Special Course on Computer Architecture

Special Course on Computer Architecture Exercise 2 Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19. June 3rd, 2011 Special Course on Computer Architecture