Special Course on Computer Architecture

Name: Special Course on Computer Architecture
Uploaded: 2017-10-08T09:52:53+00:00
Duration: PTM22S26
Description: Special Course on Computer Architecture

Special Course on Computer Architecture
#7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano June 3rd, 2011 Special Course on Computer Architecture

Outline: Simulation of Multi-Processors
Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture Coherence protocols Simulation environment: GEMS/Simics Exercises [50min] Performance evaluation of parallel applications Performance evaluation of coherence protocols June 3rd, 2011 Special Course on Computer Architecture

Multi- and many-core architectures
4 8 16 32 64 128 256 2011 2004 2006 2008 2010 MIT RAW STI Cell BE Sun T1 Sun T2 TILERA TILE64 Intel Core, IBM Power7 AMD Opteron Intel 80-core ClearSpeed CSX600 ClearSpeed CSX700 picoChip PC102 picoChip PC205 UT TRIPS (OPN) Number of PEs (caches are not included) 2 Fujitsu SPARC64 Intel SCC

Network-on-Chip (NoC)
Interconnection network to connect many-cores Router Core 16-Core Tile Architecture June 3rd, 2011 Special Course on Computer Architecture

On-chip router architecture
2) arbitration for the selected output channel 1) selecting an output channel Input ports Output ports ARBITER X+ GRANT X+ FIFO 3) sending the packet to the output channel X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration,&switch traversal are performed in pipeline manner June 3rd, 2011 Special Course on Computer Architecture

Today’s target architecture
Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processor tile Cache tile UltraSPARC L1 cache (I & D) L2 cache bank June 3rd, 2011 Special Course on Computer Architecture

Today’s target architecture
Chip multi-processors (CMPs) Multiple processors (each has private L1 cache) Shared L2 cache divided into multiple banks (SNUCA) Processors and L2 cache banks are connected via NoC Processor tile Cache tile UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Cache coherence is maintained
Write back policy Cache-write updates the memory when block is evicted Write invalidate policy Cache-write invalidates all copies of the other sharers Processor tile Cache tile Main Memory June 3rd, 2011 Special Course on Computer Architecture

Cache coherence is maintained
A CPU wants to read a block cached at The CPU sends a read request to the memory controller The controller forwards the request to current owner The owner sends the block to the requestor Processor tile Cache tile Main Memory June 3rd, 2011 Special Course on Computer Architecture

Cache coherence: MOESI protocol class
Status of each cache block is represented with M/O/E/S/I Modified (M) Modified (i.e., dirty) Valid in one cache Shared (S) Shared by multiple CPUs Exclusive (E) Clean Exists in one cache Invalid (I) Owned (O) May or may not clean Exists in multiple caches Owned by one cache Owner Responsibility to respond any requests MOESI protocols MSI, MOSI, MESI, MOESI, … June 3rd, 2011 Special Course on Computer Architecture

Cache coherence protocols
MSI/MOSI directory protocol E state is not implemented S-to-M transition always updates the main memory MESI directory protocol O state is not implemented; Dirty sharing not allowed M-to-S transition always updates the main memory MOESI directory protocol MOESI token protocol [Martin ISCA03] There are tokens as many as the number of CPUs A CPU has one or more tokens  It can read the block A CPU has all tokens  It can modify (write) the block June 3rd, 2011 Special Course on Computer Architecture

MSI Protocol: State transition
CpuRd--- CpuWr--- CpuRd--- CpuRd--- M S M S CpuWrBusWr BusRdFlush CpuWr BusWr CpuRd BusRd BusWr Flush BusWr--- I I BusRd--- BusWr--- S-to-M transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MESI Protocol: State transition
CpuRd--- CpuWr--- CpuRd--- M E BusRd FlushOpt M E BusWr Flush CpuWr--- BusRd Flush BusWr FlushOpt CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr S I S CpuRd BusRd(C) I BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- CpuRd--- M-to-S transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (1/2)
CpuRd--- CpuWr--- CpuRd--- CpuWr BusUpgr M E CpuWr--- CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr O S CpuRd BusRd(C) I CpuRd--- CpuRd--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (2/2)
BusRd Flush BusRd FlushOpt BusWr Flush BusWr FlushOpt O S I BusRdFlush BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- BusWrFlush BusUpgr--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Full-system simulation: GEMS/Simics
Wind River’s Simics Commercial detailed processor simulator Univ. of Wisconsin’s GEMS Cache, memory, and network module for Simics Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Today’s simulation target Solaris 9 OS on eight UltraSPARC processors Parallel application examples: Pi and Integer sort Various coherence protocols are supported Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank On-chip router June 3rd, 2011 Special Course on Computer Architecture

Simulation target Solaris 9 OS on eight UltraSPARC processors Parallel application example: Integer Sort (IS) Solaris 9 is running on 8-core UltraSPARC Processor tile Cache tile Main Memory UltraSPARC L1 cache (I & D) L2 cache bank A parallel program Compile Execute it with 8-core On-chip router June 3rd, 2011 Special Course on Computer Architecture

Parallel application example: OpenMP
#include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("hello world from %d of %d\n", omp_get_thread_num(), 　　　　　　　　　omp_get_num_threads()); return 0; } Hello from all threads

int main() { int i; double start_time, end_time; start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel shared(A) private(i) { #pragma omp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); return 0;

int main() { int i; double s = 0.0; double start_time, end_time; start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+:s) { #pragma omp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } printf("pi = %lf\n", s); end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time);

The first step: How to use the simulator
Please pick up your account information Log-in one of ICS cluster machines (id = 01…15) ssh –X Copy sample scripts and configuration files cp –r ~matutani/comparch2011/files work cd work June 3rd, 2011 Special Course on Computer Architecture

Start Simics ./start_ideal_memory.sh You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs). June 3rd, 2011 Special Course on Computer Architecture

In the target machine, for example, you can check the number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v You will see that there are eight processors June 3rd, 2011 Special Course on Computer Architecture

Parallel application: “pi” calculation
You can execute a "pi" calculation program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# export OMP_NUM_THREADS=1 June 3rd, 2011 Special Course on Computer Architecture

Parallel application: Integer Sort (IS)
You can execute an Integer Sort (IS) program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# export OMP_NUM_THREADS=1 June 3rd, 2011 Special Course on Computer Architecture

Exercise 1 Report the execution time of “pi” using 1, 4, 8, and 16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results. June 3rd, 2011 Special Course on Computer Architecture

Coherence protocols: Integer Sort (IS)
The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh Each simulation takes five to ten minutes. Do not run more than one scripts at the same time! June 3rd, 2011 Special Course on Computer Architecture

Exercise 2 Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19. June 3rd, 2011 Special Course on Computer Architecture

Special Course on Computer Architecture

Similar presentations

Presentation on theme: "Special Course on Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Special Course on Computer Architecture

Similar presentations

Presentation on theme: "Special Course on Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback