Software Cache Coherent Control by Parallelizing Compiler

Slides:

Advertisements

Similar presentations

Chapter 5 Part I: Shared Memory Multiprocessors

Advertisements

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Computer Abstractions and Technology

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

Translation Lookaside Buffer

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

COSC6385 Advanced Computer Architecture

COMP 740: Computer Architecture and Implementation

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Introduction to parallel programming

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

תרגול מס' 5: MESI Protocol

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Architecture and Design of AlphaServer GS320

CAM Content Addressable Memory

The University of Adelaide, School of Computer Science

The Hardware/Software Interface CSE351 Winter 2013

Multiprocessor Cache Coherency

Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Vector Processing => Multimedia

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Module IV Memory Organization.

CMSC 611: Advanced Computer Architecture

Interconnect with Cache Coherency Manager

Multiprocessors - Flynn’s taxonomy (1966)

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Translation Lookaside Buffer

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Miss Rate versus Block Size

* From AMD 1996 Publication #18522 Revision E

Computer Evolution and Performance

High Performance Computing

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.

CSL718 : Multiprocessors 13th April, 2006 Introduction

Lecture 18: Coherence and Synchronization

Presentation transcript:

Software Cache Coherent Control by Parallelizing Compiler Boma A. Adhi, Masayoshi Mase, Yuhei Hosokawa, Yohei Kishimoto Taisuke Onishi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara Department of Computer Science and Engineering Waseda University Waseda University @ LCPC 2017, Texas A&M University

What is Cache Coherency? PE0 PE1 CPU CPU - PE1 Cache PE0 Cache - Cache coherency: Data should be consistent from all CPU Core cache cache A read by processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. Interconnection Network Shared memory 1 - x Waseda University @ LCPC 2017, Texas A&M University

What is Cache Coherency? PE0 PE1 load x; load x; CPU CPU 1 - PE1 Cache PE0 Cache 1 - x x cache cache A read by processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. Interconnection Network Shared memory 1 - x Waseda University @ LCPC 2017, Texas A&M University

What is Cache Coherency? PE0 PE1 x = 2; CPU CPU 1 - PE1 Cache PE0 Cache 2 - x x x is now invalid cache cache A read by processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. Interconnection Network Shared memory 1 - x Waseda University @ LCPC 2017, Texas A&M University

Why Compiler Controlled Cache Coherency? Current many-cores CPU (Xeon Phi, TilePro 64, etc) uses hardware cache coherency mechanism. Hardware cache coherency for many-cores (hundreds to thousand cores) : Hardware will be complex & occupy large area in silicon Huge power consumption Design takes long time & larger cost Software based coherency: Small hardware, low power & scalable No efficient software coherence control method has been proposed. Waseda University @ LCPC 2017, Texas A&M University

Proposed Software Coherence Method on OSCAR Parallelizing Compiler Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis). OSCAR compiler automatically controls coherence using following simple program restructuring methods: To cope with stale data problems: Data synchronization by compilers To cope with false sharing problem: Data Alignment Array Padding Non-cacheable Buffer MTG generated by earliest executable condition analysis Waseda University @ LCPC 2017, Texas A&M University

The Renesas RP2 Embedded Multicore 2 clusters of 4 cores hardware cache coherent control SMP, No coherence support for more than 5 cores Waseda University @ LCPC 2017, Texas A&M University

Non Coherent Cache Architecture Shared Memory CPU Local Cache … PE0 PE1 PEn Interconnection Network V tag data D Local Cache V: valid bit D: dirty bit local cache control instructions from the owner CPU self-invalidate Writeback flush Waseda University @ LCPC 2017, Texas A&M University

Problem in Non-coherent Architecture (1): Stale Data Global Variable Declaration int a = 0; int b = 0; int c = 10; Shared Memory a = 0; b = 0; PE0 PE1 Stale Data: An obsolete shared data exists on a processor cache after a new value is updated by another processor core PE1 Cache b = a; Time PE0 Cache a = 20; a = 20; If the data on the cache continue to exist PE1 Cache a = 0; c = 0; c = a; Correct value with coherent control a=20 by PE0 is not published and PE1 calculates c using stale data Value is not coherent Waseda University @ LCPC 2017, Texas A&M University

Solving Stale Data Problem The compiler automatically inserts a writeback instruction and a synchronization barrier on PE0 code. Then, it inserts a self-invalidate instruction on PE1 after the synchronization barrier. Waseda University @ LCPC 2017, Texas A&M University

Problem in Non-coherent Architecture (2): False Sharing Global Valuable Declaration int a = 0; int b = 0; a and b are different elements on the same cache line Shared Memory a b False sharing: A condition which multiple cores share the same memory block or cache line. Then, each processor updates the different elements on the same block/cache line. PE0 PE1 a = 10; b = 20; PE0 Cache PE1 Cache time line line Line replacement Line replacement Memory With cache coherency: a b Depending on the timing of the line replacement, the stored data will be inconsistent. Waseda University @ LCPC 2017, Texas A&M University

Solving False Sharing (1): Data Alignment Data alignment solve the false sharing problem. Different data accessed by different PE should not share a single cache line. This approach works for scalar variables and small-sized one-dimensional array. Variable Declaration int a __attribute__((aligned(16))) = 0; int b __attribute__((aligned(16))) = 0; /* a, b are assigned to the first element of each cache line */ PE0 PE1 a = 10; b = 20; a and b are assigned to the first elements of different cache lines PE0 Cache PE0 Cache line1 line2 Mem Updates by PE0 and PE1 are separately performed by write back Waseda University @ LCPC 2017, Texas A&M University

Two-dimensional Array Problem Splitting an array cleanly along the cache line is not always possible. Lowest dimension is not integer multiply of cache line (0,0) cache line (1,0) int a[6][6]; PE0 write PE0 PE1 (2,0) for (i = 0; i < 3; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } for (i = 3; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } (3,0) PE0 & PE1 share cache line (4,0) (5,0) PE1 write Waseda University @ LCPC 2017, Texas A&M University

Solving False Sharing (2): Array Padding Memory The compiler inserts a padding (dummy data) to the end of the array to match the cache line size a cache line (1,0) (1,1) (1,2) (1,3) int a[6][8] __attribute__((aligned(16))); /* a is aligned to the first element of the cache line */ (1,4) (1,5) (1,6) (1,7) (2,0) PE0 write PE0 PE1 (3,0) for (i = 0; i < 3; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } for (i = 3; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } (4,0) (5,0) PE1 write (6,0) Waseda University @ LCPC 2017, Texas A&M University

Solving False Sharing (3): Non-cacheable Buffer cache line Solving False Sharing (3): Non-cacheable Buffer (1,0) (2,0) (3,0) PE0 write Data transfer using non-cacheable buffer: The idea is to put a small area in the main memory that should not be copied to the cache along the border between area modified by different processor core. int a[6][6] __attribute__((aligned(16))); /* a is assigned to the first element of the cache line */ int nc_buf[1][6] __attribute((section(“UNCACHE”))); /* nc_buf is prepared in non-cacheable area */ (4,0) (5,0) PE1 write PE0 PE1 b cache line (0,0) for (i = 1; i < 4; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; b[i-1][j] = i * j; } for (i = 4; i < 5; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; nc_buf[i-4][j] = i * j; } (1,0) PE0 write (2,0) (3,0) for (i = 5; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = I * j; b[i-1][j] = i * j; } Data transfer using buffer for (i = 4; i < 5; i++) { for (j = 0; j < 6; j++) { b[i-1][j] = nc_buf[i-4][j]; } (4,0) (5,0) PE1 write Waseda University @ LCPC 2017, Texas A&M University

Performance Evaluation 10 Benchmark Application in C In Parallelizable C format (Similar to Misra C in embedded field) Fed into source to source automatic parallelization OSCAR Compiler with non- cache coherence control Parallelized code then fed into Renesas C Compiler Evaluated in RP2 Processor 8 core Dual 4 core modules Hardware cache coherency for up to 4 cores in the same module Waseda University @ LCPC 2017, Texas A&M University

Waseda University @ LCPC 2017, Texas A&M University Performance of Hardware Coherence Control & Software Coherence Control on 8-core RP2 Waseda University @ LCPC 2017, Texas A&M University

Waseda University @ LCPC 2017, Texas A&M University SPEC Result Waseda University @ LCPC 2017, Texas A&M University

NAS Parallel Benchmark and MediaBench II Waseda University @ LCPC 2017, Texas A&M University

Software Coherence Impact on Performance SMP: Hardware cache coherence (baseline). NCC(soft): Both stale data handling and false sharing turned on. Hardware CC off. NCC(hard): All stale data handling , false sharing turned on and also Hardware CC on. Stale Data Handling only. Hardware CC off. False Sharing Avoidance only. Hardware CC off. Waseda University @ LCPC 2017, Texas A&M University

Waseda University @ LCPC 2017, Texas A&M University Conclusions OSCAR compiler automatically controls coherence using following simple program restructuring methods: To cope with stale data problems: Data synchronization by compilers To cope with false sharing problem: Data Alignment Array Padding Non-cacheable Buffer Evaluated using 10 benchmark applications Renesas RP2 8 core multicore processor. SPEC2000 SPEC2006 NAS Parallel Benchmark MediaBench II Waseda University @ LCPC 2017, Texas A&M University

Waseda University @ LCPC 2017, Texas A&M University Conclusion (cont’d) Provides good speedup automatically, few examples with 4 cores: 2.63 times SPEC2000 equake (2.52 with hardware) 3.28 times SPEC2009 lbm (2.9 with hardware) 3.71 times on NPB cg (3.34 with hardware) 3.02 times on MediaBench MPEG2 Encoder (3.17 with hardware) Enable usage of 8 cores automatically on RP2 with good speedup: 4.37 times SPEC2000 equake 4.76 times SPEC2009 lbm 5.66 times on NPB cg 4.92 times on MediaBench MPEG2 Encoder Waseda University @ LCPC 2017, Texas A&M University

Waseda University @ LCPC 2017, Texas A&M University Thank you Waseda University @ LCPC 2017, Texas A&M University