Download presentation
Presentation is loading. Please wait.
Published byDora Townsend Modified over 9 years ago
1
Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester
2
Nov 2010 2 Overview Processor AMD Opteron quad-core processor (‘Shanghai’) Chronos has four processors (i.e. 16 cores) Cache structure L1 and L2 cache per core L3 cache shared between the four cores Memory 6GB (6 x 1GB memory modules) per processor (24GB total) Interconnect AMD ‘Direct Connect Architecture’ (Coherent HyperTransport Technology) No ‘Front side bus’, as found in some Intel platforms Performance issues Further Information
3
Nov 2010 3 Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Briefwww.amd.com
4
Nov 2010 4 Processor – AMD Opteron 8378 ‘Shanghai’ 64 bit 2.4GHz clock speed Separate 64KB level 1 data and instruction caches per core 2-way set associative, LRU replacement, exclusive 512KB level 2 cache per core (exclusive, i.e. data in L1 does not need to be in other caches) unified (code and data) 16-way set associative, pseudo LRU replacement 6144KB (6MB) level 3 cache per processor (can be inclusive) Shared by 4 cores unified 64-way set associative, pseudo LRU replacement Cache line sizes are 64B (‘unit of coherency’)
5
Nov 2010 5 AMD Opteron cache behaviour L1 and L2 are exclusive caches data is never in both caches. L2 holds data evicted from L1 On L2 hit, data is moved to L1 and removed from L2 L2 evicts data to L3 Access to an address that would lead to an L3 miss brings data straight to L1 Only after eviction from L1 and L2 does data come into L3 (L2 and L3 are ‘victim’ caches) If data is required in L1 again, L3 keeps a copy (inclusive behaviour) if the data is likely to be shared with other cores but doesn’t keep a copy if the data is unlikely to be shared (exclusive). Cache behaviour on the Opteron is ‘mostly exclusive’
6
Nov 2010 6 AMD Opteron latencies Getting data into the registers L1 access, 3 cycles then 1 cycle per load (~1.5ns) L2 access, 9 cycles beyond L1 (~4ns) L3 access, 29 cycles (at best) (~13ns) Local memory (read access), ~140ns (not directly related to cpu cycles!) An average benchmarked figure using, e.g. lmbench On chronos, 1 cpu cycle is just under ~0.42ns Memory access time is approximate… Depends on how much work the memory system has to do to get the data and how ‘busy’ it is
7
Nov 2010 7 AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparisonwww.amd.com
8
Nov 2010 8 AMD Quad-quad ccNUMA architecture Each processor is directly connected to some memory Each processor has a memory controller Bandwidth, 12.8GB/s (aggregate over two channels) Processors are connected to each other with: Bi-directional Coherent HyperTransport Technology (HT) Coherency unit is 64 Bytes (i.e. cache line size) Up to 8.0GB/s per link (4GB/s in each direction) 3 HT links per processor, usually 2 used to connect to other processors and 1 used for I/O (via PCI bridge) Separate memory and I/O paths Compare with Front side bus architecture used by, e.g., Intel
9
Nov 2010 9 Performance issues Cores on the same processor can access directly some of the system’s memory (local memory) through the cache hierarchy Can communicate with each other via shared L3 cache Cores on different processors access remote memory via the cHT (coherent HyperTransport) links which maintains coherency of data in the L3 caches (and memory) Access to remote memory may take 1 ‘hop’ (to memory on two other processors one cHT link away) or 2 ‘hops’ (to memory on the fourth processor, two cHT links away)
10
Nov 2010 10 AMD Opteron Memory latencies Local memory reads, =100% (base case) Local memory writes, ~113% 1 hop reads, ~108% 2 hop reads, ~130% 1 hop writes, ~128% 2 hop writes, ~150% Remember, data is placed in physical memory as a result of a ‘first touch’ by a thread policy! This is bechmarked data, 1 thread, idle machine
11
Nov 2010 11 Further information See www.amd.com. Follow: Products and Technologies -> Server Products -> Server Processors:www.amd.com Product Brief Key Architectural Features Direct Connect Architecture HyperTransport Technology Quad-Core AMD Opteron Processor 4P Server and Workstation Comparison Another useful, though slightly old, document is: Performance Guidelines for AMD Athlon and Opteron ccNUMA Multiprocessor Systems. Available at: www.amd.com.cn/CHCN/assets/content_type/white_papers_and_ tech_docs/40555.pdf
12
Nov 2010 12 Information on chronos Look in files such as: /proc/cpuinfo /proc/meminfo /sys/devices/system/cpu/cpu0/cache/index0 to index3 From information in /proc/cpuinfo you can create a map of the logical processor ids (in the range [0- 15], one per core) to physical processor ids [0-3] and (physical) core ids [0-3]. You should do this!
13
Nov 2010 13 log 10 N (bytes) Performance (Mflop/s) Results of vec.f on chronos L1 = 64KB L3 = 6MB L2 = 512KB
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.