Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Similar presentations


Presentation on theme: "Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for."— Presentation transcript:

1 Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

2 Nov 2010 2 Overview  Processor  AMD Opteron quad-core processor (‘Shanghai’)  Chronos has four processors (i.e. 16 cores)  Cache structure  L1 and L2 cache per core  L3 cache shared between the four cores  Memory  6GB (6 x 1GB memory modules) per processor (24GB total)  Interconnect  AMD ‘Direct Connect Architecture’ (Coherent HyperTransport Technology)  No ‘Front side bus’, as found in some Intel platforms  Performance issues  Further Information

3 Nov 2010 3 Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Briefwww.amd.com

4 Nov 2010 4 Processor – AMD Opteron 8378  ‘Shanghai’ 64 bit  2.4GHz clock speed  Separate 64KB level 1 data and instruction caches per core  2-way set associative, LRU replacement, exclusive  512KB level 2 cache per core (exclusive, i.e. data in L1 does not need to be in other caches)  unified (code and data)  16-way set associative, pseudo LRU replacement  6144KB (6MB) level 3 cache per processor (can be inclusive)  Shared by 4 cores  unified  64-way set associative, pseudo LRU replacement  Cache line sizes are 64B (‘unit of coherency’)

5 Nov 2010 5 AMD Opteron cache behaviour  L1 and L2 are exclusive caches  data is never in both caches. L2 holds data evicted from L1  On L2 hit, data is moved to L1 and removed from L2  L2 evicts data to L3  Access to an address that would lead to an L3 miss brings data straight to L1  Only after eviction from L1 and L2 does data come into L3 (L2 and L3 are ‘victim’ caches)  If data is required in L1 again, L3 keeps a copy (inclusive behaviour) if the data is likely to be shared with other cores but doesn’t keep a copy if the data is unlikely to be shared (exclusive).  Cache behaviour on the Opteron is ‘mostly exclusive’

6 Nov 2010 6 AMD Opteron latencies  Getting data into the registers  L1 access, 3 cycles then 1 cycle per load (~1.5ns)  L2 access, 9 cycles beyond L1 (~4ns)  L3 access, 29 cycles (at best) (~13ns)  Local memory (read access), ~140ns (not directly related to cpu cycles!)  An average benchmarked figure using, e.g. lmbench  On chronos, 1 cpu cycle is just under ~0.42ns  Memory access time is approximate…  Depends on how much work the memory system has to do to get the data and how ‘busy’ it is

7 Nov 2010 7 AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparisonwww.amd.com

8 Nov 2010 8 AMD Quad-quad ccNUMA architecture  Each processor is directly connected to some memory  Each processor has a memory controller  Bandwidth, 12.8GB/s (aggregate over two channels)  Processors are connected to each other with:  Bi-directional Coherent HyperTransport Technology (HT)  Coherency unit is 64 Bytes (i.e. cache line size)  Up to 8.0GB/s per link (4GB/s in each direction)  3 HT links per processor, usually 2 used to connect to other processors and 1 used for I/O (via PCI bridge)  Separate memory and I/O paths  Compare with Front side bus architecture used by, e.g., Intel

9 Nov 2010 9 Performance issues  Cores on the same processor can access directly some of the system’s memory (local memory) through the cache hierarchy  Can communicate with each other via shared L3 cache  Cores on different processors access remote memory via the cHT (coherent HyperTransport) links which maintains coherency of data in the L3 caches (and memory)  Access to remote memory may take 1 ‘hop’ (to memory on two other processors one cHT link away) or 2 ‘hops’ (to memory on the fourth processor, two cHT links away)

10 Nov 2010 10 AMD Opteron Memory latencies  Local memory reads, =100% (base case)  Local memory writes, ~113%  1 hop reads, ~108%  2 hop reads, ~130%  1 hop writes, ~128%  2 hop writes, ~150%  Remember, data is placed in physical memory as a result of a ‘first touch’ by a thread policy!  This is bechmarked data, 1 thread, idle machine

11 Nov 2010 11 Further information  See www.amd.com. Follow: Products and Technologies -> Server Products -> Server Processors:www.amd.com  Product Brief  Key Architectural Features  Direct Connect Architecture  HyperTransport Technology  Quad-Core AMD Opteron Processor 4P Server and Workstation Comparison  Another useful, though slightly old, document is:  Performance Guidelines for AMD Athlon and Opteron ccNUMA Multiprocessor Systems. Available at: www.amd.com.cn/CHCN/assets/content_type/white_papers_and_ tech_docs/40555.pdf

12 Nov 2010 12 Information on chronos  Look in files such as:  /proc/cpuinfo  /proc/meminfo  /sys/devices/system/cpu/cpu0/cache/index0 to index3  From information in /proc/cpuinfo you can create a map of the logical processor ids (in the range [0- 15], one per core) to physical processor ids [0-3] and (physical) core ids [0-3].  You should do this!

13 Nov 2010 13 log 10 N (bytes) Performance (Mflop/s) Results of vec.f on chronos L1 = 64KB L3 = 6MB L2 = 512KB


Download ppt "Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for."

Similar presentations


Ads by Google