Download presentation
Presentation is loading. Please wait.
Published byEliezer Sopp Modified over 9 years ago
1
Intel Core2 QuadCPU @2.66 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU Frequency: 2.66 Ghz Bus Speed: 1.066 GHz (FSB=Front Side Bus) (Multiplier=10?) Code Name: Kentsfield (xeon or not? Not on my machine?) Sam Williams Diagram Clovertowns marketed as xeon’s
2
One thread (8.87/2.66=3.33 flops/cycle) >> maxNumCompThreads(1); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.4818 >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.4852 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.5231 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.6101 >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.8097 >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.8310 >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 8.8702 (8.48/8.87=0.95)
3
Two threads (17.11/2.66=6.43 flops/cycle) >> maxNumCompThreads(2); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 14.8793 >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 15.8802 >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 15.5001 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 16.3604 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 16.5596 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 16.3035 >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 16.8308 >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 16.8309 >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 17.0555 >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 17.0995 >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 17.0704 >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 17.1110 (14.8793/17.111 = 0.86)
4
Four threads (29.56/2.66=11.1 flops/cycle) >> maxNumCompThreads(4); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 23.9690 >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 25.4798 >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 25.8126 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 28.0110 >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 28.0495 >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 29.3411 >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 29.5863 >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = 29.1100 23.969/29.11=0.82
5
Summary Threads = 1/2/4 Maximum Gflops: 8.87/ 17.11/29.11 Maximum Gflops/cycle: 3.33/6.43/11.1 Maximum Gflops/cycle/thread: 3.33/3.21/2.78 Minimum (n=1000)/Maximum (n=5000or6000) – 0.95/0.86/0.82 All indicative of an ability to do 4 mults and 4 adds per core per cycle, but not enough memory bandwidth to keep the processors going at full capacity.
6
Matrix Add >> n=5000; a=randn(n,n); tic, c=a+0; t=toc;(2.66*1e9*t)/(2*n^2) ans = 12.3890 >> maxNumCompThreads(4); >> n=5000; a=randn(n,n); tic, c=a+0; t=toc;(2.66*1e9*t)/(2*n^2) ans = 12.2825 Conclusion: Takes about 12 cycles per read and write independent of operations i.e. in one cyle we have (1/12) of 8 bytes moving In one second we have (2.66*1e9)*(1/12)* 8 bytes = 1.7 GB/second (seems slow!)
7
One can try a model Cycles = (read/writes)*12 + (flops)/(4*p*efficiency) But good luck! (not sure if this accounts for all that is going on and maybe one shouldn’t decouple the memory starvation from the efficiency. You can see what you can do if you like. I’m dissapointed this is so non-predictive.)
8
https://agora.cs.illinois.edu/download /attachments/19925366/a38- mattson.pdf As a second point of comparison, consider Intel® Core™ 2 Quad processor CPU running at 2.66 GHz with a thermal design power of 95W (model number Q6700) [Intel2008]. This CPU was manufactured using the same 65 nm process technology as was used for the 80-core Terascale processor. A Core™ 2 core includes two 128 bit wide SIMD FPU that support the SSE3 instructions each of which can retire up to 4 single precision floating point operations per cycle. Hence, the peak performance of this quad core CPU is: 4 core*8flop/core*2.66 GHZ = 85.12 single precision GFLOPS This translates to 0.9 GFLOP/Watt making the 80-core Terascale processors (19.4 GFLOP/W at 0.394 TFLOP) over 20 times more power efficient than a more traditional “big core” multicore CPU.
9
Wikipedia: The Kentsfields comprise two separate silicon dies (each equivalent to a single Core 2 duo) on one MCM. [30] This results in lower costs but lesser share of the bandwidth from each of the CPUs to the northbridge than if the dies were each to sit in separate sockets as is the case for example with the AMD Quad FX platformMCM [30]northbridgeAMD Quad FX platform
10
Wikipedia The multiple cores of the Kentsfield most benefit applications that can easily be broken into a small number of parallel threads (such as audio and video transcoding, data compression, video editing, 3D rendering and ray-tracing). To take a specific example, multi-threaded games such as Crysis and Gears of War which must perform multiple simultaneous tasks such as AI, audio and physics benefit from the quad-core CPUs. [35] In such cases, the processing performance may increase relative to that of a single-CPU system by a factor approaching the number of CPUs. This should, however, be considered an upper limit as it presupposes the user-level software is well- threaded. To return to the above example, some tests have demonstrated that Crysis fails to take advantage of more than two cores at any given time. [36] On the other hand, the impact of this issue on broader system performance can be significantly reduced on systems which frequently handle numerous unrelated simultaneous tasks such as multi-user environments or desktops which execute background processes while the user is active. There is still, however, some overhead involved in coordinating execution of multiple processes or threads and scheduling them on multiple CPUs which scales with the number of threads/CPUs. Finally, on the hardware level there exists the possibility of bottlenecks arising from the sharing of memory and/or I/O bandwidth between processors.threadstranscodingdata compressionvideo editing3D renderingray-tracingCrysisGears of War [35] [36] I read this as you might hopefully get 4 fold speedups but some people say you might only get 2, and it all depends, and nobody really seems to know for sure
11
Theoretical Memory Bandwidth (Clock Frequency) * (Data Path Width) * (Transfers per clock cycle) (1.066 GHz) * (8 bytes?????) * (4)??? Might be 4=two possibilities during clock rise and two during clock fall “quad-pumped?” This would be 32 GB/sec Sam Williams says 10.6 or 21.3 on clovertown I see 1.7??
12
SSE Streaming SIMD Extensions Cores have 128 bit registers (eight of them??) That allow four single precision, or two double precision ops per second See: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Especially packed add ADDPS, and packed multiply MULPS See: http://developer.intel.com/software/products/college/ia32 /strmsimd/simd.htm http://www.cortstratton.org/articles/OptimizingForSSE.php http://www.cortstratton.org/articles/HugiCode.html http://developer.intel.com/software/products/college/ia32 /strmsimd/simd.htm http://www.cortstratton.org/articles/OptimizingForSSE.php http://www.cortstratton.org/articles/HugiCode.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.