Download presentation
Presentation is loading. Please wait.
Published byRosa Gray Modified over 9 years ago
1
Multicore – The future of Computing Chief Engineer Terje Mathisen
2
Moore’s Law «The number of transistors we can put on a chip will double every two years» – Originally from 1965, modified in 1975 – Up to around the turn of century this meant a doubling in performance every 18 months. – Power has become the worst problem. – Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D – Voltage scaling – Today, leakage current is a limiter – Even CMOS transistors leak when they get really tiny
3
Moore's Law has held for 40 years Haswell: 5,6e9, 22nm
4
What could we use all the transistors for? Increase scalar performance Increasingly more complicated cpus Multiple cycles/instruction: – 8088 (29K) – 80286 (134K) – 80386 (275K) Pipeline, one cycle/instruction – 80486 (1,2M) Superscalar: Multiple instructions/cycle – Pentium (3,1M) (two in-order pipelines) Out of order/superscalar/multithreaded – Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)
5
Pentium4 had the fastest pipeline ever 3 Ghz clock – Inner core ran at 2x, i.e. 6 Ghz – Only simple instructions, like ADD/SUB/AND/OR Guessing at branches – If (a > b) {...} else {…} Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!
6
Core 2: Multiple complicated cores Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing. – Shorter pipelines are better at branching – Object-oriented programming uses many branches Every two years: Double the number of cores – Core 2 –> Core 2 Duo -> Core 2 Quad – Latest server cpus have up to 18 cores, using 5.6e9 transistors
7
Vector operations SIMD: Work with more data in each instruction – SSE uses 16-byte vectors (4 float/2 double) – AVX uses 32-byte vectors (8 float/4 double) Each core can do two SSE operations/cycle – Quad cpu with 4*2*4 = 32 fp operations/cycle – 64 Gflops @ 2 GHz, 100 Gflops @ 3+ GHz – High-end AVX implementation doubles this, 12-18 cores add another multiplier
8
Other CPU architectures Sun Sparc 2005: Niagara: 8 cores, 4 threads/core, low clock speed Multithreaded server workloads Oracle Sparc M7 2014: 32 cores, 8 threads/core Optimized for DB operations
10
Other CPU architectures Sparc – Multithreaded server workloads IBM/Sony Cell – 2005: Playstation 3 – 1 PPE + 7-8 SPE cores, each capable of 25 Gflops/s – Works on 16-byte vectors (4 float/2 double) – ~200 Gflops SP -> 14 Gflops DP – Special HPC version with 100+ Gflops DP
12
Other CPU architectures Sun Sparc IBM/Sony Cell GPGPU – Graphics cards with semi-general fp pipelines
13
Intel Larrabee/Many Integrated Core /Xeon Phi Project started 2003 – Architecture review Oct 2006 Announced 2007 – 64-bit – x86 compatible Similar to Pentium – Dual in-order pipelines – More flexible mixing of instructions Special graphics instructions, incl. scatter/gather – S/G are very useful for HPC applications
14
LRB cont. Even longer vectors – Works with 64-byte blocks (16 float/8 double) – Combined FMUL/FADD instruction More than 50 cores on first product – 4 threads/core – 16x2x51 = 1616 flops/cycle – 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops) First product will be graphics coprocessor card Will use the same 125 watts (max) as a single P4 New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi
15
Future directions Heterogeneous cpus: – Maybe 2-4 Core2 + 20-60 Larrabee? – Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. (2013 - Fastest computer in the world: Ivy Bridge+Phi) – OS threads without fp operations can also use simple in-order LRB cores Power-efficient processing – Both laptops/mobiles and servers are limited by power use Simpler/slower cores with mostly in-order processing can use 80% less power
16
Conclusion Multicore will give us an extra factor of ~10 increase in fp processing power – Most current forms of simulation becomes possible on a single workstation with 2-4 cpus MIPS/Watt is crucial – Easier to make many simpler cores than one complex – Less wasted work – Server farms and laptops
17
What are the consequences? High performance requires multithreading – Currently this is mostly server workloads – Games are next, today they use 2-4 threads High performance requires vector programming – Can we work on 4, 16 or more variables simultaneously? Many programs (and most programmers) don't care! – If it is fast enough today, it will surely be OK in the future as well? Not neccessarily, because – Data grows exponentially!
18
HPC applications Seismic processing – PC with – Complete model of small fields – Reduced resolution test runs for larger fields – Deskside server with nearly the same capability as current 2048-cpu seismic cluster Crash simulation – Everything could fit on a laptop in 2012-2015 Financial modelling, incl Monte Carlo risk analysis Dynamic global process control
19
From current Unix cluster…
20
… to deskside workstation in 5 years?
21
Summary Multicore will give us an extra factor of ~10 increase in fp processing power Moore's law will go on MIPS/Watt is crucial Evry is at leading edge of this development
22
Thank you!
24
Do we have the required programmers? Will we get them from the universities in the future? – Possibly – Today, most graduates learn only Java, which isn't very suitable There's hope: – LRB on the NTNU CS curriculum today Similar situation at most universities Can our standard vendors deliver updated SW? – Eclipse, GeoFrame, Sismage, Ansys, Finite Element
25
Smaller transistors & slightly larger chips
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.