Presentation is loading. Please wait.

Presentation is loading. Please wait.

cache efficiency and parallelization of numerical simulations

Similar presentations


Presentation on theme: "cache efficiency and parallelization of numerical simulations"— Presentation transcript:

1 cache efficiency and parallelization of numerical simulations
Space Filling Curves cache efficiency and parallelization of numerical simulations 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

2 Agenda Motivation Caching Parallelization Numeric without SFC
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

3 Computer architecture
Memory CPU In-/Output- unit address instruction data data 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

4 Command execution Read instruction (bus access) Interprete instruction
Read operands (bus access) Calculate / Shift / … Write results back (bus access) => memory bus is the bottle neck 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

5 Cycle times of CPU vs. Memory I
factor 2,5 factor > 200 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

6 Cycle times of CPU vs. Memory II
Different development of cpu and memory cycle times Main memory access wastes cpu cycles Fast memory is available, but to expensive and small Solution: Keep data different memories Try to keep frequently used data in fast memory Use of memory hierarchy Big slow memory Small fast memory Fast memory is small => easy access Big memory needs lots of managment effort, which leads to slower behavoir. 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

7 Memory hierarchy Available size Speed ~1KB ~0.5ns 16KB – 4MB Registers
~1GB ~1TB >> 1TB Speed ~0.5ns 0.5-25ns ~80ns ~5ms >> 1s Registers Cache L1 L2 L3 Main memory Disk memory Archiv memory 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

8 Caching Keep copy of recently used data in fast accessible memory (cache) CPU  Cache  Memory Also: Websurfer  HTTP-Proxy  Webserver Use of locality properties of programms Temporal locality Recently used variables are probably used again soon Spatial locality Memory locations near just used memory is likely to be used soon DDR RAM ~3€/MB L3 Cache ~1000€/MB (Intel Xeon) 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

9 Cache efficency Cache hit: memory access can be supplied from the cache Cache miss: requested data doesn‘t exist in cache and must be fetched from memory Cache efficency: Ratio between Cache hits and misses Aim: >>95% cache hits 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

10 Contents Motivation Caching Parallelization Numeric without SFCs
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

11 Numerical algorithms Discretizing of PDE leads to a LES Au=b
LES is solved by iterative algorithm like Jacobi Gauss-Seidel Repeatedly evaluation of 5-point stencil on the two dimensional field u Assume large field u in main memory 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

12 5 point stencil Memory needs: Field u (asume n = 5000)
Calculating û5,4 Needs: u5,3,u4,4,u5,4,u6,4,u5,5 At memory posistions: 10005, 15004, 15005, 15006, 20005 Memory needs: u: 200MB >> cache size 3 lines of u: 120KB > L1 Cache 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

13 Benchmark – 5 point stencil
Tested on Pentium IV Xeon with: 3D with 1,25·108 elements 512 KBytes (128 Bytes each line) L2 cache miss rate: 15,00% 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

14 Contents Motivation Caching Parallelization Numeric without SFC
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

15 FEM using SFC Now calculate element (cell) based value and distribute them onto nodes of the grid Read write only few top elements of stack Should be in cache Are used several times … n 4 3 2 1 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

16 FEM using SFC II Elements are stored in caches according to the number of accesses … n 4 3 2 1 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

17 FEM using SFC III Chache 4 stores nodes which were accessed from all surrounding cells These nodes can be stored on harddisk, as they won‘t be needed again, during actual iteration … n 4 3 2 1 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

18 FEM using SFC IV Now the watched points are „covered“ by other nodes
Distance from the top of the stack is important How big do stacks grow? … n 4 3 2 1 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

19 SFC Due to construction SFC fill quadrats (cubes) Worst case:
Covered areas stay compact Borders (surface) tend to be small Worst case: 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

20 Results Number of nodes on border is small Stacks stay small
n number of nodes in one dimension #nodes in grid: > nd #nodes in border: = O(n(d-1)) Stacks stay small Always elements from top of the stack are used The less elements lay above some element the more probable it is used soon Elements near top of stack are used several times in a short periode This can be used to implement stack efficiently 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

21 Implementation of SFC – stacks
Parts of stack Top: Used soon Stays in Cache / Registers Center: Used in near future Should be loaded into main memory Bottom: Will be used in „far“ future Can be stored on Disk top Registers Cache L1 L2 center L3 Main memory Disk memory bottom Archiv memory 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

22 Implementation of Peano – stacks
Tested on Pentium IV Xeon with: 3D with 108 elements 512 KBytes (128 Bytes each line) L2 cache miss rate: ~0,01 % 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

23 Contents Motivation Caching Parallelization Numeric without SFC
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

24 Parallelization of FEM I
Stored nodes contain calculated contribution of the neighbour elements (cells) 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

25 Parallelization of FEM I
Stored nodes contain calculated contribution of the neighbour elements (cells) Grid can be unregular adaptivly refined 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

26 Parallelization of FEM II
Requirements of partition Handle adaptive (not regularly) refinined grid Load balancing (same cellnumber for each process) Minimal border size (min. communication) process 1 process 2 process 3 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

27 Partition algorithms NP complete => use of heuristics Scheduling
Partitions-Processor Recursive spectral bisection Recursive coordinate bisection Inertial recursive bisection Space filling curves Scheduling: - if one processor has finished his work, it can support others to do so - if one processor often help another one, he add a part of the workload to his own Partitions-Prozessor: - one processor determines the optimal distribution, while the others are calculating the iteration step - this master processor takes into account, that the redistribution consumes time and so must be worthwhile Recursive spectral bisection: - Graph algorithm, with elements as nodes of the graph - Graph is divided by using the Fiedler-Vektor 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

28 Contents Motivation Caching Parallelization Numeric without SFC
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

29 Parallelization using SFC I
SFC fills adaptivly refined grid Cutting one dimensional SFC pieces with same length, can be done easily in O(n) SFC tend to have small surfaces (as seen before) 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

30 Parallelization using SFC II
process 1 process 2 process 3 process 4 process 5 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

31 Parallelization using SFC III
At entrance into processing area border values must be in correct stack How to bring bordervalues into correct stack, without run along the hole curve? 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

32 Parallelization using SFC IV
Finest level at the border Rest coarse level 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

33 Contents Motivation Caching Parallelization Numeric without SFC
Numeric using SFC / stack architecture Parallelization Partitioning without SFC Partitioning using SFC Repartitioning due to adaptivity 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

34 Repartitioning During iteration, based on error estimations by using extensions of element values, the algorithm adjusts refinement locally On the fly repartitioning Obey same requirements as partitioning Small borders Same call number for all processes Capable to handle adaptivly refined grids Fast Using small amount of memory Distributed in parallel Minimizing data transfer 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

35 Repartition algorithms
Scratch remap algorithms Idea: New partition of area Intelligent remap into old repartition to minimize data transfer Can change the inititial partition completly Losts of datatransfer needed Fast Usefull results, when repartitioned after each adaptiv refinement Diffusion based repartitioning Exchanging workload with direct neighbours Only appropriate, when: Refinements are globally distributed Only slightly refinements preceded the repartitioning => Good results, even when seldom repartitioned 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

36 Focus of research Repartitioning field which is: Problems:
Partitioned by SPC Traversed using presented stack algorithms Problems: Only adjustment is one dimension possible How to reorganize stacks Almost sure several not yet discovered problems  11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations

37 Any questions? 11/15/2018 Space Filling Curves – cache efficiency and parallelization of numerical simulations


Download ppt "cache efficiency and parallelization of numerical simulations"

Similar presentations


Ads by Google