A cache-aware algorithm for PDEs on hierarchical data structures

A cache-aware algorithm for PDEs on hierarchical data structures
Frank Günther June 2004 Begrüßung

Überblick Motivation der Arbeit und Anforderungen Raumfüllende Kurven
Mathematische Zutaten Die Peanokurve und Keller Algorithmik Numerische Ergebnisse und Performance Zusammenfassung und Ausblick Vereinige numerische Effizienz und Hardware-Effizienz Raumfüllende Kurven für die Numerik bekannt als Werkzeug zur Definition von Reihenfolgen Gruppe Griebel/Zumbusch: Indexberechnungen für Daten in Feldern quasi-optimale Lastbalancierung für Parallelisierung Welche numerischen Anforderungen stellen wir an moderne Programme? Wie kommt man auf die Verwendung von Kellern als DS? Algorithmik: Wie ist das alles zu einem effizienten Programm kombinierbar? Einige Ergebnisse … Zusammenfasssung: Was haben wir erreicht? Wo geht die Reise hin? Wie ist der Ansatz einzuordnen? A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Motivation Efficiency in memory Multigrid Solver for PDEs
Parallelization Hierarchical data structures Amount of data Adaptivity Goals for Developing a state-of-the-art PDE-Solver: Mathematical and numerical goals: Multigrid for resolution-independent convergence Adaptivity for reduction of unnecessary grid points Leads to hierarchical data representation hardware-oriented goals: Efficient memory access Amount of data per variable as small as possible Parallelization should be possible Goals are conflicting in known implementations: Hierarchical DR conflicts with amount of data (pointer-oriented structures) With efficient mem acc (scattered coarse level points) Known implementations are often memory-bounded Idea: Replace explicit tree (pointer) structures with linear data structures Strictly linear program and data flow on units with small mutual dependence Hence spatial and temporal locality Divide linear flow in several independent packets for parallelization Not new as a whole (Griebel et al), but our implementation uses known concepts more consequently A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Space-filling curves G. Cantor: cardinality of manifolds
Is there a continuous and bijective ? E. Netto: No! Search for continous and surjective mappings Peano, Hilbert, Sierpinski, Moore, Lebesgue Cantor: two manifolds of arbitrary but finite dimension have the same cardinality (simplified: number of elements) Example: unit interval and unit square Typical question a mathematician would ask: is there … Netto => mitigation to surjective and continuous The first to find such a mapping was Guiseppe Peano … A space-filling curve is the limiting value of a recursively defined finite approximation The finite approximation evolves from repeated insertion of a Leitmotiv (fractals) in sub squares under rotations and reflections Finite Approximation are interesting because The domain is completely partitioned in square cells on every level Every cell on every level is visited in a linear and predictable order A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Mathematical ingredients and demands
Finite-Element-Discretization Squares on Cartesian grids Multigrid Hierarchical generating system (a-priori-) adaptivity Embedding of arbitrary geometries in the unit square Poisson’s equation Stokes’ equation Dirichlet-boundary-conditions In this talk: 2D FEM: just to demonstrate one method Squares: natural choice because of the space-filling curve MG: resolution independent convergence for ill-conditioned matrices produced by FEM Hierarchical GS: natural choice in conjunction with MG-FEM on the multilevel-partitioning produced by the space-filling curve Griebel has shown, that MG on HGS can be furnished with a standard theory of convergence Adaptivity: adequate resolution of hot spots or boundary with low costs Embedding: geometries different from the unit square should be possible Embedding is rather simple from an algorithmic point of view Equations as first applications for proof of concept. stokes can quite simple be developed from a program for poisson. Boundary conditions: homogenous dirichlet simple in realization. Other BC possible because of well known theory of FEM. A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Peano’s curve and stacks
Formation of “lines of points”, which are processed monotonously Lines are conserved for all depths of recursion Stacks as data structure “Coloring” depends on basis: nodal basis on finest level: 2 colors, 2 stacks hierarchical generating system: 4 colors, 8 stacks Lines of points and conservation of lines: show on picture Stacks as data structure: Linear memory with two methods: push and pop Hope for cache efficiency Coloring: Number of stacks as small as possible Number of stacks only depends on dimensionality of the problem, but not on number of grid points Recursively definable A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

The Algorithm I Develop “rule sets” for stack access
Deterministic No unnecessary access All kinds of points (inner, outer, on boundary, hanging) must be covered Efficient programming of stacks and stack access Rule sets: Efficiency by local determinism, cell-oriented … Efficient programming: no lists with pointers (Kernighan / Ritchie) Stacks realized as arrays Better locality properties Lower complexity in address arithmetic Concrete Implementation is hided, programs only see push and pop A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

The Algorithm II Input: linearized geometry-based tree
Recursive cell-oriented programming Optimizations for cell types “OO by hand” Important: Linearized space tree: geometry-based space tree, eventually changed by numerical reasons (adequate resolution of hot spots and boundaries) Space tree is produced out of a color coded picture describing geometry and boundaries in bottom-up process Linearized storage in a file Recursive cell-oriented programming: Top-dow-depth-first passing of the space tree in every iteration Program completely controlled by the space-filling curve Stack access is controlled by the space-filling curve Optimization for cell types: Cell-oriented implementation of types of points Individual implementations for inner cells, boundary cells and outer cells Less branches for often used inner cells (order n boundary, but order n^2 inner cells) Duplicated and slightly changed parts of code -> OO by hand A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Results for Poisson’s equation
Regular grids Adaptive grids Resolution Variables Iter 27 x 27 744 39 81 x 81 7.144 243 x 243 65.708 729 x 729 2187 x 2187 6561 x 6561 Resolution Variables % reg. grid Iter 27 x 27 298 40,05 38 81 x 81 1.140 15,95 243 x 243 3.800 5,78 729 x 729 13.840 2,3 2187 x 2187 36.369 0,67 6561 x 6561 0,23 Input -> grid -> solution tables: Typical multigrid behavior Up to 48 millions of variables in 75 minutes Similar results for embedded geometries and stokes’ equation Numerical demands are fulfilled A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Cache hit rate in L2-Cache
Performance Simulation Prediction “What happens, and where does it happen?” Measuring Hardware Performance Counter Confirmation of prediction Cache hit rate in L2-Cache beyond 99,0% Simulation: Cachegrind is used like the unix time command Simulation with simplified model of costs (number of cycles = instruction + 10*L1-misses + 100*L2-misses) Simulation of several events Relation to the souce code: what happens how often and where? Important for us: prediction of cycles and cache misses Measuring: Evaluation of specialized registers, which count several events like cache-misses For us: measuring of cycles and cache misses Predictions are confirmed (qualitatively) Major part of hardware oriented demands is fulfilled A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

What are the costs of a variable?
Within regular or adaptive grid? Within an embedded geometry? Nearly always the same! What is the advantage of high cache efficiency in our context? Look at the costs of a variable: Are there substantial differences for different geometries and grids? Yes, but only very small ones! Picture shows costs per variable per iteration for all grids used for solving Poisson’s equation Strong variations in grid parameters (degree of adaptivity, embedded or not, …) The more variables the more efficiency Outer areas in embedded geometries are very cheap (variables are plotted!) Variables on coarser levels do not have significantly higher costs than variables on the finest grid A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

Conclusion and outlook
Technology with high potential 3D Numerical efficiency Parallelization Efficiency in hardware PDE-solver with space-filling curves More equations Efficiency by methodology Full adaptivity adv. treatment of boundaries Fluid structure interaction What did we obtain? High efficiency in hardware in spite of high numerical efficiency Memory access is not the relevant bottle neck Efficiency evolves from methodology, not from afterwards optimization of running program The crucial effect is the deep integration of the space filling curve in the flow control of the program in conjunction with the control of memory access Temporal and spatial locality of data is enforced, hierarchical memory structures of modern processors are used quite optimal There is potential for optimizations in MFLOPS. Actual 15% of peak performance possible, e.g. 360 MFLOPS on a XEON 2,4 GHZ Roadmap: Lots of extensions are actually implemented … Extensions are not difficult in principal, we just have to do it! We are sure, that this approach has a high potential for further applications A cache-aware algorithm for PDEs on hierarchical data structures Frank Günther

A cache-aware algorithm for PDEs on hierarchical data structures

Similar presentations

Presentation on theme: "A cache-aware algorithm for PDEs on hierarchical data structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A cache-aware algorithm for PDEs on hierarchical data structures

Similar presentations

Presentation on theme: "A cache-aware algorithm for PDEs on hierarchical data structures"— Presentation transcript:

Similar presentations

About project

Feedback