Download presentation
Presentation is loading. Please wait.
Published byHarry Greer Modified over 9 years ago
1
Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory
2
2 Goal ● Provide cache-coherent layouts of meshes and graphs ● Derive metrics that measure cache- coherence of layouts ● Generality ● Simplicity ● Efficiency ● Accuracy
3
3 Cache-Coherent Metrics ● Measure the expected number of cache misses of a layout given block-based caches ● Should correlate well with the observed number of cache misses ● Cache-aware metrics ● Measure cache-coherence given known cache parameters (e.g., block size) ● Cache-oblivious metrics ● Consider all possible cache parameters
4
4 Motivation Accumulated growth rate during 1993 – 2004 (log scale) Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/ ●Lower growth rate of data access speed 1.5X 20X 46X 130X during 99 - 04
5
5 Memory Hierarchies and Block- Based Caches CPU Fast memory or cache Slow memory Block transfer Disk 10 -2 sec Access time: 10 -7 sec10 -8 sec
6
6 Main Contributions ● Propose novel and practical cache-aware and cache-oblivious metrics ● Derive metrics given block-based caches ● Propose efficient cache-coherent layout constructions ● Apply to different applications
7
7 Related Work ● Computation reordering ● Data layout optimization
8
8 Computational Reordering ● Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02] ● Cache-oblivious [Frigo et al. 99, Arge et al. 04] Focus on specific problems such as sorting and linear algebra computations
9
9 Data Layout Optimization ● Graph and matrix layout [Diaz et al. 02] ● Minimum linear arrangement (MLA), bandwidth, and wavefront, etc. ● Space-filling curves ● [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] ● Rendering and processing sequences ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02, Isenburg and Lindstrom 05] ● Cache-oblivious mesh layout ● [Yoon et al. 05]
10
10 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results
11
11 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results
12
12 ncnc General Framework of Layout Computation nana nbnb ndnd Input directed graph, G (N, A) Layout algorithm, φ 1D layout, φ(N) …….. nana ndnd ncnc nbnb Cache-coherent metric
13
13 ncnc Two-Level I/O Model [Aggarwal and Vitter 88] nana nbnb ndnd Input directed graph Layout algorithm Cache M cache blocks, whose size is B 1D layout with block size = 3 …….. nana ndnd ncnc nbnb nana ndnd ncnc nbnb
14
14 Graph Representation ● Directed graph, G = (N, A) ● Represent access patterns between nodes ● Nodes, N ● Data element ● (e.g., mesh vertex or mesh triangle) ● Directed arcs, A ● Connects two nodes if they are accessed sequentially ncnc nana nbnb ndnd
15
15 Weights of Nodes and Arcs ● Indicate probabilities that each element will be accessed ● Computed in an equilibrium status during infinite random walks ● Assume that applications infinitely access the data according to the input graph ● Correspond to eigen-values of the probability transition matrix
16
16 Cache-Coherence of a Layout given Block-Based Caches ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd
17
17 Specialization to Meshes ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph = constant 1. Two opposite directed arcs 2. Uniform distribution to access adjacent nodes given a node
18
18 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results
19
19 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1
20
20 Cache-Aware: Single Cache Block, M=1 ncnc nana nbnb ndnd Input directed graph 1D layout with block size = 3 …….. nana ndnd ncnc nbnb Cache, whose block size is B Straddling arcs
21
21 Cache-Aware: Multiple Cache Blocks, M>1 ncnc nana nbnb ndnd Input directed graph 1D layout with block boundary …….. nana ndnd ncnc nbnb Straddling arcs Cache
22
22 Final Cache-Aware Metric ● Counts the number of straddling arcs of the layout given a block size B : block index containing the node, i : Unit step function, 1 if x > 0 0 otherwise. where
23
23 High Accuracy of Cache-Aware Metric Linear correlation [-1, 1] Observed number of cache misses With 5 cache blocks With 25 cache blocks Cache-aware metric 0.97 Tested layouts: Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout, (bi or uni) row-by-row, (bi or uni) diagnoal layouts Z-curve on a uniform grid Tested block size = 4KB
24
24 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1
25
25 Cache-Oblivious: Single Cache Block, M=1 Cache Does not assume a particular block size: Then, what are good representatives for block sizes?
26
26 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.
27
27 Probability that an Arc is a Straddling Arc …….. nana ndnd ncnc nbnb Is an arc straddling given a block size? Computed as a probability as a function of arc length, l Arc length, l, = 2
28
28 Two Cache-Oblivious Metrics ● Arithmetic cache-oblivious metric, ● Geometric cache-oblivious metric, Arc length of arc (i, j) Geometric mean of arc lengths MLA metric, Arithmetic mean
29
29 Validation for Cache-Oblivious (CO) Metrics ● Geometric cache-oblivious metric ● Practical and useful Geometric CO layout Arithmetic CO layout 97% of tested block sizes The number of cache misses when M = 1 (log scale) 73% of tested power-of-two block sizes
30
30 Correlations between Metrics and Observed Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 5 cache blocks Geometric CO metric0.980.81 Arithmetic CO metric-0.19-0.32 Tested block size = 4KB Tested with 10 different layouts on a uniform grid
31
31 Efficient Layout Computation for Our Metrics ● Cache-aware layouts ● Optimized with cache-aware metric given a block size B ● Computed from the graph partitioning ● Geometric cache-oblivious metric ● Very efficient ● Can be used in different layout methods
32
32 Layout Computation with Geometric Cache-Oblivious Metric ● Multi-level construction method ● Partition an input mesh into k different sets ● Layout partitions based on our metric … 1. Partition 2. Lay out Generalized layout method for unstructured meshes
33
33 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results
34
34 Applications ● Isosurface extraction ● View-dependent rendering
35
35 Iso-Surface Extraction ● Uses contour tree [van Kreveld et al. 97] ● Runtime is dominated by the traversal of iso- surface ● Layout graph ● Use an input tetrahedral mesh Spx model (140K vertices)
36
36 High Correlation with Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 10K cache blocks Geometric CO metric 0.990.98 Tested block size = 4KB Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and depth-first) layouts, spectral [Juvan and Mohar 92], cache- oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis sorted layouts
37
37 High Correlation with Runtime Performance Linear correlation [-1, 1] First iso-surface extraction time Second iso-surface extraction time Geometric CO metric 0.94 Disk I/O time is major bottleneck Memory access time is major bottleneck
38
38 Comparison with Other Layouts The first iso-surface extraction time (sec) 8% - 77% improvement and very close to the cache-aware performance
39
39 View-Dependent Rendering ● Layout vertices and triangles of progressive meshes ● Used in an efficient VDR system [Yoon et al. 04] ● Reduce misses in GPU vertex cache
40
40 Cache Miss Ratio on Bunny Model GPU vertex cache miss ratio Vertex cache size Hoppe [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] Universal rendering seq. [Bogomjakov and Gotsman 02] Geometric CO layout
41
41 Cache Miss Ratio on Power Plant Model GPU vertex cache miss ratio Vertex cache size Z-curve Hoppe’s rendering seq. [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] COML [Yoon et al. 05] Geometric CO layout
42
42 Conclusion ● Novel cache-aware and cache-oblivious metrics to evaluate layouts ● Derived metrics based on two-level I/O model ● Improved the performance of applications without modifying codes OpenCCL, open source library http:://gamma.cs.unc.edu/COL/OpenCCL
43
43 Ongoing and Future Work ● Derive a lower bound on our geometric cache-oblivious metric ● Employ mesh compression to further reduce disk I/O accesses ● Investigate efficient layout method for deforming models ● Apply to non-graphics applications ● e.g., shortest path or other graph computations
44
44 Cache-Efficient Layouts of Bounding Volume Hierarchies ● Yoon and Manocha, Eurographics 06 Ray tracingCollision detection
45
45 Acknowledgements ● Ajith Mascarenhas ● Martin Isenburg ● Dinesh Manocha ● Fabio Bernardon, Joao Comba, and Claudio Silva ● For their unstructured tetrahedra rendering program ● Members of data analysis group in LLNL ● Anonymous reviewers
46
46 UCRL-PRES-225448 This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.
47
47 Additional slides
48
48 Cache-Coherence of Layouts ● Well known heuristics for cache-coherent layouts ● Space-filling curves [Sagan 94] ● How can we compute better layouts? ● Requires metrics measuring cache-coherence of layouts
49
49 Main Results ● Define cache-coherence of layout as: ● Expected number of cache misses during random walks of a graph given block-based caches ● Then, the exp. number of cache misses = Number of straddling arcs in a cache-aware cache ≈ Geometric mean of arc lengths in a cache- oblivious case
50
50 Data Layout Optimization ● Rendering sequences ● Triangle strips ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02] ● Processing sequences ● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05] Assume that access pattern globally follows the layout order
51
51 Data Layout Optimization ● Space-filling curves ● [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] Assume geometric regularity
52
52 Data Layout Optimization ● Graph and matrix layout ● A survey [Diaz et al. 02] ● Minimum linear arrangement (MLA) ● Bandwidth ● Profile ● Wavefront, etc. Does not necessarily produce good layouts for block-based caches
53
53 Cache-Aware Metric ● Expected number of cache misses given a block size B
54
54 Correlation between Cache-Aware Metric and Observed Number of Cache Misses Cache-aware metric M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97
55
55 Evaluating Existing Layouts ● No known tight bound ● Compare against the best layout we can construct ● Employ an efficient sampling method Is it close to the optimal layout? An existing layout, φ Use it Build a new one
56
56 Implementation ● Modify our open source layout computation codes, OpenCCL ● Based on METIS graph partitioning library [Karypis and Kumar 98] ● Processing speed ● 15k triangles / second
57
57 Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05] ● Two major improvements ● Accuracy ● Usability
58
58 Pros and Cons ● Limitations ● Not directly applicable to dynamic models ● Pros ● Generality ● Can have benefits without modifying underlying codes
59
59 Specialization to Meshes ● Assume an equally likelihood to access adjacent nodes given a node ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph
60
60 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc. B Pr(B) B Geometric distribution Uniform distribution
61
61 Cache Miss Ratio with Different Mesh Resolutions of Bunny Model GPU vertex cache miss ratio (case size: 32) Mesh resolution Hoppe’s rendering sequence [Hoppe 99] COLg
62
62 Correlation between Cache-Aware Metric and Number of Cache Misses M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97
63
63 Correlations with Observed Number of Cache Misses Cache misses One blk : Mult blks Arithmetic CO metric Geometric CO metric R 2 = 0.98R 2 = 0.81 Cache block size = 4KB,
64
64 Comparison with Other Layouts 0.99 0.98 0.94 X-axis BFL Spectral layout [Juvan and Mohar 92] DFL COML [Yoon et al. 05] Z-curve COLg Aware Correlation: Up to 2X speedup 9% lower than cache- aware layout
65
65 Comparison with Other Layouts ● Compute eight different layouts ● Our geometric cache-oblivious layout ● Our cache-aware layout ● Breadth-first layout ● Depth-first layout ● Spectral layout [Juvan and Mohar 92] ● Cache-oblivious mesh layout [Yoon et al. 05] ● Z-curve [Sagan 94] ● X-axis sorted layout
66
66 Our Layout of Bunny Model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.