Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science 1 Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures Michael Chu ( Microsoft ), Rajiv Ravindran ( HP ), Scott Mahlke University of Michigan July 13, 2015
2
University of Michigan Electrical Engineering and Computer Science 2 Multicore is Here Power efficient design Decrease core complexity, Increase number of cores Intel Core 2 Duo (2006) AMD Athlon 64 X2 (2005) Sun Niagara (2005) Image source: Intel
3
University of Michigan Electrical Engineering and Computer Science 3 LD >> ST LD + / >> & << ST + LD >> ST LD + / >> & << ST + LD Compiling in the Multicore Era Coarse-grain vs fine-grain parallelism LD >> ST LD + / >> & << ST + LD Core Core 1 Core 2 RF [Software Queues, PMUP 06], [Scalar Operand Network, HPCA 05]
4
University of Michigan Electrical Engineering and Computer Science 4 Objectives of this Work Goal: detect and exploit available fine-grain parallelism Compiler support is key for good performance –Divide computation operations and data across cores –Maximize direct access to values you need Cache 1Cache 2 Core 1Core 2 IFM IFM LD >> & LD + int x[100]int y[100]
5
University of Michigan Electrical Engineering and Computer Science 5 Data/Computation Partitioning Profile-driven Data Partitioning Program Binding decisions for memory operations Partition assignment for all Ops Region-level Computation Partitioning Region Selection Separated data partitioning from computation partitioning –First partition data, use to guide partition of computation
6
University of Michigan Electrical Engineering and Computer Science 6 Data Partitioning for Caches Goals: –Maximize parallel computation –Reduce stall cycles Coherence traffic Conflicts/misses Static analysis 1 –Object granularity Profile-driven partitioning –Memory instruction granularity Cache 1Cache 2 Core 1Core 2 IFM IFM Coherence Network 1 Compiler-directed Object Partitioning for Multicluster Processors. [CGO 06]
7
University of Michigan Electrical Engineering and Computer Science 7 Memory Access Graph Nodes: Memory operations in the program –Node weight: working set size Edges: Relationship between memory operations –Edge weight: affinity BB1 BB2
8
University of Michigan Electrical Engineering and Computer Science 8. Load 2Address 5 Load 6Address 3 Load 1Address 1 Store 2Address 5 Load 3Address 3 Load 2Address 4 Load 1Address 2 Load 1Address 1 Store 3Address 7 Load 2Address 8. Node Weight: Working Set Estimate Weighted average blocks required Requires 5 cache blocks 22 11 10 3 1
9
University of Michigan Electrical Engineering and Computer Science 9 Edge Weight: Memory Op Affinity Finds relationships between memory accesses Positive affinity: possible hit Negative affinity: possible miss C1C1 C2C2 C1C1 C2C2 C2C2 C2C2 C2C2 C1C1 C2C2 C2C2 C1C1 C1C1 C2C2 C2C2 Current Memory Access... Memory Op Block Address Cache Line LD 1 LD 2 LD 4 ST 1 LD 3 LD 2 ST 1 LD 2 LD 1 LD 2 LD 4 LD 2 LD 4 B1B1 B2B2 B3B3 B4B4 B4B4 B2B2 B4B4 B1B1 B2B2 B4B4 B3B3 B3B3 B4B4 B1B1 Sliding Window 22 11 10 3 1
10
University of Michigan Electrical Engineering and Computer Science 10 Partitioning of Memory Access Graph Prevent overloading of node weights –Reduces cache conflicts Minimize positive affinity cuts –Localizes accesses to caches Cut highly negative affinity edges 22 11 10 3 1
11
University of Michigan Electrical Engineering and Computer Science 11 Weight Calculation Graph Partitioning 1 1 10 1 8 8 8 88 8 11 111 11 11 1 int main { int x; printf(…);. } ProgramRegion Computation Partitioning Region-based Hierarchical Operation Partitioner [PLDI 03] –Modified Multilevel-FM graph partitioning algorithm Any standard operatition partitioner could be used –Must transfer data decisions down
12
University of Michigan Electrical Engineering and Computer Science 12 Dealing with Prebound Data BB1 Data partition locks memory operations in place Don’t move memory operations –Conveys global data partition down to RHOP BB1
13
University of Michigan Electrical Engineering and Computer Science 13 Profile-driven Partitioning BB1 BB2 Non-memory op Memory op Core 2 Core 1
14
University of Michigan Electrical Engineering and Computer Science 14 Experimental Methodology Trimaran Compiler Toolset Profiled/ran with different input sets Machine model: –2, 4 cores –512B, 1kB, 4kB, 8kB caches per core –1, 2, 3 cycle operand network move latency per hop
15
University of Michigan Electrical Engineering and Computer Science 15 Coherence Traffic Reduction KernelsMediabenchAVGSPECintSPECfp
16
University of Michigan Electrical Engineering and Computer Science 16 Stall Cycle Reduction KernelsMediabenchSPECintSPECfpAverage
17
University of Michigan Electrical Engineering and Computer Science 17 Speedup Over Single Core KernelsMediabenchSPECfpAVGSPECint
18
University of Michigan Electrical Engineering and Computer Science 18 Conclusion Fine-grain parallelism can be exploited –Instruction-level parallelism is not dead –Coarse-grain parallelism is still important! Data-cognizant partitioning can improve performance –Improves stall cycles by 51% –Reduces coherence traffic by 87%
19
University of Michigan Electrical Engineering and Computer Science 19 Thank You http://cccp.eecs.umich.edu
20
University of Michigan Electrical Engineering and Computer Science 20 Backup
21
University of Michigan Electrical Engineering and Computer Science 21 2-Core Speedup vs 4-Core Speedup KernelsMediabenchSPECfp AVG SPECint
22
University of Michigan Electrical Engineering and Computer Science 22 Speedup with Four Cores Kernels Mediabench SPEC AVG
23
University of Michigan Electrical Engineering and Computer Science 23 Multicore is the Past Embedded Processors: –Low power constraints –High performance requirements Examples: –Multiflow Trace [1992] –TI C6x [1997] –Lx ST/200 [2000] –Philips TM1300 [1999] –MIT Raw [1997] Processor IIFFMM Data Memory Register File Processor IFM Data Memory Register File Intercluster Communication Network Register File 1 Cluster 1 Register File 2 Cluster 2 IIFFMM Data Cache 1Data Cache 2 Data Memory Coherence Network
24
University of Michigan Electrical Engineering and Computer Science 24 Basics of Computation Partitioning Goal: Minimize schedule length Strategy: –Exploit parallel resources –Minimize critical intercluster communication + >> * & + + Intercluster move Intercluster Communication Network Register File Cluster 1 Register File Cluster 2 IIFFMM Cache 1Cache 2
25
University of Michigan Electrical Engineering and Computer Science 25 Problem #1: Local vs Region Scope 12 0 1 2 3 4 5 6 7 Local scope clusteringRegion scope clustering 1 2 6 3 7 4 8 5 9 11 12 10 12 0 1 2 3 4 5 6 7 1 2 6 5 9 11 12 3 7 4 8 1 2 6 3 7 4 8 5 9 11 12 10 1 2 6 3 7 12 11 4 8 5 9 move cycle Examples: [BUG, UAS, B-ITER]Examples: [CARS, Aletà ‘01]
26
University of Michigan Electrical Engineering and Computer Science 26 Problem #2: Scheduler-centric Cluster assignment during scheduling adds complexity Detailed resource model/reservation tables is slow Forces local decisions Examples: [BUG, UAS] 1 2 6 3 7 4 8 5 9 11 12 10 XX X X Cluster 1 XX XX XX XX Reservation Tables XX X X Cluster 2 XX XX XX X X cycle 1 2 1 2 1 2 1 2 1 2 1 2
27
University of Michigan Electrical Engineering and Computer Science 27 Our Approach Opposite approach to conventional clustering Hierarchical region view –Graph partitioning strategy –Identify tightly coupled operations - treat uniformly Non scheduler-centric mindset –Pre-scheduling technique –Estimates schedule length Advantages: –Efficient, hierarchical, doesn’t complicate scheduler
28
University of Michigan Electrical Engineering and Computer Science 28 Region-based Hierarchical Operation Partitioning (RHOP) Code is considered region at a time Weight calculation creates guides for good partitions Partitioning clusters based on given weights Weight Calculation Graph Partitioning 1 1 10 1 8 8 8 88 8 11 111 11 11 1 int main { int x; printf(…);. } ProgramRegion Region-based Hierarchical Operation Partitioning for Mulitcluster Processors [PLDI 03]
29
University of Michigan Electrical Engineering and Computer Science 29 Node Weight + + + + & & IFMB Register File Dedicated Resources Accounts for FU’s Nodes: operations in the scheduling region Metric for resource usage –Resources used by a single operation per cycle Used to estimate overcommitment of resources
30
University of Michigan Electrical Engineering and Computer Science 30 Edge Weight Edges: data flow between operations Slack distribution allocates slack to certain edges –Edge slack = identifies preferred places to cut edges –First come, first serve method used 1 1 2 2 3 3 5 5 6 6 10 7 7 11 14 13 12 4 4 8 8 9 9 0 0 0 0 0 1111 21 1 1 H H H H H L L 000 0 1 0MMMM M L
31
University of Michigan Electrical Engineering and Computer Science 31 RHOP - Partitioning Phase Modified Multilevel-FM algorithm [Fiduccia ‘82] Multilevel graph partitioning consists of two stages 1.Coarsening stage 2.Refinement stage Cluster 1 Cluster 2
32
University of Michigan Electrical Engineering and Computer Science 32 Cluster Refinement 3 questions to answer: 1.How good is the current partition? 2.Which cluster should operations move from? 3.How profitable is it to move X from cluster A to B? Cluster 1 Cluster 2 move to cluster 2?
33
University of Michigan Electrical Engineering and Computer Science 33 Resource Overcommitment Estimate 0 1 2 3 4 46 9 5 Cluster 1 Cluster 2 cycle 8 12 14 1 2 3 0 1 2 3 4 7 11 13 10 01 2 2.5 2.0 0.5 0.0 01 2 Cluster_wgt 1 = 5.0 0.0 0.33 0.0 Cluster_wgt 2 = 0.67 46 9 5 1.0 0.0 01 2 Cluster_wgt 1 = 1.0 1.33 2.33 0.83 0.0 01 2 Cluster_wgt 2 = 4.5 cluster 2?
34
University of Michigan Electrical Engineering and Computer Science 34 RHOP Computation Partitioning Improves over local algorithms –Prescheduling technique –Schedule length estimator –Combines slack distribution with multilevel partitioning Performs better as number of resources increases
35
University of Michigan Electrical Engineering and Computer Science 35 Profile-driven Data Partitioning Static analysis: coarse-grain partition –Object granularity Profile: fine-grain partition –Memory instruction granularity Global-view for data –Consider memory relationships throughout program Static Data Partition Profile-guided Partition struct fooint bar
36
University of Michigan Electrical Engineering and Computer Science 36 Communication Unaware Partitioning 8-wide centralized vs two cluster, 1-cycle communication
37
University of Michigan Electrical Engineering and Computer Science 37 Overall Performance Improvement Kernels Mediabench Average
38
University of Michigan Electrical Engineering and Computer Science 38 Next Steps: Auto Parallelization Integrate fine and coarse-grain parallelism –Better take advantage of 100’s of cores Future strategies for general parallelization –Analyze memory/data access patterns between threads –Break, untangle dependencies –New parallel programming models Balance data partitioning decisions with computation
39
University of Michigan Electrical Engineering and Computer Science 39 Edge Weight: Memory Op Affinity Finds relationships between memory accesses Positive affinity: possible hit Negative affinity: possible miss Memory Op Accessed Block LD 1 LD 2 LD 4 ST 1 LD 3 LD 2 ST 1 LD 2 LD 1 LD 2 LD 4 LD 2 LD 4 B1B1 B2B2 B3B3 B4B4 B4B4 B2B2 B4B4 B1B1 B2B2 B4B4 B3B3 B3B3 B4B4 B1B1 C1C1 C2C2 C1C1 C2C2 C2C2 C2C2 C2C2 C1C1 C2C2 C2C2 C1C1 C1C1 C2C2 C2C2 Current Memory Access Sliding Window... Memory Op Block Address Cache Line
40
University of Michigan Electrical Engineering and Computer Science 40 RHOP Computation Partitioning Improves over local algorithms –Prescheduling technique –Estimates on schedule length used instead of scheduler –Combines slack distribution with multilevel partitioning Performs better as number of resources increases MachineRHOP vs BUG 2-cluster 8-issue -1.8% 2-cluster 10-issue 3.7% 4-cluster 16-issue 14.3% 4-cluster 18-issue 15.3% 4-cluster H 13-issue 8.0% Average Improvement
41
University of Michigan Electrical Engineering and Computer Science 41 2 Cluster Results vs 1 Cluster
42
University of Michigan Electrical Engineering and Computer Science 42 4 Cluster Results vs 1 Cluster
43
University of Michigan Electrical Engineering and Computer Science 43 Experimental Evaluation Trimaran toolset: a retargetable VLIW compiler Evaluated DSP kernels and SPECint2000 Baseline: centralized processor, sum of clustered resources NameConfiguration 2-cluster 8-issue 2 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 2-cluster 10-issue 2 Homogenous clusters 2 I, 1 F, 1 M, 1 B per cluster 4-cluster 16-issue 4 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 4-cluster 18-issue 4 Homogenous clusters 2 I, 1 F, 1 M, 1 B per cluster 4-cluster H 13-issue 4 Heterogeneous clusters IM, IF, IB and IMF clusters 64 registers per cluster Latencies similar to Itanium Perfect caches
44
University of Michigan Electrical Engineering and Computer Science 44 Architectural Model Solve the easier problem: computation partitioning –Assume centralized, shared data memory –Accessible from each cluster with uniform latency Centralized Data Memory Cluster 1Cluster 2 IFM IFM
45
University of Michigan Electrical Engineering and Computer Science 45 Cluster 1Cluster 2 IFM IFM Architectural Model Use scratchpad memories –Compiler-controlled memory –Each cluster has one memory –Each object placed in one specific memory –Data object available in the memory throughout the lifetime of the program int x[100]fooint y[100]
46
University of Michigan Electrical Engineering and Computer Science 46 Data Mem 1Data Mem 2 Cluster 1Cluster 2 IFM IFM Problem: Partitioning of Data Determine object placement into data memories Limited by: –Memory sizes/capacities –Computation operations related to data Partitioning relevant to caches and scratchpad memories int x[100]struct foo int y[100]
47
University of Michigan Electrical Engineering and Computer Science 47 Data Unaware Partitioning Lose average 30% performance by ignoring data
48
University of Michigan Electrical Engineering and Computer Science 48 Our Objective Goal: Produce efficient code Strategy: –Partition data objects and computation operations Consider interplay between data/computation decisions Minimize intercluster transfers –Evenly distribute data objects Improve memory bandwidth Maximize parallelism int x[100]struct foo int y [100]
49
University of Michigan Electrical Engineering and Computer Science 49 Pass 1: Global Data Partitioning Determine memory relationships –Pointer analysis & profiling of memory Build program-level graph representation of all operations Perform data object memory operation merging: –Respect correctness constraints of the program Interprocedural Pointer Analysis & Memory Profile Step 1 Merge Memory Operations Step 3 METIS Graph Partitioner Step 4 Build Program Data Graph Step 2
50
University of Michigan Electrical Engineering and Computer Science 50 Nodes: Operations, either memory or non-memory –Memory operations: loads, stores, malloc callsites Edges: Data flow between operations Node weight: Data object size –Sum of data sizes for referenced objects Object size determined by: –Globals/locals: pointer analysis –Malloc callsites: memory profile Global Data Graph Representation int x[100]struct foomalloc site 1 400 bytes 1 Kbyte 200 bytes
51
University of Michigan Electrical Engineering and Computer Science 51 Experimental Methodology Compared to: 2 Clusters: –One Integer, Float, Memory, Branch unit per cluster All results relative to a unified, dual-ported memory Data PartitioningComputation Partition GlobalGlobal-view Data-centric Know data location GreedyRegion-view Greedy computation-centric Know data location Data UnawareNone, assume unified memoryAssume unified memory Unified MemoryN/AUnified memory
52
University of Michigan Electrical Engineering and Computer Science 52 Performance: 1-cycle Remote Access Unified Memory
53
University of Michigan Electrical Engineering and Computer Science 53 Performance: 10-cycle Remote Access Unified Memory
54
University of Michigan Electrical Engineering and Computer Science 54 Global Data Partitioning –Data placement: first-order design principle –Global data-centric partition of computation –Phased ordered approach Global-view for decisions on data Region-view for decisions on computation Achieves 95.2% performance of a unified memory on partitioned memories
55
University of Michigan Electrical Engineering and Computer Science 55 GDP for Cache Memories Used scratchpad GDP with caches Improvement from coherence traffic reduction rls and fsed suffer from no replication of data
56
University of Michigan Electrical Engineering and Computer Science 56 Decentralized Architectures Multicluster architecture –Smaller, faster register files –Transfers data through interconnect network –Cycle time, area, power benefits Examples: –TI C6x –Analog TigerSharc –Alpha 21264 –MIT Raw Intercluster Communication Network Register File 1 Cluster 1 Register File 2 Cluster 2 IIFFMM Data Mem 1Data Mem 2
57
University of Michigan Electrical Engineering and Computer Science 57 Clustered Architectures Processor IIFFMM Data Memory Register File Processor IFM Data Memory Register File Register File 1 Cluster 1 Register File 2 Cluster 2 IIFFMM Data Memory Schedule Clock Rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.