Download presentation
Presentation is loading. Please wait.
Published byDarrion Covan Modified over 9 years ago
1
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu
2
Outline 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 2 Brief Description of UltraSPARC T1 Architecture Analysis Objectives / Methodology Analysis of Results Interference on Shared Resources Scaling of Multiprogrammed Workloads Scaling of Multithreaded Workloads
3
UltraSPARC T1 (Niagara) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 3 A multi-threaded processor that combines CMP & SMT in CMT 8 cores with each one handling 4 hardware context threads 32 active hardware context threads Simple in-order pipeline with no branch predictor unit per core Optimized for multithreaded performance Throughput High throughput hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty
4
UltraSPARC T1 Core Pipeline 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 4 Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path Blue areas are replicated copies per hardware context thread
5
Objectives 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 5 Purpose Analysis of interference of multiple executing threads on the shared resources of Niagara Scaling abilities of CMT architectures for both multiprogrammed and multithreaded workloads Methodology Interference on Shared Resources (SPEC CPU2000) Scaling of a Multiprogrammed Workload (SPEC CPU2000) Scaling of a Multithreaded Workloads (SPECjbb2005)
6
Analysis Objectives / Methodology 6 4/16/2015D. Kaseridis - Laboratory for Computer Architecture
7
Methodology (1/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 7 On-chip performance counters for real/accurate results Niagara: Solaris10 tools : cpustat, cputrack, psrset to bind processes to H/W threads 2 counters per Hardware Thread with one only for Instruction count
8
Methodology (2/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 8 Niagara has only one FP unit only integer benchmark was considered Performance Counter Unit in the granularity of a single H/W context thread No way to break down effects of more threads per H/W thread Software profiling tools too invasive Only pairs of benchmarks was considered to allow correlation of benchmarks with events Many iterations and use average behavior
9
Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload Analysis of Results 9 4/16/2015D. Kaseridis - Laboratory for Computer Architecture
10
Interference on Shared Resources 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 10 Two modes considered: “Same core” mode executes a benchmark on the same core Sharing of pipeline, TLBs, L1 bandwidth More like an SMT “Two cores” mode execute each member of pair on a different core Sharing of L2 capacity/bandwidth and main memory More like an CMP
11
Interference “same core” (1/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 11 On average 12% drop of IPC when running in a pair Crafty followed by twolf showed the worst performance Eon best behavior keeping the IPC almost close to the single thread case
12
4/16/2015D. Kaseridis - Laboratory for Computer Architecture 12 Interference “same core” (2/2) DC misses increased 20% on average / 15% taking out crafty Worst DC misses are vortex and perlbmk Highest ratios of L2 misses demonstrated are not the one that features an important decrease in IPC mcf and eon pairs with more than 70% L2 misses Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth latency hiding technique is promising
13
4/16/2015D. Kaseridis - Laboratory for Computer Architecture 13 Only stressing L2 and shared communication buses On average the misses on L2 are almost the same as in the case on “same core”: underutilized the available resources Multiprogrammed workload with no data sharing Interference “two cores”
14
Scaling of Multiprogrammed Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 14 Reduced benchmark pair set Scaling 4 8 16 threads with configurations
15
Scaling of Multiprogrammed Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 15 “Same core” “Mixed mode” mode
16
4/16/2015D. Kaseridis - Laboratory for Computer Architecture 16 Scaling of Multiprogrammed “same core” 4 8 case IPC / Data cache misses not affected L2 data misses increased but IPC is not Enough resources running fully occupied memory latency hiding 8 16 case More cores running same benchmark Some footprint / request to L2 /Main memory L2 requirements / shared interconnect traffic decreased performance IPC ratio DC misses ratio L2 misses ratio
17
Scaling of Multiprogrammed “mixed mode” 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 17 Mixed mode case Significant decrease in IPC when moving both from 4 8 and 8 16 threads Same behavior as “same core” case for DC and L2 misses with an average of 1% - 2% difference Overall for both modes Niagara demonstrated that moving from 4 to 16 threads can be done with less than 40% on average performance drop Both modes showed that significantly increased L1 and L2 misses can be handed favoring throughput IPC ratio
18
Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 18 Scaled from 1 up to 64 threads 1 8 threads mapped 1 thread per core 8 16 threads mapped at maximum 2 threads per core 16 32 threads up to 4 threads per core 32 64 more threads per core, swapping is necessary Configuration used for SPECjbb2005
19
Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 19 SPECjbb2005 score per warehouse GC effect
20
Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 20 Ratio over 8 threads case with 1 thread per core Instruction fetch and DTLB stressed the most L1 data and L2 Caches managed to scale even for more then 32 threads GC effect
21
Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 21 Scaling of Performance Linear scaling of almost 0.66 per thread up to 32 threads 20x speed up at 32 threads SMP (2 Threads/core) gives on average 1.8x speed up over the CMP configuration (region 1 SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup over the 2- way SMT per core and the single-threaded CMP, respectively.
22
Conclusions 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 22 Demonstration of interference on a real CMT system Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses
23
Q/A 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 23 Thank you… Questions? The Laboratory for Computer Architecture web-site: http://lca.ece.utexas.edu Email: kaseridi@ece.utexas.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.