CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu 10 th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-10)

Laboratory for Computer Architecture2 Outline Brief Description of UltraSPARC T1 Objectives SpecJbb2005 Benchmark Results

Laboratory for Computer Architecture3 UltraSPARC T1 A new multi-threaded processor that combines CMP & SMT in CMT 8 cores with each one handling 4 hardware context threads  32 active hardware context threads Simple in-order pipeline with no branch prediction unit per core Optimized for multithreaded performance  Throughput High throughput  hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty

Laboratory for Computer Architecture4 SMP vs. CMT

Laboratory for Computer Architecture5 UltraSPARC T1 Core Pipeline Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath Core area = 11 mm 2 (90 nm technology) 4 way MT adds ~ 20% area to core

Laboratory for Computer Architecture6 Objectives Evaluate CMP/CMT benefits Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment Show effectiveness of latency hiding

Laboratory for Computer Architecture7 SPECjbb 2005 Benchmark Characteristics  Model a self contained 3-tier system: Server, Database and Clients  Every warehouse is a collection of Java objects with ~25MB of data  Each client is represented by an individual thread  No I/O effects  Reported score: Billion of Operations per Second (BOPS) Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In- Time) compiler, garbage collection and threads SPECjbb2005 3-tier architecture

Laboratory for Computer Architecture8 Parameters Experimental parameters ParameterValue Operating SystemSunOS 5.10 Generic_118833-17 CPU frequency1 GHz Main Memory Size8 Gbytes DDR2 DRAM JVM versionJava(TM) 2 build 1.5.0_06-b05 SPECjbb Execution Command Java -Xmx2560m -Xms2560m - Xmn1536m - Xss128k - XX:+UseParallelOldGC - XX:ParallelGCThreads=15 - XX:+AggressiveOpts - XX:LargePageSizeInBytes= 256m -cp jbb.jar:check.jar spec.jbb.JBBmain -propfile SPECjbb.props

Laboratory for Computer Architecture9 Measurements Methodology On-chip performance counters for real/accurate results Niagara:  Solaris10 tools : cpustat, cputrack  2 counters per Hardware Thread with one only for Instruction count Event NameDescription Instr_cntNumber of completed instructions. SB_fullNumber of store buffer full cycles FP_instr_cntNumber of completed floating-point instructions IC_missNumber of instruction cache (L1) misses DC_missNumber of data cache (L1) misses for loads ITLB_missNumber of instruction TLB miss trap taken. DTLB_missNumber of data TLB miss trap taken (includes real_translation misses). L2_imissNumber of secondary cache (L2) misses due to instruction cache requests. L2_dmiss_ldNumber of secondary cache (L2) misses due to data cache load requests.

Laboratory for Computer Architecture10 Results – Latency hiding pay off Single Thread Execution on T1 Single core execution using 4 threads on one core X2 instead of 4 Number of Warehouses SpecJbb Score (BOPS) Number of Warehouses SpecJbb Score (BOPS)

Laboratory for Computer Architecture11 CMP / CMT Scaling – CMP benefits Number of Warehouses SpecJbb Score (BOPS) 8 core x 1 thread/cores

Laboratory for Computer Architecture12 CMP / CMT Scaling – CMT benefits 75% of the benefit of adding a single core Significant less area and power requirements (remember that 4 way MT adds ~ 20% area to each core) Number of Warehouses SpecJbb Score (BOPS) 8 core x 2 threads/cores

Laboratory for Computer Architecture13 Number of Warehouses SpecJbb Score (BOPS) 8 core x 4 threads/cores CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture14 Additional hardware threads > 2 give an additional benefit of 45% Gradually diminishing returns in terms of SMT efficiency Garbage collector significantly effects regions 4 and 5 Number of Warehouses SpecJbb Score (BOPS) CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture15 IPC of three configurations Best case SPECjbb score speedup SPECjbb Score Scaling Number of Virtual Processors Norm. SPECjbb scoreIPC

Laboratory for Computer Architecture16 Throughput vs. Latency in multiprocessing/multithreaded environments Latency hiding is a good/promising technique against aggressive speculation Adding SMT can give up to 75% the benefit of CMP with significant less cost Moving to higher levels of SMT shows diminishing returns  tradeoffs between #cores and #Hardware threads per core Conclusions

Laboratory for Computer Architecture17 Thank you… Questions?? The Laboratory for Computer Architecture Web-site: http://lca.ece.utexas.edu

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Similar presentations

Presentation on theme: "CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Similar presentations

Presentation on theme: "CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer."— Presentation transcript:

Similar presentations

About project

Feedback