CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Slides:

Advertisements

Similar presentations

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

CAECW Salt Lake City -- Veazey & Gaither Varying Memory Size with TPC-C Performance and Resource Effects Jay Veazey and Blaine Gaither Hewlett-Packard.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Srihari Makineni & Ravi Iyer Communications Technology Lab

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.

Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Pipelining and Parallelism Mark Staveley

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison

Sunpyo Hong, Hyesoon Kim

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

ECE/CS 552: Multithreading and Multicore © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Sciences Department University of Wisconsin-Madison

Simultaneous Multithreading

Computer Structure Multi-Threading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Understanding Performance Counter Data - 1

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CSC3050 – Computer Architecture

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture 10 th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-10)

Laboratory for Computer Architecture2 Outline Brief Description of UltraSPARC T1 Objectives SpecJbb2005 Benchmark Results

Laboratory for Computer Architecture3 UltraSPARC T1 A new multi-threaded processor that combines CMP & SMT in CMT 8 cores with each one handling 4 hardware context threads  32 active hardware context threads Simple in-order pipeline with no branch prediction unit per core Optimized for multithreaded performance  Throughput High throughput  hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty

Laboratory for Computer Architecture4 SMP vs. CMT

Laboratory for Computer Architecture5 UltraSPARC T1 Core Pipeline Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath Core area = 11 mm 2 (90 nm technology) 4 way MT adds ~ 20% area to core

Laboratory for Computer Architecture6 Objectives Evaluate CMP/CMT benefits Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment Show effectiveness of latency hiding

Laboratory for Computer Architecture7 SPECjbb 2005 Benchmark Characteristics  Model a self contained 3-tier system: Server, Database and Clients  Every warehouse is a collection of Java objects with ~25MB of data  Each client is represented by an individual thread  No I/O effects  Reported score: Billion of Operations per Second (BOPS) Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In- Time) compiler, garbage collection and threads SPECjbb tier architecture

Laboratory for Computer Architecture8 Parameters Experimental parameters ParameterValue Operating SystemSunOS 5.10 Generic_ CPU frequency1 GHz Main Memory Size8 Gbytes DDR2 DRAM JVM versionJava(TM) 2 build 1.5.0_06-b05 SPECjbb Execution Command Java -Xmx2560m -Xms2560m - Xmn1536m - Xss128k - XX:+UseParallelOldGC - XX:ParallelGCThreads=15 - XX:+AggressiveOpts - XX:LargePageSizeInBytes= 256m -cp jbb.jar:check.jar spec.jbb.JBBmain -propfile SPECjbb.props

Laboratory for Computer Architecture9 Measurements Methodology On-chip performance counters for real/accurate results Niagara:  Solaris10 tools : cpustat, cputrack  2 counters per Hardware Thread with one only for Instruction count Event NameDescription Instr_cntNumber of completed instructions. SB_fullNumber of store buffer full cycles FP_instr_cntNumber of completed floating-point instructions IC_missNumber of instruction cache (L1) misses DC_missNumber of data cache (L1) misses for loads ITLB_missNumber of instruction TLB miss trap taken. DTLB_missNumber of data TLB miss trap taken (includes real_translation misses). L2_imissNumber of secondary cache (L2) misses due to instruction cache requests. L2_dmiss_ldNumber of secondary cache (L2) misses due to data cache load requests.

Laboratory for Computer Architecture10 Results – Latency hiding pay off Single Thread Execution on T1 Single core execution using 4 threads on one core X2 instead of 4 Number of Warehouses SpecJbb Score (BOPS) Number of Warehouses SpecJbb Score (BOPS)

Laboratory for Computer Architecture11 CMP / CMT Scaling – CMP benefits Number of Warehouses SpecJbb Score (BOPS) 8 core x 1 thread/cores

Laboratory for Computer Architecture12 CMP / CMT Scaling – CMT benefits 75% of the benefit of adding a single core Significant less area and power requirements (remember that 4 way MT adds ~ 20% area to each core) Number of Warehouses SpecJbb Score (BOPS) 8 core x 2 threads/cores

Laboratory for Computer Architecture13 Number of Warehouses SpecJbb Score (BOPS) 8 core x 4 threads/cores CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture14 Additional hardware threads > 2 give an additional benefit of 45% Gradually diminishing returns in terms of SMT efficiency Garbage collector significantly effects regions 4 and 5 Number of Warehouses SpecJbb Score (BOPS) CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture15 IPC of three configurations Best case SPECjbb score speedup SPECjbb Score Scaling Number of Virtual Processors Norm. SPECjbb scoreIPC

Laboratory for Computer Architecture16 Throughput vs. Latency in multiprocessing/multithreaded environments Latency hiding is a good/promising technique against aggressive speculation Adding SMT can give up to 75% the benefit of CMP with significant less cost Moving to higher levels of SMT shows diminishing returns  tradeoffs between #cores and #Hardware threads per core Conclusions

Laboratory for Computer Architecture17 Thank you… Questions?? The Laboratory for Computer Architecture Web-site: