Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Computer System Architectures Computer System Software
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
1 Specification and Implementation of Dynamic Web Site Benchmarks Sameh Elnikety Department of Computer Science Rice University.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Srihari Makineni & Ravi Iyer Communications Technology Lab
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Computer Organization & Assembly Language © by DR. M. Amer.
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Background Computer System Architectures Computer System Software.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.
Computer Sciences Department University of Wisconsin-Madison
Memory System Characterization of Commercial Workloads
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Presented by: Eric Carty-Fickes
Simulating a $2M Commercial Server on a $2K PC
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CS 286 Computer Organization and Architecture
Chip&Core Architecture
Dynamic Verification of Sequential Consistency
What Are Performance Counters?
Presentation transcript:

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of Wisconsin-Madison

Outline TPC-W Benchmarks in JavaTPC-W Benchmarks in Java IBM RS6000 S80 Enterprise ServerIBM RS6000 S80 Enterprise Server Hardware Counters in S80Hardware Counters in S80 Experiment ResultsExperiment Results Problems and Future workProblems and Future work ConclusionsConclusions

TPC-W benchmark TPC-W is the TPC’s new benchmark for Transactional Web Environments (E-Commerce) Modeling an online book store similar to –Browsing 95% browsing, 5% transactions –Shopping 80% browsing, 20% transactions –Ordering 50% browsing, 50% transactions Transactional Web Environment: –Web serving of static and dynamic content –Online Transaction processing (OLTP) –Some decision support (DSS)

IBM RS6000 S80 Enterprise Server 6 RS64-III Pulsar processors (451MHz)6 RS64-III Pulsar processors (451MHz) –4-issue in-order Super Scalar microprocessor with on chip 128KB L1 I-Cache, 128KB L1 D-Cache and 8MB L2 Cache. –No Branch Prediction, Aggressive early branch resolution –2 coarse grain Multithreading. SMP system. Snooping bus inter-processor connection.SMP system. Snooping bus inter-processor connection. 8GB main memory and large disk volumes. And high bandwidth IO systems.8GB main memory and large disk volumes. And high bandwidth IO systems.

System Configuration: RS64-III processor 32bits Control word RS64-III processor 32bits Control word AIX kernel Kernel Extension Performance Monitor Snooping bus Java Virtual Machine Emulated Browser Java Virtual Machine DB2 DBMS Processes JDBChttp SUN Java Web Server2.0 Java Servlet

Hardware Counters in S80 3 major components3 major components - 8 Built-in hardware counters in each RS64-III processor. - Kernel extension to AIX Performance Monitor API in the next release of AIX. 3 level counting with their own counting contexts:3 level counting with their own counting contexts: - System level counting, whole system level context - Process / Process group, process level context - Individual thread, thread level context. Some Problems with current version. - Cannot counter for individual processor. - Some Listed events are not available.

Hardware Counters in S80 Processor eventsProcessor events - execution cycles and the number of instructions executed. Instruction mix eventsInstruction mix events - Pipeline M, S, B and S instructions executed. Branch eventsBranch events - Conditional branch T/NT events, unconditional branches, zero cycle branches. Address Translation eventsAddress Translation events - TLB/SLB and ERAT/IERAT miss and duration events. Cache eventsCache events - Cache misses and latencies for each of the L1 I-Cache L1 D-Cache L2 Cache Bus and multi-processor bus snooping eventsBus and multi-processor bus snooping events - bus utilization. multi-processor bus snooping events

Results: IPC for RBE, JavaWebServer and DB2

Results: Instruction Dispatch

Results: Instruction Mix

Results: Branch Behavior

Results: Cache Behavior

Problems & Future Works Problems: - Large Dataset - Network and Server end software are the bottleneck? - Hardware counters vs. Simulations. Future works: - Measurement of other transactional processing and web serving benchmarks for comparison. - More architectural characterizations such as multithreaded processors, multiprocessor snooping and scaling.

Conclusions Server end Software is critical for high-end servers - Network and Server end software are the bottleneck - This is true for Preliminary performance characterization shows: - CPU utilization is highly dependent upon the application workloads. - High dispatching mechanism on RS64III appears less efficiently used. - Branch behavior is web interactions dependent. - L2 cache miss rate is unreasonably low and

Acknowledgement