CAECW 2008 -- Salt Lake City -- Veazey & Gaither Varying Memory Size with TPC-C Performance and Resource Effects Jay Veazey and Blaine Gaither Hewlett-Packard.

Slides:

Advertisements

Similar presentations

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice In search of a virtual yardstick:

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.

Performance Evaluation

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

1 The Problem of Power Consumption in Servers L. Minas and B. Ellison Intel-Lab In Dr. Dobb’s Journal, May 2009 Prepared and presented by Yan Cai Fall.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Srihari Makineni & Ravi Iyer Communications Technology Lab

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Scheduling Lecture 6. What is Scheduling? An O/S often has many pending tasks. –Threads, async callbacks, device input. The order may matter. –Policy,

Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Lecture 16: Operating Systems Intro to IT COSC1078 Introduction to Information Technology Lecture 16 Operating Systems James Harland

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.

CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.

Chapter 4. Measure, Report, and Summarize Make intelligent choices See through the marketing hype Understanding underlying organizational aspects Why.

Sunpyo Hong, Hyesoon Kim

EGRE 426 Computer Organization and Design Chapter 4.

CORE Lab. E.E. 1 Soft timers : efficient microsecond so ftware timer support for network proc essing Mohit Aron and Peter Druschel 17 th ACM Symposium.

Uniprocessor Process Management & Process Scheduling Department of Computer Science Southern Illinois University Edwardsville Spring, 2016 Dr. Hiroshi.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

If you have a transaction processing system, John Meisenbacher

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.

OPERATING SYSTEMS CS 3502 Fall 2017

Resource Aware Scheduler – Initial Results

Architecture Background

Intel’s Core i7 Processor

Hyperthread Support in OpenVMS V8.3

CS 101 – Sept. 25 Continue Chapter 5

Some challenges in heterogeneous multi-core systems

Lecture 2: Performance Today’s topics: Technology wrap-up

Presented by: Eric Carty-Fickes

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Uniprocessor Process Management & Process Scheduling

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

CAECW Salt Lake City -- Veazey & Gaither Varying Memory Size with TPC-C Performance and Resource Effects Jay Veazey and Blaine Gaither Hewlett-Packard

CAECW Salt Lake City -- Veazey & Gaither Motivation --- why is this interesting? More memory increases performance → How much? →Why exactly? → Reveal and quantify the underlying causes Focus is R&D tradeoffs →Performance, cost, schedule, power →How much memory to design into a commercial server? →Is memory latency more important than memory size?

CAECW Salt Lake City -- Veazey & Gaither Experimental Design Vary memory GBytes → Measure Throughput Resource utilization –CPU, disk I/O, memory BW, CPI, OS context switches HP Integrity rx6600 → Itanium CPUs (2S/4C) → About 750 disk drives TPC-C → Resource intensive → Standard, “coin of the realm”…easy to communicate → Unofficial results

CAECW Salt Lake City -- Veazey & Gaither Throughput Increase of 48% in throughput

CAECW Salt Lake City -- Veazey & Gaither Resource Utilization I/O reduction accounts for 20% of the 48% throughput improvement. Where’s the rest of it? Disk I/O and CPU utilization GB Memthruput CPU Util.IOs / sec Relative thruput approx. % insts. devoted to I/O 32149, %71, % 64173, %58, % 96184, %50, % , %44, % , %29, %

CAECW Salt Lake City -- Veazey & Gaither CPI and Memory As memory is added, CPU cycles are used more efficiently But this is an effect, not a cause---why does CPI fall?

CAECW Salt Lake City -- Veazey & Gaither CPI and Memory Bandwidth CPI can change for many reasons, most irrelevant here Memory accesses are relevant – When a load misses cache, the delay counts toward CPI

CAECW Salt Lake City -- Veazey & Gaither Caches Stabilize with Increasing Memory Units normalized for throughput –accesses (or misses) / sec / CPU / tpmC L1 accesses imply that the registers also stabilize memory L1 accesses L1 misses L2 misses L3 misses

CAECW Salt Lake City -- Veazey & Gaither OS Thread Switches and Memory Reduced thread switches probably cause of register / cache stabilization --- working sets stay around longer

CAECW Salt Lake City -- Veazey & Gaither Summary and Conclusions Adding memory increases performance significantly I/O is reduced, as well as I/O instruction pathlength Context switches are reduced as a result of less I/O –Fewer memory accesses –Lower CPI –More stable caches and registers