1 MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum.

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.

1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

CacheCast: Eliminating Redundant Link Traffic for Single Source Multiple Destination Transfers Piotr Srebrny, Thomas Plagemann, Vera Goebel Department.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

Performance Engineering Laboratories Computer Engineering Department King Fahd University of Petroleum & Minerals (KFUPM), Dhahran.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.

Architectural Impact of SSL Processing Jingnan Yao.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Memory Access Characteristics of Network Infrastructure Applications Dr. Abdul Waheed Computer Engineering Department KFUPM April 7, 2003.

Operating Systems Operating System Support for Multimedia.

TCP Behavior across Multihop Wireless Networks and the Wired Internet Kaixin Xu, Sang Bae, Mario Gerla, Sungwook Lee Computer Science Department University.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Optimizing Cloud Resources for Delivering IPTV Services Through Virtualization.

Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Computer System Architectures Computer System Software

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Improving Network I/O Virtualization for Cloud Computing.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Srihari Makineni & Ravi Iyer Communications Technology Lab

Investigating the Performance of Audio/Video Service Architecture II: Broker Network Ahmet Uyar & Geoffrey Fox Tuesday, May 17th, 2005 The 2005 International.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Online Music Store. MSE Project Presentation III

Providing Differentiated Levels of Service in Web Content Hosting Jussara Almeida, etc... First Workshop on Internet Server Performance, 1998 Computer.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

1 Part VII Component-level Performance Models for the Web © 1998 Menascé & Almeida. All Rights Reserved.

Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.

Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen

March 2001 CBCB The Holy Grail: Media on Demand over Multicast Doron Rajwan CTO Bandwiz.

A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sunpyo Hong, Hyesoon Kim

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Measuring Performance Based on slides by Henri Casanova.

Neha Jain Shashwat Yadav

How do we evaluate computer architectures?

Performance of computer systems

Performance of computer systems

Presentation transcript:

1 MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003

2  Introduction  Problem Statement  Analysis of Memory Accesses  Measurement Based Performance Evaluation  Design and Implementation of Prototype  Contributions  Conclusions  Future Work Outline

3 Introduction  Processor and memory performance discrepancy  Growing network bandwidth  Data rates in Terabits per second possible  Gigabit per second LANs already deployed  High throughput servers in network infrastructure  Streaming media servers  Web servers  Software Routers

4 Dealing with Performance Gap  Hierarchical memory architecture  temporal locality  spatial locality  Constrains  Characteristics of network payload data:  Large  won’t fit into cache  Hardly reusable  poor temporal locality

5 Problem Statement  Network servers should:  Deliver high throughput  Respond to requests with low latency  Respond to large number of clients  Our goal  Identify specific conditions at which server memory becomes a bottleneck  Includes:  cache,  main memory, and  virtual memory  Benefits  Better server design that alleviates memory bottlenecks  Optimal performance can be achieved  Constraints  Large amount of data flowing through CPU and memory  Writing code to optimize memory utilization is a challenge

6 Analysis of Memory Accesses: Data Flow Analysis Four data transfer paths:  Memory-CPU  Memory-memory  Memory-I/O  Memory-network

7 Latency Model and Memory Overhead  Each transaction involves:  CPU cycles  Data transfers: one or more of four identified types  Transaction latency:  T trans = T cpu + n 1 T m-c + n 2 T m-m + n 3 T m-disk + n 4 T m-net  T cpu  Total CPU time needed for the transaction  T m-c  Time to transfer entire PDU from memory to CPU for proc.  T m-m  Latency of memory-memory copy of a PDU  T m-disk  Latency of memory-I/O read/write of a block of data  T m-net  Latency of memory-network read/write of a PDU  n i  Number of each type of data movement operations

8 Memory-CPU Transfers  PDU Processing  checksum computation and header updating  Typically, one-way data flow (memory to CPU via cache)  Memory stall cycles  Number of memory stall cycles = (IC)(AR)(MR)(MP)  Cache miss rate  Worst case: MR = 1 (not as bad!)  Best case: MR = 0 (trivial)

9  Cache overhead in various cases:  Worst case: MR = 1, MP = 10 and (MR)(MP)  10  Best case: MR = 0  trivial  Average case: MR = 0.1, MP = 10 and (MR)(MP)  1  Memory-CPU latency dependent on internal bus bandwidth  T m-c = S/32B i usec where S is the PDU size and B i is the internal bus bandwidth in MB/s Memory-CPU Transfers cont.

10  Memory-memory transfer:  Due to memory copy of PDU between protocol layers  Transfers through caches and CPU  Stride =1 (contiguous)  Transfer involves memory  cache  CPU  cache  memory data movement  Latency:  Dependent on internal (system) bus bandwidth  T m-m = 2S/B i usec Memory-Memory Transfers

11  Memory-network transfers:  Passes over the I/O bus  DMA can be used  Again, stride = 1 (contiguous)  Latency:  Limiting factor is the I/O bus bandwidth  T m-net = S/B e usec Memory-I/O and Memory-Network Transfers

12  RTP Transaction Latency  HTTP Transaction Latency  IP Transaction Latency Latency of Reference Applications

13  Assumptions  CPU usage latency compared to data transfer latency is negligible and can be ignored  Bus contention from multiple simultaneously executed transactions do not result in any additional overhead  Server Throughput = S/T  S = size of transaction data  T = latency of a transaction given by equations 1, 2 and 3 Peak Throughputs

14 Peak Throughputs cont. ProcessorInternal bus bandwidth (MB/sec) Throughput of three network applications IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec) Intel Pentium IV 3.06 GHz AMD Athlon XP MIPS R MHz Sun Ultraspac III 900 MHz

15 Measurement Based Performance Evaluation  Experimental TestbedTestbed  Dual boot server (Pentium IV 2.0 GHz)  256 MB RAM  1.0 GHz NIC  Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)  Tools  Intel Vtune  Windows Performance Monitor  Netstat  Linux tools: vmstat, sar, iostat

16 Platforms and Applications  Platforms  Linux (kernel )  Windows 2000  Applications  Streaming media servers  Darwin streaming server  Windows media server  Web servers  Apache web server  Microsoft Internet Information server  Software router  Linux kernel IP forwarding

17 Analysis of Operating System Role  Memory Throughput Test  ECT (extended copy transfer) – memperf  Locality of reference:  temporal locality – varying working set size (block size)  spatial locality – varying access pattern (strides)

18  Context switching overhead Analysis of Operating System Role cont.

19 Streaming Media Servers Experimental Design  Factors  Number of streams (streaming clients)  Media encoding rate (56kbps and 300kbps)  Stream distribution (unique and multiple media)  Metrics  Cache miss (L1 and L2 cache)  Page fault rate  Throughput  Benchmarking Tools  DSS - streaming load tool  WMS – media load simulator

20 Cache Performance  L1 cache misses (56kbps)

21  L1 cache misses (300 kbps) Cache Performance cont.

22 Memory Performance  Page fault (300kbps)

23 Throughput  Throughput (300kbps)

24 Summary: Streaming Media Server Memory Performance  Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects.  When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients.  Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.

25 Web Servers Experimental Design  Factors  Number of web clients  Document size  Metrics  Cache miss (L1 and L2 cache)  Page fault rate  Throughput  Transactions/sec (connection rate)  Average latency  Benchmarking Tool  Webstone

26 Transactions

27 L1 Cache Miss

28 Page Fault

29 Throughput

30 Summary: Web Server Memory Performance Evaluation AttributeValue ApacheIIS Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization (%) (58 % more than apache) 349 (62% more than Apache) 63 L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) < < 10 Comparing Apache and IIS for an average file size of 10K

31 Software Router  Experimental Design  Factors  Routing configurationsRouting configurations  TCP message size (64bytes, 10 Kbytes, and 64 Kbytes)  Metrics  Throughput  Number of context switching  Number of active pages  Benchmarking Tool  Netperf

32 Software Router Throughput

33 CPU Utilization

34 Context Switching

35 Active Page

36 Summary: Software Router Performance Evaluation  Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication.  Highest CPU utilization was 84%  Highest context switching rate was 5378/sec  Number of active pages fairly uniformly distributed. Indicates low memory activity.

37 Design, Implementation and Evaluation of Prototype DB-RTP Server Architecture  Implementation  Linux platform (C)  Our implementation of RTSP/RTP (why?)

38 Double Buffering and Synchronization Buffer read Buffer write

39 RTP Server Throughput

40 Jitter

41  Throughput  DB-RTP server – Mbps  RTP server – 59 Mbps.  Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server. Summary: DB-RTP Server Performance Evaluation

42 Contributions  Cache overhead analysis.  Memory latency and bandwidth analysis  Measurement-based performance evaluation  Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.

43 Conclusions  High throughput is possible with server design enhancement.  Server throughput is significantly degraded by excessive cache misses and page faults.  Latency hiding with pre-fetching and buffering can improve throughput and jitter performance

44 Future Work  Server Development  hybrid = multiplexing + multithreading  Special Architectures (Network processors & ASICs)  resource scheduling  investigation of the role I/O  use of IRAM (intelligent RAM) architectures  integrated network infrastructure server

45 Thank you

46 go back Array restructuring Loop nest transformation Array Padding

47 Testbeds go back Streaming media/web server testbed Software router testbed

48 Communication Configurations go back

49 Backup slides

50 Page fault 56 kbps 300 kbps Memory Performance

51 Streaming Server: CPU Utilization

52  L2 cache misses (56kbps) Cache Performance cont.

53  L2 cache misses (300kbps) Cache Performance cont.

54 Web Servers Cache performance L1 cache misses L2 cache misses Transaction

55 Latency CPU Utilization Web Servers

56 DB-RTP Server L1 cache misses L2 cache misses CPU Utilization

57 Memory Performance Evaluation Methodologies  Analytical  Requires just paper and pencil  Accuracy?  Simulation  Requires programming  Time and cost?  Measurement  Real system or a prototype required  Using on-chip counters  Benchmarking tools  More accurate

58 Server Performance Tuning  Memory performance tuning  Array paddingrray padding  Array restructuringrray restructuring  Loop nest transformationoop nest transformation  Latency hiding and multithreading  EPIC (IA-64)  VIRAM  Impulse  Multiprocessing and clustering  Task parallelization  E.g. Panama cluster router  Special Architectures  Network processors  ASICs and Data flow architectures

59  Temporal vs. spatial locality  A PDU lacks temporal locality  Observation: PDU processing exhibits excellent spatial locality  Suppose data cache line is 32 bytes (or 16 words) long  Sequential accesses with stride = 1  Accessing one word, brings other 15 words as well  Thus, effective MR = 1/16 = 6.2%  better than even scientific apps  Thus, generally MR = W/L  W - Width of each memory access (in bytes)  L - Length of each cache line (in bytes)  Validation of above observation:  Similar special locality characteristics reported via measurements: S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001  MR for streaming media player better than SPEC benchmark apps!

60 Memory-CPU Transfers  PDU Processing  checksum computation and header updating  Typically, one-way data flow (memory to CPU via cache)  Memory stall cycles  Number of memory stall cycles = (IC)(AR)(MR)(MP)  IC – Instruction count per transaction  AR – Number of memory accesses/instruction (AR=1)  MR – Ratio of cache misses to memory accesses  MP – Miss penalty in terms of clock cycles  Cache miss rate  Worst case: MR = 1 while typically MP = 10  Stall cycles = 10 x IC

61  Determine cache overhead wrt execution time:  (Execution time)no-cache = (IC)(CPI)(CC)  (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)}  Cache overhead = 1 + (MR)(MP)  Cache overhead in various cases:  Worst case: MR = 1 and MP = 10  Cache results in 11 times higher latency for each transaction!  Memory-CPU latency dependent on internal bus bandwidth  Best case: MR = 0  trivial  Average case: MR = 0.1 and MP = 10 and (MR)(MP)  1  Latency due to stalls = ideal execution time without stalls  T m-c = S/32B i usec where S is the PDU size and B i is the internal bus BW in MB/s Memory-CPU Transfers cont.

62 Open Questions  Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor)  Role of memory compression  Role of scheduling Open Questions