File System Benchmarking

Slides:

Advertisements

Similar presentations

Analyzing NFS Client Performance with IOzone

Advertisements

Tag line, tag line Perforce Benchmark with PAM over NFS, FCP & iSCSI Bikash R. Choudhury.

Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Wim Coekaerts Director of Linux Engineering Oracle Corporation.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

On evaluating GPFS Research work that has been done at HLRS by Alejandro Calderon.

Exploiting SCI in the MultiOS management system Ronan Cunniffe Brian Coghlan SCIEurope’ AUG-2000.

Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

FLANN Fast Library for Approximate Nearest Neighbors

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Case Study - GFS.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.

1 A Look at PVFS, a Parallel File System for Linux Will Arensman Anila Pillai.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Almaden Rice University Nache: Design and Implementation of a Caching Proxy for NFSv4 Ajay Gulati, Rice University Manoj Naik, IBM Almaden Renu Tewari,

1 A Look at PVFS, a Parallel File System for Linux Talk originally given by Will Arensman and Anila Pillai.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Optimizing Performance of HPC Storage Systems

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Ken Cantrell / NetApp Mark Rogov / EMC July 30, 2015.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.

Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

1 Advanced Behavioral Model Part 1: Processes and Threads Part 2: Time and Space Chapter22~23 Speaker: 陳奕全 Real-time and Embedded System Lab 10 Oct.

OMFS An Object-Oriented Multimedia File System for Cluster Streaming Server CHENG Bin, JIN Hai Cluster & Grid Computing Lab Huazhong University of Science.

CHEP04 Performance Analysis of Cluster File System on Linux Yaodong CHENG IHEP, CAS

MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.

Advanced file systems: LFS and Soft Updates Ken Birman (based on slides by Ben Atkin)

CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

First Look at the New NFSv4.1 Based dCache Art Kreymer, Stephan Lammel, Margaret Votava, and Michael Wang for the CD-REX Department CD Scientific Computing.

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Parallel IO for Cluster Computing Tran, Van Hoai.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

HELMHOLTZ INSTITUT MAINZ Dalibor Djukanovic Helmholtz-Institut Mainz PANDA Collaboration Meeting GSI, Darmstadt.

An Introduction to GPFS

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.

CPSC 426: Building Decentralized Systems Persistence

Testing the Zambeel Aztera Chris Brew FermilabCD/CSS/SCS Caveat: This is very much a work in progress. The results presented are from jobs run in the last.

Parallel Virtual File System (PVFS) a.k.a. OrangeFS

Experience of Lustre at QMUL

Lecture 2: Performance Evaluation

BD-Cache: Big Data Caching for Datacenters

Achieving the Ultimate Efficiency for Seismic Analysis

Diskpool and cloud storage benchmarks used in IT-DSS

Google File System.

STUDY OF PARALLEL MONTE CARLO SIMULATION TECHNIQUES

Experience of Lustre at a Tier-2 site

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

Memory Opportunity in Multicore Era

Computer Systems Performance Evaluation

Summer 2002 at SLAC Ajay Tirumala.

Presentation transcript:

File System Benchmarking Advanced Research Computing

Outline IO benchmarks Benchmarks results for What is benchmarked Micro-benchmarks Synthetic benchmarks Benchmarks results for Shelter NFS server, client on hokiespeed NetApp FAS 3240 server, client on hokiespeed and blueridge

IO BENCHMARKING

IO Benchmarks Micro-benchmarks Synthetic benchmarks: Measure one basic operation in isolation Read and write throughput: dd, IOzone, IOR Metadata operations (file create, stat, remove): mdtest Good for: tuning an operation, system acceptance Synthetic benchmarks: Mix of operations that model real applications Useful if they are good models of real applications Examples: Kernel build, kernel tar and untar NAS BT-IO

IO Benchmark pitfalls Not measuring want you want to measure masking of the results by various caching and buffering mechanisms Examples of different behaviors Sequential bandwidth vs random IO bandwidth; Direct IO bandwidth vs bandwidth in the presence of the page cache (in the latter case an fsync is needed) Caching of file attributes: stat-ing a file on the same node on which the file has been written

What is benchmarked What we measure is the combined effect of: native file system on the NFS server (shelter) NFS server performance which depends on factors such as enabling/disabling write-delay and the number of server threads Too few threads: client retries several times Too many threads: server thrashing network between the compute cluster and the NFS server NFS client and mount options Synchronous or asynchronous Enable/disable attribute caching

Micro-benchmarks IOZone – measure read/write bandwidth Historical benchmark ability to test multiple readers/writers dd – measure read/write bandwidth Tests file write/read mdtest – metadata operations per second file/directory create/stat/remove

Mdtest – metada test Measures the rate of the operations of file/directory create, stat, remove Mdtest creates a tree of files and directories Parameters used tree depth z = 1 branching factor b = 3 number of files/directories per tree node: I = 256 Stat run by another node than the create node: N = 1 Number of repeats of the run: i = 5

Synthetic benchmarks tar-untar-rm – measure time Test large number of small file creation/deletion Test filesystem metadata creation/deletion NAS BT-IO – bandwidth and time doing IO Solve a block tri-diagonal linear system arising from the discretization of Navier-Stokes equations

Kernel source tar-untar-rm Run on 1 to 32 nodes. Tarball size: 890M Total directories: 4732 Max directory depth: 10 Total files: 75984 Max file size: 919 kB <= 1k: 14490 <= 10k: 40190 <=100k: 20518 <= 1M: 786

NAS BT-I/O Test mechanism What it measures BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension. BT-I/O is test of different parallel I/O techniques in BT Reference - http://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf What it measures Multiple cores I/O with a single large file (blocking MPI calls mpi_file_write_at_all and mpi_file_read_at_all) I/O timing percentage, Total data written, I/O data rate

ShelTER NFS RESULTS

dd throughput (MB/sec) Run on 1 to 32 nodes Two block size – 1MB and 4MB Three file sizes – 1GB, 5GB, 15GB Block size File size Average Median Stdev 1M 1G 8.01 6.10 4.58 5G 7.75 5.95 4.52 15G 5.74 5.60 0.34 4M 4G 11.17 11.80 2.87 20G 15.71 12.70 10.68 60G 14.60 10.50 9.22

dd throughput (MB/sec)

IOZone write throughput

IOZone write vs read (single thread)

Mdtest file/directory create rate

Mdtest file/directory remove rate

Mdtest file/directory stat rate

Tar-untar-rm time (sec) Real User Sys Average 781.27 1.35 10.41 Median 1341.72 1.66 13.08 Standard deviation 644.16 0.44 3.39 untar Real User Sys Average 1214.82 1.51 18.02 Median 1200.13 17.90 Standard deviation 99.03 0.06 0.62 rm Real User Sys Average 227.48 0.22 3.91 Median 216.28 3.87 Standard deviation 64.21 0.02 0.16

BT-IO Results Attribute Class C Class D Problem Size 162 x 162 x 162 Iterations 200 250 Number of Processes 4 361 I/O timing percentage 13.44 91.66 Total data written in a single file (MB) 6802.44 135834.62 I/O data rate (MB/sec) 94.99 73.45 Data written or read at every I/O instance into a single file per processor (MB/core) 42.5 7.5

NETAPP FAS 3240 RESULTS

Server and Clients NAS server: NetApp FAS 3240 Clients running on two clusters Hokiespeed Blueridge Hokiespeed: Linux kernel compile, tar-untar and rm tests have been run with: nodes spread uniformly over racks, and consecutive nodes (rack-packed) Blueridge: Linux kernel compile, tar-untar, and rm tests have been run on consecutive nodes

IOzone read and write throughput (KB/s) Hokiespeed

dd bandwidth (MB/sec) Two node placement policies Direct IO was used packed on a rack spread across racks Direct IO was used Two operations: read and write Two block sizes – 1MB and 4MB Three file sizes – 1GB, 5GB, 15 GB Results show throughput in MB/s

dd read throughput (MB/sec), 1MB blocks Hokiespeed BlueRidge Nodes spread Nodes packed Nodes packed

dd read throughput (MB/sec), 4 MB blocks Hokiespeed BlueRidge Nodes packed Nodes spread Nodes packed

dd write throughput (MB/sec), 1MB blocks BlueRidge Hokiespeed Nodes spread Nodes packed Nodes packed

dd write throughput (MB/sec), 4MB blocks Hokiespeed BlueRidge Nodes spread Nodes packed Nodes packed

Linux Kernel tests Two node placement policies Operations packed on a rack spread across racks Operations Compile: make –j 12 Tar creations and extraction Remove directory tree read and write Results show throughput in MB/s

Linux Kernel compile time (sec) Hokiespeed BlueRidge nodes real user sys 1 817 4968 1096 2 990 5014 1138 4 993 5223 1171 8 939 5143 1167 16 1318 5112 1198 32 2561 5087 1183 64 4985 5111 1209 nodes real user sys 1 694 4589 951 2 1092 4572 993 4 2212 4631 1038 8 4451 4691 1073 16 5636 4716 1098 32 5999 4702 1111 64 6609 4699 1089 Nodes spread Nodes packed nodes real user sys 1 733 5001 1116 2 1546 5086 1233 4 3189 5146 1273 8 6343 5219 1317 16 9476 5251 1366 32 10012 5255 1339 Nodes packed

Tar extraction time (sec) Hokiespeed BlueRidge nodes real user sys 1 143 1.05 9.5 2 125 0.98 9.4 4 144 1.04 9.8 8 149 16 216 1.08 10.4 32 399 1.23 12.5 64 809 1.42 15.0 nodes real user sys 1 98 0.6 6.6 2 103 4 106 6.5 8 130 0.7 7.1 16 217 0.8 9.1 32 406 1.2 13 64 818 1.1 14 Nodes spread Nodes packed nodes real user Sys 1 167 1.0 9.5 2 172 0.98 4 177 1.06 9.6 8 202 1.03 9.7 16 312 1.09 10.2 32 421 1.18 11.9 Nodes packed

Rm execution time (sec) Hokiespeed BlueRidge nodes real user sys 1 20 0.12 2.5 2 21 0.15 2.7 4 25 0.16 2.8 8 33 0.17 16 123 0.22 3.7 32 284 0.24 4.0 64 650 0.27 4.4 nodes real user sys 1 19.21 0.07 1.69 2 19.14 0.10 4 26.68 0.11 1.98 8 63.75 0.16 3.16 16 152.59 0.22 4.24 32 324.90 0.26 4.98 64 699.04 0.25 5.06 Nodes spread Nodes packed nodes real user sys 1 21 0.14 2.84 2 22 2.82 4 0.15 2.80 8 47 0.18 3.30 16 135 0.21 3.85 32 248 0.23 4.01 64 811 0.27 4.54 Nodes packed

Uplink switch traffic, runs on hokiespeed Nodes spread Nodes packed

Mdtest file/directory create rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 BlueRidge Hokiespeed

Mdtest file/directory remove rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 Hokiespeed BlueRidge

Mdtest file/directory stat rate IO ops/sec for mdtest –z 1 –b 3 –I 256 –i 10 –N 1 Hokiespeed BlueRidge

250 (I/O after every 5 steps) NAS BT IO results Class D Iterations 250 (I/O after every 5 steps) Number of jobs 50 Total data size (written/read) (TB) 6.5 (50 files of 135GB each) System HokieSpeed BlueRidge Nodes per job 3 4 Total number of cores 1800 3200 Average I/O timing in hours 5.175 5.85 5.3 5.5 Average I/O timing (percentage of total time) 92.6 93.4 92.7 96.6 Average Mop/s/process 80.6 72 79.6 44.5 Average I/O rate per node (MB/s) 2.44 2.15 2.34 1.71 Total I/O rates (MB/s) 357.64 323.02 359.8 343.42

Uplink switch traffic for BT-IO on hokiespeed 1 2 The boxes indicate the three NAS BT IO runs Red is write Green is read 3

EMC Isilon X400 RESULTS

dd bandwidth (MB/sec) Runs on BlueRidge Direct IO was used no special node placement policy Direct IO was used Two operations: read and write Two block sizes – 1MB and 4MB Three file sizes – 1GB, 5GB, 15 GB Results show throughput in MB/s

dd read throughput (MB/sec), 1MB blocks EMC Isilon NetApp

dd read throughput (MB/sec), 4 MB blocks Isilon NetApp

dd write throughput (MB/sec), 1MB blocks Isilon NetApp

dd write throughput (MB/sec), 4MB blocks Isilon NetApp

Linux Kernel tests Runs on BlueRidge Direct IO was used Operations no special node placement policy Direct IO was used Operations Compile: make –j 12 Tar creations and extraction Remove directory tree read and write Results show throughput in MB/s

Linux Kernel compile time (sec) Isilon NetApp nodes real user sys 1 701 4584 957 2 1094 4558 989 4 2228 4631 1038 8 4642 4713 1084 16 5860 4723 1107 32 6655 4754 1120 64 7181 4760 1113 nodes real user sys 1 694 4589 951 2 1092 4572 993 4 2212 4631 1038 8 4451 4691 1073 16 5636 4716 1098 32 5999 4702 1111 64 6609 4699 1089

Tar creation time (sec) Isilon NetApp nodes real user sys 1 32 0.50 4.45 2 0.51 4.54 4 0.47 4.39 8 0.48 4.38 16 33 0.49 4.28 35 4.19 64 57 4.20 nodes real user sys 1 30 0.51 4.50 2 0.49 4.46 4 34 0.50 4.51 8 41 4.45 16 62 0.54 32 116 0.60 4.83 64 238 0.89 7.10

Tar extraction time (sec) Isilon NetApp nodes real user sys 1 230 0.65 10.1 2 234 0.62 10.3 4 237 0.63 10.4 8 255 0.64 10.5 16 300 0.67 10.9 32 431 0.74 11.8 64 754 0.87 14.1 nodes real user sys 1 98 0.6 6.6 2 103 4 106 6.5 8 130 0.7 7.1 16 217 0.8 9.1 32 406 1.2 13 64 818 1.1 14

Rm execution time (sec) Isilon NetApp nodes real user sys 1 110 0.23 4.76 2 113 0.24 4.80 4 124 4.82 8 158 4.85 16 234 0.25 4.93 32 340 0.26 4.99 64 655 5.27 nodes real user sys 1 19.2 0.07 1.69 2 19.1 0.10 4 26.7 0.11 1.98 8 63.7 0.16 3.16 16 152 0.22 4.24 32 324 0.26 4.98 64 699 0.25 5.06

IOZone write throughput (KB/s) Isilon Buffered IO/BlueRidge Direct IO/BlueRidge

IOZone read throughput (KB/s) Isilon Buffered IO/BlueRidge Direct IO/BlueRidge

Iozone write throughput (KB/s) Isilon/BlueRidge NetApp/HokieSpeed

IOzone read throughput (KB/s) Isilon/BlueRidge NetApp/HokieSpeed

Thank you.