PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Slides:



Advertisements
Similar presentations
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Advertisements

Out-of core Streamline Generation Using Flow-guided File Layout Chun-Ming Chen 788 Project 1.
The Zebra Striped Network File System Presentation by Joseph Thompson.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Report : Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.
5 September 2015 Culrur-exp project CULTURe EXchange Platform (CULTUR-EXP) project kick-off meeting, August 2013, Tbilisi, Georgia Joint Operational.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Designing and Deploying a Scalable EPM Solution Ken Toole Platform Test Manager MS Project Microsoft.
 Optimization and usage of D3PD Ilija Vukotic CAF - PAF 19 April 2011 Lyon.
Block1 Wrapping Your Nugget Around Distributed Processing.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
DNS Dynamic Update Performance Study The Purpose Dynamic update and XFR is key approach to perform zone data replication and synchronization,
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
4 Dec 2006 Testing the machine (X7DBE-X) with 6 D-RORCs 1 Evaluation of the LDC Computing Platform for Point 2 SuperMicro X7DBE-X Andrey Shevel CERN PH-AID.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Performance and Scalability of xrootd Andrew Hanushevsky (SLAC), Wilko Kroeger (SLAC), Bill Weeks (SLAC), Fabrizio Furano (INFN/Padova), Gerardo Ganis.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Full and Para Virtualization
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Evaluating the performance of Seagate Kinetic Drives Technology and its integration with the CERN EOS storage system Ivana Pejeva openlab Summer Student.
A UK Computing Facility John Gordon RAL October ‘99HEPiX Fall ‘99 Data Size Event Rate 10 9 events/year Storage Requirements (real & simulated data)
FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
1 Thierry Titcheu Chekam 1,2, Ennan Zhai 3, Zhenhua Li 1, Yong Cui 4, Kui Ren 5 1 School of Software, TNLIST, and KLISS MoE, Tsinghua University 2 Interdisciplinary.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Understanding and Improving Server Performance
Solid State Disks Testing with PROOF
Diskpool and cloud storage benchmarks used in IT-DSS
Experiences with http/WebDAV protocols for data access in high throughput computing
Migration Strategies – Business Desktop Deployment (BDD) Overview
Support for ”interactive batch”
Performance And Scalability In Oracle9i And SQL Server 2000
Lecture Topics: 11/1 Hand back midterms
Presentation transcript:

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado, Neng Xu and Sau Lan Wu University of Wisconsin-Madison PROOF WORKSHOP 2007 Special thanks to Catalin Cirstoiu (ApMon), Gerardo Ganis, Jan Iwaszkiewicz and Fons Rademakers (PROOF TEAM), Sergey Panitkin (BNL) X D

The Purpose of This Benchmark To find out the detailed hardware usage of PROOF jobs. To find out what is the bottle neck of processing speed. To see the performance on the multi-core system. To see the performance on different disk configurations and RAID configurations. To see what else also affects the PROOF performances. (the file size, file content) 2 11/29/2007 Neng Xu, University of Wisconsin-Madison

Hardware/Software Configuration Hardware: - Intel 8 core, 2.66GHz, 16GB DDR2 memory. - With single disk. - With 8 disks without RAID - With RAID 5(8 disks) - With RAID 5(24 disks) 3 11/29/2007 Neng Xu, University of Wisconsin-Madison Benchmark files: Benchmark files.(900MB) ATLAS format files: EV0 files(50MB) ROOT version URL: Repository UUID: 27541ba8-7e3a c3a389f83636 Revision: 21025

The Data Processing settings Benchmark files: Details can be found in $ROOTSYS/test/ProofBench/README With EventTree_ProcOpt.C (Read 25% of the branches) With EventTree_Pro.C (Read all the branches) ATLAS format files: With EV0.C (Read only few branch) Memory Refresh: After each PROOF job, the Linux kernel stores the data in the physical memory. When we process the same data again, the PROOF will read from memory instead of disk. In order to see the real disk I/O in the benchmark, we have to clean up the memory after each test. 4 11/29/2007 Neng Xu, University of Wisconsin-Madison

What can we see from the results? How much resource PROOF jobs need: CPU Memory Disk I/O How does PROOF job use those resources: How to use a multi-core system? How much data does it load to the memory? How fast does it load to the memory? Where does the data go after processing? (Cached memory) 5 11/29/2007 Neng Xu, University of Wisconsin-Madison

6 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. 8 disks on RAID 5. 11/29/2007 Number of Workers CPU Usage Benchmark files, big size, read all the data KB/s MB % %

Performance Rate(with RAID 5) The maximum disk I/O can reach ~55MB/sec Performance keeps increasing until the num. of workers is equal to the num. of CPUs CPU usage is limited by the raid disk throughput. Scalability is limited and affected by the Raid disk throughput after number of workers reaches 6 11/29/2007 Neng Xu, University of Wisconsin-Madison 7 Number of workers Average Processing Speed (events/sec)

8 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, Single Disk. Single Disk. 11/29/2007 Neng Xu, University of Wisconsin-Madison Number of Workers CPU Usage Benchmark files, big size, read all the data KB/s MB %

Performance Rate (with 1 disk) The maximum disk I/O can reach ~25MB/sec CPUs are only ~25% used. 2 workers provide the best performance. For single disk, two proof workers will best utilize the throughput of the disk. Therefore, 2 cores vs. 1 disk is an ideal solution. 11/29/2007 Neng Xu, University of Wisconsin-Madison 9 Number of workers Average Processing Speed (events/sec)

10 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 single disks on RAID Controller 8 single disks on RAID Controller 11/29/2007 Number of Workers CPU Usage Benchmark files, big size, read all the data KB/s MB % %

Performance Rate (RAID vs. non-RAID) Using single disks can provide significant increase of Disk throughput and performance. Trade-off between data protection and performance has to be carefully checked. 8 single disk could sustain the scalability before running out of 8 CPUs. More concrete conclusion could be acquired in the same test on the machines with more cores and disks(ie. 16CPU,24 disks) 11/29/ Number of workers Average Processing Speed (events/sec)

12 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. Without Memory Refresh 8 disks on RAID 5. Without Memory Refresh 11/29/2007 Number of Workers CPU Usage Benchmark files, big size, read all the data KB/s MB %

Performance Rate (read from memory) When we didn’t refresh the memory, there was no disk I/O at all. CPUs usage reached 100% and it becomes the bottleneck. CPU becomes the only bottleneck for analysis speed and scalability. 11/29/2007 Neng Xu, University of Wisconsin-Madison 13 Number of workers Average Processing Speed (events/sec)

14 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. With Xrootd Preload. 8 disks on RAID 5. With Xrootd Preload. 11/29/2007 Number of Workers CPU Usage Benchmark files, big size, read all the data KB/s MB % %

Performance Rate(with Xrootd preload) Xrootd preloading doesn’t change disk throughput much. (see Page 7) Xrootd preloading helps to increase the top speed by ~12.5%. When we use Xrootd preload, disk I/O reaches ~60MB/sec CPUs usage reached 60%. The best performance is achieved when the number of workers is less than the number of CPUs.(6 workers provides the best performance.) 11/29/2007 Neng Xu, University of Wisconsin-Madison 15 Number of workers Average Processing Speed (events/sec)

An overview of the performance rate 11/29/2007 Neng Xu, University of Wisconsin-Madison 16 For I/O bound job, disk throughput will be the bottleneck for most of the hardware configurations; reading data from memory will provide the best scalability and performance. Trade-off between the data protection and the working performance has to be made. Using Xrootd Preload function will help to increase the analysis speed. 2 cores vs. 1 disk seems to be a reasonable hardware ratio without using Raid technology. Number of workers Average Processing Speed (events/sec)

17 Disk I/O Memory Usage Cached Memory The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. 8 disks on RAID 5. 11/29/2007 Number of Workers CPU Usage KB/s MB % % Benchmark files, big size, read 25% of the data

Performance Rate For I/O-bound-like case(benchmark analysis), the less I/O intensive job will have a better performance. 11/29/2007 Neng Xu, University of Wisconsin-Madison 18 Number of workers Average Processing Speed (events/sec)

19 Disk I/O Memory Usage Cached Memory 11/29/2007 Number of Workers CPU Usage EV0 files, read one branch KB/s MB % % The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. 8 disks on RAID 5.

Performance Rate(with ATLAS data) Disk I/O only reaches ~25MB/sec even when we were using RAID. CPUs usage reached ~85%. 8 workers provide the best performance. 11/29/2007 Neng Xu, University of Wisconsin-Madison 20 Number of workers Average Processing Speed (events/sec)

Summary We think PROOF is a viable technology for Tier3s. Various performance tests have been done in the purpose of understanding the usage of the computing resource of Proof and seeking the optimal configurations of different computing resources. Number of issues have come to our attention : The balance between CPU processing power and the disk throughput. The configuration of Xrootd file servers. (Xrootd preload) The trade-off between performance and data protection. Different types of jobs. (I/O intensive or CPU intensive) Hardware/software configurations of the data-serving storage. More issues need to be understood in designing a larger scale proof cluster for multi-users.(users priority, proof job scheduling, network traffic, data distribution, and etc.) 11/29/2007 Neng Xu, University of Wisconsin-Madison 21