Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05

Slides:



Advertisements
Similar presentations
Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
Advertisements

Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Case Study - GFS.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.
CHEP 2004 September 2004Richard P. Mount, SLAC Huge-Memory Systems for Data-Intensive Science Richard P. Mount SLAC CHEP, September 29, 2004.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
Scalla Back Through The Future Andrew Hanushevsky SLAC National Accelerator Laboratory Stanford University 8-April-10
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
File System Access (XRootd) Andrew Hanushevsky Stanford Linear Accelerator Center 13-Jan-03.
SAIGONTECH COPPERATIVE EDUCATION NETWORKING Spring 2010 Seminar #1 VIRTUALIZATION EVERYWHERE.
The Next Generation Root File Server Andrew Hanushevsky Stanford Linear Accelerator Center 27-September-2004
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
May Richard P. Mount, SLAC Advanced Computing Technology Overview Richard P. Mount Director: Scientific Computing and Computing Services Stanford.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
IMDGs An essential part of your architecture. About me
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Bigtable: A Distributed Storage System for Structured Data 1.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Xrootd Monitoring Atlas Software Week CERN November 27 – December 3, 2010 Andrew Hanushevsky, SLAC.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Presenters: Rezan Amiri Sahar Delroshan
Xrootd Update Andrew Hanushevsky Stanford Linear Accelerator Center 15-Feb-05
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 30 – Media Server (Part 5) Klara Nahrstedt Spring 2009.
Performance and Scalability of xrootd Andrew Hanushevsky (SLAC), Wilko Kroeger (SLAC), Bill Weeks (SLAC), Fabrizio Furano (INFN/Padova), Gerardo Ganis.
Full and Para Virtualization
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Xrootd Proxy Service Andrew Hanushevsky Heinz Stockinger Stanford Linear Accelerator Center SAG September-04
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Scalla + Castor2 Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 27-March-07 Root Workshop Castor2/xrootd.
Federated Data Stores Volume, Velocity & Variety Future of Big Data Management Workshop Imperial College London June 27-28, 2013 Andrew Hanushevsky, SLAC.
Tackling I/O Issues 1 David Race 16 March 2010.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
RobuSTore: Performance Isolation for Distributed Storage and Parallel Disk Arrays Justin Burke, Huaxia Xia, and Andrew A. Chien Department of Computer.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
GDB meeting - Lyon - 16/03/05 An example of data management in a Tier A/1 Jean-Yves Nief.
PetaCache: Data Access Unleashed Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam,
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
dCache “Intro” a layperson perspective Frank Würthwein UCSD
SLAC National Accelerator Laboratory
STORM & GPFS on Tier-2 Milan
Ákos Frohner EGEE'08 September 2008
Enabling High Speed Data Transfer in High Energy Physics
Large Scale Test of a storage solution based on an Industry Standard
Grid Canada Testbed using HEP applications
Hadoop Technopoints.
Support for ”interactive batch”
by Mikael Bjerga & Arne Lange
Presentation transcript:

Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05

13-October-052: Outline The state of performance Single server Clustered servers The SRM Debate The Next Big Thing Conclusion

13-October-053: Application Design Point Complex embarrassingly parallel analysis Determine particle decay products 1000’s of parallel clients hitting the same data Small block sparse random access Median size < 3K Uniform seek across whole file (mean 650MB) Only about 22% of the file read (mean 140MB)

13-October-054: Performance Measurements Goals Very low latency Handle many parallel clients Test setup Sun V20z 1.86MHz dual Opteron, 2GB RAM 1Gb on board Broadcom NIC (same subnet) Solaris 10 x86 Linux RHEL ELsmp Client running BetaMiniApp with analysis removed

13-October-055: Latency Per Request (xrootd)

13-October-056: Capacity vs Load (xrootd)

13-October-057: xrootd Server Scaling Linear scaling relative to load Allows deterministic sizing of server Disk NIC CPU Memory Performance tied directly to hardware cost Competitive to best-in-class commercial file servers

13-October-058: OS Impact on Performance

13-October-059: Device & Filesystem Impact CPU limited I/O limited 1 Event  2K UFS good on small reads VXFS good on big reads

13-October-0510: Overhead Distribution

13-October-0511: Network Overhead Dominates

13-October-0512: Xrootd Clustering (SLAC) client machines kan01kan02kan03kan04 kanxx bbr-olb03bbr-olb04 kanolb-a Hidden Details Redirectors

13-October-0513: Clustering Performance Design can scale to at least 256,000 servers SLAC runs a 1,000 node test server cluster BNL runs a 350 node production server cluster Self-regulating (via minimal spanning tree algorithm) 280 nodes self-cluster in about 7 seconds 890 nodes self-cluster in about 56 seconds Client overhead is extremely low Overhead added to meta-data requests (e.g., open) ~200us * log 64 (number of servers) / 2 Zero overhead for I/O

13-October-0514: Cluster Fault Tolerance Server and resources may come and go New servers can be added/removed at any time Files can be moved around in real-time Clients simply adjust to the new configuration Client-side interface handles recovery protocol Uses Real-Time Client Steering Protocol reactive Can be used to perform reactive client scheduling Any volunteers for the bleeding edge?

13-October-0515: Current MSS Support Lightweight agnostic interfaces provided oss.mssgwcmd command Invoked for each create, dirlist, mv, rm, stat oss.stagecmd |command Long running command, request stream protocol Used to populate disk cache (i.e., “stage-in”) xrootd (oss layer) mssgwcmd MSS stagecmd

13-October-0516: Future Leaf Node SRM MSS Interface ideal spot for SRM hook Use existing hooks or new long running hook mssgwcmd & stagecmd oss.srm |command Processes external disk cache management requests Should scale quite well xrootd (oss layer) srm MSS Grid

13-October-0517: BNL/LBL Proposal srm drm das Generic Standard LBL xrootddm rc Replica Services BNL Replica Registration Service & DataMover

13-October-0518: Alternative Root Node SRM Team olbd with SRM File management & discovery Tight management control Several issues need to be considered Introduces many new failure modes Will not generally scale olbd (root node) srm MSS Grid

13-October-0519: SRM Integration Status Unfortunately, SRM interface in flux Heavy vs light protocol Working with LBL team Working towards OSG sanctioned future proposal Trying to use the Fermilab SRM Artem Turnov at IN2P3 exploring issues

13-October-0520: The Next Big Thing High Performance Data Access Servers plus Efficient large scale clustering Allows Novel cost-effective super-fast massive storage Optimized for sparse random access Imagine 30TB of DRAM At commodity prices

13-October-0521: Device Speed Delivery

13-October-0522: Memory Access Characteristics Server: zsuntwo CPU: Sparc NIC: 100Mb OS: Solaris 10 UFS: Sandard

13-October-0523: The Peta-Cache Cost-effect memory access impacts science Nature of all random access analysis Not restricted to just High Energy Physics Enables faster and more detailed analysis Opens new analytical frontiers Have a 64-node test cluster V20z each with 16GB RAM 1TB “toy” machine

13-October-0524: Conclusion High performance data access systems achievable The devil is in the details Must understand processing domain and deployment infrastructure Comprehensive repeatable measurement strategy High performance and clustering are synergetic Allows unique performance, usability, scalability, and recoverability characteristics Such systems produce novel software architectures Challenges Creating application algorithms that can make use of such systems Opportunities Fast low cost access to huge amounts of data to speed discovery

13-October-0525: Acknowledgements Fabrizio Furano, INFN Padova Client-side design & development Bill Weeks Performance measurement guru 100’s of measurements repeated 100’s of times US Department of Energy Contract DE-AC02-76SF00515 with Stanford University And our next mystery guest!