Passive NFS Tracing of Email and Research Workloads Daniel Ellard, Jonathan Ledlie, Pia Malkani, Margo Seltzer FAST 2003 - April 1, 2003.

Slides:



Advertisements
Similar presentations
1 CS533 Modeling and Performance Evaluation of Network and Computer Systems Capacity Planning and Benchmarking (Chapter 9)
Advertisements

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.
Predicting Tor Path Compromise by Exit Port IEEE WIDA 2009December 16, 2009 Kevin Bauer, Dirk Grunwald, and Douglas Sicker University of Colorado Client.
Directory Delegations in NFSv4 CITI University of Michigan Ann Arbor Brian Wickman.
1 William Lee Duke University Department of Electrical and Computer Engineering Durham, NC Analysis of a Campus-wide Wireless Network February 13,
1 LAN Traffic Measurements Carey Williamson Department of Computer Science University of Calgary.
The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.
Wireshark – Introduction Wire 1 Due date: Friday, October 30th.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.
Storage system designs must be evaluated with respect to many workloads New Disk Array Performance (CDF of latency) seconds % I/Os seconds % I/Os seconds.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
On the Constancy of Internet Path Properties Yin Zhang, Nick Duffield AT&T Labs Vern Paxson, Scott Shenker ACIRI Internet Measurement Workshop 2001 Presented.
COS 420 DAY 25. Agenda Assignment 5 posted Chap Due May 4 Final exam will be take home and handed out May 4 and Due May 10 Latest version of Protocol.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Network Traffic Measurement and Modeling CSCI 780, Fall 2005.
Towards a Better Understanding of Web Resources and Server Responses for Improved Caching Craig E. Wills and Mikhail Mikhailov Computer Science Department.
NFS Tricks and Benchmarking Traps Daniel Ellard and Margo Seltzer FREENIX June 12, 2003.
Tracking the Evolution of Web Traffic: Felix Hernandez-Campos, Kevin Jeffay F. Donelson Smith IEEE/ACM International Symposium on Modeling, Analysis.
Copyright © 2005 Department of Computer Science CPSC 641 Winter Network Traffic Measurement A focus of networking research for 20+ years Collect.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Copyright © 2005 Department of Computer Science CPSC 641 Winter LAN Traffic Measurements Some of the first network traffic measurement papers were.
Lab Manager Maintenance July, 2008 VMware Confidential Lab Manager 3 Training Series Module 9.
File Access Patterns in Coda Distributed File System Yevgeniy Vorobeychik.
1 YouTube Traffic Characterization: A View From the Edge Phillipa Gill¹, Martin Arlitt²¹, Zongpeng Li¹, Anirban Mahanti³ ¹ Dept. of Computer Science, University.
Designing a Data Warehouse
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM R. Sandberg, D. Goldberg S. Kleinman, D. Walsh, R. Lyon Sun Microsystems.
Graybox NFS Caching Proxy By: Paul Cychosz and Garrett Kolpin.
Network File System (NFS) Brad Karp UCL Computer Science CS GZ03 / M030 6 th, 7 th October, 2008.
Sun NFS Distributed File System Presentation by Jeff Graham and David Larsen.
Internet Communication Models. Two Communication Paradigms: Stream vs. Message StreamMessage Connection-orientedConnectionless 1-1 communication1-1, 1-many,
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Consistent Hashing: Load Balancing in a Changing World
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
Computer Networking From LANs to WANs: Hardware, Software, and Security Chapter 12 Electronic Mail.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
A Web Crawler Design for Data Mining
Distributed File Systems
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
An Anonymous Approach to Group Based Assessment Wayne Ellis & Mark Ratcliffe University of Wales, Aberystwyth
NDT: Update Duplex Mismatch Detection Rich Carlson Winter Joint Tech February 15, 2005.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
ENERGY-EFFICIENCY AND STORAGE FLEXIBILITY IN THE BLUE FILE SYSTEM E. B. Nightingale and J. Flinn University of Michigan.
90% U.S. corporations currently engaged in litigation 147 Average number of active lawsuits for $1B+ companies $1M A verage per case cost of eDiscovery.
1EMC CONFIDENTIAL—INTERNAL USE ONLY FAST VP and Exchange Server 2010 Don Turner Consultant Systems Integration Engineer Microsoft TPM.
Product Presentation. SysKit By Acceleratio Acceleratio Ltd. is a software development company based in Zagreb, Croatia, Europe founded in Technology.
Nexthink V5 Demo ITSM – Users Impacted. Situation › It’s Wednesday morning › Last night the infrastructure team we worked hard on a proxy migration We.
1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)
COMP2322 Lab 1 Introduction to Wireshark Weichao Li Jan. 22, 2016.
Review CS File Systems - Partitions What is a hard disk partition?
PARALLEL DATA LABORATORY Carnegie Mellon University A framework for implementing IO-bound maintenance applications Eno Thereska.
An Analysis of Internet Content Delivery Systems 19 rd November, 2007 Youngsub CSE, SNU.
#16 Application Measurement Presentation by Bobin John.
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
On the scale and performance of cooperative Web proxy caching 2/3/06.
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
By Jason Swoyer.  Computer forensics is a branch of forensic science pertaining to legal evidence found in computers and digital storage mediums.  Computer.
A Comparison of File System Workloads D. Roselli J. Lorch T. Anderson University of California, Berkeley.
Embedded Computing Cluster: NFS Server Analysis
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Reminders Project 4 due tomorrow, 4:00pm Review lecture tomorrow
Presentation transcript:

Passive NFS Tracing of and Research Workloads Daniel Ellard, Jonathan Ledlie, Pia Malkani, Margo Seltzer FAST April 1, 2003

Daniel Ellard - FAST /1/20032 Talk Outline Motivation Tracing Methodology Trace Summary New Findings Conclusion

Daniel Ellard - FAST /1/20033 Motivation - Why Gather Traces? Our research agenda: build file systems that –Tune themselves for their workloads –Can adapt to diverse workloads Underlying assumptions: –There is a significant variation between workloads. –There are workload-specific optimizations that we can apply on-the fly. We must test whether these assumptions hold for contemporary workloads.

Daniel Ellard - FAST /1/20034 Why Use Passive NFS Traces? Passive: no changes to the server or client –Sniff packets from the network –Non-invasive trace methods are necessary for real-world data collection NFS is ubiquitous and important –Many workloads to trace –Analysis is useful to real users Captures exactly what the server sees –Matches our research needs

Daniel Ellard - FAST /1/20035 Difficulties of Analyzing NFS Traces Underlying file system details are hidden –Disk activity –File layout The NFS interface is different from a native file system interface –No open/close, no seek –Client-side caching can skew the operation mix Some NFS calls and responses are lost NFS calls may arrive out-of-order

Daniel Ellard - FAST /1/20036 Our Tracing Software Based on tcpdump and libpcap (a packet- capture library) –Captures more information than tcpdump –Handles RPC over TCP and jumbo frames Anonymizes the traces –Very important for real-world data collection –Tunable to remove/preserve specific information Open source, freely available

Daniel Ellard - FAST /1/20037 Overview of the Traced Systems CAMPUS Central college facility Almost entirely SMTP, POP/IMAP, pine No R&D “Normal” users Digital UNIX 53G of storage (1 of 14 home directory disks) EECS EE/CS facility No Research: software projects, experiments Research users Network Appliances filer 450G of storage

Daniel Ellard - FAST /1/20038 Summary of Average Daily Activity CAMPUSEECS Total Ops26.7 Million4.4 Million Read Ops65% (119.6 GB)10% (5.1 GB) Write Ops21% (44.6 GB)15% (9.1 GB) Other14%75% - getattr, lookup, access R/W Ops /21/ /27/2001

Daniel Ellard - FAST /1/20039 Workload Characteristics CAMPUS Data-Oriented 95%+ of reads/writes are to large mailboxes For newly created files: –96%+ are zero-length –Most of the remainder are < 16k –< 1% are “write-only” EECS Metadata-Oriented Mix of applications, mix of file sizes For newly created files: –5% are zero-length –Less than half of the remainder are < 16k –57% are “write-only”

Daniel Ellard - FAST /1/ How File Data Blocks Die CAMPUS: –99.1% of the blocks die by overwriting –Most blocks live in “immortal” mailboxes EECS: –42.4% of the blocks die by overwriting –51.8% die because their file is deleted Overwriting is common, and a potential opportunity to relocate/reorganize blocks on disk

Daniel Ellard - FAST /1/ File Data Block Life Expectancy CAMPUS: –More than 50% live longer than 15 minutes EECS: –Less than 50% live longer than 1 second –Of the rest, only 50% live longer than two minutes Most blocks die in the cache on EECS, but on CAMPUS blocks are more likely to die on disk

Daniel Ellard - FAST /1/ Talk Outline Motivation Tracing Methodology Trace Summary New Findings Conclusion

Daniel Ellard - FAST /1/ Variation of Load Over Time EECS: load has detectable patterns CAMPUS: load is quite predictable –Busiest 9am-6pm and evenings Monday - Friday –Quiet in the late night / early morning Each system has idle times, which could be used for file system tuning or reorganization. Analyses of workload must include time.

Daniel Ellard - FAST /1/ The Daily Rhythm of CAMPUS

Daniel Ellard - FAST /1/ New Finding: File Names Predict File Properties For most files, there is a strong relationship between the file name and its properties –Many filenames are chosen by applications –Applications are predictable The filename suffix is useful by itself, but the entire name is better The relationships between filenames and file properties vary from system to another

Daniel Ellard - FAST /1/ Name-Based Hints for CAMPUS Files named “inbox” are large, live forever, are overwritten frequently, and read sequentially. Files with names starting with “inbox.lock” or ending with the name of the client host are zero-length lock files and live for a fraction of a second. Files with names starting with # are temporary composer files. They always contain data, but are usually short and are deleted after a few minutes. Dot files are read-only, except.history.

Daniel Ellard - FAST /1/ Name-Based Hints for EECS On EECS the patterns are harder to see. We wrote a program to detect relationships between file names and properties, and make predictions based upon them –Developed this tool on CAMPUS and EECS data –Successful on other later traces as well We can automatically build a model to accurately predict important attributes of a file based on its name.

Daniel Ellard - FAST /1/ Accuracy of the Models Prediction Accuracy Length = 099.4% Length < 16K91.8% Accuracy of predictions for EECS, for the model trained on 10/22/2001 for the trace from 10/23/2001.

Daniel Ellard - FAST /1/ Accuracy of the Models Prediction Accuracy % of files Accuracy w/o names Length = 099.4%12.5%87.5% Length < 16K91.8%35.2%64.8% Accuracy of predictions for EECS, for the model trained on 10/22/2001 for the trace from 10/23/2001.

Daniel Ellard - FAST /1/ Accuracy of the Models Prediction Accuracy % of filesAccuracy w/o names % Error Reduction Length = 099.4%12.5%87.5%94.9% Length < 16K91.8%35.2%64.8%76.6% Accuracy of predictions for EECS, for the model trained on 10/22/2001 for the trace from 10/23/2001.

Daniel Ellard - FAST /1/ New Finding: Out-of-Order Requests On busy networks, requests can be delivered to the server in a different order than they were generated by the client –nfsiods can re-order requests –Network effects can also contribute This can break fragile read-ahead heuristics on the server We investigated this for FreeBSD and found that read-ahead was affected

Daniel Ellard - FAST /1/ Conclusions Workloads do vary, sometimes enormously New traces are valuable –We gain new insights from almost every trace We have identified several possible areas for future research: –Name-based file system heuristics –Handling out-of-order requests

Daniel Ellard - FAST /1/ The Last Word Please contact me if you are interested in exchanging traces or using our tracing software or anonymizer: Another resource: