File Access Patterns in Coda Distributed File System Yevgeniy Vorobeychik.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,
Caching Dynamic Documents Vipul Goyal Department of Computer Science & Engg Institute of Technology, Banaras Hindu University Sugata Sanyal School of Technology.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Consistency in NFS and AFS. Network File System (NFS) Uses client caching to reduce network load Built on top of RPC Server cache: X Client A cache: XClient.
Distributed Systems 2006 Styles of Client/Server Computing.
6.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
1 Programming & Programming Languages Overview l Machine operations and machine language. l Example of machine language. l Different types of processor.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Web-Conscious Storage Management for Web Proxies Evangelos P. Markatos, Dionisios N. Pnevmatikatos, Member, IEEE, Michail D. Flouris, and Manolis G. H.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
Proxy Design Pattern Source: Design Patterns – Elements of Reusable Object- Oriented Software; Gamma, et. al.
Chapter 7 Configuring & Managing Distributed File System
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
1 Chapter Overview Network Operating Systems Network Clients Directory Services.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Mobility in Distributed Computing With Special Emphasis on Data Mobility.
Design patterns. What is a design pattern? Christopher Alexander: «The pattern describes a problem which again and again occurs in the work, as well as.
Advanced Computer Networks1 Efficient Policies for Carrying Traffic Over Flow-Switched Networks Anja Feldmann, Jenifer Rexford, and Ramon Caceres Presenters:
Distributed File Systems
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.
Data Staging on Untrusted Surrogates Jason Flinn Shafeeq Sinnamohideen Niraj Tolia Mahadev Satyanarayanan Intel Research Pittsburgh, University of Michigan,
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Chap No: 04 Advanced Relational Database
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Chapter 6 Distributed File Systems Summary Bernard Chen 2007 CSc 8230.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Proxy, Observer, Symbolic Links Rebecca Chernoff.
Distributed Systems CS Consistency and Replication – Part IV Lecture 21, Nov 10, 2014 Mohammad Hammoud.
ENERGY-EFFICIENCY AND STORAGE FLEXIBILITY IN THE BLUE FILE SYSTEM E. B. Nightingale and J. Flinn University of Michigan.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
P51UST: Unix and SoftwareTools Unix and Software Tools (P51UST) Version Control Systems Ruibin Bai (Room AB326) Division of Computer Science The University.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
1 Reforming Software Delivery Using P2P Technology Purvi Shah Advisor: Jehan-François Pâris Department of Computer Science University of Houston Jeffrey.
Performance Limitations of ADSL Users: A Case Study Matti Siekkinen, University of Oslo Denis Collange, France Télécom R&D Guillaume Urvoy-Keller, Ernst.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
File System Numbers 4/18/2002 Michael Ferguson
Module 11: Configuring and Managing Distributed File System.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
ADO .NET from. ADO .NET from “ADO .Net” Evolution/History of ADO.NET MICROSOFT .NET “ADO .Net” Evolution/History of ADO.NET History: Most applications.
Module 11 Configuring and Managing Distributed File System.
Maintaining and Updating Windows Server 2008 Lesson 8.
A Fragmented Approach by Tim Micheletto. It is a way of having multiple cache servers handling data to perform a sort of load balancing It is also referred.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Web Server Load Balancing/Scheduling
Web Server Load Balancing/Scheduling
Distributed Shared Memory
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
Distributed File Systems
Distributed File Systems
CSE 451: Operating Systems Spring Module 21 Distributed File Systems
Distributed File Systems
CSE 451: Operating Systems Winter Module 22 Distributed File Systems
CS510 - Portland State University
THE GOOGLE FILE SYSTEM.
Distributed File Systems
Network management system
Distributed File Systems
Presentation transcript:

File Access Patterns in Coda Distributed File System Yevgeniy Vorobeychik

Outline TerminologyMotivation Project Description Related Work Case Analysis Experimental setup DFSTrace DFSTrace Custom Perl library Custom Perl library Process ProcessResultsAnalysisImplications Flaws and Limitations Future Work

Terminology DFS: Distributed File System CMU: Carnegie Mellon University Coda: DFS created at CMU (File) Caching: storing replicas of files locally Unstable files: files that are frequently updated Peer-to-peer network: network with no central server Ousterhout, Baker, Sandhu, Zhou: last names of people

Motivation File caching has long been used as a technique to improve DFS performance When a cached copy is updated, it has to be written back to the server at some point Or does it? What if you have a peer-to-peer network? What if you have a peer-to-peer network? What if there are many unstable files? What if there are many unstable files?

Motivation What if there is a “very small” set of computers that update a file? Then you can avoid writing back to the server, reducing server load (if there is a server at all) Then you can avoid writing back to the server, reducing server load (if there is a server at all) Members of the “writers” group can synchronize the file amongst themselves Members of the “writers” group can synchronize the file amongst themselves Clients can contact a member of the “writers” group directly for an updated version of the file Clients can contact a member of the “writers” group directly for an updated version of the file What does “very small” mean? Reduction in server load should justify the amount of intra-group synchronization Reduction in server load should justify the amount of intra-group synchronization I make a very conservative assumption that I make a very conservative assumption that “very small” = 1

Project Description In this project I tried to determine access patterns that can be observed in Coda Distributed File System Used Coda traces collected continuously for over 2 years at CMU Used Coda traces collected continuously for over 2 years at CMU Collected information on “create”, “read”, and “write” system calls Collected information on “create”, “read”, and “write” system calls Created several access summary files (discussed later) Created several access summary files (discussed later)

Related Work Ousterhout et al. (1985) Analyzed UNIX 4.2 BSD File System to determine file access patterns and effects of memory caching Analyzed UNIX 4.2 BSD File System to determine file access patterns and effects of memory caching Baker et al. (1991) Analyzed user-level access patterns in Sprite Analyzed user-level access patterns in Sprite Sandhu, Zhou (1992) Noted that there is a high level of sharing of unstable files in a corporate environment Noted that there is a high level of sharing of unstable files in a corporate environment However, there tends to be one cluster that writes to a file and many that read it However, there tends to be one cluster that writes to a file and many that read it Introduced FROLIC system for cluster-based file replication Introduced FROLIC system for cluster-based file replication

What About Access Patterns? A case analysis of file access: CASE I: “No Creators” – file was created outside of the trace set CASE I: “No Creators” – file was created outside of the trace set CASE II: “1 Creator” – file was created by one computer and never deleted and recreated by another CASE II: “1 Creator” – file was created by one computer and never deleted and recreated by another CREATE AND WRITE CASES a)created, but never updated b)updated by only one computer Was that computer the creator? c)updated by multiple computers Was one of those computers the creator? CREATE AND READ CASES d)created, but never read e)read by only one computer Was that computer the creator? f)read by multiple computers Was one of those computers the creator?

Case Analysis (cont’d) CASE III: “Many Creators” – file was recreated by multiple computers CASE III: “Many Creators” – file was recreated by multiple computers CASE IV: “No Writers” – file was never updated CASE IV: “No Writers” – file was never updated CASE V: “1 Writer” – file was updated by only 1 computer CASE V: “1 Writer” – file was updated by only 1 computer a)File was written to but never read b)File was read by only one computer Was the reader also the writer? c)File was read by many computers Was the writer one of the readers? CASE VI: “Many Writers” – file was updated by many computers CASE VI: “Many Writers” – file was updated by many computers

Experimental Setup DFSTrace Library and related programs for analyzing Coda traces Library and related programs for analyzing Coda traces Custom Perl Library Wrote a small (4 classes) library in Perl for analyzing ASCII Coda Traces generated by DFSTrace Wrote a small (4 classes) library in Perl for analyzing ASCII Coda Traces generated by DFSTraceProcess Generated summary files of only creates, reads, and writes for each computer from the original trace files Generated summary files of only creates, reads, and writes for each computer from the original trace files Used the summary files to tally the access patterns for each file Used the summary files to tally the access patterns for each file

DFSTrace Library for writing, reading, and manipulating Coda traces I used it to convert traces to ASCII for further manipulation with Perl scripts

PERL Library 4 Classes Tracefile class Tracefile class Reads the trace file and outputs the create, read, and write system calls and affected files Information stored in.sum.txt file, as each trace file contains information gathered from a specified computer TracefileSet class TracefileSet class Uses the tracefile class and collects information for all the tracefiles on CD or on the web (as specified by a switch) File class File class This class is used to maintain and manipulate information about a specified file accessed within the traces ComputerSet class ComputerSet class -Uses the file class to maintain information for all files accessed within the traces -Writes the access summary information into the “accesstally.txt” file

PERL Library (cont’d) 2 scripts that use the above classes gettracedata.pl uses TracefileSet class to read and summarize all the trace files on a CD or the web gettracedata.pl uses TracefileSet class to read and summarize all the trace files on a CD or the web gettracesum.pl uses ComputerSet class to read and summarize information for all the traced files gettracesum.pl uses ComputerSet class to read and summarize information for all the traced files

Results “No Creators” “1 Creator” “Many Creators” “No Writers” “1 Writer” “Many Writers” No writers Writer 0 Many writers 0 No readers reader 3871=creator; 2≠creator Many readers 10, all include creator No readers reader 13≠writer Many Readers 1, does not include writer Total: 30126

Analysis 136 files are updated by only one computer vs. only 3 files that are updated by more than one computer Thus, even the conservative assumption of “very small” = 1 encompasses 136 of 139 files that were updated Thus, even the conservative assumption of “very small” = 1 encompasses 136 of 139 files that were updated There are very few unstable files Vast majority of the files are accessed only to be read, as found in earlier studies Vast majority of the files are accessed only to be read, as found in earlier studies It’s very likely that a file will be read by the same computer that created it In most of the instances when a file has one writer or one creator, it is read by only one computer The reader group for unstable files tends to be small The reader group for unstable files tends to be small It’s likely that a file will be read by a different computer from the one that updated it Thus, there seems to be a separation between computers that update files and computers that only read them Thus, there seems to be a separation between computers that update files and computers that only read them

Analysis Do the results make sense? It makes sense that a computer that created a file will subsequently read it It makes sense that a computer that created a file will subsequently read it It seems counterintuitive that a computer that updated the file will not be the one reading it in the future It seems counterintuitive that a computer that updated the file will not be the one reading it in the future -such a scenario is possible in a project oriented environment -indeed, this is similar to the observation made by Sandhu and Zhou that there is typically one cluster that updates a file, while other clusters read it

Implications Since the “writers” group is “very small” for most files, this group can be contacted directly by other clients, avoiding server write-back It makes a lot of sense for a computer that creates a file to cache a copy of it Since unstable files tend to have small “readers” groups, a DFS may maintain a list of “readers” as well as “writers” to optimize file sharing performance

Flaws and Limitations Traces were collected only at CMU and only for Coda Only 5 of 38 CD’s of data were analyzed, leaving a lot of questions unanswered Very little data is analyzed in detail: there is no further analysis on the “No Creators” and “No Writers” cases, into which most of the data falls

Future Work This follows directly from the “Flaws and Limitations” section Analyze the rest of the Coda trace data Analyze the rest of the Coda trace data Analyze other available trace data (Sprite, etc) Analyze other available trace data (Sprite, etc) Analyze in more detail the “No Creators” and “No Writers” cases Analyze in more detail the “No Creators” and “No Writers” cases