Understand Distributed Object System (Short)

Slides:

Advertisements

Similar presentations

From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 12: Distributed.

Advertisements

Slides for Chapter 8: Distributed File Systems

1 Reliable Distributed Systems Stateless and Stateful Client- Server Systems.

OCT 1 Master of Information System Management Organizational Communications and Distributed Object Technologies Lecture 5: Distributed Objects.

NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Case Study - GFS.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed File Systems Steve Ko Computer Sciences and Engineering University at Buffalo.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

1 The Google File System Reporter: You-Wei Zhang.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

Networked File System CS Introduction to Operating Systems.

Distributed File Systems

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

Distributed system Distributed File System Nguyen Huu Tuong Vinh Huynh Thi Thu Thuy Dang Trang Tri.

MapReduce M/R slides adapted from those of Jeff Dean’s.

1 Chap8 Distributed File Systems  Background knowledge  8.1Introduction –8.1.1 Characteristics of File systems –8.1.2 Distributed file system requirements.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

DISTRIBUTED FILE SYSTEMS Pages - all 1. Topics  Introduction  File Service Architecture  DFS: Case Studies  Case Study: Sun NFS  Case Study: The.

Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

Lecture 27-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) December 3, 2013 Lecture 27 Distributed File Systems.

1 Reliable Distributed Systems Stateless and Stateful Client- Server Systems.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

1 Distribuerede systemer og sikkerhed – 4. marts 2002 zFrom Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design zEdition 3, © Addison-Wesley.

Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?

Distributed Systems Coulouris 5’th Ed Distributed Systems Distributed File Systems Understand a Generic Distributed File System Understand.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

1 Reliable Distributed Systems Stateless and Stateful Client- Server Systems Based on K. Birman’s of Cornell, Dusseu’s of Wisconsin.

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

WEB SERVICES From Chapter 19 of Distributed Systems Concepts and Design,4th Edition, By G. Coulouris, J. Dollimore and T. Kindberg Published by Addison.

Lecture 25: Distributed File Systems

File System Implementation

Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.

Lecture 25: Distributed File Systems

Programming Models for Distributed Application

NFS and AFS Adapted from slides by Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift.

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

Slides for Chapter 8: Distributed File Systems

Distributed File Systems

Today: Coda, xFS Case Study: Coda File System

Distributed File Systems

CSE 451: Operating Systems Winter Module 22 Distributed File Systems

Distributed File Systems

Distributed File Systems

Exercises for Chapter 8: Distributed File Systems

CSE 451: Operating Systems Spring Module 21 Distributed File Systems

Distributed File System

DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM

CS 345A Data Mining MapReduce This presentation has been altered.

Creating a Distributed System with Remote Procedure Calls and Web Services Ch.5 B.Ramamurthy 2/17/2019 B.Ramamurthy.

Distributed File Systems

Lecture 25: Distributed File Systems

Distributed File Systems

CSE 451: Operating Systems Winter Module 22 Distributed File Systems

CSE 451: Operating Systems Distributed File Systems

WEB SERVICES From Chapter 19, Distributed Systems

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

THE GOOGLE FILE SYSTEM.

Distributed File Systems

Distributed File Systems

Distributed File Systems

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Understand Distributed Object System (Short) 95-702 Distributed Systems Distributed Objects and Distributed File Systems Learning Goals: Understand Distributed Object System (Short) Understand Amazon Simple Storage Service Understand Distributed File System Understand NFS Network File System Developed at Sun (1984) Understand AFS Andrew File System Developed at CMU (1980’s) Compare the above with Google File System (GFS) (2004) Be able to work with HDFS the Java based open source Hadoop Distributed File System(2008) Understand and be able to program with MapReduce

Distributed Objects

The traditional object model (OOP 101) Each object is a set of data and a set of methods. Object references are assigned to variables. Interfaces define an object’s methods. Actions are initiated by invoking methods. Exceptions may be thrown for unexpected or illegal conditions. Garbage collection may be handled by the developer (C++) or by the runtime (.NET and Java). We have inheritance and polymorphism. We want similar features in the distributed case

The distributed object model Having client and server objects in different processes enforces encapsulation. You must call a method to change its state. Methods may be synchronized to protect against conflicting access by multiple clients. Objects are accessed remotely through RMI or objects are copied to the local machine (if the object’s class is available locally) and used locally. Remote object references are analogous to local ones in that: 1. The invoker uses the remote object reference to identify the object and 2. The remote object reference may be passed as an argument to or return value from a local or remote method.

{ A remote object and its remote interface object Data remote implementation object { of methods Enterprise Java Beans (EJB’s) provide a remote and local interface. EJB’s are a component based middleware technology. EJB’s live in a container. The container provides a managed server side hosting environment ensuring that non-functional properties are achieved. Middleware supporting the container pattern is called an application server. Quiz: Describe some non-functional concerns that would be handled by the application server. Java RMI presents us with plain old distributed objects. Fowler’s First Law of Distributed Objects: “Don’t distribute your objects”. 95-702 Distributed Systems Distributed Objects & Java RMI

Generic RMI object A object B skeleton Request proxy for B Reply Communication Remote Remote reference module reference module for B’s class & dispatcher remote client server

Registries promote space decoupling Java uses the rmiregistry CORBA uses the CORBA Naming Service Binders allow an object to be named and registered. object A object B skeleton Request proxy for B Reply Communication Remote Remote reference module reference module for B’s class & dispatcher remote client server Before interacting with the remote object, the RMI registry is used.

Registries promote space decoupling Slide from Sun Microsystems Java RMI documentation

Distributed Objects : Takeaways Full OOP implies more complexity than web services. Full OOP implies more complexity than RPC. Remember Martin Fowler’s First Law of Distributed Objects.

Amazon Simple Storage Service (S3) Remote object storage. An object is simply a piece of data in no specific format. Not like the objects described in RMI. Accessible via REST – PUT, GET, DELETE Each object has data, a key, and user metadata (name value pairs or your choosing) and system metadata (e.g. time of creation) Objects are placed into buckets. Buckets do not contain sub-buckets. Objects may be versioned (same object, different version, in same bucket) Objects keys are unique in a bucket and in a flat namespace. Each object has a storage class (high or low performance requirements)

Amazon Simple Storage Service (S3) Use Cases Backup most popular. Three replicas over several regions. Infrequently accessed data and archival storage. Data for a static web site. Source of video streamed data. Source and destination of data for big data applications running on Amazon EC2 Challenge: No locking: S3 provides no capability to serialize access to data. The user application is responsible for ensuring that multiple PUT requests for the same object do not clash with each other.

Amazon Simple Storage Service (S3) Consistency Model If you PUT a new object to S3. A subsequent read will see the new object. If you overwrite an existing object with PUT, it will eventually be reflected elsewhere. A read after a write may see the old value. If you delete an old object, it will eventually be removed. It may briefly appear to be still present after a delete. Amazon S3 is Available and tolerates Partitions between replicas but is only eventually Consistent. This is the “A” and “P” in the CAP theorem.

Amazon Simple Storage Service (S3) From Amazon An HTTP PUT to a bucket with versioning turned on.

Amazon Simple Storage Service (S3) from Amazon An HTTP DELETE on a bucket with versioning turned on. The delete marker becomes the current version of the object. Before call to delete After call to delete The delete marker becomes current version of the object

Amazon Simple Storage Service (S3) from Amazon An HTTP Delete on a bucket with versioning turned on. Now, if we call GET We get a 404 not found.

Amazon Simple Storage Service (S3) From Amazon We can GET a specific version. Say we GET with ID = 1111111 We can also delete permanently by including a version number in the DELETE request.

Amazon Simple Storage Service (S3) From Amazon A Java client accesses the data in an object stored on S3. AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider()); S3Object object = s3Client.getObject( new GetObjectRequest(bucketName, key)); InputStream objectData = object.getObjectContent(); // Call a method to read the objectData stream. display(objectData); objectData.close(); Quiz: Where is all of the HTTP? You may share objects with others. Provide others with a signed URL. They may access for a specific period of time.

Amazon Simple Storage Service (S3) From Amazon A Java client writes data to an object on S3. AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider()); s3client.putObject(new PutObjectRequest(bucketName, keyName, file));

Distributed File Systems

Figure 12.2 File system modules filedes = open(“CoolData\text.txt”, “r”); count = read(filedes, buffer, n) A typical non-distributed file system’s layered organization. Each layer depends only on the layer below it. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.3 File attribute record structure File length Creation timestamp Read timestamp Write timestamp Attribute timestamp Reference count Owner File type Access control list Files contain both data and attributes. The shaded attributes are managed by the file system and not normally directly modified by user programs. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.4 UNIX file system operations filedes = open(name, mode) filedes = creat(name, mode) Opens an existing file with the given name. Creates a new file with the given name. Both operations deliver a file descriptor referencing the open file. The mode is read, write or both. status = close(filedes) Closes the open file filedes. count = read(filedes, buffer, n) count = write(filedes, buffer, n) Transfers n bytes from the file referenced by filedes to buffer. Transfers n bytes to the file referenced by filedes from buffer. Both operations deliver the number of bytes actually transferred and advance the read-write pointer. pos = lseek(filedes, offset, whence) Moves the read-write pointer to offset (relative or absolute, depending on whence). status = unlink(name) Removes the file name from the directory structure. If the file has no other names, it is deleted. status = link(name1, name2) Adds a new name (name2) for a file (name1). status = stat(name, buffer) Gets the file attributes for file name into buffer. These operations are implemented in the Unix kernel. These are operations available in the non-distributed case. Programs cannot observer any discrepancies between cached copies and stored data after an update. This is called strict one copy semantics. Suppose we want the files to be be located on another machine… Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.5 File service architecture Generic Distributed File System Client computer Server computer Application program Client module Flat file service Directory service The client module provides a single interface used by apps – emulates traditional fs. Flat file service and dir. service both provide an RPC interface used by clients. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.6 Flat file service operations Read(FileId, i, n) -> Data — throws BadPosition If 1 ≤ i ≤ Length(File): Reads a sequence of up to n items from a file starting at item i and returns it in Data. Write(FileId, i, Data) If 1 ≤ i ≤ Length(File)+1: Writes a sequence of Data to a file, starting at item i, extending the file if necessary. Create() -> FileId Creates a new file of length 0 and delivers a UFID for it. Delete(FileId) Removes the file from the file store. GetAttributes(FileId) -> Attr Returns the file attributes for the file. SetAttributes(FileId, Attr) Sets the file attributes (only those attributes that are not shaded in Figure 12.3). The client module will make calls on these operations and so will the directory service act as a client of the flat file service. Unique File Identifiers (UFID’s) are passed in on all operations except create(). Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.5 File service architecture Generic Distributed File System Client computer Server computer Application program Client module Flat file service Directory service read(FileID,.. write(FileID,… fileID create( delete(FileID,… getAttributes(FileID setAttribues(FileID Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 25

Figure 12.7 Directory service operations Lookup(Dir, Name) -> FileId — throws NotFound Locates the text name in the directory and returns the relevant UFID. If Name is not in the directory, throws an exception. AddName(Dir, Name, FileId) NameDuplicate If Name is not in the directory, adds (Name, File) to the directory and updates the file’s attribute record. If Name is already in the directory: throws an exception. UnName(Dir, Name) If Name is in the directory: the entry containing Name is removed from the directory. If Name is not in the directory: throws an exception. GetNames(Dir, Pattern) -> NameSeq Returns all the text names in the directory that match the regular expression Pattern. Primary purpose: translate text names to UFID’s. Each directory is stored as a conventional file and so this is a client of the flat file service. Once a flat file service and directory service is in place, it is simple matter to build client modules that look like unix. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.5 File service architecture Generic Distributed File System Client computer Server computer Application program Client module Flat file service fileID lookUp(dir,name) addName(dir,name,fileID) unNameID(dir,name) getNames(dir, pattern) Directory service We have seen this pattern before. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 27

Figure 12.5 File service architecture Generic Distributed File System Client computer Server computer Application program Client module Flat file service Directory service fileID name operation fileID data or status Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 28

Two models for distributed file system The remote access model The upload/download model Figure 11-1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Tanenbaum and Steen Distributed Systems 10.1 File sharing semantics UNIX semantics or strict one copy : a read after a write gets the value just written. (Figure a) Sesssion semantics: changes are initially visible only to the process that modifies the file. Changes become visible when the file is closed. (Figure b) For session semantics, you might adopt “last writer wins” or transactions. Transactions makes concurrent access appear as serial. Tanenbaum and Steen Distributed Systems 10.1

NFS Goal: Be unsurprising and look like a UNIX FS. Goal: Implement full POSIX API. The Portable Operating System Interface is an IEEE family of standards that describe how Unix like Operating Systems should behave. Goal: Your files are available from any machine. Goal: Distribute the files and we will not have to implement new protocols. NFS has been a major success. NFS was originally based on UDP and was stateless. TCP added later. NFS defines a virtual file system. The NFS client pretends to be a real file system but is making RPC calls instead.

To deal with concurrent access, NFS V4 supports clients that require locks on files A client informs the server of intent to lock Server may not grant lock if already held If granted the client gets a lease (say 3 minutes) If a client dies while holding a lock, its lease will expire. They client may renew leases before old lease expires.

Figure 12.8 NFS architecture UNIX kernel protocol Client computer Server computer system calls Local Remote UNIX file system NFS client server Application program Virtual file system Other file system NFS uses RPC over TCP or UDP. External requests are translated into RPC calls on the server. The virtual file system module provides access transparency. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.9 NFS server operations (simplified) – 1 lookup(dirfh, name) -> fh, attr Returns file handle and attributes for the file name in the directory dirfh. create(dirfh, name, attr) -> newfh, attr Creates a new file name in directory dirfh with attributes attr and returns the new file handle and attributes. remove(dirfh, name) status Removes file name from directory dirfh. getattr(fh) -> attr Returns file attributes of file fh. (Similar to the UNIX stat system call.) setattr(fh, attr) -> attr Sets the attributes (mode, user id, group id, size, access time and modify time of a file). Setting the size to 0 truncates the file. read(fh, offset, count) -> attr, data Returns up to count bytes of data from a file starting at offset. Also returns the latest attributes of the file. write(fh, offset, count, data) -> attr Writes count bytes of data to a file starting at offset. Returns the attributes of the file after the write has taken place. rename(dirfh, name, todirfh, toname) -> status Changes the name of file name in directory dirfh to toname in directory to todirfh . link(newdirfh, newname, dirfh, name) Creates an entry newname in the directory newdirfh which refers to file name in the directory dirfh. Continues on next slide ... Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.9 NFS server operations (simplified) – 2 symlink(newdirfh, newname, string) -> status Creates an entry newname in the directory newdirfh of type symbolic link with the value string. The server does not interpret the string but makes a symbolic link file to hold it. readlink(fh) -> string Returns the string that is associated with the symbolic link file identified by fh. mkdir(dirfh, name, attr) -> newfh, attr Creates a new directory name with attributes attr and returns the new file handle and attributes. rmdir(dirfh, name) -> status Removes the empty directory name from the parent directory dirfh. Fails if the directory is not empty. readdir(dirfh, cookie, count) -> entries Returns up to count bytes of directory entries from the directory dirfh. Each entry contains a file name, a file handle, and an opaque pointer to the next directory entry, called a cookie. The cookie is used in subsequent readdir calls to start reading from the following entry. If the value of cookie is 0, reads from the first entry in the directory. statfs(fh) -> fsstats Returns file system information (such as block size, number of free blocks and so on) for the file system containing a file fh. The directory and file operations are integrated into a single service. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.10 Local and remote file systems accessible on an NFS client Note: The file system mounted at /usr/students in the client is actually the sub-tree located at /export/people in Server 1; the file system mounted at /usr/staff in the client is actually the sub-tree located at /nfs/users in Server 2. A “mount point” is a particular point in the hierarchy. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Andrew File System Unlike NFS, the most important design goal is scalability. One enterprise AFS deployment at Morgan Stanley exceeds 25,000 clients To achieve scalability, whole files are cached in client nodes. Why does this help with scalability? We reduce client server interactions. A client cache would typically hold several hundreds of files most recently used on that computer. The cache is permanent, surviving reboots. When the client opens a file, the cache is examined and used if the file is available there. AFS provides no support for large shared databases or the updating of records within files that are shared between client systems

Andrew File System - Typical Scenario – Modified from Coulouris If the client code tries to open a file the client cache is tried first. If not there, a server is located and the server is called for the file. The copy is stored on the client side and is opened. Subsequent reads and writes hit the copy on the client. When the client closes the file - if the files has changed it is sent back to the server. The client side copy is retained for possible more use. Consider UNIX commands and libraries copied to the client. Consider files only used by a single user. These last two cases only require weak consistency These last two cases represent the vast majority of cases. Gain: Your files are available from any workstation. Principle: Make the common case fast. See Amdahl’s Law. Measurements show only 0.4% percent of changed files were updated by more than one user during one week.

Figure 12.11 Distribution of processes in the Andrew File System Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.12 File name space seen by clients of AFS Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.13 System call interception in AFS Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.14 Implementation of file system calls in AFS A callback promise is a token stored with the cached copy – either valid or cancelled If a client closes and the file is changed then vice makes RPC calls on all other clients to cancel the callback promise. In the event that two clients both write and then close, the last writer wins. On restart of a failed client (missed callbacks) Venus sends a cache validation request to Vice. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 12.15 The main components of the Vice service interface Fetch(fid) -> attr, data Returns the attributes (status) and, optionally, the contents of file identified by the fid and records a callback promise on it. Store(fid, attr, data) Updates the attributes and (optionally) the contents of a specified file. Create() -> fid Creates a new file and records a callback promise on it. Remove(fid) Deletes the specified file. SetLock(fid, mode) Sets a lock on the specified file or directory. The mode of the lock may be shared or exclusive. Locks that are not removed expire after 30 minutes. ReleaseLock(fid) Unlocks the specified file or directory. RemoveCallback(fid) Informs server that a Venus process has flushed a file from its cache. BreakCallback(fid) This call is made by a Vice server to a Venus process. It cancels the callback promise on the relevant file. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

CMU’s Coda is an enhanced descendant of AFS Very briefly, two important features are: Disconnected operation for mobile computing. Continued operation during partial network failures in server network. During normal operation, a user reads and writes to the file system normally, while the client fetches, or "hoards", all of the data the user has listed as important in the event of network disconnection. If the network connection is lost, the Coda client's local cache serves data from this cache and logs all updates. Upon network reconnection, the client moves to reintegration state; it sends logged updates to the servers. From Wikipedia Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Google File System (GFS) Hadoop (HDFS) What is Hadoop? Sort of the opposite of virtual machines where one machine may act like many. Instead, with Hadoop, many machines act as one. Hadoop is an open source implementation of GFS. Microsoft has Dryad with similar goals. At its core, an operating system (like Hadoop) is all about: (a) storing files (b) running applications on top of files From “Introducing Apache Hadoop: The Modern Data Operating System”, Amr Awadallah

Figure 21.3 Organization of the Google physical infrastructure Commodity PC’s which are assumed to fail. 40-80 PC’s per rack. Racks are organized into clusters. Each cluster >30 racks. Each PC has >2 terabytes. 30 racks is about 4.8 petabytes. All of Google > 1 exabyte (10^18 bytes). (To avoid clutter the Ethernet connections are shown from only one of the clusters to the external links) Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Requirements of Google File System (GFS) Run reliably with component failures. Solve problems that Google needs solved – not a massive number of files but massively large files are common. Write once, append, read many times. Access is dominated by long sequential streaming reads and sequential appends. No need for caching on the client. Throughput more important than latency. Think of very large files each holding a very large number of HTML documents scanned from the web. These need read and analyzed. This is not your everyday use of a distributed file system (NFS and AFS). Not POSIX.

GFS Each file is mapped to a set of fixed size chunks. Each chunk is 64Mb in size. Each cluster has a single master and multiple (usually hundreds) of chunk servers. Each chunk is replicated on three different chunk servers. The master knows the locations of chunk replicas. The chunk servers know what replicas they have and are polled by the master on startup.

Figure 21.9 Overall architecture of GFS Each GFS cluster has a single master. Manage metadata Hundreds of chunkservers Data is replicated on three independent chunkservers. Locations known by master. With log files, the master is restorable after failure. Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

GFS – Reading a file sequentially Suppose a client wants to perform a sequential read, processing a very large file from a particular byte offset. The client can compute the chunk index from the byte offset. Client calls master with file name and chunk index. Master returns chunk identifier and the locations of replicas. Client makes call on a chunk server for the chunk and it is processed sequentially with no caching. It may ask for and receive several chunks.

GFS – Mutation operations Suppose a client wants to perform sequential writes to the end of a file. The client can compute the chunk index from the byte offset. This is the chunk holding End Of File. Client calls master with file name and chunk index. Master returns chunk identifier and the locations of replicas. One is designated as the primary. The client sends all data to all replicas. The primary coordinates with replicas to update files consistently across replicas.

MapReduce

MapReduce Runs on Hadoop Provide a clean abstraction on top of parallelization and fault tolerance. Easy to program. The parallelization and fault tolerance is automatic. Google uses over 12,000 different MapReduce programs over a wide variety of problems. These are often pipelined. Programmer implements two interfaces: one for mappers and one for reducers. Map takes records from source in the form of key value pairs. The key might be a document name and the value a document. The key might be a file offset and the value a line of the file. Map produces one or more intermediate values along with an output key. When Map is complete, all of the intermediate values for a given output key are combined into a list. The combiners run on the mapper machines.

MapReduce Reduce combines the intermediate values into one or more final values for the same output key (usually one final value per key) The master tries to place the mapper on the same machines as the data or nearby. A mapper object is initialized for each map task. In configuring a job, the programmer provides only a hint on the number of mappers to run. The final decision depends on the physical layout of the data. A reducer object is initialized for each reduce task. The reduce method is called once per intermediate key. The programmer can specify precisely the number of reduce tasks.

MapReduce – From the Google Paper Map (k1,v1) --> list(k2,v2) Reduce (k2, list(v2)) --> list(k3,v3) All values associated with one key are brought together in the reducer. Final output is written to the distributed file system, one file per reducer. The output may be passed to another MapReduce program.

More detail

Figure 21.18 Some examples of the use of MapReduce Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Figure 21.19 The overall execution of a MapReduce program Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012

Overall Execution of MapReduce Mappers run on the input data scattered over n machines: Data on Disk 1 =>( key,value) => map1 Data on Disk 2 => (key,value) => map2 : Data on Disk n => (key,value) => mapn The map tasks produce (key, value) pairs: map1 => (key 1, value) (key 2, value) map2 => (key 1, value) (key 3, value) (key 1, value) The output of each map task is collected and sorted on the key. These key, value pairs are passed to the reducers: (key 1, value list) => reducer1 => list(value) (key 2, value list) => reducer2 => list(value) (key 3, value list) => reducer3 => list(value) Maps run in parallel. Reducers run in parallel. Map phase must be completely finished before the reduce phase can begin. The combiner phase is run on mapper nodes after map phase. This is a mini-reduce on local map output. For complex activities, best to pipe the output of a reducer to another mapper.

MapReduce to Count Word Occurrences in Many Documents Disk 1 => (Document name,Document) => map1 On machine near disk 1 Disk 2 => (Document name,Document) => map2 On machine near disk 2 Disk n => (Document name, Document) => mapn map1 => (ball, 1) (game, 1) map2 => (ball, 1) (team, 1) (ball, 1) Gather map output and sort by key. Send these pairs to reducers. (ball, 1,1,1) => reducer => (ball, 3) (game, 1) => reducer => (game, 1) (team, 1) => reducer => (team, 1)

Some MapReduce Examples Count the number of occurrences of each word in a large collection of documents. Distributed GREP: Count the number of lines with a particular pattern. From a web server log, determine URL access frequency. Reverse a web link graph. For a given URL, find URL’s of pages pointing to it. For each word, create list of documents containing it. (Same as 4.) Distributed sort of a lot of records with keys.

// (K1,V1)  List(K2,V2) MapReduce Example (1) Count the number of occurrences of each word in a large collection of documents. // (K1,V1)  List(K2,V2) map(String key, String value) // key: document name // value: document contents for each word w in value emitIntermediate(w,”1”) ==================================================== // (K2, List(V2))  List(V2) (bell,[1]), (car,[1,1]) reduce(String key, Iterator values) // key: a word // values: a list of counts result = 0 for each v in values result += v; emit(key, result) (bell,1),(car,2) Doc1 car bell Doc2 (car,1),(bell,1),(car,1)

// (K1,V1)  List(K2,V2) MapReduce Example (2) Distributed GREP: Count the number of lines with a particular pattern. Suppose searchString is “th”. // (K1,V1)  List(K2,V2) map(fileOffset, lineFromFile) if searchString in lineFromFile emitIntermediate(lineFromFile,1) // (K2, List(V2))  List(V2) reduce (K2, iterator values) s = sum up values emit (s,k2) (0, the line) (8, a line) (14, the store) (22, the line) (the line, 1), (the store, 1), (the line,1) (the line, [1,1]), (the store,[1]) (2 the line),(1 the store)

MapReduce Example (3) From a web server log, determine URL access frequency. Web page request log: URL1 was visited URL1 was visted URL2 was visted // (K1,V1)  List(K2,V2) map( offset, url) emitIntermediate(url,1) // (K2, List(V2))  List(V2) reduce(url, values) sum values into total emit(url,total) (0,URL1),(45,URL1),(90,URL2),(135,URL1) (URL1,1),(URL1,1),(URL2,1),(URL1,1) (URL1, [1,1,1]), (URL2, [1]) (URL1, 3),(URL2,1)

4) Reverse a web link graph. For a given URL, find MapReduce Example (4) 4) Reverse a web link graph. For a given URL, find URL’s of pages pointing to it. // (K1,V1)  List(K2,V2) map(String SourceDocURL, sourceDoc) for each target in sourceDoc emitIntermediate(target, SourceDocURL) // (K2, List(V2))  List(V2) reduce(target, listOfSourceURL’s) emit(target, listOfSourceURL’s) (URL1, {P1,P2,P3}) (URL2, {P1,P3}) (P1, URL1), (P2,URL1), (P3, URL1) (P1, URL2), (P3, URL2) (P1, (URL1, URL2)), (P2, (URL1)), (P3,(URL1,URL2)) 5) Same as 4.

// (K1,V1)  List(K2,V2) (0, k2, data), (20, k1, data), (30, k3, data) MapReduce Example (6) 6) Distributed sort of a lot of records with keys. // (K1,V1)  List(K2,V2) (0, k2, data), (20, k1, data), (30, k3, data) map(offset, record) sk = find sort key in record emitIntermediate(sk, record) (k2,data),(k1,data),(k3,data) // (K2, List(V2))  List(V2) (k1,data),(k2,data),(k3,data) reduce emits records unchanged

Recall Example 1 Word Count Count the number of occurrences of each word in a large collection of documents. // (K1,V1)  List(K2,V2) map(String key, String value) // key: document name // value: document contents for each word w in value emitIntermediate(w,”1”) ==================================================== // (K2, List(V2))  List(V2) (bell,[1]), (car,[1,1]) reduce(String key, Iterator values) // key: a word // values: a list of counts result = 0 for each v in values result += v; emit(key, result) (bell,1),(car,2) Doc1 car bell Doc2 (car,1),(bell,1),(car,1) 68

public static class MapClass extends MapReduceBase Word Counting in Java - Mapper – Using offset into file not document name public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }} Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 69

Word Counting in Java - Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 70

Computing π Can you think of an embarrassingly parallel approach to approximating the value of π ? Randomly throw one thousand darts at 100 square 1 X 1 boards, all with inscribed circles. Count the number of darts landing inside the circles and those landing outside. Compute the area A = (landing inside)/(landing inside + landing outside). We know that A = π r 2 = π (1/2) 2 = ¼ π. So, π = 4A.