Distributed File Systems

Slides:

Advertisements

Similar presentations

Distributed Data Processing

Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.

Chapter 3 : Distributed Data Processing

Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

Computer communication

Software Engineer, #MongoDBDays.

Adam Leidigh Brandon Pyle Bernardo Ruiz Daniel Nakamura Arianna Campos.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

PARAID: The Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, An-I Andy Wang – Florida State University Peter Reiher – University of California,

CH2 System models.

Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda

Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.

(C) 2008 Clusterpoint(C) 2008 ClusterPoint Ltd. Empowering You to Manage and Drive Down Database Costs April 17, 2009 Gints Ernestsons, CEO © 2009 Clusterpoint.

Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.

G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

Networking Classification A network is two or more computers that are connected 1 There size 2 Their Servers.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

VMware vSphere Configuration and Management v6

 Cachet Technologies 1998 Cachet Technologies Technology Overview February 1998.

Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.

CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.

Load Rebalancing for Distributed File Systems in Clouds.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Enterprise Vitrualization by Ernest de León. Brief Overview.

Advanced Operating Systems Chapter 6.1 – Characteristics of a DFS Jongchan Shin.

1.4 wired and wireless networks lesson 1

Chapter 1 Characterization of Distributed Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Data Management with Google File System Pramod Bhatotia wp. mpi-sws

Hadoop Aakash Kag What Why How 1.

Introduction to Distributed Platforms

Software Systems Development

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Large-scale file systems and Map-Reduce

VIDIZMO Deployment Options

3.2 Virtualisation.

Introduction to Networks

Introduction to Networks

NOSQL databases and Big Data Storage Systems

Hadoop Clusters Tess Fulkerson.

Reach People when it matters with Location Extensions

Software Engineering Introduction to Apache Hadoop Map Reduce

Introduction to client/server architecture

Failure recovery and Checkpointing in Distributed Systems

Networking Lexi Becker Chapter is broke in two parts

A Survey on Distributed File Systems

Storage Systems for Managing Voluminous Data

Distributed Shared Memory

The Basics of Apache Hadoop

G063 - Distributed Databases

AWS Cloud Computing Masaki.

CS 345A Data Mining MapReduce This presentation has been altered.

Replication and Availability in Distributed Systems

Distributed Systems CS

Distributed File Systems

by Mikael Bjerga & Arne Lange

EE 122: Lecture 22 (Overlay Networks)

CS 295: Modern Systems Organizing Storage Devices

Presentation transcript:

Distributed File Systems Cullen Eason, Jordan Messec, Jason Propp, Richard Briglio CS455: Distributed Systems 2015 Spring Semester

Why is the Problem Important? We desire to find answers to complex and relevant problems Parsing and aggregation of massive data-sets in a timely manner Effective solutions bring efficiency, reliability, profitability and client trust

Problem Characterization Data storage needs are increasing Accessibility Reliability Consistency Google processes more than twenty petabytes of data daily Millions of pages added to the internet daily Hundreds of petabytes in the future? Upgrading within data centers Needs simplicity in expansion/replacement while keeping services available Issues with standard file systems Files exist on separate servers Must be accessed through that server directly Location matters Load balancing Repeated access to a file should spread across multiple servers

Trade-Off Space for Solutions in this Area Reliability vs Increased Storage Requirements Increased replication gives more reliability, results in higher storage and meta data Ease of Access vs Single Point of Failure Centralizing system knowledge eases use, increases dependence on Master Node Network Efficiency vs Reliability Locality between replicas reduces network traffic but decreases reliability Load balancing on Master Node reduces network traffic, reduces reliability by increasing stress on Master Node CAP: Consistency, Availability, Partition Tolerance Only two can be present at one time

Dominant Approaches to the Problem GFS Only available to Google Proprietary permissions system Oldest mapreduce based DFS Poorer namenode splitting HDFS Open source and broadly used POSIX permissions system No physical security Lustre Open source Specific file system clients Multiple simultaneous control servers Ceph Does not need object lists Traffic goes directly between object storage clusters and clients

Dominant Approaches to the Problem Virtual Resource Distance Management Addresses data and resource locality Addresses slow resources Especially relevant with multiple data centers Load Rebalancing Algorithms Necessitated by hardware failures, upgrades, or usage changes Classify chunk servers as overloaded or underloaded Underloaded servers seek out overloaded servers to take work from Power Usage Data centers use large amounts of power with electricity bills reaching the billions Power usage levels can be split into “gears” for more efficiency with low usage RABBIT is better than PARAID for read operations, but worse for write operations

Insights Gleaned Issues facing DFS Solutions Trade-offs Data Locality Single point bottlenecks and failures Power consumption Solutions hardware and architecture software and algorithms for solutions Trade-offs Cost benefit balance Needs of specific developers and the DFS’ application

What the Problem Space in the Future Will Look Like Vertical and Horizontal Expansion Current fields producing data will produce more Web is expanding rapidly, new weather centers come online New fields entering data collection Companies collect new fields of information, e.g. length of customer phone calls, time of day of sales Higher volume of queries User field will expand from large companies to individuals Big data analysis will be the new web search Range of companies making data requests will expand geographically and across industries Volatile Nodes Further integration of cell phones, laptops, glasses, watches, and other mobile devices; begin to be utilized as storage and analysis nodes in DFSs Possibly even human augmentations or implants

Trade-off Space and Solutions in the Future Future of physical fabric for data transfer Fibre Channel’s multiple-pipes vs. Ethernet’s single-pipe Multi-colored LEDs for optical fiber cables? Separation of traffic Colors define routes for data Geographic data locality Master data center updates other data centers Maintaining consistent data across geographically different locations Reduced latency for data access Cheaper memory modules One billion files on HDFS = 300Gb memory needed on namenode 150 bytes per file, folder, block Hybrid storage/memory Fast access of RAM applied to storage Separated RAM no longer necessary