Download presentation
Presentation is loading. Please wait.
1
Distributed File Systems
Cullen Eason, Jordan Messec, Jason Propp, Richard Briglio CS455: Distributed Systems 2015 Spring Semester
2
Why is the Problem Important?
We desire to find answers to complex and relevant problems Parsing and aggregation of massive data-sets in a timely manner Effective solutions bring efficiency, reliability, profitability and client trust
3
Problem Characterization
Data storage needs are increasing Accessibility Reliability Consistency Google processes more than twenty petabytes of data daily Millions of pages added to the internet daily Hundreds of petabytes in the future? Upgrading within data centers Needs simplicity in expansion/replacement while keeping services available Issues with standard file systems Files exist on separate servers Must be accessed through that server directly Location matters Load balancing Repeated access to a file should spread across multiple servers
4
Trade-Off Space for Solutions in this Area
Reliability vs Increased Storage Requirements Increased replication gives more reliability, results in higher storage and meta data Ease of Access vs Single Point of Failure Centralizing system knowledge eases use, increases dependence on Master Node Network Efficiency vs Reliability Locality between replicas reduces network traffic but decreases reliability Load balancing on Master Node reduces network traffic, reduces reliability by increasing stress on Master Node CAP: Consistency, Availability, Partition Tolerance Only two can be present at one time
5
Dominant Approaches to the Problem
GFS Only available to Google Proprietary permissions system Oldest mapreduce based DFS Poorer namenode splitting HDFS Open source and broadly used POSIX permissions system No physical security Lustre Open source Specific file system clients Multiple simultaneous control servers Ceph Does not need object lists Traffic goes directly between object storage clusters and clients
6
Dominant Approaches to the Problem
Virtual Resource Distance Management Addresses data and resource locality Addresses slow resources Especially relevant with multiple data centers Load Rebalancing Algorithms Necessitated by hardware failures, upgrades, or usage changes Classify chunk servers as overloaded or underloaded Underloaded servers seek out overloaded servers to take work from Power Usage Data centers use large amounts of power with electricity bills reaching the billions Power usage levels can be split into “gears” for more efficiency with low usage RABBIT is better than PARAID for read operations, but worse for write operations
7
Insights Gleaned Issues facing DFS Solutions Trade-offs Data Locality
Single point bottlenecks and failures Power consumption Solutions hardware and architecture software and algorithms for solutions Trade-offs Cost benefit balance Needs of specific developers and the DFS’ application
8
What the Problem Space in the Future Will Look Like
Vertical and Horizontal Expansion Current fields producing data will produce more Web is expanding rapidly, new weather centers come online New fields entering data collection Companies collect new fields of information, e.g. length of customer phone calls, time of day of sales Higher volume of queries User field will expand from large companies to individuals Big data analysis will be the new web search Range of companies making data requests will expand geographically and across industries Volatile Nodes Further integration of cell phones, laptops, glasses, watches, and other mobile devices; begin to be utilized as storage and analysis nodes in DFSs Possibly even human augmentations or implants
9
Trade-off Space and Solutions in the Future
Future of physical fabric for data transfer Fibre Channel’s multiple-pipes vs. Ethernet’s single-pipe Multi-colored LEDs for optical fiber cables? Separation of traffic Colors define routes for data Geographic data locality Master data center updates other data centers Maintaining consistent data across geographically different locations Reduced latency for data access Cheaper memory modules One billion files on HDFS = 300Gb memory needed on namenode 150 bytes per file, folder, block Hybrid storage/memory Fast access of RAM applied to storage Separated RAM no longer necessary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.