2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar.

Slides:



Advertisements
Similar presentations
Ddn.com ©2012 DataDirect Networks. All Rights Reserved. GridScaler™ Overview Vic Cornell Application Support Consultant.
Advertisements

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
The Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Windows ® Powered NAS. Agenda Windows Powered NAS Windows Powered NAS Key Technologies in Windows Powered NAS Key Technologies in Windows Powered NAS.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
The Hadoop Distributed File System
Latest Relevant Techniques and Applications for Distributed File Systems Ela Sharda
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Selling the Storage Edition for Oracle November 2000.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
VMware vSphere Configuration and Management v6
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)
An Introduction to GPFS
BIG DATA/ Hadoop Interview Questions.
Predrag Buncic CERN Data management in Run3. Roles of Tiers in Run 3 Predrag Buncic 2 ALICEALICE ALICE Offline Week, 01/04/2016 Reconstruction Calibration.
Transparent Cloud Tiering
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Organizations Are Embracing New Opportunities
iSCSI Storage Area Network
Finding a Needle in Haystack : Facebook’s Photo storage
Google Filesystem Some slides taken from Alan Sussman.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
A Survey on Distributed File Systems
Real IBM C exam questions and answers
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
PRESENTER GUIDANCE: These charts provide data points on how IBM BaaS mid-market benefits a client with the ability to utilize a variety of backup software.
Specialized Cloud Architectures
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
IBM Tivoli Storage Manager
The Google File System (GFS)
Presentation transcript:

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar IBM Research

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 2 Motivation: SANs are built for latency not bandwidth Advantages of using GPFS: High scale (thousands of nodes, petabytes of storage), high performance, high availability, data integrity, POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication) Challenges: Adapt GPFS to shared nothing clusters Maximize application performance relative to cost Failure is common, network is a bottleneck Motivation: SANs are built for latency not bandwidth Advantages of using GPFS: High scale (thousands of nodes, petabytes of storage), high performance, high availability, data integrity, POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication) Challenges: Adapt GPFS to shared nothing clusters Maximize application performance relative to cost Failure is common, network is a bottleneck SA N I/O bottleneck Traditional architecture Shared nothing environment Scale out performance 1 server -> 1 GBps 100 servers-> 100 GBps GPFS-FPO: Motivation

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 3  High-performance computing –PERCS –Blue Gene –Linux, Windows  Scalable file and Web servers  Archiving and storage management  Database and digital libraries  Digital media  OLAP, financial data management, engineering design, … GPFS: Wide Commercial Deployment

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 4 Multiple Block Sizes InfoSphere Big Insights POSIX Map Reduce Disaster Recovery Node N MetadataN SAN Node 1 Data 1 Metadata1 Solid State Disks Cache Node 2 Metadata2 Local or Rack Attached Disks Cache Replicated metadata: No single point of failure Multi-tiered storage with automated policy-based migration Data 2Data N Smart Caching Data Layout Control High Scale: 1000 nodes, PBs of data GPFS-FPO: Architecture

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 5 Locality Awareness: Make applications aware where data is located  Scheduler assigns tasks to nodes where data chunks reside  needs to know the location of each chunk  GPFS API maps each chunk to its node location  Map is compacted to optimize its size N1 D1 N1:D1 N1  Task Engine: Location Map Scheduler Read Location Map Schedule F(D1) on N1 F(D1) Application GPFS Connector Locality

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 6 Chunks: allows applications to define own logical block size  Workloads have two types of access:  Small blocks (eg: 512KB) for Index File Lookups  Large blocks (eg: 64MB) for File scans  Solution: Multiple block sizes in the same file system  New allocation scheme  Block group factor for block size  Effective block size = block size * block group factor  File blocks are laid out based on effective block size Application “chunk” FS block New allocation policy (Metablock factor, Block Size)  Optimum Block Size: Caching and pre-fetching effects at larger block sizes Effective block size: 128 MB

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 7 Write Affinity: Allow applications to dictate layout  Wide striping with multiple replicas:  No locality of data in a particular node  Bad for commodity architectures due to excessive network traffic  Write Affinity for a file or a set of files on a given node  All relevant files will be allocated on the node, near a node or in the same rack  Configurability of replicas  Each replica can be configured for either wide striping or write affinity N1 File N2 N3 Read File 1 Wide striped layout: Read over local disk + network N1 N2 N3 N4 Read File 1 Write affinity layout: Read over only local disk N1 N2 N3 N4 Read File 1 Write affinity and wide striping with replication: Read over only local disk, recover from wide striped replica Parallel Recovery

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.  GPFS-FPO: distributed metadata  Too many metadata nodes  Increases probability of any 3 metadata nodes going down  loss of data  Assumes replication=3  Typical metadata profile in GPFS-FPO  One or two metadata nodes per rack (failure domain)  Random access patterns  Observation  Metadata is suited for regular GPFS  Data is suited for GPFS-FPO  Make allocation type a storage pool property  Metadata pool: no write affinity  Data pools: write affinity Hybrid allocation: treat metadata and data differently Metadata: No write affinity Rack 1Rack 2Rack 3 Data: Write Affinity

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 9 Pipelined Replication: Efficient replication of data Disk 1Disk 2Disk 3Disk 4Disk 5 Node 1Node 2Node 3Node 4 Node 5 N1 Client N2 N3 B0 B1B2 Write Write Pipeline  Higher Degree of replication  Pipelining for lower latency  Single source does not utilize the bandwidth effectively  Able to saturate 1 Gbps Ethernet link

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 10 Fast Recovery: Recover quickly from disk failures  Handling failures is policy driven  Incorporated node failure and disk failure triggers  Policy-based recovery  Restripe on a node failure or disk failure  Alternatively, rebuild the disk when the disk is replaced  Fast recovery  Incremental recovery  Keep track of changes during failure and recover what is needed  Distributed restripe  Restripe load is spread out over all the nodes  Quickly figure out what blocks are needed to be recovered when a node fails N1 File N2 N3 Well replicated file N2 rebooted: Writes go to N1 and N3 N2 comes back and catches up by copying missed writes from N1 and N3 in parallel N2 N3 N2 N3 New write Copy Deltas Failed Nodes

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. Snapshots (a.k.a Flashcopy)  Snapshot: logical, read-only copy of the file system at a point in time  Typical uses:  Recover from accidental operations (most errors)  Disaster Recovery  Copy-on-write: file in a snapshot does not occupy disk space until the file is modified or deleted: space efficient; fast snapshot create.  Snapshot accessible through “.snapshots” sub-directory in file system root directory, e.g., /gpfs/.snapshots/snap1 – last week’s snapshot /gpfs/.snapshots/snap2 – yesterday’s snapshot

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 12 Client Panache+GPFS (configured as primary) Panache+GPFS (configured as secondary) Push all updates Asynchronously or Synchronously  Establish per-fileset replication relationship using Panache –Configure Panache in a primary-secondary relation –Configure the fileset to be in “exclusive- writer” mode –Reads to primary are all local  Only deltas are pushed –Panache tracks exact filesystem operation –Only updated byte range will be pushed  Multi-site Snapshots for consistent copy –Can take snapshot at home when the changed data is pushed Caching and Disaster Recovery

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 13  What does it offer? –One global file system name space across multiple pools of independent storage –Files in the same directory can be in different pools  GPFS ILM abstractions: –Storage pool – a group of storage volumes (disk or tape) –Policy – rules for placing files into storage pools, migrating files, retention  GPFS policy rules much richer than conventional HSM “how big is the file and when was it last touched” –Tiered storage – create files on fast, reliable storage (e.g. solid state), move files as they age to slower storage, then to tape (a.k.a. HSM) –Differentiated storage - place media files on storage with high throughput, database on storage with high IO’s per second –Grouping - keep related files together, e.g. for failure containment or project storage GPFS Manager Node Cluster manager Lock manager Quota manager Allocation manager Policy manager System Pool Data Pools GPFS Clients Storage Network SSD Pool SAS Pool SATA Pool GPFS RPC Protocol GPFS Placement Policy Application GPFS Placement Policy Application GPFS Placement Policy Application GPFS Placement Policy Application Posix Disk Pools External (Tape) Pools: HPSS, TSM, MAID, VTL,… Information Lifecycle Management (ILM)

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 14 File SystemGPFS-FPOHDFSMapR RobustNo single point of failure NameNode vulnerability No single point of failure Data IntegrityHighEvidence of data lossNew file system ScaleThousands of nodes Hundreds of nodes POSIX ComplianceFull – supports a wide range of applications Limited Data ManagementSecurity, Backup, Replication Limited MapReduce Performance Good Workload IsolationSupports disk isolation No supportLimited support Traditional Application Performance GoodPoor performance with random reads and writes Limited Comparison with HDFS and MapR

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.  4Q12 : GA  1H13 :  WAN Caching, Preliminary Disaster Recovery  Advanced Write Affinity  KVM Support  1Q14:  Encryption  Automated Disaster Recovery 15 RoadMap