Presentation is loading. Please wait.

Presentation is loading. Please wait.

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar.

Similar presentations


Presentation on theme: "2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar."— Presentation transcript:

1 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar IBM Research

2 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 2 Motivation: SANs are built for latency not bandwidth Advantages of using GPFS: High scale (thousands of nodes, petabytes of storage), high performance, high availability, data integrity, POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication) Challenges: Adapt GPFS to shared nothing clusters Maximize application performance relative to cost Failure is common, network is a bottleneck Motivation: SANs are built for latency not bandwidth Advantages of using GPFS: High scale (thousands of nodes, petabytes of storage), high performance, high availability, data integrity, POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication) Challenges: Adapt GPFS to shared nothing clusters Maximize application performance relative to cost Failure is common, network is a bottleneck SA N I/O bottleneck Traditional architecture Shared nothing environment Scale out performance 1 server -> 1 GBps 100 servers-> 100 GBps GPFS-FPO: Motivation

3 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 3  High-performance computing –PERCS –Blue Gene –Linux, Windows  Scalable file and Web servers  Archiving and storage management  Database and digital libraries  Digital media  OLAP, financial data management, engineering design, … GPFS: Wide Commercial Deployment

4 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 4 Multiple Block Sizes InfoSphere Big Insights POSIX Map Reduce Disaster Recovery Node N MetadataN SAN Node 1 Data 1 Metadata1 Solid State Disks Cache Node 2 Metadata2 Local or Rack Attached Disks Cache Replicated metadata: No single point of failure Multi-tiered storage with automated policy-based migration Data 2Data N Smart Caching Data Layout Control High Scale: 1000 nodes, PBs of data GPFS-FPO: Architecture

5 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 5 Locality Awareness: Make applications aware where data is located  Scheduler assigns tasks to nodes where data chunks reside  needs to know the location of each chunk  GPFS API maps each chunk to its node location  Map is compacted to optimize its size N1 D1 N1:D1 N1  Task Engine: Location Map Scheduler Read Location Map Schedule F(D1) on N1 F(D1) Application GPFS Connector Locality

6 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 6 Chunks: allows applications to define own logical block size  Workloads have two types of access:  Small blocks (eg: 512KB) for Index File Lookups  Large blocks (eg: 64MB) for File scans  Solution: Multiple block sizes in the same file system  New allocation scheme  Block group factor for block size  Effective block size = block size * block group factor  File blocks are laid out based on effective block size Application “chunk” FS block New allocation policy (Metablock factor, Block Size)  Optimum Block Size: Caching and pre-fetching effects at larger block sizes Effective block size: 128 MB

7 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 7 Write Affinity: Allow applications to dictate layout  Wide striping with multiple replicas:  No locality of data in a particular node  Bad for commodity architectures due to excessive network traffic  Write Affinity for a file or a set of files on a given node  All relevant files will be allocated on the node, near a node or in the same rack  Configurability of replicas  Each replica can be configured for either wide striping or write affinity N1 File N2 N3 Read File 1 Wide striped layout: Read over local disk + network N1 N2 N3 N4 Read File 1 Write affinity layout: Read over only local disk N1 N2 N3 N4 Read File 1 Write affinity and wide striping with replication: Read over only local disk, recover from wide striped replica Parallel Recovery

8 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.  GPFS-FPO: distributed metadata  Too many metadata nodes  Increases probability of any 3 metadata nodes going down  loss of data  Assumes replication=3  Typical metadata profile in GPFS-FPO  One or two metadata nodes per rack (failure domain)  Random access patterns  Observation  Metadata is suited for regular GPFS  Data is suited for GPFS-FPO  Make allocation type a storage pool property  Metadata pool: no write affinity  Data pools: write affinity Hybrid allocation: treat metadata and data differently Metadata: No write affinity Rack 1Rack 2Rack 3 Data: Write Affinity

9 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 9 Pipelined Replication: Efficient replication of data Disk 1Disk 2Disk 3Disk 4Disk 5 Node 1Node 2Node 3Node 4 Node 5 N1 Client N2 N3 B0 B1B2 Write Write Pipeline  Higher Degree of replication  Pipelining for lower latency  Single source does not utilize the bandwidth effectively  Able to saturate 1 Gbps Ethernet link

10 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 10 Fast Recovery: Recover quickly from disk failures  Handling failures is policy driven  Incorporated node failure and disk failure triggers  Policy-based recovery  Restripe on a node failure or disk failure  Alternatively, rebuild the disk when the disk is replaced  Fast recovery  Incremental recovery  Keep track of changes during failure and recover what is needed  Distributed restripe  Restripe load is spread out over all the nodes  Quickly figure out what blocks are needed to be recovered when a node fails N1 File N2 N3 Well replicated file N2 rebooted: Writes go to N1 and N3 N2 comes back and catches up by copying missed writes from N1 and N3 in parallel N2 N3 N2 N3 New write Copy Deltas Failed Nodes

11 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. Snapshots (a.k.a Flashcopy)  Snapshot: logical, read-only copy of the file system at a point in time  Typical uses:  Recover from accidental operations (most errors)  Disaster Recovery  Copy-on-write: file in a snapshot does not occupy disk space until the file is modified or deleted: space efficient; fast snapshot create.  Snapshot accessible through “.snapshots” sub-directory in file system root directory, e.g., /gpfs/.snapshots/snap1 – last week’s snapshot /gpfs/.snapshots/snap2 – yesterday’s snapshot

12 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 12 Client Panache+GPFS (configured as primary) Panache+GPFS (configured as secondary) Push all updates Asynchronously or Synchronously  Establish per-fileset replication relationship using Panache –Configure Panache in a primary-secondary relation –Configure the fileset to be in “exclusive- writer” mode –Reads to primary are all local  Only deltas are pushed –Panache tracks exact filesystem operation –Only updated byte range will be pushed  Multi-site Snapshots for consistent copy –Can take snapshot at home when the changed data is pushed Caching and Disaster Recovery

13 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 13  What does it offer? –One global file system name space across multiple pools of independent storage –Files in the same directory can be in different pools  GPFS ILM abstractions: –Storage pool – a group of storage volumes (disk or tape) –Policy – rules for placing files into storage pools, migrating files, retention  GPFS policy rules much richer than conventional HSM “how big is the file and when was it last touched” –Tiered storage – create files on fast, reliable storage (e.g. solid state), move files as they age to slower storage, then to tape (a.k.a. HSM) –Differentiated storage - place media files on storage with high throughput, database on storage with high IO’s per second –Grouping - keep related files together, e.g. for failure containment or project storage GPFS Manager Node Cluster manager Lock manager Quota manager Allocation manager Policy manager System Pool Data Pools GPFS Clients Storage Network SSD Pool SAS Pool SATA Pool GPFS RPC Protocol GPFS Placement Policy Application GPFS Placement Policy Application GPFS Placement Policy Application GPFS Placement Policy Application Posix Disk Pools External (Tape) Pools: HPSS, TSM, MAID, VTL,… Information Lifecycle Management (ILM)

14 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 14 File SystemGPFS-FPOHDFSMapR RobustNo single point of failure NameNode vulnerability No single point of failure Data IntegrityHighEvidence of data lossNew file system ScaleThousands of nodes Hundreds of nodes POSIX ComplianceFull – supports a wide range of applications Limited Data ManagementSecurity, Backup, Replication Limited MapReduce Performance Good Workload IsolationSupports disk isolation No supportLimited support Traditional Application Performance GoodPoor performance with random reads and writes Limited Comparison with HDFS and MapR

15 2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.  4Q12 : GA  1H13 :  WAN Caching, Preliminary Disaster Recovery  Advanced Write Affinity  KVM Support  1Q14:  Encryption  Automated Disaster Recovery 15 RoadMap


Download ppt "2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar."

Similar presentations


Ads by Google