Download presentation
Presentation is loading. Please wait.
Published byRolf Osborne Modified over 8 years ago
1
RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015
2
metadata Exportd Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Rozofsmount /fs1/home/ RozoFS architecture overview Components Rozofsmount Storage /fs1/home/ Metadata server Data path metadata Exportd Storage client node control path
3
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 3 Storage component [cid1,sid1] Storage process [cid2,sid1] [cidn,sid1] Storage Node IP@:port File System (e.g: XFS) Raid 0 (0+1,5,6) Device 0 File System (e.g: XFS) Raid 0 (0+1,5,6) Device n Physical disks A storage (cid/sid) is a set of logical disks (devices) with the same capacity and performance On the same server, RozoFS can provide storages based on different technologies Note : configuration can be done with or without RAID controller storage
4
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 4 RozoFS clusters and Volumes Storage (host_1)Storage (host_n) Cluster 1 Cluster 2 Cluster n Volume 1 Cluster 1 Sid1:host_1.. Sidn:host_n Cluster 2 Sid1:host_1.. Sidn:host_n Cluster n Sid1:host_1.. Sidn:host_n A RozoFS cluster(cid) is an uniform set of storages (sid) in terms of disk capacity and performance A cluster id is unique within a RozoFS system
5
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 5 Mapping filesystems on volumes Volume 1 Cluster 1 Cluster n Volume 2 Cluster n+1 Cluster n+p Filesystem 1Filesystem jFilesystem j+1Filesystem j+k RozoFS supports configuration with multiple volumes A Volume can host more than one File system (thin provisioning) There are quotas (hard and soft) per file system A File system is identified by an unique id (eid) within the configuration
6
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 6 File localization within a filesystem Volume 1 Cluster 1 Cluster n Filesystem 1Filesystem j Mojette Transform Projections Storage nodes Storage (cid/sid)
7
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 7 RozoFS configuration Eid1:/metadata/fs1,vid=1 Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Volume 1 Eid1:/metadata/fs1,vid=1 Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Volume i Cluster 1 Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Storage_conf Listening_endpoints (@IP:port) [cid1,sid1]:pathname1,device_count [cid2,sid1]: pathname2,device_count Exportd node Storage node fstab rozofsmount mount_path rozofs export@IP,/metadata/fs1 rozofsmount node
8
Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Rozofsmount /fs1/home/ Eid1:/metadata/fs1,vid=1 Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 RozoFS architecture overview Components Volume 1 conf Eid1:/metadata/fs1,vid=1 Rozofsmount Storage Sid1: host1 /fs1/home/ Metadata server Data path Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 RozoFS Export conf. Volume i Exportd Cluster 1 Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Storage Sid2: host2 Storage Sid3: host3 Storage Sid4: host4 client node control path
9
Typical RozoFS deployments
10
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 10 RozoFS native mode (scale-out NAS) GigE infrastructure (shared by Data storage and metadata) Native protocol Linux Client with RozoFS clients/applications Storage and metadata Rozofsmount Storage Exportd Note: the exportd function can reside on some storage nodes also.
11
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 11 RozoFS Cluster : NAS mode GigE infrastructure (data storage and metadata) SMB,NFS, AFP.. Windows, Linux, UNIX and Apple clients GigE Infrastructure clients/applications Rozofsmount Storage Exportd
12
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 12 Virtualisation solution with RozoFS: CloudStack+KVM GigE infrastructure (data storage and metadata) + Standard GigE Infrastructure Niveau clients/applications External Network Rozofsmount Storage Rozofsmount Storage Rozofsmount Storage Rozofsmount Exportd
13
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 13 RozoFS basic exchanges
14
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 14 RozoFS basic exchanges inter components interfaces Rozofsmount Storcli 1 Storcli n Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Storage Sid1: host1 Cluster conf. Metadata ops./ mount Storage monitoring Projections deletion Read/write truncate Metadata Server CLIENT NODE
15
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 15 Rozofsmount Eid1:/metadata/fs1,vid=1 Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 RozoFS basic exchanges Filesystem mounting Volume 1 conf Eid1:/metadata/fs1,vid=1 Rozofsmount /fs1/home/ Metadata server Cluster n Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 RozoFS Export conf. Volume i Exportd Cluster 1 Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Mount /metadata/fs1 Rozofsmount –H exportd_host –E/metadata/fs1 /fs1/home/ 1 2 3 Clusters list Storcli 1 Storage Sid1: host1 Storage Sid2: host2 Storage Sid3: host3 Storage Sid4: host4 4 TCP open
16
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 16 RozoFS basic exchanges file creation Rozofsmount Metadata Server (exportd) Open(« /fs1/home/foo »,O_CREAT|O_RDWR,0640) Application/VFS Volume distribute(EID) Cluster 1 Sid1:host1 Sid2:host2 Sid3:host3 Sid4:host4 Sid5:host5 Sid6:host6 …… 1) Get the volume associated with EID (VID) 2) Get the Cluster list(CID) 3) Get 4 storages for a Cluster(SID) Export_mknod 1) allocate a unique file Id (FID) 2) Volume distribute(EID) 3) Insert(FID,« foo ») in parent directory 4) write new file attributes 5) update parent attributes DISK Eid1:/metadata/fs1,vid=1 mknod(EID,parent_fid,« foo »,O_RDWR,0640) attrs(FID,cid1:{sid1..sid4},0640,etc…} File_descriptor 14 2 3 FID : Unique File Identifier Descriptor Parent_fid: FID of the parent directory
17
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 17 RozoFS basic exchanges file opening Rozofsmount Metadata Server (exportd) Open(« /fs1/home/foo »,O_RDWR,0640) application Directory entries cache Parent_dir. Name1->FID1 Name2->FID2 foo -> FID3 ……. Export_lookup 1)Get file FID from parent directory (cache or disk) 2) Get File attributes (cache or disk) DISK Eid1:/metadata/fs1,vid=1 lookup(EID,parent_fid,« foo », O_RDWR,0640) File_attributes(attrs3) l ookup 19 3 4 attributes cache FID1->attrs1 FID2->attrs2 FID3->attrs3 ……. FID3 cid:{sid1,sid2,sid3,sid4} Atime,mtime …… attrs3open Fd 1 2 5 68 File descriptor allocator FID3 cid:{sid1,sid2,sid3,sid4} Atime,mtime …… Fd 1 7 VFS
18
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 18 RozoFS basic exchanges synchronous file write Len = pwrite(fd,offset,size,buffer) Application/VFS 110 FID3 cid:{sid1,sid2,sid3,sid4} File size Atime,mtime …… Fd 1 write(fd1,offset,size,data) Storage Sid4: host4 Mojette Transform Forward Write projections 1) Generate projections 2) Send all the projections write in parallel 3) Wait for all the write responses Write 1) Find the context associated with fd1 2) Submit data to write to storcli 3) Wait for end of write 4) Update the blocks on exportd 3) Return written to upper layer Storage Sid3: host3 Storage Sid1: host1 Storage Sid2: host2 write(FID3,offset,data,size) Size or errcode write(FID3,prj1) status Prj1,prj2,prj3 Data,size 2 6 5 3 4 7 Size or errcode Redundancy level (2+1): 2 reads 3 writes write(FID3,prj2) write(FID3,prj1) status 5 5 6 6 Write_blocks ( file attributes update ) 1) Update time information 2) Update size if greater 3) Update cache and disk DISK Eid1:/metadata/fs1,vid=1 attributes cache FID1->attrs1 FID2->attrs2 FID3->attrs3 ……. Wr_blks(EID1,FID3,offset,size) Attrs(attrs3) 8 9 Metadata server (exportd) Redundancy level (2+1): 2 reads 3 writes
19
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 19 RozoFS basic exchanges file read Rozofsmount Len = pread(fd,offset,size,buffer) Application/VFS 18 FID3 cid:{sid1,sid2,sid3,sid4} File size Atime,mtime …… Fd 1 Pread(fd1,offset,size) Storage Sid4: host4 Storcli Mojette Transform Inverse Read projections 1) Send parallel read requests 2) Wait for projection data returned from storages 3) Rebuild initial block Read 1) Find the context associated with fd1 2) Request data to storcli 3) Return requested data to VFS Storage Sid3: host3 Storage Sid1: host1 Storage Sid2: host2 Read(FID3,offset,size) Data,length Read(FID3,prj1,offset_prj Read(FID3,prj2,offset_prj prj1prj2 Prj1,prj2 Data,length 2 33 44 5 6 7 Redundancy level (2+1): 2 reads 3 writes
20
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 20 RozoFS basic exchanges file deletion Rozofsmount Metadata Server (exportd) unlink(« /fs1/home/foo ») Application/VFS File deletion 1)Remove the file from the parent directory (disk and cache) 2) Delete the attributes of the file (disk and cache) 3) Update the parent attributes 4) Insert file reference in the trash (list and disk) DISK Eid1:/metadata/fs1,vid=1 unlink(EID,parent_fid,«foo ») Parent_attributes 14 2 3 Trash thread FID6->attrs6 FID7->attrs7 FID3->attrs3 ……. FID3 cid:{sid1,sid2,sid3,sid4} …… errcode Trash list attrs3 Storage Sid1: host1 Storage Sid2: host2 Storage Sid3: host3 Storage Sid4: host4 unlink(parent_fid,« /fs1/home/foo ») Projections deletions(FID3)
21
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 21 RozoFS data path Mojette Transform performances Mojette Transform uses cases
22
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 22 Mojette Transform performances Encoding/decoding performances with 2 redundancies projections (4+2) 1.Mojette decoding/encoding is not CPU intensive and fits well on client side 2.Mojette decoding time does not depend on number of failures
23
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 23 Mojette Transform performances Encoding/decoding performances with 4 redundancies projections (8+4)
24
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 24 Write: Mojette Transform Forward (4+2) storaged Local FS File system block (4KB) Mojette erasure coding transform Projections storaged Local FS storaged Local FS storaged Local FS storaged Local FS storaged Local FS 1kB The initial block is divided in 4 parts. The Mojette Transform generates 6 projections. Among the 6 projections any 4 projections are enough to rebuild the initial block
25
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 25 Mojette Transform Forward + Write Process RozoFS Layout Distribution OSD Node 1 OSD Node 2 OSD Node 3 OSD Node 4 OSD nodes 1,2,3,4) User payload RozoFS data-path write service File system block forward transformation (nominal use case) proj 1.1 proj 2.1 proj 3.1 proj1.2 proj2.2 proj 3.2 proj 1.3 proj2.3 proj 3.3 proj 1.4 proj 2.4 proj 3.4 proj 1.5 proj 2.5 proj 3.5 The set of OSD is provided within the metadata associated with the file User payload is split in User Data Blocks (4K or 8K) Mojette transform is applied on each UDB Optimal distribution Spare Node(s) UDB 1 UDB 2 UDB 3 UDB 4 UDB 5
26
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 26 RozoFS data-path write service nominal use case sequence diagram Write transactions are performed in parallel Write service ends upon receiving all the responses from OSD nodes
27
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 27 Mojette Transform Forward + Write Process RozoFS Layout Distribution OSD Node 1 OSD Node 2 OSD Node 3 OSD Node 4 OSD nodes (1,2,3,4) User payload RozoFS data-path write service failure use case proj 1.1 proj 2.1 proj 3.1 proj1.2 proj2.2 proj 3.2 proj 1.3 proj2.3 proj 3.3 proj 1.4 proj 2.4 proj 3.4 proj 1.5 proj 2.5 proj 3.5 Spare OSD is used in case of failure of OSD belonging to the optimal distribution Write operation is successful when n+m projections are successfully written Optimal distribution Spare Node(s) UDB 1 UDB 2 UDB 3 UDB 4 UDB 5 proj 3.1 proj 3.2 proj 3.3 proj 3.4 proj 3.5
28
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 28 RozoFS data-path write service failure sequence diagram
29
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 29 Read: Mojette Transform Inverse (1/2) storaged Local FS File system blocks (4KB) Mojette erasure coding transform Projections storaged Local FS storaged Local FS storaged Local FS storaged Local FS storaged Local FS 1kB Read 4 projections among any of the 6 storage nodes
30
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 30 Read: Mojette Transform Inverse (2/2) storaged Local FS File system blocks (4KB) Mojette erasure coding transform Projections storaged Local FS storaged Local FS storaged Local FS storaged Local FS storaged Local FS 1kB In case of a failure of one node another one is selected among the set of servers associated with the file
31
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 31 RozoFS data-path read service Filesystem block Mojette inverse transformation (nominal use case) optimal distribution UDB (4K or 8K) OSD NODES projection 2 projection 1 1 2 projection 3 Read + Inverse Mojette Transform 3 4 RozoFS Layout Distribution OSD nodes (1,2,3,4) Read Read process selects n projections among the n+m projections to rebuild a User Data Block It can be any projection subset (n) in the n+m projection set Read transactions towards the OSD are performed in parallel: Minimize data transfer delay over the network
32
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 32 RozoFS data-path read service sequence diagram (nominal use case)
33
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 33 RozoFS data-path read service failure use case optimal distribution UDB (4K or 8K) OSD NODES projection 2 projection 1 1 2 projection 3 Read + Inverse Mojette Transform 3 4 RozoFS Layout Distribution OSD nodes (1,2,3,4) Read Attempt reading on remaining OSD in case of read projection failure: Disk failure Network failure Out of date projection Read
34
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 34 RozoFS data-path read service failure sequence diagram Fast projection recovery time: Start a guard timer on first projection read reply At timer expiration read requests are propagated towards remaining OSD
35
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 35 RozoFS data-path read service failure sequence diagram: case of a CRC 32 error The crc error is detected on the storage node The storage nodes informs that the read failure is due to a CRC error After rebuilding the initial data, the storcli process triggers a transform forward The transform forward concerns only the faulty projection It might more that one block to regenerate (depends on the number of CRC errors) Once the projection has been regenerated, it is sent back the associated storage node
36
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 36 Data integrity
37
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 37 End to end data integrity RozoFS projection authentication: One crc32 per projection: The crc includes the payload of the projection as well as the block identifier. A block identifier is defined by the file inode allocated by RozoFS and its offset in the file. The crc32 is stored along with the projection 1kB Block_id=fid+offset(i) checksum 1kB checksum 1kB checksum 1kB checksum 1kB checksum 1kB checksum Mojette erasure coding transform
38
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 38 Projection self-healing in RozoFS storaged Local FS 1kB storaged Local FS 1kB storaged Local FS 1kB Application RozoFS Application issues a read to RozoFS. Data can be rebuilt with OSD 1 and 2. Checksum on OSD1 reveals that projection is corrupted on disk storaged Local FS 1kB storaged Local FS 1kB storaged Local FS 1kB Application RozoFS RozoFS reads projection on OSD 3. Block 0 (red) is rebuit with projection from OSD 2 & 3 Good data is returned to application 4kB storaged Local FS 1kB Application RozoFS OSD1 RozoFS regenerates the corrupted projection from the rebuilt block RozoFS sends it to OSD1 for re- writing OSD2OSD3OSD1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.