New layout for describing block devices and file systems Luis Fernando Muñoz Mejías Universidad Autónoma de Madrid 4 th Quattor Workshop (UAM, 2007)
Luis Fernando Muñoz Mejías Outline ➲ Current layout ● Limitations ➲ Tim Bell and Andras Horvath's model ➲ New model ➲ Example ➲ Conclusions ➲ What's next
Luis Fernando Muñoz Mejías Current layout ➲ Oriented to partitions ➲ Mixes partition and file system definition ➲ Based on set_partitions function set_partitions (nlist (“hda1”, nlist (“mountpoint”, “/”, “size”, 10*GB, “type”, “ext3”...
Luis Fernando Muñoz Mejías Limitations of current layout ➲ Control of advanced features (large partitions, tuning options...) is poor ➲ Software RAID is difficult to achieve, at best ➲ It depends on “hda”-like naming ● There are other naming schemas, f.i. MegaRAID ➲ It's not meant to control hardware RAID
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model ➲ Separation between file system and block devices ● File system just references the block device it lies on ● Easy to extend ➲ Very natural for humans ● Partitions are part of disk structure, logical volumes are part of volume group structure...
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model: file systems type filesystem = { “preserve” : boolean “format” : boolean “type” : string “block_device” : string “mountpoint” : string... }; Reference
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model: disks type disk = { “partitions” : partition{} “label” ? string... }; “Natural”: partitions are disk members Allows for large partitions
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model: hardware RAID type hwraid = { “partitions” : partition{} “label” : string “level” : string... }; Duplicated from disks!
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model: LVM and software RAID type volume_group = { “device_list” : string[] “logical_volumes” : logical_volume{} }; type sw_raid = { “device_list” : string[] “raid_level” : string }; List of references Bi-directional referencing
Luis Fernando Muñoz Mejías Tim Bell and Andras Horvath's model: problems ➲ Difficult to implement ➲ Human's natural way is not the best way for computers ● Bi-directional creations are complex, slow and error prone ● The model doesn't allow filesystems to control creations, destructions and modifications
Luis Fernando Muñoz Mejías New model ➲ Based on Tim's and Andras' ➲ Top-down only ➲ File systems control creation, growth and shrinks
Luis Fernando Muñoz Mejías New model: top-down File systems and block devices can be modelled as a tree-like structure rooted on the file system File system: /Homer Blockdev: LV /dev/Springfiedl/EvergreenTrc Blockdev: VG /dev/Springfield Blockdev: partition /dev/sda1 Blockdev: partition /dev/sda2 Blockdev: disk /dev/ciss/c0d 0 B lockdev: disk /dev/sda Tree structure Non- tree structur e
Luis Fernando Muñoz Mejías New model: disks and hardware RAID type physical_dev = { “label” : string “raid_level” ? string “raid_members” ?... }; ➲ Hardware RAID and disks are merged ➲ Partitions are defined outside their disks msdos, gpt, bsd...
Luis Fernando Muñoz Mejías New model: partitions type partition = { “size” ? long “holding_dev” : string “type” : string... }; ➲ Partitions reference the disk they lay on ● More flexibility on naming schemas ➲ “grow” flag is gone Optional! primary, extended, logical
Luis Fernando Muñoz Mejías New model: LVM type volume_group = { “device_list” : string[] }; type logical_volume= { “volume_group” : string “size” ? long }; ➲ Volume groups don't know about the logical volumes they hold ● Enforce top-down approach
Luis Fernando Muñoz Mejías New model: software RAID type md = { “raid_level” : string “device_list” : string[] }; ➲ Software RAID can lay on arbitrary devices References to defined block devices
Luis Fernando Muñoz Mejías New model: files type file = { “size” : string “owner” : string “group” : string “perms” : long }; ➲ Files can hold filesystems with the loopback module
Luis Fernando Muñoz Mejías New model: file systems type filesystem = { “mountpoint” : string “type” : string “tuneopts” ? string “preserve” : boolean “format” : boolean “mount” : boolean “freq” : long “sync” : long “block_device” : string }; ➲ File systems only reference the block device they lay on
Luis Fernando Muñoz Mejías New model: tying it all together type blockdevices = { “physical_devs” ? disk{} “partitions” ? partition{} “volume_groups” ? volume_group{} “logical_volumes” ? logical_volumes{} “md” ? md{} “files” ? file{} }; bind “/system/blockdevices” = blockdevices; bind “/system/filesystems” = filesystem[]; This is a list!!
Luis Fernando Muñoz Mejías Some advice ➲ Don't use extended/logical partitions ● Use LVM instead ➲ Always use partitions ● Don't place filesystems directly on disks, they might get destroyed by Quattor
Luis Fernando Muñoz Mejías Let's see an example
Luis Fernando Muñoz Mejías Example: diagram File system: /Homer Blockdev: LV /dev/Springfiedl/EvergreenTrc Blockdev: VG /dev/Springfield Blockdev: partition /dev/sda1 Blockdev: partition /dev/sda2 Blockdev: disk /dev/ciss/c0d 0 B lockdev: disk /dev/sda Tree structure Non- tree structur e
Luis Fernando Muñoz Mejías Example Let's suppose that /dev/sda1 uses 1GB, /dev/sda2 uses the rest of the disk, and logical volume Springfield/EvergreenTrc uses all its volume group
Luis Fernando Muñoz Mejías Example: the file system “/system/filesystems” = list ( nlist (“mountpoint”, “/Homer”, “preserve”, true, “format”, false, “type”, “xfs”, “block_device”, “logical_volumes/EvergreenTrc”, “mount”, true ) );
Luis Fernando Muñoz Mejías Example: the LVM “/system/blockdevices/logical_volumes” = nlist( “EvergreenTrc”, nlist (“volume_group”, “Springfield”)); “/system/blockdevices/volume_groups” = nlist ( “Springfield”, nlist (“device_list”, list (“partitions/sda1”, “partitions/sda2”, “physical_devs/” + escape (“ciss/c0d0”))); Relative to /system/blockdevic es Relative to /system/blockdevices/volume_grou ps
Luis Fernando Muñoz Mejías Example: the partitions “/system/blockdevices/partitions” = nlist ( “sda1”, nlist (“holding_dev”, “sda”, “size”, 1*GB), “sda2”, nlist (“holding_dev”, “sda”) ); /dev/sda2 fills the rest of the disk Primary partition is assumed Relative to /system/blockdevices/physical_dev s
Luis Fernando Muñoz Mejías Example: the disks “/system/blockdevices/disks” = nlist ( “sda”, nlist (“label”, “msdos”), escape (“ciss/c0d0”), nlist (“label”, “none”) ); A PV lies directly on the disk, without partitions. No label must be set for this
Luis Fernando Muñoz Mejías Conclusion ➲ New layout is more flexible and easier to extend ➲ Implemented on AII and ncm-filesystems ● See next presentations ➲ Temporary path under /software/components/filesystems ➲ Ready to stabilize on /system/...
Luis Fernando Muñoz Mejías What's next ➲ LVM snapshots ● Are they needed? ● Are they Quattor business at all? ➲ LVM striping? ➲ Software RAID monitoring? ➲ Quota definition ➲ Other stuff...
Luis Fernando Muñoz Mejías More information CERN's twiki on the new layout