Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seven Problems of Linux Containers

Similar presentations


Presentation on theme: "Seven Problems of Linux Containers"— Presentation transcript:

1 Seven Problems of Linux Containers
Kir Kolyshkin 28 April LinuxFest Northwest parallels.com || openvz.org || criu.org

2 Seventy Seven Problems of Linux Containers
Kir Kolyshkin 28 April LinuxFest Northwest (of which I am going to cover six) parallels.com || openvz.org || criu.org

3 Problem 1: Effective virtualization
Virtualization is partitioning Historical way: $M mainframes Modern way: virtual machines Problem: performance overhead Partial solution: hardware support (Intel VT, AMD V) parallels.com || openvz.org || criu.org

4 parallels.com || openvz.org || criu.org
Solution: isolation Run many isolated userspace instances on top of one single (Linux) kernel All processes see each other files, process information, network, shared memory, users, etc. Make them unsee it! parallels.com || openvz.org || criu.org

5 parallels.com || openvz.org || criu.org

6 One historical way to unsee
chroot() parallels.com || openvz.org || criu.org

7 parallels.com || openvz.org || criu.org
Namespaces Implemented in the Linux kernel PID net IPC UTS mnt user clone() with CLONE_NEW* flags parallels.com || openvz.org || criu.org

8 Problem 2: Shared resources
All containers share the same set of resources (CPU, RAM, disk, various kernel things ...) Need fair distribution of goods so everyone gets their share Need DoS prevention Need prioritization “All animals are equal, but some animals are more equal than others” -- George Orwell parallels.com || openvz.org || criu.org

9 parallels.com || openvz.org || criu.org

10 Solution: OpenVZ resource controls
user beancounters controls 20 parameters hierarchical CPU scheduler disk quota per containers I/O priorities per-container Dynamic control, can “resize” runtime parallels.com || openvz.org || criu.org

11 parallels.com || openvz.org || criu.org
Solution: cgroups Cgroups is a mechanism to control resources per hierarchical groups of processes Cgroups is nothing without controllers: blkio, cpu, cpuacct, cpuset, devices, freezer, memory, net_cls, net_prio Cgroups are orthogonal to namespaces Still a work in progress (kernel memory) parallels.com || openvz.org || criu.org

12 Problem 3: easy resources
User Beancounters are complicated: user has to set all these parameters some of which are interdependent We created a collection of valid configs, ... wrote a whole book about UBC ... and a set of tools to help parallels.com || openvz.org || criu.org

13 parallels.com || openvz.org || criu.org

14 parallels.com || openvz.org || criu.org
Solution: VSwap Only two primary parameters: RAM and swap others still exist, but no longer required to set Swap is virtual, no actual I/O is performed Slow down to emulate real swap Only when actual global RAM shortage occurs, virtual swap goes into the real swap Currently only available in OpenVZ kernel parallels.com || openvz.org || criu.org

15 Problem 4: fast live migration
We can migrate an OpenVZ container from one physical server to another without a shutdown We want to do it fast even for huge containers huge disk: use shared storage huge RAM: ??? parallels.com || openvz.org || criu.org

16 Normal migration process
(Assuming shared storage) 1 Freeze the container 2 Dump its complete state to a dump file 3 Copy dump file to destination server 4 Undump 5 Unfreeze Problem: huge dump file parallels.com || openvz.org || criu.org

17 Solution 1: network swap
1 Dump the minimal memory, lock the rest 2 Restore the minimal memory, mark the rest as swapped out 3 Set up network swap from the source 4 Unfreeze. Missing RAM will be “swapped in” 5 Migrate the rest of RAM and kill it on source parallels.com || openvz.org || criu.org

18 parallels.com || openvz.org || criu.org

19 Solution 1: network swap
1 Dump the minimal memory, lock the rest 2 Copy, undump what we have, mark the rest as swapped out 3 Set up network swap served from the source 4 Unfreeze. Missing RAM will be “swapped in” 5 Migrate the rest of RAM and kill it on source PROBLEM? Reliability, no way to rollback parallels.com || openvz.org || criu.org

20 Solution 2: Iterative RAM migration
1 Ask kernel to track modified pages 2 Copy all memory to destination system 3 Ask kernel for list of modified pages 4 Copy those pages 5 GOTO 3 until satisfied 6 Freeze and do migration as usual parallels.com || openvz.org || criu.org

21 parallels.com || openvz.org || criu.org
Problem 5: upstreaming OpenVZ was developed separately Then we wanted to merge it upstream (i.e. to vanilla Linux kernel) Problem? parallels.com || openvz.org || criu.org

22 parallels.com || openvz.org || criu.org

23 parallels.com || openvz.org || criu.org
Problem 5: upstreaming OpenVZ was developed separately Then we wanted to merge it upstream (i.e. to vanilla Linux kernel) Problem: upstream devs are not accepting our work parallels.com || openvz.org || criu.org

24 Solution 1: rewrite from scratch
User Beancounters -> CGroups Did 2 rewrites for PID namespace until it finally got accepted Network namespace redone It works! about 1500 patches got landed to vanilla II Parallels made it to top10 contributors parallels.com || openvz.org || criu.org

25 parallels.com || openvz.org || criu.org
Solution 2: CRIU We tried hard to merge checkpoint/restore Other people tried hard too, no luck Can't make it to the kernel, let's go userspace With minimal kernel intervention when required Kernel exports most of information already, so let's just add missing bits and pieces parallels.com || openvz.org || criu.org

26 parallels.com || openvz.org || criu.org
Checkpoint / Restore (mostly) In Userspace Tools currently at version 0.4 Will do 1.0 release this year Kernel 3.8 has about 120 patches from us 95% of needed features are there Memory snapshot recently made it to -mm tree parallels.com || openvz.org || criu.org

27 parallels.com || openvz.org || criu.org

28 Problem 6: common file system
Container is just a directory on host, all CTs reside on the same FS File system journal is a bottleneck Lots of small-size files I/O on CT backup No sub-tree disk quota support in upstream No per-container snapshots Live migration: rsync -- changed inodes File system type and properties are fixed parallels.com || openvz.org || criu.org

29 parallels.com || openvz.org || criu.org
Solution 1: LVM Only works only on top of block device Hard to manage (e.g. how to migrate huge volume?) No dynamic allocation Complicated management parallels.com || openvz.org || criu.org

30 parallels.com || openvz.org || criu.org
Solution 2: loop device VFS operations leads to double page-caching (already fixed in the recent kernels) No dynamic allocation, max space is used Limited feature set parallels.com || openvz.org || criu.org

31 parallels.com || openvz.org || criu.org
Solution 3: ploop Basic idea: same as loop, just better Modular design: various image formats (qcow2 in TODO) various I/O backends More features: live resize instant live snapshots write tracker to help in live migration parallels.com || openvz.org || criu.org

32 Any problems questions?
parallels.com || openvz.org || criu.org


Download ppt "Seven Problems of Linux Containers"

Similar presentations


Ads by Google