avatar

Virtual Machines vs. Containers Revisited - Part 3

Mobycast
Mobycast
Episode • Oct 23, 2019 • 58m

In this episode, we cover the following topics:

  • Operating-system-level virtualization = containers
    • Allows the resources of a computer to be partitioned via the kernel
      • All containers share single kernel with each other AND the host system
    • Depend on their host OS to do all the communication and interaction with the physical machine
      • Containers don't need a hypervisor; they run directly within the host machine's kernel
    • Containers are using the underlying operational system resources and drivers
      • This is why you cannot run different OSes on the same host system
        • i.e. Windows containers can run on Windows only, and Linux Containers can run on Linux only
      • What we think of different OSes (RHEL, CentOS, SUSE, Debian, Ubuntu) are not really different...
        • They are all same core OS (Linux), they just differ in apps/files
    • Based on the virtualization, isolation, and resource management mechanisms provided by the Linux kernel
      • namespaces
      • cgroups
  • Container history
    • FreeBSD Jails (2000)
      • BSD userland software that runs on top of the chroot(2) system call
        • chroot is used to change the root directory of a set of processes
      • Processes created in the chrooted environment cannot access files or resources outside of it
      • Jails virtualize access to the file system, the set of users, and the networking subsystem
      • A jail is characterized by four elements:
        • Directory subtree: the starting point from which a jail is entered
          • Once inside the jail, a process is not permitted to escape outside of this subtree
        • Hostname
        • IP address
        • Command: the path name of an executable to run inside the jail
      • Configured via jail.conf file
    • LXC containers (2008)
      • Userspace interface for the Linux kernel features to contain processes, including:
        • Kernel namespaces (ipc, uts, mount, pid, network and user)
        • Apparmor and SELinux profiles
        • Seccomp policies
        • Chroots (using pivot_root)
        • Kernel capabilities
        • CGroups (control groups)
    • Docker containers (2014)
      • Early versions of Docker used LXC as the container runtime
      • LXC was made optional in v0.9 (March 2014)
        • Replaced by libcontainer)
        • libcontainer became the core of runC
      • LXC was dropped in v1.10 (February 2016)
  • Container technology
    • Containers are just processes. So what makes them special?
    • Namespaces
      • Restrict what you can SEE
      • Virtualize system resources, like the file system or networking
        • Makes it appear to processes within the namespace that they have their own isolated instance of resource
        • Changes to the global resource only visible to processes that are members of the namespace
      • Processes inherit from parent
      • Linux provides the following namespaces:
        • IPC (interprocess communications)
          • CLONE_NEWIPC: Isolates System V IPC, POSIX message queues
        • Network
          • CLONE_NEWNET: Isolates network devices, stacks, ports, etc
        • Mount
          • CLONE_NEWNS: Isolates mount points
        • PID
          • CLONE_NEWPID: Isolates process IDs
        • User
          • CLONE_NEWUSER: Isolates user and group IDs
        • UTS (Unix Timesharing System)
          • CLONE_NEWUTS: Isolates hostname and NIS domain name
        • Cgroup
          • CLONE_NEWCGROUP: Isolates cgroup root directory
      • Syscall interface
        • System call is the fundamental interface between an app and the Linux kernel
          • i.e. Linux kernel calls to create/enter namespaces for processes
    • Control groups (cgroups)
      • Restrict what you can DO
      • Limits an application (container) to a specific set of resources like CPU and memory
      • Allow containers to share available hardware resources and optionally enforce limits and constraints
      • Creating, modifying, using cgroups is done through the cgroup virtual filesystem
      • Processes inherit from parent
      • Can be reassigned to different cgroups
        • Memory
        • CPU / CPU cores
        • Devices
        • I/O
        • Processes
      • Using cgroups
        • To see mounted cgroups:
          • mount | grep cgroup
        • To create a new cgroup:
          • mkdir /sys/fs/cgroup/cpu/chris
        • To set "cpu.shares" to 512:
          • echo 512 > /sys/fs/cgroup/cpu/chris/cpu.shares
        • Now add a process to this cgroup:
          • echo <get_pid> > /sys/fs/cgroup/cpu/chris/cgroup.procs
  • Pseudo code: Creating a container
    • Steps:
      • Create root filesystem for container
        • Spin up busybox in Docker container, and then export filesystem
      • Run "launcher" process that sets up "child" namespace
      • Launcher process forks new child process (now under new namespaces)
        • Child process then forks new process for container
          • chroot (to our root filesystem)
          • mount any other FS
          • set cgroups (e.g. apply CPU constraints)

Links

End Song
Bettie Black & Sophia - Something Beautiful

For a full transcription of this episode, please visit the episode webpage.

We'd love to hear from you! You can reach us at: