Intel Optane Storage with Docker's layers

Using Intel’s Optane SSD storage to dive into Docker’s layers

ash wilson / 07.11.17

Background:

Containerization is a rapidly growing trend in application hosting infrastructure. There are a number of guiding principles and best practices for building container images (containerization’s analog for virtual-machine images). One guiding principle of containerization is building small, concise, single-concern images. The flexibility of the tools (Docker, for instance) for building container images means that there is room for creativity and those principles aren’t necessarily enforced. We were curious to know how community-built images are constructed, and if the most popular ones adhere to these best practices.

Not too long ago, Intel issued an open call for interesting use cases to test out on their new Intel Optane technology. We responded to that call with a use case for analyzing community-contributed images from Dockerhub.

Originally, we tried doing this research on a test server, but we were unable to get results in a decent amount of time.

Intel’s team agreed that it was an interesting use case, and gave us access to a server with a pair of blazing-fast 375GB Optane SSDs installed. The problem with our internal research server was I/O, and after switching to the server with Optane storage, I/O ceased to be a problem and the constraining factor became internet bandwidth… on a pipe that was peaking at over 550Mbps.

Armed with this new server, we began to use our analysis tools to better understand how people construct container images. One area of investigation involved researching the number of layers comprising a Docker image, paying special attention to whether or not being concise was paramount in community-contributed images.

So let’s dive in.

In this post we make the assumption that you are:

  • Linux proficient:
    • Have a basic understanding of inodes and blocks
    • Have heard of OverlayFS, and/or overlay2
  • Docker proficient:
    • Have used it before, at least in testing, and understand how to build an image
    • Understand the basics of writing a Dockerfile
  • Interested in build process subtleties that can impact performance and stability

What we were looking for:

We used the Intel server to download and analyze more than 4,900 community-built images from Dockerhub, starting with the most popular.

We examined different aspects of these images, some having to do with the metadata that ships in the image manifest, others having to do with the actual contents of the image. The number of layers in each Docker image immediately stood out to us in our research.

What makes a layer? Each action in the build process that changes the file system creates a new layer. When you build a new image based on another image (like an Alpine or Ubuntu base image), you inherit the layers used to create those base images.

The average Docker image in our sample set was comprised of more than 12 layers (12.62 was the average across the whole sample set). How many layers did the biggest one have? 122. Nope, that’s not a typo. The most layered image had 122 layers. And that’s in the top 4,000 most popular images (out of more than 600,000) on Dockerhub.

It’s hard to say how many layers are too many – it’s subjective and in the end you have to do what’s right for your application. Who cares, anyway? Specifically, why should you care about how many layers your images have?

Because the devil is in the details.

Let’s talk for a minute about layered file systems. In a common scenario, you’ll format a volume with ext4 and mount it up under /var/lib/docker. The layered file system then sits on top of that file system, so to speak. When you tell the Docker engine to download an image, it fetches the image manifest, which contains a list of layer SHA sums. The Docker engine then downloads each layer referenced by SHA sum as a gzipped tarball, and expands it on your file system, somewhere under /var/lib/docker.

The exact location depends on the storage driver you’re using. If you navigate to /var/lib/docker/overlay2 (assuming you’ve configured Docker to use the overlay2 storage driver) you can run `tree` and see that each layer directory’s diff/ directory represents the root of the image’s file system, but only contains the files that were created or changed. Remember, each of these corresponds to a build step where the filesystem of the image was altered.

So each layer gets expanded into a directory and some magic ties it all together when you instantiate a container from the image, right? Sort of. It’s not magic, it’s mount. In the Linux kernel version 4+, you can mount multiple directories as a single file system. You can see this in action when you type ‘mount | grep ^overlay’ from the operating system that’s running Docker, if you have any containers currently running. You’ll see a colon-delimited list of layer ID links.

For more information on how to chase down the actual layer IDs from those all-caps links, check out this helpful Docker page.

If you follow that link, you’ll notice that Docker recommends the use of overlay2 instead of overlay as a storage driver. One of the most compelling reasons for choosing overlay2 is the difficulty of keeping inode consumption under control with overlay. And now  we’re getting to why you should care:

In the CIA Triad, the A is for Availability. If you consume all the inodes on your Docker engine’s storage volume, your availability suffers.

We used Munin to track system resource consumption throughout the testing period, and we noticed a big difference when switching from overlay to overlay2:

Optane storage 1Optane image 2

The chart on the left shows disk space and inode consumption when using overlay, and the chart on the right shows the transition to overlay2 on the 5th of the month. Inode consumption was a consistent problem with overlay. When we switched to overlay2 (and kernel version 4), we were able to consume more images concurrently because inode exhaustion ceased to be an issue in our testing environment. (Something to note with implementation: If you decide to go with Optane storage in your environment, know that we were able to fill, analyze, and flush image storage extremely fast with a pair of 375GB P4800x Optane SSDs, striped with LVM.)

Here’s a real-world exercise that my colleague Leslie Devlin put together to highlight the difference in overlay and overlay2, and the benefits of being concise:

We tested six use cases using overlay and overlay2 on identically-configured Ubuntu hosts with ext4 /var/lib/docker partitions. We built Ubuntu-based containers running a Python “hello world” script within Flask. The dockerfiles for the first three tests used the same install commands, but had slight changes in the ordering and grouping of those commands. You’d expect nearly identical dockerfiles to build nearly identical containers, but with overlay involved, you’d be mistaken.

Overlay Overlay2
Blocks Inodes Blocks Inodes
1 506976 31509 480980 17160
2a 506960 31506 480876 17133
2b 500636 27341 480788 17111
3b 500652 27344 480892 17138

Test case 2b, which uses the fewest inodes in overlay, has its RUN commands together in a single layer:

When a file is created in one layer of a container, and then removed in the same layer (as the install artifacts are here with apt-get clean), the files are not saved when the next layer is built. However, when a file is created in one layer, and “deleted” in another, it’s not actually removed, but hidden from view with a whiteout in that later, higher layer. Since that file is still stored in a lower layer of the container, it comes at a cost in actual disk usage.

So 3b, which is virtually identical to the above test in terms of inode usage, moves apt-get clean to the end of the file:

Since apt-get clean is run in a later layer, it’s creating a whiteout of all of the files left over from the install commands in previous layers rather than actually removing them from the disk. What’s a whiteout? Here’s a brief article that describes what it is and how it works.

In these cases, conciseness saves inodes – but even the containers that are kinder to inodes use 60% more of them than overlay2.

In our second set of test cases we actually tried to eat up inodes. We started with 2b, the best case in terms of inode usage in overlay. For case 3a, we added three small data writes at the end of the dockerfile:

For case 3c, we used those same writes, but also added a command to delete the padding file:

Overlay Overlay2
Blocks Inodes Blocks Inodes
3a 520712 40109 481004 17162
3c 527400 44364 481072 17179

Once again, the overlay2 containers are pretty resilient, using only a few inodes for our malicious writes. Overlay, on the other hand, uses over 1½ times the inodes as overlay2 for our test 3c container. The rm command alone used over four thousand inodes. (The numbers for overlay on these two tests looked so out of place that we ran the same tests again on a different day just to be sure.)

We see here that, when using the overlay2 driver, dockerfiles that do more or less the same thing build out as containers that use more or less the same amount of disk resources. But even minor differences in dockerfiles have a huge impact on inode usage in containers built using overlay. These differences might be unintentional, or slipped into a complex build file with intent to cause harm.

Disk may be cheap, but no one wants to be surprised by running out of inodes. And even if you build your filesystems anticipating this kind of inode burn, we expect that, with more complex containers built out at scale, this would have a real impact on how much you’d have to spend on file storage.

Conclusion:

There’s no hard and fast limit to how many layers are too many, but not paying attention to this aspect of image construction can have a significant impact on the performance and reliability of your application and platform. The impact can be even more pronounced if you’re using the overlay (as opposed to overlay2) filesystem driver in Docker.

We would like to thank Intel for giving us access to an Optane testing server and making this kind of research possible!