Containerization

Motivation

Science today faces a reproducibility crisis. Key findings published cross scientific disciplines are not corroborated by other scientists when tested independently. A survey conducted by Nature asked scientists which factors they though contributed the most to the crisis. Over 80% of respondants felt the reason of ‘Methods, code unavailable’ contributed to irreproducible research.

For many scientists, software and analysis have become an indispensible and increasingly unavoidable component of their research. Critical findings now arise from the analysis of data that uses tools developed in house as well as tools published by others. These components are usually integrated by custom ‘glue code’ that connects them together.

This environment poses a new set of challenges to scientists who use computational methods in their research:

  • How do we write analysis code that is robust and reproducible?
  • How can we concisely communicate our code with other researchers?
  • How do we share analysis code with other researchers in a form that can be easily executed?

As computational analysis and tools become more complex, so do the environments needed to execute them. Modern software packages often require hundreds of supporting software packages, provided either by a particular operating system or from a third party. Further, each of these software package dependencies has a specific version or set of versions that are needed for the package to run. The author of a package could in principle record all of these packages and their dependencies and provide this list with their software distribution, but maintaining this list of software and ensuring cross-platform compatibility is a major challenge. Environment management software packages such as miniconda are available to address this challenge, but introduce additional complexity due to the fact that it itself is an additional software dependency, package availability is largely dependent upon community support, and because third party software packages may not be supported across different platforms. A superior solution to managing and deploying complex software environments is to create containerized applications.

Containerization

Containerization, also known as operating-system-level virtualization, is a technology that enables the encapsulation and execution of sets of software and their dependencies in a platform-agnostic manner. A software container is a file that has been built by specific containerization software, e.g. docker or singularity, to contain all of the necessary software and instructions to run.

What is a container?

Generally speaking, a container is a file that specifies a collection of software that can run in a particular execution environment. The execution environment is provided by the containerization software, e.g. docker, such that the container doesn’t have to be aware of the particular machine it is running on. This means that a container will be portable to any environment where the containerization software can run, thus eliminating the need for software authors (i.e. us) to worry about whether or not our code will run on any given hardware/OS/etc.

At the time of writing (July 2018), docker is by far the most popular containerization software. docker has been open source since its release in 2013 and an enormous docker community has grown since. Due to its popularity, this workshop will use docker exclusively as the vehicle for demonstrating containerization of custom applications.

Another more recent containerization software called singularity is available that addresses some of the usability shortcomings of docker. If docker is not available on your computational resources due to security concerns, then singularity may be an option. The containerization concepts are identical between docker and singularity, and all of the content of this workshop is easily adaptable to from docker to singularity.