Packaging your own application

Workflow Overview

The simplest workflow for building a docker container with your own code usually follows these steps:

  1. Identify an appropriate image
  2. Identify additional dependencies needed for your application
  3. Install those dependencies with the appropriate RUN commands
  4. Add your code to the image, either with ADD or git
  5. Specify an appropriate CMD or ENTRYPOINT specification
  6. Build your image, repeating 2-4 if needed until success
  7. Run a container of your image, test behavior
  8. Iterate, if needed

Preparing docker image for your code

Choosing a base image

The first step in creating a docker container is choosing an appropriate base image. In general, picking the most specific image that meets your requirements is desirable. For example, if you are packaging a python app, it is likely advantageous to choose a python base image with the appropriate python version rather than pulling an ubuntu base image and installing python using RUN commands.

Installing dependencies

Once a base image is chosen, any additional dependencies need to be installed. For debian based images, the apt package manager is used to manage additional packages. For Fedora based images, the yum package manager is used. Be sure to check which base linux image is used for a more specific image to know which package manager to use.

Annoyance Alert

In practice, it can be hard to know all of the additional system packages that need to be installed. Often, building a image to completion and running it to identify errors is the most expedient way to create an image.

Occasionally, a software package dependency, or a specific version of software, is not available in the software repositories for a base image linux distro. In these cases, it might be necessary to download and install precompiled binaries manually, or build a package from source. For example, here is an example Dockerfile that installs a specific version of samtools from a source release available on github:

FROM ubuntu:bionic

RUN apt update

# need these packages to download and build samtools:
# https://github.com/samtools/samtools/blob/1.9/INSTALL
RUN apt install -y wget gcc libz-dev ncurses-dev libbz2-dev liblzma-dev \
    libcurl3-dev libcrypto++-dev make
RUN wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 && \
    tar jxf samtools-1.9.tar.bz2 && \
    cd samtools-1.9 && ./configure && make install

CMD ["samtools"]

Putting your code into a docker image

Once your dependencies are installed, the final step is to move your own code into your image. There are primarily two different strategies for doing so:

  • Copy source files into the image using the ADD command in the Dockerfile
  • Clone a git repository into the image from a publicly hosted repo like github or bitbucket

Nota Bene

In any case, it is a good idea to create a git or other source code versioning system to develop your code, hosted publicly if possible. Your Dockerfile should be developed and tracked along with your code, so that both can be developed over time while maintaining reproducibility.

Locally

The local strategy is convenient when developing software. Running development code in a docker container ensures your testing and debugging environment are consistent with the execution environment where your code will ultimately run. To build from a local source tree:

  1. Create a Dockerfile in the root directory where your code resides
  2. Prepare the Dockerfile for your code as in Preparing docker image for your code
  3. Copy all of the source files into a directory (e.g. /app) in the container with ADD . /app
  4. Perform any setup that comes bundled with your package source (e.g. pip install -r requirements.txt or python setup.py) with the RUN command
  5. Set the CMD entry point appropriately for your app
  6. Build your image with an appropriate tag
  7. Run and test your application, ideally with unit tests

Assuming we have written a python application named app.py, from within the source code directory containing the application we could write the following Dockerfile:

# Use an official Python runtime as a parent image
FROM python:2.7-slim

# Copy the current (host) directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# mount the current working directory to /cwd when the container is run
WORKDIR /cwd

# Run app.py when the container launches
ENTRYPOINT ["python", "app.py"]

When a container is run, app.py will be run directly and passed any additional arguments specified to the docker run command.

Cloning from github/bitbucket

For software projects hosted on github or bitbucket, or when it is not desired to include a Dockerfile along with your application source code, the Dockerfile can also be set to clone and install a git repo instead of adding code locally. Instead of using the ADD command from above, use a RUN git clone <repo url> instead:

FROM python:3.6

# have to install git to clone
RUN apt install git

# git clone repo instead of ADD
RUN git clone https://bitbucket.org/bubioinformaticshub/docker_test_app /app
RUN pip install --trusted-host pypi.python.org -r /app/requirements.txt

# mount the current working directory to /cwd when the container is run
WORKDIR /cwd

# use ENTRYPOINT so we can pass files on the command line
ENTRYPOINT ["python", "/app/app.py"]

Cloning a public repo into a Docker container in this way has the advantage that the environment where you write your code can be the same or different than the platform where the code is run.

There is one additional caveat to this method of adding code to your image. To save on build time, docker caches the sequential steps in your Dockerfile when building an image, and only reruns the steps from the command where a change has been made. The ADD command automatically detects if local file changes have been made and automatically re-copies them into the container on docker build. This method of cloning a repo from bitbucket, however, does not re-trigger a build. When cloning your application from a public git repo, the --no-cache flag must be provided to your docker build command:

$ docker build --no-cache --tag app:latest .

This invalidates all build cache and re-clones your repo on each build.

Running your docker container

Once your code has been loaded into an image, containers for your image can be run in the normal way with docker run. Any host directories containing files needed for the analysis must be mounted:

$ docker run --mount type=bind,source=/data,target=/data \
    --mount type=bind,source=$PWD,target=/cwd app \
    --in=/data/some_data.txt --out=/data/some_data_output.csv

Remember that any time your code changes you will need to rebuild your image, including --no-cache if you pull your code from a git repo.

Publishing your docker image

Once your docker image is complete and your app is read to share, you can create a free account on Docker Hub and upload your image. Be sure to provide a full description of what the image does, what software it contains, and how to run it, specifying any directories the container expects to be mounted to access data (e.g. /data). You might alternatively consider hosting your image on the Amazon Elastic Container Registry or Google Cloud Container Registry. If your app will primarily be executed in either AWS or GAE environments, it may be preferable to publish your image to the corresponding registry.

Hands On Exercise

Writing the Dockerfile

Write, build, and run a Dockerfile that:

  1. Uses the python:3.6 base image
  2. Installs git with apt
  3. Clones the repo docker_test_app
  4. Installs the dependencies using the requirements.txt file in the repo
  5. Configures the ENTRYPOINT to run the script in the repo with python3

Running the Dockerfile with data from an S3 bucket

Nota Bene

When you run this app, you should specify the -t flag to your docker run command.

Try running the container using docker run with no arguments to see the usage.

A fastq file that can be passed to this script has been made available on a shared S3 bucket. You will download this file to your local instance using the aws cli. First, you must run aws configure and provide your access keys. Specify us-east-1 as the region. The bucket address of the file is:

s3://buaws-training-shared/test_reads.fastq.gz

Download the file using the aws cli and pass it to the app using docker run. You must mount the directory where you downloaded the fastq file using the --mount command line option as above.