Verified Commit 3bbc6d32 authored by Frere, Jonathan (FWCC) - 142176's avatar Frere, Jonathan (FWCC) - 142176
Browse files

Merge branch 'docker-guide-blog-post'

parents e64ac398 ccde789a
Pipeline #42835 passed with stages
in 5 minutes and 58 seconds
title: "Docker for Science (Part 2)"
date: 2020-08-25
- frere
layout: blogpost
title_image: docker-cover.png
- tutorial
- consulting
- docker
excerpt: >
Previously, we learned about Docker, and how to run other people's Docker containers.
In this post, we will explore building our own images to package up our projects.
> This post is part of a short blog series on using Docker for scientific applications.
> My aim is to explain the motivation behind Docker,
> show you how it works,
> and offer an insight into the different ways that you might want to use it in different research contexts.
> Quick links:
> - [Part 1 (Getting Started with Docker)]({% post_url 2020/08/2020-08-25-getting-started-with-docker-1 %})
> - Part 2 (A Dockerfile Walkthrough) ← You are here!
> - [Part 3 (Using Docker in Practical Situations)]({% post_url 2020/08/2020-08-25-getting-started-with-docker-3 %})
# An Example Dockerfile
Let's get straight to business:
Here's what an example Dockerfile for a simple Python project might look like.
(The comments are added to make it easier to reference later in this post.)
# (1)
FROM python:3.8.5
# (2)
WORKDIR /opt/my-project
# (3)
COPY . /opt/my-project
# (4)
RUN pip install -r requirements.txt
# (5)
ENTRYPOINT [ "python3", "" ]
# Building Our Example Project
First let's figure out how to turn this Dockerfile into a container that we can run.
The first step is to get the code --
you can find it in [this repository]( so you can clone it and follow along.
The first step to getting this ready to run is `docker build`.
To build an image, you need a Dockerfile, a name for the image, and a context.
The Dockerfile is what tells Docker how to build the image,
the name is what Docker will use to reference this image later (e.g. `python` or `hello-world`),
and the context is the set of files from your file system that Docker will have access to when it tries to build the project.
Usually the context is the project directory (usually also the directory where the build command is run from).
Likewise, by convention, a Dockerfile is generally called `Dockerfile` (with no extension),
and lives in the project's root directory.
If this isn't the case, there are additional flags to pass to `docker build` that specify where it is located.
The name is given with the `-t` flag, also specifying any tags that you want to provide (as always, these default to `:latest`).
The `-t` flag can be provided multiple times, so you can tag one build with multiple tags,
for example if your current build should belong to both the `latest` tag, and a fixed tag for this release version.
Having cloned the example repository, you can run this build process like this:
$ # builds the file at ./Dockerfile, with the current working directory as the context,
$ # with the name `my-analyser`.
$ docker build -t my-analyser .
Sending build context to Docker daemon 20.48kB
Step 1/5 : FROM python:3.8.5
3.8.5: Pulling from library/python
d6ff36c9ec48: Pull complete
c958d65b3090: Pull complete
edaf0a6b092f: Pull complete
80931cf68816: Pull complete
7dc5581457b1: Pull complete
87013dc371d5: Pull complete
dbb5b2d86fe3: Pull complete
4cb6f1e38c2d: Pull complete
0b3d7b2fc317: Pull complete
Digest: sha256:4c62d8c5ef331e485143c7a664fd6deeea4595ac17008ef5c10cc470d259e39f
Status: Downloaded newer image for python:3.8.5
---> 62aa40094bb1
Step 2/5 : WORKDIR /opt/my-project
Removing intermediate container 3e718c528a63
---> f6845bcf9e20
Step 3/5 : COPY . /opt/my-project
---> 8977a9a29d1c
Step 4/5 : RUN pip install -r requirements.txt
---> Running in 8da06d6427d0
Collecting numpy==1.19.1
Downloading numpy-1.19.1-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
Collecting click==7.1.2
Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Installing collected packages: numpy, click
Successfully installed click-7.1.2 numpy-1.19.1
Removing intermediate container 8da06d6427d0
---> ba22084bd57e
Step 5/5 : ENTRYPOINT [ "python3", "" ]
---> Running in d1c9dc9bc09f
Removing intermediate container d1c9dc9bc09f
---> d12d76ae371b
Successfully built d12d76ae371b
Successfully tagged my-analyser:latest
There are a few things to notice here.
Firstly, Docker sends the build context (that's the `.` part) to the Docker daemon.
We'll discuss the role of the Docker daemon a bit in the next post, but for now, the daemon is the process that actually does the work here.
After that, we start going through the steps defined in the Dockerfile
(you'll notice the five steps each match up to the five commands).
We'll go through what each command is actually doing in a moment,
although it might be interesting to get an idea for what each line is doing before reading onwards.
Before we explore the individual commands, however, we should figure out how to actually run this compiled image.
The Python script that we're running is a fairly simple one --
it has two commands, one to tell us how many items of data we've got, and another to give us the average values from that data.
We can run it like this:
$ docker run my-analyser
--help Show this message and exit.
$ docker run my-analyser count-datapoints
My Custom Application
datapoint count = 100
$ docker run my-analyser analyse-data
My Custom Application
height = 1.707529904338
weight = 76.956408654431
This is very similar to the `hello-world` container that we ran,
except without any need to download anything (because the container has already been built on our system).
We'll look at transfering the container to other computers in the next post,
but, in principle, this is all we need to do to get a completely self-sufficient container containing all the code
that we need to run our project.
For now, let's go through the Dockerfile step-by-step and clarify what each command given there is doing.
# Step-by-step Through The Dockerfile
The first thing **(1)** a Dockerfile needs is a parent image.
In our case, we're using one of the pre-built Python images.
This is an official image provided by Docker that starts with a basic Debian Linux installation,
and installs Python on top of it.
We can also specify the exact version that we want (here we use 3.8.5).
There are a large number of these pre-built official images available,
for tools such as
and [Julia](
There are also unofficial images that often bring together a variety of scientific computing tools for convenience.
For example, the Jupyter Notebooks team have a [wide selection]( of different images with support for different setups.
Alternatively, most Linux distributions, including Ubuntu and Debian[^1], are available as parent images.
You may need to do more work to get these set up
(for example, you'll need to manually install Python first)
but you also have more flexibility to get things set up exactly how you want.
Once we've got our base image, we want to make this image our own.
Each of the commands in this file adds a new layer on top of the previous one.
The first command we use **(2)** is fairly simple -- it just sets the current working directory.
It's a bit like running `cd` to get to the directory you want to start working in.
Here, we set it to `/opt/my-project`.
It doesn't really matter what we use here,
but I recommend `/opt/<project-name>` as a reasonable default.
The next step **(3)** is to add our own code to the image.
The image that we're building will be (mostly)[^2] isolated from the computer that we run it from,
so if we want our code to be built into this project, we need to explicitly put it there.
The `COPY` command is the way to do that.
It creates a new layer that contains files from our local system (`.`) in the location in the image that we specify (`/opt/my-project`).
At this point, we have a python project inside a Docker image.
However, our project probably has some third-party dependencies that will also need to be installed.
As I pointed out before, the Docker container that we're aiming for is isolated from the computer that we will run it from,
which also means that any dependencies that we need must also be installed inside the container.
The `RUN` **(4)** command allows us to run arbitrary commands inside the container.
After running the command, Docker then creates a new layer with the changes that were made by the command that we ran.
Here, we run the `pip` command to install all of our dependencies[^3].
We load the dependencies from a file called `requirements.txt` --
if you're not so used to this system, this is a way of defining dependencies in a reproducible way,
so that any future user can look through a project and see exactly what the will need to run it.
It's important to emphasize that Docker doesn't need to replace `requirements.txt`, CMake, or other dependency management tools.
Rather, Docker can work together with these and other tools to help provide additional reproducibility guarantees.
The final part of our `Dockerfile` is the `ENTRYPOINT` command **(5)**.
Part of the idea of Docker is that each Docker container does one thing, and it does it well.
(You might recognise the UNIX philosophy here.)
As a result, a Docker container should generally contain one application,
and only the dependencies that that application needs to run.
The `ENTRYPOINT` command, along with the `CMD` command tells Docker which application should run.
The difference between the `ENTRYPOINT` and `CMD` is a bit subtle, but it roughly comes down to how you use the `docker run` command.
When we ran it in the previous post, we generally used the default commands set by the containers --
for `hello-world`, the default command was the executable that printed out the welcome message,
while in `python`, the default command was the Python REPL.
However, it's possible to overwrite this command from the `docker run` command.
For example, we can run the Python container to jump straight into a bash shell, skipping the Python process completely:
$ docker run -it python:3.8.5 bash # note the addition of 'bash' here to specify a different command to run
This ability to replace the default command comes from using `CMD`.
In the Python Dockerfile, there is a line that looks like `CMD python`, which essentially tells Docker
"if nobody has a better plan, just run the Python executable".
On the other hand, the arguments to `ENTRYPOINT` will just be put before whatever this command ends up being.
(It is possible to override this as well, but it's not as common.)
For example, consider the following Dockerfile:
FROM ubuntu:20.04
# using `echo` allows us to "debug" what arguments get
# passed to the ENTRYPOINT command
ENTRYPOINT [ "echo" ]
# this command can be overridden
CMD [ "Hello, World" ]
When we run this container, we get the following options:
$ docker run echotest # should print the default value CMD value
Hello, World
$ docker run echotest override arguments # should print the overidden arguments
override arguments
$ docker run -it --entrypoint bash echotest # overrides the entrypoint
As a rule, I would recommend using `ENTRYPOINT` when building a container for a custom application,
and `CMD` when you're building a container that you expect to be a base layer,
or an environment in which you expect people to run a lot of other commands.
In our case, using `ENTRYPOINT` allows us to add subcommands to the `` script that can be run easily from the command line,
as demonstrate in the opening examples.
If we'd used `CMD` instead of `ENTRYPOINT`,
then running `docker run my-analyser count-datapoints` would have just tried to run the `count-entrypoints` command in the system,
which doesn't exist, and would have caused an error.
# Next: Part 3 -- Practical Applications in Science
In this second of three parts, we've looked at an example project with an example Dockerfile.
We explored how to build and run this Dockerfile,
and we explored some of the most important commands needed to set up the Dockerfile for a project.
In the final part, I want to explore some of the different ways that someone might use Docker as part of research.
For example,
how to distribute Docker containers to other places,
how to run Docker containers on HPC systems,
building Docker via Continuous Integration,
and other places where you might see Docker being used.
View part three [here]({% post_url 2020/08/2020-08-25-getting-started-with-docker-3 %}).
<!-- doing spacing with html is fun... -->
<br />
<div class="alert alert-success">
<h2 id="contact-us"><i class="fas fa-info-circle"></i> Get In Touch</h2>
HIFIS offers free-of-charge workshops and consulting to research groups within the Helmholtz umbrella.
You can read more about what we offer on our
<strong><a href="{% link services/ %}">services page</a></strong>.
If you work for a Helmholtz-affiliated institution, and think that something like this would be useful to you, send us an e-mail at
<strong><a href="mailto:{{site.contact_mail}}">{{site.contact_mail}}</a></strong>,
or fill in our
<strong><a href="{% link services/consulting.html %}#consultation-request-form">consultation request form</a></strong>.
# Footnotes
If you're look deeper into Docker, you might notice that a distribution called "Alpine Linux" crops up a lot.
This is an alternative distribution that is specifically designed to be as light as possible.
This _can_ save a lot of space in docker images, _but_ it also comes with some additional complexities.
I recommend starting with a Debian-based distribution, particularly for Python-based projects,
and then switching to Alpine Linux later if you find that your docker images are getting too large to handle.
"Mostly" is an important caveat here!
To usefully run a Docker container, we need to send some input in and get some sort of output out --
this is mostly handled with command-line arguments and the console output of whatever runs inside Docker.
However, for some applications (less so scientific ones),
we will also want to access a service running inside the container, e.g. a webserver.
Alternatively, we may want to access files inside the container while running it,
or even allow the container to access files from the "parent" computer that's running it.
These things can all be enabled using different arguments to the `docker run` command.
I'll talk a little bit more about some specifics here in the final part of this series,
where I'll also mention tools like Singularity (that you're more likely to run into on HPC systems),
and explain some of the limitations of these tools a bit more clearly.
If you have a lot of different Python projects,
you might (rightly!) ask why I haven't used something like `virtualenv` to isolate the Python environment.
The answer is that, in this case, it's not really necessary.
The Docker image that we build will have isolation built-in --
and not only for Python, but for all our other dependencies too.
This diff is collapsed.
title: "Docker For Science (Part 1)"
date: 2020-09-23
- frere
layout: blogpost
title_image: docker-cover.png
- tutorial
- consulting
- docker
excerpt: >
Understanding Docker probably won't solve all of your problems,
but it can be a really useful tool when trying to build reproducible software that will run almost anywhere.
In this series of blog posts, we will explore how to setup and use Docker for scientific applications.
> This post is part of a short blog series on using Docker for scientific applications.
> My aim is to explain the motivation behind Docker,
> show you how it works,
> and offer an insight into the different ways that you might want to use it in different research contexts.
> Quick links:
> - Part 1 (Getting Started with Docker) &#8592; You are here!
> - Part 2 (A Dockerfile Walkthrough) (Coming Soon!)
> {% comment %}[Part 2 (A Dockerfile Walkthrough)]({% post_url 2020/08/2020-08-25-getting-started-with-docker-2 %}){% endcomment %}
> - Part 3 (Using Docker in Practical Situations) (Coming Soon!)
> {% comment %}[Part 3 (Using Docker in Practical Situations)]({% post_url 2020/08/2020-08-25-getting-started-with-docker-3 %}){% endcomment %}
Understanding Docker probably won't solve all of your problems,
but it can be a really useful tool when trying to build reproducible software that will run almost anywhere.
Unfortunately, a lot of existing tutorials are aimed primarily at web developers, backend engineers, or cloud DevOps teams,
which is a pity, because Docker can be useful in much wider contexts.
This series explains what Docker is, how to use it practically, and where it might be useful in the context of scientific research.
# What is Docker?
One of the key challenges in modern research is how to achieve reproducibility.
Interestingly, this is also a big interest for software development.
If I write some code, it should work on my machine
(I mean, I hope it does!)
but how do I guarantee that it will work on anyone else's?
Similarly, when writing code to analyse data, it is important that it produces the correct result,
not just when you run the code multiple times with the same input data,
but _also_ when someone else runs the code on a different computer.
[![A comic showing a confusing network of interlinking Python environments. The subtitle reads "My Python environment has become so degraded that my laptop has been declared a superfund site.](](
The complexity of Python environments, as explained by XKCD (Comic by Randall Munroe -- [CC BY-NC 2.5](
One of the common ways that software developers have traditionally tried to solve this problem is using virtual machines (or VMs).
The idea is that on your computer, you've probably got different versions of dependencies that will all interact in different messy ways,
not to mention the complexity of packaging in languages like Python and C.
However, if you have a VM, you can standardise things a bit more easily.
You can specify which packages are installed, and what versions, and what operating system everything is running on in the first place.
Everyone in your group can reproduce each other's work, because you're all running it in the same place.
The problem occurs when a reviewer comes along, who probably won't have access to your specific VM.
You either need to give them the exact instructions about how to setup your VM correctly
(and can you remember the precise instructions you used then, and what versions all your dependencies were at?)
_or_ you need to copy the whole operating system (and all of the files in it) out of your VM, into a new VM for the reviewer.
Docker is both of those solutions at the same time.
Docker thinks of a computer as an _image_, which is a bundle of _layers_.
The bottom layer is a computer with almost nothing on it[^1].
The top layer is a computer with an operating system, all your dependencies, and your code, compiled and ready to run.
All the layers between those two points are the individual steps that you need to perform to get your computer in the right state to run your code.
Each step defines the changes between it and the next layer,
with each of these steps being written down in a file called a _Dockerfile_.
Moreover, once all of these layers have been built on one computer, they can be shared with other people,
meaning that you can always share your exact setup with anyone else who needs to run and review the code.
When these layers are bundled together, we call that an image.
Finally, to run the image, Docker transforms it into a container,
and runs that container as if it were running inside a virtual machine[^2].
This is a bit of a simplification.
The canonical base image ("scratch") is a zero-byte empty layer,
_but_, if you were able to explore inside it,
you'd find that there is still enough of an operating system for things like files to exist, and to run certain programs.
This is because Docker images aren't separate virtual machines --
the operating system that you can see is actually the operating system of the computer that's running Docker.
This is a concept called _containerisation_ or _OS-level Virtualisation_, and how it works is very much beyond the scope of this blog post!
The differences between layers, images, and containers is not always obvious, and I had to look it up a lot while writing this post.
Most of the time, it's possible to think of layers and images being the same thing, and containers being the way that you run the final layer.
However, this isn't technically accurate, and can cause some confusion when exploring container IDs, image IDs, and layer IDs.
If you want to explore this more, I recommend reading Sofija Simic's post [here](,
followed by Nigel Brown's post [here](
Please remember that none of the above information is necessary to truly use and understand Docker --
the main reason that I ran into these questions was when trying to get a completely solid understanding of what different IDs referred to while writing this post.
Most of the time, these specifics are completely transparent to the user.
# Setting Up Docker
Setting up Docker will look different between different operating systems.
This is to cover certain cross-platform issues.
Basically, as a general rule, in any given operating system, it's only possible to run containers that _also_ use that same operating system.
(Linux on Linux, Windows on Windows, etc.)[^3]
Obviously this is very impractical, given that most pre-built and base layers available for Docker are built for Linux.
As a result, for Windows and MacOS, Docker provides a tool called Docker Desktop,
which includes a virtual machine to basically paper over the differences between Linux and the host operating system[^4].
It also provides a number of other tools for more advanced Docker usage that we won't go into now.
For Linux, you will need to install "Docker Engine" --
this is essentially just the core part of Docker that runs containers.
The installation instructions for Mac, Windows, and Linux are available at the [Get Docker]( page --
if you want to follow along with the rest of these commands, feel free to complete those installation instructions, and then come back here.
As I mentioned in the previous footnote,
containerisation isn't about creating new virtual machines --
it's about running a mostly-sandboxed version of an operating system inside the parent operating system
(this is the _containerisation_ concept).
Because it's still running inside the same operating system as before, you can't switch between Linux and Windows.
[^4]: Note that you can also use Windows Subsystem for Linux (WSL) instead of a "true" virtual machine.
# Running Our First Docker Container
The first step with any new programming language is the "Hello World" program --
what does "Hello World" look like on Docker?
$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
0e03bdcc26d7: Pull complete
Digest: sha256:7f0a9f93b4aa3022c3a4c147a449bf11e0941a1fd0bf4a8e6c9408b2600777c5
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
-text snipped for convenience-
The first thing we get when we run this docker command is a series of messages about what Docker is doing to run the `hello-world` container.
1. First, Docker tries (and fails) to search the computer that it's running on for an already cached copy of a container called `hello-world:latest`.
The `:latest` part is called the tag, and roughly corresponds to the version of the relevant software that is installed on this container.
When no tag is specified, Docker defaults to "latest", which is usually the most recent build of a container.
2. Because it can't find the image, it "pulls" the image from an external repository -- in this case, [Docker Hub](
The `hello-world` container is actually part of the "standard library" of official Docker images, which is where the `library/` part comes from.
Normally, if we were to host our own images on Docker Hub, we'd need to include a user or organisation namespace (e.g. `helmholtz/...`).
3. The line beginning with a set of random numbers and digits means that Docker is downloading a layer.
(The numbers and digits are an identifier for the file being downloaded.)
On slower computers, you might see a loading bar appear here while the actual download takes place.
4. The next two lines ("Digest" and "Status") are simply updates to say that everything has been downloaded and that Docker is ready to run the image.
The digest is a unique identifier for this exact image which will never be updated,
which can be useful if you want to be completely certain that you'll never accidentally update something.
5. Finally, a message is printed (this is the "Hello from Docker!" section).
This explains a bit about what has just happened, and confirms that everything was successful.
# Running Our Second Docker Container
The "Hello World" operation runs, but it doesn't actually do much useful --
let's try running something more interesting and useful.
Part of our original motivation for this exercise was managing the chaos of different ways of installing Python and its dependencies,
so let's see if we can get a container up and running with Python.
The first step is generally to find a Python base image.
Thankfully, as part of the set of officially maintained images, Docker provides some Python images for us to use.
This includes images for different versions of Python.
Whereas last time, we used the default `latest` tag, this time we can try explicitly using the 3.8.5 tag to set the Python version.
However, if we try running this, we'll run into a bit of an issue:
$ docker run python:3.8.5
Unable to find image 'python:3.8.5' locally
3.8.5: Pulling from library/python
d6ff36c9ec48: Pull complete
c958d65b3090: Pull complete
edaf0a6b092f: Pull complete
80931cf68816: Pull complete
7dc5581457b1: Pull complete
87013dc371d5: Pull complete
dbb5b2d86fe3: Pull complete
4cb6f1e38c2d: Pull complete
c2df8846f270: Pull complete
Digest: sha256:bc765f71aaa90648de6cfa356ec201d50549031a244f48f8f477f386517c5d1b
Status: Downloaded newer image for python:3.8.5
If you run this, you'll immediately see that there are a lot more layers that need to be downloaded and extracted --
this makes sense, as Python is a much more complicated piece of software than just print a "Hello World" message!
You'll also see that instead of `latest`, the tag is `3.8.5`, so we can be sure what version we are using.
However, when we ran this image, the docker command immediately exited, and we're back to where we started.
We've downloaded _something_ -- but what does that something actually do?
By default, when Docker runs a container, it just prints the output of that container --
it doesn't send any user input into that container.
However, the default Python command is a REPL -- it require some sort of input to do something with.
To allow us to send terminal input in and out, we can use the `-it` flags, like this:
$ docker run -it python:3.8.5
Python 3.8.5 (default, Sep 1 2020, 18:44:24)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
That looks better!
Feel free to play around and convince yourself that this is a working, standard Python installation.
Pressing Ctrl+D will exit the terminal and close the container.
It's worth noting that the second time we ran this command, there was no information about pulling layers or downloading images.
This is because Docker caches this sort of information locally.
# Running Our Second Docker Container (Again!)
All Docker containers have a command that runs as the main process in that container.
With the "Hello World" container, that command was a small binary that prints out a welcome message.
With Python, the command was the standard `python` executable.
What if we want to run a different command in the same container?
For example, say we have a Python container, and we're using the Python interpreter.
Is there a way that we can open a shell on that container so that we can run commands like `pip` to install dependencies?
The first thing we need to do is deal with a problem that we're about to run into.
When the main process in a container exits
(the "Hello World" command has printed all it needs to print, or the Python interpreter has been exited)
the whole container is closed.
This is mostly useful
(when the main process exits, we probably don't need the container any more)
but it does mean that we need to think a bit about how we're going to interact with the running container.
Firstly, let's create a new container, but give it a special name (here `my-python-container`).
$ docker run --name my-python-container -it python:3.8.5
Python 3.8.5 (default, Sep 1 2020, 18:44:24)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Now, opening a second terminal (and _not_ closing the Python process in the first terminal),
we can use the `docker exec` command to run a second command inside the same container, as long as we know the name.
In this case, we can use `bash` as the second command, and from there we can `pip install` whatever we want.
$ docker exec my-python-container bash
root@f30676215731:/# pip install numpy
Pressing Ctrl-D in this second terminal will close bash and bring us out of this new container.
We could also have directly run `docker exec my-python-container pip install numpy` --
in this case, because we only wanted to run one command inside the container, it would have had the same effect.
However, opening up a bash terminal inside the container is a very useful ability,
because it's then possible to root around inside the container and examine what's going on --
often helpful for debugging!
# Next: Part 2 -- A Dockerfile Walkthrough
In this post, I explained a bit about how Docker works, and how to use Docker to run Python
(and many other tools!)
in an isolated environment on your computer.
All the images that we used in this post were created by others and hosted on Docker Hub.
In the next post, I'm going to explain how to create your own image, containing your own application code,
by going line-by-line through an example Dockerfile.
By creating an image in this way, we can clearly define the instructions needed to setup, install, and run our code,
making our development process much more reproducible.
{% comment %}View part two [here]({% post_url 2020/08/2020-08-25-getting-started-with-docker-2 %}).{% endcomment %}
<!-- doing spacing with html is fun... -->