Firstly, Docker sends the build context (that's the `.` part) to the Docker daemon.
We'll discuss the role of the Docker daemon a bit in the next post, but for now, the daemon is the process that actually does the work here.
After that, we start going through the steps defined in the Dockerfile
(you'll notice the five steps each match up to the five commands).
We'll go through what each command is actually doing in a moment,
although it might be interesting to get an idea for what each line is doing before reading onwards.
Before we explore the individual commands, however, we should figure out how to actually run this compiled image.
The Python script that we're running is a fairly simple one --
it has two commands, one to tell us how many items of data we've got, and another to give us the average values from that data.
We can run it like this:
```console?prompt=$,#
$docker run my-analyser
Usage: main.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
analyse-data
count-datapoints
$docker run my-analyser count-datapoints
My Custom Application
datapoint count = 100
$docker run my-analyser analyse-data
My Custom Application
height = 1.707529904338
weight = 76.956408654431
```
This is very similar to the `hello-world` container that we ran,
except without any need to download anything (because the container has already been built on our system).
We'll look at transfering the container to other computers in the next post,
but, in principle, this is all we need to do to get a completely self-sufficient container containing all the code
that we need to run our project.
For now, let's go through the Dockerfile step-by-step and clarify what each command given there is doing.
# Step-by-step Through The Dockerfile
The first thing **(1)** a Dockerfile needs is a parent image.
In our case, we're using one of the pre-built Python images.
This is an official image provided by Docker that starts with a basic Debian Linux installation,
and installs Python on top of it.
We can also specify the exact version that we want (here we use 3.8.5).
There are a large number of these pre-built official images available,
for tools such as
[Python](https://hub.docker.com/_/python),
[R](https://hub.docker.com/_/r-base),
and [Julia](https://hub.docker.com/_/julia).
There are also unofficial images that often bring together a variety of scientific computing tools for convenience.
For example, the Jupyter Notebooks team have a [wide selection](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html) of different images with support for different setups.
Alternatively, most Linux distributions, including Ubuntu and Debian[^1], are available as parent images.
You may need to do more work to get these set up
(for example, you'll need to manually install Python first)
but you also have more flexibility to get things set up exactly how you want.
Once we've got our base image, we want to make this image our own.
Each of the commands in this file adds a new layer on top of the previous one.
The first command we use **(2)** is fairly simple -- it just sets the current working directory.
It's a bit like running `cd` to get to the directory you want to start working in.
Here, we set it to `/opt/my-project`.
It doesn't really matter what we use here,
but I recommend `/opt/<project-name>` as a reasonable default.
The next step **(3)** is to add our own code to the image.
The image that we're building will be (mostly)[^2] isolated from the computer that we run it from,
so if we want our code to be built into this project, we need to explicitly put it there.
The `COPY` command is the way to do that.
It creates a new layer that contains files from our local system (`.`) in the location in the image that we specify (`/opt/my-project`).
At this point, we have a python project inside a Docker image.
However, our project probably has some third-party dependencies that will also need to be installed.
As I pointed out before, the Docker container that we're aiming for is isolated from the computer that we will run it from,
which also means that any dependencies that we need must also be installed inside the container.
The `RUN`**(4)** command allows us to run arbitrary commands inside the container.
After running the command, Docker then creates a new layer with the changes that were made by the command that we ran.
Here, we run the `pip` command to install all of our dependencies[^3].
We load the dependencies from a file called `requirements.txt` --
if you're not so used to this system, this is a way of defining dependencies in a reproducible way,
so that any future user can look through a project and see exactly what the will need to run it.
It's important to emphasize that Docker doesn't need to replace `requirements.txt`, CMake, or other dependency management tools.
Rather, Docker can work together with these and other tools to help provide additional reproducibility guarantees.
The final part of our `Dockerfile` is the `ENTRYPOINT` command **(5)**.
Part of the idea of Docker is that each Docker container does one thing, and it does it well.
(You might recognise the UNIX philosophy here.)
As a result, a Docker container should generally contain one application,
and only the dependencies that that application needs to run.
The `ENTRYPOINT` command, along with the `CMD` command tells Docker which application should run.
The difference between the `ENTRYPOINT` and `CMD` is a bit subtle, but it roughly comes down to how you use the `docker run` command.
When we ran it in the previous post, we generally used the default commands set by the containers --
for `hello-world`, the default command was the executable that printed out the welcome message,
while in `python`, the default command was the Python REPL.
However, it's possible to overwrite this command from the `docker run` command.
For example, we can run the Python container to jump straight into a bash shell, skipping the Python process completely:
```console?prompt=$,#
$docker run -it python:3.8.5 bash # note the addition of 'bash' here to specify a different command to run
root@f30676215731:/#
```
This ability to replace the default command comes from using `CMD`.
In the Python Dockerfile, there is a line that looks like `CMD python`, which essentially tells Docker
"if nobody has a better plan, just run the Python executable".
On the other hand, the arguments to `ENTRYPOINT` will just be put before whatever this command ends up being.
(It is possible to override this as well, but it's not as common.)
For example, consider the following Dockerfile:
```docker
FROM ubuntu:20.04
# using `echo` allows us to "debug" what arguments get
# passed to the ENTRYPOINT command
ENTRYPOINT [ "echo" ]
# this command can be overridden
CMD [ "Hello, World" ]
```
When we run this container, we get the following options:
```console?prompt=$,#
$docker run echotest # should print the default value CMD value
Hello, World
$docker run echotest override arguments # should print the overidden arguments
override arguments
$docker run -it--entrypoint bash echotest # overrides the entrypoint
```
As a rule, I would recommend using `ENTRYPOINT` when building a container for a custom application,
and `CMD` when you're building a container that you expect to be a base layer,
or an environment in which you expect people to run a lot of other commands.
In our case, using `ENTRYPOINT` allows us to add subcommands to the `main.py` script that can be run easily from the command line,
as demonstrate in the opening examples.
If we'd used `CMD` instead of `ENTRYPOINT`,
then running `docker run my-analyser count-datapoints` would have just tried to run the `count-entrypoints` command in the system,
which doesn't exist, and would have caused an error.
# Next: Part 3 -- Practical Applications in Science
In this second of three parts, we've looked at an example project with an example Dockerfile.
We explored how to build and run this Dockerfile,
and we explored some of the most important commands needed to set up the Dockerfile for a project.
In the final part, I want to explore some of the different ways that someone might use Docker as part of research.
For example,
how to distribute Docker containers to other places,
how to run Docker containers on HPC systems,
building Docker via Continuous Integration,
and other places where you might see Docker being used.
View part three [here]({% post_url 2020/08/2020-08-25-getting-started-with-docker-3 %}).
<!-- doing spacing with html is fun... -->
<br/>
<divclass="alert alert-success">
<h2id="contact-us"><iclass="fas fa-info-circle"></i> Get In Touch</h2>
<p>
HIFIS offers free-of-charge workshops and consulting to research groups within the Helmholtz umbrella.
You can read more about what we offer on our
<strong><ahref="{% link services/index.md %}">services page</a></strong>.
If you work for a Helmholtz-affiliated institution, and think that something like this would be useful to you, send us an e-mail at
Understanding Docker probably won't solve all of your problems,
but it can be a really useful tool when trying to build reproducible software that will run almost anywhere.
Unfortunately, a lot of existing tutorials are aimed primarily at web developers, backend engineers, or cloud DevOps teams,
which is a pity, because Docker can be useful in much wider contexts.
This series explains what Docker is, how to use it practically, and where it might be useful in the context of scientific research.
# What is Docker?
One of the key challenges in modern research is how to achieve reproducibility.
Interestingly, this is also a big interest for software development.
If I write some code, it should work on my machine
(I mean, I hope it does!)
but how do I guarantee that it will work on anyone else's?
Similarly, when writing code to analyse data, it is important that it produces the correct result,
not just when you run the code multiple times with the same input data,
but _also_ when someone else runs the code on a different computer.
{:.treat-as-figure}
{:.float-left}
[](https://xkcd.com/1987/)
The complexity of Python environments, as explained by XKCD (Comic by Randall Munroe -- [CC BY-NC 2.5](https://xkcd.com/license.html))
One of the common ways that software developers have traditionally tried to solve this problem is using virtual machines (or VMs).
The idea is that on your computer, you've probably got different versions of dependencies that will all interact in different messy ways,
not to mention the complexity of packaging in languages like Python and C.
However, if you have a VM, you can standardise things a bit more easily.
You can specify which packages are installed, and what versions, and what operating system everything is running on in the first place.
Everyone in your group can reproduce each other's work, because you're all running it in the same place.
The problem occurs when a reviewer comes along, who probably won't have access to your specific VM.
You either need to give them the exact instructions about how to setup your VM correctly
(and can you remember the precise instructions you used then, and what versions all your dependencies were at?)
_or_ you need to copy the whole operating system (and all of the files in it) out of your VM, into a new VM for the reviewer.
Docker is both of those solutions at the same time.
Docker thinks of a computer as an _image_, which is a bundle of _layers_.
The bottom layer is a computer with almost nothing on it[^1].
The top layer is a computer with an operating system, all your dependencies, and your code, compiled and ready to run.
All the layers between those two points are the individual steps that you need to perform to get your computer in the right state to run your code.
Each step defines the changes between it and the next layer,
with each of these steps being written down in a file called a _Dockerfile_.
Moreover, once all of these layers have been built on one computer, they can be shared with other people,
meaning that you can always share your exact setup with anyone else who needs to run and review the code.
When these layers are bundled together, we call that an image.
Finally, to run the image, Docker transforms it into a container,
and runs that container as if it were running inside a virtual machine[^2].
[^1]:
This is a bit of a simplification.
The canonical base image ("scratch") is a zero-byte empty layer,
_but_, if you were able to explore inside it,
you'd find that there is still enough of an operating system for things like files to exist, and to run certain programs.
This is because Docker images aren't separate virtual machines --
the operating system that you can see is actually the operating system of the computer that's running Docker.
This is a concept called _containerisation_ or _OS-level Virtualisation_, and how it works is very much beyond the scope of this blog post!
[^2]:
The differences between layers, images, and containers is not always obvious, and I had to look it up a lot while writing this post.
Most of the time, it's possible to think of layers and images being the same thing, and containers being the way that you run the final layer.
However, this isn't technically accurate, and can cause some confusion when exploring container IDs, image IDs, and layer IDs.
If you want to explore this more, I recommend reading Sofija Simic's post [here](https://phoenixnap.com/kb/docker-image-vs-container),
followed by Nigel Brown's post [here](https://windsock.io/explaining-docker-image-ids/).
Please remember that none of the above information is necessary to truly use and understand Docker --
the main reason that I ran into these questions was when trying to get a completely solid understanding of what different IDs referred to while writing this post.
Most of the time, these specifics are completely transparent to the user.
# Setting Up Docker
Setting up Docker will look different between different operating systems.
This is to cover certain cross-platform issues.
Basically, as a general rule, in any given operating system, it's only possible to run containers that _also_ use that same operating system.
(Linux on Linux, Windows on Windows, etc.)[^3]
Obviously this is very impractical, given that most pre-built and base layers available for Docker are built for Linux.
As a result, for Windows and MacOS, Docker provides a tool called Docker Desktop,
which includes a virtual machine to basically paper over the differences between Linux and the host operating system[^4].
It also provides a number of other tools for more advanced Docker usage that we won't go into now.
For Linux, you will need to install "Docker Engine" --
this is essentially just the core part of Docker that runs containers.
The installation instructions for Mac, Windows, and Linux are available at the [Get Docker](https://docs.docker.com/get-docker/) page --
if you want to follow along with the rest of these commands, feel free to complete those installation instructions, and then come back here.
[^3]:
Why?
As I mentioned in the previous footnote,
containerisation isn't about creating new virtual machines --
it's about running a mostly-sandboxed version of an operating system inside the parent operating system
(this is the _containerisation_ concept).
Because it's still running inside the same operating system as before, you can't switch between Linux and Windows.
[^4]:Note that you can also use Windows Subsystem for Linux (WSL) instead of a "true" virtual machine.
# Running Our First Docker Container
The first step with any new programming language is the "Hello World" program --
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
-text snipped for convenience-
```
The first thing we get when we run this docker command is a series of messages about what Docker is doing to run the `hello-world` container.
1. First, Docker tries (and fails) to search the computer that it's running on for an already cached copy of a container called `hello-world:latest`.
The `:latest` part is called the tag, and roughly corresponds to the version of the relevant software that is installed on this container.
When no tag is specified, Docker defaults to "latest", which is usually the most recent build of a container.
2. Because it can't find the image, it "pulls" the image from an external repository -- in this case, [Docker Hub](https://hub.docker.com/search?q=&type=image).
The `hello-world` container is actually part of the "standard library" of official Docker images, which is where the `library/` part comes from.
Normally, if we were to host our own images on Docker Hub, we'd need to include a user or organisation namespace (e.g. `helmholtz/...`).
3. The line beginning with a set of random numbers and digits means that Docker is downloading a layer.
(The numbers and digits are an identifier for the file being downloaded.)
On slower computers, you might see a loading bar appear here while the actual download takes place.
4. The next two lines ("Digest" and "Status") are simply updates to say that everything has been downloaded and that Docker is ready to run the image.
The digest is a unique identifier for this exact image which will never be updated,
which can be useful if you want to be completely certain that you'll never accidentally update something.
5. Finally, a message is printed (this is the "Hello from Docker!" section).
This explains a bit about what has just happened, and confirms that everything was successful.
# Running Our Second Docker Container
The "Hello World" operation runs, but it doesn't actually do much useful --
let's try running something more interesting and useful.
Part of our original motivation for this exercise was managing the chaos of different ways of installing Python and its dependencies,
so let's see if we can get a container up and running with Python.
The first step is generally to find a Python base image.
Thankfully, as part of the set of officially maintained images, Docker provides some Python images for us to use.
This includes images for different versions of Python.
Whereas last time, we used the default `latest` tag, this time we can try explicitly using the 3.8.5 tag to set the Python version.
However, if we try running this, we'll run into a bit of an issue: