Verified Commit 167899b2 authored by Frere, Jonathan (FWCC) - 142176's avatar Frere, Jonathan (FWCC) - 142176
Browse files

Add a conclusion to the blog posts

parent a4c86ed3
Pipeline #41698 passed with stages
in 5 minutes and 51 seconds
......@@ -2,16 +2,15 @@
title: "Docker for Science (Part 3)"
date: 2020-08-25
authors:
- frere
- frere
layout: blogpost
title_image: docker-cover.png
categories:
- tutorial
- tutorial
tags:
- consulting
- docker
- consulting
- docker
excerpt: >
---
{:.summary}
......@@ -64,7 +63,20 @@ Let's start off by doing it manually, and then do it "properly" -- by which I me
If you want to follow along, create a new private repository that you can use as a sandbox,
and push the code from the previous post to play around with.
## The Manual Process
[^1]:
Notice that all third-party images have two parts --
a group/maintainer name (e.g. `jupyter`),
and a specific image name (e.g. `scipy-notebook`).
This is the main way that you can tell the difference between official and third-party images.
[^2]:
Unfortunately, the second-most common code hosting option at Helmholtz, BitBucket, doesn't include a container registry.
You can check with your local administrators if they have a tool like Artifactory or JFrog available.
Alternatively, part of the evolution of the HIFIS project is to provide code hosting infrastructure across the whole Helmholtz community,
which will include access to package and container registries,
so please keep an eye out for more HIFIS news on this blog!
## Saving Images -- The Manual Process
In the top-right corner of this Container Registry page, there is a button that says "CLI Commands".
This will walk us through the main steps of getting the image that we generated earlier into this registry.
......@@ -105,7 +117,7 @@ for example: `registry.hzdr.de/bauer34/docker-test:my-tag`.
That was the manual version, and there were a few steps, but it wasn't so complicated.
How does the automatic version compare?
## The Automatic Process
## Saving Images -- The Automatic Process
In GitLab, we can use pipelines to make things happen automatically when we update our code.
Often, this will be building our project,
......@@ -145,6 +157,53 @@ you should already be able to see the build taking place.
The documentation for this template is available [here](https://gitlab.com/hifis/templates/gitlab-ci/-/blob/master/docs/docker.md).
## Sharing Images
Having an image in a registry is one thing, but sharing it with other people is another.
Private GitLab projects will also have private registries,
which means that anyone else who wants to access the registry will need to log in to GitLab via Docker
(as we did in the manual saving process)
and have sufficient privileges in the team.
However, there is another route.
GitLab also provides access tokens that can be given to people to allow them the ability to pull images from Docker,
but not to make other changes.
They don't even need to have a GitLab account!
In a projects settings, under _Settings > Access Tokens_, there is a page where you can create tokens to share with other people.
These tokens are randomly-generated passwords that are linked to a particular project, that specify exactly what a person is able to access.
For the purposes of sharing a Docker image, the `read_registry` permission is enough --
this will allow the bearer of the token to access the registry, but not push new images there, or access other project features.
To create an access token, give the token a name to describe what it's being used for,
select the relevant permissions that you want to grant[^3],
and optionally give an expiry date, if you know that the token will only be needed until a certain time.
In response, GitLab will provide a string of letters, digits, and symbols,
which can be copied and sent to the people who need to use it.
To use this token, use the `docker login` command with your normal GitLab username, and the token provided.
For more information, see the documentation [here](https://docs.gitlab.com/ee/user/packages/container_registry/#authenticating-to-the-gitlab-container-registry).
[^3]:
Selecting which permissions to grant is an interesting question of security design that we shouldn't go into too much here,
but a general guideline is "everything needed to do the job required, and not a step more".
That is, give only the permissions that are actually needed right now, not permissions that might be useful at some point.
This probably doesn't matter so much in the scientific world, where open research is increasingly important,
but it's a good principle when dealing with computers in general.
Consider a badly-written tool (they do exist... 😉) that is designed to clean up images that aren't needed any more.
One mistake in the filter for deciding which images aren't needed any more,
and this sort of tool could rampage through all the registries that it is connected to, deleting everything it can see.
(This sort of thing happens way ofter than you would think - see
[this bug](https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/issues/123) and
[this bug](https://github.com/ValveSoftware/steam-for-linux/issues/3671) and
[this bug](https://itsfoss.com/accidentally-deletes-company-wrong-command/) and
[this bug](https://www.wired.com/2001/11/glitch-in-itunes-deletes-drives/) --
all just from one small `rm` command!)
By limiting the access scope to read-only, we can limit how much these sorts of problems affect us.
At least until we decide to run this particularly (thankfully fictional) clean-up tool ourselves,
and make the same mistake...
# Docker for HPC (High Performance Computing)
Once you've got a Docker image that you can build and run on your computer, it makes sense to look for more useful places to run this image.
......@@ -159,33 +218,111 @@ One of the side-effects of the way that Docker works
is that it is generally possible for a Docker image running on a server to gain administrator access on that parent server,
essentially "breaking out" of the container.
This makes the administrator's job much more difficult in terms of locking down each user's processes and isolating them from each other.
As a result, it's generally not a good idea to run the Docker daemon in this way.
As a result, it's generally not a good idea to run Docker in this way.
However, the Docker daemon isn't the only way to run Docker images.
There are a number of other tools used to do containerisation, and there is one tool particularly that is both
designed to run on HPC systems,
However, surprisingly, Docker isn't the only way to run Docker images.
There are a number of other tools used to do containerisation,
and one tool particularly that is both designed to run on HPC systems,
_and_ can interoperate with Docker, meaning you can usually just run your Docker image just like normal.
This tool is known as _Singularity_.
This tool is known as [_Singularity_](https://sylabs.io/guides/3.6/user-guide/introduction.html).
It is actually a complete containerisation tool in its own right,
with its own format for defining containers,
and its own way of running containers.
In Docker, when you run a docker image, it's actually
and its own way of running containers[^4].
More importantly, it knows how to convert other container formats (including Docker) into its own `.sif` format.
In addition, it runs as the current user --
it doesn't require any magical higher privileges like Docker.
(This is a trade-off, but for the purposes of scientific applications, it's usually a reasonable one to make.)
If you want to install Singularity and try it out yourself,
you will need a Linux machine and a Go compiler, along with a few other dependencies.
You can find the full instructions [here](https://sylabs.io/guides/3.6/admin-guide/installation.html).
Running Singularity on an HPC system will also depend on how exactly that HPC system has been set up,
but it will generally involve requesting a job, and running the `singularity` command as part of that job,
with the desired resources.
One of the key questions when using a private registry such as GitLab (see above),
is how to log in to that registry.
Interactively, Singularity provides a --docker-login flag when pulling containers.
In addition, it's possible to use SSH keys for authentication in certain circumstances.
[^4]:
It has its own container system?
And it's more suited to scientific applications?
Why are these blog posts about Docker then -- why not just go straight to this clearly more convenient tool?
Two reasons:
Firstly, Docker is way more popular than Singularity, or indeed any other similar tool.
This means more documentation,
more starter images to base our own changes off,
more people to find and fix bugs in the software,
and more support in third-party tools like GitLab.
Secondly, Singularity only runs on Linux,
and the installation process involves cloning the latest source code,
installing a compiler for the Go programming language,
and compiling the whole project ourselves.
Given that Singularity can run Docker images,
we can use Docker in the knowledge that we can also get the advantages of Singularity later.
# Docker in The Wild
* Gitlab CI (and other CI tools)
*
So far, we've generally assumed that the Docker containers being created are wrapping up whole programs for use on the command line.
However, there are also situations where you might want to send a whole environment to other people,
so that they have access to a variety of useful tools.
# Conclusion
If you've used GitLab CI (and some other similar systems), this is how it works.
When GitLab runs a job in a pipeline, it creates a fresh Docker container for that job.
That way, the environment is (mostly) freshly-created for each job, which means that individual jobs are isolated.
It also means that the environment can be anything that the user wants or needs.
In the final part, I want to explore some of the different ways that someone might use Docker as part of research.
For example,
how to run Docker containers on HPC systems,
building Docker via Continuous Integration,
and other places where you might see Docker being used.
By default, this will probably be some sort of simple Linux environment,
like a recent Ubuntu release, or something similar.
However, if a CI job needs specific tools, it may well be simpler to find a Docker image that already has those tools installed,
than to go through the process of reinstalling those tools every time the job runs.
For example, a CI job that builds a LaTeX document may find it easiest to use a pre-build installation such as
[`blang/latex`](https://hub.docker.com/r/blang/latex).
View part three [here]({% post_url 2020/08/2020-08-25-getting-started-with-docker-2 %}).
In fact, in GitLab, it's even possible to use the registries from other projects to access custom images,
and use those custom images in jobs in other projects.
It's even possible to use jobs to create images to use in other jobs, if that's something that you really need!
# Conclusion
Thus, as they say, endeth the lesson.
Over the course of these three blog posts,
we've talked about the purpose of Docker,
and how it can be used to package applications and their dependencies up in convenient way;
we've got started with Docker, and learned how to run Docker containers on our system;
we've walked through how to create our own Docker containers using a Dockerfile;
and finally, in this post,
we've talked about some of the ways that we can use Docker practically for scientific software development.
Docker can often be a hammer when all you need is a screwdriver --
very forceful, and it'll probably get the job done,
but sometimes a more precision tool is ideal.
The motivating example for this blog series was the complexity of Python project development,
where trying to remember which packages are installed, and which packages are needed by a particular project,
can cause a lot of issues when sharing that project with others.
For this case alone, Docker can be useful, but you may want to consider a package manager such as [Poetry](https://python-poetry.org/),
which can manage dependencies and virtual Python environments in a much simpler way.
However, when different tools, languages, and package management needs come together,
using Docker can often be a good way to make sure that the system really is well-defined,
for example by ensuring that the right system packages are always installed,
as well as the right Python packages,
and the right R or Julia software as well.
If you feel like your project is a spiralling mess of complexity,
and you're not sure what packages need to exist, or how to build it on any computer other than your own,
then hopefully this approach of building a Docker container step-by-step can help.
However, if you would like more support for your project,
HIFIS offers a consulting service, which is free-of-charge, and available for any Helmholtz-affiliated groups and projects.
Consultants like myself can come and discuss the issues that you are facing,
and explore ways of solving them in the way that is most appropriate to your team.
For more details about this, see the "Get In Touch" box below.
<!-- doing spacing with html is fun... -->
<br />
......@@ -204,16 +341,3 @@ View part three [here]({% post_url 2020/08/2020-08-25-getting-started-with-docke
</div>
# Footnotes
[^1]:
Notice that all third-party images have two parts --
a group/maintainer name (e.g. `jupyter`),
and a specific image name (e.g. `scipy-notebook`).
This is the main way that you can tell the difference between official and third-party images.
[^2]:
Unfortunately, the second-most common code hosting option at Helmholtz, BitBucket, doesn't include a container registry.
You can check with your local administrators if they have a tool like Artifactory or JFrog available.
Alternatively, part of the evolution of the HIFIS project is to provide code hosting infrastructure across the whole Helmholtz community,
which will include access to package and container registries,
so please keep an eye out for more HIFIS news on this blog!
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment