17 - Containerisation for scientific reproducibility [ID:20178]
50 von 213 angezeigt

Okay, sorry. Hi. So, yeah, so I'm Stefano. I will talk a bit about contentization with

a specific focus on scientific reproducibility. Sorry, I didn't include the graphics. I will

just start as a fresh reminder what are containers. This is a definition from CIO.com. So it's

pretty dumbed down, but I think it's very effective. So containers are a solution for

the problem of how to get software to run reliably when moved from one computing environment

to another. And this is also known as a dependency health problem, which we all know too well,

but just as a recap, the dependency health problem shows under two different takes, I

would say. The first one is Mike wants to use a new software. He cannot find a precompiled

version. Mike then asks or Google for help, and he gets some basic instructions like compile

it. At this point, Mike starts downloading all the development environment. And soon

he realizes he needs to upgrade or even downgrade his main operating system. He started doing

that, but during the process something went wrong, and then Mike spent the entire afternoon

in trying to fix his own primary operating system. And at the end, the software he really

wanted to install turns out it doesn't do what he hoped to do. And this is one take

of the dependency health problem. The other take of the dependency health problem is more

serious in my opinion, which is Mike wants to use this new software. Mike does find a

precompiled version. He then download and install it. Mike runs the software, and at

this point he gets an answer, which is 43. But after a year, Mike runs again the same

software and with the same data, he gets the results, 42, which is of course the right

answer. Now, the point is that at this point, Mike takes a very deep dive into the problem

and finds out that the library that he was using when using the software at the time

was using a wrong API code to a library because it changed. So he basically worked for a year

with a wrong software, actually with a software giving wrong results. And this is something

that actually happened and that generated quite a lot of rumour for chemical potentials.

So the dependency health problem, as any problem, has a solution spectrum. So we range from

properly stating requirements of the software to defining virtual environments where we

have those requirements automatically set up. We can also use typically linked binaries,

which to some extent are a very good solution, but unless you cannot use them like Python

or complex programs, for example. On the other side of the spectrum, we have virtual machines

with hardware emulation. In this case, you can reach full reproducibility because you

don't even have problems or big Alendian, little Alendian or CPU instruction sets. Then

we have a lighter version of the virtual machines, which are the ones we all know that work with

a hypervisor. And then we have this containerization thing in the middle, which is where we focus.

Now, a container engine is something slightly different than a virtual machine. You can

think about it as a virtual machine in first approximation, but it's different because

there is no hardware abstraction layer between the container engine and the host operating

system. So they share the kernel. So everything is like a virtual machine, except that they

use a kernel. They can also share dependencies. So here, for example, the dependencies for

those two containers are the same because of the incremental file system, the nature

of the container file system itself, which we will, I will pick a bit in next slide.

So this is a very important difference. And actually, there are three different solutions

for containerization. One is singularity, which you might have heard and which was mentioned

for the presentation before. And the other one is Docker. That is probably the most common

one. And they are at the specific size of this subspectrum because singularity is more

closer to a process, while Docker is way more closer to a virtual machine. And this is an

important trait. So let's see the differences between singularity and Docker, which again

are container engines. So singularity, it is a scientific computing program, let's say.

So it born the scientific computing and high-performance computing space, while Docker, it is the IT

industry standard. So from Amazon to Google and going through DeepMind. Singularity, in

singularity running containers are seen as processes, while in Docker running containers

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:24:59 Min

Aufnahmedatum

2020-07-27

Hochgeladen am

2020-07-28 00:46:23

Sprache

en-US

Speaker

Stefano Alberto Russo, INAF

Content

The role of software containerization and example use of Docker containers.

The Workshop

The Workshop on Open-Source Software Lifecycles (WOSSL) was held in the context of the European  Science Cluster of Astronomy & Particle Physics ESFRI infrastructures (ESCAPE), bringing together people, data and services to contribute to the European Open Science Cloud. The workshop was held online from 23rd-28th July 2020, organized@FAU.

Copyright: CC-BY 4.0

Einbetten
Wordpress FAU Plugin
iFrame
Teilen