Okay, cool. So I'm going to start. So we split our talk into two. So I'll give the first
10 minutes and then Josh will give the remainder of the talk. So the talk is about Dimela and
Karaka, which are like sort of two sister projects. And the idea here basically is to
get to system-agnostic, portable and configurable data reduction pipelines for radio interferometry.
So I'll just go straight into it. So I'll start with Dimela. So basically when I came
into radio astronomy software back in 2014, I found it was not a very friendly environment
for new users, even for actually more established users as well. That's not the most friendly
environment because the software that was there was often not very stable. It was badly
written, if I may say, and it was just very complicated in order to basically get something
going even less things. And then if you wanted to basically write a pipeline, it was a natural
nightmare. So we needed to find a way to do this. So basically we asked ourselves what
is needed in order to have a good pipeline or a good software package that can be useful
and robust. And basically this is what we have come to understand about basically doing
this well is that we need to have visual control and issue tracking. And this can be done by
multiple ways. GitHub is my choice, but there's also other ways of doing it. You need to document
your code very well. BritaDots is also quite good. GitHub also and GitHub lab gives us
this. You need to package your code and have a good build system. For this basically PyP
for any Python project is good enough. And we also have CanSuite, which is managed by
Heiss, who is in Amsterdam right now. And basically what he does is he creates T-bit
packages for most of the software that we use in radio and telemetry. And if anyone has something
new, they can basically just send him an email or open an issue on the CanSuite repository
on GitHub. So this has actually been very, very helpful. And obviously continuous integration,
I use Travis. But now moving towards GitHub actions, I find it a bit easier to use and
to integrate since most of our code base is on GitHub. And one more thing, especially
in radio and telemetry, is isolation and portability. We use Docker, Podman, Singularity. We tried
to use a new Docker for a while, but we just found that it was not mature enough to ensure
stability. So we sort of ditched it. But maybe over time, when it becomes more stable, we
will reintroduce it. What Docker also gives you is portability. Because now basically
since the software is in these containers, you can basically just send your pipeline
script to someone else in another part of the world. And they can run the same pipeline
on the same data and should get exactly the same results because the isolation you use
in containers ensures this. So let me just start with some definitions.
So what I call a pipeline or a recipe sometimes is basically a set of tasks which are connected
in a simple way to produce a non-result. So basically when I say a non-result, I just
say that you, I mean that you know what type of data to expect given a set of inputs. And
furthermore, we basically stipulate that a pipeline must be versioned and whenever it
fails, it must fail with a non-zero active status. These things may seem obvious, but
you'd be surprised how many packages don't actually follow this.
Then I will define what a pipeline task is. In the stimuli framework, we call the pipeline
task a cab and basically a cab is any application that processes or visualizes data. And we
impose that the cab must have the following attributes. It takes input in a standardized
form and then, but basically any application or any binary can be wrapped in order to conform
to the cab standard using serialization tools like JSON or YAML. And we impose that cab
must be, must execute using container technology and stimuli that supports a Docker, similarity
and broadband basically as I've said before. However, you can also have Python functions
as part of your pipeline tasks.
Okay, let me just give you a simple definition of a cab. So this is a cab for a tool called
WS Clean, which takes in radio interferometry disabilities and makes an image out of them.
So basically this is the standard definition. You have the name of the task, it's called
WS Clean. Here base means the Docker image, which has the software installed limits. So
Zugänglich über
Offener Zugang
Dauer
00:28:00 Min
Aufnahmedatum
2020-07-23
Hochgeladen am
2020-07-24 00:56:21
Sprache
en-US
Speaker
Gyula I. G. Jozsa, Sphesihle Makhathini, SKA Observatory
Content
Data production pipeline with Stimela and CARAcal as example for interoperable software.
The Workshop
The Workshop on Open-Source Software Lifecycles (WOSSL) was held in the context of the European Science Cluster of Astronomy & Particle Physics ESFRI infrastructures (ESCAPE), bringing together people, data and services to contribute to the European Open Science Cloud. The workshop was held online from 23rd-28th July 2020, organized@FAU.
Copyright: CC-BY 4.0