5 - Stimela and CARAcal - Towards system agnostic, portable and and configurable data reduction pipelines [ID:20072]
50 von 217 angezeigt

Okay, cool. So I'm going to start. So we split our talk into two. So I'll give the first

10 minutes and then Josh will give the remainder of the talk. So the talk is about Dimela and

Karaka, which are like sort of two sister projects. And the idea here basically is to

get to system-agnostic, portable and configurable data reduction pipelines for radio interferometry.

So I'll just go straight into it. So I'll start with Dimela. So basically when I came

into radio astronomy software back in 2014, I found it was not a very friendly environment

for new users, even for actually more established users as well. That's not the most friendly

environment because the software that was there was often not very stable. It was badly

written, if I may say, and it was just very complicated in order to basically get something

going even less things. And then if you wanted to basically write a pipeline, it was a natural

nightmare. So we needed to find a way to do this. So basically we asked ourselves what

is needed in order to have a good pipeline or a good software package that can be useful

and robust. And basically this is what we have come to understand about basically doing

this well is that we need to have visual control and issue tracking. And this can be done by

multiple ways. GitHub is my choice, but there's also other ways of doing it. You need to document

your code very well. BritaDots is also quite good. GitHub also and GitHub lab gives us

this. You need to package your code and have a good build system. For this basically PyP

for any Python project is good enough. And we also have CanSuite, which is managed by

Heiss, who is in Amsterdam right now. And basically what he does is he creates T-bit

packages for most of the software that we use in radio and telemetry. And if anyone has something

new, they can basically just send him an email or open an issue on the CanSuite repository

on GitHub. So this has actually been very, very helpful. And obviously continuous integration,

I use Travis. But now moving towards GitHub actions, I find it a bit easier to use and

to integrate since most of our code base is on GitHub. And one more thing, especially

in radio and telemetry, is isolation and portability. We use Docker, Podman, Singularity. We tried

to use a new Docker for a while, but we just found that it was not mature enough to ensure

stability. So we sort of ditched it. But maybe over time, when it becomes more stable, we

will reintroduce it. What Docker also gives you is portability. Because now basically

since the software is in these containers, you can basically just send your pipeline

script to someone else in another part of the world. And they can run the same pipeline

on the same data and should get exactly the same results because the isolation you use

in containers ensures this. So let me just start with some definitions.

So what I call a pipeline or a recipe sometimes is basically a set of tasks which are connected

in a simple way to produce a non-result. So basically when I say a non-result, I just

say that you, I mean that you know what type of data to expect given a set of inputs. And

furthermore, we basically stipulate that a pipeline must be versioned and whenever it

fails, it must fail with a non-zero active status. These things may seem obvious, but

you'd be surprised how many packages don't actually follow this.

Then I will define what a pipeline task is. In the stimuli framework, we call the pipeline

task a cab and basically a cab is any application that processes or visualizes data. And we

impose that the cab must have the following attributes. It takes input in a standardized

form and then, but basically any application or any binary can be wrapped in order to conform

to the cab standard using serialization tools like JSON or YAML. And we impose that cab

must be, must execute using container technology and stimuli that supports a Docker, similarity

and broadband basically as I've said before. However, you can also have Python functions

as part of your pipeline tasks.

Okay, let me just give you a simple definition of a cab. So this is a cab for a tool called

WS Clean, which takes in radio interferometry disabilities and makes an image out of them.

So basically this is the standard definition. You have the name of the task, it's called

WS Clean. Here base means the Docker image, which has the software installed limits. So

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:28:00 Min

Aufnahmedatum

2020-07-23

Hochgeladen am

2020-07-24 00:56:21

Sprache

en-US

Speaker

Gyula I. G. Jozsa,  Sphesihle Makhathini, SKA Observatory

Content

Data production pipeline with Stimela and CARAcal as example for interoperable software.

The Workshop

The Workshop on Open-Source Software Lifecycles (WOSSL) was held in the context of the European  Science Cluster of Astronomy & Particle Physics ESFRI infrastructures (ESCAPE), bringing together people, data and services to contribute to the European Open Science Cloud. The workshop was held online from 23rd-28th July 2020, organized@FAU.

Copyright: CC-BY 4.0

Einbetten
Wordpress FAU Plugin
iFrame
Teilen