6 - Hangar - Git for your data [ID:20073]
50 von 147 angezeigt

Okay, so everybody should be able to see a start slide here.

Yes. Excellent. Okay, cool.

So, I am here to present Hanger, which, you know, tagline could be get for your data.

And what this project is about is about a year ago myself and my colleagues started asking the questions. Why isn't sort of the open source software ethos translated to data sets and data set curation and what are really the main things that are holding back the open

set curation, and there were a couple of different answers that we come up came up with around, you know data is proprietary information and you know it's very important to companies.

It's a lot of work to process.

And the the pipeline built around it tends to not be all that configurable.

You know, especially for outsiders.

And the.

And really one of the bigger things that we decided to latch on to was this question of how would you even begin to share data.

You know, effectively between teams, you know there's as much fun as zipping up a bunch of files in a directory and sending them across an FTP server is.

It's not the best tool.

And, you know, I think what get has done for open source software, being you know, a fairly standard open source package most software developers are familiar with.

We wanted to develop something that, you know, could deal with data natively.

As we start to think about this data workflow. We realized that transformation pipelines going into machine learning AI, or really any sort of analysis model for numeric data are very complex.

The data sets, improving and expanding continuously.

They're expensive, and they take a lot of time to build from the annotation and labeling to IO optimization and making all the considerations and trade offs to the point where you can actually do an exploratory analysis, much less train and, you know, deploy, you know, publish the results of whatever you.

The calculations you did requires a certain amount of quality assurance.

And so we started to look at this and realize, you know, we were built for AI machine learning production. Originally, so,

the, the steps that you have during, you know, the steps which are most important change, depending on which stage you are in of developing some sort of computational rhythm or machine learning model in the early stages.

You are the model changes often the network changes often the graph the computations. Typically though for the late stages, the data is augmented, and that changes often.

So, managing data on disk is really kind of a pain. There are a million different formats, each pretty much domain specific.

The problem is that anybody who wants to work with an organization's data needs to not only have some domain expertise, but to be. They need to be familiar with the actual format the data stored in and know how to actually access it.

So, each of these tools that are commonly used are designed to solve a particular problem, but they have limitations and the complex can be very difficult to the setup can be very complex.

So, in a world where highly specialized talent is really already scarce.

Often, a lot of organizations spend massive amounts of money and time to build a team and collect and annotate data building infrastructure around that information.

It's hard, you know it's easy to see why people regard data as something proprietary.

Could we merge the code which analyzes some sort of computation some piece of data, together with a system like get, you know, could we reconstruct the exact input data which produced some results.

And frankly, it's, you know, creating this type of system is a chore for most people. I happen to find it fascinating, but I know that's not the majority of people here, so building the system is basically what we've tried to do with hair.

How would we build a version control system designed for new Eric, starting from the ground up, making no assumptions.

Basically, Greenfield here.

This was the situation we found ourselves in about a year ago.

And we had some design goals efficiently store on dimensional away arrays as a first class data type.

So, basically, if you input an MPI array, you get out of MPI right pythons are language of choice here, mainly because it's, you know, the pie data ecosystem is phenomenal.

We want to track time travel through history, check out from any point in time, ensure the integrity of data and history.

Something that it does really well is it provides, you know, these commit hashes, which if you know the top level, can that hash.

And you can do a check which rehashes all the contents in a repository and validates that if the hash has come out exactly the same.

Every single commit every single file every change is preserved in history.

And it's a guarantee of the system.

So we wanted data integrity and to be, you know, sort of built in using a Merkle tag, similar to how it does.

Zero cost branching merging.

We want this to be very easy to collaborate with people.

Built for distribution collaboration and something that we've got to deal with, but which isn't really a consideration for get is that even small teams of people working with data can generate massive quantities of that content.

And, you know, the ability for somebody to work with repository of data on their local machine, I find is pretty important.

So we want to be able to just partially clone or fetch small parts of data from a massive data set, all while preserving the ability to to actually branch merge.

And we want to be able to sort of figure out sort of the metadata describing what's in a repository from the actual content.

I'll explain how that works in a little bit.

And it needs to be performing.

We want to saturate requests from I'm going to say a reasonably sized compute cluster.

You know, a couple hundred nodes, you know, low hundred nodes.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:22:31 Min

Aufnahmedatum

2020-07-24

Hochgeladen am

2020-07-24 01:06:21

Sprache

en-US

Speaker

Richard Izzo, Tensorwerk

Content

Presentation of the open-source data versioning tool Hangar

The Workshop

The Workshop on Open-Source Software Lifecycles (WOSSL) was held in the context of the European  Science Cluster of Astronomy & Particle Physics ESFRI infrastructures (ESCAPE), bringing together people, data and services to contribute to the European Open Science Cloud. The workshop was held online from 23rd-28th July 2020, organized@FAU.

Copyright: CC-BY 4.0

Einbetten
Wordpress FAU Plugin
iFrame
Teilen