48 - NHR PerfLab Seminar: Asynchronous MPI communication with OpenMP tasks – spawning task dependency graphs across nodes [ID:47035]
50 von 440 angezeigt

Welcome also from my side. As I already announced, I'm working in the OpenMP community and also

in the MPI community for quite some time now. I'm working mostly on tools for MPI OpenMP,

correctness tools and also performance tools. But the work I'm talking about today is actually more

towards the application space. And the idea is to get more scalability into applications

so that we can actually scale up to exascale if you want to use this password. So what is it about?

The initial idea came shortly after Josef Schuchard published a paper at IWOM 18 where he tried

to build up an application that uses tasks, OpenMP tasks, and put this into a distributed memory

and then integrate MPI communication into OpenMP tasking. The issue that he pointed out in this

paper was that OpenMP because of some weak guarantees in the specification is not really suitable

to execute as-and-course communication and allow or guarantee to get progress all the time.

So in different configurations that he tried out, he ran into deadlocks and these deadlocks were

caused because of weak guarantees in the OpenMP specification. So what I'm talking about today

is basically based on a paper that I submitted to MuralMPI20 and there he actually also published

a similar proposal to do really as-and-course communication in MPI and get a notification from

MPI when the communication is done. And what I will show today is how you can integrate

OpenMP tasks and OpenMP task dependencies with this kind of as-and-course and MPI communication.

So first for those who are not so familiar with OpenMP task and task dependencies,

what are OpenMP tasks and how do OpenMP task dependencies work?

There are mainly three OpenMP constructs that are used to generate tasks and that accept

the dependence clause to express dependencies between different parts of work. The task and

the target construct actually generate tasks. The task weight construct is a synchronization

construct to terminate or to guarantee that certain tasks have finished. The task weight

construct since OpenMP 5, I think, also accepts the dependence clause, which means you can

actually wait for specific tasks that share this dependency. Other tasks are not necessarily

finished when you pass such a task weight with dependence.

The dependence that you can declare there is based on variables and the address of the variable you

pass in to the depend clause. And then you can have different dependencies, independence,

out dependence. Out dependence is always also independent. That's the reason why the description

and the specification basically makes no difference between out and in-out dependence.

Then something quite new is mutex in-outset, which declares

mutual exclusion between tasks that have the same dependence, but also ordering with all other

dependence clauses. And then there is in-outset, which is a weird thing, which is basically

in two, which means you can have one set of independences. So all these tasks can execute

in parallel. And then if you have in-outset, then that's basically the same thing. They

are another set of tasks that can execute concurrently. And then you can have another set of

independences. So it's a weird thing. I never understood the use case for that, but that's a

different topic. So it's basically the same as in. So how would you use tasks to parallelize a code?

And the example that I'm using here is block Cholesky decomposition. The idea there is we have

a huge matrix. We partition the matrix into some blocks. And then we want to factorize this matrix.

And the algorithm is that you iterate in smaller and smaller parts until you have the complete

matrix factorized. And the code for this is basically a larger for loop that iterates

over the smaller partitions. So basically the layers that I showed there. And then you have

four steps. And the first is the red function that you apply to the top left block in this

triangular blocking scheme. Then you apply the next kernel, which all those are blast functions

or LAPA functions. So you throw in your matrix coordinates, pointers, and then you get the

result into one of the blocks. So the first kernel is the red one. Then we iterate over the

column. That's the next for loop. Then we iterate over the diagonal to calculate those things. And

then we iterate over all the remaining blocks to calculate a matrix multiplication. In a small

example that fits onto the slide, it's not so obvious that at the end, the matrix multiplication

kernels are dominating number of kernels. Because if you scale that at large, they are

Teil einer Videoserie :
Teil eines Kapitels:
NHR@FAU PerfLab Seminar

Zugänglich über

Offener Zugang

Dauer

01:04:34 Min

Aufnahmedatum

2023-04-04

Hochgeladen am

2023-04-06 15:16:04

Sprache

en-US

NHR PerfLab Seminar talk on April 4, 2023
Speaker: Joachim Jenke, RWTH Aachen University, Chair of Computer Science (High Performance Computing)
Abstract:
Block-synchronous execution is a main source for parallel inefficiencies. To improve scalability of parallel codes, it can be crucial to replace block-synchronous execution by more fine-grained synchronization. OpenMP tasks with dependencies allow to express asynchronous execution with just the necessary synchronization at the process level. OpenMP 5.0 introduced detached tasks. In combination with MPI detached communication (aka. MPI continuations), detached tasks allow to build task dependency graphs across MPI processes. In this presentation you will learn how you can integrate MPI detached communication into your project and profit from real asynchronous communication. For an example code, we will compare the parallel performance of different levels of synchronization. If you don’t want to use OpenMP tasks, the same approach will also work with C++ futures/promises.
 
Short Bio:
Joachim Jenke is a postdoctoral researcher with the IT Center of the RWTH Aachen University. He received his doctoral degree from the RWTH Aachen University in 2021. His research interests are focused on correctness and performance of HPC applications. As leader of the OpenMP tools subcommittee and member of the MPI tools working group he is interested in pushing both programming models to new limits. He is principle developer of the correctness analysis tools MUST and Archer.
 
See https://hpc.fau.de/research/nhr-perflab-seminar-series/ for past and upcoming NHR PerfLab seminar talks.
Einbetten
Wordpress FAU Plugin
iFrame
Teilen