Welcome also from my side. As I already announced, I'm working in the OpenMP community and also
in the MPI community for quite some time now. I'm working mostly on tools for MPI OpenMP,
correctness tools and also performance tools. But the work I'm talking about today is actually more
towards the application space. And the idea is to get more scalability into applications
so that we can actually scale up to exascale if you want to use this password. So what is it about?
The initial idea came shortly after Josef Schuchard published a paper at IWOM 18 where he tried
to build up an application that uses tasks, OpenMP tasks, and put this into a distributed memory
and then integrate MPI communication into OpenMP tasking. The issue that he pointed out in this
paper was that OpenMP because of some weak guarantees in the specification is not really suitable
to execute as-and-course communication and allow or guarantee to get progress all the time.
So in different configurations that he tried out, he ran into deadlocks and these deadlocks were
caused because of weak guarantees in the OpenMP specification. So what I'm talking about today
is basically based on a paper that I submitted to MuralMPI20 and there he actually also published
a similar proposal to do really as-and-course communication in MPI and get a notification from
MPI when the communication is done. And what I will show today is how you can integrate
OpenMP tasks and OpenMP task dependencies with this kind of as-and-course and MPI communication.
So first for those who are not so familiar with OpenMP task and task dependencies,
what are OpenMP tasks and how do OpenMP task dependencies work?
There are mainly three OpenMP constructs that are used to generate tasks and that accept
the dependence clause to express dependencies between different parts of work. The task and
the target construct actually generate tasks. The task weight construct is a synchronization
construct to terminate or to guarantee that certain tasks have finished. The task weight
construct since OpenMP 5, I think, also accepts the dependence clause, which means you can
actually wait for specific tasks that share this dependency. Other tasks are not necessarily
finished when you pass such a task weight with dependence.
The dependence that you can declare there is based on variables and the address of the variable you
pass in to the depend clause. And then you can have different dependencies, independence,
out dependence. Out dependence is always also independent. That's the reason why the description
and the specification basically makes no difference between out and in-out dependence.
Then something quite new is mutex in-outset, which declares
mutual exclusion between tasks that have the same dependence, but also ordering with all other
dependence clauses. And then there is in-outset, which is a weird thing, which is basically
in two, which means you can have one set of independences. So all these tasks can execute
in parallel. And then if you have in-outset, then that's basically the same thing. They
are another set of tasks that can execute concurrently. And then you can have another set of
independences. So it's a weird thing. I never understood the use case for that, but that's a
different topic. So it's basically the same as in. So how would you use tasks to parallelize a code?
And the example that I'm using here is block Cholesky decomposition. The idea there is we have
a huge matrix. We partition the matrix into some blocks. And then we want to factorize this matrix.
And the algorithm is that you iterate in smaller and smaller parts until you have the complete
matrix factorized. And the code for this is basically a larger for loop that iterates
over the smaller partitions. So basically the layers that I showed there. And then you have
four steps. And the first is the red function that you apply to the top left block in this
triangular blocking scheme. Then you apply the next kernel, which all those are blast functions
or LAPA functions. So you throw in your matrix coordinates, pointers, and then you get the
result into one of the blocks. So the first kernel is the red one. Then we iterate over the
column. That's the next for loop. Then we iterate over the diagonal to calculate those things. And
then we iterate over all the remaining blocks to calculate a matrix multiplication. In a small
example that fits onto the slide, it's not so obvious that at the end, the matrix multiplication
kernels are dominating number of kernels. Because if you scale that at large, they are
Presenters
Zugänglich über
Offener Zugang
Dauer
01:04:34 Min
Aufnahmedatum
2023-04-04
Hochgeladen am
2023-04-06 15:16:04
Sprache
en-US