56 - NHR PerfLab Seminar 2023-09-05: DGEMM on Integer Tensor Cores [ID:49059]

50 von 352 angezeigt

Hi, I'm Hiroki Otomo from Tokyo Institute of Technology and thank you for inviting me

to this seminar.

And yeah, I'm very glad to be able to talk about my recent research, DGEMM on Integer,

Tensor,Cores.

So before talking about my research, I'd like to introduce myself briefly.

So my name is Hiroki Otomo and I'm a PhD candidate at Tokyo Institute of Technology.

And yeah, the defense is already over, so I received my PhD this September.

And my research interests rely on mixed-pressure computing, randomized new maker linear algebra,

quantum circuit simulation, and HPC processors, including GPU and other accelerators.

And yeah, so before talking about double-pressure emulation, I would like to introduce the motivation

of our research.

So in recent years, significant advancement in deep learning has been achieved.

And many processors equipped with mixed-pressure or low-pressure matrix multiplication units,

taking advantage of the fact that the deep and the computation of deep learning can be

tolerant for low-pressure computing and heavily relies on matrix multiplication.

On the other hand, traditional HPC applications require more higher accuracy, such as FP32,

FP64, and even higher.

So there is a gap between the pressure that deep learning hardware equipped with and HPC

applications require.

So yeah, we have a question as a motivation of our research.

Can deep learning processors be used for HPC applications?

And yeah, I think it is yes.

And I would like to introduce the example answer to this question in this talk, but

it's not easy.

And the similar discussion, as in there, similar discussion as shown in the left of this video.

So I'll start introducing my research.

So before talking about double-pressure emulation, I'd like to introduce single-pressure emulation,

single-pressure matrix multiplication on tensor cores.

Because both methods are very similar, but there are big differences.

And single-pressure emulation is a more straightforward method, so it is easy to understand.

So I will introduce single-pressure emulation first, and then introduce double-pressure.

So first, what is NVIDIA Tensor Core?

So NVIDIA Tensor Core is a mixed-pressure matrix multiplication and addition unit on

NVIDIA GPUs.

The input of the Tensor Core is low-pressure, such as FP16 or TF32, while the computation

inside and output is FP32.

And TF32 is one of the floating-point formats, which has 8-bit exponent, the same as FP32,

and 10-bit of mantissa, which is the same as FP16.

And the throughput of Tensor Core is very fast.

For example, on NVIDIA A100 GPU and H100 GPU, the throughput of TF32 input Tensor Core

is about 7 to 8 times faster than FP32, FP32 computing unit.

And FP16 Tensor Core is about 15 to 16 times faster than FP32 computing unit.

So we would like to utilize this high throughput computing unit to improve the performance

of HPC applications.

And now we want to compute single-pressure matrix multiplication on Tensor Core.

But we have a problem.

Can we compute S-gen on Tensor Core?

The answer is, unfortunately, no.

Because as I mentioned before, because the input matrix of the Tensor Core is low-pressure,

we need to convert the input FP32 matrix to low-pressure.

Teil einer Videoserie :

HPC4FAU / NHR@FAU

Teil eines Kapitels:

NHR@FAU PerfLab Seminar

Presenters

Dr. Georg Hager

Zugänglich über

Offener Zugang

Dauer

00:35:29 Min

Aufnahmedatum

2023-09-05

Hochgeladen am

2023-09-08 19:06:03

Sprache

en-US

Speaker: Hiroyuki Ootomo, Tokyo Institute of Technology

Title: DGEMM on Integer Tensor Cores

Date and time: Tuesday, September 5, 2 p.m. – 3 p.m.

Abstract:

In order to meet the increasing demand for dense matrix-matrix multiplication from the deep learning community, processors with specialized computing units for matrix multiplication are being developed by numerous vendors, such as NVIDIA Tensor Cores and Google TPUs. These hardware are designed to efficiently perform matrix multiplication at low precision, taking advantage of the fact that deep learning can tolerate low-precision operations, and the computation heavily relies on matrix multiplications. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units. This talk introduces a double-precision equivalent matrix multiplication using Int8 Tensor Cores and the Ozaki scheme, a high-precision matrix multiplication scheme using a lower-precision computing unit.

Short bio:

Hiroyuki Ootomo is a Ph.D. candidate at Tokyo Institute of Technology and studying under Dr. Rio Yokota. His research interests lie in high performance computing, especially mixed-precision computing using special hardware, randomized numerical linear algebra, and quantum circuit simulation. His current work is on a fast and high-accuracy GEMM on NVIDIA Tensor Cores and its application.

For a list of past and upcoming NHR PerfLab seminar events, see: https://hpc.fau.de/research/nhr-perflab-seminar-series/

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/49059

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/49059&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Audio

Per RSS abonnieren