So today I'm going to present and share my research on efficient and robust hardware
for neural networks with you.
So neural networks were proposed originally to simulate the synapse network in our human
brain.
For example, this figure shows a very simple neural network.
The computations at the neural includes the multiplication operations with the input and
weight and also the cumulation results will be added with the BIOS and processed by a
nonlinear activation function.
But with the increasing depth of neural networks, massive multiplication and addition operations,
or we call it micro-operations, are required to be executed in neural networks.
For example, in the previous year, GPT-3, a large-amplified model neural network, was
used in child GPT, consisting of 96 layers with 175 billion weights.
So running this large-amplified model triggers the training of micro-operations.
And training GPT, of course, consumes a lot of energy consumption.
And which is also comparable to the electricity consumption of 120 years for average US households.
And inference, where pre-treat neural networks were used to do predictions, consumes even
more energy consumption.
According to Alidia and Amazon, 80% of this energy is used for inference.
So my group is trying to develop some methodologies from AI efficient AI algorithms and efficient
hardware to reduce energy consumption.
So in this talk, I'm going to present two efficient AI algorithms, class aware pruning
and class exclusion early exit.
And also I will present some efficient methodologies to reduce the digital accelerators for neural
networks like the systolic array in TPUs.
And the energy consumption in accelerating neural networks, majorly from the data movement
by loading in the waste, and also the huge number of micro-operations.
So that to reduce the data movement, I further developed a logic-based neural network design.
So after cleaning a neural network, we embed the waste into the circuits and optimize the
circuits so that the circuits can be used for application-specific design.
And we also target some in-memory computing platforms like R1-based course bars.
So they are in-memory computing platforms.
They don't need to load the waste from the memories.
They just do the computing and save the data in the memory.
But they are not mature enough.
They suffer from variations.
So if you do not address the variations, the accuracy of the neural networks accelerated
by the course bars will be degraded significantly.
So I will present a topic on this.
So let's first look at the first topic about pruning.
So you might have heard of a lot of pruning techniques.
But our work proposed a new pruning perspective called class-aware pruning.
So the basic concept can be seen from this figure.
The basic idea is that if you use a very simple neural network to do three classifications,
and we found that different neurons, they are important for different number of classes.
So for example, this neuron is important for three classes.
So important means that if you remove this neuron, the accuracy of those three classes
will be reduced significantly.
And this neuron is important for two classes, for example.
And this neuron is important for one class.
And this neuron is important for non-classes.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:44:07 Min
Aufnahmedatum
2025-01-15
Hochgeladen am
2025-01-15 15:46:04
Sprache
en-US
Speaker: Prof. Dr. Grace Li Zhang, Technische Universität Darmstadt
Abstract:
The last decade has witnessed significant breakthroughs of deep neural networks (DNNs) in many fields. These breakthroughs have been achieved at extremely high computation and memory costs. Accordingly, the increasing complexity of DNNs has led to a quest for efficient hardware platforms. In this talk, class-aware pruning is first presented to reduce the number of multiply-and-accumulate (MAC) operations in DNNs. Class-exclusion early-exit is then examined to reveal the target class before the last layer is reached. To accelerate DNNs, digital accelerators such as systolic arrays from Google can be used. Such an accelerator is composed of an array of processing elements to efficiently execute MAC operations in parallel. However, such accelerators suffer from high energy consumption. To reduce energy consumption of MAC operations, we select quantized weight values with good power and timing characteristics. To reduce energy consumption incurred by data movement, the logic design of neural networks is presented. Analog In-Memory-Computing platform based on RRAM crossbars will also be discussed. In the end, ongoing research topics and future research plans will be summarized.
For a list of past and upcoming NHR PerfLab seminar events, see: https://hpc.fau.de/research/nhr-perfl...