Welcome back to the University. I hope you all had a good time passing between the years.
I think last time we stopped somewhere around these slides on MPI. I decided to drop the
MPI coding example from the lecture. If anyone's interested in that, we can do that in the
exercises. I would like to continue now with the actual content of the lecture, simply
because I realized that we don't have as much time as I was expecting in this term left.
I think usually the teaching time of the winter term runs until early or even mid of February.
This year it will end on the end of January. So we have one or two lectures less than I
was expecting. So let's just continue. We will now continue with the review of the fastest
systems of the top 500 list. We are currently in November 2009 and we are looking at Jaguar.
Jaguar was located at the Oak Ridge National Laboratory. It was surpassing the previous
fastest system, which was Roadrunner. Roadrunner was not really a system which was loved by
programmers as I said last time. Actually Jaguar was a rather convenient system to program
for because it was just standard opteron processors, no accelerators, no really complicated architecture.
It was more or less a standard Cray system. So it was not one of a kind, but it was rather
simply something which everyone else bought at that time, just a slightly bigger version
of it. Also it was quite a lot faster than Roadrunner. I think Roadrunner had 1.1 petaflops,
so this is a step up of about 60%, quite a lot. However, it did stay on the top 500 list
just for one year when Tianhe, I think no one here can correct me so I can claim that
it was the correct pronunciation, Tianhe 1A, surpassed it just one year later. I said before
that the architecture is more or less standard. I think if we go back the past, probably even
the past 10 years and look at all Cray systems we will always see some sort of 3D torus network.
The version that Jaguar was using was the C-Star network which had per 3D direction
or per cardinal dimension 9.6 gigabytes bandwidth per second. It had a dedicated processor for
managing network traffic and each node was connected via a 6.4 gigabyte link. If you
look at this you may wonder why would I have a slower link to connect the node to when
my network here is actually faster. Why would I ever need a higher bandwidth here? The reason
for that is of course I have a torus network, so I might have to route traffic from neighboring
nodes through this node. So these 9.6 gigabytes are not exclusively for the connected node
but it might need to share it if I'm not using just next neighbor communication. That's the
reason for that. I had two opteron processors connected to this network link and each had
its own RAM. That's just to be expected because we have a cache coherent non-uniform memory
access or CC NUMA architecture just like every multi-core dual socket design today in all
supercomputers has. I think except for BlueGNQ which has just a single socket per node. What
is interesting, I think this is the first slide in this lecture we see on the cooling
system. Actually the cooling system was rather important for Jaguar to achieve a much better
power efficiency than RoadRAR. What they did was they were using a sort of liquid cooling
where they had an evaporator and they did lead the coolant flow from the bottom to the
top of the rack and on the top of the rack the hot coolant would be cooled back. Hello.
What's similar in this architecture to the BlueGNQ architectures we've seen before or
to the RoadRunner architectures is the modular and hierarchical design. So compute blades
did consist of multiple nodes. So in this case we have four nodes per blade. We have
eight blades per chassis and three chassis per rack. So that's actually similar to RoadRunner.
They were using I think standard 19 inch racks. And if we try to visualize this you can already
tell that per rack that's a lot of nodes, right? Of course they didn't stop at one rack
but they had a lot of racks. Actually I think RoadRunner if it hadn't been upgraded later
on would still be in the top ten for the fastest systems. Actually they did later on, they
didn't scrap the whole system and they rather upgraded it. But we'll see that later on in
the lecture. What's also interesting is if we visualize each node as such a bubble here
we can also visualize the I O nodes here as yellow bubbles and the red bubbles are I think
the interconnects to the external network. So if you want to do I O on this system efficiently
Presenters
Zugänglich über
Offener Zugang
Dauer
01:29:40 Min
Aufnahmedatum
2015-01-13
Hochgeladen am
2019-04-03 16:19:04
Sprache
en-US