Okay, so the main thing here is how do we sample? So what you really want is you
want to balance between exploration of the search space, which is essentially
sampling, and exploitation of that information. So we go back here, we've
explored a little bit of the search space and at some point we have said
this is enough information, now we're going to act. When you do that is a
choice that matters and we have no good idea of when that actually is. That is
something that needs to be again explored for a particular game because
there's no general rule that says oh exploitation after three levels that's
good and we don't know that. That's a parameter. Parameters to your search
algorithm which you might want to test. Okay but on a very abstract level you
can say you have to strike a good balance and you have to optimize for that.
How that works? Up to you. And of course you can now go and use advanced math or
statistics or something like that to optimize here and there are techniques
of this. One is this upper confidence bounds applied to trees. That's the name
that comes out. And really what you do there is you kind of, there's a
good, as you can imagine, there's a good well-understood area of
mathematics of playing one-armed bandits in casinos. It's relatively easy to
understand and if you understand it well then you might actually do better than
not. And for some games you can actually do card counting and all those kind of
things and you can actually win. So casino situations are things where we
understand the maths extremely well and so if you look at kind of this situation
or kind of this situation even better then you can think of them as random
processes which you can kind of make predictions about and this here you can
these kind of choices you can think of as a one-armed bandit.
You know these machines where you kind of and then three discs spin
and so on and it's roughly these are the three discs which come back with a value
and then you have to choose which one to select or something like this. And so you
can you can apply multi-armed bandit theory to these kind of things and that
actually gives you something that gives you a good strategy which happens to
work relatively relatively well. But you can also do something else and that's
very attractive if you have huge amounts of computing power and not that much
brain power or math power. You can basically you can basically take neural
networks and let them learn by playing against each other. It's kind of a little
bit like the genetic algorithms we were we were looking at. And so what the AlphaGo
people did they went this route they have Google backs them so they have all
the computing power you would ever want and some of them are called policy
networks and some of them are called value networks and what they do is you
learn some policies from analyzing human games and human rule books and all of
those kind of things where you just say if you are in this kind of situation
then you have to put your piece there unless you're white then it goes there
something like this right so out of those if you can replay them you can
actually get some information. But you can also look for things which are kind
of things you learn from self play because you play against yourselves and
sometimes you win against yourself or sometimes you lose against yourself
right and that gives you evaluation and those you can as we say back propagate
through the neural networks which is the trigger of the learning. Now the neural
networks the standard stuff there's very little choice that you can actually do
you can actually well what you what you can do is you can kind of look at the
pattern of the networks but usually you kind of you kind of have almost complete
networks with kind of a many to one thing at the end and up until recently
Presenters
Zugänglich über
Offener Zugang
Dauer
00:09:48 Min
Aufnahmedatum
2020-10-30
Hochgeladen am
2020-10-30 10:26:44
Sprache
en-US
It is explained,how a search should be guided and what is done in AlphaGo.