Welcome everybody to the next part of deep learning. Today we want to finish talking about common practices and in particular we want to have a look at the evaluation.
So of course we need to evaluate the performance of the models that we've trained so far and now we have set the training set, the hyperparameters, we all estimated this.
And now we want to evaluate the generalization performance on previously unseen data. This means the test data and it's time to open the vault.
Remember, of all things the measure is man. So data is annotated and labeled by humans.
And during training all labels are assumed to be correct. But of course to err is human, which means that in addition we may have a bigger data.
The ideal situation that you actually want to have for your data is that it has been annotated by multiple human motors and then you can take the mean or a majority vote.
There's also a very nice paper by Stefan Steidl from 2005 and it introduces an entropy based measure that takes into account the confusions of human reference labelers.
So this is very useful in situations where you have unclear labels, in particular in emotion recognition.
This is a problem and also humans confuse sometimes classes like angry versus annoyed while they are not very likely to confuse angry versus happy.
So this is a very clear distinction. But of course there's different degrees of happiness. Sometimes you're just a little bit happy and then it makes it really difficult to differentiate happy from neutral.
And this is also hard for humans. So in prototypes, if you have actors playing, you get emotion recognition, recognition rates way over 90 percent.
But if you have real data emotion, if you have emotions as they occur in daily life, it's much harder to predict.
So this can then also be seen in the labels and in the distribution of the labels. If you have a prototype, all of the raters will agree.
It's clearly this particular class. If you have nuances and not so clear emotions, you will see that also our raters will have more or less uniform distribution over the labels because they also can't assess the specific sample.
So mistakes by the classifier are obviously less severe if the same class is also confused by humans. And this is considered in this entropy based measure.
Now, if you look into performance measure, you want to take into account the typical classification measures.
And they are typically built around the false negatives, the true negatives, the true positives and the false positives.
And from that, for binary classification problems, you can then compute true false positive rates.
So this typically then leads to numbers like the accuracy, that is the number of true positives plus true negatives over the number of positives and negatives.
Then there is the precision or positive predictive value that is computed as the number of true positives over the number of true positives plus false positives.
There's the so-called recall that is defined as the true positives over the true positives plus the false negatives.
Specificity or true negative value is given as the true negatives over the true negatives plus the false positives and the F1 score,
which is then somehow an intermediate way mixing those different measures where you have the true positive value times the true negative value divided over the sum of true positive and true negative value.
I typically recommend receiver operating characteristic curves because all of the measures that you've seen above, they are dependent on thresholds.
And if you have the ROC curves, there you essentially evaluate your classifier for all different thresholds.
And then this gives you an analysis how well it performs in different scenarios.
Furthermore, there are performance measures in multiclass classification.
So these are adapted versions of the measures above the top K error, which is the true class label not being in the K classes with the highest prediction score or also common top one and top five error.
ImageNet, for example, usually uses the top five error.
If you really want to understand what's going on in multiclass classification, I recommend looking at confusion matrices,
because confusion matrices are useful for, let's say, 10 to 15 classes.
If you have a thousand classes, then confusion matrices don't make any sense anymore.
Still, you can gain a lot of understanding of what's happening if you look at confusion matrices.
Now, sometimes you have very few data.
So in these cases, you may want to choose cross validation like a K-fold cross validation where you split your data into K-folds.
And then you use K minus one fold as training data and you test on fold K and you repeat it K times.
So this way you have seen in the evaluation data all of the data, but you trained on independent data because you held it out at the time of training.
It's rather uncommon in deep learning because it implies very long training times.
And you have to repeat the entire training K times, which is really hard if you train for seven days and then you have a seven fold cross validation.
You know, you can do the math. It will take really long.
But it can typically be used for the hyperparameter estimation.
But if you do so, you have to nest it. Don't perform cross validation just on all of your data.
Select the hyperparameters and then go ahead and work with the same data again in testing.
This will give you optimistic results.
You should always make sure that if you select parameters, you hold out the test data where you want to test on.
So there's techniques for nesting cross validation into cross validation.
Then it will also become computationally very expensive.
So that's even worse if you want to nest the cross validation.
One thing that you have to keep in mind is the variance of the results is typically underestimated because the training runs are not independent.
Also pay attention that you may introduce additional bias by incorporating the architecture selection and hyperparameter selection.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:14 Min
Aufnahmedatum
2020-06-01
Hochgeladen am
2020-06-01 10:06:37
Sprache
en-US
Deep Learning - Common Practices Part 4
This video discusses how to evaluate deep learning approaches.
Further Reading:
A gentle Introduction to Deep Learning
References:
[1] M. Aubreville, M. Krappmann, C. Bertram, et al. “A Guided Spatial Transformer Network for Histology Cell Differentiation”. In: ArXiv e-prints (July 2017). arXiv: 1707.08525 [cs.CV].
[2] James Bergstra and Yoshua Bengio. “Random Search for Hyper-parameter Optimization”. In: J. Mach. Learn. Res. 13 (Feb. 2012), pp. 281–305.
[3] Jean Dickinson Gibbons and Subhabrata Chakraborti. “Nonparametric statistical inference”. In: International encyclopedia of statistical science. Springer, 2011, pp. 977–979.
[4] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–478.
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
[6] Boris T Polyak and Anatoli B Juditsky. “Acceleration of stochastic approximation by averaging”. In: SIAM Journal on Control and Optimization 30.4 (1992), pp. 838–855.
[7] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Functions”. In: CoRR abs/1710.05941 (2017). arXiv: 1710.05941.
[8] Stefan Steidl, Michael Levit, Anton Batliner, et al. “Of All Things the Measure is Man: Automatic Classification of Emotions and Inter-labeler Consistency”. In: Proc. of ICASSP. IEEE - Institute of Electrical and Electronics Engineers, Mar. 2005.