Welcome everybody to the next part of deep learning. Today we want to finish talking
about common practices and in particular we want to have a look at the evaluation.
Of course we need to evaluate the performance of our models that we've trained so far.
Now we've set up the training, set hyperparameters and configured all of this.
Now we want to evaluate the generalization performance on previously unseen data.
This means the test data and it's time to open the vault.
Remember of all things the measure is man.
So data is annotated and labeled by humans and during training all labels are assumed
to be correct.
But of course to err is human.
This means that we might have ambiguous data.
The ideal situation that you actually want to have for your data is that it has been
annotated by multiple human voters.
Then you can take a mean or majority vote.
There's also a very nice paper by Stefan Steidl from 2015.
It introduces an entropy based measure that takes into account the confusion of human
reference labelers.
This is very useful in situations where you have unclear labels.
In particular in emotion recognition this is a problem as also humans confuse sometimes
classes like angry versus annoyed while they are not very likely to confuse angry versus
happy.
As this is a very clear distinction.
There are different degrees of happiness.
Sometimes you're just a little bit happy.
In these cases it is really difficult to differentiate happy from neutral.
This is also hard for humans.
In prototypes if you have for example actors playing you get emotion recognition rates
way over 90%.
If you have real data and if you have real emotions as they occur in daily life it's
much harder to predict.
This can then also be seen in the labels and the distribution of the labels.
If you have a prototype all of the raters will agree that the observation is clearly
this particular class.
If you have nuances and not so clear emotions you will also see that our raters will have
a less peaked or even uniform distribution over the labels because they also can't assess
the specific sample.
So mistakes by the classifier are obviously less severe if the same class is also confused
by humans.
Exactly this is considered in Steidl's entropy based measure.
Now if we look into performance measures we want to take into account the typical classification
measures.
They are built around the false negatives, the true negatives, the true positives and
the false positives.
From that for binary classification problems you can compute true and false positive rates.
This typically then leads to numbers like the accuracy that is the number of true positives
plus the true negatives over the number of positives and negatives.
Then there is the position of positive predictive value that is computed as the number of positive
back positive over the number of true positives plus false positive.
There is the so called recall that is defined as the true positives over the true positives
plus the false negatives.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:54 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 15:06:36
Sprache
en-US
Deep Learning - Common Practices Part 4
This video discusses how to evaluate deep learning approaches.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning
References:
[1] M. Aubreville, M. Krappmann, C. Bertram, et al. “A Guided Spatial Transformer Network for Histology Cell Differentiation”. In: ArXiv e-prints (July 2017). arXiv: 1707.08525 [cs.CV].
[2] James Bergstra and Yoshua Bengio. “Random Search for Hyper-parameter Optimization”. In: J. Mach. Learn. Res. 13 (Feb. 2012), pp. 281–305.
[3] Jean Dickinson Gibbons and Subhabrata Chakraborti. “Nonparametric statistical inference”. In: International encyclopedia of statistical science. Springer, 2011, pp. 977–979.
[4] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–478.
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
[6] Boris T Polyak and Anatoli B Juditsky. “Acceleration of stochastic approximation by averaging”. In: SIAM Journal on Control and Optimization 30.4 (1992), pp. 838–855.
[7] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Functions”. In: CoRR abs/1710.05941 (2017). arXiv: 1710.05941.
[8] Stefan Steidl, Michael Levit, Anton Batliner, et al. “Of All Things the Measure is Man: Automatic Classification of Emotions and Inter-labeler Consistency”. In: Proc. of ICASSP. IEEE - Institute of Electrical and Electronics Engineers, Mar. 2005.