A few months ago, I blogged about how to reduce AlexNet image resolution without giving up much ImageNet accuracy. I’m not quite sure what I did wrong, but I’m not able to reproduce those results. I’m sorry for leading you down the wrong path, and thanks to the 10 people who wrote me with questions about this!
New numbers
Here’s a revised version with numbers that I’ve been able to produce consistently with ImageNet-1k on an NVIDIA K40, where all the accuracy deltas are compared to 256×256 AlexNet:
DNN Architecture | Input | Crop | Top-1 accuracy | Top-5 accuracy | Frame rate at test-time |
AlexNet [1] | 256×256 | 227×227 | 57.1% | 80.2% | 624 fps |
AlexNet | 128×128 | 99×99 | 42.7% (-14.4) | 67.3% (-12.8) | 3368 fps (5.4x speedup) |
AlexNet | 128×128 | 111×111 | 46.2% (-10.9) | 70.1% (-10.1) | 2191 fps (3.4x speedup) |
VGG_F [2][3] | 128×128 | 99×99 | 41.2% (-15.9) | 65.7% (-14.5) | 2876 fps (4.6x speedup) |
VGG_F_extralayer [4] | 128×128 | 99×99 | 48.3% (-8.8) | 72.8% (-7.4) | 1600 fps (2.5x speedup) |
VGG_F_extralayer | 128×128 | 111×111 | 50.2% (-6.9) | 75.1% (-5.1) | 1248 fps (2x speedup) |
As you can see, the drop in accuracy for 128×128 AlexNet is larger than what I listed in my previous blog post. Oops.
After trying a few other DNN architectures, I identified an architecture that I’m calling VGG_F_extralayer [4]. With VGG_F_extralayer, we claw our way back up above 50% top-1 accuracy, while maintaining some speed benefits due to 128×128 images.
There are a few differences between VGG_F and VGG_F_extralayer:
1. VGG_F_extralayer has an additional 1×1 conv layer with 256 filters after conv4. (Going deeper sometimes improves accuracy.)
2. In its final pooling layer, VGG_F_extralayer does average pooling instead of max pooling. (In general, I often find that average-pooling near the end of a DNN provides a moderate bump in accuracy.)
3. The conv1 layer has 5×5 instead of 11×11 filters. (11×11 would probably give similar accuracy.)
4. The strides in VGG_F_extralayer for conv1 and pool1 are slightly different than VGG_F (see the details in the VGG_F_extralayer prototxt file [4]).
What’s next?
There are plenty of open questions, such as “which of these modifications have the biggest impact on accuracy?” I invite you to explore them.
[1] A. Krizhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012.
[2] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets. BMVC, 2014 .
[3] VGG_F prototxt: https://gist.github.com/ksimonyan/a32c9063ec8e1118221a
[4] VGG_F_extraLayer prototxt: http://www.forrestiandola.com/blog/wp-content/uploads/2015/07/vgg_f_extraLayer_trainval.prototxt