Notes

  • Important : See this review to understand ConvNeXt V1
  • Also you can refer to this review about Masked Autoencoders (MAE)

Highlights

  • ConvNeXt V1 was focused on supervised learning only. However, similarly to transformers, convolution based models can benefit from self-supervised learning techniques (such as MAE)
  • Simply applying self-supervised learning methods (i.e. MAE) to ConvNeXt leads to sub optimal results
  • This article proposes a fully convolutional masked autoencoder framework (MAE) and modifies the ConvNeXt architecture with Global Response Normalization (GRN) layers

Fully convolutional masked autoencoder

Figure 1 : Fully convolutional masked autoencoder design

  • They use a random masking strategy with a masking ratio of 0.6

  • They use ConvNeXt as the encoder

  • Unlike transformers you can’t simply remove masked patches from the image as the 2D structure of the image must be preserved for convolution

  • Also naive solutions such as replacing masked patches by masked tokens don’t perform well in practice

  • The idea here is to see the masked image as sparse data. Based on that, it is natural to use sparse convolution that will operate only on visible pixels

Note 1 : if the center pixel of the convolution is masked then the convolution will not operate and just return a masked pixel

Note 2 : sparse convolution layers can be converted back to standard convolutions at the fine-tuning stage without requiring additional handling

  • They tested several decoder architectures/depths but at the end they chose a simple ConvNeXt plain decoder

  • Loss function is the mean squared error (MSE) computed only on the masked patches between reconstructed and target images

    Evaluation of FCMAE

  • They pre-train and fine-tune on ImageNet-1K for 800 and 100 epochs respectively and report top-1 accuracy

  • Ablation study is done to justify design choice

    Figure 2 : Results of the FCMAE’s ablation study

  • The also compare the self-supervised approach to fully supervised learning

    Figure 3 : Comparison with fully supervised approach

  • They perform better than the fully supervised setup trained for 100 epochs but are still worse than the orginal ConvNeXt V1 baseline trained for 300 epochs

This is in contrast to the recent success of masked image modeling using transformer-based models [..] where the pre-trained models significantly outperform the supervised counterparts.

Global Response Normalization

  • To try to improve on that and to gain more insight into the learning behavior they perform a qualitative analysis in the feature space.

Figure 4 : illustration of the feature collapse phenomenom

  • They noticed a feature collapse phenomenon and they computed the cosine distance between features to get more insight

    Figure 5 : Cosine distance between features for all models

Note : ConvNeXt V2 FCMAE is the new architecture which is the ConvNeXt V1 FCMAE with the new normalization layer added to fix the feature collapse phenomenon

  • This analysis showed a reduction in feature diversity through the network for the ConvNeXt V1 FCMAE model

  • They propose to introduce a new normalization layer called global response normalization (GRN) to increase feature diversity

  • This layer is composed of three steps:

    • global feature aggregation : consists of mapping each feature map \(X_i\) into a scalar and constructing a vector representing all feature maps

      Here they use the \(L2\)-norm : \(G(X) = \lbrace \Vert X_1 \Vert, \Vert X_2 \Vert, ...,\Vert X_C \Vert \rbrace\)

    • feature normalization : \(N( \Vert X_i \Vert) = \frac {\Vert X_i \Vert}{\sum_{j=1,...,C} \Vert X_j \Vert}\)

    • feature calibration : \(X_i = X_i * N(G(X)_i)\)

  • They add a residual connection to create the final block which is: \(X_i = \gamma * X_i * N(G(X)_i) + \beta + X_i\), with \(\gamma\) and \(\beta\) learnable parameters

  • They incorporate the GRN layer into the ConvNeXt block creating ConvNeXt V2

    Figure 6 : Illustration of ConvNeXt block

Impact of GRN

  • GRN succeeds to mitigate the feature collapse behavior (see cosine distance between features maps)
  • The new model outperforms the 300 epochs supervised counterpart

Figure 7 : Result of ConvNeXt V2

  • Ablation study

    Figure 8 : Ablation study of GRN layer

ImageNet Experiments

Classification

  • Comparison with ConvNeXt V1 and contribution of pre-training:

Figure 9 : Detailed results

Figure 10 : Comparison with SOTA methods

  • They also evaluate the performance of the framework when adding an intermediate pretraining on Image-net 22K
  • They achieve the best state of the art results on Image-Net 1K using only public dataset

Figure 11 : Results with intermediate pretraining on Image-Net 22K

Object Detection

Figure 12 : Object detection results

Semantic Segmentation

Figure 13 : Semantic segmentation results

Conclusion

The fully convolutional masked autoencoder pre-training allows to improve the performance on various tasks but requires a specific architecture design.