Highlight

Masked Auto-Encoder is a very simple self-supervised training approach that gives state-of-the-art results
Asymmetric encoder-decoder architecture that reduce the computation cost
Does not rely too much on data augmentation : only crops are used

Method

Masking

The image is split into patches as in the original ViT, then 75% of the patches are masked (removed) following a uniform distribution

Encoder

It uses the ViT architecture and only take in input the tokens of visible patches

Decoder

It also uses the ViT architecture, but is smaller than the encoder
Masked token are added to the sequence before being fed to the decoder
The final layer of the decoder is a linear layer that predicts pixel values for each token
The masked token are shared and learned during the training

Loss

The loss is the MSE between the groundtruth pixel values and the predicted pixel values
The loss is only computed on the masked tokens !
However, the authors report that the accuracy drop only slightly if the loss is computed on all tokens

Experiment

Dataset : ImageNet-1K
Model :
- Encoder : ViT-Large (24 blocks, width of 1024)
- Decoder : 8 blocks with a width of 512
Training time : 16 hours for 800 epochs, using 128 TPU-v3 cores. (but it is faster than the other methods)

Results

Masking ratio

They find an optimal masking ratio of 75%

Examples from the validation set

Increasing the masking ratio

Note : the masking ratio was increased at inference time, the model was still trained with a masking ratio of 75%

Accuracy

First they do self-supervied pre-training and then a finetuning, the top-1 validation accuracy is reported below :

\[\begin{array}{ccc} \text { scratch, original } & \text { scratch, our impl. } & \text { baseline MAE } \\ \hline 76.5\% & 82.5\% & 84.9\% \end{array}\]

Masked Autoencoders Are Scalable Vision Learners