Notes

  • Code is available on github
  • This work was done by the same team that did the Compact Convolutional Transformer (CCT) reviewed in this post
  • So they use the same method of Convolutional Tokenization

Highlights

  • Similar to Swin Transformer the idea is to reduce the computational cost of the attention mechanism
  • The authors introduce the Neighborhood Attention (NA) and the Neighborhood Attention Transformer (NAT)
  • With the Neighborhood Attention the attention is only computed on a neighborhood around each token
  • This method not only allows to reduce the computational cost of the attention mechanism but also helps to introduce local inductive biases
  • The drawback is that it reduces the receptive field

Neighborhood Attention

Neighborhood attention on a single pixel \((i, j)\) is defined as follows:

\[NA(X_{i, j}) = softmax(\frac{Q_{i,j}K^T_{\rho(i,j)} + B_{i,j}}{scale})V_{\rho_{(i,j)}}\]

where:

  • \(Q, K, V\) are linear projections of \(X\)

  • \(B_{i,j}\) denotes the relative positional bias

  • ρ(i, j) is a fixed-length set of indices of pixels nearest to (i, j). For a neighborhood of size \(L * L\), \(\vert \rho(i,j) \vert = L^2\)

However, if the function ρ maps each pixel to all pixels (\(L²\) is equal to feature map size), this will be equivalent to self attention.

  • The complexity of the neighborhood attention is linear with respect to resolution unlike self attention’s.
  • The function \(\rho\) which maps a pixel to a set of neighboring pixels is realized with a sliding window.
  • For corner pixels that cannot be centered, the neighborhood is expanded to maintain receptive field size. As illustrated in the image below.

Neighborhood Attention Transformer

  • For the tokenization they use the overlapping convolution method introduced in the Compact Convolutional Transformer

  • The rest of the architecture is a succession of blocks containing a token merging layer to reduce dimension and a standard multi head attention block but with the self attention replaced by the neighborhood attention

  • The token merging layer is also different from the patch merging layer in the Swin Transformer

  • Here the overlapping downsampler consists of a 3x3 convolution with 2x2 strides on the patches

Results

Classification

  • Trained on ImageNet-1k (1.2 millions images for training, 1000 classes)

  • NAT outperforms Swin Transformers and ConvNeXt

Object Detection

  • Mask R-CNN and Cascade Mask R-CNN with different backbones trained on MS-COCO

Semantic Segmentation

  • UPerNet with different backbones trained on ADE20K (20 000 training images)

  • NAT performs better than Swin Transformer for the segmentation task
  • But NAT fails to beat ConvNeXt, a recent and very efficient convolutional network

Ablation studies

  • To attest the efficiency of the Neighborhood Attention, they test their architecture on ImageNet-1k with different kinds of attention

  • They also study different merging methods with a Swin Transformer to attests the efficiency of the Overlapping Downsampler

Conclusion

This paper introduces a new and interesting attention mechanism based on the neighborhood of a token. It builds a transformer architecture based on this mechanism that achieves competitive results on different computer visions tasks.