Thanks to the authors for making their code available. If I had any misunderstandings while reading the paper, I had to check the code to confirm it.
@InProceedings{chen2020simple,
title = {A Simple Framework for Contrastive Learning of Visual Representations},
author = {Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
booktitle= {Proceedings of the 37th International Conference on Machine Learning},
pages = {1597--1607},
year = {2020},
volume = {119},
series = {Proceedings of Machine Learning Research},
month = {13--18 Jul},
publisher= {PMLR},
url = {https://proceedings.mlr.press/v119/chen20j.html},
}
@InProceedings{grill2020bootstrap,
title = {Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning},
author = {Grill, Jean-Bastien and Strub, Florian and Altch\'{e}, Florent and Tallec, Corentin and Richemond, Pierre and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan and Gheshlaghi Azar, Mohammad and Piot, Bilal and Kavukcuoglu, Koray and Munos, R\'{e}mi and Valko, Michal},
booktitle= {Advances in Neural Information Processing Systems},
volume = {33},
year = {2020},
publisher= {Curran Associates, Inc.},
url = {https://papers.nips.cc/paper/2020/hash/f3ada80d5c4ee6ad7316347209072cdc-Abstract.html},
}
@InProceedings{caron2020unsupervised,
title = {Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
author = {Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
booktitle= {Advances in Neural Information Processing Systems},
volume = {33},
year = {2020},
publisher= {Curran Associates, Inc.},
url = {https://papers.nips.cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html},
}
@InProceedings{caron2021emerging,
title = {Emerging Properties in Self-Supervised Vision Transformers},
author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J\'{e}gou, Herv\'{e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
booktitle= {Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages = {9650--9660},
year = {2021},
publisher= {IEEE},
url = {https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html},
}
@InProceedings{zhou2021ibot,
title = {iBOT: Image BERT Pre-Training with Online Tokenizer},
author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
booktitle= {Proceedings of the International Conference on Learning Representations},
year = {2022},
publisher= {OpenReview.net},
url = {https://openreview.net/forum?id=0sH0m4gG9F},
}
@Article{oquab2023dinov2,
title = {DINOv2: Learning Robust Visual Features without Supervision},
author = {Oquab, Maxime and Darcet, Timoth\'{e}e and Moutakanni, Th\'{e}o and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yu and Jewell, Huub and Li, Anna and neverov, Ishan Misra and Dovrat, Guilherme and Quliyev, Abhimanyu and Kundu, Konstantin and Malik, David and Potter, Judy and Kitchen, Quentin and Beaune, Eric and Souslan, Jacob and Robert, Pierre and Yang, Jiahui and Alwala, Jia and Xu, Tian and Xu, Yossi and Tay, Chen and Vo, Matthieu and Benhamou, Th\'{e}o and Monasse, Piotr and Blankevoort, Christoph and Dovrat, Tim and Mairal, Julien and Mairal, Inderjit and Mairal, Jean and Joulin, Armand and Misra, Ishan and Jegou, Herve},
journal = {arXiv preprint arXiv:2304.07193},
year = {2023},
url = {https://arxiv.org/abs/2304.07193},
}
SwAV added a layer to cluster the images. Their motivation was to avoid the costly pairwise assignment of positive and negative pairs.
Comparing cluster assignments allows to contrast different image views while not relying on explicit pairwise feature comparisons.
source: Caron et al., 2020
SwAV uses multi-crop training, where the image is cropped into smaller sizes.
In this work, we propose multi-crop that uses smaller-sized images to increase the number of views while not increasing the memory or computational requirements during training.
source: Caron et al., 2020
DINOv1 utilized the power of vision transformers and replaced the ConvNet backbone used in previous works.
In this work, inspired from these methods, we study the impact of self-supervised pretraining on ViT features. Of particular interest, we have identified several interesting properties that do not emerge with supervised ViTs, nor with convnets:
- Self-supervised ViT features explicitly contain the scene layout and, in particular, object boundaries, as shown in Figure 1. This information is directly accessible in the self-attention modules of the last block.
- Self-supervised ViT features perform particularly well with a basic nearest neighbors classifier (k-NN) without any finetuning, linear classifier nor data augmentation, achieving 78.3% top-1 accuracy on ImageNet.
source: Caron et al., 2021
source: Caron et al., 2021
source: Caron et al., 2021
iBOT paper introduces the concept of masked patch token and designed a loss for this.
The target network is fed with a masked image while the online tokenizer with the original image. The goal is to let the target network recover each masked patch token to its corresponding tokenizer output. Our online tokenizer naturally resolves two major challenges.
- On the one hand, our tokenizer captures highlevel visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens.
- On the other hand, our tokenizer needs no extra stages of training as pre-processing setup since it is jointly optimized with MIM via momentum update.
source: Zhou et al., 2022
source: Zhou et al., 2022
DINOv2 used a similar architecture as iBOT but with some changes. The most notable change was the dataset; they used the LVD-142M dataset, which is much larger than ImageNet-22K. Another change was the use of separate heads for DINO and iBOT.
In Zhou et al. (2022a), an ablation study shows that sharing parameters between the DINO and iBOT heads leads to better performance. At scale, we observed that the opposite is true, and we therefore use two separate heads in all our experiments.
source: Oquab et al., 2024
They also trained on full-resolution $518 \times 518$-pixel images.
Check out my post on DINOv2 architecture.