Mashaan blog

Self-Supervised Learning Review: from SimCLR to DINOv2

Acknowledgment:

Thanks to the authors for making their code available. If I had any misunderstandings while reading the paper, I had to check the code to confirm it.

References:

@InProceedings{chen2020simple,
  title    = {A Simple Framework for Contrastive Learning of Visual Representations},
  author   = {Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
  booktitle= {Proceedings of the 37th International Conference on Machine Learning},
  pages    = {1597--1607},
  year     = {2020},
  volume   = {119},
  series   = {Proceedings of Machine Learning Research},
  month    = {13--18 Jul},
  publisher= {PMLR},
  url      = {https://proceedings.mlr.press/v119/chen20j.html},
}
@InProceedings{grill2020bootstrap,
  title    = {Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning},
  author   = {Grill, Jean-Bastien and Strub, Florian and Altch\'{e}, Florent and Tallec, Corentin and Richemond, Pierre and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan and Gheshlaghi Azar, Mohammad and Piot, Bilal and Kavukcuoglu, Koray and Munos, R\'{e}mi and Valko, Michal},
  booktitle= {Advances in Neural Information Processing Systems},
  volume   = {33},
  year     = {2020},
  publisher= {Curran Associates, Inc.},
  url      = {https://papers.nips.cc/paper/2020/hash/f3ada80d5c4ee6ad7316347209072cdc-Abstract.html},
}
@InProceedings{caron2020unsupervised,
  title    = {Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
  author   = {Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
  booktitle= {Advances in Neural Information Processing Systems},
  volume   = {33},
  year     = {2020},
  publisher= {Curran Associates, Inc.},
  url      = {https://papers.nips.cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html},
}
@InProceedings{caron2021emerging,
  title    = {Emerging Properties in Self-Supervised Vision Transformers},
  author   = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J\'{e}gou, Herv\'{e} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle= {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages    = {9650--9660},
  year     = {2021},
  publisher= {IEEE},
  url      = {https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html},
}
@InProceedings{zhou2021ibot,
  title    = {iBOT: Image BERT Pre-Training with Online Tokenizer},
  author   = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
  booktitle= {Proceedings of the International Conference on Learning Representations},
  year     = {2022},
  publisher= {OpenReview.net},
  url      = {https://openreview.net/forum?id=0sH0m4gG9F},
}
@Article{oquab2023dinov2,
  title    = {DINOv2: Learning Robust Visual Features without Supervision},
  author   = {Oquab, Maxime and Darcet, Timoth\'{e}e and Moutakanni, Th\'{e}o and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yu and Jewell, Huub and Li, Anna and neverov, Ishan Misra and Dovrat, Guilherme and Quliyev, Abhimanyu and Kundu, Konstantin and Malik, David and Potter, Judy and Kitchen, Quentin and Beaune, Eric and Souslan, Jacob and Robert, Pierre and Yang, Jiahui and Alwala, Jia and Xu, Tian and Xu, Yossi and Tay, Chen and Vo, Matthieu and Benhamou, Th\'{e}o and Monasse, Piotr and Blankevoort, Christoph and Dovrat, Tim and Mairal, Julien and Mairal, Inderjit and Mairal, Jean and Joulin, Armand and Misra, Ishan and Jegou, Herve},
  journal  = {arXiv preprint arXiv:2304.07193},
  year     = {2023},
  url      = {https://arxiv.org/abs/2304.07193},
}

drawings-01 001

SimCLR

drawings-02 001


drawings 004


drawings 005


drawings 006

BYOL

drawings-02 002

SwAV

SwAV added a layer to cluster the images. Their motivation was to avoid the costly pairwise assignment of positive and negative pairs.

Comparing cluster assignments allows to contrast different image views while not relying on explicit pairwise feature comparisons.

source: Caron et al., 2020

drawings-02 003


SwAV uses multi-crop training, where the image is cropped into smaller sizes.

In this work, we propose multi-crop that uses smaller-sized images to increase the number of views while not increasing the memory or computational requirements during training.

source: Caron et al., 2020

drawings-01 005

DINOv1

DINOv1 utilized the power of vision transformers and replaced the ConvNet backbone used in previous works.

In this work, inspired from these methods, we study the impact of self-supervised pretraining on ViT features. Of particular interest, we have identified several interesting properties that do not emerge with supervised ViTs, nor with convnets:

source: Caron et al., 2021

drawings-02 004


Untitled 002

source: Caron et al., 2021


Untitled 001

source: Caron et al., 2021

iBOT

iBOT paper introduces the concept of masked patch token and designed a loss for this.

The target network is fed with a masked image while the online tokenizer with the original image. The goal is to let the target network recover each masked patch token to its corresponding tokenizer output. Our online tokenizer naturally resolves two major challenges.

source: Zhou et al., 2022

drawings-02 005


Untitled 001

source: Zhou et al., 2022

DINOv2

DINOv2 used a similar architecture as iBOT but with some changes. The most notable change was the dataset; they used the LVD-142M dataset, which is much larger than ImageNet-22K. Another change was the use of separate heads for DINO and iBOT.

In Zhou et al. (2022a), an ablation study shows that sharing parameters between the DINO and iBOT heads leads to better performance. At scale, we observed that the opposite is true, and we therefore use two separate heads in all our experiments.

source: Oquab et al., 2024

They also trained on full-resolution $518 \times 518$-pixel images.

Check out my post on DINOv2 architecture.