open access publication

Conference Paper, 2024

CL-MAE: Curriculum-Learned Masked Autoencoders

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ISBN 979-8-3503-1892-0, Volume 00, Pages 2480-2490, 10.1109/wacv57701.2024.00248

Contributors

Madan, Neelu 0000-0001-5778-3470 [1] Ristea, Nicolae-Catalin [2] [3] Nasrollahi, Kamal 0000-0002-1953-0429 [1] [4] Moeslund, Thomas Baltzer 0000-0001-7584-5209 [1] Ionescu, Radu Tudor 0000-0002-9301-1950 (Corresponding author) [3] [5]

Affiliations

  1. [1] Aalborg University
  2. [NORA names: AAU Aalborg University; University; Denmark; Europe, EU; Nordic; OECD];
  3. [2] Polytechnic University of Bucharest
  4. [NORA names: Romania; Europe, EU];
  5. [3] University of Bucharest
  6. [NORA names: Romania; Europe, EU];
  7. [4] Milestone Systems, Denmark
  8. [NORA names: Denmark; Europe, EU; Nordic; OECD];
  9. [5] SecurifAI, Romania
  10. [NORA names: Romania; Europe, EU]

Abstract

Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.

Keywords

ImageNet, Masked Autoencoders, adversary, approach, autoencoder, behavior, capability, code, complex, curriculum, curriculum learning, curriculum learning approach, downstream tasks, empirical results, factors, images, imaging model, input, input image, learning, learning approach, learning capability, loss, mask, mask image models, mask module, masking strategy, model, modulation, neutral state, pretext, pretext task, procedure, reconstruction, reconstruction loss, reconstruction task, representation, representation learning capability, results, robust representation, self-supervision, state, strategies, task, task complexity, tokens, training, training procedure, transferable representations, transition

Data Provider: Digital Science