open access publication

Conference Paper, 2024

Learning the What and How of Annotation in Video Object Segmentation

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), ISBN 979-8-3503-1892-0, Volume 00, Pages 6936-6946, 10.1109/wacv57701.2024.00680

Contributors

Delatolas, Thanos (Corresponding author) [1] [2] Kalogeiton, Vicky S [3] Papadopoulos, Dim P 0000-0002-5278-2273 [1] [2]

Affiliations

  1. [1] Pioneer Center for AI
  2. [2] Technical University of Denmark
  3. [NORA names: DTU Technical University of Denmark; University; Denmark; Europe, EU; Nordic; OECD];
  4. [3] LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris

Abstract

Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5× faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.

Keywords

DAVIS dataset, Davis, Moses, abundance, accuracy, agents, agreement, annotated objects, annotated videos, annotation, annotation cost, annotation framework, annotation process, annotation time, annotation types, applications, approach, baseline, cost, data generation, dataset, editing, experiments, frame, framework, gain, generation, human agreement, human-in-the-loop, humans, mask, method, model, modulation, object segmentation, objective, performance, performance gains, process, segmentation masks, segments, selection, significant performance gains, standard way, state-of-the-art, state-of-the-art performance, target, target object, time, time-consuming, traditional approaches, traditional way, training videos, type, video, video editing, video frames, video object segmentation, video object segmentation model, way

Funders

  • Agence Nationale de la Recherche

Data Provider: Digital Science