Abstract
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking. Show more
Publication status
publishedExternal links
Book title
Computer Vision – ECCV 2020Journal / series
Lecture Notes in Computer ScienceVolume
Pages / Article No.
Publisher
SpringerEvent
Organisational unit
03514 - Van Gool, Luc / Van Gool, Luc
Notes
Due to the Coronavirus (COVID-19) the conference was conducted virtually.More
Show all metadata
ETH Bibliography
yes
Altmetrics