Domain-Robust Network Architectures and Training Strategies for Visual Scene Understanding
Open access
Author
Date
2024Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Understanding the content of images is an important part of many applications in autonomous driving, augmented reality, robotics, medical imaging, and remote sensing. With the breakthrough of deep neural networks, semantic image understanding has substantially progressed in the last few years. However, neural networks require large amounts of annotated data to be trained properly. As the annotation of large-scale real-world datasets is a costly process, the network can instead be trained on a dataset with existing or cheaper annotations such as automatically labeled synthetic data. Unfortunately, neural networks are usually sensitive to domain shifts so that they perform rather poorly on domains different from the training data. Therefore, unsupervised domain adaptation (UDA) and domain generalization (DG) methods aim to enable a model trained on a source domain (e.g. synthetic data) to perform well on unlabeled or even unseen target domains (e.g. real-world data).
Most UDA/DG research focused specifically on the design of adaptation and generalization techniques to overcome the problem of domain shifts. However, the influence of other aspects of the learning framework on domain robustness has been mostly overlooked. Therefore, we newly take a more holistic view on domain robustness and study the impact of different aspects of the learning framework on UDA and DG including the network architecture, general training schemes, image resolution, crop size, and context information. In particular, we address the following problems of existing DG and UDA methods: (1) Instead of relying on generic and outdated segmentation architectures for evaluating DG/UDA strategies, we study the influence of recent architectures on domain-robust semantic/panoptic segmentation and design a network architecture specifically tailored for domain-generalizable and domain-adaptive segmentation. (2) To avoid overfitting to the source domain, we propose general training strategies that preserve prior knowledge. (3) To achieve fine segmentation details under the increased GPU memory consumption of DG/UDA, we propose a domain-robust and memory-efficient multi-resolution training framework. (4) To resolve local appearance ambiguities on the target domain, we propose a method to enhance the learning of spatial context relations. These contributions are detailed in the following paragraphs.
As previous UDA and DG semantic segmentation methods are mostly based on outdated DeepLabV2 networks with ResNet backbones, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network architecture tailored for UDA and DG. It consists of a hierarchical Transformer encoder and a multi-level context-aware feature fusion decoder. The DAFormer network is enabled by three simple but crucial training strategies to stabilize the training and avoid overfitting to the source domain: While Rare Class Sampling on the source domain improves the quality of the pseudo-labels by mitigating the confirmation bias of self-training toward common classes, a Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pre-training. With these techniques, DAFormer achieves major performance advances in UDA and DG and enables learning even difficult classes such as train, bus, and truck.
Further, we study principal architecture designs for panoptic segmentation with respect to their UDA capabilities. We show that previous panoptic UDA methods took suboptimal design choices. Based on the findings, we propose EDAPS, a network architecture that is particularly designed for domain-adaptive panoptic segmentation. It uses a shared, domain-robust Transformer encoder to facilitate the joint adaptation of semantic and instance features, but task-specific decoders tailored for the specific requirements of both domain-adaptive semantic and instance segmentation.
While DAFormer and EDAPS can better distinguish different classes, we observe that they lack fine segmentation details. We pinpoint the reason to the use of downscaled images, which result in low-resolution predictions. However, naively using full high-resolution images is infeasible due to the higher GPU memory consumption of UDA/DG compared to supervised methods. The alternative of training with random crops of high-resolution images alleviates this problem but falls short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution training approach for UDA and DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention while maintaining a manageable GPU memory footprint. HRDA enables adapting small objects and preserving fine segmentation details, significantly improving the performance of previous UDA and DG methods.
Even with the improved discriminative and high-resolution abilities of DAFormer and HRDA, UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks and domain gaps.
Overall, this thesis reveals the importance of a holistic view on the different aspects of the learning framework, such as network architectures and general training strategies, for domain-robust visual scene understanding. The presented methods majorly improve the performance on synthetic-to-real, day-to-nighttime, and clear-to-adverse weather domain adaptation across several perception tasks. For instance, they achieve an overall gain of +18.4 mIoU for semantic segmentation on GTA-to-Cityscapes. Beyond adaptation, DAFormer and HRDA even work in the more challenging domain generalization setting, where they improve the performance by +12.0 mIoU when generalizing from GTA to 5 unseen real-world datasets. The implementations are open-sourced and available at https://github.com/lhoyer. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000702004Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Van Gool, Luc
Examiner: Dai, Dengxin
Examiner: Schiele, Bernt
Examiner: Salzmann, Mathieu
Publisher
ETH ZurichOrganisational unit
03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)
More
Show all metadata
ETH Bibliography
yes
Altmetrics