Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation

We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. To capture the cross-task relationships, we propose a neural network architecture that contains task-specific and cross-task refinement heads. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain. We experimentally observe improvements in both tasks' performance because the complementary information present in these tasks is better captured. Specifically, we show that: (1) our approach improves performance on all tasks when they are complementary and mutually dependent; (2) the CTRL helps to improve both semantic segmentation and depth estimation tasks performance in the challenging UDA setting; (3) the proposed ISL training scheme further improves the semantic segmentation performance. The implementation is available at https://github.com/susaha/ctrl-uda.


Introduction
Semantic segmentation and monocular depth estimation are two important computer vision tasks that allow us to perceive the world around us and enable agents' reasoning, e.g., in an autonomous driving scenario. Moreover, these tasks have been shown to be complementary to each other, i.e., information from one task can improve the other task's performance [29,42,60]. Domain Adaptation (DA) [11] Corresponding author: Suman Saha (suman.saha@vision.ee.ethz.ch) * Equal contribution. Figure 1: Semantic segmentation improvement with our approach to unsupervised domain adaptation over the stateof-the-art DADA [62] method. Left to right: Cityscapes test images, DADA, and the proposed method (CTRL). Our model correctly segments the "bus", "rider", and "wall" classes underrepresented in the target domain (highlighted). refers to maximizing model performance in an environment with a smaller degree of supervision (the target domain) relative to what the model was trained on (the source domain). Unsupervised Domain Adaptation (UDA) assumes only access to the unannotated samples from the target domain at train time -the setting of interest in this paper, explained in greated detail in Sec. 6.
Recent domain adaptation techniques [34,62] proposed to leverage depth information available in the source domain to improve semantic segmentation on the target domain. However, they lack an explicit multi-task formulation to relate depth and semantics, that is to say, how each semantic category relates to different depth levels. The term depth levels refers to different discrete ranges of depth values, i.e., "near" (1-5m); "medium-range" (5-20m), or "far" (>20m). This paper aims to design a model that learns explicit relationships between different visual semantic classes and depth levels within the UDA context.
To this end, we design a network architecture and a new multitask-aware feature space alignment mechanism for UDA. First, we propose a Cross-Task Relation Layer (CTRL) -a novel parameter-free differentiable module tailored to capture the task relationships given the network's semantic and depth predictions. Second, we utilize a Semantics Refinement Head (SRH) that explicitly captures cross-task relationships by learning to predict semantic segmentation given predicted depth features. Both CTRL and SRH boost the model's ability to effectively encode correlations between semantics and depth, thus improving predictions on the target domain. Third, we employ an Iterative Self Learning (ISL) scheme. Coupled with the model design, it further pushes the performance of semantic segmentation. As a result, our method achieves state-of-theart semantic segmentation performance on three challenging UDA benchmarks (Sec. 4). Fig. 1 demonstrates our method's effectiveness by comparing semantic predictions of classes underrepresented in the target domain to predictions made by the previous state-of-the-art method. The paper is organized as follows: Sec. 2 discusses the related work; Sec. 3 describes the proposed approach to UDA, the network architecture, and the learning scheme; Sec. 4 presents the experimental analysis with ablation studies; Sec. 5 concludes the paper.

Related Work
Semantic Segmentation. refers to the task of assigning a semantic label to each pixel of an image. Conventionally, the task has been addressed using hand-crafted features combined with classifiers, such as Random Forests [53], SVMs [16], or Conditional Random Fields [31]. Powered by the effectiveness of Convolutional Neural Networks (CNNs) [33], we have seen an increasing number of deep learning-based models. Long et al. [38] were among the first to use fully convolutional networks (FCNs) for semantic segmentation. Since then, this design has quickly become a state-of-the-art method for the task. The encoderdecoder design is still widely used [67,5,1,72,4].
Cross-domain Semantic Segmentation. Training deep networks for semantic segmentation requires large amounts of labeled data, which presents a significant bottleneck in practice, as acquiring pixel-wise labels is a labor-intensive process. A common approach to address the issue is to train the model on a source domain and apply it to a target domain in a UDA context. However, this often causes a performance drop due to the domain shift. Domain Adaptation aims to solve the issue by aligning the features from different domains. DA is a highly active research field, and techniques have been developed for various applications, including image classification [18,36,39,40], object detection [8], fine-grained recognition [19], etc.
More related to our method are several works on unsupervised domain adaptation for semantic segmentation [69,52,74,9,61,25,65,73,48,66]. This problem has been tackled with curriculum learning [69], GANs [52], adversarial training on the feature space [9], output space [55], or entropy maps [61], self-learning using pseudo-or weak labels [74,48,25]. However, prior works typically only consider adapting semantic segmentation while neglecting any multi-task correlations. A few methods [7,62] model correlations between semantic segmentation and depth estimation, similarly to our work, yet -as explained in Sec. 1these works come with crucial limitations. Monocular Depth Estimation. Similar to semantic segmentation, monocular depth estimation is dominated by CNN-based methods [13,15,32,35]. [13] introduced a CNN-based architecture for depth estimation, which regresses a dense depth map. Their approach was then improved by incorporating techniques such as a CRF [37,35] and multi-scale CRF techniques [64]. Besides, improvements in the loss design itself also lead to better depth estimation. Examples include the reverse Huber (berHu) loss [46,75], and the ordinal regression loss [15].
Multi-task Learning for Semantic Segmentation and Depth Estimation. Within the context of multi-task learning, semantic segmentation is shown to be highly correlated with depth estimation, and vice versa [68,63,29,70,71,42,54,60,59,28]. To leverage this correlation, some authors have proposed to learn them jointly [50,27,6]. In particular, [45,27,58,3] proposed to share the encoder and use multiple decoders, whereas a shared conditional decoder is used in [6]. Semantic segmentation was also demonstrated to help guide the depth training process [21,26].
In this paper, we build upon these observations. We argue that task relationships, like the ones between depth and semantics, are not entirely domain-specific. As a result, if we correctly model these relationships in one domain, they can be transferred to another domain to help guide the DA process. The proposed method and its components are explicitly designed around this hypothesis.

Method
In this section, we describe our approach to UDA in the autonomous driving setting. Sec. 3.1 presents an overview of the proposed approach; Sec. 3.2 explains the notation and problem formulation; Sec. 3.3 describes supervision on the source domain; Sec. 3.4 presents the CTRL module design; Sec. 3.5 describes the ISL technique; Sec. 3.6 prescribes the rest of the network architecture details.

Overview
The primary hypothesis behind our approach is that task dependencies persist across domains, i.e., most semantic classes fall under a finite depth range. We can exploit this information from source samples and transfer it to target using adversarial training. As our goal is to train the network in a UDA setting, we follow an adversarial training  scheme [24,55] to learn domain invariant representations.
Unlike [62] that directly aligns a combination of semantics and depth features, we wish to design a joint feature space for domain alignment by fusing the task-specific and the cross-task features and then learn to minimize the domain gap through adversarial training. To this end, we propose CTRL -a novel module that constructs the joint feature space by computing entropy maps of both the semantic label and discretized depth distributions (Fig. 2). Thus, CTRL entropy maps, generated on the source and target domains, are expected to carry similar information.
Further enhancement of semantic segmentation performance appears possible by utilizing the Iterative Self-Learning (ISL) training scheme, which does not require expensive patch-based pseudo-label generation like [25]. As our CTRL helps the network to predict high-quality predictions ( Fig. 1), ISL training exploits high-confidence predictions as supervision (pseudo-labels) on the target domain.

Problem Formulation
Let D (s) and D (t) denote the source and target domains, with samples from them represented by tuples (x (s) , y (s) , z (s) ) and (x (t) ) respectively, where x ∈ R H×W ×3 are color images, y ∈ {1, ..., C} H×W are semantic annotations with C classes, and z ∈ [Z min , Z max ] H×W are depth maps from a finite frustum. Furthermore, F e is the shared feature extractor, which includes a pretrained backbone, and a decoder; F s and F d are the task-specific semantics and depth heads, respectively; F r is the SRH (Fig. 2).
First, F e extracts a shared feature map to be used by SRH and task-specific semantics and depth heads. The semantics head F s predicts a semantic segmentation map y s = F s (F e (x)) with C channels per pixel, denoting predicted class probabilities. The depth head F d predicts a real-valued depth mapẑ = F d (F e (x)), where each pixel is mapped into the finite frustum specified in the source domain. We further employ SRH to learn the cross-task relationship between semantics and depth by making it predict semantics from the shared feature map, attenuated by the predicted depth map. Formally, the shared feature map is point-wise multiplied by the predicted depth map, and then SRH predicts a second (auxiliary) semantic segmentation map:ŷ r = F r (ẑ F e (x)).
We refer to the part of the model enclosing the F e , F s , F r , F d modules as a prediction network. The predictions made by the network on the source and target domains are denoted as (ŷ r ,ẑ (t) ), respectively. We upscale these predictions along the spatial dimension to match the original input image dimension H × W before any further processing. Given these semantics and depth predictions on the source and target domains, we optimize the network cost using supervised loss on the source domain, and unsupervised domain alignment loss on the target domain within the same training process.

Supervised Learning
Since the semantic segmentation predictionsŷ (s) s ,ŷ (s) r and ground truth y (s) are represented as pixel-wise class probabilities over C classes, we employ the standard crossentropy loss with the semantic heads: We use the berHu loss (the reversed Huber criterion [32]) for penalizing depth predictions: (2) Following [29], we regress inverse depth values (normalized disparity), which is shown to improve the precision of predictions on the full range of the view frustum. The parameters of the network θ e , θ s , θ r , θ d (parameterizing F e , F s , F r , F d modules), collectively denoted as θ net , are learned to minimize the following supervised objective on the source domain: where λ r and λ d are the hyperparameters weighting relative importance of the SRH and depth supervision.

Cross-Task Relation Layer
In the absence of ground truth annotations for the target samples, we train the network on the target images using unsupervised domain alignment loss. Existing works either align source and target domain in a semantic space [61] or a depth-aware semantic space [62] by fusing the continuous depth predictions with predicted semantic maps. Here, we argue that simple fusion of the continuous depth prediction into the semantics does not enable the network to learn useful semantic features at different depth levels. Instead, explicit modeling is required to achieve this goal.
Humans learn to relate semantic categories at each discrete depth level differently. For example, "sky" is "far away" (large depth), "vehicles" are "nearby", "road" appears to be both "far" and "nearby". Taking inspiration from the way humans relate semantic and depth, we design a CTRL (Fig. 2) that captures the semantic class-specific dependencies at different discrete depth levels. Moreover, CTRL also preserves task-specific information by fusing task-specific and task-dependent features learned by the semantics, depth, and refinement (SRH) heads. CTRL consists of a depth discretization, an entropy map generation, and a fusion layer described in the following subsections.

Depth Discretization Module
The prediction made by the depth headẑ contains continuous depth values. We want to map it to a discrete probability space to learn visual semantic features at different depth levels. We quantize the view frustum depth range into a set of representative discrete values following the spacingincreasing discretization (SID) [15]. Such discretization assigns progressively large depth sub-ranges further away from the point of view into separate bins, which allows us to simulate the human perception of depth relations in the scene, with a finite number of categories.
Given the depth range [Z min , Z max ] and the number of depth bins K, SID outputs a K-dimensional vector of discretization bin centers b as follows: We can now assign probabilities of the predicted depth values falling into the defined bins:

Joint Space for Domain Alignment
The task-dependencyŷ r (output by SRH), alongside the task-specific semanticsŷ s and depthẑ probability maps, can be considered as discrete distributions over semantic classes and depth levels. As we do not have access to the ground truth labels for the target domain, one way to train the network to predict high-confidence predictions is by minimizing the uncertainty (or entropy) in the predicted distributions over the target domain [61]. The source and target domains share similar spatial features, and it is recommended to align them in the structured output space [23].
To this end, we propose a novel UDA training scheme, where task-specific and task-dependent knowledge is transferred from the source to the target domain by constraining the target distributions to be similar to the source by aligning the entropy maps ofŷ r ,ŷ s , andẑ . Note that unlike [62,61], which constrain only on the task-specific space (ŷ s in our case) for domain alignment, we train the network to output highly certain predictions by aligning features in the task-specific and task-dependent spaces.
We argue that aligning source and target distributions jointly in task-specific and task-dependent spaces helps to bridge the domain gap for underrepresented classes, which are learned poorly without the presence of a joint representation. To encode such a joint representation, we generate entropy maps as follows: We then concatenate these maps along the channel dimension to get the fused entropy map E = concat(E r , E s , E d ) and employ adversarial training on it. For aligning the source and target domain distributions, we train the proposed segmentation and depth prediction network (parameterized by θ net ) and the discriminator network D (parameterized by θ D ) following an adversarial learning scheme. More specifically, the discriminator is trained to correctly classify the sample domain being either source or target given only the fused entropy map: At the same time, the prediction network parameters are learned to maximize the domain classification loss (i.e., fooling the discriminator) on the target samples using the following optimization objective: We use the hyperparameter λ adv weighing the relative importance of the adversarial loss (8). Our training scheme jointly optimizes the model parameters of the prediction network (θ net ) and the discriminator (θ D ). Updates to the prediction network and the discriminator happen upon every training iteration; however, when updating the prediction network, the discriminator parameters are kept fixed. Parameters of the discriminator are updated separately using the domain classification objective (Eq. 7).

Iterative Self Learning
Following prior work [74], we train our network endto-end using an ISL scheme using Algorithm 1. We first train the prediction (θ net ) and discriminator (θ D ) networks for Q 1 iterations. We then generate semantic pseudo-labels ( y (t) ) on the target training samples x (t) using the trained prediction network.
We further train the prediction network on the target training samples using pseudo-labels supervision and a masked cross-entropy loss (Eq. 1), masking target prediction pixels with confidence less than 0.9, for Q 3 iterations. Instead of training the prediction network using SL only once, we iterate over generating high-confidence pseudolabels and self-training Q 2 times to refine the pseudo-labels, further resulting in better quality semantics output on the target domain.
We show in the ablation studies (Sec. 4.4) that our ISL scheme outperforms the simple SL. The discriminator network parameters (θ D ) are kept fixed during self-training. Generate y (t) s = F s (F e (x (t) )) using trained θ net ; 4: Train θ net on (x (t) , y (t) s ) for Q 3 iterations; 5: end for 3.6. Network Architecture The shared part of the prediction network F e consists of a ResNet-101 backbone and a decoder (Fig. 2). The decoder consists of four convolutional layers; its outputs are fused with the backbone output features, which are denoted as the "shared feature map". This shared feature map is then fed forward to the respective semantics and semantics refinement heads. Following the residual auxiliary block [43] (as in [62]), we place the depth prediction head between the last two convolutional layers of the decoder. In the supplementary materials, we show that our proposed approach is not sensitive to the residual auxiliary block and performs equally well with a standard multi-task learning network architecture (i.e., a shared encoder followed by multiple taskspecific decoders). We adopt the Deeplab-V2 [4] architec-tural design with Atrous Spatial Pyramid Pooling (ASPP) for the prediction heads. We use DC-GAN [49] as our domain discriminator for adversarial learning.

UDA Benchmarks
We use three standard UDA evaluation protocols (EPs) to validate our model: EP1: SYNTHIA → Cityscapes (16 classes), EP2: SYNTHIA → Cityscapes (7 classes), and EP3: SYNTHIA → Mapillary (7 classes). A detailed explanation of these settings can be found in [62]. In all settings, the SYNTHIA dataset [51] is used as the synthetic source domain. In particular, we use the SYNTHIA-RAND-CITYSCAPES split consisting of 9,400 synthetic images and their corresponding pixel-wise semantic and depth annotations. For target domains, we use Cityscapes [10] and Mapillary Vistas [44] datasets. Following EP1, we train models on 16 classes common to SYN-THIA and Cityscapes; in EP2 and EP3, models are trained on 7 classes common to SYNTHIA, Cityscapes, and Mapillary. We use intersection-over-union to evaluate segmentation: IoU (class-IoU) and mIoU (mean-IoU). To promote reproducibility and emphasize significance of our results, we report two outcomes: the best mIoU, and the confidence interval. The latter is denoted as mean ± std collected over five runs, thus describing a 68% confidence interval centered at mean 1 . For depth, we use Absolute Relative Difference (|Rel|), Squared Relative Difference (Rel 2 ), Root Mean Squared Error (RMS), its log-variant LRMS; and the accuracy metrics [14] as denoted by δ 1 , δ 2 , and δ 3 . For each metric, we use ↑ and ↓ to denote the improvement direction.

Experimental Setup
All our experiments are implemented in PyTorch [47]. Backbone network is a ResNet-101 [22] initialized with ImageNet [12] weights. The prediction and discriminator networks are optimized with SGD [2] and Adam [30] with learning rates 2.5 × 10 −4 and 10 −4 respectively. Throughout our experiments, we use λ r = 1.0, λ d = λ adv = 10 −3 . For generating depth bins, we use Z min = 1m, Z max = 655.36m, and K = 15. In all ISL experiments, parameters of the algorithm are: Q 1 = 65K, Q 2 = 5, Q 3 = 5K. Link to the project page with source code is in the Abstract. Table 1 reports semantic segmentation performance of our proposed model trained and evaluated following EP1. For a fair comparison with [55,56,41], we also report results on Table 1: Semantic segmentation performance (IoU and mIoU, %) comparison to the prior art. All models are trained and evaluated using the EP1 protocol. mIoU* is computed on a subset of 13 classes, excluding those marked with *. For our method, we report the results of the run giving the best mIoU, as well as 68% confidence interval over five runs as mean ± std.   Table 2: Semantic segmentation performance (IoU and mIoU, %) comparison to the prior art. All models are trained and evaluated using the EP2 and EP3 protocols at different resolutions, as indicated in the resolution ("Res.") column. For our method, we report the results of the run giving the best mIoU, as well as 68% confidence interval over five runs as mean ± std.  13 classes and the standard 16 classes settings. Our method achieves SOTA performance in EP1 on both 16 and 13 classes, outperforming [62,34] by large margins. Now we can identify the major class-specific improvements of our method over the SOTA [62] DADA. The major gains come from the following classes -"wall" (+12.8%), "motorbike" (+10.8%), "bus" (+6.9%), "person" (+5.5%), "rider" (+2.7%) and "car" (+2.0%). Moreover, our method shows consistent improvements on classes underrepresented in the target domain: "motorbike" (+10.8%), "pole" (+4.0%), "sign" (+2.7%), and "bicycle" (+1.8%). Fig. 3 shows the results of the qualitative comparison of our method with Our method demonstrates notable improvements over [62] on "bus", "person", "motorbike", and "bicycle" classes as highlighted using the yellow boxes. DADA [62]. Note that our model delineates small objects like "human", "bicycle", and "motorbike" more accurately than DADA.

EP2 and EP3
Table 2 presents the semantic segmentation results in EP2 and EP3 benchmarks. The models are evaluated on the Cityscapes and Mapillary validation sets on their common 7 classes. We also train and evaluate our model on the 320 × 640 resolution to obtain a fair comparison with the reference low-resolution models. In a similar vein, the proposed method outperforms the prior works in EP2 and EP3 benchmarks for both full-and low-resolution (640 × 320) settings. We further show in Sec. 4.5 that our approach achieves state-of-the-art performance without ISL in EP2 and EP3 in both full-and low-resolution settings. The proposed CTRL coupled with SRH demonstrates consistent improvements over three challenging benchmarks by cap- italizing on the inherent semantic and depth correlations. In EP2 and EP3, our models show noticeable improvements over the state-of-the-art [62] with mIoU gains of +3.1% (EP2-full-res), +2.0% (EP2-low-res), +4.2% (EP3full-res), +6.1% (EP3-low-res). Despite the challenging domain gap between SYNTHIA and Mapillary, our model shows significant improvement (+6.1%) in a low-resolution setting, which suggests robustness to scale changes.

Ablation Studies
A comprehensive ablation study is reported in Table 3. We trained 11 different models, each having a different configuration; these are denoted as C1, ..., C11. We use the following shortcuts in Table 3 to represent different combinations of settings: "Sem" -semantic, "Dep" -depth, "Sup" -supervision, "Adv" -adversarial, and "Conf" -configuration. Configurations C1 to C4 denote supervised learning settings without any adversarial training. These models are trained on the SYNTHIA dataset and evaluated on Cityscapes validation set. Configurations from C5 to C7 denote different combinations of supervised and adversarial losses on the semantics, depth, and semantics refinement heads. C8 is the proposed model with CTRL, but without ISL. C9 to C11 are models trained with SL or ISL with or without SRH. C5 to C11 follow EP1 protocol: SYNTHIA → Cityscapes UDA training and evaluation setting.
C1 is trained using semantics label supervision without any depth information or adversarial learning. By enabling parts of the model and training procedure, we observed the following tendencies: C2 & C3 : depth supervision (either direct or through SRH) improves performance; C4: however, adding SRH on top of the depth head in the supervised learning setting does not bring improvements; C5: effec- 58.0 ± 0.7 Table 5: Improvement over the state-of-the-art [62] in monocular depth estimation. The models are trained following SYNTHIA → Cityscapes (16 classes) UDA setting w/o ISL and evaluated on the Cityscapes validation set. tiveness of entropy map domain alignment in semantics feature space [61]; C6 and C7: domain alignment in the depth or refined semantics feature spaces do not bring any further improvements; C8: a combination of depth and SRH with task-specific semantics improves the performance (i.e., our CTRL model); C9: SL brings further improvement but not as good as with our ISL training scheme; C10: emphasizes the improvement over C6 with ISL enabled; C11: positive contribution of the SRH towards improving the overall model performance. Finally, we achieve state-of-the-art segmentation results (mIoU 44.9%) by combining the proposed CTRL, SRH, and ISL (configuration C11).

Effectiveness of the Joint UDA Feature Space
This section analyzes the effectiveness of joint feature space learned by the CTRL for unsupervised domain alignment. We train and evaluate our CTRL model without ISL on two UDA benchmarks: (a) EP2: SYNTHIA to Cityscapes 7 classes (S → C) and (b) EP3: SYNTHIA to Mapillary 7 classes (S → M) in both full-and low-resolution (FR and LR) settings. In Table 4, we show the segmentation performance of our model on these four different benchmark settings and compare it against the state-of-the-art DADA model [62]. The proposed CTRL model (w/o ISL) outperforms the DADA model with mIoU gains of +1.5%, +0.1%, +1.3%, and +2.9% on all four UDA benchmark settings attesting the effectiveness of the joint feature space learned by the proposed CTRL.
Besides, we train both DADA and our model with ISL and notice improvements in both the models with mIoU 43.5% (DADA) and 44.9% (ours). The superior quality of the predictions of our model, when used as pseudo labels, provides better supervision to the target semantics; the same can be observed in both our quantitative (Tables 1 and 2) and qualitative results (Figs. 3 and 4).

Monocular Depth Estimation Results
In this section, we show that our model not only improves semantic segmentation but also learns a better representation for monocular depth estimation. This intriguing property is of great importance for multi-task learning. According to [43], paying too much attention to depth is detrimental to the segmentation performance. Following [43], DADA [62] uses depth as purely auxiliary supervision. We observed that depth predictions of [62] are noisy (also admitted by the authors), resulting in failure cases. We conjecture that a proper architectural design choice coupled with a robust multi-tasking feature representation (encoding task-specific and cross-task relationship) improves both semantics and depth. In Table 5, we report the depth estimation evaluation results on the Cityscapes validation set of our method and compare it against the DADA model [62]. Training and evaluation are done following the EP1 protocol: SYNTHIA → Cityscapes (16 classes). We use Cityscapes disparity maps as ground truth depth pseudolabels for evaluation. Table 5 demonstrates a consistent improvement of depth predictions with our method over [62].

Conclusion
We proposed a novel approach to semantic segmentation and monocular depth estimation within a UDA context. The main highlights of this work are: (1) a Cross-Task Relation Layer (CTRL), which learns a joint feature space for domain alignment; the joint space encodes both task-specific features and cross-task dependencies shown to be useful for UDA; (2) a semantic refinement head (SRH) aids in learning task correlations; (3) a depth discretizing technique facilitates learning distinctive relationship between different semantic classes and depth levels; (4) a simple yet effective iterative self-learning (ISL) scheme further improves the model's performance by capitalizing on the high confident predictions in the target domain. Our comprehensive experimental analysis demonstrates that the proposed method consistently outperforms prior works on three challenging UDA benchmarks by a large margin. In this document, we provide supplementary materials for our main paper submission. First, Sec. 6 provides a bird-eye view of the assumed UDA setting and how CTRL fits into it. The main paper reported our experimental results using three standard UDA evaluation protocols (EPs) where the SYNTHIA dataset [51] is used as the synthetic domain. To demonstrate our proposed method's effectiveness on an entirely new UDA setting, in Sec. 7, we report semantic segmentation results of our method on a new EP: Virtual KITTI → KITTI. In this setup, we use synthetic Virtual KITTI [17] as the source domain and real KITTI [20] as the target domain. We show that our proposed method consistently outperforms the SOTA DADA method [62] when evaluated on this new EP with different synthetic and real domains. In Sec. 8, we present a t-SNE [57] plot comparing our method with [62]. We also share additional qualitative results on SYNTHIA → Cityscapes (16 classes). Sec. 9 details our network design. To demonstrate that the proposed CTRL is not sensitive to a particular network design (in our case, the residual auxiliary block [43]), we train a standard multi-task learning network architecture (i.e., a shared encoder followed by multiple task-specific decoders without any residual auxiliary block) with CTRL and notice a similar improvement trend over the baselines. The set of experiments and the results are discussed in Sec. 10.

Overview of the UDA setting
Unsupervised Domain Adaptation (UDA) aims at training high-performance models with no label supervision on the target domain. As seen in Fig. 5, label supervision is applied only on the source domain predictions, whereas tuning the model to perform well on the target domain is the task of adversarial supervision. Since both types of supervision are applied within the same training protocol, adversarial supervision is responsible for teaching the model the specificity of the target domain by means of bridging the domain gap. When dealing with multi-modal predictions, it is crucial to choose the joint feature space subject to adversarial supervision correctly. CTRL provides such rich feature space, which allows training much better models using the same training protocols. This allows us to leverage the abundance of samples in the synthetic source domain and produce high-quality predictions in the real target domain.

Virtual KITTI → KITTI
Following [7], we train and evaluate our model on 10 common classes of Virtual KITTI and KITTI. In KITTI, the groundtruth label is only available for the training set; thus, we use the official unlabelled test images for domain alignment. We report the results on the official training set following [7]. The model is trained on the annotated training samples of VKITTI and unannotated samples of KITTI. For this experiment, we train our model without (w/o) ISL. Table 6 reports the semantic segmentation performance (mIoU%) of our approach. Our model outperforms DADA [62], with significant gains coming from the following classes: "sign" (+8.1%), "pole" (+5.7%), "building" (+2.7%), and "light" (+1.9%). Notably, these classes are practically highly relevant to an autonomous driving scenario. In Figure 7, we present some qualitative results of DADA and our models trained following the new Virtual KITTI → KITTI UDA protocol.

SYNTHIA → Cityscapes
This section presents a t-SNE [57] plot of the feature embeddings learned by the proposed model guided by CTRL, and [62]. Fig. 6 shows 10 top-scoring classes of each method; distinct classes are circled. As can be seen from the figure, CTRL leads to more structured feature space, which concurs with our analysis of the main paper. Both models are trained and evaluated   following the UDA protocol SYNTHIA → Cityscapes (16 classes). Furthermore, we present additional qualitative results of our model for semantic segmentation and monocular depth estimation. Figures 8, 9 show the results of the qualitative comparison of our method with [62]. Note that our proposed method has higher spatial acuity in delineating small objects like "human", "bicycle", and "person" compared to [62]. Figure 10 shows some qualitative monocular depth estimation results.

Network Architecture Design
The shared part of the semantic and depth prediction network F e consists of a ResNet-101 backbone and a decoder. The decoder consists of four convolutional layers, each followed by a Rectified Linear Unit (ReLU). The decoder outputs a feature map that is shared among both semantics and depth heads. This shared feature map is fed forward to the respective semantic segmentation, monocular depth estimation, and semantics refinement heads. For the task-specific and task-refinement heads, we use Atrous Spatial Pyramid Pooling (ASPP) with sampling rates [6,12,18,24] and the Deeplab-V2 [4] architecture. Our DC-GAN [49] based domain discriminator takes as input a feature map with channel dimension 2 × C + K, where C is the number of semantic classes, and K is the number of depth levels.

Robustness to Different Network Design
Our proposed model adopts the residual auxiliary block [43] (as in [62]), which was originally proposed to tackle a particular MTL setup where the objective was to improve one primary task by leveraging several other auxiliary tasks. However, unlike [62] which doesn't have any decoder for depth, we introduce a DeepLabV2 decoder for depth estimation to improve both task performances. Our qualitative and quantitative experimental results show an improvement of depth estimation performance over [62]. Furthermore, we are interested to see the proposed model's performance when used with a standard MTL architecture (a common encoder followed by multiple task-specific decoders without any residual auxiliary blocks). To this end, we make necessary changes to our existing network design to have a standard MTL network design. We then train it following UDA protocols. The details of our experimental analysis are given below.
For the standard MTL model (denoted as "Ours*" in Table 7), the depth head is placed after the shared feature extractor F e . The shared feature extractor consists of a ResNet backbone and decoder network (see Fig. 2). For the second model with residual auxiliary block (denoted as "Ours"), we positioned the depth head after the decoder's third convolutional layer. The semantic segmentation performance of these two variants of the proposed model is shown in Table 7. Both models are evaluated on the five different UDA protocols and outperform state-of-the-art DADA [62] results. The results show that our proposed CTRL is not sensitive to architectural changes and can be used with standard encoder-decoder MTL frameworks. Our findings may be found beneficial for the domain-adaptive MTL community, e.g., in answering a question whether learning additional complementary tasks (surface normals, instance segmentation) performs domain alignment.  [62] predictions; (d) our model predictions. Our method demonstrates notable improvements over [62] on "bus", "person", and "bicycle" classes as highlighted using the yellow boxes.  [62] predictions; (d) our model predictions. Our method demonstrates notable improvements over [62] on "bus", "person", and "bicycle" classes as highlighted using the yellow boxes.