Systematic Calibration of a Convection‐Resolving Model: Application Over Tropical Atlantic

Non‐hydrostatic km‐scale weather and climate models show significant improvements in simulating clouds and precipitation, especially of convective nature. However, even km‐scale models need to parameterize some physical processes and are thus subject to the corresponding parameter uncertainty. Systematic calibration has the advantage of improving model performance with transparency and reproducibility, thus benefiting model intercomparison projects, process studies, and climate‐change scenario simulations. In this paper, the regional atmospheric climate model COSMO v6 is systematically calibrated over the Tropical South Atlantic. First, the parameters' sensitivities are evaluated with respect to a set of validation fields. Five of the most sensitive parameters are chosen for calibration. The objective calibration then closely follows a methodology previously used for regional climate simulations. This includes simulations considering the interaction of all pairs of parameters, and the exploitation of a quadratic‐form metamodel to emulate the simulations. In the current set‐up with 5 parameters, 51 simulations are required to build the metamodel. The model is calibrated for the year 2016 and validated in two different years using slightly different model setups (domain and resolution). Both years demonstrate significant improvements, in particular for outgoing shortwave radiation, with reductions of the bias by a factor of 3–4. The results thus show that parameter calibration is a useful and efficient tool for model improvement. Calibrating over a larger domain might help to further improve the overall performance, but could potentially also lead to compromises among different regions and variables, and require more computational resources.

then returning equatorwards. This is one of the most important circulations in our climate system that functions as an atmospheric heat engine, and many studies have demonstrated that the spatial organization of subtropical and tropical clouds associated with the Hadley circulation can be represented more credible at high resolutions (Bretherton & Khairoutdinov, 2015;Heim & Hentgen, 2021). This concerns especially shallow cumulus and stratocumulus clouds (Hohenegger et al., 2020).
In spite of these improvements when going toward higher resolution, there are still some challenges. Although CRMs run at a relatively high resolution (typically less than 4 km grid spacing) (Prein et al., 2015), some processes still need to be parameterized, such as cloud microphysics and turbulence (Schär et al., 2020), which are approximations of subgrid-scale processes and rely on semi-empirical parameters that are poorly constrained by observations. Thus, when applying CRMs over the tropics, the simulations are subject to high parametric uncertainty related to poorly confined model parameters. Model results could be very sensitive to those parameters (Di et al., 2015). In practice, the values of uncertain parameters are determined using subjective expert tuning. Normally, the tuning does not follow a unique well-defined methodology (Hourdin et al., 2017). Subjective model tuning implies some difficult challenges. For instance, differences in model results reflect both differences in model structure (such as dynamical cores and type of parameterizations) and model tuning, thereby hazing the value of model intercomparison projects. This is particularly important for cloud-radiative feedback, as the magnitudes of the anthropogenic forcing and cloud-radiative feedbacks are small, often smaller than the systematic model biases in terms of radiation budget (Stocker, 2014).
Compared with subjective tuning, systematic calibration methods, using a predefined mathematical framework to perform model tuning, possess the advantage of making the process more explicit and reproducible (Hourdin et al., 2017). The framework encompasses the validation strategy, the set of to-be-calibrated parameters, and the modeling strategy (period and domain). Within such a stipulated framework, the calibration is objective, but the definition of the framework is subjective. Thus, to ensure a valid intercomparison of different model versions (e.g., different resolutions or parameterizations) and an assessment of the parametric uncertainty, a systematic model calibration method is preferable (Bellprat et al., 2012(Bellprat et al., , 2016García-Díez et al., 2015).
Current calibration techniques mainly include two categories in terms of the optimization (Hourdin et al., 2017). One is fast optimization of some cost function, evaluating model performance given specific metrics like averaged radiation or precipitation (Bellprat et al., 2012;Bracco et al., 2013;Duan et al., 2017;Gorman & Oliver, 2018;Langenbrunner & Neelin, 2017;Liu et al., 2022;Neelin et al., 2010;Tett et al., 2017). The other, instead of trying to find the optimum parameter setting, involves using Bayesian approaches to provide the uncertainty for the parameters (Bony & Dufresne, 2005;Couvreux et al., 2021;Rougier, 2007;Salter et al., 2019;Sanderson, 2011;Sexton et al., 2012). Except for some studies that use particle-based approaches (Lee et al., 2020) or adaptive sampling algorithms (Phipps et al., 2021), most of the research uses emulators, mapping model inputs with outputs to reduce computational resources. In terms of the emulators, the calibration methods can also be divided into those that use statistical models (Voudouri et al., 2021) and machine learning methods (Li et al., 2019).
In this study, we choose a fast optimization method given limited computational resources, and applied a simple statistical emulator for clearer input-output relationships. We systematically calibrated the non-hydrostatic fully compressible limited-area model of the Consortium for Small-Scale Modeling (COSMO) in climate mode (Doms & Förstner, 2004;Steppeler et al., 2003) and obtained optimal parameter settings over the tropical Atlantic. The objective of this study is to examine the potential of systematic calibration in improving the model performance of cloud simulation over the tropics. Future applications will address the role of cloud-radiative feedbacks in climate change.

Numerical Simulations
We use the European Center for Medium-Range Weather Forecast (ECMWF) Re-Analysis (ERA5) data (Hersbach et al., 2020) as initial conditions and lateral boundaries to drive the COSMO v6 model. The parameterization schemes applied are similar as in Heim and Hentgen (2021): deep and shallow convection parameterizations are switched off, radiative fluxes are computed following the δ-two-stream approach after Ritter and Geleyn (1992), the single-moment bulk scheme after Reinhardt and Seifert (2006) is used as cloud micro-  (Raschendorfer, 2001) is employed for the computation of subgrid-scale vertical turbulent flux and we use prescribed sea-surface temperature over the ocean.
All simulations are run with 60 vertical levels and a horizontal grid spacing of 4 km. The sensitivity and calibration simulations are performed over domain D01 as displayed in Figure 1 with a size of 1,000 × 575 grid columns. The simulation period covers 4 months (February, May, August, November) in 2016, each with a 5-day-spin-up period. Based on previous calibration studies (Russo et al., 2020;Voudouri et al., 2018), 13 parameters that are thought to exert a significant impact on model results were tested, shown in Table 1. In the end, five of these parameters are selected for calibration, and the reasoning is elaborated in Section 3.1. For validation of the optimized parameter setting, we proceed in two steps. First we present a validation over D01 with the same set-up as for the calibration. Second, we use a larger validation domain D02 at a refined horizontal grid spacing of 3 km. It has a size of 2,750 × 2,065 grid columns. Both validation periods consider another year than the one used in calibration, to avoid overfitting of parameters.

Calibration Methodology
The calibration with N parameters optimizes the parameter choice in an N-dimensional cube spanned by the min/max ranges of the selected parameters (see Table 1). To construct a metamodel, we follow previous studies (Bellprat et al., 2016;Voudouri et al., 2018) and employ the following set of simulations: the default simulation (all parameters at default value), pairs of sensitivity simulations (one parameter changed to min/max values), and quadruplets of interaction simulations (two parameters changed to min/max values). The total number of simulations is then 1 + 2N + 2N(N − 1) = 2N 2 + 1, and for N = 5 this yields 51 simulations. Based on this set of simulations, a metamodel is constructed, and

Table 1
Perturbed Parameters the optimal value of the parameters is selected. The restriction to using only quadratic interactions (with two non-default values) in the set of simulations is consistent with the choice of the metamodel (see below). The set of simulations considered in the current study is shown in Table 2. The technical details of the calibration closely follow Bellprat et al. (2012). Significant differences concern the choice of the validation data, differences in the performance score, and the use of scaled parameter ranges (see below).

Performance Score
Since the target is to improve cloud-related performance, top of atmosphere (TOA) radiative fluxes (outgoing longwave radiation (OLR) and outgoing shortwave radiation (OSR)) are chosen to calibrate the model results.
Besides, the surface latent heat flux (LHFL) is also included as a target validation field, since it plays an important role in humidifying the atmosphere. Furthermore, LHFL also enables us to take a surface field into consideration, apart from the TOA fields. The TOA observation data is from Satellite Application Facility on Climate Monitoring (CM SAF) (Schulz et al., 2009). Since LHFL observation data is limited, ERA5 reanalysis data (Hersbach et al., 2020) is used to constrain this field. This special choice of validation data is owed to the limited availability of in-situ observations in the area of interest. A critical element of this choice is the use of ERA5 data for LHFL. The use of such data in the calibration hinges upon an appropriate estimate of the data's uncertainties.
The variables are evaluated using monthly means, averaged spatially for 28 rectangular regions (5° × 5° each, 4 rows and 7 columns over the calibration domain D03 as displayed in Figure 1). The error of these time series is measured using a performance score (PS): . (1) The . The σ ϵ of LHFL is from the standard deviation of the ERA5 assimilation ensemble members, which provides background-error estimates for the deterministic reanalysis system (Hersbach et al., 2019(Hersbach et al., , 2020. The best simulations will show the largest values of PS. Table 3 displays the σ o and σ ϵ used for the calibration.

Metamodel
Since direct simulations with the convection-resolving model (  MM is based on the assumption that the climate model results from parameter perturbation are smooth and can be approximated by a second order polynomial regression. Interactions of parameter perturbations are approximated by a non-linear term for each parameter pair.
Relative parameter values μ * and model fields Φ * are used as independent and dependent variables separately to fit the MM. For each field, month and grid point, the corresponding formulations can be written as: where subscripts def and p refer to default and perturbed parameter values. The MM is then given bŷ where f MM indicates the polynomial function of the MM. The hat symbol above Φ indicates that the MM is not a perfect replication of the model and the corresponding field is an estimation. f MM includes one linear and one quadratic term for each relative parameter, and also one interactive term for every parameter pair (First order for each parameter in the pair). This can be expressed in vector notation aŝ where the vector a contains the N linear coefficients for each parameter, and the matrix B includes coefficients for N quadratic terms on its diagonal and for N(N − 1)/2 interactive terms in the off-diagonal elements (with the general assumption B i,j = B j,i ). Together this yields N(N + 3)/2 coefficients defining the MM. For example, if two parameters (μ 1 , μ 2 ) are calibrated, f MM would bê Φ * = 1 1 + 2 2 + 2 1 1 + 2 2 2 + 2 1 2 1,2, where a 1 , a 2 , b 1 , b 2 and b 1,2 are coefficients to be solved for.
As discussed above, a total of 2N 2 + 1 simulations are used to fit the MM, which is more than the number N(N + 3)/2 of unknown coefficients. The resulting linear system of equations is thus overdetermined, and optimal interaction parameters are estimated using least squares error measures.
In general, the default value μ def will not be in the center of the parameter range [μ min , μ max ], and this may lead to unsatisfactory results when fitting the MM. Parabolic fitting works best with a default value at the center, therefore we applied a logarithmic transformation of parameter values to fit the MM as in Voudouri et al. (2018), where α and β are determined such that = ( min + max) ∕2 .
After the construction of the MM, 3,000,000 parameter sets are sampled with the Latin hypercube design (McKay et al., 2000). The set of parameter values with maximum PS was chosen as the optimal parameter set. Figure 2 presents the PS's of the sensitivity tests of the 13 parameters. The default PS (the black dots) indicates that LHFL performance is quite good, which is reasonable since we use the prescribed sea-surface temperature. Besides, as the domain D01 is mainly affected by low clouds, which hardly modify emitted longwave radiation from surface, the longwave radiation performance is also good. One of the target is to improve the representation of low clouds, which is related to variations in the OSR-field. Therefore, when choosing the final parameters for calibration, the ones that strongly impact OSR are the priority. Based on this figure the following parameters are selected for the calibration: tur_len, clc_diag, cloud_num, qi0 and rat_sea.

of 15
The choice follows the following considerations: First, tur_len, clc_diag and cloud_num have the largest potential in increasing OSR performance, with the largest OSR PS around 0.6. We also include two parameters to constrain OLR and LHFL. OLR is most sensitive to qi0, which controls the autoconversion of cloud ice and has almost no impact on OSR and LHFL. This makes qi0 a suitable parameter for calibration. For LHFL, rlam_heat and rat_sea exert the most significant impact. Since they have a similar impact over the ocean (rlam_heat controls the overall latent heat flux and rat_sea is a scaling factor exerted on the rlam_heat to distinguish sea and land) and the domain  The drastic impact of qi0 on longwave radiation can be seen when setting it to the maximum value. Because larger qi0 indicates less conversion of cloud ice to precipitable snow and more cloud ice would accumulate, thus preventing longwave radiation from escaping. The remaining parameters effectively control the shortwave radiation.

Calibration Results
Once the coefficients of the metamodel have been determined from the calibration simulations, the optimal parameter setting is chosen based on a sampling of the five-dimensional cube. Figure 4 shows the resulting distribution of the PS. The PS increases from the default 0.62 (black line) to the optimum 0.86 (red line). This improvement is very substantial, but will require independent validation (see Section 3.3). Figure 5 displays the corresponding distributions of PS as a function of the parameters. The default and optimized parameter values are shown by the black and red vertical lines. Results show that the parameter qi0 mainly affects high clouds and controls longwave radiation. Increasing qi0 results in lower values for OLR due to larger cloud ice content. The parameter for computing the rate of cloud liquid water in unsaturated cases (clc_diag) approaches 1, which indicates no subgrid-scale clouds. That is reasonable for high-resolution modeling due to smaller grid cells. The optimal value for tur_len is a bit lower than its default. This leads to less vertical mixing within the planetary boundary layer. This indicates decreased moisture supply and cloud amount. Besides, turbulence also affects the boundary layer stability and the inversion height (Heim & Hentgen, 2021), which indirectly influences the amount of low clouds. A shallower boundary layer favors the formation of low clouds, especially of persistent stratocumulus decks, yet a too shallow boundary layer top might be lower than the surface-determined lifting condensation level (LCL) and thus not allow clouds to form (Wood, 2012). Lower values of rat_sea favor higher surface latent heat fluxes. Clouds react to decreased rat_sea mainly in two ways. One is higher PBL moisture which allows for more cloud water. The other is decreased boundary layer stability, which may not favor the formation of low clouds. Furthermore, a lower value of cloud_num results in a larger cloud droplet size. That leads to increased precipitation, and might thus decrease cloud amount. In the mean time, reduced cloud_num  also suppresses buoyant turbulence kinetic energy (TKE) production, thus may decrease cloud-top entrainment and increase cloud amount (Ackerman et al., 2004;Coakley Jr. & Walsh, 2002). Besides, the blue histograms in Figure 5 indicate parameter convergence to some degree. Most of the parameters converge into a certain range. Except for tur_len, whose distribution seems to have a relatively large range, indicating that the results are not very sensitive to this parameter.
Results also demonstrate that the joint calibration of the parameters is superior in comparison to calibrating the parameters individually. Calibrating a parameter individually means to select its value where it has maximum performance score in Figure 2. The resulting parameter values are illustrated by the green lines in Figure 5. For two of the five parameters there is a significant difference to the joint calibration. Moreover, the performance of  single-parameter calibration is also displayed in Figure 4 with the dashed green line. The performance is much lower than the joint five-parameter calibration, in which the parameter interactions are considered. This indicates that parameter interactions are important and tuning parameters individually one by one might not be able to exploit the model's potential.

Robustness of the Optimized Parameter Setting
To verify the calibration and the key result in Figure 4, the default simulation for the year 2016 has been repeated with the calibrated parameter settings. This confirmed the results and showed an improvement in PS from 0.62 before calibration, to 0.86 after calibration. The agreement with the metamodel is surprisingly good, as the optimal performance score is missed by less than a percent.
To test whether the calibrated parameter setting also works for another year, Figure 6 displays the comparison between simulations using the optimized parameter setting as described before and the default simulation during four full seasons in 2013 with domain D03: December, January and February (DJF), March, April and May (MAM), June, July and August (JJA), September, October and November (SON). The model performance is significantly improved in all seasons for shortwave radiation and surface latent heat flux. OLR is mainly affected by high clouds, whereas the spatial domain is dominated by low clouds for most of the seasons. Therefore, the change in OLR is minor. In MAM, when the ITCZ is southernmost and partially within the simulation domain, there is a significant underestimation of OLR, and an increase of the bias with the calibration. This kind of effect is to be expected with, as with the use of a PS there may be compensation of errors. In this particular case, the large OSR bias in the default is being reduced, but at the cost of increases in the OLR bias. The underestimation in MAM is mainly due to the overestimated ice cloud in the ITCZ. Therefore, the longwave radiation bias in MAM might indicate a deficiency of the model in simulating the high clouds with the same set of optimum parameters obtained over the current domain (since more weight is given to the low clouds due to the selection of the domain). However, overall PS is reduced, corresponding to a net reduction of the weighted overall bias. The daily bias over the domain D03 in 2013 is presented in Figure 7. For the longwave radiation, the bias is almost the same between the optimum and default setting for most of the time. However, in April and May, where the ITCZ moves to the Southernmost, the bias with the optimum parameter setting is significantly higher than with the default setting. For shortwave radiation, there is a systematic decrease of bias using the optimum parameter setting, especially in austral winter and spring, when low cloud prevails. It should be noted that the consideration of daily biases includes biases due to predictability limitations and chaotic processes in the model domain.
To further explore how robust the optimum parameter setting is, we use another year (2006) and an extended simulation domain (D02 as displayed in Figure 1) for validation. Due to the limitation of computational resources, we only simulated 4 months (February, May, August, November) to represent each season. Figure 8 shows the comparison between the optimized parameter setting and the default ones averaged over the 4 months (grid points close to the borders are discarded as spacial spin-up). Table 4 lists the biases for the simulations with the optimum and default setting for 2006 over the whole domain D02 (grid points close to the borders are discarded as spacial spin-up) and calibration domain D03. Within D03 (Figure 1), the performance improved substantially, where the OSR bias decreased from 25, 25, 36, 53 W m −2 under the default setting to 4, 12, 2, 3 W m −2 under the optimum setting in February, May, August, November respectively. OLR performance has also improved, except for May. The deteriorated underestimation of OLR in May with the optimum setting might be due to the impact of the ITCZ, which is a similar case as the validation results in 2013 ( Figure 6). These results indicate that the optimum parameter setting is robust for different years and slightly different resolutions (4 vs. 3 km). When taking the remaining part of the domain D02 ( Figure 1) into consideration, the performances still improve significantly for OSR and LHFL. The four months average bias in 2006 decreased from 13 to −1 W m −2 for OSR and from −5 to 2 W m −2 for LHFL. For OLR, it is evident that the optimum simulation underestimates OLR over the ITCZ (Figure 8), and Table 4 shows that overall D02 domain average OLR is underestimated in all four months. Because D02 encompasses the ITCZ during all four months. This is consistent with the aforementioned result that the set of parameters that suits low clouds over sea might not apply as well for ITCZ.

Summary and Conclusions
In this paper, the regional climate model COSMO v6 was systematically calibrated over the Tropical South Atlantic. First, the most sensitive parameters were identified with respect to the target fields that are important for the representation of clouds (shortwave/longwave radiation and surface latent heat flux). Based on single-parameter sensitivity studies, a total of 5 parameters were selected for calibration. The calibration is based on single-parameter sensitivity experiments and simulations considering quadratic interactions. A polynomial metamodel (MM) is then used to emulate the model simulations. We applied Latin hypercube sampling and chose the set of parameters with the best performance score (PS) as the optimal parameter set. We calibrated the COSMO v6 model in 2016 and validated the results in 2013 and 2006 in two different computational domains. With the calibrated optimal parameter settings, the performance improved significantly compared to the default parameter setting, especially for OSR. Even when we applied the optimal setting over a significantly extended domain with a slightly higher resolution (3 vs. 4 km), the optimal setting also showed significant improvements. However, since the calibration domain covers only oceans, and since the impact of the ITCZ is small in this region, applying the obtained optimal parameter setting over land and the northern part of the domain encounters problems, especially for OLR, which is highly relevant with ITCZ high clouds. Thus, calibrating over a larger domain might improve the overall performance, but would potentially also lead to compromises among different regions and variables, and would require more computational resources to achieve improved results for the whole domain. Besides, we also conducted tests to see how sensitive the optimum parameter value is to the length of the calibration period. This might provide information for future calibration studies in balancing computational cost and calibration performance. Detailed results can be found in the supplementary document.
Besides the aforementioned performance improvements, another advantage of systematic calibration is that it could benefit model intercomparisons, process studies and climate-change scenario simulations. The traditional way of tuning a model does not follow a unique well-defined methodology and thus hazes the value of model intercomparisons. Instead, systematic calibration, based on a well-defined methodology, is promising in constraining parameterization-related uncertainties with transparency and reproducibility.
Conducting model calibrations with regional climate model (RCM) simulations driven by prescribed lateral boundary conditions from reanalysis fields, as presented in the current study, provides substantial advantages over using calibration with global climate models (GCMs). In a GCM there will in general be significant circulation biases. For instance, biases in polar regions will affect the circulation in tropical regions, and a calibration will at least partly attempt to compensate for associated circulation biases. With RCMs driven by reanalyses, the calibration targets the parameterization suite with realistic large-scale circulations. As a result, the RCM approach requires much shorter calibration and validation periods, as demonstrated by our study. Indeed, we used merely 4 months of a particular year for the calibration, and have demonstrated that this significantly improves simulations in other years and extended domains. It is thus attractive to consider a combined GCM/RCM calibration framework, that considers both approaches. Indeed, there is an increasing number of GCMs that are available in both limited-area and global configurations, such as the ICON model (Pham et al., 2021) or the Unified Model (Bush et al., 2020). With such models, it is feasible to combine RCM-style calibrations in subdomains. For instance, one could calibrate boundary-layer and warm microphysics parameters over tropical oceans, snow and ice microphysics parameters over polar regions, and land-surface parameters over major continental regions. We believe that this kind of approach would be superior in comparison with conventional GCM model tuning, and yield a more physically-based set of model parameters.  There are a number of fundamental limitations with model calibration. First of all, it can only improve parameterization-related model performance of the subjectively predefined validation fields. It is thus important to select a broad range of validation data sets. Second, there are compensations of errors between different variables and areas. Since the model itself is not perfect (i.e., will have biases irrespective of the parameter choices), compensation of errors cannot be completely avoided. However, the current methodology does so objectively and accounts for observational uncertainties. Third, emulators are necessary within the calibration framework, since it is impossible to traverse the parameter space with the climate model. In this study, we used deterministic polynomial regression to build the emulator, which already provided enough accuracy as indicated in Section 3.3, but emulators inevitably bring in uncertainties. Nevertheless, we believe that the results achieved in this study are very promising and suggest that climate models should undergo systematical model calibration.
Note that the open-source code provided with this paper is independent of the target model, the geographical region, and the validation fields. It could easily be applied to the calibration of other models, in other regions, and even in other research domains.