Performance assessment of using various solar radiation data in modelling large-scale solar thermal systems integrated in district heating networks

The use of solar radiation data models is widespread in energy system analysis, however a gap exists when assessing their impact in modelling large-scale solar thermal systems integrated in district heating (DH) systems. Therefore, this study presents an analysis of how using satellite-based radiation data models (SARAH), reanalysis models (CFSR, ERA and MERRA2) and other data models (Danish Reference Year) affect the modelling of these systems. Taking three DH plants in Denmark as study cases, the measured radiation between 2016 and 2019 are utilized. Using energyPRO-based mathematical models of the systems, heat outputs are calculated and compared with measured data. Moreover, the yearly DH plant operational cost is calculated to observe the economic impact of using inaccurate models. It is found that heat production assessments based on the SARAH model show a better agreement with measured data than the reanalysis-based ERA5, MERRA2 and CFSR models. The empirically-based DRY shows low errors when observing its yearly values but has a higher inaccuracy on the hourly level, providing inaccurate operation pro ﬁ les of the plant. Additionally, the satellite-based solar data model SARAH is further analyzed to identify patterns of its inaccuracy. After comparing it with 18 locations in Denmark using month-hourly pro ﬁ les, no error trend can be identi ﬁ ed, supporting the robustness of


Introduction
In recent years, a transition of the energy system towards a more sustainable system has been called for by the international community in response to the growing effects of climate change [1].Renewable energy sources (RES) are expected to play a crucial role in this transition providing cleaner and efficient alternatives to the fossil fuels dominating today's energy system [2e4].
One of the RES most readily available as an alternative to fossil fuels is solar energy, which can be harvested by transforming solar radiation into electricity via the photovoltaic (PV) effect, or as thermal energy either for heating purposes or indirectly to electricity through concentrated solar power (CSP) [2,5,6].However, as with other fluctuating RES, solar energy has a dependence on external conditions for its production, since solar radiation will vary by location, time, and other factors such as cloud coverage [7].
Thus, obtaining accurate solar radiation measurements has traditionally proven to be crucial for properly designing solar installations, minimizing the uncertainty of variable energy production while optimising the plant design characteristics and location of such plants, as well as accurately predicting operational costs and benefits [8].To gather accurate radiation measures, devices such as pyranometers, pyrheliometers and reference solar cells are used.However, they must be installed beforehand and be functional for a long period of time before enough radiation data can be gathered.This is costly and delays the design and establishment of new plants.
An alternative to relying on actual site-specific measurements is to base the design of solar collectors on other data sources.In recent years, much progress has been done in the field of modelling solar radiation, with the appearance of more accurate models and corresponding solar data.Two major methods exist to calculate solar radiation at a location without using physical devices.One method is based on satellite images such as the European Meteosat, used to estimate solar radiation by converting satellite images into maps of global solar radiation at ground level.The advantage of this method is that it provides historical data for as many years as there are images, and with the same temporal frequency.Using these images also allows a high spatial resolution.The disadvantage comes from the dependency of the satellite, meaning that the databases only provide information where the satellite has coverage and, therefore, are geographically limited.In addition, the resolution of the data is limited to the pixel resolution of the images.An example of a satellite-based data set is SARAH-2, which provides information for Europe, Africa and parts of South America [9].
The other method to establish solar radiation data is reanalysis.Reanalysis is the process where an unchanging data assimilation system is used to reprocess meteorological observations during a certain period of time [10].Hence, the data assimilation system consists of a model of the atmosphere and its main characteristics, such as radiative transfer and convection, providing dynamic analysis on its behavior and obtaining historical data models [11].
The data assimilation combines both models and observations, accounting for their uncertainties and respecting certain constraints to ensure data reliability.These methods allow for a broader geographical data coverage, reaching larger areas than satellite-based methods, although with a lower spatial resolution.Examples of reanalysis-based databases can be found in CFSR-2, ERA-5, MERRA-2, JERA-5, MERRA-2, ERA-Interim, JRA-55 and NCEP-NCAR.These use different calculation methods and different meteorological observations to establish data [12].
The validation of these solar data models is crucial and has been addressed in several studies.In some of these, the data models are compared amongst each other and with measured data to assess their accuracy in estimating solar radiation.In Gueymard et al. (2014) [13], a review of validation methodologies and statistical performance indicators for modeled solar radiation data is conducted, noting a lack of development in this topic until recently and the importance of the data validation process, since higher solar radiation data accuracy translates into lower financial risks and better economic feasibility.
In Yang et al. (2020) [14], a comprehensive assessment and validation of eight different solar data models are carried out, concluding that satellite-based data models have better performance and how only using a few error metrics could be misleading.In Boiley et al. (2015) [15], a similar comparison is conducted between the reanalysis-based models MERRA and ERA, observing how they tend to misrepresent clear sky when the weather conditions are actually cloudy, and how satellite-based models should be preferred.
Other studies focus mainly on PV specifically and even on CSP.In Victoria et al. (2019) [16], the impact of PV system configurations at high penetration levels in European countries is assessed using validated reanalysis data that has been bias-corrected using satellite data.Moraes (2018) [17] conducted a comparative analysis of long-term PV power capacity factor datasets with an open licence, concluding how the datasets diverged noticeably, despite being based on a common meteorological data source and how since 2015 the data availability for reanalysis has increased.
In Pfennninger et al. (2016) [18], PV output simulations using meteorological reanalysis and satellite data for 30 years are created and later validated against a large set of measured data.From this, the website Renewable.Ninja was created to make these simulations publicly available.In Jurus (2013) [19], an estimation of climate variability was carried out, introducing the MERRA reanalysis system for PV power modelling.In Carra et al. (2020) [20], a methodology was created to increase the accuracy of solar data models when modelling CSP plants, thereby better accounting for optical losses in these types of plants.
However, to the best of the authors' knowledge, the impact of using these solar data models in the planning process of solar thermal systems has not been studied, although they represent more than a third of the total solar installations [5].Individual solar heating represents more than 90% of annual solar thermal installations, mainly in individual households, apartments and public buildings.
Meanwhile, large-scale solar thermal systems have been emerging and becoming more competitive, especially in areas where district heating (DH) systems are already in place [21,22].This becomes more relevant as DH systems are set to play a crucial role in achieving 100% RES and enabling the integration of different energy sectors in smart energy systems.Therefore, the usage of large-scale solar thermal systems appear to be a promising solution to use available solar resources [23,24].
In the International Energy Agency report "The role of Solar thermal in Future Energy Systems" [25], it is concluded that the energy system design is crucial when implementing solar thermal, that globally, the solar thermal potential is around 3e12% of the heat production and that it can be a valid alternative to scarce renewable sources such as biomass.It also shows how integrating large-scale solar thermal systems in DH networks has a better socioeconomic result than implementing individual rooftop solar collectors.
By 2019 almost 400 large-scale solar thermal systems were operating worldwide with around 1350 MWth of capacity [5].Denmark has historically been the leader in the sector by a large margin, having almost 18 times more area of solar collectors than the second country, China [5].However, large-scale solar thermal systems are gaining market share in other regions such as Germany [26], where DH grids are already present around the country, or China [27], where studies show that the solar heated area can reach 756 million m 2 , covering around 3% of the total heated area, especially in rural areas such as Tibet.
The existing literature on large-scale solar thermal planning focuses mostly on technological design aspects like Tian et al. (2019) [28], analysing the development and recent trends of solar DH in Denmark; Renaldi et al. (2019) [29], analysing a study case in the UK with seasonal heating; Paulus et al. (2014) [30], focusing on decentralized systems or Winterscheid et al. (2017) [31], analysing the integration of solar thermal in DH plants that are already in place.
All these studies highlight the importance of using relevant data in modelling solar thermal systems, although they do not analyze the impact or the accuracy of the solar data.In Winterscheid et al. (2017) [31], one of the main conclusions is that using yearly data provides an approximate idea of the needed storage capacity and collector areas, but that it is necessary to use hourly data for dimensioning the heat load for the whole year.This is especially the case when taking into account storage systems, which are present in almost all large-scale solar thermal systems connected to DH networks.With sector integration forming a key element in the transition towards RES-based energy systems, it can also be argued that temporal variation of heating systems is not only a heating issue; it ultimately impacts the entire energy system.
Having identified both a gap in understanding the impacts of using solar data models in solar thermal systems and a growing trend in large-scale solar heating systems integrated in DH networks, this study aims to address these topics.To that end, the analysis focuses on Denmark, where large-scale solar thermal systems are widespread, to obtain measured data and compare it to the results of modelling using solar data models.
The main innovation of the presented work resides in identifying how different solar radiation data models perform when modelling solar collector production for large-scale solar thermal systems integrated in district heating networks and how this performance can be improved.This overarching goal can be further broken down in the following research questions which we address through this study: 1. How accurate are the heat production results when using different solar radiation data inputs in large-scale solar thermal systems modelling?2. What is the economic impact on the operational cost calculation when using these solar data models in large-scale solar thermal systems design?3. Can the performance of these solar data models be improved by identifying correction factors that increase the accuracy of the models?
The work presented here is divided into five parts.The methods used are presented in Section 2, along with study cases and the different data sources employed.In Section 3, we address the first and second research questions by comparing the measured solar radiation data from the study-cases with the different solar data models and then estimating the corresponding estimated heat outputs and operational costs in large-scale heating system models.In Section 4, correction factors are analytically obtained to observe whether it is possible to improve the performance of these solar data models to model large-scale solar heating systems, hence answering the third research question.Finally, the results obtained are discussed in Section 5 and conclusions are presented in Section 6.In addition, the data and model used as well as additional graphs are available at the own-elaborated data-set [32].

Methods
This section describes the relevant data and analyses used in the study.First, relevant data sources are presented including both modeled solar data and measured data from the study cases.Then, the data processing procedure, statistical analysis parameters, and energy system models are discussed.Finally, the method used to obtain correction factors to improve the solar data models is presented.

Data sources
This study has been done in collaboration with the local knowledge and software center EMD International A/S (EMD) (https://www.emd.dk/)[33].EMD offers services of consultancy to companies and institutions worldwide, in pre and postconstruction phases of different energy systems, including largescale thermal solar systems in Denmark.In particular, there are three solar thermal plants where EMD monitors their daily operation and allows plant operators to access live and historical data on the state of the plants.Moreover, they provided access to the solar data models used in this study, via their software energyPRO (further explained in Section 2.2).

Study cases
The three solar DH plants from which it has been possible to obtain first-hand data and therefore used as study cases can be seen in Fig. 1, with their main characteristics in Table 1.
EMD has stored hourly data from these plants since they were installed including solar radiation, temperature, the heat output from the solar collectors, and fuel consumption from the DH boilers.The most recent plant, Hvide Sande, was installed in 2016, meaning that there are at least 4 years of data from the three plants.Hence, it has been chosen to use these locations as study cases, to help answer the research questions of this study.

Solar radiation data
The first step of this study is to compare solar radiation data between physically measured radiation and solar data models.Most of the solar data models used are obtained through the energyPRO software.That is the case of the satellite-based SARAH solar data model and the reanalysis models ERA5, CFSR as well as the Danish Reference Year (DRY).Other than these, the MERRA2 reanalysis model is also used since it is publicly available from the website renewables.ninja, as well as the information available at solvarmedata.dk.A summary of its main characteristics can be found in Table 2.
2.1.2.1.Ground-measured data.Ground-measured data is obtained through EMD from the study cases.The solar radiation is measured using laboratory-tested pyranometers, installed with the same inclination that the solar collectors have at each plant.The data is provided from EMD from 2016 to 2019, both included, for the three study case plants.   .Its main upgrades are in the assimilation of modern hyperspectral radiance and microwave observations, while also using NASA's ozone profile observations.It has a spatial resolution of 0.5 Â 0.5 , providing hourly data from 1980 [18].
2.1.2.6.DRY.In addition to the solar data models, the Danish Design Reference Year (DRY) is also used for comparison.The DRY consists of a year-long dataset with hourly values and is used as a reference for designing and building solar energy systems in Denmark.It is developed by the Danish Meteorological Institute and is based on meteorological measurements in multiple locations of the country during a specific period of years.It does not include atypical months nor the variability of general meteorological conditions.This dataset was first built using the period 2001e2010 [36], but has recently been updated as of 2018 [37].
The dataset is divided into six different climatological zones for Denmark and includes data regarding temperature, global radiation, and diffuse irradiance among others.These zones account for the slight variations in radiation that might occur within the Danish geography.Saeby corresponds to the climatological zone 1 (north of Jutland) and Hvide Sande and Ringkøbing are in zone 2 (coastal mid-Jutland).

Solar heat production data
In addition to the solar radiation data, the heat output from the large-scale solar thermal systems is also assessed.

Measured data.
The measured data is obtained from the EMD database, including hourly values for heat production from the solar collectors, as well as other measurements related to the operation of the DH plants, such as fuel and electricity consumption or storage usage among others.
The mathematical models with the energyPRO software of the heating systems are also obtained from EMD.These models are the ones that are used to assess and manage the operation of the plants and have been adapted to work with this study.The detail of how this mathematical modelling is done is explained in Section 2.5.

SolvarmeData.
To further verify the data measured, data from the portal solvarmedata.dk is also used [38].In there, largescale solar heating systems upload their measured data so as to have a common database with the rest of the Danish plants and thus enable knowledge sharing.Currently, around 60 of the largescale solar heating plants from Denmark are connected, having their historical and current measurements of heat output and solar radiation uploaded on the platform, and it is ready to start including international plants as well.However, Hvide Sande plant data is not available; only Saeby and Ringkøbing have data there.

The energyPRO simulation model
The energy systems simulation and analysis model energyPRO is a priority-list model originally created for the design of the DH plants with combined heat and power (CHP) units, boilers, thermal energy storage, and demands, however, the breadth and ability have increased over the years.This has in part been as a consequence of the increasing complexity of DH plants e but also as a consequence of its application outside its original scope.Today, energyPRO is widely used in consultancy as well as in academia.It is characterized by a user-defined temporal resolution, a userdefined simulation horizon e thus while hourly analyses for a one-year period are typical, five-year analyses with a quarterhourly resolution are possible.
In terms of the technical aggregation level, all units may be defined in the desired level of detail.All energy production, conversion, storage, and demand types may be modeled e and new types of technologies and carriers can be analyzed.To facilitate simulation of RES-based systems, energyPRO links up to various databases providing data on e.g.solar energy and its temporal distribution.
In terms of optimization, energyPRO is based on an economic  [36,37] priority list where the cost of running a given unit in a given time step is calculated and compared to other units.Units are then dispatched in ascending order of production cost for all units in all time steps.This procedure is further detailed in Ref. [39].Furthermore, energyPRO offers a MILP solver to do the dispatch [40], which allows a more complex operation to be planned, as often found in market optimized private-wire operated energy systems and Power2X systems.
In academia, energyPRO has particularly been used to model systems with a focus on CHP [41,42], power-to-heat [43,44], energy storage [45] or full energy systems [46].The model is also widely applied to analyze of economic conditions and incentives of the energy system [47e50] due to its focus on the business economic optimal solutions.Specifically, within solar energy exploitation, Kazagic et al. analyzed solar energy combined with CHP in Visoko, Bosnia-Hercegovina [51], Valan cius and Mikucionien e applied energyPRO to model building energy supply for a block of flats [52].
Ben Amer-Allam and coauthors designed a heating system for Helsingør, Denmark with a focus on various RES e including solar [53].A conversion of the heating systems of Samsø, Denmark from solar and biomass to solar and heat pumps was investigated in Ref. [54] and lastly R€ am€ a and Wahlroos investigated the role of RES in heating systems using energyPRO in Ref. [55].There was not, however, a specific focus on the solar data or the accuracy of the representation of solar energy in these reviewed articles.

Statistic analysis parameters
To perform the comparisons and statistical analyses of the data, it is necessary to define the parameters that are used for this comparison.Since this study includes comparisons between different locations with different expected outputs, all statistical parameters are normalized, dividing them by the mean of the measured values.

RMSE
The first parameter is the Root Mean Squared Error or Difference (RMSE).Its normalized version is defined by Equation (1) [13].
Where y 1 represents the measured data, y 2 is the predicted one, T is the time-frame to be analyzed and the overline indicates mean value.The RMSE allows observing the variation between the expected data and the measured data.Since it uses the square of the differences, it punishes large errors in the data.It allows comparing the accuracy of the data-sets to analyze.

MBE
The other main parameter is the Mean Biased Error or Difference (MBE).Equation (2) defines how it is calculated [13].
The MBE allows observing systematic error, identifying constant deviations of the measurements that can be produced due to malfunctions of the measurement or a bias in the simulation results.These bias errors are important to detect and easier to compensate since usually, they correspond to a constant value or an offset in the data.
By combining these two parameters, it is possible to assess if the data predicted has a low accuracy (RMSE) as well as to identify possible bias errors (MBE).
In addition to statistical parameters, different graph types are used to obtain a visual comparison of the data to analyze.The main plots are average curves for a limited duration or duration curves, among others.By plotting the several results scaled and with the same horizontal axis of time, it is possible to identify patterns of behavior of the different locations analyzed, such as a higher than expected radiation or a similar heat output among others.

Data treatment
For this study, it is crucial to ensure that all the data used is correctly assessed and treated.The following data treatment processes have been necessary to perform the analysis adequately.

Data gaps
In both the data measured and in some of the solar data models there has been found several gaps of data, meaning that in some hours there is no data available.This has been found in all the years in all the plants, with up to 100 h of missing data per year.To compensate for this, it has been chosen to calculate a yearly average for all the plants and use this reference year for comparisons.
All the data from the solar data models as well as the outputs of the mathematical models have been obtained for each of the four years, from which yearly averages have been built.By doing this, the errors between models and measured data might be diluted but it ensures that the gaps of data do not alter the result of the analysis since these gaps do not account for the average and it is highly unlikely to have the exact same hour missing for more than a year.

Removing offset
An offset error has been identified in the solar radiation data from the Ringkøbing plant in the years 2018 and 2019.The nature of this error can be associated with a malfunction of the measurement equipment according to the DH plant technicians.This offset has been observed to have a changing behavior during the year, being higher in days where the radiation is higher.The variance of this error however is low.To remove this offset, the following methodology has been used.
First, the offset has been identified when all the solar data models predict null radiation but the measurement has a non-zero value, mainly at night-time.Two methods have been used.First, to calculate the average of this night-time offset, and remove it for the whole day limiting it to zero in case the result would be negative radiation.The second, removing the maximum of this error per day also limiting it to zero.
To assess if these methods are valid, the data corrected has been compared with the data of the years with no observable offset errors, to evaluate whether the differences between the years are consistent.The offset is mainly observable in the MBE shown in Fig. 2. When using both methods, the MBE between the years is near 0, with a similar behavior than when comparing the two years without offset.The maximum method appears to produce a slightly similar MBE, hence it is the one that has been used.

Converting to horizontal radiation
The solar radiation data from the solar collector plants have been measured on the tilted surface, which is 30 on Ringkøbing and Saeby, and 38 at Hvide Sande.Since the solar data models offer radiation on a horizontal surface, it is necessary to transform this measured data to horizontal radiation.
To do so, the same equations that the energyPRO models use have been utilized.The impact of direct radiation, diffuse radiation, and reflected radiation are taken into account as seen in Ref. [56].

Mathematical modelling of large-scale solar thermal systems with DH
To mathematically model the large-scale solar thermal systems in the three cases to analyze, energyPRO is used.Models for the three study cases are obtained from EMD.However, these models are used to operate and manage the plants on a daily basis and therefore are not suitable to include year-long hourly values.Therefore, the models have been adapted, changing the operation time frame (from one day to a year) with the corresponding modifications of input data such as electricity prices or demands.
These models include the complete DH systems of these plants on top of the solar collectors and heating storage systems.These DH systems include gas boilers, electric boilers, CHP plants, and storage systems while also including the electricity market prices, fuel prices, and heat demand.
The model in energyPRO prioritizes the lowest cost of production, therefore it uses solar collectors whenever there is enough radiation and stores the heat produced that can not be consumed in thermal energy storage facilities.These storage systems are designed so that there is enough capacity to store the excess heat and therefore the solar collectors are always operating.The mathematical models are available at [32].
To calculate the heat output from solar collectors, the model requires to input size and position of the collectors, the collector specifications regarding the efficiency, the expected temperatures in and out of the collector, the ambient temperature, and solar radiation, on top of the orientation and geometrical disposition to account for array shading.The equations that energyPRO uses to calculate heat production from solar radiation can be consulted at [56].The input radiation in energyPRO can be included from any of the solar radiation data models mentioned in Section 2.1.2,or any other dataset such as the ground-measured data.
Therefore, to assess the accuracy of the models the measured radiation is added as an input, and the heat output is compared to the measured production data.Then, energyPRO is also used to assess the impact of using the different solar data models and their sensitivity to changes in solar radiation.To observe the impact of these differences, the yearly operational costs calculated for each case are compared as well as the yearly heat output.

Correction factor calculation
In Section 4 correction factors to improve the solar data models are presented.To obtain these factors the following method was used.
The aim is to obtain a K factor, which multiplied to the solar data model could increase its accuracy when comparing it to the measured values.It was chosen to work with hourly month profiles, hence having 24 K values for each month, 288 in total.These K values would then be multiplied to the hourly values of the solar data model accordingly, generating a new data set with the corrected data.
Four different sets of K values are created to assess their results.They are obtained from calculating average hourly monthly profiles of the measured data and dividing it with the profiles obtained using the solar data model.With this methodology, it is expected to identify patterns of behavior in the errors of the solar data model, for instance, if the errors are higher in certain months or hours.

Data assessment and model performance
In this section, the solar radiation data obtained from the different plants and the solar data models are assessed and compared.Then, using the models for each plant the heat output is calculated for each case.With that, the model performance is assessed and the sensitivity to different solar radiation inputs observed, concluding with the calculation of the different operational costs associated with each case.

Radiation
Before analyzing the model performance, the radiation data has to be assessed.To do so, first, the different sites are compared among them to identify possible errors and patterns of behavior.Then, the solar radiation measured is compared with the different solar data models to be analyzed in this report.

Site comparison
A comparison of the radiation measured in the three sites analyzed has been done.All the data is available at [32].To facilitate comprehension for the reader, an example week of June is shown in Fig. 3.
The week shown in Fig. 3 has been chosen as it is a representative week of the solar data measurement in the three plants.Hence, it can help visualize the key findings of the site comparison.First of all, the radiation measured in Ringkøbing and Hvide Sande is not as similar as one would expect from their proximity.The shape of both curves follows an almost identical path but in peak hours, the values of radiation read in Hvide Sande are consistently higher.Saeby follows a different radiation profile (as expected, due to its distance from the other plants), but in clear peak hours has an almost identical profile to Ringkøbing.These findings are consistent for the rest of the data, which can be found in Ref. [32].To better observe and analyze these behaviors, day-type month profiles have been created, using the hourly average radiation per month (Fig. 4).These behaviours can be found throughout the year, although it is more clear in the summer months when the radiation is higher.In the monthly profiles, where the irregularity produced by the clouds gets reduced due to the usage of the average, the curves of Saeby and Ringkøbing are almost identical.Due to the similarity of the latitude between the three plants, there is no seasonal difference in the radiation measurements.The differences come from the clouds, which have a local dynamic and are manifested differently in Saeby than in the other plants.
In Hvide Sande however, the values at peak hour in the month profile day-types can be around 15% higher.Hvide Sande and Ringkøbing are situated within 15 km from which one would assume that the radiation values would be more similar than what they actually are.The main observable difference between both plants is the location, with Ringkøbing being situated inland, and Hvide Sande on a narrow isthmus between the North Sea and the 300 km 2 lagoon Ringkøbing Fjord.Theoretically, being next to the sea would correspond to a lower reflection coefficient [57], hence even lower radiation than the Ringkøbing plant.Therefore, the main hypothesis for this difference is that there is a measurement error in Hvide Sande which will be further analyzed throughout the study.

Comparison of the different solar data models
The solar data measured in the three locations are then compared to the data obtained from the different data models presented in Section 2.1.2.
Observing an example week in the three sites in Fig. 5, it can be observed how the DRY has the most erratic behavior.This is to be expected since this dataset does not represent any specific year and therefore does not account for the variability in general meteorological conditions like the satellite or reanalysis-based data sets do.From the others, it can be observed how SARAH has a more accurate path replication, whereas others like MERRA2 do not reflect as well the quick variability of radiation due to clouds passing.
This behavior is further observed when calculating the RMSE and MBE.It has been calculated for each plant and each solar data model using hourly, daily, monthly, and yearly average values for the RMSE, and only hourly values for the MBE since the resolution of the data does not affect its result.For all these calculations the 4 years of data available have been used.
From the RMSE calculation, in Fig. 6 it can be seen how using a higher temporal resolution on the computation increases the value of the error in all cases.This is expected to happen when evaluating radiation data [16].The temporal dependence of the error is more pronounced for DRY, whose RMSE increases drastically when using hourly values.However, when using the yearly average the error is almost zero for both Saeby and Ringkøbing.
The RMSE of the different models has a separate behavior between Hvide Sande and the other two plants.In Saeby and Ringkøbing, SARAH has always the lowest error, followed by ERA5, CFSR, and MERRA2 in a proportionate manner.In Hvide Sande however, that is the case only when using hourly values, and when using smaller resolution ERA5 and CFSR are more accurate.This would further corroborate the hypothesis that there is a measurement error in Hvide Sande.
In the case of MBE, similar behavior is observed in Fig. 7.In Ringkøbing and Saeby, the DRY has the lowest error with practically 0, followed closely by SARAH.The other solar data models have higher values, following the same order with ERA5 being better, then CFSR, and finally MERRA2.In Hvide Sande however, both the DRY and SARAH have a considerable negative MBE, with both ERA5 and CFSR having a lower value.
Both Figs. 6 and 7 show that none of the solar data models predicted the higher solar radiation found in the Hvide Sande plant.Looking only at the satellite and reanalysis-based solar data models, it appears that SARAH performs best.From the other models, it can be seen how ERA5 has a better performance than CFSR, despite having the same spatial resolution.Finally, MERRA2 appears to be the least accurate, with higher MBE/RMSE values, which can be linked to its more limited spatial resolution.
The DRY has an even better performance than the SARAH model on the aggregate data but has the worst performance when considering an hourly operation.This means that the satellite and reanalysis-based solar data models have a better prediction on the variability of the solar radiation within the year.One of the reasons is its spatial resolution since it is possible to obtain solar data estimations from locations closer to the plants than with the reanalysis methods.For the case of the DRY, it can be seen how it is not useful to account for this variation during short periods of time, however, it has a lower bias error when accounting for the total yearly values than SARAH and the reanalysis models.

Heating output
Once the radiation has been analyzed, the next step is to compare this radiation to the heating output from the solar collectors, to identify possible anomalies in the data.Then, the mathematical models are tested using the measured solar radiation as the input.Later, the sensitivity of these models is assessed using the different solar data models as inputs.

Comparison between plants
The first step is to compare the measured data among the different plants to assess their efficiency and identify possible patterns when comparing it with the radiation measured.To compare the heat output values between the plants since they have different installed capacities, the heat output is divided by the surface of solar collectors.Therefore, this normalized heat output is used for all comparisons.
A week of the measured heat output can be seen in Fig. 8 and the monthly profile of June in Fig. 9, following the same methodology explained for the radiation.The remaining monthly heat output profiles can be consulted at [32].
This comparison shows how the measured heat follows a proportional pattern than in radiation, especially when comparing Figs. 8 and 3. Observing the monthly profiles, it is shown how Hvide Sande has the highest heat output per m 2 , with Ringkøbing and Saeby following proportionally.However, it can not be determined whether this difference in heat output comes from the higher radiation measured since the plants have different efficiencies.
To further verify the data measured, the portal solvarmedata.dk is used, however, only Saeby and Ringkøbing have data there.In these cases -although the values are not the exact same as the ones facilitated by the plant operators-the errors between them are quite low, with relative RMSE of 0.08 and 0.19 and MBE of 0.091 and 0.004 for Saeby and Ringkøbing respectively.

Model assessment
The energyPRO models used for solar collectors require a radiation input, as well as key parameters regarding the efficiency of the collector.These parameters are obtained directly from EMD and are assumed to be valid.Therefore, for different solar radiation   inputs, the heat output profiles are obtained, which can be compared to the measured ones.The first step to assess the ener-gyPRO models as well as to validate the data measured is to use the measured solar radiation as the input.
The RMSE and MBE between the measured heat output and the modeled heat output obtained from using the solar are calculated and can be observed in Fig. 10 and in 11 as the measEP parameter.
It can be observed how in Hvide Sande the RMSE is constantly higher than in the Ringkøbing and Saeby cases, and the difference increases when increasing the resolution of the calculation.In terms of the MBE, in Ringkøbing there is an MBE of À0.014, in Saeby 0.026, and in Hvide Sande up to 0.083.Other than this difference between the plants, it can be observed how the model has some base error which needs to be taken into account in the following analyses.This error is associated with inaccuracies of the different parameters of the modelling process, as well as using hourly values instead of a higher resolution.
These errors can be better seen when calculating the solar efficiency, defined as the ratio between heat production and horizontal radiation.This efficiency is calculated for both the measured values and the output values of the energyPRO models.Hence, it is possible to observe whether the data from the measured radiation and measured heating are proportional, validating the operation of the mathematical models.The efficiency is higher the higher the measured horizontal radiation, hence these values are more relevant in the summer months.
Monthly day-types using 4 years of data have been plotted to further observe the differences between the measured efficiency and the model's efficiency in Fig. 12 (rest of monthly data available at [32]).In the case of Saeby and Ringkøbing, both efficiency curves are practically the same.There is a small desynchronization, since the model predicts the start and finish of the efficiency curve earlier, especially in Ringkøbing, although prediction during peak hours is quite accurate.In the case of Hvide Sande, this offset is also present, but there is a more noticeable difference at peak hours, which can have a difference of up to 10%.This difference at Hvide Sande at its peak shows how the higher radiation identified in the previous analysis is not directly proportional to the increase of heat production, hence gives further veracity to the hypothesis of a measurement error.

Modelling using solar data models
To assess the relevance and accuracy of the model and its sensitivity to solar radiation, the different solar data models are input in the model and their results are assessed.The RMSE and MBE calculation with different resolutions can be seen in Figs. 10  and 11.
From the different solar data models, it can be observed how the RMSE follows a similar pattern to the radiation in Fig. 6, but with    increased values.The differences between the different solar data models also get increased, with SARAH having the lowest error when using hourly values.ERA5 comes second, followed by CFSR and MERRA2 respectively.
The DRY has a substantially large RMSE when using hourly values, although in the yearly resolution has a lower error than the SARAH.In the case of Hvide Sande, the SARAH model has a similar performance than in the other plants, but in the other models behaves differently, with MERRA2 having slightly lower RMSE than ERA5 and CFSR respectively, although with higher values than in the other plants.
The heat output can also be obtained as the difference between the total measured heat output and the modeled one for each of the inputs.The absolute values can be seen in Table 3, while Fig. 13 shows the proportional difference, dividing the difference by the measured heat output.
As it can be observed, the total heat produced in the model using the measured radiation all the plants has a slight error (83 MWh) which can be associated with the assumptions made building the model or due to using hourly data instead of a higher resolution.From the rest of the data models, it can be observed how both SARAH and DRY are the better performers in total yearly heat calculation by a solid margin, with DRY having almost no error in Saeby and SARAH likewise in Ringkøbing.The rest of the solar data models predict a higher heat output than the measured one, with MERRA2 and CFSR performing the worst.
However, the impact of the usage of these data models can not only be observed using the total heat output but also the accuracy of the heat output values in the hourly data set, as calculated in the RMSE.This is relevant, for instance, when assessing the operation cost of the DH plants associated with the large-scale solar collectors.

Economic impact
In the cases studied, the large-scale solar thermal systems are connected to DH systems, which also have CHPs, gas boilers, and electric heaters to supply the heat.Therefore, the operational cost of the system can be assessed and compared when using the different solar data models as inputs.The difference in cost is not only dependent on the amount of energy produced by the solar collectors but also on when this energy is produced.Since the system always minimizes the cost, if the electricity price is high the CHP will operate and sell excess electricity, whereas the gas boiler or the electric boiler will operate otherwise.To assess this, ener-gyPRO models of these systems have been used, available at [32].
In the case of Ringkøbing, the solar plant has an installed capacity of 22 MW, with a yearly heat demand of around 114,000 MWh of which around 12e14% is supplied by the solar collectors.In Saeby, only data until 2018 is used and the expansion of solar collectors from 2019 is ignored.Therefore, Saeby is modeled with an installed capacity of 8.2 MW, and with a yearly heat demand of 82,000 MWh of which around 6% is provided by the solar collectors.In Hvide Sande, the solar plant has a capacity of 7 MW and a yearly demand of its 1500 consumers of around 35,000 MWh, from which approximately 13e15% is produced by the collectors.
The operational cost for the plants is estimated using the different radiation datasets as an input.This operational cost accounts for the buying of fuel and electricity, the different taxes associated and the selling of electricity to the market from the CHPs, as well as maintenance costs.Due to the differences in sizes of the solar collector plants the economic absolute values also differ, as seen in Table 4.To put costs into perspective, the operational cost difference has been divided by the collector area of each plant in Fig. 14.The operational cost difference is calculated in relation to the values obtained from modelling with the measured radiation as an input.
Ringkøbing, since it is the largest plant, has also the largest operational cost difference between the solar data models and the measured data.The SARAH has the least operational cost difference, followed closely by DRY and ERA5.CFSR and MERRA2 are at a higher distance, supposing around 660k DKK of difference in the operational cost, which is around 36.000k DKK to put it in perspective.In Saeby, SARAH performs slightly better than DRY, then followed by MERRA, ERA5, and CFSR respectively.Finally, in Hvide Sande, the DRY and SARAH have the highest values, while MERRA2 and CFSR perform better compared to using the measured radiation.
This analysis shows the relevance and impact of having an accurate solar radiation data set since the inaccuracy of the data models can be translated to miscalculations on operational costs, which can affect the perceived viability of the project in the design process.The more accurate the solar data model, the lower the uncertainties in the project design.

Identification of correction factors
Once the impact of data and of solar data models on the planning process has been assessed, the next step is to improve the accuracy of the solar data model to minimize these errors.Since it has been seen that SARAH is the most consistent solar data model from the ones analyzed, it is the one chosen to analyze further.Therefore, several correction factors are presented and assessed following the method presented in Section 2.6.
However, more data needs to be utilized so as to avoid any bias that using only two or three plants might cause, hence having nontransferable results to other cases.To further the scope, solar radiation data obtained from solvarmedata.dk is used.
The database has been filtered to only use those sites located in Jutland, with complete data from 2016 until 2019 so that the data is comparable to the data logged by EMD.The radiation is measured on a tilted plane, so it is also necessary to know the angle of the collectors to obtain the horizontal radiation as explained in Section 2.4.3.This results in an extra 15 plants, where radiation values are obtained.Moreover, SARAH values for those plants are also gathered for comparison.A map with the location of the plants as well as technical characteristics of these sites is available at [32].
The same data treatment as for the other plants is done, using the hourly average of the four years to ignore data gaps.It is also important to note that the data obtained from solvarmedata.dk comes from a secondary source and therefore might be less reliable than the three study cases, which come directly from the plant operators.
The radiation monthly day-types are created for all the plants.In Fig. 15 the profile for July can be seen, whereas the rest of the months are available at [32].It can be observed how Saeby and Ringkøbing are situated on the average of the radiation curve, whereas Hvide Sande has the highest peak, although followed closely by other plants.From the duration curve between the measured data and the SARAH model in Fig. 16, the trend of the errors of these plants can be observed globally.In most of the 18 plants, the duration curve for the measured radiation follows the one from SARAH accurately, which is the case for both Saeby and Ringkøbing.
In three plants it has been observed a higher SARAH error, resembling the one observed in Section 3.1.1from Hvide Sande.However, no common factor has been found for these sites in terms of location, inclination angle, size or year of construction so as to identify why the error is higher in these plants.
The final step was to try to correct the SARAH solar data model and reduce its error, taking Ringkøbing and Saeby as the plants to analyze and discarding Hvide Sande due to the unexpected higher radiation measured.To assess if the usage of correction factors improves the performance of the solar data model, the RMSE, MBE, Fig. 14.Operational cost difference for the different solar data models, calculated subtracting it to the total operational cost of the DH system when using the measured radiation as an input.total yearly heat production prediction, and operational cost is assessed.The correction factors K have been obtained following the method explained in Section 2.6.The first correction factor is calculated using all the 18 plants obtained from solvarmedata.dk, K18.Since some of these plants have been observed to have higher errors from SARAH as could be seen in Fig. 16, a second K9 has been calculated choosing the nine plants with a similar SARAH error than the two studied plants.Then, a third correction factor K3 using the three plants analyzed in this report since the data has been properly assessed.Finally, a correction factor only using the data from the two plants K2 is chosen.The results obtained can be seen in Table 5.
As it can be seen in both plants, using any of the proposed correction factors lowers the RMSE, except when using K2 in Saeby.In terms of MBE, it gets increased in almost all the cases, except precisely in K2 Saeby as well as K3 in the same plant.This shows that since the MBE in both plants is quite low, it is not easy to identify monthly hourly profiles that can improve the SARAH accuracy and that are able to be used together in both plants.
To further observe these results, the errors in heat production from solar collectors (hs) and the error in the yearly operational cost (OC) show how neither of these parameters improves the results of the SARAH original data-set.Therefore it can be concluded that correction factors based on monthly profiles do not improve the results from SARAH in terms of heat production calculation and operational cost, at least with the current data used.Moreover, no trends on where the errors tend to occur have been identified in terms of location, the hour of the day, or the season.

Discussion
This section presents a discussion on the relevance of this study as well as the impact that certain limitations may have on the analysis.
First of all, when using the energyPRO models it has been observed how there were relevant differences between the directly measured heat outputs and the values obtained through the model using the measured solar radiation.It can be argued that a higher accuracy of the model could have been obtained if the theoretical parameters from the different equipment would have been further analyzed to be corrected with empirical values, such as the efficiency of the collectors or the losses in the transmission systems.Nevertheless, since the analysis has been done in a comparative manner, it still provides relevant results as to how inputting different solar radiation models impact the performance of the mathematical models, observing their sensitivity to changes in the radiation.
From the results obtained, it is observed how in the yearly operational cost and heat production calculation, the DRY and the SARAH datasets have similar results with SARAH performing slightly better.However, it may also be observed how the DRY has a higher RMSE than all the other solar data models used when using hourly values.Therefore, this has an impact on the calculation of the usage of the storage systems as well as the intraday operation.
Hence, despite the DRY showing accurate results in terms of yearly calculations, the usage of satellite-based or reanalysis-based solar data sets can be argued to be preferred when designing largescale solar collectors systems since it gives more accurate hourly models on how the system would operate.If storage or transmission systems were to be designed, the usage of more accurate hourly values would be helpful in obtaining a correctly designed installation.
From the correction factor analysis, it can be argued that they could have been found if a higher resolution had been used or different methods were to be used.The goal of the identification of the correction factor was to validate and try to identify error patterns from SARAH in order to improve its performance.To do so, the methodology of using monthly hourly profiles was selected, identifying several K factors using secondary data from up to 18 plants with similar characteristics.This was done to have a more global and relevant result, as well as possibly identify trends on when the errors had a higher impact.However, it is seen how these errors do not have a common denominator and were different in each case, hence not being able to be corrected using the values of several plants.It could be argued that using more data, with more plants and more years could have had better results.A possible future work would be to collect more data from more plants, ensure its validation, and identify trends that were not found in this study.This could be done for instance using machine-learning-based algorithms or analyzing in more detail how the solar data model estimates the solar radiation.
Finally, through the whole article, it has been observed how the behavior of the solar radiation in Hvide Sande does not fit with the other two plants, as well as with the expected results from the solar data models analysis or the data from solvarmedata.dk.Compared to the other plants, the values are constantly higher especially in peak hours while all the solar data models do not predict that.In the heat output measurements, this higher radiation also does not get reflected, as observed in the efficiency plots (Fig. 12).The reason for this increased solar radiation has not been identified and the main hypothesis is a malfunction of the measuring system, which is recommended to be tested, although alternate causes could be related to surface characteristics, albedo effect, or elements reflecting on the collectors.

Conclusion
The main goal of this study was to fill the gap of knowledge regarding the impact of using solar data models in large-scale solar thermal systems rather than in PV cases.It has been observed how the same trends that in those cases occur, with satellite-based models like SARAH having a better performance than the reanalysis-based ones ERA5, CFSR, and MERRA2, in terms of RMSE and MBE but also in heat production results and operational cost predictions.
It has been seen how substantial differences in operational cost of the DH plants associated with these systems can occur when using some of these less accurate data models such as CFSR or MERRA2, reaching almost 660k DKK(90k V) yearly in some of the cases.Although it is hard to compare plants with different sizes and configurations, it is clear the importance of using accurate solar data models to have more precise modelling of large-scale solar thermal plants.
The DRY shows similar results in terms of MBE than the SARAH data model, when using aggregated data, but has the worst performance when analyzing hourly operation.When accounting for the yearly heat production or operational cost, both models have similar errors.Hence, the DRY model appears to be accurate for yearly approximations, but inaccurate when observing the hourly behavior of the plant.Therefore, to have a better understanding of how the plant would behave on the hourly frequency, the usage of the SARAH data model is recommended.
With the data used, it was not possible to obtain correction factors that could be used to reduce the errors in the SARAH data model using month profiles.
More data and different methods would be necessary to identify possible improvements to this model.However, this shows that the SARAH solar data model is a robust dataset and can be recommended to be used in the planning of large-scale solar thermal plants integrated in DH systems where no measured radiation data is available.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

2. 1
.2.2.SARAH-2.SARAH-2, further referred to as SARAH, is a satellite-based solar data model.It is the second version of the Surface Solar Radiation Data Set -Heliosat and a product of the EUMETSAT's Satellite Application Facility on Climate Monitoring (CM SAF)[9].This climate data record has information regarding surface irradiance, as well as direct irradiance and sunshine

Fig. 1 .
Fig. 1.Location of the three study-case plants in Jutland, Denmark.As can be seen, Ringkøbing and Hvide Sande are quite close (15 km) whereas Saeby lies on the east coast at around 200 km away.

Fig. 2 .Fig. 3 .
Fig. 2. Mean biased error between the different years without correcting the offset and correcting it with the mean and maximum method.The dotted line represents the MBE between 2016 and 2017 where there is no offset.The ref indicates which year is used as the reference.

Fig. 4 .
Fig. 4. Horizontal radiation day-type month profile of June, calculated using the hourly average of the 4 years of data for each hour of the month.

Fig. 5 .
Fig. 5. Horizontal radiation from the measured tilted radiation as well as the solar data models for a week in June in 2017 the three plants.

Fig. 6 .
Fig. 6.Root Mean Squared Error of solar radiation between the different data models and the plants for different resolutions using 4 years of data.The circle corresponds to Ringkøbing, the cross to Saeby, and the triangle to Hvide Sande.

Fig. 7 .
Fig. 7. Mean Biased Error of solar radiation between the different data models and the plants using 4 years of data.

Fig. 8 .
Fig. 8. Measured heating output from the solar collectors divided by solar area during a week in June in the three plants.

Fig. 9 .
Fig. 9. Heating output divided by solar area day-type month profile of June, calculated using the hourly average of the 4 years of data for each hour of the month.

Fig. 10 .
Fig. 10.Root Mean Squared Error of heat output between the different data models and the plants for different resolutions using 4 years of data.The circle corresponds to Ringkøbing, the cross to Saeby, and the triangle to Hvide Sande.

Fig. 11 .
Fig. 11.Mean Biased Error of heat output between the different data models and the plants using 4 years of data.

Fig. 12 .
Fig. 12. June day-type efficiency measured in the plants (solid line) and calculated from the energyPRO models (dashed line) for each of the three plants using 4 years of data.

Fig. 13 .
Fig. 13.Yearly solar heat production difference between the measured values and the simulated ones, using the measured radiation (measEP) and the different solar data models.

Fig. 15 .
Fig. 15.Horizontal radiation day-type month profile of July, calculated using the hourly average of the 4 years of data for each hour of the month for all the sites analyzed.

Fig. 16 .
Fig. 16.Duration curve of the different sites, using the measured values as the X-axis and the SARAH ones as the Y-axis.

Table 1
Main characteristics of the three study case plants.Note that the Saeby plant was expanded in 2019, increasing to 37213 m 2 and 25,7 MW but only the old plant is included in this study.

Table 2
Summary of the main characteristics of the solar data models used.

Table 3
Yearly solar heat production in absolute values[MWh]measured directly in the plant, simulated using the measured radiation (measEP) and with the different solar data models as inputs.

Table 4
Yearly operational cost difference in x1000 DKK between the simulations using the measured solar radiation and the different solar data models in absolute values.

Table 5
Tables showing the RMSE, MBE, error in heat produced from solar collectors (hs) and the error in the operational cost calculation in Ringkøing and Saeby for the different K parameters used.