Exploring machine learning to predict the pore solution composition of hardened cementitious systems

.


Introduction
In-depth knowledge of pore solution composition is critical for several areas related to cementitious materials and their durability.For example, the pore solution of concrete plays a vital role in the corrosion behavior of steel in reinforced structures [1][2][3][4][5].The high alkalinity of the pore solution (pH 12.5-14) promotes steel passivation by creating a protective barrier of iron (hydr)oxides that inhibits active corrosion.This is one of the main reasons why a relatively non-noble metal such as carbon steel can be used to produce durable reinforced concrete structures, even in corrosive exposure conditions [6].Moreover, the pore solution composition influences how this passivity may breakdown during the service life of a structure [7].The pore solution also plays a critical role in other deteriorative processes such as alkali-aggregate reaction [8,9], carbonation [10], sulfate attack [11], salt damage [12,13], and microbially-induced concrete deterioration [14].Furthermore, the chemical composition of pore solution can be used to predict its electrical resistivity, which is required for calculating the formation factor of concrete that provides critical information on its pore structure [15,16].In recent years, the formation factor has become an essential parameter in reactive-transport models for service-life predictions of concrete structures [15], because it can be used to determine key transport properties of concrete like chloride diffusion coefficients [17] and water permeability [18].
Thus, there is a need to evaluate the chemical composition of concrete pore solutions, either experimentally or with the help of predictive models.A common approach is to extract pore solution from hardened cement paste, mortar, or concrete samples by means of applying pressure to the sample [19,20], followed by the chemical composition analysis.However, this process is not always straightforward.Extracting adequate amount of pore solution from samples that are mature and/or with low water-to-cementitious content, whether they come from laboratory specimens or from existing structures, is not always possible or practical [21].Alternative experimental techniques such as cold [21] or hot [22] water extraction or alkaline leaching [23][24][25] typically introduce other inaccuracies, as these techniques heavily depend on factors such as temperature, time, fineness of the sample particles, and duration [26].Modeling approaches have also been used to determine the pore solution composition; some examples include the Taylor's prediction model [27], the algorithms developed by the National Institute of Standards and Testing (NIST) [28][29][30], and thermodynamic modeling [31].However, some of these modeling approaches involve assumptions that cannot be easily generalized to cover the diversity of materials, binder kinetics, and curing conditions, and others require input data that is not always available.The problem with these modeling approaches becomes more pronounced as the cementitious material mixture has become more complex due to the partial replacement of Portland cement (PC) with supplementary cementitious materials (SCM) [32][33][34][35][36], as demonstrated by Vollpracht et al. [37].These blended systems have distinct reaction mechanisms that influence the pore solution composition [36,37].For example, SCM can decrease the pore solution alkalinity and pH buffer capacity due to dilution effects, the consumption of Ca(OH) 2 during pozzolanic reactions, and increased alkali uptake in hydrated phases [31,37,38].Because SCM show a high degree of variability in reactivity and chemical composition [39], in the absence of accurate methods of quantifying reactivity, the existing modeling approaches are not always reliable in predicting pore solution composition.
Machine learning (ML) algorithms have the potential to address many of these challenges [40,41].ML is an approach through which the computers are trained to learn from data, understand trends, and make predictions based on this information [42].The theoretical basis of ML models relies on statistics, optimization methods, training, and validation tests.ML is especially useful for modeling complex problems when the role of each input variable (henceforth called "feature") is not well understood or cannot be easily delineated from other features.For instance, ML has been applied to estimate the compressive strength of concrete, which is affected by several factors such as chemical composition and reactivity of constituent materials, mixture proportions, curing conditions, and the age of testing [40,[43][44][45][46][47][48][49][50].With similar thinking, it can be hypothesized that ML is also suitable to predict the pore solution composition in cementitious materials, as this problem is another case in which a multitude of factors controls the final result.If successful, ML models may be an alternative for laborious pore solution extraction and analysis experiments, and other prediction approaches.
This study explores the use of ML algorithms to predict the pore solution composition of hardened cementitious systems produced with PC and SCM.For this purpose, a database of pore solution compositions has been established from literature and it is used for algorithm training and evaluation purposes.The accuracy of the ML algorithms is also compared to traditional modeling approaches such as the Taylor's model, the NIST algorithm, and thermodynamic modeling.

Pore solution composition database
Data on pore solution composition of cementitious matrices were compiled from 34 peer-reviewed sources published over the last 40 years (1981-2021) [23,26,.The complete database is available in [83].In this database, 310 different mixtures were collected; however, not all cases cover all ions that are part of this investigation.The materials consist of PC + (1 or 2) SCM in binary or ternary systems.The work presented here excludes systems younger than 28 days because of their highly transient behavior at early ages.
Table 1 describes the parameters considered in the database, comprising mixture proportioning, binder characteristics, curing conditions, and the pore solution extraction method.The nomenclature of the features discussed here is provided in Table 2, while detailed information on all features can be found in a companion paper on Data in Brief [84].In the evaluated literature, information on mixture proportioning, extraction methods, and age were commonly reported; the chemical composition of the cements was frequently given but not always complete.The curing conditions and specific details on extraction and analysis techniques were less frequently reported, while materials reactivity or amorphous content was rarely provided in the literature.
The systems included in the database were filtered to eliminate cases that are known to introduce inconsistencies.Specifically, the following cases have been excluded: (a) studies that focused on relative, rather than absolute, concentrations; and (b) PC or SCM with high P 2 O 5 content (>0.50 % wt.) as these systems have been reported to have mixing issues that artificially affect the pore solution [85].To expand the data containing sulfur species, both SO 4 2− and total sulfur-containing ions were considered combined.Moreover, when clinker phase contents were not reported, Bogue equations were applied when suitable to calculate them, following ASTM C150 [86].
In studies that did not report the measured OH − directly, OH − activity or concentration was calculated from the measured pH (Eq.( 1)) or charge balance (Eq.( 2)), when possible.This approach was also used by some of the studies of the database which reported the OH − concentration.
Calculating OH − concentration through Eq. ( 1) has a pitfall because the ion activity is usually (70-80 %) lower than the ionic concentration in solutions of high ionic strength such as the ones in cementitious materials [15,[87][88][89][90].The use of activity (a OH-) and concentration (OH − ) interchangeably, as it is commonly done in the literature, may thus introduce errors in the interpretation of pore solution data.It should be mentioned, however, that literature extensively applies the measured pH to calculate OH − concentration through this assumption.Therefore, the database inevitably contains values from both approaches for measuring OH − concentration (i.e., indirectly from pH measurement and directly from OH − titration), with limited room for correcting the inconsistencies.We ignored this potential inconsistency in the reported OH − concentration.Nevertheless, to assess the potential impact of this common practice of interchangeable use of OH − concentration and activity in the literature and thus in our database, a feature representative of the OH − analysis technique was included in the database ("OH_method").Concerning Eq. ( 2), note that charge balance calculations were not performed if the reported data did not contain at least Na + and K + concentrations.The use of Eqs. ( 1) and ( 2) to fill gaps in the database entries where no direct information on OH − was available from the original study was implemented (i) to increase the number of data for analysis, as ML methods should have a large dataset, and (ii) to consider different methods and make the model more generalist to literature, reflecting a broader set of studies.
More information on the database's content is provided in Section 3.

Machine learning (ML) algorithm
This study implemented the ML algorithms through improved decision tree regressors: Random Forest (RF) and Extra Trees (ET), also called Extremely Randomized Trees.Decision tree algorithms create rules to predict outcomes by finding patterns in the data [91,92].These rules split groups in a hierarchical process, in which each rule represents a node of a whole tree [92].RF and ET regressors are adaptations of traditional decision trees [93,94].They are strong ensembled models that combine predictions from independent decision trees, as schematized in Fig. 1.Features are split into several decision trees with their own test/train process, and then these results are averaged into a robust prediction [41,93].Despite their similarity, RF and ET differ in some aspects [94,95].RF uses data samples (bootstrap method) for modeling, while ET considers the whole database.The features and split points of the nodes are optimized decisions on RF, while ET makes this choice randomlyhence, it is dubbed "Extremely Randomized".In prediction metrics, ET models have a lower variance than RF models; also, ET is computationally faster.
The RF and ET regressors were selected after a first evaluation of several established algorithms regarding their prediction accuracy, with successful results also for other ensembled decision tree models (as Gradient Boosting and Light Gradient Boosting).In a more in-depth evaluation, the top features influencing RF and ET regressors provided the most reasonable interpretation of the working mechanism of the models, in accordance with concrete science theory, thus justifying their selection for this study.The need for complex decision tree models for a fair prediction of pore solutions implies the intrinsic complexity of this task.
The ML model was implemented with the Python library Pycaret [97], a low-code Python library based on Scikit-Learn [98] and shared as an open code [99].The predefined model's parameters were n_estimators = 80 (number of minor decision trees that are averaged for the result), max_depth = 8 (maximum length of subdivisions inside a tree), while the ML regressors automatically chose the others.This study did not intend to exhaust the optimization potential of ML algorithms but rather explore the possibility of applying them to predict pore solution composition.
First, ML models are trained on a subset of the database, and then the model is tested on another subset; this iteration leads to optimized prediction rules [42].The flowchart in Fig. 2 illustrates the algorithm applied to the database.The ML model was employed with the logarithm value of the ionic concentration (log[ion]) to constrain the predictions to a small but still positive number.First, samples were divided from the original database: 85 % of data for modeling and 15 % for the final validation.Cross-validation of 5-fold was used to minimize the limitations of a small database.These data splits depend on random seeds used as input to the model.The seeds define how the systems are divided into testing/training/validation.When finalized, the model was applied to predict the ionic concentrations of the data set aside in the beginning.
The predictions were compared to the ionic concentrations reported in the literature.The prediction accuracy was quantified by four metrics described in Table 3: (i) the coefficient of determination (R 2 ), (ii) the root mean square error (RMSE), and (iii, iv) the portion of predictions  with a maximum relative error of 10 % and 25 % around the measured value.
To optimize the models, hyperparameters and random seeds were varied.From 200 models (100 for each type of regressor), the 20 best ones were selected.Eq. ( 3) guided this choice, since R 2 and the portion of predictions with max. 10 % and 25 % relative error are positively correlated with the prediction accuracy.
Additionally, the models were evaluated by analyzing their working mechanisms.The feature importance ranking of each regressor was investigated taking into account the scores for the 20 best models.This ranking indicates the most relevant factors that ML identified for predicting the pore solution composition.In this study, the inherent feature importance method of the library Pycaret was applied, based on the mean decrease in impurity, by evaluating which features are more decisive to split nodes of the decision trees in relation to how many samples reach that node.A drawback of this method is possibly overrating high cardinality features, which was minimized by the normalization of the database utilizing the z score method.
As an additional investigation, some features were selectedapproximately 5-8 out of the original 85 features considered by the ML model.This choice was based on the feature importance ranking of the comprehensive models and on cement and concrete science principles.The feature importance rankings are displayed in Fig. 8, while the theory behind the most influencing features is discussed in Section 4.2.1.The features included in the models are listed in Table 4, where SCM 1 and SCM 2 were combined into a single SCM, with a corrected chemical composition by weighted average.Thus, only a reduced number of features were used as input to the ML regressors.The aim was to simplify the data required for pore solution prediction since not all cement characteristics are equally easily available, as well as due to possible noise in the predictions originated from minor features included in the model.It should be noted that the selection of features is not a strict process and could be further explored and narrowed down in more advanced ML models.

Comparison with Taylor's and NIST prediction models
ML predictions were compared with Taylor's model and the NIST algorithm.Taylor's method [27] proposes the prediction of Na + and K + based on the alkali oxides available in PC and fly ash.The dissolution of the cementitious materials is modeled, and it is assumed that part of the alkalis binds to hydration products, and part remains free in the pore solution, according to an equation with empirical parameters.The free water of the pore solution is estimated proportionally to the degree of hydration.Then, the Na + and K + concentrations in free water are calculated.Sulfates are calculated based on Eq. ( 4) [27,28,100].Finally, OH − is calculated by the charge balance of Na + , K + , and SO 4 2− , similar to the approach used in Eq. ( 2) but without Ca 2+ .
The main limitation of Taylor's model is in the assumption of parameters such as the dissolution constants for the materials and the amount of alkalis present in sulfates and clinker.The constants proposed in Taylor's study [27] apply primarily to older cements that reacted slower than the modern ones.Also, the only SCM included in the model is fly ash.The lack of knowledge on fly ash reactivity and the adoption of statistic averages further limits the model.
The NIST algorithm [28][29][30] uses information on the water/binder ratio, PC and SCM amount (silica fume, fly ash, and slag), alkalis, and silica content of the systems, and the total degree of hydration.Here, the degree of hydration was estimated by the Modified Parrot-Killoh (MPK) model, detailed elsewhere [101][102][103], because the original source of the  Portion of the predictions with maximum 10 % relative error on the ionic concentrations reported in the literature.E25 Portion of the predictions with maximum 25 % relative error on the ionic concentrations reported in the literature.

Table 4
Features selected to predict the pore solution composition.data did not provide this information.Note that the MPK model was not validated for cements blended with slag, with both pozzolanic and latent hydraulic behavior; therefore, for these slag-based systems, the degree of hydration of PC was assumed for the entire cementitious system.The NIST model considers the immediate dissolution of 75 % of alkalis and their uptake by the reaction products, which is increased in the presence of silica fume.The overall calculation process is similar to Taylor's model.
For the comparison, ML models were created following the same steps described in Section 2.2, but now with a reduced database only accounting for ions OH − , Na + , and K + because Taylor's and NIST models are limited to these ions.To compare with Taylor's model, 130 systems with PC and fly ash were available for consideration, while for NIST, 200 systems with PC, silica fume, fly ash, and slag could be used.The group of systems destined for predictions was analyzed by the statistic metrics of Table 3.These metrics apply for samples of the database, chosen according to the same random seeds initially used in the ML model to split 15 % of data for validation.ML models can handle missing data, but the others not.In ML, numeric features were assumed to be the mean values of the dataset, while categorical features are associated with a "NaN" category [97].Unknown information on cement's characteristics (mostly their alkali content and clinker phases) did not allow to calculate the pore solution composition by Taylor's model and NIST algorithm in some casesthis reduced the size of their samples.

Comparison with thermodynamic modeling
The prevalent pore solution ions (OH − , Na + , K + ) of characteristic systems were predicted with thermodynamic modeling and compared to the best ML models based on databases with all features and only selected features.The thermodynamic calculations are done using GEMS3K [104], which performs thermodynamic calculations by minimizing the Gibbs free energy of the reaction products for a given set of input compositions.The CemData v18.01 [105] database is used in conjunction with the GEMS3K for the calculation of reaction products of cementitious systems.The formation of some of the carbonate-ettringite, hydrotalcite, and hydrogarnet phases was blocked, based on evidence from the literature showing that these phases are frequently not observed to form in substantial quantities in cementitious systems at ambient temperatures (<60 • C) [62,[106][107][108].The evaluated systems consist of PC with replacements of fly ashes class C and F (20 and 40 %), metakaolin (10 and 20 %), and silica fume (4 and 8 %), at different typical maximum degrees of reactivity (DOR*) and ages of 28 and 56 days, according to Table 5.Thus, a total of 12 systems was assessed for each SCM.The range of DOR* of each SCM was assumed according to its type, following the statistical data reported by Bharadwaj et al. [39].For silica fume and metakaolin, simulations were performed at DOR* of 50 %, 70 %, and 90 %; for fly ashes, simulations were performed at DOR* of 20 %, 40 %, and 60 %.The amounts of phases that react at any given time (i.e., 28 days and 56 days) were calculated using the MPK model [101][102][103], which accounts for the kinetics of blended cementitious mixes.The same model provided the degrees of hydration of OPC at 28 and 58 days as ~77 % and ~ 82 %, respectively.The phase assemblage of the systems is typical of blended cementitious materials.

Overview of pore solution database
The database contained 310, 210, 214, 182, and 150 entries for OH − , Na + , K + , Ca 2+ , and sulfur species concentrations, respectively.As shown in Fig. 3, the database had a large number of pure PC systems (35 %), while the main SCM were silica fume, fly ash, and slag (in increasing order of replacement level).A small number of systems ( 17) contained limestone or quartz powder as well.
The analytical techniques applied in the literature to measure ion

Table 5
Assumed mix proportioning and composition of systems for pore solution prediction with ML models and thermodynamic modeling.The distribution of all the reported ion concentrations is presented in Fig. 5, clearly showing that OH − , Na + , and K + are the prevalent ions of the pore solution, as expected [31].This is the primary reason why most studies using simulated concrete pore solution mainly consider these ions.When a more comprehensive analysis of pore solution composition was performed, other ionse.g., Al, Fe, Mg, Si specieswere generally found at much lower concentrations [23,52,[64][65][66]72,76,80,109,110].This may be seen as justification for only considering OH − , Na + , K + , Ca 2+ , and ionic sulfur species as done in the reviewed literature and in this study.
Occasionally, outlier concentrations were observed, e.g., OH − lower than 20 mMol, which corresponds to systems with pH values lower than 12.3, with a minimal pH of 10.4.Such systems are primarily cementitious materials with significantly large PC replacements for silica fume, fly ash, and slag, even combined in ternary blends and evaluated in older ages, from 90 to 730 days [23,54,57,70,81,111], and also cements with low alkali content like white PC [112].These systems might be cases of Ca(OH) 2 depletion that resulted in undersaturated pore solutions with extremely low OH − concentration [33].
Regarding minor ions, Ca 2+ and mainly sulfur compounds had wide concentration ranges.Low Ca 2+ concentrations are expected due to the low solubility of Ca(OH) 2 , keeping it precipitated as an alkaline reserve in the form of portlandite crystals.Adopting a K sp of 6.5•10 − 6 for Ca (OH) 2 , its maximum solubility in pure water is 11.8 mMol, at a pH of 12.37.However, the Ca 2+ concentration is expected to be even lower since OH − is also present in NaOH and KOH.For example, assuming a pre-existent [OH − ] of 100 mMol, associated with pH 13, Ca(OH) 2 maximum solubility would be reduced to 0.63 mMol.In the database, this 100 mMol concentration was easily surpassed for OH − in 70 % of systems, which would reduce even more the solubility for Ca(OH) 2 .In contrast, only 16 % of these systems were below the Ca 2+ concentration of 0.63 mMol limit.This first-hand estimate suggests that 84 % of the pore solutions may show inconsistencies in the measurement of Ca 2+ ions, with reported Ca 2+ concentrations higher than they should have been.Some experimental errors could be related to the storage conditions before testing and the acidification of solutions to avoid fast precipitation of Ca 2+ [37].
Furthermore, Na + and K + were often indicated as a simple sum in literature, admitting that NaOH and KOH provide most of the OH − in the  system [113].When validating it with the reported pore solution composition, a wide scatter was obtained for the ratios shown in Fig. 6, based on charge balance calculations.This graph includes all systems where all the ions shown in the different equations (shown in y-axis, adapted from Eq. ( 2)) were reported in the original studies, in a total of 210, 172, and 90 systems for the three methods of charge balance calculation, respectively.The average ratio was equal to 1, and the outlier points were mainly systems with high concentrations of Ca 2+ and sulfur.The high divergencies in charge balance suggest uncertainties in the analytical measurements or missing information on other ions that were not experimentally measured [37].
In summary, a large database for the pore solution composition of cementitious materials was collected from the literature.Some imperfections have been identified in the data, mainly due to the experimental errors underlined in this section.Thus, the database itself is not completely accurate and contains uncertainties, which should be considered when assessing the effectiveness of the ML models that are based on the database.

Results and discussion
After the initial analysis of the ion concentrations, ML was applied to predict the pore solution composition of the cementitious materials.Models were created considering i) all features collected from the literature and ii) the selected features listed in Table 4. Finally, comparisons with literature prediction models are presented.

Prediction accuracy
The prediction metrics for the comprehensive models based on all features (Table 1) are shown in Fig. 7a, while the results for the models based on selected features (Table 4) are presented in Fig. 7b.Generally, ET and RF regressors shared the best models and no significant difference between them can be claimed.Only a slight advantage was observed for the ET regressor for the prediction of OH − and sulfur species, according to Fig. 8.The similar performance of the models agrees with their theoretical reasoning since both are ensembled decision tree regressors.
High prediction accuracy was obtained for the prevalent ions OH − , Na + , and K + .R 2 varied from 0.90 to 0.97; approximately 50 % of predictions had up to 10 % relative error, and 75-90 % had relative errors lower than 25 %.The average RMSE was equal to 80, 20, and 45 mMol for OH − , Na + , and K + , respectively.The relatively high RMSE values may be explained by the greater absolute concentration of these ions when compared to Ca 2+ and sulfur species.Given that the database covered a range of systems that do not necessarily perform similarly (as discussed in Section 3), we believe that the trained ML models can well predict the concentrations of OH − , Na + , and K + in the pore solutions.
On the other hand, the prediction of the minor ions was not equally reliable.Average values of R 2 were 0.53 and 0.83 for Ca 2+ and sulfur, respectively, and the average RMSE was 6 mMol for both.The portion of predictions also showed this trend: only 10 % to 30 % of predictions had  1) and part (2) presents the results for the models based on the selected features (Table 4).Refer to Table 3 for more details on the metrics.up to 10 % relative error, and about 30-50 % were within the 25 % relative error range.This indicates that the prediction of minor ions was more challenging than for the prevalent ions.The two levels of prediction accuracy for the pore solution ions might be attributed to the difference in absolute concentration and the database size for each group of ions.Ca 2+ and sulfur species at low concentrations may exhibit larger scatter and uncertainties in the original literature data than species at higher concentrations since measuring  1) and (c,d) the selected features (Table 4), for the different ions small concentrations in solutions is challenging [114].Also, the database for OH − , Na + , and K + has more entries to train and test the ML models in diverse scenarios, making the models more robust and reliable for predicting these ions.Furthermore, the lower RMSE for these minor ions compared to prevalent ions should be seen in the context of their low absolute concentration, since RMSE is not a normalized metric and it mostly provides information on the expected absolute deviation of predictions from true values.For normalized prediction accuracy metrics on deviation, E10 and E25 are recommended.
Concerning the models based on the selected features, Fig. 7b, results had a small decrease in accuracy compared to the models based on all features.The slightly lower accuracy is in accordance with the limited information to make predictions -only up to 10 % of the original features were considered.Even in such conditions, ML models could predict OH − , Na + , and K + with satisfactory accuracy.This suggests that many of the features considered in the original database (Table 1) have a negligible effect on the pore solution composition (see Section 4.2) and might have only disturbed predictions and created noise in the data [41].The simplification of the database and ML models becomes highly advantageous for practical applications, when not all information is easily accessible.
The uncertainties in the ML models should not be seen only as a deficiency of the ML models but rather a mirror of literature.The database of pore solutions was collected from 34 studies that applied different methods and have their own type of intrinsic errors, as well as missing data for some features.Therefore, it becomes indeed complex to determine an accurate fit for such vast data and studies.At the same time, it is an advantage that the models were trained for a general and broad dataset, exploring the literature to reduce bias.ML algorithms still showed a reasonable accuracy for different data and research within such conditions.

Feature importance ranking
Analyzing the feature importance ranking is fundamental to understand how the ML algorithms work.This ranking indicates the most important factors for predicting ion concentrations.It should be noted that the feature importance is a normalized ranking and does not have a physical meaning.Fig. 8a,b shows the top 10 features for the 20 best models comprising the complete set of features from the database, while Fig. 8c,d presents the top 5 ranking for the models based on selected features of Table 4. Fig. 8 is also divided according to the type of regressor (RF or ET), and the value of n indicates how many models among the 20 best ones corresponded to each regressor.
It can be observed that usually very few features dominate the predictions, while the importance scores drop sharply for the remaining features.For predictions of OH − , Na + and K + , aspects related to the high content of SiO 2 in the SCM were highly importantfeatures "SCM1_type_1" (which is silica fume, meaning a high silica content) and "SCM1_SiO2".Other types of SCM did not present such a relevant effect, suggesting that the pore solution is mainly affected by silica fume.Moreover, the alkalis content of PC ("PC_Na2O", "PC_K2O") was relevant, as well as the replacement ratio of PC to SCM.The promotion of both low (alkalis content in PC) and high (silica content in SCM) cardinality features implies the ML models succeeded in overcoming the difference in magnitudes for the features, due to data normalization, and have not overrated the importance of high cardinality properties.
In contrast, when predicting Ca 2+ , the feature importance ranking was far less confident on the most important features.This is represented by the gradual decrease in the scores for the top features and the large standard deviations.This lower confidence also reflects on the lower prediction accuracy of ML models for Ca 2+ .Also, the double presence of aluminate containing features (PC_Al2O3 and PC_C3A) was assumed in this selected database due to the high importance attributed to both features in the comprehensive importance ranking.Theoretically, this effect might be caused by the reaction of C 3 A with gypsum (calcium sulfate), the formation of ettringite, and overall concentration of Ca 2+ and sulfur species in solution.The potential redundancy of aluminate phases was evaluated by modeling cases with both or only one of the aluminates, and no significant difference was observed for the prediction accuracy of Ca 2+ ions.This suggests no issues with redundancy decreasing the performance of the ML models.
For sulfur species, "PC_K2O" showed an apparent leading influence on the predictions, and the secondary features showed balanced relative importance.The temperature (feature "C1_T") effect was relevant for sulfur species, agreeing with [31], due to the increased solubility of ettringite in higher temperatures, then releasing sulfur species to the pore solution.
The important features detected by ET and RF regressors are mostly in agreement.Some slight distinctions could be attributed to RF bootstrapping method to select a sample of features from the total available and then train/test the model on the sampled data [94,95].This could bias the features observed by the model during training, while the ET considers all of them in the model development.Therefore, some less important features could differ for the models.

Main features
Established knowledge about cement hydration, SCM reactions, and alkali binding [34,36] support the main features influencing the OH − , Na + , and K + concentration in the pore solution.Within this context, emphasis is given to the alkali oxides of PC ("PC_K2O", "PC_Na2O"), the amount of SiO 2 in the SCM ("SCM1_SiO2"), and the relative amount of PC and SCM ("PC" and "SCM")Fig.9.
The alkali oxides Na 2 O and K 2 O in the cementitious materials directly affect the pore solution through their dissolution in the Na + and K + ions.When more alkali oxides are available in the binder materials, more alkalis should remain in the pore solution [27].However, it should be noted that the alkali content of SCM did not appear as important as Fig. 9. Statistical metrics for OH-prediction accuracy based on a database with (i) SCM1 and SCM2 and type specified, and (ii) SCM merged and without type definition.Refer to Table 3 for more details on the metrics (a-d).
the alkali content of PC.This is likely because the SCM content in the systems was not large and SCM had limited reactivity.Besides, note that the free alkalis in the pore solution could bind to hydrated phases.The formation of these phases is affected by the speed of cement hydration and curing conditions.
It is also clear in Fig. 8 that SCM type and amount ("SCM", "SCM1_type_1", "SCM1_SiO2") affect the pore solution composition significantly.These materials modify the cementitious reactions by means of physical and chemical effects [33,34].Fine particles can contribute to the filler effect and accelerate cementitious reactions by acting as nucleation sites [35].The partial replacement of PC with SCM dilutes the calcium-rich clinker content of the blended mixture while increasing the reactive SiO 2 and Al 2 O 3 contents.Due to clinker dilution, the hydrated systems typically contain lower Ca(OH) 2 content, and pozzolanic reactions further reduce the Ca(OH) 2 content while producing additional C-S-H and C-(A)-S-H phases.The additional hydrated products will then uptake more alkalis from the pore solution.These processes directly influence the pore solution alkalinity in terms of OH − , Na + , and K + concentrations in the pore solution.This is particularly true for highly reactive SCM such as silica fume, which has a fine particle size distribution and large SiO 2 content.As a result, "SCM1_type_1" (silica fume) was identified as a crucial feature in the ML predictions.
SCM reactivity was also expected to be a critical parameter defining the pore solution of cementitious systems because it indicates the amount of SCM actively participating in the pozzolanic reactions.However, the experimental data in the collected literature did not report the quantified reactivity of the SCM; therefore, it was not possible to properly consider this as part of the ML features.Since the reactivity of SCM is highly dependent on the SCM type [39], we hypothesize that the ML algorithm could indirectly identify typical reactivities for the materials through the feature "SCMx_type" in the database, and somehow estimate which are the most reactive materials.That could be associated with the feature "SCM1_type_1", which is in the top ranking of comprehensive models based on all features.
To test this "indirect reactivity hypothesis", we performed an additional analysis of the prediction accuracy, namely by omitting the SCM type.This was performed in the models for OH − using the database with selected features, by merging both SCM's of ternary systems into a single SCM and not including information on the SCM type.Fig. 9 shows the results for the 20 best models, obtained analogously as described in Section 2.2.Statistical metrics suggest a slight decrease in prediction accuracy when SCM's were combined and their type was not specified.This can be most clearly observed for the portion of predictions with max. 25 % relative error.Indeed, as discussed above, SCM type seems to have at least a small positive effect on the prediction accuracy of the ML models.The small effect of SCM type might also be influenced by the data used in the training and predictions.SCM type might show higher importance if the systems included in the analysis had a higher SCM replacement and reactivity.
On the other hand, since the SCM reactivity can vary significantly within each type of SCM [39], not having SCM reactivity as an outstanding feature results in inaccuracies and uncertainties in the ML predictions.This problem, however, is not unique for ML but exists for all modeling approaches, including thermodynamic modeling, when the quantified SCM reactivity is not known for the modeled materials and also has to be assumed.

Minor features
Some features, in fact most of the considered features, were here identified to play a minor if not negligible role in the prediction of the pore solution composition (Fig. 8).This could be because the feature under question indeed has a small effect on the pore solution composition.However, it should be borne in mind that the "quality" of the database may be another reason for why some features show a lower influence on the pore solution.With "quality" we here refer to aspects such as completeness and representativeness of the database as well as the reliability of the individually collected data.For instance, the data distribution (e.g.how frequently the features assume certain values) becomes important here, mainly when the features are represented by only a few systems.For example, ternary mixtures only constituted 11 % of the systems considered for training the models on OH − ions.This has to be considered when interpreting the ML models finding that the features associated with the second SCM (SCM2) were not important in the predictions.
Another feature with minor effect on the prediction was the "matrix_type" feature.Different matricespastes, mortars, or concretedid not significantly affect the pore solution composition.While this MLbased finding is in agreement with experiments reported in the literature [113], we cannot exclude that the limited variability of the database may here explain why the matrix type did not seem relevant for the predictions.Overall, 90 % of the pore solutions collected in the database referred to pastes, which may compromise a fair statistic comparison for this feature.This again supports the need for expanding the database to test and train the ML model.
The age of the samples also showed a minor influence on the prediction.This may be explained by the fact that we restricted the pore solution prediction for systems older than 28 days.First exploratory attempts (not reported here) with databases including ages as early as a few hours showed a greater influence of the sample age.It is well known that after these 28 days, most PC hydration reactions have already developed [33], and this may explain the moderate effect of sample age in the here investigated Portland cement based systems.Additionally, the SCM included in the database might have had a low reactivity and, therefore, they did not significantly change the systems even at older ages.However, as very limited information is given in the studies collected in the database with respect to SCM reactivity, we cannot test this hypothesis.It should be noted that the age distribution is fairly spread for the database, with 30 % of the pore solution systems with exactly 28 days, 35 % between 29 and 91 days, 33 % up to 720 days, and the remaining ones with ages older than 720 days, up to 16 years.Thus, the age distribution of the database is not considered to be a concern in the sense that it may have biased the feature importance ranking presented above.
The ML-based analysis of the database suggests that methods to extract pore solutions are among the minor features as well.This is represented by the "Extraction_x" and "Pressure_MPa" features that did not appear in the top rankings of feature importance.Some experimental studies have also observed a negligible influence of extraction pressure [51,113], which supports this ML-based finding.However, in other individual studies under fixed conditions, features like the pressure applied to extract the pore solution have been reported to influence the pore solution composition by expelling more ions from the cementitious system [31,115].In addition, literature shows leaching methods do not necessarily provide the same pore solution concentrations of extractions with a pressure apparatus [26,115,116].The current finding based on ML, namely that the extraction methods were not identified to be important features, should thus be interpreted with caution.This is because the majority of the data (93 %) compiled in the database was obtained with a pressure apparatus, which could bias the interpretation of ML algorithms.
Furthermore, the OH − analysis method did not significantly influence the results, expressed by the absence of the "OH_method" from the top feature importance rankings of Fig. 8.This is somewhat surprising, as the pH calculated from OH − could be different from the measurements of a pH meter, even for the same study.For example, there were cases with pH 12.99 when calculated from OH − (obtained by charge balance) while pH measured with pH meter was only 12.08 [80].Similarly, in another case, OH − obtained using titration with HCl led to a calculated pH of 13.11 while the pH measured by a pH meter was only 12.88 [60].Vollpracht et al. [37] explain that OH − concentration could be overestimated if obtained from acid titration of the pore solution because other complexes (KOH or CaOH + ) are also counted.The direct pH measurement is based on the activity of the ions, which is generally lower than their concentration.Such examples create doubts about the accuracy of OH − concentrations contained in the database.However, this feature (OH− analysis method) seems not to dramatically affect the predictions.The methods used to measure the pore solution composition in the studies collected in our database (Fig. 4) consist of several different techniques.Therefore, the unbalanced distribution of the data is not considered a concern.However, the random noise on the data due to measurement uncertainties might have blurred the ML algorithms from identifying clear influencing trends of each analytical method.
An additional point is that clinker phases were not considered relevant to capture PC reactivity for the prediction of the prevalent ions, not fulfilling our original hypothesis.Thus, mineralogy information was not included in the selected database for the prevalent ions, excluding a potential redundancy of having both chemical and mineralogical data on the models.
Concerning sulfur species, no significant distinction was observed between the concentrations of sulfate (SO 4 2− ) and total sulfur species (SO 4 2− , S 2− , S 2 O 3 2− ).For cements without slag, Lothenbach et al. [64] reported a ± 20 % variation comparing SO 4 2− and all sulfur species.This might already be within the error range of the ML model for sulfur.Also, this is related to the fact that sulfur species in the pore solution mainly consist of SO 4 2− .

Comparison with Taylor's and NIST prediction models
The ML models for predicting the pore solution concentrations of OH − , Na + , and K + were compared to models available in the literature: Taylor's model [27] and NIST calculations [28,29].For Taylor's model, the systems only included fly ash as SCM, whereas NIST calculations supported silica fume, fly ash, and slag.The selected 10 best ML models (blue dots for ET and red triangles for RF) and their metrics were evaluated compared to Taylor's (black squares in the first column) and NIST's results (green diamonds in the second column) and are presented in Fig. 10.These metrics apply for samples of the database, divided according to the same random seeds initially used in the ML model to split 15 % of data for validation.
Comparing the statistical metrics, the ML models were clearly advantageous over both Taylor's model and the NIST algorithm.When looking at all cases, represented by the scattered points in black squares/ green diamonds in Fig. 10, Taylor's and NIST results were widely spread out.This suggests the metrics strongly depend on the subsample groups selected from the database and utilized to compare with the subsamples selected for the evaluation of ML models (the samples of 15 % of data shown in Fig. 2).In a few cases, some results from Taylor's model were similar to the best ML predictions, while others were significantly worse.The R 2 of some sampled data from Taylor's and NIST was negative, suggesting a substantial divergence from the experimentally measured pore solution composition.
The lower accuracy of Taylor's and NIST models may be explained.First, standard parameters described in Taylor's study [27] were used in the predictions, such as alkali contents in clinker and sulfate phases, not reported for the systems in the database.This puts Taylor's and NIST models at a disadvantage since they are based on theoretical equations for the materials dissolution and reaction rate, with empirical parameters generalized for several materials.The average literature values adopted in 1987, when Taylor's model was proposed, are certainly not the best fit for cements studied over the last 30 years, which have a finer grain size distribution, more reactive phases in PC (as C 3 S in opposition to C 2 S), and higher replacement levels for SCM with improved propertiesall these parameters could influence the hydration reactions and affect the pore solution as well.This limitation of theory-based models can also be seen as an asset for the ML models, which do not require sophisticated information on cement dissolution and reactivity of SCM but can still provide reasonable predictions.
The promising results of ML models might also derive from the higher adaptability of their predictions to the existing data.Despite the apparent high accuracy, the ML models remain very sensitive to the reliability of the input data, which should be carefully taken into consideration.In this regard, it should again be borne in mind that experimental data is imperfect and to some extent dependent on the actual methods used (compare Section 3), and thus related errors may have a strong influence on the predictions.In summary, ML models adapt better to experimental data from different studies, while Taylor's and NIST models are more theory-based.As seen in the predictions, ML could understand and interpret well the concentration of the main ions OH − , Na + , and K + .The data-driven concept behind ML models also makes them more widely applicable and versatile than the theory-based models.

Comparison with thermodynamic modeling
Fig. 11 shows the comparison of ML and thermodynamic modeling on the prediction of the pore solution composition of typical cementitious materials, with different replacement amounts, SCM reactivities, and ages.Note that no comparison with experimental data is possible in this case.Systems of PC with fly ashes (FA-C and FA-F) and metakaolin (MK) showed comparable results, but thermodynamic modeling provided slightly more alkaline compositions.For the systems with PC and silica fume (SF), in Fig. 11d, while the Na + predictions were comparable, the ML models predicted lower concentrations for OH − and K + than the thermodynamic model.The difference in OH − and K + for the systems containing SF could be attributed to the fact that SF was assumed to be highly reactive (50 %, 70 %, 90 %) in the thermodynamic model, and the reactivity of SCM was not a feature used in the ML simulations.However, it should be noted that in most cases, unlike the comparison here, the reactivity of SCM would not be available, and thermodynamic simulations would have to be done without this information.In addition, ML models attribute high importance to the silica content of SCM, and SF was rich in SiO 2 (91 % wt.).This combination of factors might have led to the lower pore solution concentration predicted by ML for the systems with PC and SF.
It should be noted that thermodynamic predictions for OH − , Na + , and K + concentrations are highly influenced by the activity and alkali binding models used in the simulations.Therefore, it is not straightforward to use the thermodynamic predictions for these ionic concentrations as a benchmark for comparison purposes, as they might be inaccurate themselves.This is particularly relevant to systems with highly reactive SCM, such as silica fume.Literature [117] exemplifies the differences between thermodynamic predictions and experimental results on the pore solution composition.Thus, we should have in mind that thermodynamic predictions provide relevant insights on the pore solution, but are also not perfect.
However, thermodynamic modeling has some advantages as well.For example, when combined with kinetic models like Parrot and Killoh model [103], thermodynamic modeling can consider the age of the cementitious mixture in the predictions.In addition, when the SCM reactivity is quantified, thermodynamic modeling can incorporate it in the calculations.Since the database that formed the basis for the ML model did not have SCM reactivity as a parameter, the ML model can only incorporate reactivity indirectly through the SCM type, as discussed in Section 4.2.1.Ideally, the database for the ML algorithm should have more data on reactivity, which is lacking in current literature.This could improve the sensibility of the ML models to SCM reactivity.Concurrently, the good accuracy of ML models even without considering the reactivity is a major advantage of this approach.Since the materials' reactivity is not that easily available, thermodynamic modeling could not even be a choice in such cases.

Conclusion
The pore solution composition (ion concentrations OH − , Na + , K + , Ca 2+ , and sulfur species) of cementitious systems was investigated in this paper, including binary and ternary mixes of PC and SCM.Based on a comprehensive database collected from literature, machine learning (ML) regressors were applied to predict ion concentrations.
The prevalent ions (OH − , Na + , and K + ) of the pore solution were successfully predicted, with up to 75-90 % of predictions within 25 % relative error from the reported experimental values.Ca 2+ and sulfur species had lower accuracy in predictions, up to 50 % within 25 % relative error, which may be explained by the significantly lower concentrations of these species compared to the prevalent ions and the related uncertainties in the literature data.
The most important features identified by the ML models were the SCM's silica content, PC's alkalis content, and SCM replacement level.This finding is in agreement with theory and experimental studies in the literature.Other features, such as the age of the mixtures, pore solution extraction methods, and alkali content of SCM, were not identified as significant features by ML models.However, the sometimes unbalanced data distribution in the collected literature might have influenced the low importance attributed to some of these features.
Finally, a comparison with well-established, theory-based methods for pore solution prediction -Taylor's classic model and NIST algorithm showed a higher accuracy of the ML model when applied to this database.
Throughout the analysis of the here compiled data from different literature sources, a number of uncertainties have been identified in the pore solutions reported in publications.Thus, additional care should be paid to experimental measurements of pore solution composition, as well as in reporting complete data in the literature.In this context, the ability of ML to overcome missing information, particularly in the reactivity of the materials, may be considered an advantage over methods like thermodynamic modeling.An example is the hypothesis that ML indirectly incorporates reactivity through the known type of SCM.
In conclusion, there are promising perspectives to employ machine learning as an additional tool to substantiate the prediction of the pore solution composition.The next step to increase the prediction accuracy of these models and provide their application is exploring more ML techniques, like optimizing hyperparameters of the algorithms and clustering features.We also emphasize the importance of disseminating open-source databases and codes used or created in research with ML algorithms.

Table 3
Statistic metrics to evaluate the prediction accuracy of pore solution composition.Metric Interpretation R 2 The Coefficient of Determination indicates how well the model fits and interprets the data.RMSE The Root Mean Squared Error expresses the dispersion of the residuals, i.e., how spread are the prediction errors in relation to the experimental data.E10

Fig. 3 .
Fig. 3. Mix proportion distribution of a) PC, and SCM for b) binary and c) ternary systems, for which pore solution composition data is available in the literature compiled in our database.

Fig. 4 .
Fig. 4. Analytical techniques employed in the collected literature studies to measure ion concentrations in the extracted pore solution.

Fig. 5 .
Fig. 5. Distribution of ion concentrations of the pore solutions from the collected literature studies.

Fig. 7 .
Fig. 7. Statistical metrics associated with the prediction accuracy of the ML models, classified by their regressor.The scattered points represent the 20 best models created with different random seeds.Part (1) shows the results for the models based on all features (Table1) and part (2) presents the results for the models based on the selected features (Table4).Refer to Table3for more details on the metrics.

Fig. 6 .
Fig. 6.Ratio of charge balance considering different ions: Na + , K + , Ca 2+ , and SO 4 2− to determine OH − .The outlier points represent pore solutions with issues on the charge balance of these ions, which reflects unreliable concentrations in the publications or missing data.

Fig. 8 .
Fig. 8. Feature importance ranking for models based on (a,b) all features (Table1) and (c,d) the selected features (Table4), for the different ions (1) OH − , (2) Na + , (3) K + , (4) Ca 2+ , and (5) Sulfur species.Only the top 10 or 5 features were plotted.The scores are the average of the 20 best models created with different random seeds, with the error bars equaling the standard deviation.The left column shows the ranking for the Extra trees (ET) regressor, in blue, and the right column is associated with the Random Forest (RF) regressor, in red.The value n is the number of models of that regressor among the selected 20 best models.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

( 1 )
Fig. 8. Feature importance ranking for models based on (a,b) all features (Table1) and (c,d) the selected features (Table4), for the different ions (1) OH − , (2) Na + , (3) K + , (4) Ca 2+ , and (5) Sulfur species.Only the top 10 or 5 features were plotted.The scores are the average of the 20 best models created with different random seeds, with the error bars equaling the standard deviation.The left column shows the ranking for the Extra trees (ET) regressor, in blue, and the right column is associated with the Random Forest (RF) regressor, in red.The value n is the number of models of that regressor among the selected 20 best models.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 10 .
Fig. 10.Statistical metrics comparing the prediction accuracy of the ML models with selected important features and (1) Taylor's model, in the first column, and (2) NIST algorithm, in the second column.The two evaluations consider the pore solution composition of cementitious materials with Portland cement and fly ash, and, additionally for NIST calculations, also silica fume and slag.For ML models (ET and RF regressor), the scattered points represent the 10 best models created with different random seeds.For Taylor's model and NIST calculations, the scattered points are prediction accuracy results for diverse samples of the entire database.

Fig. 11 .
Fig. 11.Predictions of pore solution composition of cementitious systems produced with Portland cement (PC) and fly ashes (a) class C (FA-C) and (b) class F (FA-F), (c) metakaolin (MK), and (d) silica fume (SF).The pore solutions were predicted with thermodynamic modeling and ML algorithms based on all features and on selected features.

Table 1
Data collected on the pore solution composition and cementitious materials.

Table 2
Description of the main features nomenclature.