Validation of ambulatory monitoring devices to measure energy expenditure and heart rate in a military setting

Objectives.To investigate the validity of different devices and algorithms used in military organizations worldwide to assess physical activity energy expenditure (PAEE) and heart rate (HR) among soldiers. Design. Device validation study. Methods. Twenty-three male participants serving their mandatory military service accomplished, firstly, nine different military specific activities indoors, and secondly, a normal military routine outdoors. Participants wore simultaneously an ActiHeart, Everion, MetaMax 3B, Garmin Fenix 3, Hidalgo EQ02, and PADIS 2.0 system. The PAEE and HR data of each system were compared to the criterion measures MetaMax 3B and Hidalgo EQ02, respectively. Results. Overall, the recorded systematic errors in PAEE estimation ranged from 0.1 (±1.8) kcal.min−1 to −1.7 (±1.8) kcal.min−1 for the systems PADIS 2.0 and Hidalgo EQ02 running the Royal Dutch Army algorithm, respectively, and in the HR assessment ranged from −0.1 (±2.1) b.min−1 to 0.8 (±3.0) b.min−1 for the PADIS 2.0 and ActiHeart systems, respectively. The mean absolute percentage error (MAPE) in PAEE estimation ranged from 29.9% to 75.1%, with only the Everion system showing an overall MAPE <30%, but all investigated devices reported overall MAPE <1.4% in the HR assessment. Conclusions. The present study demonstrated poor to moderate validity in terms of PAEE estimation, but excellent validity in all investigated devices in terms of HR assessment. Overall, the Everion performed among the best in both parameters and with a device placement on the upper arm, the Everion system is particularly useful during military service, as it does not interfere with other relevant equipment.


Introduction
Armed forces worldwide are monitoring the demands and activities performed by their personnel in different military occupations (Rosendal et al 2003, Pihlainen et al 2014, Wyss et al 2014, Friedl 2018, Buller et al 2021. This allows the commanders to make data based decisions about military performance tasks, missions and injury prevention. Concerning monitoring of physical demands in a military setting, commonly data about distance covered on foot, heart rate (HR), energy expenditure (EE), skin or core temperature are assessed and analyzed (Wyss and Mader 2011, Pihlainen et al 2014, Buller et al 2021. Many different commercial or self-developed objective monitoring devices are available. They claim to assess the same parameters, however, these are based for example on HR data, on acceleration data or on a combination of both obtained on the upper arm, on the chest, on the hip using each time different technology (Brage et al 2007, Wyss et al 2012, Burrell et al 2016, Buller et al 2021. The major limitation of these devices is the limited knowledge about data validity and reliability for measuring physical demands in a military setting or limited comparability Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. (Crouter et al 2008, Dannecker et al 2013, Lee et al 2014, Sperlich and Holmberg 2017, Friedl 2018, Bent et al 2020, Carrier et al 2020. Yet, to adhere to and to benefit from data monitoring, a basic prerequisite is to validate technology against a criterion measure, ideally through a comparison of multiple systems simultaneously, in the environment and during activities of the user group (Sperlich and Holmberg, 2017). Validity is especially important when decisions based on these data may affect soldier health and safety. Knowledge about accuracy of a system can then be used to determine if it is good enough for the intended purpose.
Therefore, the aim of the present study was to simultaneously investigate the validity of different devices and algorithms used in military organizations worldwide to assess EE and HR among soldiers.

Methods
Twenty-three male participants (age: 20.9±2.5 years, weight: 76.0±11.6 kg, height: 1.8±0.1 m, body mass index: 24.2±2.5 kg m −2 , estimated peak oxygen consumption (Wyss et al 2007): 50.3±4.3 ml kg −1 min −1 ) serving their mandatory military service at a Swiss Army Infantry Training School participated in this study. Participation was voluntary and all participants provided their written informed consent. This study was performed in accordance with the principles of the Declaration of Helsinki and was approved by the local Ethics Committee of Northwest and Central Switzerland (Project-ID: 2016-01842 Switzerland;Firmware 30.03.2017, beta-version). Each device was compared to the criterion devices Hidalgo EQ02, for HR data, and MetaMax 3B (Cortex Biophysik GmbH, Leipzig, Germany; Firmware 2.8.6, July 2017), for EE data, respectively. The portable breath-by-breath gas analyzer MetaMax 3B was shown to measure metabolic demands reliably and has been used previously to assess the validity of wearables designed to monitor EE (Vogler et al 2010, Duking et al 2020. The Equivital LifeMonitor EQ02 is a multiparameter body-worn system including electrocardiography and demonstrated good HR validity (Akintola et al 2016). To estimate EE, the Garmin Fenix 3 and the PADIS 2.0 had to be combined with a HR monitor. As the available space around the chest region was a limiting factor for the number of investigated devices, the HR monitor TICKR X (Wahoo Fitness, Atlanta, USA) was chosen because it was simultaneously compatible with both devices (Bluetooth and ANT+sending mode). The sampling rate within all devices was set to 30 s. The ActiHeart was worn on the chest with two self-adhesive electrocardiogram electrodes, the Everion on the left upper arm, the Hidalgo EQ02 holster and the TICKR X strap around the chest and the Garmin Fenix 3 on the left wrist (figure 1). Two PADIS 2.0 sensors were worn, one on the right hip and one on the right side of the backpack. All tested devices as well as the criterion devices were calibrated according to the manufacturer's manual with the original equipment and participant's information was entered into the respective user profiles before each individual measurement. Furthermore, each measurement system was used with body location and wear as intended by the manufacturer.
One week prior to validation measurements participants completed a self-conducted endurance run at the Infantry Training School. After this initial endurance run measurements were taken on two different test days at least 24h apart within two weeks of each other.
On measurement day one, military activities were obtained indoors in laboratory conditions. Firstly, information about the study was verbally repeated. Secondly, body weight (in only underwear) and body height were measured with a calibrated scale and portable stadiometer, respectively (Model 861 and 213, Seca GmbH, Hamburg, Germany), and upper arm (relaxed midbiceps) and chest (during normal breathing excursions) circumferences were obtained using a non-stretch nylon tape measure. Thirdly, the participants were fitted with all devices and put on their military trouser, t-shirt, and boots. Resting EE and HR were assessed for 15 min after an initial 5 min period with the participants lying awake on a nonconductive bed. This was followed by 9 activities with increasing intensities (table 1). The activity tasks were office work in a sitting position, cleaning weapon and boots, mopping the floor, walking, self-paced marching with loaded backpack (20.3±1.9 kg) and weapon, lifting and lowering 30 kg loads in 10 s repetitions, lifting and carrying 30 kg loads 20 m with 10 s rest at every turn, shoveling sand for 30 s followed by 8 s rest periods, and running outdoors in running shoes at a steady pace that could be sustained for 60 min (speed 7.9±0.9 km h −1 ). Each of the nine activities lasted 5.5 min with 2 min breaks between activities. During the recovery phases, the upcoming activity was explained.
On measurement day two, field activities were investigated outdoors during a normal military day routine without interference by the study team. The participants were equipped with the same devices as during measurement day one. Activities differed according to the daily military program and could include a 25 km march, shooting, building combat, setting up a check point, material inspection, classroom lecture, or sports. and mean absolute (percentage) errors in heart rate (HR) estimation in beats per minute reported by different devices during laboratory (orange and italic) and field (green and non-italic) measurements. AMEQ=activity module Equivital * algorithms. Table 1. Activity protocol of measurement day one during the structured tests of military tasks of increasing intensity (indoors).
Intensity level 1 Lying awake, no task for 20 min (Resting EE and HR) 2 Sitting and doing office work Cleaning weapon and boots Mopping the floor Walking 3 Self-paced marching with loaded backpack (20.3±1.9 kg) and weapon Lifting and lowering 30 kg loads in 10 s repetitions Lifting and carrying 30 kg loads 20 m with 10 s rest at every turn Shoveling sand for 30 s followed by 8 s rest periods 4 Running at a steady pace that could be sustained for 60 min (speed 7.9±0.9 km h −1 ) Note. Each of the activity lasted 5.5 min with 2 min breaks between activities. The task order within intensity level 2 and 3 was randomly assigned.
The daily study duration included 30 min of preparation and 90 min of data collection.
Resting EE and HR were averages of measurements obtained between minutes 10 and 14 during the 15 min resting measurement (Compher et al 2006). Peak oxygen consumption was estimated based on the formula of Wyss et al (2007) from the results of the progressive endurance run the week prior the laboratory measurements. For the analysis of the laboratory tasks, the average 30 s EE and HR data of 120 s duration from seconds 180-300 were used to calculate kilo calories per min (kcal.min −1 ) and beats per minute (b.min −1 ) for each activity. For the field activities, the average 30 s EE and HR over the 90 min measurements were calculated for each device and compared to the respective criterion device. The oxygen consumption and carbon dioxide expiration values were used to estimate EE based on the formula by Peronnet and Massicotte (1991). In addition, based on the HR data from the Hidalgo EQ02, two further formulas for EE estimation in soldiers were investigated and validated. Firstly, the algorithm of Obesense (Gilgen-Ammann et al 2017), and secondly, the algorithm activity module Equivital (AMEQ) developed and in use by the Royal Dutch Army . Some devices recorded total EE (Everion, Garmin Fenix 3, MetaMax 3B, and Obesense) and others recorded physical activity EE (PAEE). In the present study, PAEE was used. In the devices recording total EE, PAEE was computed by calculating the total EE minus the resting EE. After each measurement, the data of each device was downloaded with the respective software and the raw data was exported as an Excel file (Windows 2013, Microsoft Corporation).
In case of technical difficulties, and therefore, missing data of a reference device, the entire measurement, PAEE (5/42, 12%) or HR (3/42, 7%) analysis, was deleted. In case of missing data in an investigated device (<4%), the missing data was interpolated applying the expectation maximization method (Blankers et al 2010).
Statistical analyses were performed using IBM SPSS Statistics 24 (IBM Corporation, Armonk, NY, United States). The alpha level was set at p<0.05. For PAEE and HR data, the mean absolute errors (MAE) and mean absolute percentage errors (MAPE), Bland-Altman analyses with corresponding systematic biases and 95% limits of agreement (Bland and Altman 1986), Pearson correlations, and % accuracy were conducted for analyses. For the % accuracy, a percentage of the PAEE data within ±20% and of the HR data within ±5% from the criterion values was deemed meaningful (Schweizer and Gilgen-Ammann 2018, Gilgen-Ammann et al 2019). Furthermore, univariate analyses with post-hoc tests least significant difference were used to detect activities with significant influence on the MAE in the PAEE and HR estimations, respectively.

Results
Complete day 1 laboratory measurements were obtained for all 23 participants but only 19 participants completed day 2 field measurements due to withdrawal from the study or assignment transfer. From these, useable data was obtained for laboratory (n=22) and field (n=15) PAEE validation, and laboratory (n=22) and field (n=17) HR validation.
For all laboratory and field measurements combined, the average reference value for PAEE was 4.6± 1.3 kcal.min −1 (range of mean 0.3-13.9 kcal.min −1 ) and for HR 108.9±12.7 b.min −1 (range of mean 72.6-153.2 b.min −1 ). The recorded systematic errors (limits of agreements) in PAEE estimation ranged from 0.1 (±1.8) kcal.min −1 to −1.7 (±1.8) kcal.min −1 for the PADIS 2.0 and AMEQ, respectively, and in the HR assessment ranged from −0.1 (±2.1) b.min −1 to 0.8 (±3.0) b.min −1 for the PADIS 2.0 and ActiHeart, respectively (table 2). The MAPE in PAEE estimation ranged from 29.9% to 75.1% for the Everion and the  (figure 2, HR). For the laboratory and field measurements separately, the MAE and MAPE in PAEE and HR estimations are presented in figure 1. The univariate analyses revealed significant differences in the PAEE estimation depending on the activity type (F 9 , 1488 =95.53, p<0.01). However, no single activity or activity level (i.e. intensity) could be shown to Figure 2. Relative deviation (%) of the investigated systems compared with the reference device MetaMax 3B for physical activity energy expenditure (PAEE) and Hidalgo EQ02 for heart rate (HR) data. The red lines indicate the proposed equivalence zone (±20% of the mean in the PAEE and ±5% in the HR); the boxplots' lower and upper boundaries indicate the 25% and 75% quantiles of the data, respectively, and the middle notch indicates the median data value. The whiskers include all the data points that fall within the 1.5 interquartile range of the 25% and 75% quantile values. Circles and stars indicate distance data points that lie beyond the 1.5 and 3 interquartile ranges, respectively. AMEQ=activity module Equivital * algorithms.
increase MAE in the PAEE estimation. Hence, all the investigated devices have their (dis)advantages depending on activity types, however, they differ among each other. Bland and Altman plots that illustrate the best and the worst system, respectively, estimating PAEE with a distinction by activity types are presented in figure 3. For the HR estimation, the univariate analysis resulted in F 10,236 =18.63 (p=0.084). Yet, the running activity resulted in significantly higher MAE in HR estimation than all other activity types. Extended tables of device results are presented in an online appendix for PAEE and HR estimations (supplemental tables (available online at stacks. iop.org/PMEA/42/085008/mmedia)).

Discussion
The present study investigated the validity of different devices and algorithms used in military organizations worldwide to assess PAEE and HR among Swiss Army soldiers. Data outputs were assessed in both structured tests of military tasks of increasing intensity (indoors) and during a normal day of routine military activities (outdoors). Our results showed that the Everion and the PADIS 2.0 were the most precise devices in estimating PAEE overall, yet, with a systematic errors of −0.6 and 0.1 kcal.min −1 , MAPE of 29.9% and 37.5%, correlations of r=0.841 and 0.814, and 70.3 and 66.7% of all PAEE estimations within ±20% of the reference values, respectively. In terms of HR measurements, all investigated devices demonstrated very good validity with a systematic error of −0.1 b.min −1 in the PADIS 2.0, a MAPE 0.3% in the AMEQ, excellent correlations from r=0.993 to 0.999, and 3/5 devices having 100% of their HR data within ±5% of the reference values.
In terms of PAEE estimation, similar error rates and differences between investigated devices and intensitylevels were observed elsewhere. Roos et al (2017) validated three commercially available sport watches and corresponding chest belts estimating EE and reported MAPE of 10%-42% when assessed during low-and moderate-intensity running. Also, MAPE of 20.6%, ranging from 9.1% to 31.4%, in the PAEE estimation were observed in the Polar Vantage M compared to the MetaMax 3B when obtained during various activities from sitting in a chair to accomplishing a floorball course (Gilgen-Ammann et al 2019). On average, 59.5% of the mean PAEE values were accurate to within 20%, which is comparable to some of the systems investigated in the present study (Everion 70.3%, PADIS 2.0 66.7%, ActiHeart 45.9%). Notably, the Polar Vantage M and some of the systems in our study involve only a single sensor, whereas other systems consist of two sensors (e.g. PADIS 2.0 or Garmin Fenix 3), which may hamper the usability without improving validity. The proprietary algorithms used in the devices to estimate PAEE are not publicly disclosed and the results can only be assessed with empirical testing such as this experiment Holmberg 2017, Duking et al 2020). Historically, PAEE has been derived from activity or from HR but published studies have suggested that a combination of these two measurements, particularly with individual calibration, might produce even better PAEE estimates (Brage et al 2007).
Generally, no activity type was revealed to particularly affect PAEE estimation accuracy in any measurement system. This might be explained by the fact that device placement was on different body parts, and i.e. strong arm movement did not play as a significant role as reported elsewhere (Gilgen-Ammann et al 2019). However, looking at the single systems, error measurements seem to be device-and activity-type dependent (supplemental tables). In line with this, Dooley et al (2017) stated that the examination of overall activities may lead to misinterpretation as differences would cancel each other and show minimal differences compared to a reference value. The existing devices and algorithms used in military organizations worldwide to estimate PAEE have only poor to moderate validity. Notably, only the Everion system showed overall MAPE <30% in the PAEE estimation. Users and commanders must be aware of these errors and take them into considerations while giving instructions or recommendations to personnel when relying on these systems. During military training, personnel often complete periods of intense exercise leading to high PAEE. Valid monitoring of PAEE is essential to applications involving safe limits of training and the prevention of training injuries (Epstein et al 1988, Edwards et al 2020. It was demonstrated that the mean sustained work intensity of soldiers was close to 50% of their maximal aerobic capacity (Pihlainen et al 2014). This is approximately the upper limit of sustainable effort, equivalent to a physical activity level of 2.25-2.50 or EE of~4000 kcal.d −1 observed in many other studies (Wyss et al 2012). The devices tested in this study can provide useful quantification of military training intensity and daily workload at a group level but do not appear to be sufficiently accurate and precise for individual guidance.
In terms of HR assessment, the results were excellent in all investigated devices (MAPE <1.4%). This finding was in line with recent research demonstrating high concordance in the HR measurement during various activities between optical HR monitors and the criterion measure Polar H10 chest strap (Schweizer and Gilgen-Ammann 2018). In the present study running resulted in significantly higher MAE in HR estimation than all the other activity types, which was in contrast to the findings by Schweizer and Gilgen-Ammann (2018), who found no differences, but had the lowest MAE in the running activity. In the present study, ±5% accuracy was between 97% and 100%, when considering the mean values for each activity. Such high accuracy is required, as for meaningful prediction of heat strain from algorithms that estimate core body temperature from HR (Buller et al 2013, Buller et al 2021. Also, mental stress and discomfort increase HR without a simultaneous increase in the oxygen consumption, e.g. due to changes in the activity intensity (Lambiase et al 2012). Based on the present data, all investigated systems can be recommended for valid HR measurements during different activities.
When choosing a monitoring system, apart from validity, wearing comfort and feasibility should be taken into consideration as well. From a holistic point of view all these aspects have an effect on user compliance. Previously, sensors worn around the chest were rated to have a significantly greater negative impact on soldiers' bodies than sensors worn around the wrist and other body parts , Wyss et al 2020. A poor wearing comfort was the most frequently reported negative impact (21.0%), followed by interference with equipment (9.9%), and movement restrictions (7.4%). Considering this information and the present findings, the Everion on the upper arm was found to be the most valid and feasible system to assess PAEE and HR during structured tests of military tasks of increasing intensity (indoors) and during a normal day of routine military activities (outdoors).