Get rid of the beat in mobile EEG applications: A framework towards automated cardiogenic artifact detection and removal in single-channel EEG

Brain activity recordings outside clinical or laboratory settings using mobile EEG systems have recently gained popular interest allowing for realistic long-term monitoring and eventually leading to identification of possible biomarkers for diseases. The less obtrusive, minimized systems (e.g. single-channel EEG, no ECG reference) have the drawback of artifact contamination with varying intensity that are particularly difficult to identify and remove. We developed brMEGA, the first algorithm for automated detection and removal of cardiogenic artifacts using non-linear time-frequency analysis and machine learning to (1) detect whether and where cardiogenic artifacts exist, and (2) remove those artifacts. We compare our algorithm against visual artifact identification and a previously established approach and validate it in one real and semi-real datasets. We demonstrated that brMEGA successfully identifies and substantially removes cardiogenic artifacts in single-channel EEG recordings. Moreover, recovery of cardiogenic artifacts gives the opportunity for future extraction of heart rate features without ECG measurement.


42
Wearable and portable EEG technology emerged in the last decades allowing for brain 43 activity monitoring in a natural setting. Moving from clinical and laboratory settings 44 into in-home settings opens up a vast amount of opportunities for real-life, long-term 45 and large-scale monitoring of brain activity during different vigilance states and 46 conditions, such as during sleep or while performing specific everyday tasks. 47 The cardiogenic artifact in EEG can occur when cardiac electrical fields affect the 66 surface potentials on the scalp 9 . It can arise in all vigilance states, e.g. also during sleep 67 that is otherwise less affected by movement artifacts or eye blinks. The cardiogenic 68 artifact has frequency components in the delta-theta range but due to its non-sinusoidal 69 shape, the artifact is also represented in higher harmonics of the initial band leading to  For (1), to our knowledge, there is very limited work aiming to automatically determine 81 whether cardiogenic artifacts exist when only a single-channel EEG signal is available.  17 . Yet, all the approaches did not provide an application to tackle (1) and only 93 well-controlled recording segments with known ECG artifact presence were included. 94 We emphasize that while (1) seems to ask the same problem as (2), they are essentially Correct removal of artifacts initially requires identification of participants and epochs 115 that are contaminated by EEG. Even though similar EEG montages might be used, not 116 all participants will show artifacts. Furthermore, specifically for long-lasting recordings 117 (e.g. nocturnal sleep period), these artifacts might only temporarily be present and will 118 presumably vary in their magnitude and shape across the recording period. Specifically, 119 for large-scale and long-term recordings that are collected with mobile applications, 120 visual inspection (which is typically involved) is not feasible. tracking algorithm is applied to obtain the peak locations (step 1-3). When cardiogenic 174 artifacts exist, we obtain the cardiogenic artifact locations; otherwise, we obtain random EEG spikes. From steps 1-2 and 1-3, we form four features associated with cardiogenic 176 artifacts, two from the time domain and two from the TFR (step 1-4). We apply the 177 well-established support vector machine algorithm (step 1-5) to learn whether a 4-178 second epoch is contaminated by cardiogenic artifact or not. The flowchart of this part 179 of algorithm is shown in Figure 1.   Table A1).   Based on this model, we apply the optimal shrinkage (OS) to find similar cardiogenic 209 patterns (step 2-1), so that the nonlocal Euclidean median (step 2-2) leads to a less  In order to validate the performance of our brMEGA to accurately define ECG peak 224 locations we included two in-lab measured subjects from one study site that also 225 included ECG as reference recordings. In Figure 3, a representative segment before and 226 after cardiogenic artifact removal is shown, and the simultaneously recorded   In order to quantify and compare the TS performance, we consider the following two 254 indices: the artifact residue (AR) index, which was considered in Malik,et. al. 33 for the  Here we propose the automated framework brMEGA for determination, peak location 300 identification, and removal of cardiogenic artifacts from single-channel EEG. We 301 demonstrate that our algorithm can successfully detect and substantially recover EEG 302 periods with cardiogenic artifacts without the need of an additional ECG derivation.

303
BrMEGA is the first algorithm for automated cardiogenic artifact determination and 304 removal that can be applied to long duration recordings with minimal EEG channel 305 settings. Therefore, it provides a highly attractive artifact removal solution for mobile 306 EEG applications that have gained tremendous impetus in the last decade. and amplitude that need identification with high temporal and frequency resolution (e.g.

314
IHR) such as is the case for cardiogenic artifacts. Therefore, a method is needed that 315 overcomes the limitations of classical spectral analysis. We have recently established 316 that SST can successfully isolate and characterize this kind of signal, even when the 317 signal oscillates in a non-sinusoidal way (e.g. non-sinusoidal signals such as ECG) 28 . 318 Accordingly, the brMEGA detection algorithm included a combination of features 319 extracted from the raw signal and the SST to capture typical cardiogenic signatures.

320
With these features we have trained an SVM to identify cardiogenic artifacts in EEG 321 based on visual expert labelling. Compared to a neuroscience expert, the algorithm 322 achieved a very high specificity (at ~95%) in identifying an artifact that was also 323 identified visually. The sensitivity was slightly lower at around 70%, thus the algorithm 324 identified some epochs as including cardiogenic artifacts in the EEG that were not 325 labelled by the expert. However, this might be biased to the visual capacity of the expert. 326 Artifacts could be less pronouncedly visible, specifically if strong EEG activity is 327 present, even though they exist in the EEG. We took a closer look at some epochs that 328 were only identified by the algorithm and could also see indices of cardiogenic artifacts 329 that were labelled by the algorithm indication that at least in some epochs, the algorithm 330 might outperform the scorer (see Appendix Figure A13). A limitation of the datasets 331 derived from the mobile EEG device is that we did not have the ground truth ECG to 332 evaluate whether the located cardiogenic peaks with our algorithm matched with real 333 R-peaks in the ECG. To overcome this limitation, we include additional in-lab datasets 334 that provided simultaneous EEG and ECG recordings. Applying our brMEGA 335 algorithm to these recordings demonstrated highly precise peak overlap between the 336 cardiogenic artifact and the ECG with 96% precision and 96% sensitivity. This result is 337 comparable with the accuracy of previously published approaches, for which accuracies 338 around 97-99% were reported 10,16 . Here, we have to consider the achieved performance 339 is slightly lower than in previous publications because we include a whole night 340 recording with some time periods that might not have had an artifact. In other words,  Removal and recovery of the cardiogenic artifact 359 Upon detection of cardiogenic artifacts, we applied the wave-shape manifold model 360 template subtraction to remove the cardiogenic artifact. Our approach can substantially 361 recover the EEG as illustrated in the time-domain but also in the frequency domain, for 362 which we should removal in the primary frequency bands but also the affected higher 363 harmonics. This is also quantified in four indices that depict EEG recovery in contrast 364 to the ground truth EEG in our semi-real dataset. All indices show pronounced 365 improvement after brMEGA application compared to the contaminated EEG. The 366 proposed brMEGA is suitable to handle non-stationary time series, particularly when 367 the signal is recorded for hours or longer. It is well known that the EEG signal is non-368 stationary. The cardiogenic artifact, on the other hand, is also non-stationary, due to the 369 inevitable heart rate variability and the fact that the cardiogenic artifact strengths might 370 vary from one to another. Furthermore, changes in head positions can affect the extent 371 of the artifact, which is often happening during sleep or recordings that involve head 372 movements 9 . Unlike the traditional TS algorithms, like EAS we compared in this 373 work 15 and its variations 16,22 , we leverage this non-stationarity to more efficiently 374 remove cardiogenic artifacts. Such non-stationarity is captured by the wave-shape 375 manifold, which quantifies how cardiogenic artifacts change from one cycle to another.

376
Specifically, for each given cardiogenic artifact, we design a specific filter, the optimal 377 shrinkage, to "clean up the EEG signal" so that the median is taken only over those to the warping, the spectral spreading issue caused by the inevitable heart rate 396 variability is mitigated, and hence the EEG recovery can be better evaluated.

397
Furthermore, our algorithm also outperformed in metrics for EEG signal recovery when 398 we compared recovered signal in the semi-real dataset to the ground truth EEG. 399 We further identified that the cardiogenic artifact does not only entail the strong QRS 400 complex but also the less pronounced T-wave, which has previously been reported 9 . 401 Thus, we highlight the importance of capturing a long enough window around the QRS 402 complex to remove the T-wave of the cardiogenic artifact. approaches. Using visual identification as the ground truth of artifact existence is the 423 step that can be improved. One problem is that strong EEG background activity might 424 mask its presence so that the ground truth is not correct, which will mislead the SVM    interval is greater than 120 microvolt. We concatenate them to form the clean ECG 494 signal. If two consecutive beats have RR interval less than 0.4 s, the first one is removed. 495 Next, we generate the cardiogenic artifact by randomly scaling the peaks in the 496 following way. Let be the height of the th R peak in the clean ECG, and be 497 the interval of the QRS interval of the th heartbeat, and be the interval from the S 498 wave of the th QRS to the Q wave of the ( + 1)th QRS, which includes the th T wave. 499 We scale the ECG over by 60ℎ / , where ℎ is independently and identically

513
The main difference is that the periods between two consecutive stimulation artifacts 514 are usually fixed and their times are known, while the periods between consecutive 515 cardiogenic artifacts are time-varying and their times are unknown. Below we detail the 516 algorithm step by step. Denote the digitalized EEG signal as 0 ∈ ℝ , where we pad 517 the recorded signal by 0 so that the overall length of the EEG signal is of the multiple 518 of 4 seconds. We assume that the signal is uniformly sampled every ∆ > 0 second.

519
Without loss of generality, we assume that 1/∆ is an integer to simplify the notation 520 and suppose there are > 0 4-second non-overlapping epochs; that is = × 4/∆ .

522
Step 0: Preprocessing of EEG data for algorithm 523 The mean of the EEG signal is removed and an IIR notch filter applied at 50 Hz Step 1: 537 In the first step, we automatically determine if a given epoch is contaminated by 538 cardiogenic artifacts or not. We first correct the polarity of the signal. To do this, we 539 need to locate peaks in 1 . We rectify the signal and take the square root of it. Denote respectively. We then divide into submatrices associated with non-overlapping 551 epochs, denoted as ,1 … , ∈ ℂ (4/∆ )× .

552
According to the well-established theory, the cardiogenic artifact in EEG will be 553 represented as a dominant horizontal curve in the TFR 28 . See Figure 1,  Since we focus on subjects without severe arrhythmia like ventricular fibrillation, we 559 follow the normal physiological constraint and assume that the heart beats at the 560 frequency less than 4Hz and higher than 1/3Hz; that is, the consecutive two beats are 561 separated by at least 250 ms and no longer than 3 seconds 29 . Under this constraint, in 562 this work, we apply the following curve extraction algorithm to determine the IHR.
(1) 564 where > 0 is the penalty term chosen by the user, = {⌊1/3∆ ⌋ + 1, ⌊1/3∆ ⌋ + 565 2, … ,4/∆ }, ⌊1/3∆ ⌋ means the largest integer smaller than 1/3∆ , and * is a dim 566 vector representing a curve on the TFR. The first term in the function in (1) essentially 567 fits the TFR with a curve that captures the frequency with the highest intensity at each 568 time, while the second term imposes a smoothness constraint on the extracted curve.

569
As a result, when cardiogenic artifacts exist, the IHR is captured, and at time ∆ -th 570 second the IHR is * ( )∆ Hz. When there is no cardiogenic artifact, the extracted curve 571 represents the instantaneous frequency associated with the spectral distribution of the 572 EEG as a random process.

573
With the estimated * , we apply the peak tracing algorithm 36 to determine the 574 location and heights of peaks from . Specifically, we consider the following 575 optimization:  Step 2. Once we confirm that the EEG signal is contaminated by the cardiogenic artifact,

733
The above AR and SC indices are designed to evaluate the algorithm performance when 734 the ground truth EEG signal is not available. When the true EEG signal is available, we apply 735 the following four metrics 33 to quantify how well the EEG is recovered from the cardiogenic 736 artifact contaminated EEG signal. Denote the true EEG as a ∈ ℝ N and the estimated EEG as 737 ã ∈ ℝ N . Here, the estimated EEG can be the one after applying brMEGA or EAS, or without 738 applying any algorithm. The first index is the normalized mean square error (NMSE), which 739 is defined as 740 . 741 Clearly, the smaller NMSE is, the better the EEG is estimated by ã. The second index is the 742 cross correlation between a and ã , denote as ρ. The better ã estimates a, the closer ρ is to 1. 743 The third and fourth indices are signal-to-noise ratio (SNR) and peak-signal-to-noise ratio 744 where (a, ã) is the root mean square error between a and ã. Clearly, the higher the SNR 747 and PSNR, the better the recovery is. 748 To report NMSE, we divide the signal into non-overlapping 20 second epochs, and we 749 take the median value of all NMSEs in each epoch, which represents the algorithm 750 performance over that epoch. The same procedure is applied to ρ, SNR and PSNR in the same 751 way. In the end we have four indices to quantify the cardiogenic artifact, and hence EEG, 752 recovery quality over each epoch.  The result is shown in Supplementary Table 1, where the sensitivity is 0.93, the 761 specificity 0.69, and the F1 score is 0.84 and the overall accuracy is 0.8. Compared with 762 the 5-fold cross validation shown in Table 1, the result is compatible and encouraging.

763
Note that while the leave-one-subject-out cross validation scheme is challenged by the 764 inter-individual variability, this scheme is close to the real-world scenario, particularly 765 when there is no available information for a newly arrival subject.

767
Implementation details and parameters 768 The proposed brMEGA algorithm is implemented in Matlab, and we adhere to the 769 following parameters in this work. We set 1/∆ = 250 . In the nonlinear-type TF 770 analyses, including SST, the window is Gaussian and the Full width at half maximum 771 is set to be 10s, and the maximal frequency is set to be = 25 Hz. We fix the frequency 772 resolution at ∆ : = / = 0.02 Hz. Set =1 in the curve extraction algorithm (1), We 773 set the parameter γ = 20 in the peak tracing algorithm in (2). For the kernel SVM 774 classifier, we apply the following parameters: the cost is set to 1, Gamma is set to 0.2, 775 and the weights are set to be 1 and 3 for the cardiogenic artifact free epochs and epochs 776 with cardiogenic artifacts respectively. We do not carry out any grid optimization 777 search for the parameters to avoid over-fitting. In the NLEM, we take = 100. The power spectral density is calculated using Welch's method, featuring 5000 discrete 789 Fourier transform points, Hamming windows of 5000 samples, and 50% overlapping. The datasets analyzed during the current study are available from the corresponding 815 author on reasonable request. In the future, the dataset will be made available in a 816 repository.

818
Code availability 819 The code of brMEGA is currently available via request but will be made open source 820 in the future. Figure 1: The flowchart of the first step of brMEGA. The four designed features are marked from 1 to 4. The first feature is the peak height in the time domain, the second and third features are the features in the time-frequency (TF) domain that are determined by the "optimal transport distance", and the fourth feature is the total energy around the peak. The features in the time domain depend on the peak tracking algorithm with the extracted instantaneous frequency from the TF representation.   . Red x marks the respective peak detection for both ECG (R peak) and EEG (cardiogenic artifact peak) separately. The R peak and artifact peaks are correctly identified and overlap. The recovered EEG signal by brMEGA is shown in the third panel. Figure 5: An illustration of the cardiogenic artifacts and the "T wave" bumps (marked by red arrows). If the window is short, the T-wave-like artifact cannot be recovered. Red x indicate cardiogenic artifact peak or R peak occurrence.   Table 2. The confusion matrix of cardiogenic artifact detection. A detected cardiogenic artifact is viewed as a false positive when there does not exist a R peak within the 50 ms grace period. A false negative cardiogenic artifact is defined when there is a R peak, but we do not detect a cardiogenic artifact within the 50 msec grace period.