Spindle dataset collection
Polysomnographic data from 180 subjects was sourced from the Montreal Archive of Sleep Studies (MASS)37. The dataset was split into two “phases”, where phase 1 consisted of 100 younger subjects (mean age of 24.1 years old) and phase 2 consisted of 80 older subjects (mean age of 62.0 years old). A subset of N2 stage sleep from the C3 channel was sampled from each subject (see methods for details). 25 sec epochs of this single channel EEG were presented to expert PSG technologists, researchers, and non-expert scorers via a custom web based scoring platform. Users identified the start and stop of candidate spindles, and indicated their confidence (high, med, low) for each spindle marked. In total, 47 PSG technologists, 18 researchers and 695 non-experts viewed 10,453, 6,636 and 37,467 epochs respectively in Phase 1. Phase 2 was viewed by 31 PSG technologists (7,941 epochs viewed). No scorers viewed the whole dataset, and the histogram of the number of scorer views per epoch image is shown in Fig. 1. A minimum number of scorers per epoch was crucial to compile a reliable gold standard (GS): the median number of scorers per epoch is 5 for the PSG technologists (Fig. 1a,b), 4 for researchers (Fig. 1c) and 18 for non-experts (Fig. 1d). More than 95% of all the epochs have been seen by at least 3 PSG technologists. Table 1 presents the number of scorers and amount of data scored for each user subtype and phase. Almost 100,000 candidate spindles were identified by all scorers combined.
Human group consensus
The collected scores include many candidate spindles, and some of them showed low agreement across scorers (an event scored as a spindle by some can be scored as “not a spindle” by others). To create our GS (dataset of the highest quality spindles from the Group Consensus (GC) of experts) we averaged scoring across experts, and kept (by thresholding) only the candidate spindles that exceed a desired minimum consensus between experts – termed Group Consensus Threshold (see Methods). The minimum consensus defined by the Group Consensus Threshold (GCt) was chosen to maximize the mean individual expert performance (see Supplementary Fig. 1 and Table 1) against the leave-one-out GS (the GS in which the evaluated expert did not contribute to the spindle scoring). We identified an optimum required consensus GCt between experts of 0.2 in phase 1 and 0.35 in phase 2. These GCts are similar to what has been previously reported14. The scorers’ performance was evaluated using a “by-event” f1 score (f1), which is the harmonic mean between the precision and the recall. Recall is the percentage of gold standard spindles correctly detected by a scorer (true positives divided by true positives plus false negatives i.e. completeness), whereas precision is the percentage of a scorer’s spindles that are part of the gold standard set (true positives divided by false positives + true positives i.e. exactness). This by-event performance depends on how similar the estimated spindle (marked by a scorer or detected by an algorithm) has to be to the GS spindle to be considered as a match (True Positive); the lowest similitude occurs when spindles are adjacent (no overlap between spindles) and the strictest similitude occurs when spindles are temporally aligned with the exact same length (100% overlap). Figure 2 presents the by-event performance of experts (as well as researchers, non-experts and algorithms) as a function of the overlap threshold between estimated and GS spindles. An overlap threshold of 0.2 (also previously reported14) was the highest threshold that maximized performance and was used for further analyses in the current study.
With the GC threshold and overlap threshold chosen, the gold standard consists of 5342 spindles (3338 in phase 1, 2004 in phase 2). The properties of these spindles are reported in Table 2. This set of GS annotations is freely available on the Open Science Framework38, and the corresponding EEG data can be downloaded from the Montreal Archive of Sleep Studies website (http://www.ceams-carsm.ca/mass/). See the Readme document on the Open Science Framework38 for details on how to obtain a license to download these data.
Performance of the human group consensus and automated detectors
A rigorous evaluation of spindle results from clinical and academic sleep studies hinges on quantifying the accuracy and biases of the spindle detection method used. Therefore, to inform future work, we evaluate the spindle detection performance of experts, researchers and non-experts. Human detection of spindles is still considered the highest standard; however, many recent publications have utilized automated methods to save time and cost. Therefore, along with evaluating the performance of humans, seven popular and previously published spindle detection algorithms6,34,39,40,41,42,43 were run on the EEG data (see Methods for details on the algorithms). We compared the by-event performance of each automated detector or human group consensus (GCre and GCne) against the GS, and the individual experts were evaluated against the leave-one-out GS to avoid reporting bias.
The mean individual expert f1 was higher in phase 1 (0.76) than phase 2 (0.65), suggesting that spindles are easier to score in the younger cohort. A mean individual expert f1 of 0.67 has previously been reported14 for a cohort similar to our phase 2. The f1 of the GCre and GCne was ~0.8, suggesting that the group consensus performs better than individual experts, on average (Figs. 2a, 3d). It is noteworthy that individuals (including individual experts, non-experts and researchers) that have very high or low f1 scores tend to be scorers that did not score much data (indicated by lighter colored markers in Fig. 3). Scoring a small amount of data and thereby not encountering the full variety of epochs could have resulted in artificially high/low individual scores.
Similar to human scores, the f1 of the detectors were slightly reduced in the older cohort compared to the younger cohort, except for a943 which remained the same (Fig. 2a,b and Supplementary Table 2). Top performance (based on f1 score) on the younger cohort (phase 1) was the GCre followed closely by the GCne. The a742 detector had the highest f1 in the younger cohort, closely matching performance of the average human expert (Figs. 2a, 3d). The highest f1 in the older cohort was reached by a9. Interestingly, a9 was the method most sensitive to the overlap threshold, as its performance decreases more rapidly than other methods as the threshold becomes more stringent (see methods). Therefore, spindles detected by the a9 algorithm and matching GS spindles are less perfectly temporally aligned (i.e. the start/stop and duration of spindles is less accurate) compared to the other methods. Detector a9 performance was followed closely by a7. We also evaluated the detectors performance against the GCre (see Supplementary Fig. 2a) or the GCne (see Supplementary Fig. 2b). The performance of the automated methods remained essentially the same (for more details see Supplementary Table 3).
Automated detectors had their own specific tradeoff between precision (how many detected spindles were matching GS spindles) and recall (how many GS spindles were detected), the most balanced algorithms were a4 and a7 (Figs. 3a,d and Supplementary Table 2). The highest f1 on the whole cohort (phase 1 & 2, 180 subjects) was reached by a7 (0.72 against the GS) which is the same as the average individual expert f1. This performance is followed closely by a9 with a f1 = 0.71, a9 showed a higher recall (0.8) but a lower precision (0.65) (Fig. 3d). Figure 3(b,c) shows the Precision-Recall plot of the individual re or ne and their GC (GCre and GCne respectively). Note that the majority of the individual researchers showed a high precision to the detriment of the recall (i.e. are overly conservative when marking spindles), and the resulting GCre is perfectly balanced with a GCt = 0. The performance evaluation of the detectors against the three different human references (GS, GCre, GCne) provided similar results (for more information see Supplementary Table 3). The number of spindles, and detailed performance metrics (True positives, False positives, False Negatives) for the GS, GCre, GCne and each automated algorithm are reported in Supplementary Table 4. The performance (as quantified by the precision, recall and f1-score) of the seven tested detectors were essentially the same as reported previously14,34,42,43. Note that the performance of a9 was slightly more balanced in the original publication43 than in the current study.
Spindle characteristics by-subject as a function of age and sex
Spindle activity decreases with age, and sex differences have also been reported3,4,5,6,7,8,9,10,11,12,13. We evaluated the age group difference between 100 subjects 18–35 years old and 80 subjects 50–76 years old, and sex difference between the 88 females and 92 males. We tested the spindle density measured as spindle per minute (spm), average maximum peak-to-peak amplitude (µV), average duration (s) and average dominant oscillation frequency (Hz) by-subject on the spindle dataset included in the GS (see Methods). A 2 × 2 ANOVA showed main effect for age and sex but no interaction on both for spindle density (age p = 0.0001 and sex p = 0.001) and average amplitude (age p = 1.5e-6 and sex p = 3e-8). The difference on the average spindle duration was significant only for age (p = 0.01). No significant effect was found for the dominant oscillation frequency of the spindle. Further analyses of the age and sex differences were performed with the non-parametric Mann-Whitney test (Fig. 4) since the spindle characteristics distributions were not all normally distributed based on the Shapiro-Wilk test. The spindle density in the GS was higher (p = 0.0002), average duration was longer (p = 0.008) and average amplitude was higher (p = 2e-06) in younger compared to older subjects (Fig. 4). The spindle density (p = 0.0008) and the average spindle amplitude (p = 1e-06) in the GS were also higher in females compared to males (Fig. 4). Supplementary Tables 2 and 3 contain detailed analysis of each detector’s ability to capture the sex and age trends present in the GS.
The average spindle activity reported in the previous crowdsourcing project14 was similar to our phase 2 (older cohort) despite a relatively high standard deviation across subjects. Warby et al.14 reported 2.3 ± 2 spm with an average duration of 0.75 ± 0.27 s, a maximum peak-to-peak amplitude of 27 ± 11 μV and an oscillation frequency mean of 13.3 ± 1 Hz. We measured a by-subject dominant oscillation frequency of 13.1 ± 0.8 Hz (see Supplementary Table 5).
Comparison of detection methods
When considering which method to use to detect spindles, automated or otherwise, it is important to understand which spindle properties are best captured by each. To this end, we computed the correlation of the spindle density and spindle characteristics between the GS spindles and automatically detected spindles for each algorithm (a2-a9) as well as GCre and GCne. The correlations for the spindle density in phase 1 (younger cohort, 100 subjects) are reported in Table 3. For phase 1, the correlation is higher for human GC than automated detectors. The GCne is slightly more correlated (r2 = 0.91) than the GCre (r2 = 0.88). The correlation for the detectors is low for the spindle density (r2 average across detectors is 0.37) and spindle duration (r2 = 0.32), but very high for spindle amplitude (r2 = 0.90) and high for spindle frequency (r2 = 0.69). The detectors a7 and a9 performed better than the average of the detectors, especially for the spindle density which their r2 were 0.73 and 0.85 respectively. The correlation coefficients for the detectors in phase 2 are reported in the Supplementary Table 6. Briefly, the correlation was higher for the spindle density but lower for all the other characteristics compared to the phase 1. Again, the detectors a7 and a9 outperformed the other detectors for the correlation with the GS spindle density with a r2 = 0.83 and 0.88 respectively.
We compared the spindle characteristics by-subject distribution of each detector (a2-a9) and human group consensus (GCre and GCne) to the GS for the whole cohort except for GCre and GCne using a Mann-Whitney test. The variance in spindle characteristics was much larger across detectors than across the three human subtypes (PSG technologists, researchers and non-experts) (Fig. 5 and Supplementary Table 7). The spindle density of a2 was much lower (0.9 spm, p = 9e-38) than the GS (3.8 spm), a3 (7 spm, p = 3.6e-25) and a8 (6.9 spm, p = 2.3e-34) were much higher than the GS. The average duration was much higher for a2 (1.15 s, p = 1.6e-33) and a9 (1.15 s, p = 2e-49) compared to the GS (0.78 s), but a3 (0.56 s, p = 4.7e-43), a4 (0.67 s, p = 1.1e-15) and a5 (0.5 s, p = 1.2e-48) were much lower. The average amplitude and oscillation frequency were about the same for all the detectors except a2 which showed spindles with greater amplitude (43 µV, p = 9.5e-30) than the GS (30 µV). The histogram at the cohort level (by-subject analysis) of the dominant oscillation frequency of spindles of the GS spindles or any of the automated detectors is unimodal, and does not support the hypothesis of decomposing the spindles into fast and slow spindles (Fig. 5d). Note that the slightly higher spindle density, duration and amplitude for the re and ne spindle dataset (Fig. 5) are biased due to the fact that only the younger cohort (phase 1) was scored by these groups (see Table 2 for the true comparison for the phase 1, “Phase 1 - Younger” column).
How many scorers are needed for crowdsourcing sleep spindle annotations?
Obtaining quality spindle scoring is costly and time consuming; knowing the number of scorers per epoch to achieve reliable results is worthwhile and may help to create future GS datasets. We identified that aggregating the scoring from two to four experts or researchers per epoch is optimum (Fig. 6a). However, three to ten non-experts were needed for similar performance (Fig. 6b). Zhao et al.35