Introduction
Executive Functioning (EF) and Processing Speed (PS) are foundational for many complex cognitive functions, and worsening performance in these domains has been hypothesized to play a dominant role in cognitive decline (e.g., Reynolds et al., Reference Reynolds, West and Braver2009; Salthouse, Reference Salthouse1996, Reference Salthouse2000). As such, longitudinal monitoring of both EF and PS has important implications for early detection, intervention, and management of pathological cognitive impairment. Neuropsychological batteries and standardized assessments typically include measures of EF and PS; however, most standardized measures are designed for one-on-one, in-person administration, which can be costly, burdensome, and often impractical for both researchers and participants.
Luckily, recent advancements in connected technologies have provided opportunities to perform remote health monitoring in a manner that can support research advances toward healthy aging and early disease detection. Mobile app-based assessments offer a particularly appealing mechanism for administration, as they can be completed at any time in nearly any location (Ben-Zeev & Atkins, Reference Ben-Zeev and Atkins2017; Koo & Vizer, Reference Koo and Vizer2019). By leveraging personal smartphone technologies, remote cognitive administration can offer a cost-effective and efficient alternative to in-person assessment and enable study designs that include frequent and/or longitudinal cognitive monitoring.
To address the need for reliable, standardized, remote cognitive assessments, the National Institute on Aging (NIA) has awarded multiple grants to create the “Mobile Toolbox,” (MTB) (www.mobiletoolbox.org, Gershon et al., Reference Gershon, Sliwinski, Mangravite, King, Kaat, Weiner and Rentz2022) a library of cognitive tests and supplemental scales embedded within the REDCap system and their companion MyCap App (Harris et al., Reference Harris, Swafford, Serdoz, Eidenmuller, Delacqua, Jagtap, Taylor, Gelbard, Cheng and Duda2022; Harris et al., Reference Harris, Taylor, Minor, Elliott, Fernandez, O'Neal, McLeod, Delacqua, Delacqua, Kirby and Duda2019; Harris et al., Reference Harris, Taylor, Thielke, Payne, Gonzalez and Conde2009). MTB and REDCap make it possible for researchers to design and deploy smartphone-based studies that participants can complete anywhere and anytime through an iOS- or Android-based smartphone device. Through MyCap, the MTB measures are highly accessible, affordable, and as we review here, have been validated for use across the adult lifespan. Raw data from the tasks are uploaded to servers, where they are aggregated, processed for quality, and used to generate performance metrics. All MTB assessments are designed to measure well-established constructs, using existing paradigms optimized for self-administration on a personal smartphone. Furthermore, the MTB system is designed to allow these and other measures to be combined into electronic protocols customized for the needs of a large number and variety of future studies. Originally intended for research use, the MTB is not currently suitable for individual diagnosis, but the platform provides a mechanism for future research to explore this potential use. Most of the initial core MTB cognitive tests were adapted from measures included in the NIH Toolbox® for Assessment of Neurological and Behavioral Function Cognitive Battery (NIHTB-CB) or are measures of similar constructs (Carlozzi et al., Reference Carlozzi, Beaumont, Tulsky and Gershon2015; Carlozzi et al., Reference Carlozzi, Tulsky, Chiaravalloti, Beaumont, Weintraub, Conway and Gershon2014; Gershon et al., Reference Gershon, Wagster, Hendrie, Fox, Cook and Nowinski2013; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer, Carlozzi, Slotkin, Blitz, Wallner-Allen, Fox, Beaumont, Mungas, Nowinski, Richler, Deocampo, Anderson, Manly, Borosh, Havlik, Conway, Edwards, Freund, King, Moy, Witt and Gershon2013; Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014). The goal of the current paper is to describe the process of developing and validating three MTB measures that assess EF and PS: Arrow Matching (inhibitory control), Shape-Color Sorting (cognitive flexibility), and Number-Symbol Match (speed of information processing). These measures share similarity in that they all rely on reaction time and are all adaptations of existing NIHTB assessments.
Data from three distinct samples were used to evaluate the psychometric properties of the MTB measures. In Study 1, participants completed the MTB measures in a lab on a study-provided smartphone and were administered external measures of similar and dissimilar cognitive constructs. The goal of Study 1 was to evaluate the convergent and divergent validity, internal consistency (split half), and correlations with age for MTB measures when completed in a controlled laboratory setting. In Study 2, participants completed MTB measures remotely on their own smartphones and were administered measures of similar cognitive constructs in the lab. The goal of Study 2 was to replicate the results from Study 1 and to evaluate the psychometric properties of the MTB measures when taken remotely on a personal smartphone. Study 2 also allowed us to compare results across Android and iOS devices. In Study 3, participants completed the MTB twice on their own smartphone, two weeks apart. The goal of Study 3 was to examine the test-retest reliability of the MTB measures when taken remotely.
Method
Measure development
Two EF measures from the NIHTB-CB, Flanker Inhibitory Control and Attention Test and Dimensional Change Card Sort Test (DCCS; Weintraub et al., Reference Weintraub, Dikmen, Heaton, Tulsky, Zelazo, Bauer, Carlozzi, Slotkin, Blitz, Wallner-Allen, Fox, Beaumont, Mungas, Nowinski, Richler, Deocampo, Anderson, Manly, Borosh, Havlik, Conway, Edwards, Freund, King, Moy, Witt and Gershon2013; Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014), and one PS measure from the NIHTB supplemental tests, the Oral Symbol Digit Test (See Carlozzi et al., Reference Carlozzi, Beaumont, Tulsky and Gershon2015; Healy & Fernald, Reference Healy and Fernald1911), were selected for adaptation for self-administration on the Mobile Toolbox app. A preliminary version of each measure was created for usability and pilot testing. The results of those initial tests informed revisions, which were incorporated into the versions of the measures that were used for the validation studies.
Arrow matching
Arrow Matching assesses the inhibitory control component of executive functioning. Based on the original Eriksen flanker task (Eriksen & Eriksen, Reference Eriksen and Eriksen1974) as well as the NIHTB version (Flanker Inhibitory Control and Attention Test; Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014) participants indicate whether a central stimulus is oriented to the left or right, while inhibiting focus on potentially incongruent flanking stimuli on either side.
Designed to be taken in landscape orientation on a smartphone screen, Arrow Matching presents five arrows in a line (See Figure 1A). Four flanking arrows appear for a fraction of a second (100 ms) prior to a central arrow. Examinees then have 2000ms to respond with the direction of the central arrow, selecting from two buttons. Participants complete 50 trials in a pseudo-random order, of which approximately one third of the central stimuli are incongruent with the flankers. A centrally located star rotates during a variable (500 ms, 1250 ms, or 2000 ms) inter-stimulus-interval (ISI). The movement of the star was chosen to help participants maintain attention and provide a sense of system status, communicating that another trial is soon to appear.
One difference between MTB Arrow Matching and its NIHTB counterpart (Flanker) was the addition of more trials (50 vs 20), with less time allotted for each item (2000 ms vs. 10,000 ms). This faster auto-advance, combined with a variable ISI, was implemented to increase task difficulty, with the goal of expanding distribution of performance.
Shape-color sorting
Shape-Color Sorting measures the cognitive flexibility component of executive function. Based on the Dimensional Change Card Sort Test (Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014), participants are cued to match a bivalent central test stimulus to one of two target stimuli based on one of two dimensions (Figure 1B). Trials vary in the relevant dimension, requiring participants to shift their matching rules.
In this test, which is taken in portrait orientation, trials switch between cueing “color” and “shape.” The measure begins with five mixed-practice items, followed by 30 test trials, 20 percent of which cue “color”. The cued word is presented in lowercase text because words in lowercase font are more easily recognizable (Tinker, Reference Tinker1963). There is a variable-length ISI (either 300 or 1000ms) between each trial and participants have 2500ms to respond to each trial.
Whereas the NIHTB version (DCCS) uses a bunny and sailboat for practice items and a ball and truck for live items, the MTB version uses a dog and car for practice items, and balloon and house for live items. These new stimuli use the same colors, general shape, and a similar style of drawing as those in the NIHTB version.
Number-symbol match
Number-Symbol Match is an electronic adaptation of the many extant “coding” types of tests that originated in the early 20th century (Healy & Fernald, Reference Healy and Fernald1911), and shares similarities with the NIH Toolbox Oral Symbol Digit Test (Carlozzi et al., Reference Carlozzi, Tulsky, Chiaravalloti, Beaumont, Weintraub, Conway and Gershon2014), which was similarly adapted from this original source. This measure assesses processing speed by instructing participants to use a reference key to pair numbers with symbols in a constrained time.
Number-Symbol Match, completed in landscape orientation, presents a “key” at the top, showing the numbers one through nine with a unique symbol connected to each number (See Figure 1C). Below the key, nine symbols are presented per screen, and the participant must tap the correct number for each symbol presented, according to the key at the top. Symbol order is pseudo-random, with the condition that no identical symbols appear contiguously. The test includes 16 successive screens of 9 items each (total items = 144) and participants are given 90 s to complete as many items as possible.
One difference between the MTB version of Number-Symbol Match and other similar tests, including the NIHTB version of the Oral Symbol Digit Test, is that self-corrections are not permitted with the MTB design. Moreover, the Oral Symbol Digit Test uses oral responses whereas Number-Symbol Match uses a motoric response (tapping a button), as the latter was expected to reduce potential effects of background noise when collecting data in unknown environments and obviates the need for an examiner or speech recognition processing of responses.
Validation studies
Participants
92 participants from Study 1 (M age = 49.27 years, SD = 17.65) and 1021 participants from Study 2 (M age = 43.97 years, SD = 21.24) were enrolled in the NIHTB version 3 re-norming study and had been recruited by a third-party market research firm. They were racially and ethnically diverse and represented a range of age groups and education levels (albeit few participants had less than a high school education). The only inclusion criteria for participants were: 1) age 18 or older; 2) ownership of an iOS or Android smartphone; 3) ability to consent to participation in English. Participants were not screened for cognitive impairments prior to participation. Participants in Study 3 (N = 168, M age = 63.54, SD = 12.10) were enrolled as part of a larger independent validation study through the Brain Health Registry (BHR), an online, longitudinal platform with over 100,000 members (Michael W Weiner et al., Reference Weiner, Aaronson, Eichenbaum, Kwang, Ashford, Gummadi, Santhakumar, Camacho, Flenniken, Fockler, Truran-Sacrey, Ulbricht, Scott Mackin and Nosheny2023). BHR consists of a public-facing website and a participant portal, where participants over the age of 18 can register, create an online profile, complete an online informed consent form, and complete study tasks. Participants are recruited to BHR through different methods (Weiner et al., Reference Weiner, Aaronson, Eichenbaum, Kwang, Ashford, Gummadi, Santhakumar, Camacho, Flenniken, Fockler, Truran-Sacrey, Ulbricht, Mackin and Nosheny2023; Weiner et al., Reference Weiner, Nosheny, Camacho, Truran‐Sacrey, Mackin, Flenniken, Ulbricht, Insel, Finley, Fockler and Veitch2018) and advertising themes and messages include those tailored towards older adults with normal cognition, as well as those likely to have subjective cognitive decline and cognitive impairment.
Study 3 participants were required to be fluent in English, have previously opted-in to learning about additional research opportunities within BHR, and were required to have a compatible smartphone device. Participants were not screened for cognitive impairment. Due to an unexpected technical issue that corrupted data from Android devices, only users of iOS were included in this sample. See Table 1 for full demographic breakdown of participants in the three studies.
Procedure
Study 1. Participants self-administered the MTB measures on study-provided iOS smartphones (iPhones), unproctored in the lab. They were also administered the NIHTB Version 3 measures on study-provided tablets (iPads), which included measures of interest for validation: Flanker, DCCS, Oral Symbol Digit Test, and Pattern Comparison Processing Speed Test. Participants were also administered several external measures of similar constructs, including the Delis Kaplan Executive Function System (D-KEFS) Color Word Interference Test (Delis et al., Reference Delis, Kaplan and Kramer2001), Wisconsin Card Sorting Test (WCST-64; Heaton, Reference Heaton1981), and the Coding and Symbol Search subtests from the Wechsler Adult Intelligence Scale, 4th edition (Wechsler, Reference Wechsler2008), as well as two measures for divergent validity – the Peabody Picture Vocabulary Test, 5th edition (PPVT-5; Dunn, Reference Dunn2018) and the NIH Toolbox Picture Vocabulary Test (TPVT; Gershon et al., Reference Gershon, Cook, Mungas, Manly, Slotkin, Beaumont and Weintraub2014), both of which measure receptive vocabulary, a construct that is distinct from EF and PS. The PPVT has previously been used as a measure of divergent validity vis-à-vis the NIHTB EF measures (Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014). We expected the MTB EF measures to correlate at least moderately with respective measures of similar constructs (r > .3) and weakly with a measure of a divergent construct (i.e., the PPVT-5; r < .3).
Study 2. Participants were administered the NIHTB measures used on Study 1 in the lab, and then completed the MTB measures on their own iOS or Android smartphone remotely, no more than 14 days later.
Study 3. BHR participants were invited by email, screened for eligibility (access to a compatible smartphone), and provided online instructions for MTB app download. Participants self-administered the MTB measures on their own iOS smartphone remotely twice - once at baseline, and once 14 (± 3) days later. Participants were only included in final analyses if they completed the measures at both timepoints, and if they did not switch devices between sessions (i.e., from iOS to Android, iPhone to iPad). Of the 168 participants enrolled, 144 participants provided data for test-retest reliability for Shape-Color Sorting, 142 participants for Arrow Matching, and 141 participants for completed Number-Symbol Match.
Studies 1 and 2 were conducted in compliance with, and approved by, the Internal Review Board (IRB) at Northwestern University (IRB STU00207455) and Study 3 was conducted in compliance with, and approved by, the IRB at the University of California, San Francisco (IRB 20-30058). All data was obtained in accordance with Helsinki Declaration.
Analyses
Scores for Arrow Matching and Shape-Color Sorting use a rate-based score - the number of correct trials completed per second, which matches the scoring model used for NIHTB version 3 and is taken from prior literature (Woltz & Was, Reference Woltz and Was2006). The score for Number-Symbol Match uses the number of correct responses completed in the allocated time (90 s). All MTB analyses reported here used raw scores. While these reflect the primary scores, similar to other EF tests (e.g., the DEFKS), additional metrics are also available for Arrow Matching and Shape-Color Soring including error rate, anticipation errors, median correct and incorrect, etc.
Spearman correlations were conducted to explore convergent validity against external measures of similar cognitive constructs (Study 1), and NIHTB equivalent measures (Studies 1 & 2), as well as divergent validity against the NIH Toolbox PVT (Studies 1 & 2) and PPVT (Study 2). Note that TPVT is an interval scale based on IRT models, and the remaining convergent measures are each ratio scales. Parametric tests are appropriate for the comparison of interval scales to ratio scales, as well as between ratio scales.
Tests of independence assessed whether MTB correlations with NIHTB varied as a function of testing environment (in-person vs. remote—i.e., Study 1 vs. Study 2). Performance across age was also computed with Spearman correlations (Studies 1 & 2). We anticipated that all three measures would correlate negatively with age, as EF and PS are known to decline across the adult lifespan.
Internal consistency reliability (Studies 1 & 2) was calculated using a median correlation with Spearman-Brown correction between bootstrapped random split-half coefficients. Test-retest reliability (Study 3) was calculated using intraclass correlation coefficients, and practice effects were analyzed using linear mixed-effects models with two time points and a random intercept for participant (which is statistically equivalent to a paired samples t-test).
Finally, because subtle differences in operating system can influence the timing of stimuli presentation and response recording we considered the effect of operating system (iOS vs. Android) on standardized scores, using linear regression models and controlling for age in each model. In addition to evaluating differences in performance scores across device type, we also compared validity and reliability metrics across device types considering overlapping 95% Confidence Intervals (for validity estimates) and overlapping 25%–75% percentiles (for reliability estimates).
Analyses for all validation studies were conducted in R (R Core Team, 2023). Due to the number of comparisons, p-values were only considered significant if they were less than .01.
Results
Study 1 (In-person sample)
Examination of score distributions did not suggest floor or ceiling effects of the measures (i.e., there were very few cases with perfect or zero scores). Validity estimates comparing scores between MTB measures and NIHTB counterparts were strong, ranging from r = .58 to r = .74 (See Table 2). Validity estimates compared to external measures were more variable. Number-Symbol Match had a strong correlation with the WAIS-IV Coding score (r = .68) and Symbol Search (r = .63). Shape-Color Sorting correlated moderately in the expected negative direction with all D-KEFS subscores (−.54 < r < −.38), as well as moderately in the expected positive direction with the WCST-64 (r = .41). Similarly, Arrow Matching correlated moderately in the expected negative direction with all D-KEFS subscores (−.45 < r < −.39), as well as in the expected positive direction with the WCST-64 Test (r = .32).
Note. All correlations significant at p < .01 unless otherwise noted.
^Negative correlation expected due to the inverse relationship between scores (e.g., performance vs. age; speed vs. accuracy).
All three measures showed no significant relationship to scores on the two vocabulary measures: NIHTB PVT (r < .15) and PPVT (r < .29), demonstrating evidence of divergent validity. Measures in this study also demonstrated very strong internal consistency reliability (r’s ≥ .93) and expected negative correlations with age (r’s ≤ −.39).
Study 2 (Fully remote sample)
Examination of score distributions did not suggest floor or ceiling effects of the measures. Reliability and validity estimates, as well as correlations with age, were largely similar to those seen in Study 1 (See Table 3 for all results on the Full Sample). Tests of independent correlations comparing NIHTB convergent validity correlations between the in-person (Study 1) and remote (Study 2) samples indicated no significant differences between the estimates for Arrow Matching vs. NIHTB Flanker (z = 1.56, p = .12), Shape-Color Sorting vs. NIHTB DCCS (z = 0.18, p = .86), Number-Symbol Match vs. NIHTB Pattern Comparison (z = 0.21, p = .83), or Number-Symbol Match vs. NIHTB Oral Symbol Digit Test (z = 0.14, p = .89). Finally, all three measures again showed no significant correlation with the divergent measures of NIHTB PVT (r’s ≤ .15) and PPVT (r’s ≤ .27). Like in Study 1, all three measures demonstrated very strong internal consistency reliability (r’s ≥ .94)
Note. All correlations significant at p < .01 unless otherwise noted.
^Negative correlation expected due to the inverse relationship between scores (e.g., performance vs. age; speed vs. accuracy) All values, with the exception of Internal reliability report 95% CIs.
Unlike Study 1, in which all individuals completed the MTB measures on a study-provided iOS smartphone (iPhone), in Study 2, participants completed the measures on their own phones, 31% of which were Android devices. Therefore, Study 2 allowed us to compare MTB reliability and validity between operating systems.
First, we considered the effect of operating system on scores. Linear regressions demonstrated that operating system did indeed have a significant effect on standardized scores, controlling for age, for both Arrow Matching (β = 0.23, p < .001) and Shape-Color Sorting (β = 0.21, p < .001), but not for Number-Symbol Match (β = 0.005, p = .93). For both Arrow Matching and Shape-Color Sorting, scores were higher on iOS than on Android. See Table 4 for full regression results.
Note. **p < .001.
Importantly, despite the small effect of operating system on scores, there were no significant differences in convergent validity, divergent validity, or internal reliability between operating systems, as evidenced by overlapping confidence intervals (See Table 3 for estimate comparisons by operating system).
Study 3 (Fully remote test-retest sample)
Again, examination of score distributions did not suggest floor or ceiling effects of the measures. Shape-Color Sorting showed good test-retest reliability (N = 144, ICC = .78, 95% CI: [.71–.84]) with no significant practice effects from baseline to retest (average increase of 0.03 items correct per second, t (143) = 1.8, p = .08). Number-Symbol Match exhibited excellent test-retest reliability (N = 141, ICC = .83, 95% CI: [.74–.89]), with a small but significant increase of 2.73 additional items correct the second time it was taken (t(140) = 4.8, p < .0001). Finally, Arrow Matching showed good test-retest reliability (N = 142, ICC = 0.69, 95% CI: [.59–.76]), and unexpectedly showed a small but significant decrease in performance between baseline and retest. On average, scores changed by -0.05 items correct per second (t(141) = −2.3, p = .02) on the second administration.
Discussion
This paper describes the results of a multi-part validation effort demonstrating the psychometric properties of three MTB measures that assess EF and PS. Results from the in-person sample (Study 1) provide convergent validity evidence compared to both NIHTB measure equivalents as well as external measures, under “optimal” circumstances. Results from a larger and fully remote sample (Study 2) replicate the validity and reliability results of Study 1 while demonstrating an effect of phone operating system on results for two of the tests. Study 3 provides evidence of test-retest reliability and practice effects.
All three measures demonstrated good evidence of convergent and divergent validity on both iOS and Android devices, supporting their effectiveness in assessing the specified constructs. Scores on Arrow Matching correlated strongly with those on the NIHTB Flanker and moderately with D-KEFS Color Word Inhibition including the raw score, Color Naming, Word Reading, and Inhibition/Switching. Shape-Color Sorting scores correlated strongly with the NIHTB DCCS measure, as well as moderately with the D-KEFS Inhibition/Switching. Number-Symbol Match correlated strongly with NIHTB Pattern Comparison and Oral Symbol Digit tests, and with the WAIS-IV Coding and Symbol Search scores. Additionally, all three measures had small, non-significant correlations with both the NIHTB PVT and PPVT, measures of vocabulary knowledge that tend to be a proxy for general abilities. Together, these results suggest that Arrow Matching, Shape-Color Sorting, and Number-Symbol Match assess the targeted constructs of EF and PS.
Correlations between MTB scores and age are similar to those reported for the original NIHTB (−0.50 to −0.55; Carlozzi et al., Reference Carlozzi, Tulsky, Chiaravalloti, Beaumont, Weintraub, Conway and Gershon2014; Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014). Additionally, correlations between MTB and NIHTB equivalents reported here were quite strong (ranging from .58 to .70), and correlations between MTB measures and external measures of similar cognitive constructs reflected a similar range to those seen for the original NIHTB validation study (.52 between NIHTB Flanker and D-KEFS Color-Word Interference Inhibition, .55 between NIHTB DCCS and D-KEFS Color-Word Interference Inhibition). This level of correlation is impressive given that MTB measures are self-administered (and, for two of our three samples, were completed in an unproctored setting).
It is also notable that there were no differences in validity or internal consistency reliability coefficients between the sample that completed the MTB measures in person (Study 1) and those that completed the measures remotely (Study 2). Despite potential challenges facing remote assessment, the consistency in reliability and validity estimates offers further confidence in the utility of this tool as intended.
Finally in Study 3, we considered test-retest reliability on iOS devices only, as participants in this sample completed each measure twice, 14 days apart. All measures exhibited strong test-retest reliability with generally stable performance after two weeks. Nevertheless, we did see slight practice effects in the positive direction for Number-Symbol Match, which is comparable to practice effects on similar processing speed tests (Carlozzi et al., Reference Carlozzi, Tulsky, Chiaravalloti, Beaumont, Weintraub, Conway and Gershon2014) and in the negative direction for Arrow Matching, which differs from the positive practice effects previously found for NIHTB Flanker (Zelazo et al., Reference Zelazo, Anderson, Richler, Wallner-Allen, Beaumont, Conway, Gershon and Weintraub2014). Although the practice effects were minimal, they suggest that those interested in using the MTB EF and PS measures for high-frequency testing or Ecological Momentary Assessment designs use caution in interpreting changes in scores over relatively short periods of time. One challenge in developing cognitive assessments for remote administration on individuals’ smartphones is that devices vary in their timing precision, which is problematic for tests that depend on precise stimuli presentation and response timing (Germine et al., Reference Germine, Reinecke and Chaytor2019; Passell et al., Reference Passell, Strong, Rutter, Kim, Scheuer, Martini, Grinspoon and Germine2021). In comparing performance across users who completed the measures on an iOS vs. Android device in Study 2, we indeed found that operating system made a difference on scores for the two measures that are highly time dependent (Arrow Matching and Shape-Color Sorting), whereas it did not impact Number-Symbol Match, for which precise timing matters less. Given that the effect of operating system emerged only for the two measures that rely on precise stimuli presentation timing and response recording, these results suggest that the effect of operating system is likely due to software or hardware differences rather than a third, person-specific variable. Note also that there are differences beyond operating system that exist across different devices – Android devices in particular are manufactured by a wide range of companies using different hardware components that could affect the measurement of fine-grained timing events during tests. However, the diversity of devices used precluded any formal study of the effect of device hardware on results. Future work will determine whether there is a subset of hardware devices, such as older or less expensive models, that yield differing results. In the absence of this additional research, for studies proposing multiple assessments of the same individual on time-dependent tests, we strongly recommend ensuring that they complete the measures on the same device over time so that any device effect is consistent within an examinee, and that operating system is included as a covariate in analyses.
Despite a small impact on scores for the timed measures (Arrow Matching and Shape-Color Sorting), we found no differences in the convergent validity, divergent validity, or internal consistency across operating systems for any of the measures. This suggests that MTB measures can be used reliably with both types of operating systems. However, researchers should use caution when comparing Arrow Matching or Shape-Color Soring scores from different devices and may want to avoid combining different operating systems in their samples when using these measures. In contrast, this should not be a concern for Number-Symbol Match, which saw no differences in reliability, validity, or mean scores between operating systems.
Limitations
Despite their strengths, our studies have some limitations. First, test-retest reliability and practice effects were only examined on iOS devices due to a technical error in the Android sample. Further research with Android phones should be conducted to understand the influence of repeated administrations on the measures’ reliability before they are used with Android samples. Second, although the demographics of our three samples were reasonably diverse, they lacked representation from certain groups, for example, those with less than a high school education. Future validation studies with underrepresented groups are important for the measures to be used in research with these populations. Third, the current samples were not recruited specifically to include individuals with cognitive impairments, and no cognitive assessments were conducted prior to enrollment. As the MTB was designed to track cognitive change across the lifespan and support research in cognitive decline, it is imperative that future work determine the feasibility, reliability and validity of the MTB in samples with impairments, including cognitive impairments, and other clinical groups.
The MTB is designed to be a remote assessment tool, which comes with both strengths and limitations. Remote measures that can be self-administered on a personal smartphone can reduce the cost and participation burden of research (Naito et al., Reference Naito, Wills, Tropea, Ramirez-Zamora, Hauser, Martino, Turner, Rafferty, Afshari, Williams, Vaou, McKeown, Ginsburg, Ezra, Iansek, Wallock, Evers, Schroeder, DeLeon, Yarab, Alcalay and Beck2021). However, it is difficult to monitor for cheating or poor effort, as well as other environmental factors that may influence test performance when measures are taken in remote settings. The measures time out after 10 minutes of inactivity to protect against low engagement; however, we were not able to monitor for other types of performance validity within these measures. Although it is difficult to cheat on these EF measures as participants cannot look up answers, it is possible that some participants asked another person to complete the test for them or did not try their best on the measures. Future versions of the MTB will implement measures to monitor and control for performance validity in remote settings, as well as collect data on contextual factors such as background noise or movement, to empirically test if and how these factors impact test performance in the real world. Moreover, users can easily include their own instructions to participants through the REDCap system to address engagement concerns.
Finally, while the three samples, particularly those from Studies 1 and 2, were reasonably diverse, they do not reflect the comprehensive demographic breakdown of populations in the US. This does limit the generalizability of validity and reliability across all populations and limits the use of these data for creating normed scores. As of now, MTB can be considered useful for research purposes in the tested populations but is not appropriate for clinical use or high-stakes testing, and may not be as useful when testing populations underrepresented in this study.
Conclusion
MTB Arrow Matching, Shape-Color Sorting, and Number-Symbol Match are shown here to be reliable and valid tools for remotely assessing EF and PS in healthy adults. Future work should consider their efficacy in additional contexts including with clinical populations, however the work reported here provides a critical foundation for the expansion of the MTB in future studies. Our hope is that the MTB will enhance research on cognitive change across the lifespan and advance our knowledge of both typical and atypical cognitive decline.
Acknowledgements
This work was supported by the National Institutes of Health grant U2CAG060426. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to recognize the contributions of the Sage Bionetworks UI / UX Design team in conducting the user research studies and contributing to the design of the tests, and the Sage Engineering teams that developed and managed the system for remote administration of the study.
Competing interests
Dr Weiner serves on Editorial Boards for Alzheimer’s & Dementia, and the Journal for Prevention of Alzheimer’s Disease. He has served on Advisory Boards for Acumen Pharmaceutical, Alzheon, Inc., Amsterdam UMC; MIRIADE, Cerecin, Merck Sharp & Dohme Corp., and NC Registry for Brain Health. He also serves on the University of Southern California (USC) ACTC grant, which receives funding from Eisai. He has provided consulting to Boxer Capital, LLC, Cerecin, Inc., Clario, Dementia Society of Japan, Dolby Family Ventures, Eisai, Guidepoint, Health and Wellness Partners, Indiana University, LCN Consulting, MEDA Corp., Merck Sharp & Dohme Corp., NC Registry for Brain Health, Prova Education, T3D Therapeutics, USC, and WebMD. He has acted as a speaker/lecturer for China Association for Alzheimer’s Disease and Taipei Medical University, as well as a speaker/lecturer with academic travel funding provided by: AD/PD Congress, Amsterdam UMC, Cleveland Clinic, CTAD Congress, Foundation of Learning; Health Society (Japan), Kenes, U. Penn, U. Toulouse, Japan Society for Dementia Research, Korean Dementia Society, Merck Sharp & Dohme Corp., National Center for Geriatrics and Gerontology (NCGG; Japan), USC. He holds stock options with Alzeca, Alzheon, Inc., ALZPath, Inc., and Anven. Dr Weiner received support for his research from the following funding sources: National Institutes of Health (NIH)/NINDS/NIA, Department of Defense, California Department of Public Health, University of Michigan, Siemens, Biogen, Hillblom Foundation, Alzheimer’s Association, Johnson & Johnson, Kevin and Connie Shanahan, GE, VUmc, Australian Catholic University (HBI-BHR), The Stroke Foundation, and the Veterans Administration.
All other authors have no conflicts to report.