Introduction
COVID-19 related precautions forced the field of neuropsychology to rapidly embrace telecommunication-based evaluations. At the peak of the COVID-19 crisis, leading public health organizations advised millions to shelter in place and limit person-to-person contact to prevent transmission (World Health Organization, 2020). Consequently, telemedicine became a critical medium for health care delivery among neuropsychologists with many providers pivoting to virtually based assessments (Hammers et al., Reference Hammers, Stolwyk, Harder and Cullum2020; Marra, Hoelzle, et al., Reference Marra, Hoelzle, Davis and Schwartz2020; Zane et al., Reference Zane, Thaler, Reilly, Mahoney and Scarisbrick2021). To accommodate this sudden increase in telehealth utilization, insurance billing and reimbursement structures became more flexible (Centers for Medicare and Medicaid Services, 2021) and best practice guidelines emerged to support the responsible and ethical provision of remote neuropsychological services (Bilder et al., Reference Bilder, Postal, Barisa, Aase, Cullum, Gillaspy, Harder, Kanter, Lanca, Lechuga, Morgan, Most, Puente, Salinas and Woodhouse2020). In addition, research that required ongoing in-person participation quickly adapted procedures to promote study continuity and to allow for remote data collection. These efforts were guided by the limited telehealth literature available at that time, little of which evaluated home-based virtual assessment.
Teleneuropsychology (teleNP), which the Inter Organizational Practice Committee defines as the use of any audiovisual technology (e.g., telephone, video conference) to facilitate remote neuropsychological assessment (Bilder et al., Reference Bilder, Postal, Barisa, Aase, Cullum, Gillaspy, Harder, Kanter, Lanca, Lechuga, Morgan, Most, Puente, Salinas and Woodhouse2020), is being increasingly relied upon to bridge gaps in the provision of neuropsychological services, particularly when in-person evaluations are not possible. Although teleNP was used infrequently before the global COVID-19 pandemic (Miller & Barr, Reference Miller and Barr2017), its adoption has grown and many neuropsychologists report an increased use of teleNP for clinical interviewing, test administration, and feedback (Hammers et al., Reference Hammers, Stolwyk, Harder and Cullum2020). For the purposes of this study, the term “teleNP” refers to traditional, face-to-face neuropsychological assessments that have been adapted to either a telephone-based or video conference-based format; other forms of audiovisual-aided neuropsychological assessment (e.g., computerized testing via specialized software packages or web-based platforms) were beyond the scope of this report.
Empirical evidence related to telephone-based neuropsychological evaluations
Telephone-based neuropsychological assessments characterize some of the earliest iterations of teleNP (Brandt & Folstein, Reference Brandt and Folstein1988). This format may serve as a particularly useful medium for teleNP administration, given the widespread availability of telephones and the general ease of operating such devices. Cognitive screeners, such as the Telephone Interview for Cognitive Status (TICS) (Brandt & Folstein, Reference Brandt and Folstein1988) and its modified version (TICS-M) (Welsh et al., Reference Welsh, Breitner and Magruder-Habib1993), are among the most extensively validated and widely used telephone-based instruments to date (Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020; Castanho et al., Reference Castanho, Amorim, Zihl, Palha, Sousa and Santos2014; Hunter et al., Reference Hunter, Jenkins, Dolan, Pullen, Ritchie and Muniz-Terrera2021). Additionally, the comparability of telephone and in-person assessments is well-supported for most verbally administered tasks (e.g., verbal memory and language measures) (Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020). For instance, Bunker et al. (Reference Bunker, Hshieh, Wong, Schmitt, Travison, Yee, Palihnich, Metzger, Fong and Inouye2017) administered a battery of neuropsychological tests to a sample of older adults (mean age = 74.9) in-person and then subsequently via telephone 2–4 weeks later; the investigators found that mean scores obtained through in-person and telephone testing were strongly correlated for several measures, including the Hopkins Verbal Learning Test-Revised (HVLT-R) Total Recall (r = 0.87), Verbal Fluency (r = 0.92), and the Boston Naming Test 15-item (BNT-15) (r = 0.85) (note: the BNT-15 was heavily revised by the researchers to be compatible with verbal administration over telephone). However, the evidence supporting the use of visuospatial measures via telephone is exceedingly small (Thompson et al., Reference Thompson, Prince, Macdonald and Sham2001), and few telephone-based instruments are available which evaluate processing speed and executive functioning (Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020). Interestingly, whereas early video-based teleNP studies were most frequently conducted onsite at a clinical or research setting (i.e., optimal conditions conducive to high standardization and experimental control), a vast majority of telephone-based investigations involve testing administered to participants directly in their homes (Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020).
Empirical evidence related to video-based neuropsychological evaluations
Existing research evaluating video-based teleNP relative to traditional, in-person neuropsychological (NP) evaluation has thus far been encouraging, albeit under highly controlled conditions (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017; Cullum et al., Reference Cullum, Weiner, Gehrmann and Hynan2006; Cullum et al., Reference Cullum, Hynan, Grosch, Parikh and Weiner2014; Grosch et al., Reference Grosch, Weiner, Hynan, Shore and Cullum2015; Hildebrand et al., Reference Hildebrand, Chow, Williams, Nelson and Wass2004; Marra, Hamlet, et al., Reference Marra, Hamlet, Bauer and Bowers2020a; Wadsworth et al., Reference Wadsworth, Galusha-Glasscock, Womack, Quiceno, Weiner, Hynan, Shore and Cullum2016, Reference Wadsworth, Dhima, Womack, Hart, Weiner, Hynan and Cullum2018). For example, an early meta-analysis of 12 studies revealed that testing modality has a minor, nonstatistically significant influence on performance, with verbally based measures showing the strongest reliability (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017). Interestingly, however, this meta-analysis revealed higher (33%), lower (61%), and equivalent (6%) mean test scores for video relative to in-person NP evaluations (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017). A relatively recent systematic review comparing video-based and in-person testing in older adults (aged 65 and older) supported teleNP-administered tests, particularly for cognitive screening tools (e.g., Montreal Cognitive Assessment, Mini-Mental State Examination) and tests measuring language, attention, and memory (Marra et al., Reference Marra, Hamlet, Bauer and Bowers2020a). In addition, a large early investigation of video-based teleNP (Cullum et al., Reference Cullum, Hynan, Grosch, Parikh and Weiner2014) in older adults reported moderate to excellent reliability across the assessed measures (i.e., ICCs ranging from 0.55 to 0.91). Importantly, however, most of these early studies examined a limited number of tests and administered video-based testing in well-controlled environments (e.g., via video in an office next to the examiner) that may not accurately reflect the less predictable nature of in-home teleNP.
There is limited research evaluating in-home, video-administered teleNP when applied to older adult populations (Abdolahi et al., Reference Abdolahi, Bull, Darwin, Venkataraman, Grana, Dorsey and Biglan2016; Alegret et al., Reference Alegret, Espinosa, Ortega, Pérez-Cordón, Sanabria, Hernández, Marquié, Rosende-Roca, Mauleón, Abdelnour, Vargas, de Antonio, López-Cuevas, Tartari, Alarcón-Martín, Tárraga, Ruiz, Boada and Valero2021; Fox-Fuller et al., Reference Fox-Fuller, Ngo, Pluim, Kaplan, Kim, Anzai, Yucebas, Briggs, Aduen, Cronin-Golomb and Quiroz2022; Lindauer et al., Reference Lindauer, Seelye, Lyons, Dodge, Mattek, Mincks, Kaye and Erten-Lyons2017; Parks et al., Reference Parks, Davis, Spresser, Stroescu and Ecklund-Johnson2021; Stillerova et al., Reference Stillerova, Liddle, Gustafsson, Lamont and Silburn2016), with the home environment potentially being more susceptible to confounding factors, such as ambient noise (e.g., visitors or delivery persons ringing the doorbell), lapses in internet connectivity, poor audio or visual quality, people or pets entering the testing area, etc. Only three such studies were available in early 2020 at the start of the COVID-19 pandemic (Abdolahi et al., Reference Abdolahi, Bull, Darwin, Venkataraman, Grana, Dorsey and Biglan2016; Lindauer et al., Reference Lindauer, Seelye, Lyons, Dodge, Mattek, Mincks, Kaye and Erten-Lyons2017; Stillerova et al., Reference Stillerova, Liddle, Gustafsson, Lamont and Silburn2016), all of which focused on the video conference administration of the Montreal Cognitive Assessment (MoCA) and were associated with moderate to strong agreement across modalities.
Thus, there was a paucity of empirical data available when the University of Michigan temporarily halted in-person human subject research to comply with state and federal COVID-19 lock-downs. By necessity, our Michigan Alzheimer’s Disease Research Center’s (MADRC) longitudinal study of memory and aging was forced to shift to a virtual format. Shortly thereafter, the National Alzheimer’s Coordinating Center (NACC) disseminated a revised version of the Uniform Data Set version 3 (UDS v3.0) cognitive test battery to the ADRC network. This UDS v3.0 Telephone Cognitive Battery (known as UDS v3.0 t-cog) preserved many of the core tests from the in-person battery, which we augmented with additional verbal measures (i.e., C Letter Fluency, Hopkins-Verbal Learning Test-Revised) and UDS 3.0 tests that involved the presentation of visual stimuli (i.e., Benson Complex Figure, Multilingual Naming Test). This paper is the first, to our knowledge, to evaluate the reliability of the UDS v3.0 t-cog test battery (and additional tests from our local protocol) in a clinically mixed sample of 181 older adults. In exploratory analyses, we also assessed reliability estimates according to either video-based or telephone-based testing modality. Our primary goal was to describe the reliability of the UDS v3.0 t-cog measures in this real-world teleNP situation through an examination of intraclass correlation coefficients (ICCs).
Method
Participants
All study procedures – which were reviewed and approved by the University of Michigan Medical School Review Board (IRBMED) – adhered to the ethical standards outlined in the Helsinki Declaration. A total of 210 participants provided written informed consent at each time point and completed both an in-person UDS v3.0 assessment before March 12th, 2020 (when COVID-19 restrictions were implemented) and the next subsequent evaluation using the UDS v3.0 t-cog. Given the new remote testing format, a separate virtual meeting was held with the participant prior to virtual testing to obtain informed consent via SignNow, a secure, electronic signature platform supported by the University of Michigan. If participants were unable to navigate the SignNow interface, a physical copy of the informed consent document was mailed to the participant and reviewed during a telephone call. Participants were then instructed to sign and return their consent form via postal mail, using a pre-addressed, pre-stamped envelope that had been provided. Of the 210 total participants, 25 cases were deemed potentially invalid and were not included in our data analysis. Threats to validity were documented for each of these excluded assessments, with several cases (36%) citing multiple potential confounds. Reasons for exclusion included hearing impairment (9/25), technological issues (9/25), distractions/interruptions in the home (5/25), note-taking/“cheating” by participant (3/25), unapproved assistance from others in the home (2/25), lack of effort or interest (3/25), fatigue (2/25), and emotional issues (2/25). Two cases that used a hybrid testing modality (i.e., a combination of telephone and video) were removed and two other cases with an “impaired, not MCI” research diagnosis (i.e., participants with objectively impaired performance on neuropsychological testing but without subjective cognitive complaints or evidence of functional decline) were also excluded due to the small group size. These steps resulted in a total of 181 participants who were included in our final reported data analysis.
The UDS v3.0 t-cog was administered either via video conference (n = 122) or telephone (n = 59) with an average of 16 months between evaluations (mean days = 479.2; SD = 122.0 days; range = 320–986 days). All participants were English-speaking adults aged 52 years and older. Exclusionary criteria included history of non-neurodegenerative neurologic injury or disease, such as moderate-severe traumatic brain injury, stroke, or epilepsy, or a history of central nervous system radiation therapy, or developmental delays. Those with significant psychiatric diagnoses (e.g., Bipolar Disorder, Schizophrenia, moderate-severe Major Depressive Disorder) or active substance abuse/dependence were also excluded.
The sample was predominantly female (66.9%) and mostly college educated (M = 16.3 years of education; SD = 2.5; range = 12–20 years). Mean age was 71.9 (SD = 6.8; range = 52.3–93.9). Self-reported race was 54.1% “White” and 38.7% “Black or African American” (see Table 1 for complete demographic characteristics). Participants held a consensus diagnosis of cognitively unimpaired (n = 120), mild cognitive impairment (MCI; n = 50), or dementia (n = 11) following the in-person evaluation and were re-diagnosed following the remote visit. Diagnoses were rendered via consensus conference that included neurologists, neuropsychologists, nurses, social workers, and other relevant specialists according to NACC guidelines (National Alzheimer’s Coordinating Center, 2015).
Abbreviations: SD, Standard Deviation; TeleNP, teleneuropsychology; UDS, Uniform Data Set.
Procedures
Neuropsychological test battery
Table 2 lists the tests used for each assessment type (i.e., in-person, video, and telephone). No alternate forms were used, as NACC does not provide alternative tests for the UDS v3.0. Measures used in all formats included: the Montreal Cognitive Assessment (MoCA), Craft Story 21, Number Span Forward and Backward, Category Fluency (Animals and Vegetables), Letter Fluency (C, F, and L), the Hopkins Verbal Learning Test-Revised (HVLT-R), and the Trail Making Test A and B (note that C Letter Fluency and the HVLT-R were part of our “local” protocol and are not included in the UDS 3.0). Importantly, the video and telephone evaluations used the oral version of the Trail Making Test A and B as well as the Blind/Telephone MoCA. Blind/Telephone MoCA scores were converted to the traditional MoCA scale using the formula provided on the test publisher’s website (Nasreddine, Reference Nasreddine2022). We included both the Benson Complex Figure and the Multilingual Naming Test (MINT) during the video visits even though these measures were not included in the UDS v3.0 t-cog battery. For the Benson Complex Figure, examiners shared a digital version of the image via screenshare and asked participants to copy the image following standard (i.e., in-person) instructions. Once completed, participants held their figure in front of the webcam and the examiner saved a screenshot for subsequent scoring. Participants were then instructed to fold the piece of paper in half with the image on the inside, take it in their left hand, and place it on the floor. This three-step command effectively removed the Benson drawing from view while concurrently evaluating the participant’s ability to follow a multi-step command. Following the delay, the participant was asked to draw the figure and another screenshot was captured and later scored. At the end of each session, participants were instructed to dispose of their Benson Figure drawings in the trash; this was intended to protect test security and to help prevent any unapproved inspection/reproduction of test stimuli. To evaluate confrontation naming, we showed digital images of the MINT stimuli to participants and recorded their responses following standard procedures. Since visually based stimuli could not be administered during the telephone-based sessions, we used the Verbal Naming Test (VNT) instead of the MINT (per NACC guidance) and omitted the Benson Complex Figure.
Abbreviations: HVLT-R, Hopkins Verbal Learning Test-Revised; IP, In-Person; MoCA, Montreal Cognitive Assessment; MINT, Multilingual Naming Test.
a Proportion correct calculated for MINT and Verbal Naming Test given the different scales. Likewise, TMT B/A ratios were calculated for oral and written trails to ensure comparable metrics.
UDS v3.0 t-cog set-up
Participants completed cognitive testing from their homes using a personally owned telephone (for telephone assessments) or computing device (for video assessments). For video-administered testing, we were unable to standardize the nature of the device or screen size, given pandemic related restrictions; as such, participants were permitted to use any internet-enabled device (e.g., desktop computers, laptops, tablets, and smartphones). Examiners conducted testing from either the MADRC office space or from their homes using a University of Michigan computer and virtual private network (VPN). Technology was consistent across all examiners; the examiner set-up included a desktop computer, dual monitors, a headset with a built-in microphone, and a webcam. The UDS v3.0 t-cog battery was administered by secured video conference (n = 122) or telephone (n = 59) using either the “BlueJeans” or “Zoom for Health” telecommunication platforms. For video assessments, an identical, nondescript virtual background (i.e., an image of an empty room with a white wall and wooden floor) was used by all examiners. The test examiner asked participants to power down all electronic devices and remain in a quiet place where they would not be disturbed for approximately 90 minutes. Participants were explicitly instructed to complete the testing session by themselves and were reminded that they were not allowed to take notes or receive assistance from others in the home while completing their evaluation. Another person was permitted to set-up the telephone or video call if the participant was unable to do so on their own (National Alzheimer’s Coordinating Center, 2020); the individual providing assistance was then asked to leave the room immediately in order for testing to commence. At the start of each session, examiners performed an initial check of connection quality by ensuring participants could adequately hear and see the examiner and that the audio and visual connections were not “dropping” during conversation. Participants were also reminded to use sensory aids (e.g., hearing aids, eyeglasses), if they normally used such aids. Any factors that may have influenced the validity of a neuropsychological measure (e.g., note-taking, significant disruptions to internet connectivity) were recorded by the examiner and discussed with the larger team before deciding whether to exclude (see above). To enhance comparability of measurement, we converted MINT and VNT scores to percent correct and evaluated both raw (i.e., time to completion in seconds) and ratio (i.e., B/A ratios) for the Trail Making Tests (written for in-person; oral for UDS v3.0 t-cog).
Statistical methods
Except when otherwise noted, all analyses used raw scores. We used ICCs and 95% confidence intervals (CI) to estimate test-retest reliability across neuropsychological measures. ICC figures were interpreted according to established thresholds (Koo & Li, Reference Koo and Li2016): values ≤0.50 indicate “poor” reliability, between 0.50 and 0.75 suggest “moderate” reliability, between 0.75 and 0.90 suggest “good” reliability, and ≥0.90 imply “excellent” reliability. Significance of ICC measurements tested the null hypothesis that ICC = 0 and are represented by the 95% CIs.
To frame the primary results more accurately, we calculated comparable ICCs under two control conditions: (1) restricting analyses to only those who remained diagnostically stable across the two assessment points (Table 4) (n = 158) and (2) between two consecutive in-person UDS v3.0 evaluations that both occurred prior to the COVID-19 pandemic (n = 276; mean time between visits = 398.9 days; SD = 88.1; range: 188–880 days) (Table 5). For the repeat in-person control analysis, participants were selected from our longitudinal cohort of older adults who had completed two in-person assessments on or before March 11th, 2020; these analyses included data associated with the participants’ two most recent evaluations. Demographic characteristics associated with the in-person to in-person sample were similar to the primary sample [mean age = 72.1 years (SD = 7.6; range = 51.1–92.9); mean years of education = 15.9 (SD = 2.5; range = 8–20); 68.8% females; 55.8% White; 35.5% Black or African American]. Of the 276 cases compared in the repeat in-person sample, 141 were cognitively unimpaired; those with cognitive impairment held consensus research diagnoses of Amnestic MCI (n = 61), non-Amnestic MCI (n = 30), dementia of the Alzheimer’s type (n = 40), and mixed dementia (n = 4). Notably, this group had a higher proportion of dementia cases (15.9%) relative to the overall sample (6.1%) (Table 1).
Results
Overall sample comparing in-person with UDS v3.0 t-cog
Overall ICCs ranged from 0.01 to 0.79 across the tests of the 181 individuals included in the analysis (Table 3; Figure 1). ICCs fell in poor (15%), moderate (70%), and good (15%) agreement ranges. We found the strongest ICCs (i.e., “good”) for Craft Story Recall – Delayed Verbatim (ICC = 0.79) and Paraphrase (ICC = 0.77), and the Benson Complex Figure – Delayed (ICC = 0.79 during the video assessments). Conversely, the lowest ICCs were observed for the Trail Making Test-A/Oral Trail Making Test-A (TMT-A/OTMT-A) (ICC = 0.01), Trail Making Test-B/Oral Trail Making Test-B (TMT-B/OTMT-B) (ICC = 0.21), and TMT B/A Ratio (ICC = 0.11) (Table 3). This general pattern of results was evident when considering the video-based and telephone-based sessions separately, though it should be noted that four ICCs were relatively lower for telephone than for video-based sessions (Table 3: Number Span Forward, Number Span Backward, Category Fluency – Animals, and TMT B/A Ratio).
Abbreviations: HVLT-R, Hopkins Verbal Learning Test-Revised; ICC, Intraclass Correlation Coefficient; MINT, Multilingual Naming Test; MoCA, Montreal Cognitive Assessment; Obs, Observations; OTMT, Oral Trail Making Test; SD, Standard Deviation; TMT, Trail Making Test; VNT, Verbal Naming Test.
a Proportion correct calculated for MINT and Verbal Naming Test given the different scales. Likewise, TMT B/A ratios were calculated for oral and written trails to ensure comparable metrics.
b Video-only analyses.
c Means and SD for the combined (video and telephone remote visits) overall sample except where indicated as video only analyses.
Additional analyses
Our additional control analyses revealed two primary findings: (1) results were largely unchanged when limiting our analyses to those who remained diagnostically stable across these two time points (ICC Range: 0–0.78) (Table 4; Figure 1) and (2) ICCs were higher (ICC Range: 0.35–0.87) between consecutive in-person evaluations that occurred on or before March 11th, 2020 (Table 5; Figure 1). Importantly, the mean number of prior evaluations (i.e., those which occurred before the two visits included in the data analysis) was similar for our primary analysis (mean number = 0.9558; SD = 0.74; median = 1; range = 0–2) and the in-person to in-person control sample (mean number = 0.38 evaluations; SD = 0.49; median = 0; range = 0–2). Of the 181 participants in the primary in-person/virtual cohort, 125 were also in the in-person to in-person sample (69.1% overlap across groups).
Abbreviations: HVLT-R, Hopkins Verbal Learning Test-Revised; ICC, Intraclass Correlation Coefficient; MINT, Multilingual Naming Test; MoCA, Montreal Cognitive Assessment; Obs, Observations; OTMT, Oral Trail Making Test; SD, Standard Deviation; TMT, Trail Making Test; VNT, Verbal Naming Test.
a Proportion correct calculated for MINT and Verbal Naming Test given the different scales. Likewise, TMT B/A ratios were calculated for oral and written trails to ensure comparable metrics.
b Video only analyses.
c Means and SD for the combined (video and telephone remote visits) overall sample except where indicated as video only analyses.
Abbreviations: CU, Cognitively Unimpaired; Cog Imp, Cognitively Impaired (MCI + dementia); HVLT-R, Hopkins Verbal Learning Test-Revised; ICC, Intraclass Correlation Coefficient; MCI, Mild Cognitive Impairment; MINT, Multilingual Naming Test; MoCA, Montreal Cognitive Assessment; Obs, Observations; SD, Standard Deviation; TMT, Trail Making Test.
ICCs by diagnostic group
Exploratory diagnosis-specific results were limited by relatively small sample sizes for those with cognitive impairment (i.e., MCI and dementia) but revealed notable differences across diagnostic groups (Supplemental Tables 1 and 2). Specifically, cognitively unimpaired participants showed poor ICCs for HVLT-R Delayed Recall (ICC = 0.2) and HVLT-R Retention (ICC = 0.14), as well as TMT-A/OTMT-A (ICC = −0.01), TMT-B/OTMT-B (ICC = 0.19), and TMT B/A Ratio (ICC = 0.13). Symptomatic participants showed poor ICCs for Number Span Forward (ICC = 0.31), Number Span Backward (ICC = 0.39), TMT-A/OTMT-A (ICC = 0), TMT-B/OTMT-B (ICC = 0.15), and TMT B/A Ratio (ICC = 0.03).
Discussion
The COVID-19 pandemic necessitated accessible neuropsychological assessments capable of reaching individuals outside of traditional research and clinical settings. Growing evidence suggests that both telephone and video administered teleNP may serve as a viable alternative to traditional, in-person assessment (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017; Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020; Marra, Hamlet, et al., Reference Marra, Hamlet, Bauer and Bowers2020); however, the psychometric properties associated with teleNP when administered directly to the home remains an understudied area of research, particularly for video-based neuropsychological evaluations. This investigation is the first, to our knowledge, to evaluate the reliability of the UDS v3.0 t-cog test battery, as well as additional measures from our local study protocol. In general, our results are encouraging and suggest mostly moderate to good agreement between in-person and teleNP testing conditions (overall ICC Range = 0.01–0.79; ICC Range = 0.53–0.79, if excluding TMT/OTMT) (Table 3). Although our reliability estimates are, in some cases, less robust than prior teleNP investigations (see Cullum et al., Reference Cullum, Hynan, Grosch, Parikh and Weiner2014 as an example), this may be partially explained by a lengthier testing interval than has typically been reported in other studies (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017; Hunter et al., Reference Hunter, Jenkins, Dolan, Pullen, Ritchie and Muniz-Terrera2021) – a factor that was outside of our control given the pandemic. Additionally, the variability in scores observed across assessments might be reasonably attributed to a certain degree of expected change within aging populations. For perspective, Webb et al. (Reference Webb, Ryan, Wolfe, Woods, Shah, Murray, Orchard and Storey2022) conducted test–retest analyses in a large sample (n = 16,956) of older adults (age ≥ 65 years) who completed a series of cognitive tests in-person (i.e., the Modified Mini-Mental State, Symbol Digit Modalities Test, Hopkins Verbal Learning Test-Revised, and Controlled Oral Word Association Test) at baseline and at one-year follow-up; results were associated with ICCs in the moderate to good range (ICC Range = 0.53–0.77) and provide a useful point of comparison with our study. Our findings were not driven by clinical conversion/reversion given our first control analyses that revealed comparable ICCs in a diagnostically stable subgroup (ICC Range = 0–0.78) (Table 4). Our second set of control analyses revealed relatively higher ICCs in comparably timed, repeat in-person evaluations using the same neuropsychological measures (ICC Range = 0.35–0.87) (Table 5). These latter differences cannot be accounted for by prior experience or practice effects since our samples had a comparable number of prior evaluations. Thus, there appears to be some relative loss of reliability when shifting from in person to virtual, though we cannot comment on the clinical or research ramifications of this difference.
Our findings suggest that the general field of neuropsychology can have confidence in several UDS v3.0 measures when administered virtually: specifically, the Craft Story Recall – Delayed Paraphrase and Verbatim, Letter Fluency (C, F, and L), MINT, and the Benson Complex Figure – Delayed. The strong reliability estimates associated with Craft Story and Letter Fluency are consistent with prior research that has supported the cross-modal comparability of verbally mediated tasks across traditional, in-person and remote (i.e., telephone or video-based) testing conditions (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017; Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020; Hunter et al., Reference Hunter, Jenkins, Dolan, Pullen, Ritchie and Muniz-Terrera2021). The latter two measures (i.e., MINT and Benson Complex Figure) are important to note since they were not included in the NACC UDS 3.0 t-cog test battery. The relatively strong ICCs observed for the Benson Complex Figure – Delayed are important for the field given the relative paucity of teleNP measures that assess visuospatial functioning (Brearly et al., Reference Brearly, Shura, Martindale, Lazowski, Luxton, Shenal and Rowland2017; Carlew et al., Reference Carlew, Fatima, Livingstone, Reese, Lacritz, Pendergrass, Bailey, Presley, Mokhtari and Cullum2020). These findings suggest our approach may be viable for other measures of visuoperception and visuoconstruction. The MINT was moderately reliable when comparing video-based and in-person administrations (ICC = 0.73); reliability was also moderate when comparing in-person MINT scores with scores obtained via telephone on the Verbal Naming Test (ICC = 0.71; ICC calculated using the percentage of correct responses to account for different scales on MINT/VNT). Overall, our results are slightly less favorable than past studies that have compared in-person and video-based administrations of the Boston Naming Test-15 item short form (BNT-15) – a confrontation naming task similar to the MINT – which has previously been associated with ICCs of 0.81 (Cullum et al., Reference Cullum, Hynan, Grosch, Parikh and Weiner2014), 0.87 (Cullum et al., Reference Cullum, Weiner, Gehrmann and Hynan2006), and 0.93 (Wadsworth et al., Reference Wadsworth, Galusha-Glasscock, Womack, Quiceno, Weiner, Hynan, Shore and Cullum2016). Notably, participants in these prior studies completed both video and in-person evaluations on the same day, whereas our testing interval was far longer (i.e., approximately 16 months). As such, it is unsurprising that our lengthy retest interval resulted in comparatively lower ICCs.
The HVLT-R and its subtests revealed moderately strong ICCs, ranging from 0.56 to 0.65 in our overall remote sample (ICC Range = 0.54–0.66 for video-based evaluations; ICC Range = 0.59–0.71 for telephone-based evaluations). This pattern is consistent with, albeit somewhat weaker than, past studies using video HVLT-R administration (ICCs of 0.77–0.88) (Cullum et al., Reference Cullum, Weiner, Gehrmann and Hynan2006, Reference Cullum, Hynan, Grosch, Parikh and Weiner2014; Wadsworth et al., Reference Wadsworth, Galusha-Glasscock, Womack, Quiceno, Weiner, Hynan, Shore and Cullum2016, Reference Wadsworth, Dhima, Womack, Hart, Weiner, Hynan and Cullum2018). Interestingly, Bunker and colleagues (2017) found HVLT-R correlation coefficients of r = 0.27–0.87 for in-person versus telephone-based administration. Our data are consonant with those of Bunker et al. (Reference Bunker, Hshieh, Wong, Schmitt, Travison, Yee, Palihnich, Metzger, Fong and Inouye2017) as both studies observed the weakest relationships with the HVLT-R percent retention scores, so some degree of caution may be warranted when interpreting this measure. We again suspect that our longer test–retest interval played a role in these findings but do not believe it fully accounts for them given the stronger ICCs for the subsequent in-person evaluations (ICCs of 0.56–0.76; Table 5).
Our results warrant caution when using the OTMT-A or B instead of the written versions of these measures based on ICCs that fell in the poor reliability range. This conclusion was somewhat anticipated, as the raw scores for each task are known to vary considerably with one another (i.e., higher raw values are expected on the in-person, written TMT relative to the oral analog of the task) (Ricker & Axelrod, Reference Ricker and Axelrod1994). To account for these differences, we calculated an ICC using the TMT B/A ratio, although this produced a similarly weak correlation between in-person and remote testing conditions (ICC = 0.11). The original OTMT validity study (Ricker & Axelrod, Reference Ricker and Axelrod1994) found strong Pearson’s correlation coefficients for OTMT-A/TMT-A (r = 0.68) and OTMT-B/TMT-B (r = 0.72). However, a more recent study (Mrazik et al., Reference Mrazik, Millis and Drane2010) revealed weaker correlations between OTMT-A/TMT-A (r = 0.29) and OTMT-B/TMT-B (r = 0.62). Another investigation reported that OTMT-A failed to distinguish between cognitively healthy and cognitively impaired participants due to a compressed range of scores with little variability (Bastug et al., Reference Bastug, Ozel-Kizil, Sakarya, Altintas, Kirici and Altunoz2013). Thus, there is a consensus across studies (Bastug et al., Reference Bastug, Ozel-Kizil, Sakarya, Altintas, Kirici and Altunoz2013; Kaemmerer & Riordan, Reference Kaemmerer and Riordan2016; Mrazik et al., Reference Mrazik, Millis and Drane2010) that OTMT-A may not be an adequate substitute for TMT-A due to fundamental differences in task design: the OTMT-A places fewer cognitive demands on the participant (e.g., does not involve visual scanning or effortful number sequencing) and may elicit an over-learned, rote response. Overall, with respect to the OTMT, our findings align with past studies showing questionable agreement between the OTMT and TMT and suggest that users should be cautious when using OTMT for diagnostic purposes. Future studies should evaluate whether lack of agreement between OTMT and TMT arise from the solely verbal nature and/or the teleNP platform.
Strengths and limitations
As with all studies, several limitations exist that were largely due to the unanticipated COVID-19 pandemic. First, the significant lapse in time between the in-person and UDS v3.0 t-cog testing sessions (i.e., on average 16 months, but in some cases, as great as three years) likely weakened test–retest reliability estimates. However, our control analyses revealed these patterns were not due to diagnostic conversion/reversion (Table 4) and were instead related to the cross-modal assessment since comparably timed in-person evaluations had relatively higher ICCs (Table 5). While our total sample (n = 181) rivals the largest pre-COVID-19 investigation of teleNP (n = 202) (Cullum et al., Reference Cullum, Hynan, Grosch, Parikh and Weiner2014), our study was more heavily weighted toward cognitively unimpaired participants, so it was surprising that ICCs on some measures (e.g., HVLT-R) were notably below those for cognitively impaired participants. This is an unexpected finding that warrants replication as we anticipated patient populations showing greater variability. We again emphasize that ICCs were not driven by diagnostic change and that they were relatively stronger for consecutive in-person visits. Another notable limitation relates to our well-educated sample (M = 16.3 years of education), which potentially limits generalizability of our findings. Additionally, pandemic-related restrictions rendered us unable to standardize participants’ testing equipment and technological set-up (e.g., computing device type, audiovisual quality, internet connection speed). We cannot rule out the possibility that variability in participant technology influenced our results (e.g., perhaps worse performance associated with smaller screen size associated with tablets or smartphones), and we encourage future investigations to control for these factors more systematically. Furthermore, as some methods (e.g., shredding Benson Complex Figure renderings) were simply not feasible given the context of the current study, future efforts should ensure more rigorous test security methods according to published guidelines (Boone et al., Reference Boone, Sweet, Byrd, Denney, Hanks, Kaufmann, Kirkwood, Larrabee, Marcopulos, Morgan, Paltzer, Rivera Mindt, Schroeder, Sim and Suhr2022). Finally, the number of sessions performed for each modality (i.e., video vs. telephone) was relatively modest, so we encourage replication of all findings. While each limitation is notable, the overall study may provide a more ecologically valid reflection of real-world teleNP when compared to prior studies conducted under tightly controlled (i.e., ideal) test parameters.
A strength of our study lies in its racial diversity (i.e., 38.7% Black or African American), which addresses a critical gap in the literature (Marra, Hamlet, et al., Reference Marra, Hamlet, Bauer and Bowers2020). Although it was beyond the scope of this investigation to understand whether reliability estimates were differentially influenced by race, our mostly moderate to good agreement across in-person and remote testing modalities lends general support for the adoption of teleNP in a racially diverse sample. We encourage other researchers to explore teleNP more thoroughly within diverse populations to ensure appropriate inclusion and generalizability of empirical findings.
Conclusion and future directions
Within the context of the naturalistic “experiment” created by the COVID-19 pandemic, our findings revealed primarily moderate to good relationships between the UDS v3.0 t-cog test battery and its in-person counterpart. For certain measures, reliability was somewhat stronger when delivered via video as opposed to telephone, possibly owing to additional visual facial cues available in this format (e.g., participants might more effectively register verbal information when able to see the examiner’s facial expressions and lip movements). In summary, this report is an important initial step in evaluating the reliability of the UDS v3.0 t-cog test battery, and other in-home teleNP testing more broadly. Future work should clarify how diagnostic group and retest duration timeframe affect reliability and should consider the ecological validity of in-home versus traditional, tightly controlled settings.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S1355617723000383
Acknowledgments
The authors acknowledge the National Institute on Aging and the Michigan Alzheimer’s Disease Research Center, University of Michigan (P30AG053760 and P30AG072931) for making this work possible.
Funding statement
The authors also acknowledge funding from the National Institute on Aging to BMH (R35AG072262).
Competing interests
None.