Introduction
In clinical practice, impressions of speech are routinely employed as an element of mental status examination and are a primary source of information in the diagnostic process. Especially in schizophrenia-spectrum disorders, descriptions of speech are used to assess specific symptoms such as alogia (e.g. poverty of speech) and blunted affect (e.g. diminished vocal intonation), as well as positive symptoms such as excitement (e.g. pressured speech; Alpert, Shaw, Pouget, & Lim, Reference Alpert, Shaw, Pouget and Lim2002). Although there is no consensus on the best sub-categorization of the different symptoms of schizophrenia-spectrum disorders yet, an important distinction is that between positive and negative symptoms. Being able to determine at different moments in time whether a schizophrenia-spectrum patient experiences predominantly positive or negative symptoms is of great clinical importance because these symptom subtypes require different treatment (Fusar-Poli et al., Reference Fusar-Poli, Papanastasiou, Stahl, Rocchetti, Carpenter, Shergill and McGuire2015). Moreover, persistent negative symptoms are strongly related to functional outcomes (Lin et al., Reference Lin, Huang, Chang, Chen, Lin, Tsai and Lane2013; Milev, Ho, Arndt, & Andreasen, Reference Milev, Ho, Arndt and Andreasen2005), further highlighting the need for a reliable distinction between subtypes. From a scientific perspective, reliable identification of relevant subgroups of patients is highly important because the neurobiological underpinnings of their disorders are likely different (Cuthbert & Insel, Reference Cuthbert and Insel2013; Insel, Reference Insel2014). Consider a patient with grandiose delusions, hallucinations, pressured speech and derailment, v. a patient who presents with social withdrawal, alogia and catatonia: although both patients could be classified as having schizophrenia, the processes underlying these clinically non-overlapping symptom-collections might be entirely different.
Traditionally, clinical rating scales, such as the Positive and Negative Syndrome Scale (PANSS; Kay, Fiszbein, & Opfer, Reference Kay, Fiszbein and Opfer1987) are used to assess both positive and negative symptoms in schizophrenia. However, previous research has stressed the dramatic disparity between speech-based objective measurements of individual characteristics (e.g. flat affect) and scale-based subjective assessments assessed by clinicians; symptom rating scales are, at best, only modestly related to objective measurements of the behavior they intend to reflect (Cohen, Mitchell, & Elvevåg, Reference Cohen, Mitchell and Elvevåg2014). Moreover, some PANSS items are highly influenced by experience and preconceptions of the rater, which results in low inter-rater reliability (Kølbæk et al., Reference Kølbæk, Blicher, Buus, Feller, Holm, Dines and Correll2018). In daily practice, these subjective assessments are not performed scale-based, but ad hoc, leading to even higher inter-observer variation. Especially, the assessment of negative symptoms remains problematic, since consensus-based instruments diverge on what should be considered true negative symptoms and what should be considered general/other illness components; such as, for example, cognition (Marder & Galderisi, Reference Marder and Galderisi2017). Developing objective measurements to complement clinical ratings is thus fundamental to overcome these serious validity concerns with clinical assessments of positive and negative symptoms, and enable development towards measurement-based care (Insel, Reference Insel2017).
Recent advances in natural language processing have paved the way towards automated speech analyses as a biomarker for psychosis (Corcoran et al., Reference Corcoran, Mittal, Bearden, Gur, Hitczenko, Bilgrami and Wolff2020; de Boer et al., Reference de Boer, Voppel, Begemann, Schnack, Wijnen and Sommer2018; de Boer, Brederoo, Voppel, & Sommer, Reference de Boer, Brederoo, Voppel and Sommer2020; Hitczenko, Mittal, & Goldrick, Reference Hitczenko, Mittal and Goldrick2020; Tan & Rossell, Reference Tan and Rossell2020). Most studies in this field analyze language content, including methods of measuring discourse coherence, syntactic complexity and referential coherence (Bedi et al., Reference Bedi, Carrillo, Cecchi, Slezak, Sigman, Mota and Corcoran2015; Corcoran et al., Reference Corcoran, Carrillo, Fernández-Slezak, Bedi, Klim, Javitt and Cecchi2018; Holmlund et al., Reference Holmlund, Chandler, Foltz, Cohen, Cheng, Bernstein and Elvevåg2020; Mota, Furtado, Maia, Copelli, & Ribeiro, Reference Mota, Furtado, Maia, Copelli and Ribeiro2014; Palaniyappan et al., Reference Palaniyappan, Mota, Oowise, Balain, Copelli, Ribeiro and Liddle2019; Rezaii, Walker, & Wolff, Reference Rezaii, Walker and Wolff2019). A source of information that is less commonly used is the acoustic aspects of speech. Articulation and other components involved in speech production can be measured in the acoustic signal, for example by decomposing sound waves into formants; i.e. acoustic resonance frequencies that indicate the position and movement of the articulatory organs when speaking. F1 (first formant) indicates jaw/mouth opening and tongue height, while F2 corresponds to tongue positioning (front/back) and lip rounding. Furthermore, F0, or fundamental frequency, is an acoustic parameter of pitch. Such acoustic speech features (F0 and F2 variability) have been associated with specific negative symptoms (Bernardini et al., Reference Bernardini, Lunden, Covington, Broussard, Halpern, Alolayan and Compton2016; Covington et al., Reference Covington, Lunden, Cristofaro, Wan, Bailey, Broussard and Compton2012) and have also been utilized as differentiators between individuals with schizophrenia-spectrum disorders and healthy individuals in small samples (Martínez-Sánchez et al., Reference Martínez-Sánchez, Muela-Martínez, Cortés-Soto, García Meilán, Vera Ferrándiz, Egea Caparrós and Pujante Valverde2015; Tahir et al., Reference Tahir, Yang, Chakraborty, Thalmann, Thalmann, Maniam and Dauwels2019), attaining overall classification accuracies varying from 81% to 94%. Although these studies are promising, the larger literature also reports studies that failed to find a difference in acoustic features between schizophrenia-spectrum patients and healthy individuals (Cohen, Mitchell, Docherty, & Horan, Reference Cohen, Mitchell, Docherty and Horan2016; Meaux, Mitchell, & Cohen, Reference Meaux, Mitchell and Cohen2018).
These inconsistent findings may be due to the fact that several studies took different approaches to computing and standardizing acoustic parameters (Eyben et al., Reference Eyben, Scherer, Schuller, Sundberg, Andre, Busso and Truong2016). Individual research groups develop and employ their own set of features, and the studies reported above used acoustic feature sets that overlap only partially. Furthermore, studies often include non-acoustic features (e.g. the usage of determiners in speech) in their analyses. Therefore, the results reported in the literature cannot always be meaningfully compared (de Boer et al., Reference de Boer, Brederoo, Voppel and Sommer2020). Indeed, a recent meta-analysis found the existing literature to be highly diverse and unsystematic (Parola, Simonsen, Bliksted, & Fusaroli, Reference Parola, Simonsen, Bliksted and Fusaroli2020). Furthermore, the rapid developments in natural language processing have led to a proliferation in features, often amounting in up to tens of thousands of acoustic parameters (Marmar et al., Reference Marmar, Brown, Qian, Laska, Siegel, Li and Vergyri2019). While this abundance in features allows capturing many acoustic characteristics, it comes at the cost of overfitting and preventing clear interpretations of the underlying mechanisms (de Boer et al., Reference de Boer, Brederoo, Voppel and Sommer2020). We analyzed the speech of patients with a schizophrenia-spectrum disorder and healthy controls using the extended Geneva Acoustic Minimalistic Parameter Set (eGeMAPS) for acoustic analyses (Eyben et al., Reference Eyben, Scherer, Schuller, Sundberg, Andre, Busso and Truong2016). This parameter set was developed to provide a baseline for affective speech processing, in order to ensure easy replication and to improve comparability of parameters.
Overall, some recent studies have shown promising results in identifying schizophrenia-spectrum disorders from healthy controls utilizing automated analysis of acoustic speech patterns (Martínez-Sánchez et al., Reference Martínez-Sánchez, Muela-Martínez, Cortés-Soto, García Meilán, Vera Ferrándiz, Egea Caparrós and Pujante Valverde2015; Tahir et al., Reference Tahir, Yang, Chakraborty, Thalmann, Thalmann, Maniam and Dauwels2019). Yet, inconsistencies remain and replications in large samples using interpretable and standardized speech parameters are required (Meaux et al., Reference Meaux, Mitchell and Cohen2018; Parola et al., Reference Parola, Simonsen, Bliksted and Fusaroli2020). Thus, the first aim of the current study is to replicate the diagnostic potential of standardized speech parameters in a large sample of patients with a schizophrenia-spectrum disorder. The second aim is to explore the value of acoustic analyses in differentiating between patients who at that time experience predominantly positive v. negative psychotic symptoms. Third, we aim to explore the relation between acoustic features and cognitive and psychotic symptoms, as well as the relation with antipsychotic medication. Based on previous study from our group, we specifically expect increased pauses to be related to the use of typical antipsychotics (de Boer, Voppel, Brederoo, Wijnen, & Sommer, Reference de Boer, Voppel, Brederoo, Wijnen and Sommer2020).
Methods
Participants
Data from 142 Dutch patients with a schizophrenia-spectrum disorder and 142 healthy age- and sex-matched controls from the same community were collected between 2015 and 2020. All procedures were approved by the Ethical Committee of the University Medical Center Utrecht. Psychiatric diagnoses were confirmed by the Structured Clinical Interview for DSM-IV (SCID; First, Reference First2014), the Comprehensive Assessment of Symptoms and History (CASH; Andreasen, Flaum, & Arndt, Reference Andreasen, Flaum and Arndt1992) or the Mini-International Interview (MINI; Sheehan et al., Reference Sheehan, Lecrubier, Sheehan, Amorim, Janavs, Weiller and Dunbar1998), depending on the study the participants originally enrolled in. Healthy controls were screened for the absence of previous or current mental illness. Symptom severity and cognitive functioning were rated by consensus rating of two trained researchers, blinded to phonetic analysis, with the PANSS (Kay et al., Reference Kay, Fiszbein and Opfer1987) and Brief Assessment of Cognition (BACS; Keefe et al., Reference Keefe, Goldberg, Harvey, Gold, Poe and Coughenour2004). The patients were categorized as having predominant negative v. positive symptoms based on PANSS scores. Since there is no consensus in the literature on cutoff scores for such subcategorizations, a median split was used. Patients were therefore categorized as having predominantly negative symptoms if they had a greater score on the negative than on the positive subscale of the PANSS, and a negative subscale score ⩾ the median, being 12. The patients with predominantly positive symptoms consisted of patients with PANSS positive subscale scores that were greater than their negative subscale scores, and a positive subscale score ⩾ the median, being 9. Of the 142 schizophrenia-spectrum cases, 44 had predominantly positive symptoms and 45 had predominantly negative symptoms, the remaining patients had an equal combination of positive and negative symptoms.
Patients were divided into two categories based on different dopamine binding profiles, namely patients with (1) low dopamine D2 receptor (D2R) occupancy drugs (i.e. quetiapine, paliperidone, olanzapine and clozapine) or (2) high D2R occupancy drugs (i.e. aripiprazole, risperidone, flupentixol, amisulpride and haloperidol), following a previous report by our group (de Boer et al., Reference de Boer, Voppel, Brederoo, Wijnen and Sommer2020). Antipsychotic drug dosages were recalculated into chlorpromazine equivalents to evaluate the effect of dosage between the drugs (Leucht et al., Reference Leucht, Samara, Heres, Patel, Woods and Davis2014).
Procedure
Speech recording
Semi-structured interviews varying from 5 to 30 min in length (average 11 min) were conducted using a set of neutral open-ended questions. For an elaborate description of this methodology, see previous reports by our group (de Boer, van Hoogdalem, et al., Reference de Boer, van Hoogdalem, Mandl, Brummelman, Voppel, Begemann and Sommer2020; de Boer, Voppel, et al., Reference de Boer, Voppel, Brederoo, Wijnen and Sommer2020; Voppel, de Boer, Brederoo, Schnack, & Sommer, Reference Voppel, de Boer, Brederoo, Schnack and Sommer2021). AKG-C544l cardioid microphones were used to record the participant's and interviewer's speech onto two different channels. Speech was digitally recorded onto a Tascam DR40 solid state recorder, at a sampling rating of 44.1 kHz with 16-bit quantization. Head-worn microphones were used to keep the distance between the mouth and the microphone as constant as possible (2 cm), since this has a considerable effect on recorded vocal loudness. Microphones were placed at a 45° angle from the participant's mouth in order to prevent aerodynamic noise.
Speech pre-processing
To remove crosstalk (i.e. speech from the interviewer on the participant's audio channel), the following steps were taken: (1) the ‘annotate to text grid silences’ function in PRAAT (Boersma & Weenink, Reference Boersma and Weenink2013) was used on the interviewer's channel (settings: minimum pitch 100 Hz, time step 0.0, silence threshold −30.0 dB, minimum silence duration 1.0 s, minimum sounding duration 0.1 s); (2) all resulting speech segments in which the interviewer was silent were selected on the participants channel and (3) the resulting speech segments were concatenated into a new audio file containing only segments of speech of the participant.
Acoustic parameters
The eGeMAPS parameter set was extracted from the participant's speech using OpenSMILE (Eyben, Weninger, Gross, & Schuller, Reference Eyben, Weninger, Gross and Schuller2013). eGeMAPS provides arithmetic means and coefficients of variation [standard deviation (s.d.) normalized by the arithmetic mean] for each parameter. A total of 88 parameters were computed at the speaker level and were used for feature selection and classification model building. For a full overview of the parameters, see online Supplementary Table S1. The parameters can be divided into the following types (Eyben et al., Reference Eyben, Weninger, Gross and Schuller2013): temporal (e.g. speech rate; six parameters), frequency (e.g. fundamental frequency; 24 parameters), spectral [e.g. Mel-frequency cepstral coefficients (MFCCs), relative energy in different frequency bands; 43 parameters] and energy/amplitude (e.g. intensity; 14 parameters).
Cognitive functioning
Cognition was assessed in all patients using the BACS (Keefe et al., Reference Keefe, Goldberg, Harvey, Gold, Poe and Coughenour2004) which consists of the following tasks: (1) List learning – Verbal memory; (2) Digit sequencing - Working memory; (3) Token motor task - Motor speed; (4) Category Instances and Controlled Oral Word Association Test - Verbal fluency; (5) Symbol coding - Attention and information processing speed and (6) Tower of London - Executive function. Individual BACS scores were converted into standardized Z-scores that are corrected for age and gender based on previously published norm scores (Keefe et al., Reference Keefe, Harvey, Goldberg, Gold, Walker, Kennel and Hawkins2008).
Statistical analysis
For categorical variables, a χ2 test, and for continuous variables analysis of variance (ANOVAs) were used to assess differences between groups in demographic characteristics. Following previous research (Marmar et al., Reference Marmar, Brown, Qian, Laska, Siegel, Li and Vergyri2019), random forest algorithms were used to build machine-learning classifiers using acoustic speech parameters to differentiate between schizophrenia-spectrum patients and healthy controls, and to classify symptom patterns (i.e. predominantly positive v. predominantly negative symptoms). In this study, random forest classifiers were built using the 88 speech markers, using leave-ten-out cross-validation to divide the data into test and train sets. Random forests are based on multiple classification trees. The nodes in these decision trees are based on a binary ‘split’ of a predictor, aimed at minimizing misclassification in a training subset of the data. The process continues recursively until a tree is formed that does not improve from further splits or nodes. The probability of being a member of a certain target class, is the fraction of that class in the terminal node into which they fall. Once trees are iteratively built, per-case classification is performed on the testing subset. The resulting probability estimates are used to generate a receiver operator curve (ROC) and its area under the curve (AUC). From the trained trees, the importance of a specific predictor (Gini-importance score) is estimated by setting the predictor to a random number, and comparing the difference in predictive power (AUCs) to the final model, measuring how much worse the model becomes when replacing the predictor with random data. Gini-importance scores are average scores over all recursive trees, and can only be interpreted in relation to each other. Bivariate Pearson correlations were performed to analyze associations between acoustic variables and clinical characteristics in the patients for the top 10 features in both models. Correlation analyses were corrected for multiple comparison using false-discovery rate (FDR). Alpha was set to 0.05 for all analyses.
Results
The schizophrenia-spectrum patients and controls did not differ significantly in age, sex or parental educational levels (see Table 1). The patients overall received less education than the controls, which is to be expected given that the first psychosis often occurs during educational years. The schizophrenia-spectrum patients with predominantly positive symptoms had a longer illness duration than the patients with predominantly negative symptoms. These subgroups did not differ significantly on age, sex and parental educational levels. The patients with predominantly positive symptoms had a longer illness duration than the patients with predominantly negative symptoms (Table 1).
HC: healthy controls; SSD: schizophrenia-spectrum disorder patients.
Psychotic symptom subgroup analyses were performed in those patients that met criteria for either predominantly positive or negative symptoms. These subgroups are a subset (n = 89) of the full schizophrenia-spectrum disorder sample (n = 142).
** Indicates p value <0.01.
Diagnostic classifier
The trained 10-fold cross-validated classifier had an accuracy of 86.2% with an AUC of 0.92 in distinguishing schizophrenia-spectrum patients from healthy controls. Sensitivity was 85.1% and specificity 87.2% (see Table 2).
Legend: SSD: schizophrenia-spectrum disorder patients, HC: healthy controls, sens: sensitivity, spec: specificity, CI: confidence interval, NPV: negative predictive value, PPV: positive predictive value, AUC: area under the curve.
Table 3 lists the ten acoustic parameters with the highest importance scores in the final model, as well as the means and s.d.s of these parameters in the schizophrenia-spectrum patients and controls. For a full list of Gini-importance scores, see online Supplementary Table S1. Three of the top parameters in the model were temporal (voiced segment rate, unvoiced segment length and s.d.), indicating short, fragmented speech segments and longer pauses in the schizophrenia-spectrum patients compared to controls. Patients with a schizophrenia-spectrum disorder were further classified using spectral parameters between groups, namely a reduced mean spectral slope of voiced and unvoiced regions (indicating a more tensed voice in the patients) and reduced spectral flux variation (i.e. less difference between spectra measured between two consecutive time windows, which may indicate more monotone speech). Patients were further characterized by reduced variation of F2 bandwidth (i.e. vowel bandwidth, can indicate breathiness). Moreover, the energy of the first and third formant as compared to pitch (F0) differed from healthy controls, indicating differences in intonation characteristics and voice quality. The patients showed a larger variation in loudness compared to the controls.
s.d., standard deviation; UV, unvoiced; V, voiced.
Ranking based on average Gini-importance scores.
** indicates p value <0.01.
Psychotic symptoms classifier
The trained 10-fold cross-validated model had an accuracy of 74.2% with an AUC–ROC of 0.76 in distinguishing patients with predominantly negative from those with predominantly positive symptoms. Sensitivity was 65.9% and specificity 80.0% (see Table 2). Table 4 lists the top ten acoustic parameters in the trained model, as well as the means and s.d.s of these parameters in the predominant positive and negative symptom groups. For a full list of Gini-importance scores, see online Supplementary Table S2. Patients with positive symptoms had a smaller spectral slope indicating a smaller difference in energy between low and high frequencies. Other spectral parameters did not differ between groups. Seven of the top ten parameters were frequency parameters (i.e. relating to the way vowels are pronounced); patients with positive symptoms had a smaller F1 and F2 bandwidth (i.e. vowel bandwidth, indicating breathiness), lower mean F1 frequencies, less jitter variation (i.e. roughness or voice quality) and less variation in F3 frequency (i.e. vowel frequency).
Ranking based on average Gini-importance scores. s.d.: standard deviation; UV: unvoiced; V: voiced. ** indicates p value <0.01, * indicates p value <0.05.
Associations between acoustic variables and clinical characteristics
Psychotic and cognitive symptoms
Pearson correlations for the top ten parameters from both classifiers with PANSS positive, negative and general and composite BACS scores are presented in Table 5. Acoustic variables correlated most strongly with negative and to a lesser degree general PANSS scores. After FDR correction, no significant associations with cognitive functioning were found.
Legend: Reported values are Pearson's r values. ** indicates p value <0.01, * indicates p value <0.05. Bold values remained significant after FDR correction for multiple comparisons.
Antipsychotic medication
Pearson correlations between chlorpromazine equivalents and all 88 acoustic parameters revealed only one weak association; slope V500–1500 (s.d.) was correlated with chlorpromazine equivalent dose (r = −0.207, p = 0.019). This association did not remain significant after FDR correction. A multivariate analysis of variance (MANOVA) analysis comparing patients with low DR2 occupancy drugs with patients using high DR2 occupancy drugs, and chlorpromazine equivalent dose as a covariate, revealed no significant overall group effect of medication type (F (1, 85) = 1.191, p = 0.269) or dose (F (1, 85) = 1.260, p = 0.206) on the acoustic parameters. An ANOVA analysis comparing the two medication groups on pause duration (mean unvoiced segment length) specifically, with chlorpromazine equivalent dose as a covariate, revealed a significant main effect of drug type (F (1, 126) = 4.532, p = 0.035, partial η 2 = 0.035). Antipsychotic dose showed no significant effect on pause duration (p = 0.642). Patients with high DR2 occupancy drugs showed a greater pause duration (mean = 0.28) than the patients with low DR2 occupancy drugs (mean = 0.24).
Discussion
This study confirms previous research in showing that patients with a schizophrenia-spectrum disorder can be reliably distinguished from healthy control participants by using automated speech-based techniques, in a large sample. We extend the current literature by further showing that patients with a schizophrenia-spectrum disorder can be divided into those with predominantly positive v. negative symptoms based on their speech acoustics.
By using standardized acoustic measures, high sensitivity, specificity and accuracy could be achieved in differentiating patients from controls. The model assigns higher probability to a schizophrenia-spectrum diagnosis when speech is fragmented (i.e. frequent short-speech segments), has a decreased spectral slope (i.e. less difference in energy between high- and low-speech frequencies) and longer pauses. We further demonstrated that speech parameters can be used to identify subjects with positive and negative symptoms in schizophrenia-spectrum disorders. This model assigns higher probability to positive symptoms when there is less variation in jitter (i.e. roughness of the voice), as well as differences in vowel quality (e.g. breathiness and rounding of vowels, indicated by reduced variation in F3 formant frequencies, a lower F1 formant frequency and a smaller F1 and F2 formant bandwidth (i.e. reach of the formants, affected by jaw/mouth opening as well as tongue and lip positioning). This indicates that speech-markers do not only differentiate between ‘normal’ and ‘disturbed’ speech, but also (reliably) reflect positive and negative symptoms.
Our results showed that temporal (i.e. timing related) parameters are highly important in classifying patients and controls, as three out of six temporal parameters were included in the top ten parameters. This is an interesting finding since temporal parameters comprise only 6.8% of the acoustic parameters used. These findings are in correspondence with a recent meta-analysis indicating that temporal aspects such as speech rate and pausing patterns are often disturbed in schizophrenia-spectrum disorders (Cohen et al., Reference Cohen, Mitchell and Elvevåg2014; Parola et al., Reference Parola, Simonsen, Bliksted and Fusaroli2020). Contrastingly, temporal parameters were not among the top ten parameters characterizing positive v. negative psychotic symptoms. Timing-related parameters are not specific to a single-mental process. In this study, we did not find an association between overall cognitive functioning and the timing related parameters in our models. However, a more detailed exploration of this relation could reveal associations between acoustic-timing and specific aspects of cognitive functioning. Reduced speech rate can, for example, be the result of slower word-retrieval, slower processing speed, slower articulation or increased distractibility. Recent work has shown that increased pausing in schizophrenia-spectrum disorders is mostly observed clause-initial, suggesting that these patients require more time to formulate a sentence (Çokal et al., Reference Çokal, Zimmerer, Turkington, Ferrier, Varley, Watson and Hinzen2019). Moreover, in a previous report by our group we found that patients with increased pausing specifically had difficulties with memory-related tasks (Oomen et al., Reference Oomen, de Boer, Brederoo, Voppel, Brand, Wijnen and Sommer2021). We found that several of the acoustic parameters in our models were associated with negative and to a lesser degree general PANSS scores, including loudness, formant bandwidth and amplitude. These findings suggest that patients with high negative PANSS scores have a more breathy voice and reduced voice quality. Moreover, voiced segments per s was negatively associated with positive PANSS scores, suggesting that patients with high positive PANSS have less fragmented speech. A thorough exploration of these associations is beyond the scope of the current study; further research is needed to explore how acoustic aspects are related to individual PANSS items.
Our results revealed no effect of antipsychotic medication dose on the acoustic parameters, which is in accordance with our previous work (de Boer et al., Reference de Boer, Voppel, Brederoo, Wijnen and Sommer2020). Furthermore, although we did not find an overall effect of drug type on the acoustic variables, we replicated our previous work in showing that the use of high DR2 drugs was associated with increased pausing (de Boer et al., Reference de Boer, Voppel, Brederoo, Wijnen and Sommer2020).
Research on spectral profile parameters suggests they are mostly associated with vocal valence, such as angry speech, as well as the control over emotions (Eyben et al., Reference Eyben, Scherer, Schuller, Sundberg, Andre, Busso and Truong2016; Goudbeek & Scherer, Reference Goudbeek and Scherer2010). Spectral slope and pitch variation have also been associated with emotional stress (Simantiraki, Giannakakis, Pampouchidou, & Tsiknakis, Reference Simantiraki, Giannakakis, Pampouchidou and Tsiknakis2016). Frequency related parameters have been associated with arousal, alertness and engagement in the conversation (Goudbeek & Scherer, Reference Goudbeek and Scherer2010). Applying these interpretations to our findings suggests that patients with positive v. negative symptoms show differences in terms of arousal, alertness and engagement (frequency related parameters), and to a lesser degree in vocal valence (e.g. angry/happy). However, it should be noted that there is little research on the interpretation of acoustic parameters in relation to psychopathology. Most research has been conducted in healthy individuals. Interpreting the different parameters in relation to schizophrenia-spectrum disorders or psychotic symptoms therefore remains speculative for now.
Of note, there is some circularity in this line of research. On the one hand PANSS scores are – for some part – based on a clinical interpretation of speech (e.g. pauses; Alpert, Pouget, & Silva, Reference Alpert, Pouget and Silva1995), on the other hand acoustic speech measurements are validated by their association with negative symptoms. Clinicians describe a persons' speech as being ‘flat’, ‘monotonous’ or ‘aprosodic’, which is then traced back to smaller F0 range and decreased variability in F2 (Compton et al., Reference Compton, Lunden, Cleary, Pauselli, Alolayan, Halpern and Covington2018). Speech analysis should be used routinely to understand the changes in speech during psychosis, and should be favored over less precise clinical descriptions of spoken language. Phonetic parameters have (relatively) clear underlying biological mechanisms (Eyben et al., Reference Eyben, Scherer, Schuller, Sundberg, Andre, Busso and Truong2016), while we do not know what a clinician hears when he describes a patient's speech as ‘monotonous’. For example, there is evidence that clinicians are influenced by speaking patterns in assessing the severity of all negative symptoms; when pause length is manipulated in spoken language, clinicians tend to rate other (non-speech) negative symptoms as more severe (Alpert et al., Reference Alpert, Pouget and Silva1995).
This line of research is still in the proof of concept phase and will face a number of challenges before it can be clinically implemented (Dukart, Weis, Genon, & Eickhoff, Reference Dukart, Weis, Genon and Eickhoff2021). For example, biases in machine learning (Schnack, Reference Schnack, Mechelli and Vieira2020), privacy issues, generalizability and technical pitfalls should be dealt with in the first place (Jacobson et al., Reference Jacobson, Bentley, Walton, Wang, Fortgang, Millner and Coppersmith2020; Starke, De Clercq, Borgwardt, & Elger, Reference Starke, De Clercq, Borgwardt and Elger2020). After these critical hurdles are taken, we envision acoustic analysis to have several valuable clinical implementations. For example, speech analyses could be used in primary care facilities as an initial screening instrument. Currently, diagnostic accuracy in primary care is low as the personnel is not specifically trained in psychiatric problems. Individuals with a high likelihood of a schizophrenia-spectrum disorder could for example be prioritized for a referral to a mental health care professional. Moreover, we expect that accuracy of models like these will significantly improve from the combination of different types of analyses, such as a combination of acoustic, semantic and syntactic analyses (de Boer et al., Reference de Boer, Brederoo, Voppel and Sommer2020). With sufficient levels of accuracy, possible clinical implementations could include relapse prediction, prediction of treatment efficacy, prognosis, early diagnosis and differential diagnosis.
Although the temporal stability of speech anomalies in schizophrenia-spectrum disorders has scarcely been studied (Cohen et al., Reference Cohen, Fedechko, Schwartz, Le, Foltz, Bernstein and Elvevåg2019), phonetic research has shown that some acoustic properties such as formant trajectories are highly stable over time (Hasan, Jamil, & Rahman, Reference Hasan, Jamil and Rahman2004; Ingram, Prandolini, & Ong, Reference Ingram, Prandolini and Ong2013; Nolan & Grigoras, Reference Nolan and Grigoras2005), making them suitable even for speaker identification. If acoustic speech properties are indeed highly personal and stable, deviations from such a stable personal pattern may very well be a cue for relapse into psychosis in an individual person. Further longitudinal studies are required to ascertain how speech disturbances in schizophrenia-spectrum disorders develop over time (Arevian et al., Reference Arevian, Bone, Malandrakis, Martinez, Wells, Miklowitz and Narayanan2020; Cohen et al., Reference Cohen, Fedechko, Schwartz, Le, Foltz, Bernstein and Elvevåg2019).
There were a number of limitations in the study. First, although conservative leave-ten-out cross-validation was employed, there remains a risk of overfitting. Replication in an independent dataset is therefore required. Secondly, although several variables that can influence the results have been explored, not all factors could be controlled for. For example, we had no data on smoking behavior, length or weight, which are known to influence some aspects of speech. Moreover, the recordings were made in different rooms, which may have affected some of the acoustic aspects of speech. Therefore, future research should assess the specificity of the classifier by controlling for these factors and by testing its performance in differential diagnosis of psychiatric disorders. Finally, it should be noted that we averaged the acoustic parameters over the interview duration (mean duration 10.8 min), which precludes dynamic differences over time. Further research should therefore also incorporate dynamic aspects of speech acoustics, such as convergence between speech partners, to fully acknowledge the within-subject variety of speech acoustics as well.
In conclusion, these findings show that standardized acoustic speech parameters can be used to accurately classify patients with a schizophrenia-spectrum disorder and healthy controls, as well as differentiate between patients with predominantly positive v. negatives symptoms. Our findings support the usefulness of computational tools to characterize complex human behavior such as speech. We strongly encourage the use of standardized open-source software packages such as eGeMAPS since they ensure easy comparison and replication across samples and even cross-linguistically.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291721002804.
Data
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available as they contain information that could compromise research participant privacy or consent.
Acknowledgements
We are grateful to all participants and research interns for their help with data collection and preparation.
Author contributions
Janna de Boer: conceptualization, Methodology, Formal analysis, Investigation, Writing – Original draft. Alban Voppel: Methodology, Software, Investigation. Sanne Brederoo: Writing – Review and Editing. Hugo Schnack: Writing - Review & Editing, Supervision. Khiet Truong: Writing - Review & Editing, Supervision, Frank Wijnen: Writing - Review & Editing, Supervision. Iris Sommer: Writing - Review & Editing, Supervision, Funding acquisition
Financial support
I. S. received a TOP grant from The Netherlands Organization for Health Research and Innovation (ZonMW, project: 91213009).
Conflict of interest
I. S. is a consultant to Gabather, received research support from Janssen Pharmaceuticals Inc. and Sunovion Pharmaceuticals Inc.