Introduction
Talkers spontaneously adjust their speech based on communicative goals and listener characteristics. One prominent example of adaptation is infant-directed speech (IDS), in which adults adjust their speaking style on several prosodic and linguistic dimensions when addressing infants. In IDS, adults modify their speech along prosodic and linguistic dimensions for infants, characterised by higher, variable pitch, expanded vowel space usage (Kuhl et al., Reference Kuhl, Andruski, Chistovich, Chistovich, Kozhevnikova, Ryskina and Lacerda1997; Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017), shorter sentences, and longer pauses (Fernald & Simon, Reference Fernald and Simon1984; Grieser & Kuhl, Reference Grieser and Kuhl1988). IDS is hypothesised to engage infants’ attention (Cooper & Aslin, Reference Cooper and Aslin1990), communicate affect (Trainor et al., Reference Trainor, Austin and Desjardins2000), and facilitate language acquisition (Golinkoff et al., Reference Golinkoff, Can, Soderstrom and Hirsh-Pasek2015), contributing to socio-emotional and cognitive development (Jaffe et al., Reference Jaffe, Beebe, Feldstein, Crown and Jasnow2001). Thus, IDS plays a crucial role in early parent–infant interactions, alongside other modalities such as facial expressions and touch (Beebe et al., Reference Beebe, Messinger, Bahrick, Margolis, Buck and Chen2016).
However, IDS is not a monolithic phenomenon where a similar exaggerated register and linguistic modifications are universally used in all speech addressed to infants (see Farran et al., Reference Farran, Lee, Yoo and Oller2016, for a discussion). Instead, the characteristics and amount of IDS also depend on the linguistic and cultural context (e.g., Hilton et al., Reference Hilton, Moser and Bertolo2022), the age and development of the recipient (e.g., Julien & Munson, Reference Julien and Munson2012; Ko, Reference Ko2012), and the dynamics of the interaction of the dyad (e.g., Smith & Trainor, Reference Smith and Trainor2008). Additionally, IDS can resemble adult-directed speech (ADS) in certain contexts, with the usage of ADS register increasing as the infant grows (Farran et al., Reference Farran, Lee, Yoo and Oller2016).
Properties of IDS may also change because of adverse conditions affecting the dynamics and quality of dyadic interactions. Research on early parent–infant interactions, while not specifically centred on prosodic aspects of vocal exchanges, has indicated a number of infant- and/or parent-related factors that can influence the quality of the interaction. For instance, maternal postnatal depression or other adverse mental health symptoms (e.g., anxiety, stress, or birth-related trauma) are significant risk factors for the early mother–infant bonding (Korja et al., Reference Korja, Savonlahti, Ahlqvist-Björkroth, Stolt, Haataja, Lapinleimu, Piha and Lehtonen2008; Murray et al., Reference Murray, Fiori-Cowley, Hooper and Cooper1996). Early interactions may also be altered in cases of infant congenital facial malformations, such as cleft lip (Murray et al., Reference Murray, Hentges, Hill, Karpf, Mistry, Kreutz, Woodall, Moss, Goodacre and Lip2008), in dyads with infants at risk of neurodevelopmental disorders (NDDs) like autism spectrum disorder (ASD; Saint-Georges et al., Reference Saint-Georges, Mahdhaoui, Chetouani, Cassel, Laznik, Apicella, Muratori, Maestro, Muratori and Cohen2011; Wan et al., Reference Wan, Green and Scott2019), and following preterm birth (Korja et al., Reference Korja, Latva and Lehtonen2012). Similarly, and pertinent to this study, early infant–caregiver dyadic exchanges can become compromised by the occurrence of severe perinatal adverse events, including extremely preterm birth, perinatal asphyxia, stroke, or other adverse conditions that put the infants at high risk for developing severe neurological disorders, in particular Cerebral Palsy (CP) (Festante et al., Reference Festante, Antonelli, Chorna, Corsi and Guzzetta2019). In their review, Festante et al. (Reference Festante, Antonelli, Chorna, Corsi and Guzzetta2019) report that, overall, infants who experienced brain injury or other neurological major events tend to exhibit suboptimal interactive behaviours likely stemming from their clinical conditions. On the other hand, caregivers, who inevitably face an unexpected emotional burden, may experience increased mental health symptoms. Such symptomatology along with awareness of their infant’s high risk for chronic neurological disorders may also impact caregivers’ interactive behaviours during dyadic exchanges, including the way caregivers speak to their infants.
However, limited research exists on how prosodic properties of IDS might be affected by infants’ high risk for neurological disorders during the first 6 months of life, that is, during the prelinguistic period of development. Therefore, characterizing early IDS for infants at high risk for neurological disorders is important, especially in the first months of life, which correspond to the key period for leveraging neuroplasticity and promoting neurodevelopment. Identifying deviations in IDS can help guide early interventions, enhancing caregiver–infant bonds and benefiting both infants’ development and parental well-being (Morgan et al., Reference Morgan, Fetters, Adde, Badawi, Bancale, Boyd, Chorna, Cioni, Damiano, Darrah, de Vries, Dusing, Einspieler, Eliasson, Ferriero, Fehlings, Forssberg, Gordon, Greaves, Guzzetta and Novak2021).
This study addresses the gap in knowledge by investigating the properties of IDS addressed to a group of prelinguistic 4.5-month-old infants at high risk for developing CP or severe NDDs. We audio-recorded and analysed spontaneous mother–infant interactions and compared them to control dyads with age-comparable typically developing (TD) infants. Because of the lack of specific predictions (as discussed below), we explored various acoustic dimensions of IDS, also analysing the degree of parental responsiveness and prosodic alignment to infant vocalisations. As a secondary analysis, we analysed our measurements in light of later neurodevelopmental outcomes of the high-risk infants and also controlled for whether our findings could be attributed to depression-, stress-, or anxiety-related symptoms of the mothers. The overall goal was to evaluate whether parents’ knowledge of their baby being at risk already imposes measurable changes in their IDS prosody, or whether any potential changes in IDS are related to the severity of the infant’s clinical outcome instead of the risk per se.
Prosodic properties of IDS in typical development
IDS is associated with prosodic and lexico-syntactic modifications of speech (see Soderstrom, Reference Soderstrom2007, for a review), and we focus on the former in this study. Speech prosody is characterised by intonation, intensity, and timing of speech units. Phonation style, such as whispered, breathy, pressed, or normal (a.k.a. modal) phonation, is also a property of speaking style that talkers vary independently of the literal meaning of the message (in most languages; see Gordon & Ladefoged, Reference Gordon and Ladefoged2001). These prosodic and stylistic properties can be analysed through direct acoustic measures like fundamental frequency (F0) for intonation, or via acoustic correlates, like estimated syllable counts per time unit for speaking rate, utterance duration as a proxy for the number of linguistic units (Räsänen et al., Reference Räsänen, Seshadri, Karadayi, Riebling, Bunce, Cristia, Metze, Casillas, Rosemberg, Bergelson and Soderstrom2019), or voicing ratio for the amount of whispered speech. The prototypical IDS prosody, compared with ADS, features higher and more variable pitch intonation contours (Fernald & Simon, Reference Fernald and Simon1984; Hilton et al., Reference Hilton, Moser and Bertolo2022) and shorter utterances and longer pauses (see Soderstrom, Reference Soderstrom2007, for a review). In addition, the articulation of vowels tends to be exaggerated (Kuhl et al., Reference Kuhl, Andruski, Chistovich, Chistovich, Kozhevnikova, Ryskina and Lacerda1997) or more variable (Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017). Reports have also linked IDS with an increased proportion of whispered or breathy phonation relative to ADS (Fernald & Simon, Reference Fernald and Simon1984; Garnica, Reference Garnica, Snow and Ferguson1977; Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017; Sundberg & Lacerda, Reference Sundberg and Lacerda1999), although phonation style is not systematically included in acoustic–phonetic analyses of IDS despite its importance in the communication of affect and intimacy in speech (e.g., Gobl & Ní Chasaide, Reference Gobl and Ní Chasaide2003; Ishi et al., Reference Ishi, Ishiguro and Hagita2010; Laver, Reference Laver1980).
The basic prosodic parameters of IDS undergo diverse changes as the infant develops. Over the first 2 years of infants’ life, IDS speaking rate systematically increases to align with ADS (Ko, Reference Ko2012; Narayan & McDermott, Reference Narayan and McDermott2016; Raneri et al., Reference Raneri, Von Holzen, Newman and Bernstein Ratner2020) while IDS utterances become shorter from 3 to 9 months of age (Murray, Reference Murray1990). Regarding intonation, Stern et al. (Reference Stern, Spieker and MacKain1982) found a larger pitch range for IDS toward 4-month-old infants compared with newborns or older infants, Bergeson et al. (Reference Bergeson, Miller and McCune2006) reported higher pitch in IDS to 3–18 month-olds compared with 10–37 month-olds, and Lui et al. (Reference Lui, Tsao and Kuhl2009) reported higher mean pitch and pitch range for Mandarin mothers when they addressed prelinguistic infants compared with 5-year-old children. Kitamura et al. (Reference Kitamura, Thanavishuth, Burnham and Luksaneeyanawin2001) found an increasing mean pitch from birth up to 6 or 9 months for English and Thai mothers, respectively, with a rising trend in F0 range until 12 months in both languages (the study’s maximum infant age). Furthermore, the proportion of speech directed at the child in the ADS register increases with the child’s age (Farran et al., Reference Farran, Lee, Yoo and Oller2016), potentially affecting the observed statistics of the aforementioned prosodic features of IDS, as long as the earlier studies have defined IDS based on the addressee instead of observed speech register.
The observed changes in IDS reflect the development of the dyadic interaction and language skills of the recipient, thereby also affecting the role of IDS at different stages of development. For instance, earlier research indicates a shift in parental communication from initially capturing attention and soothing to dynamic dyadic interaction, recognizing infants as active social partners. Henning et al. (Reference Henning, Striano and Lieven2005) suggested this change takes place from 1 to 3 months of age, whereas Fernald (Reference Fernald, Papousek, Jurgens and Papousek1992) has suggested the development proceeds from initially attentional-affective to more content-driven (linguistic) toward the end of the first year of life. Concretely, Henning et al. (Reference Henning, Striano and Lieven2005) observed that increased positive vocalisations in infants correlated with fewer, shorter maternal utterances, and more one-word responses from mothers, signalling adaptation to the infants’ growing social engagement. Additionally, perceived increases in infant engagement to verbal communication reinforce adult use of IDS-like high-pitched speech (Smith & Trainor, Reference Smith and Trainor2008), whereas a lack of infant responsiveness decreases the amount of maternal speech that is prototypical to the IDS register (Braarud & Stormark, Reference Braarud and Stormark2008). Caregivers are also known to adjust temporal properties of their speech to the perceived linguistic proficiency of their children (e.g., Julien & Munson, Reference Julien and Munson2012), and caregivers and their 12- to 30-month-old children entrain to each other’s pitch patterns (Ko et al., Reference Ko, Seidl, Cristia, Reimchen and Soderstrom2016).
IDS in clinical infant populations
Atypical IDS or vocal interaction patterns can occur in various clinical conditions. One such population involves autistic children (e.g., Warlaumont et al., Reference Warlaumont, Richards, Gilkerson and Oller2014) or infants at risk for ASD (Quigley & McNally, Reference Quigley and McNally2013; Quigley et al., Reference Quigley, McNally and Lawson2016; Seidl et al., Reference Seidl, Cristia, Soderstrom, Ko, Abel, Kellerman and Schwichtenberg2018). Notably, in their extensive review, Woolard et al. (Reference Woolard, Lane, Campbell, Whalen, Swaab, Karayanidis, Barker, Murphy and Benders2022) identified only three studies investigating prosodic (acoustic) features of IDS in infants at risk of ASD or autistic children (Brisson et al., Reference Brisson, Martel, Serres, Sirois and Adrien2014; Quigley et al., Reference Quigley, McNally and Lawson2016; Xu et al., Reference Xu, Gilkerson, Richards and Rosenberg2012) and two studies looking at utterance lengths in the same populations (Brisson et al., Reference Brisson, Martel, Serres, Sirois and Adrien2014; Choi et al., Reference Choi, Nelson, Rowe and Tager-Flusberg2020), where certain differences were observed for IDS addressed at infants at risk/autistic children compared with controls. Brisson et al. (Reference Brisson, Martel, Serres, Sirois and Adrien2014) studied two acoustic features of caregiver speech: mean duration and pitch of IDS utterances directed at 0–6 months old infants later diagnosed with ASD. They did not find differences in mean pitch, but maternal utterances were shorter in duration to autistic children (during infancy) than controls. This aligns with Choi et al. (Reference Choi, Nelson, Rowe and Tager-Flusberg2020) who found maternal utterances addressed to high-risk infants have fewer morphemes than those spoken to low-risk infants. Quigley et al. (Reference Quigley, McNally and Lawson2016) studied maternal pitch mean, range, and variance and utterance intensity with 12- and 18-month-old infants at a low and high risk for ASD. They observed an increase in mean pitch for the high-risk group from 12 to 18 months and a decrease for the low-risk group, yet no significant difference in the mean pitch between the groups. Moreover, Quigley et al. (Reference Quigley, McNally and Lawson2016) did not observe any statistically significant differences in pitch range or variance or intensity between the two groups. However, they identified strong correlations between maternal and infant pitch range (Spearman rho = 0.77) and intensity (rho = 0.75) in the 12-month-old low-risk group, with no correlations in 18-month-olds. For the dyads in the 18-month-old high-risk group, mean pitch correlated (r = 0.61), while other measures were mostly weakly and inversely correlated (Quigley et al., Reference Quigley, McNally and Lawson2016). Xu et al. (Reference Xu, Gilkerson, Richards and Rosenberg2012) compared consonant and vowel durations, intensities, spectral entropies, and vowel F0 in IDS to children with and without ASD diagnosis, finding higher duration, loudness, and F0 of vowels in the autistic group. In addition to these studies, Seidl et al. (Reference Seidl, Cristia, Soderstrom, Ko, Abel, Kellerman and Schwichtenberg2018) have explored prosodic alignment of pitch range in dyads with 12- to 24-month-old infants at low and high risk for ASD and their mothers, finding no effects of alignment in general nor differences in alignment between the groups. Recently, Woolard et al. (Reference Woolard, Benders, Campbell, Whalen, Mallise, Karayanidis, Barker, Murphy, Tait, Gibson, Korostenski and Lane2023) studied the relationship between early signs of autism and maternal pitch contours in 12-month-old infants, finding fewer rise–fall–rise or fall–rise–fall intonations for mothers who rated their infants as displaying more signs of autism, and less use of flat intonation toward infants rated with more signs of autism by experts. However, Woolard et al. did not find correlations between other prevalent intonation contour types and infant status.
In addition to these prosodic studies, a handful of studies have examined non-prosodic interactional aspects of dyads involving prelinguistic infants at a high risk of NDDs. Quigley and McNally (Reference Quigley and McNally2013) compared interaction patterns of dyads with infants having low or high risk for ASD (age range: 3–7 months), finding less contingent parental responses but more attention bids from parents of high-risk infants. Saint-Georges et al. (Reference Saint-Georges, Mahdhaoui, Chetouani, Cassel, Laznik, Apicella, Muratori, Maestro, Muratori and Cohen2011) examined dyadic interactions of infants under 6 months, between 6 and 12 months, and older than 12 months, later diagnosed with ASD or other intellectual disabilities. They observed no differences in parental responsiveness compared with parents of healthy controls. Reissland and Stephenson (Reference Reissland and Stephenson1999) studied interactions of five very preterm infants and their mothers two months post-discharge, finding increased maternal responsiveness to preterm infants’ vocalisations compared with mothers of full-term infants. Recently, Provera et al. (Reference Provera, Neri and Agostini2023) investigated syntactic and lexical aspects of maternal IDS toward 3-month-old preterm infants but did not find any significant effects of birth status and/or maternal postnatal depression on the studied aspects of IDS or maternal speech verbosity. This aligns with an earlier report by Salerni et al. (Reference Salerni, Suttora and D’Odorico2007), who did not find any lexical or morphosyntactic differences in IDS addressed at full-term versus preterm infants of 6 months of age. When looking at functional properties of IDS, Provera et al. (Reference Provera, Neri and Agostini2023) found that extremely low birth weight preterm infants (<1000 g) received a lower proportion of affect-salient speech and a higher proportion of information-salient speech as compared with low birth weight infants (<1500 g). It is crucial to note that these investigations, however, did not take into account the severity of prematurity, which might have a fundamental role in influencing the characteristics of maternal IDS. Moreover, none of these studies explored the prosodic features of IDS, focusing exclusively on the interactional aspects or lexico-syntactic aspects of maternal input.
Regarding acoustic analysis of IDS in the context of the risk for CP or other severe neurological conditions, we are not aware of any prior published research. The paucity of studies on IDS to young infants at risk for neurological impairments thus leaves open whether IDS to such infants reflects the characteristics of typical acoustic properties of IDS directed at TD infants, and, on the other hand, whether there are aspects of verbal communication that are early indicators of potential neurological deficits in infants. The present study aims to address this gap by studying IDS addressed at prelinguistic infants at high risk for neurological disorders.
How IDS may be affected by infant’s neurological condition
Several potential mechanisms might account for potential differences in IDS to infants at high risk for neurological impairments compared with IDS heard by TD infants.
First, explicit parental awareness of infant risk may alter their communicative behaviours. While severe brain injury and the high risk of CP can be detected early and accurately through reliable and validated diagnostic and prognostic clinical tests (Novak et al., Reference Novak, Morgan, Adde, Blackman, Boyd, Brunstrom-Hernandez, Cioni, Damiano, Darrah, Eliasson, de Vries, Einspieler, Fahey, Fehlings, Ferriero, Fetters, Fiori, Forssberg, Gordon, Greaves and Badawi2017), a conclusive diagnosis of CP is usually performed between 12 and 24 months, after specific clinical signs emerge, especially in infants with less severe impairments (Boychuck et al., Reference Boychuck, Andersen, Bussières, Fehlings, Kirton, Li, Oskoui, Rodriguez, Shevell, Snider and Majnemer2020). This indicates that while parents recognise their infants’ clinical risk and may observe early atypical behaviours, they generally remain unaware of the long-term clinical outcomes in the early stages of development, specifically at the time the interaction was assessed in the current study. Consequently, the uncertainty regarding the infant’s condition might already prompt atypical parental interactive behaviours, regardless of the actual current or future infant’s health status.
For instance, parents may treat infants as more fragile or sensitive than usual. The parents themselves may also experience atypical spectrum of emotions or mental states when interacting with their infants. Specifically, the high prevalence of maternal depression, anxiety, or post-traumatic stress among mothers of high-risk infants (e.g., Davis et al., Reference Davis, Edwards, Mohay and Wollin2003; Korja et al., Reference Korja, Savonlahti, Ahlqvist-Björkroth, Stolt, Haataja, Lapinleimu, Piha and Lehtonen2008) may impact their responsiveness and communication of affect toward their infants. This could indirectly result from the potentially traumatic experiences linked to the infant’s perinatal clinical history, including extended NICU hospitalisations, current clinical conditions, and uncertainty about future clinical outcomes. Earlier research has revealed that depressed mothers when speaking in IDS, speak less, tend to use less modulated intonation contours (i.e., more monotonic intonation), and have slower and less consistent responses to infant behaviours (Lam-Cassettari & Kohlhoff, Reference Lam-Cassettari and Kohlhoff2020, and references therein; Zlochower & Cohn, Reference Zlochower and Cohn1996).
If mothers in our high-risk group experience more depression, anxiety, or stress symptoms than mothers of TD infants, we could then expect them to adopt a more detached and monotonic speaking style compared with that of the controls.
Alternatively, potential differences from maternal IDS directed to TD infants could be driven by delayed or impaired socio-emotional, cognitive, and motor development of infants. These may result in increased passiveness and lower responsiveness to maternal initiations, as previously reported by Festante et al. (Reference Festante, Antonelli, Chorna, Corsi and Guzzetta2019). Combined with the notion that IDS shifts from primarily affective and directive to linguistic and interactive as the infant matures (see above), these potential developmental delays may result in diverging interactional styles already at a prelinguistic stage. If the present 4.5-month-old at-risk babies are lagging behind their TD peers in terms of social engagement (e.g., because of their motor, visual, or sensory impairments), we may then expect properties of IDS to reflect more affect-conveying and attention-capturing speech properties compared with controls. This would involve, for example, higher variability of pitch and stress patterns, lower prosodic alignment of the dyads, and more breathy or whispered phonation to support communication of warmth and intimacy (Ishi et al., Reference Ishi, Ishiguro and Hagita2010; Laver, Reference Laver1980; Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017). In contrast, we could expect greater maternal coordination with infant vocalisations (e.g., higher contingency of maternal responses, stronger dyadic alignment of vocal parameters) from dyads with TD infants compared with the risk group (e.g., Warlaumont et al., Reference Warlaumont, Richards, Gilkerson and Oller2014; but see also Seidl et al., Reference Seidl, Cristia, Soderstrom, Ko, Abel, Kellerman and Schwichtenberg2018). Based on Murray (Reference Murray1990), Raneri et al. (Reference Raneri, Von Holzen, Newman and Bernstein Ratner2020), and Henning et al. (Reference Henning, Striano and Lieven2005), we would also expect shorter utterances toward TD infants if the at-risk babies are lagging behind in development.
Overall, the potential consequences of parental awareness of the infant’s neurological risk, potential emotional and mental challenges of the caregivers, and the potential developmental differences between the high-risk and TD infants result in somewhat conflicting predictions for our study. There are potentially several mechanisms at play, and the earlier research is very sparse in terms of aspects of IDS studied in a comparable risk population. Given this starting point, the present study was primarily framed as an exploratory one with a comparison of various acoustic and a few interactional aspects of IDS between the study groups.
Materials and methods
Participants
Fourteen infants at high risk for neurological disorders (Risk Group, RG) and 14 control infants at very low risk (Control Group, CG) participated in this observational study with their mothers at the Stella Maris Infant Lab for Early intervention (SMILE) of the IRCCS Stella Maris Foundation in Pisa for a total sample of 28 mother–infant dyads.
Dyads within the RG were recruited at the Infant Neurology Section of the IRCCS Stella Maris Foundation in Pisa, where infants were admitted as inpatients or outpatients. Control dyads (CG) were recruited from the local postnatal ward and at the SMILE Lab through online advertisements and flyers. Inclusion criteria for RG infants were: the occurrence of perinatal neurological adverse events (including preterm birth, perinatal hypoxic–ischemic encephalopathy, and perinatal brain vascular events) and/or abnormal perinatal neuroimaging (cranial ultrasound and/or neonatal MRI) and/or atypical movement patterns at the general movements assessment (GMA), a validated assessment tool to identify neurological issues that may lead to CP and other severe developmental disabilities (see e.g., Einspieler & Prechtl, Reference Einspieler and Prechtl2005). Inclusion criteria for CG infants were: uncomplicated delivery, absence of perinatal complications, and typical movement patterns at the GMA.
All mothers were over the age of 18 and fluent Italian speakers. Specifically, 25 mothers were monolingual Italian speakers (N = 13 CG, N = 12 RG), while 3 mothers were bilingual or multilingual (foreign language(s) and Italian; N = 1 CG, N = 2 RG). Mother and infant sample characteristics are reported in Table 1. Demographic characteristics of the two groups reported in the table were compared using t-tests for independent samples, and Pearson’s chi-square tests, as appropriate; with no significant differences between RG and CG. An additional three dyads (N = 1, CG; N = 2, RG) were recruited and took part in the study but were subsequently excluded from analyses because mothers spoke to the infant in a foreign language during most of the experimental procedures, although being fluent Italian speakers. To keep the mothers’ interactions with their infants as natural as possible, no restrictions were given/imposed on mothers in terms of language use.
Notes: * Corrected age (CA) for preterm infants.
The study was approved by the Tuscan Pediatric Ethics Committee (200/2019) and conducted in accordance with the ethical principles of the Declaration of Helsinki. Families gave written informed consent before participating in the study.
Mother–infant interaction video-recording
Mothers and infants were invited to the SMILE Lab (RG: N = 14, CG: N = 6) or, alternatively, were visited at their home (CG: N = 8) by a researcher when infants were 4.5-month-old (Range RG: 16.0–22.4 weeks, M = 18.5, SD = 1.5; Range CG: 16.3–19.5 weeks; M = 18.4, SD = 0.97), using corrected age (CA) in the case of preterm birth. All experimental procedures and the experimental setting were identical at both locations and common to all dyads. At this age, the vocal modality is already of great importance for communication during mother–infant interactions, and patterns of bidirectional turn-taking and contingent vocal coordination between mothers and infants can be already observed (Jaffe et al., Reference Jaffe, Beebe, Feldstein, Crown and Jasnow2001).
During the visit, approximately 5 minutes (M = 331.9 s, SD = 83.8 s) of a spontaneous mother–infant face-to-face interaction were videotaped in a quiet room and when the infant was calm and alert. A researcher monitored filming, staying out of the mother’s and the infant’s sight during the interaction and without communicating with the mother or the infant during filming, unless necessary. Details of video-recording procedures are reported in Appendix A.1.
Audio coding
Interaction videos (MP4) and the corresponding audio tracks (waveforms, pitch, and intensity contours) were coded using ELAN (https://archive.mpi.nl/tla/elan), version 5.4. Two trained coders, who worked independently from each other, did not participate in the original recordings, and were blind to the purpose of the study, annotated the following vocal behaviours on each audio–video recording: (i) maternal utterances or other non-lexical vocal patterns (e.g., sounds, sighs, singing, or humming) directed to the infant, (ii) infant vocalisations (but not vegetative sounds such as coughs or sneezes), (iii) maternal utterances directed to another adult (i.e., the experimenter), (iv) utterances of other adults (e.g., other family components occasionally entering the room or the experimenter) directed toward the infant and/or to the mother. In addition, (v) any environmental noise occurring during the interaction was annotated for each audio–video recording. All vocal behaviours and noises were annotated to the above four categories with onset and offset timestamps. Overlap of infant and adult vocalisations was possible, as speakers were annotated independently of each other.
Inter-rater agreement was calculated for 28% of all videos (8 out 28), and substantial agreement was achieved for each behaviour scored with Cohen’s κ = 0.86 for IDS utterances, κ = 0.74 for infant vocalisations, κ = 0.85 for mother-to-adult utterances, κ = 0.77 for other adult utterances, and κ = 0.77 for environmental noises.
Annotations related to the maternal vocal patterns directed to the infant and infant vocalisations were used for subsequent analyses.
Infant neurodevelopmental outcome
Subgroup analyses were performed according to the clinical diagnosis and neurodevelopmental outcome of RG infants, which are reported in Table 2. Clinical diagnosis was defined within 18 months CA. Neurodevelopmental outcome was established using clinical referrals and according to predefined clinical outcome categories (see also Iyer et al., Reference Iyer, Roberts, Metsäranta, Finnigan, Breakspear and Vanhatalo2014) as typical, mildly abnormal (including mild developmental – motor, language or cognitive – delay, CP walking and without intellectual disability) or severely abnormal (including severe spastic and dystonic CP with or without intellectual disability, severe sensory deficits, ASD).
Maternal mental health
Mental health issues, such as depression, have been previously reported to affect properties of IDS (e.g., Lam-Cassettari & Kohlhoff, Reference Lam-Cassettari and Kohlhoff2020, and references therein), and it is well established that mothers of infants at risk for neurological disorders, as those in our risk group, are potentially at risk themselves for increased negative emotional states symptoms (e.g., Davis et al., Reference Davis, Edwards, Mohay and Wollin2003; Festante et al., Reference Festante, Antonelli, Chorna, Corsi and Guzzetta2019; Korja et al., Reference Korja, Savonlahti, Ahlqvist-Björkroth, Stolt, Haataja, Lapinleimu, Piha and Lehtonen2008).
In the current study, we assessed aspects of maternal mental health using the Depression, Anxiety, and Stress Scale (DASS-21) (Lovibond & Lovibond, Reference Lovibond and Lovibond1995). It is a 21-item self-report questionnaire, structured into three sub-scales, designed to measure symptoms of depression (D), anxiety (A), and stress (S) experienced over the past week, in which higher scores indicate greater symptomatology (see also Appendix A.2).
All mothers participating in the study were asked to complete the questionnaire at the end of the visit. All but two mothers within the RG returned the completed questionnaire.
Between-group comparisons were performed by means of the Wilcoxon–Mann–Whitney test for independent samples. Detailed results are reported in Table 3. Results indicated that mothers in the RG reported higher scores (M = 7.50, SD = 7) than mothers in the CG (M = 1.57, SD = 2.9) in the Depression subscale (p = 0.015). No statistically significant differences between groups were instead found for the Anxiety and Stress subscales or the DASS-21 Total score.
Notes: † Moderate symptoms, ‡ Severe symptoms, * Statistical significance.
Based on the between-group differences observed in the DASS-21 depression subscale, the depression scores were included in control analyses for acoustic feature comparisons to see if the acoustic findings correlated with depressive symptoms among the participating mothers (see Results section).
Studied speech features
We chose a total of 13 features to characterise the IDS of the caregivers, as listed in Table 4. Given the small sample size, non-ideal audio recording conditions, and across-talker comparisons, we focused on features that are robust to speaker characteristics and background noise levels.
The list includes features to quantify richness of intonational expression in terms of variance and skewness of fundamental frequency (F0) of speech, as calculated from log-F0, and in terms of intonation contours employed by the speakers, as captured by relative frequencies of four prototypical intonation contours of utterances (see below for details). We also measure phonation style (e.g., modal, tense, soft, whisper) with variance of spectral tilt during voiced speech, and by calculating voicing ratio, a measure of the relative duration of voiced sounds (sounds that are excited by the periodic fluctuations of the vocal folds) compared with unvoiced sounds (e.g., consonants) that also indirectly reflects the strength of voicing. As temporal measures and as proxies for linguistic complexity, we use logarithmic utterance length and speaking rate (syllables per second). Finally, interactional aspects are captured by the total amount of verbal activity per time unit (‘vocalisation rate’), and by maternal responsiveness in terms of the probability that the mother responds verbally to an infant’s vocalisation within 1 second from the offset of infant vocalisation, and in terms of the average delay for the mother to respond independently of the 1-s threshold. See Appendix B for a longer description and motivation for the features.
As pre-processing steps, we first reduced the impact of any stationary noise in the original audio recordings using spectral subtraction with a minimum statistics estimator (Martin, Reference Martin2001) using VOICEBOX toolbox (http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html). The recordings were then split into utterance-level clips based on the manual annotations, and feature extraction was then applied to each utterance separately.
F0 contours were estimated with the YAAPT algorithm (v4.0; Zahorian & Hu, Reference Zahorian and Hu2008), and spectral tilt was represented by the first real cepstrum coefficient, both with a 10-ms temporal resolution. The number of voiced frames (i.e., time-instances with ongoing vocal fold vibration, such as vowels and voiced consonants) was calculated from the output of YAAPT voicing detection that assigns every 10-ms signal frame as voiced (with an associated F0 estimate) or unvoiced (no F0 estimate). We report the voicing ratio as the proportion of voiced frames to all speech frames (as annotated by the sound type scoring). Syllable counts were extracted using the thetaSeg-algorithm by Räsänen et al. (Reference Räsänen, Doyle and Frank2018)). Utterance durations, response delays, and overall vocalisation rates were extracted from the manual annotations.
Intonation contour features corresponded to the relative frequencies of four prototypical intonation contour types (Figure 1) utilised by each speaker. The prototypes were obtained in a data-driven manner by first linearly interpolating each utterance-level log-F0 contour across unvoiced segments of speech, resampling the contour to have a fixed number (N = 20) relative positions independently of original duration, and applying zero-mean and unit-variance normalisation to each contour. The resulting contours of all adult speech in the dataset were then clustered with k-means algorithm into four clusters, resulting in monotonic, rising, falling, and rising-falling prototypes (Figure 1). The number of clusters was determined empirically by finding the number of which k-means systematically produced the same contour shapes despite different (random) cluster initialisations. Finally, the normalised F0 contour of each individual utterance was then assigned to the nearest cluster. Empirical proportions (relative frequencies) of these cluster assignments were then used as speaker-level features in the acoustic analyses. As a result, these proportions indicate how much each speaker employed monotonic, rising, falling, or rising–falling intonation, reflecting the variety of employed intonation styles.
In the analyses, each feature value was first calculated at the utterance-level. For frame-level features (F0, tilt), statistical moments (variance, skewness) were extracted per utterance. Then the features were averaged across all utterances from the given speaker. Speaker-level mean features were then subjected to statistical analyses across the groups. To study alignment between infants and caregivers, comparable acoustic measures were also calculated for each infant. This was possible since all the used measures were based on acoustic or annotated properties of speech. For instance, the sonority-based syllabifier can track open–close cycles of infant vocalisations even if they do not reflect adult-like well-formed syllabic structure. Similarly, vocal fold vibration frequency (F0), the corresponding temporal contours, and the proportion of vocalisation time with vibration can be extracted with YAAPT irrespective of how proficient language users employ intonation.
Although generally informative of speaking style, commonly used acoustic measures such as mean F0, mean spectral tilt, spectral centroid, or harmonic-to-noise ratio (HNR) were not measured because of their susceptibility to idiosyncratic speaker differences (for F0) or uncontrolled background noise levels and types present in the recording environments (for tilt, centroid, and HNR), as they can be more easily biased in small-sample analyses.
All acoustic measures and primary acoustic analyses (below) were defined and implemented before gaining access to the full dataset or any of the associated infant metadata.
Primary acoustic analyses
Our primary analysis consisted of a comparison of the 13 acoustic features between RG and CG dyads. Given our limited sample size and an array of potential variables of interest, the overall aim was to detect only the most substantial differences between the groups by employing largely conservative statistical analyses of our data. Across-group differences were tested separately for each feature using an unpaired two-tailed t-test with a < 0.05 and using Bonferroni–Holm correction for 13 comparisons. While this is a highly stringent criterion for significance with our relatively small sample size, we wanted to ensure the robustness of any potential findings from such a broad exploratory analysis. Any significant findings were then subjected to more detailed pairwise comparisons (unpaired t-test) based on the infants’ long-term clinical outcomes: [control, CG] versus [at-risk w. typical/mildly abnormal outcome, RG-M], [CG] versus [at-risk w. severely abnormal outcome, RG-S], and [RG-M] versus [RG-S] to understand whether observed differences are driven by parents’ vocal behaviours, likely related to the awareness of their infants being at risk, or driven by the severity of the infants’ clinical condition.
Dyadic correlations and alignment in turn-taking
As secondary analyses, we were also interested in potential dependencies between adult speech and infant vocalisations at recording and interactional exchange levels. If mothers adjust their speaking style based on infant vocal responses, if infants adjust their vocalisation prosody to IDS, or if the alignment is reciprocal (e.g., Quigley et al., Reference Quigley, McNally and Lawson2016, and references therein), we would expect to see a statistical coupling between acoustic characteristics of IDS and infant vocalisations. At the recording level, we measured speaker-level mean features also for the infants using an identical practice to adult feature extraction (see Section: Studied Speech Features), and then measured the linear correlation between infant and adult feature values to see if the recording-level alignment was observable in the dyads. This was done separately for the two groups (RG vs. CG) to see if the groups differed in terms of their level of alignment, and whether alignment was present at all. For example, the analysis would reveal if infants with above-average variability in F0 would also have parents with above-average F0 variance and vice versa.
A more detailed analysis of alignment at the level of individual interactional exchanges (IE) was conducted by comparing infant and adult feature values from their subsequent utterances. First, we identified changes in speaker turn from infant vocalisation to mother’s response by requiring that the mother responded within 1-s from infant vocalisation offset. For each IE q and feature f, the difference between the feature value of the child (CHI) and mother (MOT) was calculated as
The corresponding expectation of the difference was also defined as
where N is the number of all CHI→MOT turn changes for the given dyad, that is, by measuring the average difference between the current child vocalisation and all other responses of the same mother that did not follow the current infant vocalisation. Finally, a so-called alignment score for the IE was measured as the difference between the expected and actual feature value:
For the intonation contour prototypes (cf. Figure 1), the process was somewhat different. Subsequent matching contours from CHI and MOT were scored as 1 and mismatching ones as 0 for each q, followed by subtraction of the prior probability that the mother would use the given contour type independently of the leading infant vocalisation (based on relative empirical frequencies of maternal intonation patterns for the given dyad).
After obtaining the IE-level alignment scores, they were averaged across all IEs of the dyad into an overall alignment score $ {A}_f\in \left[-\infty, \infty \right] $ for the given feature and dyad and subjected to statistical analyses. If Af is zero, the subsequent infant and mother vocalisations are not any more similar to each other than would be expected by random sampling of a response from the mother’s vocal repertoire. If Af is positive, that is, the expected feature difference is larger than the actual difference, then the mothers are accommodating their speaking style to the preceding infant vocalisations. If Af is negative, the mothers are responding to their infants with more distinct speech than what would be expected by chance.
In statistical analyses, we test if the alignment scores of the groups differ from zero (two-tailed unpaired t-test), followed by across-group comparisons with unpaired t-tests similar to the primary acoustic analyses. Note that delay and probability of response, overall vocalisation rate, and the individual F0 contour type probabilities were not applicable to alignment analysis. A significance criterion of a < 0.05 with Bonferroni–Holm correction with 13 comparisons was applied to the correlational analysis and with 7 comparisons for the turn-taking analysis.
Results
Primary acoustic analyses
We first ensured that the acoustic measures of mothers’ speech were within a typical range for IDS. The raw F0 values had a mean of 221.5 Hz and SD of 67.4 Hz, which is typical of female speech (Fitch & Holbrook, Reference Fitch and Holbrook1970). The average speaking rate was slightly above five syllables per second, which also aligns with earlier cross-linguistic measurements of speaking rate (e.g., Räsänen et al., Reference Räsänen, Doyle and Frank2018). Since all our IDS features are derivatives of F0 tracking, syllabification, or manual annotation, we thereby consider the feature extraction process reliable for the IDS on the present dataset.
Figure 2 shows the results of the primary acoustic analysis of maternal speech in CG and RG groups, and Table 5 shows the corresponding descriptive statistics. Out of the 13 compared measures, only the voicing ratio (proportion of voiced speech of all speech) is significantly lower for RG mothers than controls (p < 0.001; t(26) = 4.092; Cohen’s d s = 1.547). All other measures have notable overlap in their distribution without statistically significant group differences.
Note: * Statistical Significance.
Given the detected difference in the voicing ratio of mothers’ speech, we proceeded to analyses with the two outcome-based subgroups of the at-risk infant dyads: those with infants having a typical or a mildly abnormal neurodevelopmental outcome (RG-M; N = 7), and those with a severe neurodevelopmental outcome (RG-S; N = 7), as shown in Figure 3. We used one-way ANOVA to detect if group differences existed with the outcome grouping, resulting in significant group differences (F(2,25) = 13.320; p < 0.001). Post hoc tests with unpaired t-tests indicated that the difference in voicing ratio between controls (CG) and RG-M mothers was significant (p = 0.045; t(19) = 2.141; d s = 0.991) with M = 0.68 for CG and M = 0.62 for RG-M. In addition, there was a difference between RG-M and RG-S (p = 0.042; t(12) = 2.281; d s = 1.219) with a decreasing voicing ratio associated with more severe neurodevelopmental outcomes (M = 0.55 for RG-S). The difference between CG and RG-S was also significant (p < 0.001; t(19) = 5.788; ds = 2.679).
Primary acoustic analyses comparable to those of maternal speech were conducted for infant vocalisations. Results relative to the infant raw F0 values and basic acoustic comparisons between CG and RG infants are reported in Appendix C. No significant differences were observed between the two infant groups (p > 0.05 for all comparisons after controlling for multiple comparisons) for any of the 13 compared acoustic measures (see Figure C1).
Dyadic and turn-taking analyses
Analysis of the dependencies between feature values of infants and their mothers revealed a correlation between the average voicing ratio of infants and their mothers in the at-risk group (r = 0.75, p = 0.003; Figure 4). This means that in the at-risk group, lower amounts of voicing in infant vocalisations were associated with lower amounts of voicing in the mothers’ speech. No other correlations were observed for the other features in the two groups (p > 0.05 for all).
Figure 5 shows outcome of the turn-taking alignment analysis. There were no significant differences between the groups, but both CG and RG infants had negative alignment in F0 skewness (p < 0.001, ds > 1.78 for both) and voicing ratio (p < 0.005; ds > 0.989), indicating that parents in both groups were using intonation and voicing ratio conditioned on the preceding infant vocalisation, but making these features more distinct from that of an infant than what is expected by chance. In addition, negative alignment was observed for tilt variability (p < 0.005; ds = 1.121) and utterance duration (p < 0.001; ds = 1.810) in RG, indicating that mothers of this group chose a more distinct instead of similar vocal response to the leading infants’ vocal initiations.
Based on the results of the correlation analysis, we also conducted a post-hoc subgroup comparison of the voicing ratio measured from the infant vocalisations (see Fig. C2). This analysis revealed that RG-S infants had a lower voicing ratio compared with CG infants (p = 0.022) but not to RG-M infants (p > 0.005) (detailed results are reported in Appendix C).
Finally, we re-ran the basic acoustic comparisons shown in Figure 2 using the mothers’ DASS-21 depression scores as covariates. This was done to check if the conflicting predictions from the potential developmental lag of infants in RG and higher maternal depressive symptoms in RG (Section: Maternal Mental Health) could counterbalance each other when averaging the results across subjectsFootnote 1. However, the analysis did not reveal new significant differences between the groups.
Interim discussion on acoustic analyses
The results reveal a substantial difference in voicing ratio between control and at-risk dyads. In addition, the worse the long-term neurodevelopmental outcome of the infant, the more there is unvoiced relative to voiced speech by the mother. From the perspective of speech production, a lower voicing ratio could be attributed to a higher proportion of whispered speech or speech with otherwise weaker voicing in mothers of infants at risk compared with controls. More specifically, there might be a difference in voice quality, a.k.a. phonation style, between the talkers. For instance, IDS has previously been associated with higher amounts of whispered speech compared with ADS (Fernald & Simon, Reference Fernald and Simon1984; Garnica, Reference Garnica, Snow and Ferguson1977; Sundberg & Lacerda, Reference Sundberg and Lacerda1999) and with a higher amount of breathy speech compared with modal (normal) phonation (Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017). It is also established that voice quality is a strong cue to communicating affect and attitudes in speech (Gobl & Ní Chasaide, Reference Gobl and Ní Chasaide2003; Ishi et al., Reference Ishi, Ishiguro and Hagita2010; Laukkanen et al., Reference Laukkanen, Vilkman, Alku and Oksanen1996; Scherer, Reference Scherer1986). This raises the possibility that the infant’s health status affects the amount of whisper or breathy speech mothers employ to communicate with their children. However, the basic acoustic analysis of annotated caregiver utterances does not allow identification of the exact source of the difference. To address this issue, we conducted another round of manual data annotation and analysis to investigate whether phonation style was indeed responsible for the observed differences in voicing, as detailed next.
IDS phonation analysis
Annotation of speech for phonation style
In the second stage of the study, two annotators with notable experience from speech analysis (the first and second author) annotated all longer than 500-ms mother-to-infant utterances of the first stage annotation into one of three phonation categories: (i) whisper, (ii) soft/breathy, or (iii) modal. These three categories can be considered as different points on a continuum of phonation styles depending on the completeness of glottal closure (Gordon & Ladefoged, Reference Gordon and Ladefoged2001). In addition, compared with the first stage of the study, we applied a more specific criterion that the annotated utterances should consist of varying (non-repetitive) lexical content, as we wanted to make sure the phonation differences are related to speech in particular. For this purpose, a fourth category of non-lexical infant-directed vocalisations without clear lexical content was introduced, capturing vocalisations such as singing without words, humming, and other non-lexical fillers, acknowledgements, and responses without obvious consonant-vowel alternating structure. This category was not used in the phonation analyses. Moreover, utterances with more than 50% of duration overlapping with notable baby vocalisations or other environmental sounds were excluded from the phonation analyses, as we wanted the annotated segments to also support further automated analyses (not utilised in this study).
The annotators were instructed to assign each utterance to the majority phonation style present in an utterance, as sometimes phonation could change during the utterance. Similarly, the decision on lexical versus non-lexical was made based on the majority duration, as maternal vocalisations could shift from non-linguistic to linguistic (or vice versa) in a gradual manner, for example, by starting a play song by first humming a tune before following with words. We did not annotate infant vocalisations in terms of phonation style because of the difficulty of judging phonation from the relatively short non-linguistic infant vocalisations.
The annotation process was conducted using a MATLAB interface, in which an utterance was played to the annotator while showing the waveform and spectrogram of the signal. The annotator used a keyboard to select one of the four annotation options. The annotators could listen to the samples as many times as they wished, and they were allowed to go back to previous samples if they wanted (e.g., to correct mistakes or re-evaluate annotation decisions as the following context unfolded with subsequent samples). Utterances of each mother were annotated in a continuous block to allow annotators to adapt to the voice and general speaking style of the mother. The order of the dyads was randomised. The annotators could complete the annotation in multiple sessions on different days to avoid fatigue, and both annotators chose to do so. The annotators did not have access to the group assignment of the dyads during the annotation process. A small amount of general familiarisation with the audio material had taken place earlier in the context of piloting the acoustic measures, ensuring general validity of the first-stage annotations for vocalisations, and subjectively evaluating the need and outcomes for signal denoising. However, no information on infant study groups was available to the annotators at any of the stages.
Inter-annotator analysis of the resulting annotations revealed Fleiss’ kappa of κ = 0.65 (‘moderate’) for the four-class annotation. Binary pairwise agreement rates were κ = 0.81 for speech versus non-speech distinction, κ = 0.70 for whisper versus non-whisper, κ = 0.49 for soft versus non-soft phonation, and κ = 0.62 for modal versus non-modal phonation. Figure 6 shows the corresponding inter-annotator confusion matrix.
Overall, the inter-annotator agreement rate forms a reasonable basis for group comparisons. As expected, distinctions between neighbouring categories of the phonation continuum (whisper vs. soft and soft vs. modal) were the most difficult to determine. Besides the general lack of a strict categorical boundary between soft and modal voice, observed ambiguity was partially because of some of the utterances being relatively long in duration and consisting of multiple phonation styles in a sequence. In addition, non-speech versus speech distinction was sometimes challenging for very short utterances or utterances with intervening noise that was clearly present for some duration of the clip, as there were several borderline cases of whether the majority of speech duration was masked by substantial noise or not.
Analyses and results
After obtaining the annotations, analyses of maternal phonation style proportions were conducted for the CG, RG-M, and RG-S groups. Studied measures included (1) proportion of breathy speech, (2) proportion of whispered speech, and (3) proportion of modal speech, where the speaker-level dependent variable was the proportion of the given phonation style across all the utterances from the given caregiver. Only the utterances where both annotators agreed on the style were included in the analysis. In addition, a numeric overall speaker phonation score (0–2) was devised by scoring each utterance as 0 for whisper, 1 for breathy, or 2 for modal voicing, and then taking the average across the utterances of the speaker and their labels from both annotators. The motivation for this was that we expected annotators to disagree primarily on borderline cases between two phonation styles, and therefore scoring each utterance with a numeric average would take the continuous nature of phonation style more naturally into account. Besides the phonation style, the amount of linguistic versus non-linguistic utterances was tested for any group differences.
We conducted a one-way ANOVA separately for each of the four metrics, revealing significant group differences in the numeric phonation score (F(2,25) = 7.692; p = 0.002), the proportion of soft phonation (F(2,25) = 7.692; p = 0.002), and proportion of modal phonation (F(2,25) = 8.772; p = 0.001). Group differences were not significant for whisper. We then proceeded to pairwise analyses of the groups for the significant measures. Figure 7 shows the results from the phonation analyses. The results indicate that there were no statistical differences between CG and RG-M mothers on any of the measures. In contrast, there is a systematic difference between IDS to severe long-term outcome (RG-S) babies and the other two groups. First of all, the overall numeric phonation score (0–2) indicates that there is a clear difference between CG (M = 1.56, SD = 0.17) and RG-S (M = 1.19, SD = 0.24) dyads (p = 0.001, t(19) = 4.121, d s = 1.908), and also RG-M (M = 1.55, SD = 0.27) and RG-S dyads (p = 0.020, t(12) = 2.669, d s = 1.426). Analysis of the phonation style proportions reveals that soft (breathy) phonation is much more prevalent in speech by parents of RG-S infants (M = 44.7%, SD = 15.4%) compared to parents of CG infants (M = 22.0%, SD = 15.6%) (p = 0.005; t(19) = 3.151; d s = 1.458) and to parents of RG-M infants (M = 16.5%, SD = 12.0%) (p = 0.002; t(12) = 3.830; d s = 2.047). The redundant measure, the proportion of modal phonation, also shows a clear difference between RG-S infants and the other two groups with less modal phonation in the severe group. Even though ANOVA did not reveal significant group differences for the proportion of whispered speech, there appears to be a trend toward more whispering and more variable use of whispering across the RG-S mothers (M = 20.9%, SD = 15.4%, min–max: 0.0%–42.1%) compared with CG (M = 9.1%, SD = 9.1%, min–max: 0.0–24.5%).
As for the comparison of the proportion of lexical versus non-lexical utterances, there were no group differences (p > 0.05 for all comparisons). Proportions of non-lexical maternal vocalisations were M = 31.9% (SD = 22.8%) for CG, M = 25.4% (SD = 16.6%) for RG-M, and M = 25.8% (SD = 14.1%) for the RG-S.
As a control, we analysed if the subject-level phonation style measures were related to self-reported depressive symptoms of the mothers (from the DASS-21 questionnaire). When measuring the Spearman correlation between the depression scores and numeric phonation scores in the full set of subjects for whom the DASS-21 scores were available (14 CG and 12 in RG), there was no significant correlation (p = 0.55). Depression score correlations with respect to the proportion of soft or modal speech were also not significant (p > 0.05 for all). We also repeated the correlation analysis separately for CG and RG, again without any significant findings across the four measures of phonation (p > 0.05 for all analyses). This indicates that the mothers’ depression scores, even though somewhat higher in the RG mothers, do not appear to drive the observed differences in phonation style.
Finally, we ensured that the differing recording environments in the CG (8 dyads at home, 6 at the lab) were not responsible for the phonation differences between mothers of TD infants and infants with severe neurodevelopmental outcomes. This was done by comparing the phonation styles of the 6 CG in-lab dyads to the 7 RG-S dyads (all recorded in the lab). As a result, the overall numeric phonation score was still significantly higher for the controls (M = 1.489, SD = 0.118) than for RG mothers of infants with a severe outcome (M = 1.189, SD = 0.239) (p = 0.0179, t(11) = -2.78), and with modal phonation being much more common in mothers of CG infants (M = 61.5%, SD = 15.7%) compared with mothers of RG-S infants (M = 34.4%, SD = 19.6%) (p = 0.020, t(11) = –2.714). In other words, CG mothers in the in-lab used soft or whispered speech in 38.5% of their utterances, whereas mothers of the RG-S babies did so in 65.6% of the utterances. This confirms that the phonation style differences cannot be solely attributed to different speaking styles in different communicative environments. Yet, the recording environment also appears to have some effect on the style, as CG mothers used more soft phonation in the lab (M = 32.2%) than at home (M = 14.4%) (p = 0.0282; t(12) = –2.494).
Interim discussion on phonation analysis
Analyses of phonation style (whisper, breathy, or modal) reveal that parents change their phonation style based on the neurodevelopmental condition of the infant. More specifically, the substantially increased use of breathy speech in IDS seems to be associated with infants who are later diagnosed with more severe neurodevelopmental impairments. In contrast, no such increase in breathiness is observed for infants who are also involved in the clinical cohort for increased risk for neurological disorders, but who later turn out to have less severe or no adverse long-term clinical outcomes (RG-M). In addition, there were signs of whispered speech being more common in the RG-S than CG dyads, especially for some mothers, even though the group differences were not significant. Since there are no differences in the frequency of non-speech versus speech utterances, the initial acoustic analysis finding related to the voicing ratio seems to be attributable to a difference in phonation. It also appears that the differences in phonation styles are not explained by the severity of maternal depressive symptoms, even though mothers of infants at risk generally show higher self-reported depressive symptoms.
Discussion
This study sought to identify any major differences in IDS of Italian-speaking mothers to their infants who are at a high risk of neurological disorders (RG) compared with infants without such a risk (CG). Moreover, we compared IDS in two subgroups of at-risk infants: those who eventually developed severe long-term neurodevelopmental outcomes (RG-S) and those who turned out to have a typical development or only had mild neurological impairments (RG-M). As a result, we found a systematic difference in the proportion of voiced speech of all mothers’ speech between the groups, where infants with later more severe outcomes received less voiced speech. In addition, it appears that the amount of voicing in mothers’ speech and infants’ vocalisations was correlated at the dyad level, and infants with severe long-term outcomes also had a lower voicing ratio than controls. When zooming into the origins of this difference through phonation analysis, we discovered that mothers whose infants had long-term severe neurodevelopmental outcomes used substantially more often breathy phonation than mothers of controls or at-risk infants with less severe long-term outcomes. Such phonation style has been earlier attributed to communication of intimacy (Laver, Reference Laver1980) and positive affect (e.g., Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017).
As for infant vocalisations, we did not observe differences in the main comparison of acoustic features between the groups. However, this does not mean that differences would not necessarily exist. Besides our generally small sample size, the acoustic measures applied to infant vocalisations were the same as those designed primarily for adult speech instead of being tailored for prelinguistic vocalisations. In addition, the overall duration of infant vocalisation data per recording was generally low compared with adult utterances (e.g., vocalisation rates in Figures 2 and C1), further limiting the potential generality of the findings. Thereby, further research would be required to draw stronger conclusions regarding infant vocalisations in the present type of study groups.
The origin of the group differences in caregiver phonation remains unclear. In theory, the difference in voicing observed for control versus at-risk comparison could be attributable to the mothers’ awareness of their infants being at risk for neurological impairments at the time of recording. This might have affected their emotional state, for instance making them more fearful, anxious, or depressed compared with mothers of control, or caused the mothers to communicate additional closeness and tenderness through weakly (softly) phonated speech. In addition, the at-risk and control groups were not balanced in terms of the recording environment, where some of the controls were recorded at home instead of the lab. However, the voicing ratio difference and phonation style differences are observable not only between the control (CG) and the at-risk group (RG) but also within the at-risk group when the group is further split into infants with typical/mild (RG-M) and severe later neurodevelopmental outcomes (RG-S). Moreover, subject-level self-reported maternal depression ratings were not significantly correlated with the phonation style proportions. This suggests that parental awareness of infant risk or higher prevalence of depressive symptoms in the risk group is not the primary factor in explaining the change in the speaking style as such, nor that the recording environment would cause the parents to use phonation in a different manner (e.g., in familiar vs. unfamiliar settings).
As an alternative explanation, the findings suggest that the severity of the clinical condition of infants (those who later develop a more severe outcome) likely prompted a different style of maternal speaking compared with speech directed at TD peers, at the time of the recording. In the present data, this adaptation appears to take place through changes in the voicing of speech. In addition, the lower voicing ratio in severe outcome infants compared with control infants and the correlation in mothers’ and infants’ voicing ratio might suggest that parents align their speaking style to that of the infants. However, this alignment is not observed at the level of individual interactional exchanges but reflects more overall use of voicing throughout the recording session.
The underlying cause or functional role of the increased use of soft phonation style in the severe outcome group can be related to several factors. One of the predictions based on earlier literature (see section in Introduction) was that parents of infants with developmental delays would utilise a proportionally more affect-centred communication style compared with normally developing age-matched infants. Since breathiness is associated with the communication of affect, especially gentleness/tenderness (Ishi et al., Reference Ishi, Ishiguro and Hagita2010) and intimacy (Laver, Reference Laver1980), whereas whispered speech is typical to private communicative engagement while sharing general affective and attitude profile with breathy speech (Gobl & Ní Chasaide, Reference Gobl and Ní Chasaide2003), it is possible that the observed phonation differences reflect more affect-centred IDS in the severe outcome group. Alternatively, mothers of infants with more severe neurological conditions may instinctively stick with the communication style that is typically observed with younger infants (e.g., attention capturing or directing and soothing) compared with the more dynamic interactive behaviours—behaviours that would be typically observed with 4.5-month-old infants and that might be undertaken by mothers of control infants or infants with milder neurodevelopmental deficits. To test if this is simply because of developmental delays in the severe outcome group, an analysis of phonation styles should be conducted for younger (TD) infants from an otherwise comparable language and recording environment, and seeing whether the proportion of breathy speech decreases sensitivity during the first 4–5 months of life to the levels now observed for controls.
Yet another possibility is that the mothers of severe outcome infants tried to avoid overstimulation of their babies. ASD is associated with a high prevalence of sensory processing difficulties, such as hyperacuity (Khalfa et al., Reference Khalfa, Bruneau, Rogé, Georgieff, Veuillet, Adrien, Barthélémy and Collet2004). Abnormal sensory reactivity is common also in infants at neurological risk and correlates with abnormal outcomes (Chorna et al., Reference Chorna, Solomon, Slaughter, Stark and Maitre2014). Since whispering and breathy speech are quieter than modal phonation, they may help to keep the infants more comfortable. This comes at the cost of reduced acoustic clarity, as the resonances of the vocal tract (formants) are attenuated in breathy speech compared with modal phonation. It is also possible that the infants with more severe outcomes were more fretful and/or cried more often at the time of recordings, leading to overall more soothing behaviours by the mothers compared with controls. However, the recording was designed to capture spontaneous mother–infant interactions when the infant was calm and alert, while excessive distress and crying of the infant were not filmed, and thereby we did not annotate and count infant cries explicitly. Therefore, this hypothesis cannot be quantitatively tested with the present interactional data. In addition, it is possible that infants who then developed more severe outcomes were generally more prone to become fussy during normal infant–caregiver interaction or had been more fussy before the recording onset. The mothers might have therefore tried to avoid excessive stimulation of the baby to ensure the success of the recording session. In general, it would be important to replicate the present analyses using more naturalistic long-form audio recordings from infants’ daily environments to reduce biases and the risk of artefacts arising from the data collection procedure, such as the presence of an experimenter nearby, even if out of the dyad’s sight, as in the current study. However, in all the above cases, independently of what is the underlying cause for mothers’ speaking style choice, the basic role of soft speaking style is related to the regulation of infants’ affective state.
Besides voicing and speaking style, our analyses did not find any other systematic group differences in caregiver speech to their infants. Utterance lengths, speaking rates, variability and style of intonation contours, variability of spectral tilt, or, for example, parental responsiveness to infant vocalisations or overall amount of speech did not show statistically significant differences between the groups. Alignment analyses at the recording and international exchange level also did not indicate any additional differences between the studied groups. While our current sample is relatively small, the other acoustic measures did not even have any notable trends for potential group differences (e.g., post hoc analysis without the normalisation for multiple comparisons does not reveal any additional effects; see also Table 5). At the same time, the voicing ratio difference and the associated phonation-style differences were extremely robust with large effects (e.g., d s = 2.679 for the voicing ratio and d s = 1.458 for the proportion of soft phonation in severe vs. control groups). While phonation can be considered as a largely independent dimension from other aspects of speech production (at least for languages without phonation-related phonemic contrasts; see Gordon & Ladefoged, Reference Gordon and Ladefoged2001), it is somewhat puzzling that parental accommodation of IDS to their infants’ neurodevelopmental status showed up only in this particular aspect of speaking style. To better understand this finding, additional investigations of the studied features with larger sample sizes (thereby enabling reliable analysis of interactions between multiple prosodic features), additional age groups (longitudinal design) and native languages, and with more naturalistic, acoustically controlled, and longer recordings would be required. A more complete characterisation of infant vocalisation types (e.g., cries) and characterisation of non-verbal aspects of infant–caregiver interaction would help to shed light on the issue.
In addition, the interpretation of our results must consider that prematurity may also contribute to the effects we found in the RG maternal IDS. The potential impact of prematurity on IDS syntax and lexical characteristics has been previously investigated alone or in combination with maternal depression, although with conflicting results (Salerni et al., Reference Salerni, Suttora and D’Odorico2007; Provera et al., Reference Provera, Neri and Agostini2023). RG babies in our population, however, often present a combination of prematurity and other risk factors for neurological disorders. Therefore, it is not possible to disentangle the role of these two factors on maternal IDS in the current study. Given the significant impact that prematurity, the risk for neurological illness, and maternal emotional states may have on early parent–infant interactions and infant development, digging into their distinct roles in IDS in future investigations is warranted.
Conclusions
The present study shows that some aspects of maternal IDS are sensitive to the neurodevelopmental state of their child already at an early age, earlier than when the infants typically start to exhibit any conventional signs of speech comprehension or production. Moreover, it appears that parents perform this type of accommodation automatically, and probably unconsciously, according to the severity of their infant’s clinical condition and already several months before the long-term clinical outcome was established. This maternal sensitivity is reflected by substantially increased usage of soft voice toward infants that later turn out to have severe neurological issues. The use of soft voice may reflect a higher focus on affect regulation in the case of neurologically impaired infants. In contrast, speech toward healthy controls makes more use of modal phonation, which may be more useful as an input for language learning because of higher clarity of vocal resonances.
In conclusion, the exact reason for the phonation style differences in our study groups remains unclear, and cannot be fleshed out with the present data. Potential explanations may include developmental delays, increased passiveness, hyperacuity, or increased fussiness in the group of infants with more severe long-term outcomes, and therefore more research and larger sample sizes are needed to better understand the underlying causes and generality of the observed findings. Moreover, comprehensive and microanalytic analyses of interactive exchanges, including interactional domains other than vocal interactions (e.g., facial expression, motor behaviours, or gaze/attention), will integrate current results and provide a more exhaustive picture of interactive dynamics between mothers and infants at high risk for neurological impairments. In addition, the present study was conducted with a particular sample of participants (all mothers being White, educated, and Italian-speaking), and hence it remains to be tested how the finding on phonation style generalises to other participant populations.
Finally, the results demonstrate how phonation style can vary in speech directed at infants across different circumstances and recipient characteristics. Even though phonation style is known to be related to the communication of affect, intimacy, and attitude (e.g., Laukkanen et al., Reference Laukkanen, Vilkman, Alku and Oksanen1996; Laver, Reference Laver1980; Scherer, Reference Scherer1986), one of the hypothesised key roles of IDS in early development, its analysis is frequently overlooked in IDS research (but see Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017). Besides its affect-related dimension, phonation style has also notable consequences on spectral properties, such as clarity and audibility, of speech. Hence, we argue that phonation should be taken into account in future studies on the nature of speech heard by infants.
Acknowledgements
We would like to sincerely thank the families who participated in this research. We thank S. Zanforlini and L. Foti for their assistance in coding the audio–video recordings and Dr. C. Antonelli for her help during recruitment and data collection. We acknowledge the scientific and financial support of the American Academy of Cerebral Palsy and Developmental Medicine (AACPDM Grant 2019 awarded to F.F. and A.G.). This work was partially supported by the Italian Ministry of Health - Ricerca Corrente (2024-2025), and Horizon 2020 project BornToGethThere no. 848201. O.R. was funded by the Academy of Finland grant nos. 314602 and 345365, and M.A. was funded by the Academy of Finland grant nos. 335752 and 343498.
Competing interest
The authors declare none.
Appendix A: Description of methodology
A.1 Mother–infant interaction Video-recording procedures
Spontaneous face-to-face interactions between mothers and infants were video-recorded utilizing a 25 Hz digital video camera (Panasonic FullHD HC-V180) positioned laterally to the dyad. During the interaction, infants were placed semi-reclined in an infant seat or on a nursing pillow on the floor; the mother sat directly opposite, facing the infant, at a distance of ~30–40 cm, for optimal engagement. Mothers were invited to engage with their infants as they would normally do, without the use of a pacifier or toys. Videotaping took place in a quiet room, far from feeding times and when the infant was calm and alert. In the event of infant distress, the recording could be paused until the infant had calmed and then resumed, thereby prioritizing their well-being and comfort throughout the recording process.
A.2 Depression, Anxiety, and Stress Scale (DASS-21) questionnaire
Maternal mental health was assessed through the Depression, Anxiety, and Stress Scale (DASS-21) (Lovibond & Lovibond, Reference Lovibond and Lovibond1995), a 21-item self-report questionnaire, structured into three sub-scales, and designed to measure symptoms of depression (D), anxiety (A), and stress (S) experienced over the past week. The questionnaire includes 7 items for each sub-scales (D, A, and S), and each item is graded on a 4-point Likert scale from 0 to 3 (0: ‘Did not apply to me at all’, 3: ‘Applied to me very much or most of the time’). The score for each sub-scale is calculated by summing up the scores for the relevant items. A total score representing the overall maternal emotional distress can be calculated by summing up the scores of all items. Summed scores of each subscale and the total score are then multiplied by 2 to be compared with the normative data of the 42-item original DASS (or DASS-42; Lovibond & Lovibond, Reference Lovibond and Lovibond1995). Higher scores indicate greater symptomatology.
Appendix B: Description of the studied acoustic features
Richness of intonation was measured with variance and skewness of logarithmic fundamental frequency (log-F0). The variance of log-F0 represents the general variability of intonation, such as the presence and extent of linguistic accent and sentence stress or emotional expression. F0 is also well-known to vary between IDS and ADS (see the Introduction section). Skewness of log-F0 is complementary to variance by measuring the asymmetry (non-normality) of the log-F0 distribution. It captures, for instance, the degree to which the speakers might use very high F0 to highlight certain parts of an utterance among otherwise less extreme intonation patterns.
For phonation, the variance of spectral tilt during voiced speech reflects phonation style, as the vibration style of the vocal folds is directly linked to the spectral balance of low versus high speech frequencies (Klatt & Klatt, Reference Klatt and Klatt1990). Variance of spectral tilt is also likely to reflect stress patterns utilised by the speaker. The voicing ratio measures the proportion of voicing time (i.e., sounds that are excited by the periodic fluctuations of the vocal folds) compared with the duration of unvoiced speech (e.g., consonants), thereby reflecting the amount of whispered speech. The automatically estimated voicing ratio is also likely to reflect the strength of voicing more generally, as weak voicing (e.g., soft speech) may result in fewer voicing detections from an automatic estimator, especially when transitioning from voiced to unvoiced or silent segments. Previously, voicing has been linked to IDS because of the higher breathiness of IDS (Miyazawa et al., Reference Miyazawa, Shinya, Martin, Kikuchi and Mazuka2017) and with more whispering in IDS than in ADS (Fernald & Simon, Reference Fernald and Simon1984; Garnica, Reference Garnica, Snow and Ferguson1977; Sundberg & Lacerda, Reference Sundberg and Lacerda1999).
Logarithmic utterance length and speaking rate (syllables per second) were used for temporal analysis. In IDS, they also act as proxies for the complexity of language input, as faster speech tends to be appropriate for more advanced listeners, whereas utterance length correlates strongly with the number of linguistic units (e.g., Räsänen et al., Reference Räsänen, Seshadri, Karadayi, Riebling, Bunce, Cristia, Metze, Casillas, Rosemberg, Bergelson and Soderstrom2019).
Appendix C: Acoustic analyses of infant vocalisations
C.1 Primary acoustic analyses of infant vocalisations
For infant vocalisations, the mean F0 was 246.1 Hz (±86.2 Hz; max 695.7 Hz), which is somewhat lower than expected. Manual analysis revealed that the F0 tracking algorithm often resulted in pitch-halving errors for high-pitched vocalisations (e.g., detecting 600-Hz F0 as 300 Hz), which is a known problem for F0 estimation and appears to be more prominent for infant sounds. Despite the halving errors, the F0 estimates for infants still reflect their use of vocal cords as a function of time, and, by default, the errors should not be biased toward any of our study groups. Hence, we proceeded also with the F0-based analyses for infants, although their results should be treated with caution. Infant ‘speech rate’, that is, syllable-like peaks and troughs in sonority per second, were comparable to those of adults.
Figure C1 shows the acoustic comparison of infant vocalisations between control infants (CG) and infants at high risk (RG). No statistically significant differences are observed between the groups (p > 0.05 for all comparisons; Holm–Bonferroni corrected for multiple comparisons).
C.2 Subgroup comparison of voicing ratio measured from the infant vocalisations
Given that the correlation analysis of maternal speech and infant vocalisations revealed signs of coupling in the voicing ratio between infants and their mothers in the risk-group, we also conducted a post hoc subgroup comparison of the voicing ratio measured from the infant vocalisations (Figure C2). The result indicates that infants in the severe outcome group had a lower voicing ratio (M = 0.53, SD = 0.06) than controls (M = 0.65, SD = 0.10) (p = 0.022; t(17) = 2.524; d s = 1.168; two infants did not have at least five utterances with valid F0 contours and were excluded from the analysis). The difference between typical/mildly abnormal outcomes and severe outcomes in infants was not significant.