1. Introduction
Spoken language comprehension seems like an easy, automatized process. But intelligibility and comprehension of speech can be rendered difficult in our daily conversations due to adverse listening conditions like background noise and distortion of the speech signal (e.g., Chen & Loizou, Reference Chen and Loizou2011; Fontan et al., Reference Fontan, Tardieu, Gaillard, Woisard and Ruiz2015). For example, the voice of a person talking on the other end of a telephone connection can sound robotic and difficult to understand when the signal quality or transmission is poor. Perception and comprehension of speech in such an adverse condition is effortful (Pals et al., Reference Pals, Sarampalis and Başkent2013; Strauss & Francis, Reference Strauss and Francis2017; Winn et al., Reference Winn, Edwards and Litovsky2015). To deal with perceptual difficulties, listeners rely on top-down prediction based on the context that has been understood so far (Obleser & Kotz, Reference Obleser and Kotz2010; Pichora-Fuller, Reference Pichora-Fuller2008; Sheldon et al., Reference Sheldon, Pichora-Fuller and Schneider2008b). The context can contain information about a topic of the conversation, syntactic information about the structure of the sentence, world knowledge, visual information, and so forth (Altmann & Kamide, Reference Altmann and Kamide2007; Brothers et al., Reference Brothers, Wlotko, Warnke and Kuperberg2020; Kaiser & Trueswell, Reference Kaiser and Trueswell2004; Knoeferle et al., Reference Knoeferle, Crocker, Scheepers and Pickering2005; Xiang & Kuperberg, Reference Xiang and Kuperberg2015; for reviews, see Ryskin & Fang, Reference Ryskin and Fang2021; Stilp, Reference Stilp2020).
To utilize context information, listeners must attend to it and build up a meaning representation of what has been said. Listeners attend to the context information in clear speech with minimal effort, but processing and comprehending degraded speech is more effortful and requires more attentional resources (Eckert et al., Reference Eckert, Teubner-Rhodes and Vaden2016; Peelle, Reference Peelle2018; Wild et al., Reference Wild, Yusuf, Wilson, Peelle, Davis and Johnsrude2012). However, it is less clear how listeners distribute attentional resources: On the one hand, listeners can attend throughout the whole stream of speech and may thereby profit from the context information to predict sentence endings. On the other hand, listeners can focus their attention on linguistic material at a particular time point in the speech stream and, as a result, miss critical parts of the sentence context. If the goal is to understand a specific word in an utterance, there is a trade-off between allocating attentional resources to the perception of that word vs. allocating resources also to the understanding of the linguistic context and generating predictions.
The aim of this study was to investigate how the allocation of attentional resources induced by different task instructions influences language comprehension and, in particular, the use of context information under adverse listening conditions. To examine the role of attention on predictive processing under degraded speech, we conducted two experiments in which we manipulated task instructions. In Experiment 1, participants were instructed to only repeat the final word of the sentence they heard, while in Experiment 2, they were instructed to repeat the whole sentence, thus drawing attention to the entire sentence including the context. In both experiments, we varied the degree of predictability of sentence endings as well as the degree of speech degradation. In the following, we first summarize the findings of studies that have investigated predictive language processing in the comprehension of degraded speech, and then results on the role of attention and task instruction in speech perception.
1.1. Predictive processing and language comprehension under degraded speech
It is broadly agreed that human comprehenders generate expectations about upcoming linguistic material based on context information (for reviews, see Kuperberg & Jaeger, Reference Kuperberg and Jaeger2016; Nieuwland, Reference Nieuwland2019; Pickering & Gambi, Reference Pickering and Gambi2018; Staub, Reference Staub2015). These expectations are formed while a sentence unfolds. The claims about the predictive nature of language comprehension are based on a variety of behavioral and electrophysiological experimental measures including eye-tracking and electroencephalography (EEG). For instance, in the well-known visual world paradigm, listeners fixate on a picture of an object (e.g., a cake) that is predictable based on the prior sentence context (e.g., ‘The boy will eat the …’) even before hearing the final target word (e.g., Altmann & Kamide, Reference Altmann and Kamide1999, Reference Altmann and Kamide2007; Ankener et al., Reference Ankener, Sekicki and Staudte2018). Moreover, highly predictable words are read faster and are skipped more often compared to less predictable words (Frisson et al., Reference Frisson, Rayner and Pickering2005; Rayner et al., Reference Rayner, Slattery and Liversedge2011).
In EEG studies, the N400, a negative-going EEG component that usually peaks around 400 ms poststimulus, is considered as a neural marker of semantic unexpectedness (Kutas & Federmeier, Reference Kutas and Federmeier2011). For instance, in the highly predictable sentence context ‘The day was breezy so the boy went outside to fly …’, DeLong et al. (Reference DeLong, Urbach and Kutas2005) found that the amplitude of the N400 component for the expected continuation ‘a kite’ was much smaller than for the unexpected continuation ‘an airplane’. Although these studies demonstrated that as the sentence context builds up, listeners form predictions about upcoming words in the sentence, the universality and ubiquity of predictive language processing have been questioned (see Huettig & Mani, Reference Huettig and Mani2016). Also, the use of context for top-down prediction can be limited by factors like literacy (Mishra et al., Reference Mishra, Singh, Pandey and Huettig2012), age, and working memory (Federmeier et al., Reference Federmeier, Mclennan, de Ochoa and Kutas2002, Reference Federmeier, Kutas and Schul2010), as well as by the experimental setup (Huettig & Guerra, Reference Huettig and Guerra2019). While these language comprehension studies investigating predictive processing have used clean speech and sentence reading, the present study focuses on examining how attention influences the use of context to form top-down predictions under adverse listening conditions.
There is already some evidence that when the bottom-up speech signal is less reliable due to degradation, listeners tend to rely more on the context information to support language comprehension (Amichetti et al., Reference Amichetti, Atagi, Kong and Wingfield2018; Obleser & Kotz, Reference Obleser and Kotz2010; Sheldon et al., Reference Sheldon, Pichora-Fuller and Schneider2008a). For example, Sheldon et al. (Reference Sheldon, Pichora-Fuller and Schneider2008a, Figure 2) estimated that for both younger and older adults, the number of noise-vocoding channels required to achieve 50% accuracy varied as a function of sentence context. Compared to highly predictable sentences, a greater number of channels (i.e., more bottom-up information) was required in less predictable sentences to achieve the same level of accuracy. Therefore, they concluded that when speech is degraded, predictable sentence context facilitates word recognition. Obleser et al. (Reference Obleser, Wise, Alex Dresner and Scott2007) found that at a moderate level of spectral degradation, listeners’ word recognition accuracy was higher for highly predictable sentence contexts than for less predictable ones. However, while listening to the least degraded speech, there was no such beneficial effect of sentence context (see also Obleser & Kotz, Reference Obleser and Kotz2010). Hence, especially when the bottom-up speech signal is less reliable due to moderate degradation, information available from the sentence context is used to enhance language comprehension, suggesting that there is a dynamic interaction between top-down predictive and bottom-up sensory processes in language comprehension (Bhandari et al., Reference Bhandari, Demberg and Kray2021).
1.2. Attention and predictive language processing
It is not only the quality of speech signal that influences the reliance on and use of predictive processing; attention to auditory input is also important. Auditory attention allows a listener to focus on the speech signal of interest (for reviews, see Fritz et al., Reference Fritz, Elhilali, David and Shamma2007; Lange, Reference Lange2013). For instance, it has been shown that a listener can attend to and derive information from one stream of sound among many competing streams as demonstrated in the well-known cocktail party effect (Cherry, Reference Cherry1953; Hafter et al., Reference Hafter, Sarampalis and Loui2007). When a participant is instructed to attend to only one of the two or more competing speech streams in a diotic or dichotic presentation, response accuracy to the attended speech stream is higher than to the unattended speech (e.g., Tóth et al., Reference Tóth, Honbolygó, Szalárdy, Orosz, Farkas and Winkler2020). Similarly, when a listener is presented with a stream of tones (e.g., musical notes varying in pitch, pure tones of different harmonics) but attends to any one of the tones appearing at a specified time point, this is reflected in a larger amplitude of N1 (e.g., Lange & Röder, Reference Lange and Röder2010; see also Sanders & Astheimer, Reference Sanders and Astheimer2008) which is the first negative-going ERP component, peaking around 100 ms poststimulus, considered as a marker of auditory selective attention (Näätänen & Picton, Reference Näätänen and Picton1987; Thorton et al., Reference Thorton, Harmer and Lavoie2007). Hence, listeners can draw attention to and process one among multiple competing speech streams.
So far, most previous studies investigated listeners’ attention within a single speech stream by using acoustic cues like accentuation and prosodic emphasis. For example, Li et al. (Reference Li, Lu and Zhao2014)) examined whether the comprehension of critical words in a sentence context was influenced by a linguistic attention probe such as ‘ba’ presented together with an accented or deaccented critical word. The N1 amplitude was larger for words with such an attention probe than for words without a probe. These findings support the view that attention can be flexibly directed either by instructions toward a specific signal or by linguistic probes (Li et al., Reference Li, Zhang, Li, Zhao and Du2017; see also Brunellière et al., Reference Brunellière, Auran and Delrue2019). Thus, listeners are able to select a part or segment of a stream of auditory stimuli to pay attention to.
The findings on the interplay of attention and prediction mentioned above come from studies which, for the most part, used a stream of clean speech or multiple streams of clean speech in their experiments. They cannot tell us about the attention–prediction interplay in degraded speech comprehension. Specifically, we do not know what role attention to a segment of a speech stream plays in the contextual facilitation of degraded speech comprehension, although separate lines of research show that listeners attend to the most informative portion of the speech stream (e.g., Astheimer & Sanders, Reference Astheimer and Sanders2011), and semantic predictability facilitates comprehension of degraded speech (e.g., Obleser & Kotz, Reference Obleser and Kotz2010).
1.3. The present study
We examined whether context-based semantic predictions are automatic during effortful listening to degraded speech, when participants are instructed to report either the final word of the sentence or the entire sentence. We manipulated semantic predictions and speech degradation by orthogonally varying cloze probability of target words and number of channels for the noise-vocoding of speech in a factorial design. Noise-vocoded speech is difficult to understand, as the frequency-specific information of a specific bandwidth is replaced with white noise while temporal cues are preserved (e.g., Corps & Rabagliati, Reference Corps and Rabagliati2020; Davis et al., Reference Davis, Johnsrude, Hervais-Adelman, Taylor and McGettigan2005; Shannon et al., Reference Shannon, Zeng, Kamath, Wygonski and Ekelid1995).
In two experiments, we varied the task instructions to the listeners, which required them to differentially attend to the target word. In Experiment 1, listeners were asked to report the noun which was in the final position of the sentence that they heard. This instruction did not require listeners to pay attention to the context. Hence, processing the context was not strictly necessary for the task. In Experiment 2, listeners were asked to report the entire sentence by typing in everything they heard. Thus, the listeners’ attention in Experiment 2 was not focused on any specific part of the sentence. We hypothesized that when listeners pay attention only to the contextually predicted target word, as they might choose to do in Experiment 1, they do not form top-down predictions, that is, there should not be a facilitatory effect of target word predictability. In contrast, when listeners attend to the whole sentence, they do form expectations, such that a facilitatory effect of target word predictability will be observed.
2. Experiment 1
2.1. Method
2.1.1. Participants
We recruited 50 participants online via Prolific Academic (Prolific, 2014). One participant whose response accuracy was less than 50% across all experimental conditions was removed. Among the remaining 49 participants (M age ± SD = 23.31 ± 3.53 years; age range = 18–30 years), 27 were male and 22 were female. All participants were native speakers of German and did not have any speech-language disorder, hearing loss, or neurological disorder (all self-reported). All participants received 6.20 euros as monetary compensation for their participation. The experiment was approximately 40 minutes long. The German Society for Language Science ethics committee approved the study and participants provided informed consent in accordance with the Declaration of Helsinki.
2.1.2. Materials
We used the same materials from our previous study (Bhandari et al., Reference Bhandari, Demberg and Kray2021). They consist of 360 German sentences spoken by a female native German speaker, unaccented, at a normal rate of speech. The sentences were recorded and digitized at 44.1 kHz with 32-bit linear encoding. All sentences consisted of pronoun, verb, determiner, and object (noun) (e.g., stimuli sentences with their English translations see Supplementary Material). We used 120 nouns to create three types of sentences differing in the cloze probability of the target words (nouns) which mostly appeared as the final word of the sentence. We thereby compared sentences with low, medium, and high cloze target words.
The cloze probability ratings for each of these sentences were measured in a norming study with a separate group of participants (n = 60; age range = 18–30 years). Mean cloze probabilities for sentences with low cloze target words (low predictability sentences), medium cloze target words (medium predictability sentences) and high cloze target words (high predictability sentences) were 0.022 ± 0.027 (M ± SD; range = 0.00–0.09), 0.274 ± 0.134 (M ± SD; range = 0.10–0.55), and 0.752 ± 0.123 (M ± SD; range = 0.56–1.00), respectively.
The speech signal was divided into 1, 4, 6, and 8 frequency bands between 70 and 9,000 Hz to create four different levels of speech degradation for each of the 360 recorded sentences. Frequency boundaries were approximately logarithmically spaced, determined by cochlear-frequency position functions (Erb, Reference Erb2014; Greenwood, Reference Greenwood1990). A customized Praat script originally written by Darwin (Reference Darwin2005) was used to create noise-vocoded speech. Boundary frequencies for each noise-vocoding condition are given in Table 1.
2.1.3. Procedure
Participants were asked to use headphones or earphones. A sample of vocoded speech not used in the practice trial or the main experiment was provided so that the participants could adjust the volume to their preferred level of comfort at the beginning of the experiment. The participants were instructed to listen to the sentences and to type in the target word (noun) by using the keyboard. The time for typing in the response was not limited. They were also informed at the beginning of the experiment that some of the sentences would be ‘noisy’ and not easy to understand, and in these cases, they were encouraged to guess what they might have heard. Eight practice trials with different levels of speech degradation were given to familiarize the participants with the task before presenting all 120 experimental trials with an intertrial interval of 1,000 ms.
Each participant had to listen to 40 high predictability, 40 medium predictability, and 40 low predictability sentences. Levels of speech degradation were also balanced across each predictability level, so that for each of the three predictability conditions (high, medium, and low predictability), ten 1-channel, ten 4-channel, ten 6-channel, and ten 8-channel noise-vocoded sentences were presented, resulting in 12 experimental lists. The sentences in each list were pseudo-randomized so that no more than three sentences of the same degradation and predictability condition appeared consecutively.
2.2. Analyses
We performed data preprocessing and analyses in RStudio (R version 3.6.3; R Core Team, 2020). At 1-channel, there were only five correct responses, one each from 5 participants out of 49. Therefore, the 1-channel speech degradation condition was excluded from the analyses.
Accuracy was analyzed using Generalized Linear Mixed Models (GLMMs) with lmerTest (Kuznetsova et al., Reference Kuznetsova, Brockhoff and Christensen2017) and lme4 (Bates et al., Reference Bates, Mächler, Bolker and Walker2015) packages. Binary responses (categorical: correct and incorrect) for all participants were fit with a binomial linear mixed-effects model (Jaeger, Reference Jaeger2006, Reference Jaeger2008). Correct responses were coded as 1 and incorrect responses were coded as 0. Number of channels (categorical: 4-channel, 6-channel, and 8-channel noise-vocoding), target word predictability (categorical: high predictability sentences, medium predictability sentences, low predictability sentences), and the interaction of number of channels and target word predictability were included in the fixed effects.
We first fitted a model with maximal random effects structure that included random intercepts for each participant and item (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). Both by-participant and by-item random slopes were included for number of channels, target word predictability, and their interaction, which was supported by the experiment design. Based on the previous findings on perceptual adaptation (e.g., Cooke et al., Reference Cooke, Scharenborg and Meyer2022; Davis et al., Reference Davis, Johnsrude, Hervais-Adelman, Taylor and McGettigan2005; Erb et al., Reference Erb, Henry, Eisner and Obleser2013; but see also Bhandari et al., Reference Bhandari, Demberg and Kray2021), we further added trial number (centered) in the fixed effect structure to control for whether the listeners adapted to the degraded speech. We report the results of the model that includes trial number as fixed effects.Footnote 1
We applied treatment contrast for number of channels (8-channel as a baseline) and sliding difference contrast for target word predictability (low predictability vs. medium predictability, and low predictability vs. high predictability sentences). The code and data are available in the following publicly accessible repository: https://osf.io/t6unj/.
2.3. Results and discussion
Mean response accuracy for all experimental conditions is shown in Table 2 and Fig. 1. We found that accuracy increased with an increase in the number of noise-vocoding channels, that is, with a decrease in speech degradation. However, accuracy did not increase with an increase in target word predictability. The results of statistical analysis confirmed these observations (see Table 3).
There was a significant main effect of number of channels, indicating that response accuracy for the 8-channel vocoded speech was higher than for both 4-channel (β = −3.50, SE = 0.22, z (4,410) = −16.19, p < 0.001) and 6-channel vocoded speech (β = −0.70, SE = 0.21, z (4,410) = −3.29, p = 0.001), that is, when the number of channels increased to 8, listeners gave more correct responses (see Fig. 2). There was, however, no significant main effect of target word predictability (β = 0.30, SE = 0.36, z (4,410) = .84, p = 0.40, and β = 0.50, SE = 0.43, z (4,410) = 1.16, p = 0.25), and no interaction between number of channels and target word predictability (all ps > 0.05). There was also no significant main effect of trial number (β = 0.001, SE = 0.002, z (4,410) = .48, p = 0.63) suggesting that the listeners’ performance did not improve over time.
These results indicated a decrease in response accuracy with an increase in speech degradation from the 8-channel to the 6-channel noise-vocoding condition, and from the 8-channel to the 4-channel noise-vocoding condition. However, response accuracy did not increase with an increase in target word predictability, and the interaction between number of channels and target word predictability was also absent, in contrast to previous findings (Obleser & Kotz, Reference Obleser and Kotz2010; Obleser et al., Reference Obleser, Wise, Alex Dresner and Scott2007; see also Hunter & Pisoni, Reference Hunter and Pisoni2018). These results suggest that the task instruction, which asked participants to report only the final word, indeed led to neglecting the context. Although participants were able to neglect the context, there was still uncertainty about the speech quality of the next trial; hence, they could not adapt to the different levels of degraded speech.
To confirm that the predictability effect (or contextual facilitation) is replicable and dependent on attentional focus, we conducted a second experiment in which we changed the task instruction to draw participants’ attention to decoding the whole sentence.
3. Experiment 2
3.1. Method
3.1.1. Participants and materials
We recruited 48 participants (M age ± SD = 24.44 ± 3.55 years; age range = 18–31 years; 32 males) online via Prolific Academic. The same procedure was followed as in Experiment 1, and the same stimuli were used.
3.1.2. Procedure
Participants were presented with sentences at a comfortable volume level. They were asked to use headphones or earphones, and a prompt was presented before the experiment began to adjust the volume to their level of comfort. Eight practice trials were presented, followed by 120 experimental trials. The participants were instructed to report the entire sentence by typing in what they heard. We did not limit the response time.
3.2. Analysis
We followed the same data analysis procedure as in Experiment 1. The 1-channel speech degradation condition was excluded from the analysis. We did not consider whether listeners reported other words in a sentence correctly; only the final words of the sentences (target words) were considered as either correct or incorrect responses. As in Experiment 1, we report the results from the maximal model supported by the design.Footnote 2
3.3. Results and discussion
Mean response accuracy for different conditions is shown in Table 4 and Fig. 2. We found that accuracy increased when the number of noise-vocoding channels increased, as well as when the target word predictability increased. The results of statistical analysis confirmed these observations (Table 5): We again found a main effect of number of channels, such that response accuracy at 8-channel was higher than for both 4-channel (β = −3.51, SE = 0.24, z (4,320) = −14.64, p < 0.001), and 6-channel noise-vocoding (β = −0.65, SE = 0.22, z (4,320) = −2.93, p = 0.003). Similar to Experiment 1, the main effect of trial number was not significant (β = 0.002, SE = 0.002, z (4,320) = 1.11, p = 0.27) indicating that the response accuracy did not increase over the course of the experiment.
In contrast to Experiment 1, there was also a main effect of target word predictability: Response accuracy in high predictability sentences was significantly higher than in low predictability sentences (β = 1.42, SE = 0.47, z (4,320) = 3.02, p = 0.003). We also found a statistically significant interaction between speech degradation and target word predictability (β = −1.14, SE = 0.50, z (4,320) = −2.30, p = 0.02). Subsequent subgroup analyses of each channel condition showed that the interaction was driven by the difference in response accuracy between high predictability sentences and low predictability sentences in the 8-channel (β = 1.42, SE = 0.62, z (1,440) = 2.30, p = 0.02) and 6-channel noise-vocoding conditions (β = 1.14, SE = 0.34, z (1,440) = 3.31, p < 0.001); at 4 channel, the difference in response accuracy between high and low predictability sentences was not significant (β = 0.28, SE = 0.18, z (1,440) = 1.59, p = 0.11).
In contrast to Experiment 1, these results indicate an effect of target word predictability; that is, response accuracy was higher when the target word predictability was high as compared to low. Also, the interaction between target word predictability and speech degradation, which was not observed in Experiment 1, showed that semantic predictability facilitated the comprehension of degraded speech already at moderate levels (like 6- or 8-channel). In line with the findings from Experiment 1, response accuracy was better with a higher number of channels.
We combined the data from both experiments in a single analysis to test whether participants’ response accuracy changes across the experiments, that is, to test whether the difference between experimental manipulations is statistically significant. We ran a binomial linear mixed-effects model on response accuracy and followed the same procedure as in Experiments 1 and 2. A full random effects structure supported by the study design was modeled.Footnote 3 The model summary is shown in Table 6. The model revealed that there was no significant main effect of experimental group (β = 0.04, SE = 0.26, z (8,730) = .15, p = 0.88) indicating that the overall response accuracy did not change with the change in instructions from Experiments 1 and 2. However, the critical interaction between experimental group and target word predictability was statistically significant (β = 0.46, SE = 0.20, z (8,730) = 2.34, p = 0.02), that is, the effect of predictability was larger in the group that was asked to type in the whole sentence (Experiment 2) than in the group that was asked to type only the sentence-final target word (Experiment 1). Together, these findings suggest that the change in task instruction, which draws attention either to the entire sentence or only to the final word, is critical to whether the context information is used under degraded speech. But degraded speech comprehension is not reduced by binding listeners’ attention allocation to one part of the speech stream.
4. General discussion
The main goals of the present study were to investigate whether online semantic predictions are formed in comprehension of degraded speech when task instructions encourage attention to the processing of the context information, or only to the critical target word. The results of two experiments revealed that attentional processes clearly modulate the use of context information for predicting sentence endings when the speech signal is moderately degraded.
In contrast to the first experiment, the results of our second experiment show an interaction between target word predictability and degraded speech. This is generally in line with the few existing studies that found a facilitatory effect of predictability at different levels of speech degradation when the participants were instructed to pay attention to the entire sentence (e.g., at 4-channel, or at 8-channel; Bhandari et al., Reference Bhandari, Demberg and Kray2021; Obleser & Kotz, Reference Obleser and Kotz2010; Obleser et al., Reference Obleser, Wise, Alex Dresner and Scott2007). The important new finding that our study adds to the present literature is that this effect may be weakened or lost when listeners are instructed to report only the final word of the sentence that they heard (Experiment 1). The lack of predictability effect (or contextual facilitation) can most likely be attributed to listeners not successfully decoding the meaning of the verb of the sentence, as the verb is the primary predictive cue in our stimuli for the target word (noun). Hence, this small change in task instructions from Experiment 1 to Experiment 2 sheds light on the role of top-down regulation of attention in using context for language comprehension in adverse listening conditions. In an adverse listening condition, language comprehension is generally effortful, so that focusing attention on only a part of the speech signal seems beneficial in order to enhance stimulus decoding. However, the results of this study also show that this comes at the cost of neglecting the context information that could be beneficial for language comprehension. Our findings hence demonstrate that there is a trade-off between the use of context for generating top-down predictions vs. focusing all attention on a target word. Specifically, the engagement in the use of context and generation of top-down predictions may change as a function of attention (see also Li et al., Reference Li, Lu and Zhao2014). This claim is also corroborated by the significant change in predictability effects (or contextual facilitation) from Experiment 1 to Experiment 2, in the combined dataset. Findings from the irrelevant-speech paradigm also support our conclusion. It has been shown that the predictability of unattended speech has no effect on the main experimental task (e.g., memorization of auditorily presented digits). Wöstmann and Obleser (Reference Wöstmann and Obleser2016) did not find predictability effects when the participants ignored the degraded speech (see also Ellermeier et al., Reference Ellermeier, Kattner, Ueda, Doumoto and Nakajima2015). An alternative explanation of ‘participants neglecting the context’ could be that the participants did not listen to the context at all, or they heard but did not process the context. However, irrelevant-speech paradigm studies show that listeners cannot avoid listening to the speech presented to them; to-be-ignored speech has been shown to interfere with the main experimental task (e.g., LeCompte, Reference LeCompte1995). It is not implausible that the listeners listened to the context but did not do a deep processing. This is not incompatible with our first explanation, as in either case, attention to the final word leaves the listeners with limited resources to process and form a representation of the context information.
At this point, we note the differences in response accuracies across different levels of speech degradation, and contextual facilitation therein. At 8-channel condition, the speech was least degraded, and listeners recognized more words than in the 4- or 6-channel conditions, which is in line with prior studies that have found an increase in intelligibility and word recognition with an increase in number of channels (e.g., Davis et al., Reference Davis, Johnsrude, Hervais-Adelman, Taylor and McGettigan2005; Obleser et al., Reference Obleser and Kotz2011). Speech signal passed through 4-channel noise-vocoding was most degraded. Therefore, in the second experiment, at 4-channel, attending to the entire sentence did not confer contextual facilitation because decoding the context itself was difficult. Listeners could not utilize the context differentially across high and low predictability sentences to generate semantic predictions. At 6-channel – a moderate level of degradation – listeners could attend to, identify, and decode the context; hence we observed the significant difference in response accuracy between high and low predictability sentences. We observed a similar contextual facilitation at 8-channel as well. This is in line with previous findings (e.g., Obleser et al., Reference Obleser, Wise, Alex Dresner and Scott2007; but see also Obleser & Kotz, Reference Obleser and Kotz2010) which show that predictability effects can be observed at a moderate degradation level of 8-channel or less. To summarize, our results indicate that there was a very strong difference in intelligibility between 4- and 6-channel conditions, but that the difference in intelligibility between 6- and 8-channel conditions was minor. Note, however, that even for 8-channel, low predictability sentences were not always understood correctly.
Considering theoretical accounts of predictive language processing (Friston et al., Reference Friston, Parr, Yufik, Sajid, Price, Holmes and Square2020; Kuperberg & Jaeger, Reference Kuperberg and Jaeger2016; McClelland & Elman, Reference McClelland and Elman1986; Norris et al., Reference Norris, McQueen and Cutler2016; Pickering & Gambi, Reference Pickering and Gambi2018), one would expect that listeners automatically form top-down predictions about upcoming linguistic stimuli based on prior context. Also, when speech is degraded, top-down predictions render a benefit in word recognition and language comprehension (e.g., Corps & Rabagliati, Reference Corps and Rabagliati2020; Sheldon et al., Reference Sheldon, Pichora-Fuller and Schneider2008a, Reference Sheldon, Pichora-Fuller and Schneider2008b). Results of our study revealed new theoretical insights by showing that this is not always the case. Top-down predictions are dependent on attentional processes (see also Kok et al., Reference Kok, Rahnev, Jehee, Lau and De Lange2012), directed by task instructions; thus they are not always automatic, and predictability does not always facilitate language comprehension of degraded speech. To this point, our findings shed light on the growing body of literature indicating limitations of predictive language processing accounts (Huettig & Guerra, Reference Huettig and Guerra2019; Huettig & Mani, Reference Huettig and Mani2016; Mishra et al., Reference Mishra, Singh, Pandey and Huettig2012; Nieuwland et al., Reference Nieuwland, Politzer-Ahles, Heyselaar, Segaert, Darley, Kazanina, von Grebmer Zu Wolfsthurn, Bartolozzi, Kogan, Ito, Mézière, Barr, Rousselet, Ferguson, Busch-Moreno, Fu, Tuomainen, Kulakova, Husband and Huettig2018).
Results from both experiments show that the effect of trial number is not significant. In contrast to previous studies (e.g., Davis et al., Reference Davis, Johnsrude, Hervais-Adelman, Taylor and McGettigan2005; Erb et al., Reference Erb, Henry, Eisner and Obleser2013) we did not observe adaptation to noise-vocoded speech. In those studies, there was certainty about the speech quality of the next trial, as the participants were presented with only one level of spectral degradation (only 4-channel or only 6-channel noise-vocoding), and crucially with no specific regard to semantic predictability. On the contrary, in our study, listeners were always uncertain about the speech quality of the next trial as well as its semantic predictability. Because of this changing context, the perceptual system of the participants may not retune itself (cf. Goldstone, Reference Goldstone1998; Mattys et al., Reference Mattys, Davis, Bradlow and Scott2012). This is also in line with our prior finding that listeners do not adapt to degraded speech when there is a trial-by-trial variation in perceptual and semantic features (Bhandari et al., Reference Bhandari, Demberg and Kray2021).
We also should note the limitations of the current study. In our experiments, we have used short Subject–Verb–Object sentences in which the verb is predictive of the noun, and we have given participants the somewhat unnatural task of reporting the last word of a sentence. In more naturalistic sentence comprehension, participants would normally aim to understand the full utterance, and would most likely not have restricted goals such as first and foremost decoding a word in a specific position of the sentence. Instead, the speaker would usually indicate important words or concepts via pitch contours, stress, or intonation patterns, which would then direct the attention of a listener. Furthermore, the sentences uttered in most day-to-day conversations are longer, and context information builds up more gradually – information from several words is usually jointly predictive of upcoming linguistic units. Similarly, the design of our experiments limits our ability to discern whether participants generated predictions online while processing the speech, or while typing in the words after listening to the degraded speech.
To conclude, we show that task instructions affect distribution of attention to the noisy speech signal. This, in turn, means that when insufficient attention is given to the context, top-down predictions cannot be generated, and the facilitatory effect of predictability is substantially reduced.
Supplementary Materials
To view supplementary materials for this article, please visit http://doi.org/10.1017/langcog.2022.16.
Data Availability Statement
The code and data mentioned above are available in the public repository of Open Science Framework – https://osf.io/t6unj/.
Conflict of Interest
We conducted this research with no relationship, financial or otherwise, that could be a potential conflict of interest.