1 Introduction
Our capacity to understand spoken language is remarkable. We achieve this seemingly with ease through complex and overlapping processes that take a continuous acoustic signal as input, leading to the perception of speech sounds and coherent speech; prosodic features such as rhythm, stress, and intonational contours and categories; discrete words and, ultimately, meaning.
Speech unfolds over time, often in challenging circumstances such as busy streets or noisy restaurants, at rates of 10–15 phonemes per second (Reference Studdert-KennedyStuddert-Kennedy, 1987), with the sound signal only available for 50–100 milliseconds in auditory memory (Reference ElliottElliott, 1962; Reference Remez, Ferro and DubowskiRemez et al., 2010; Reference Remez, Ferro, Wissig and LandauRemez et al., 2008). This places great demands on the auditory system and even exceeds the system’s basic capabilities (Reference ReppRepp, 1988). As opposed to written language, speech has no blank spaces between words and no commas or full stops. Through multiple steps at timescales of tens to hundreds of milliseconds, the auditory system must transform this signal into phonetic representations that are interpretable to linguistic interfaces in our brains (Reference Chomsky and HalleChomsky & Halle, 1968), segment the unbroken stream into possible words, make contact with long-term memory storage and finally reach the intended word and achieve comprehension. Even then, words themselves are complex, multi-layered and multi-modal entities (Reference Durst-Andersen and BentsenDurst-Andersen & Bentsen, 2021; Reference ElmanElman, 2004, Reference Elman2009), and we add new words to our lexicon almost every day (Reference Brysbaert, Stevens, Mandera and KeuleersBrysbaert et al., 2016). Complicating matters even further, speech is built up of potentially meaningless units – phonemes – that can be combined in countless ways to create and distinguish meaningful morphemes and words (Reference HjelmslevHjelmslev, 1961 [1943]; Reference HockettHockett, 1958; Reference MartinetMartinet, 1949), and we can even use these to create words or sentences that may never have been heard before but can be understood or guessed by listeners if they are constructed according to the morphosyntax or grammar of a certain language (Reference ChomskyChomsky, 1957). As a whole, speech can be considered a code that needs to be cracked by the listener to arrive at the message intended by the speaker (Reference Liberman, Cooper, Shankweiler and Studdert-KennedyLiberman et al., 1967).
When listening to speech, our brains must map the continuous acoustic signal onto linguistic categories, interface with some type of mental lexicon in long-term memory and recognise the spoken word. There is also always a balance between the speaker’s articulatory economy and constraints, and the listener’s need for perceptual distinctiveness (Reference Lindblom, Hardcastle and MarchalLindblom, 1990). Rather than a set of unique acoustic features neatly corresponding to each possible speech sound in a particular language, there is a many-to-one mapping problem inherent to speech that the system needs to overcome. Different speakers produce different speech sounds when pronouncing the same words. They may reduce or drop speech sounds altogether, and one speaker may even pronounce the same speech sounds differently on different occasions (Reference Allen, Miller and DeStenoAllen, Miller, & DeSteno, 2003; Reference Newman, Clouse and BurnhamR. S. Newman, Clouse, & Burnham, 2001). Similar or identical speech sounds may also lead to different word meanings depending on their context in a word. The s in the word skit contributes differently to its meaning than the s in kits. In English, the presence or lack of a puff of aspiration after the initial plosive phoneme /k/ of the word cat does not change its meaning: it is merely treated as variance in the signal that still leads – through the transformation from signal to phonetic representations and categories – to the ‘invariant’ perception of the English phoneme /k/, regardless of its allophonic manifestation. In a language like Hindi, however, where aspiration is phonemic, such a sound difference can change the meanings of words, so that aspirated and unaspirated instances of /k/ in the speech stream need to be mapped to two different phonemes.
This is an overview of the challenges faced and – seemingly easily – overcome by the neural auditory system as it processes speech. Beginning with the history of neurolinguistics from the nineteenth century to the present day, it discusses modern neuroimaging methods and analysis techniques before a description of sound and speech, and how they are processed by the brain from cochlea to cortex, finishing with a few directions in which the field of phonetics in the brain is moving into the future.
2 The Birth of Neurolinguistics
2.1 Paul Broca: The Seat of Language
The foundations of cognitive neuroscience, neuropsychology and the subsequent development of neurolinguistics – as well as attendant neuro-prefixed subdisciplines such as neurophonetics – were arguably laid in the 1860s through the work of Paul Broca (1824–1880), a surgeon and anatomist based in Paris who studied the connection between brain damage and function. At the time, there were ongoing discussions about the localisation of brain functions, including the seat of language functions in the brain. The idea that brain functions could be localised to restricted areas of the cerebral cortex – the outermost layer of the brain, folded into grooves (sulci, singular sulcus) and ridges (gyri, singular gyrus) – had been contested but gained popularity through the work of Franz Gall and colleagues (Reference Gall and SpurzheimGall & Spurzheim, 1809). However, Gall viewed brain function through the framework of phrenology: the idea that the localisation of brain functions could be ascertained through measurements of bumps on the skull. While Paul Broca’s subsequent research into the connection between brain injury and language function would serve to discredit the claims of the phrenologists by showing that distinct areas of the brain were, in fact, important for language function, he was not the first to suggest a connection between pathology and brain function. A few decades earlier, Reference BouillaudJean-Baptiste Bouillaud (1825) presented several cases of patients who had lost the ability to speak but could still understand spoken language. While their damage was too extensive to draw any conclusions as to localised lesions, Bouillaud suggested that the anterior – or frontal – lobes contain an organe législateur de la parole: the legislative organ of speech, which could be paralysed in the absence of any other paralysis, and which contained subsystems for both the ‘intellectual’ (in the grey matter, i.e., neuronal cell bodies) and ‘muscular’ facets of speech (white matter, i.e., connections between cells). As we will see, the frontal lobes of the brain indeed contain crucial centres for speech and language – one named after Paul Broca himself – but the particular importance of the left hemisphere for language did not reach the mainstream until the Paris anatomist published his findings in the 1860s, with a previous similar suggestion by neurologist Marc Dax in 1836 having gone largely unnoticed (Reference DaxDax, 1865).
On 11 April 1861, 51-year-old Louis Leborgne was admitted to Bicêtre hospital in Paris under the care of Paul Broca. Leborgne’s only response to questions was the syllable ‘tan’ repeated twice, accompanied by left-hand gestures (Reference BrocaBroca, 1861b, Reference Broca1861c). The speed at which he had lost his ability to speak was unknown, but when he was admitted to hospital, he had not spoken for two or three months. While he understood everything said to him and appeared to have good hearing, all he could say in return was ‘tan tan’. Broca suggested that the lesion had been relatively limited in size for the first ten years, but subsequently led to increasing paralysis of the limbs. The goal was now to identify the primary location of this lesion, and the suggestion was that it started in the left frontal lobe and spread to left subcortical areas. Twenty-four hours after Leborgne’s death on 17 April, an autopsy was conducted. It was concluded that the most extensive substance loss had occurred in the posterior part of the left inferior frontal gyrus, and that the lesion must have begun to form there, causing the aphemia – defined as damage to the faculty responsible for articulating words, a condition subsequently known as (motor) aphasia – and then slowly over the course of ten years spread to the insula as well as subcortical areas, leading to limb paralysis. Broca saw this connection between the brain damage and loss of speech as evidence that the localisation of the ‘seat’ of spoken language is incompatible with le système des bosses – phrenology – which had previously been proposed by Franz Gall. This general language faculty was proposed to establish connections between ideas and signs – foreshadowing the theoretical work of Reference De Saussure, Bailly, Séchehaye and RiedlingerFerdinand De Saussure (1916) – and like Bouillaud before him, Broca made a distinction between the production and perception of language, the former but not the latter being damaged in the case of Leborgne.
Later that year, another patient – eighty-four-year-old Lazare Lelong – was admitted to Bicêtre for femoral surgery (Reference BrocaBroca, 1861a). Following a fall and brain haemorrhage a year and a half earlier, he could only speak a few words, with difficulty, while he still understood everything that was said to him. The words he could produce did carry meaning in French: oui (‘yes’), non (‘no’), tois [trois] (‘three’), toujours (‘always’) and Lelo for Lelong. Trois appeared to encompass all numbers (he would say trois and indicate he meant ‘four’ with his fingers: the number of children he had), and toujours did not seem to have any specific meaning. Again, Broca concluded that what had been lost was the ‘faculty of articulated speech’ (faculté du langage articulé), but it was different from Leborgne, in that the patient could say several words and thus had a limited ‘vocabulary’. Lelong passed away on 8 November 1861. Following an autopsy, a lesion was found in the left frontal lobe. While it was considerably less widely spread than Leborgne’s lesion, it was noted that the ‘centre’ of Lelong’s lesion was in the same spot as the former: the posterior part of the left inferior frontal gyrus. Pars opercularis and pars triangularis of the inferior frontal gyrus are today commonly referred to as Broca’s area (Reference BrodmannBrodmann, 1909). While the lesions present in these patients were subsequently found to extend more than Broca initially assumed (Reference Dronkers, Plaisant, Iba-Zizen and CabanisDronkers et al., 2007) – in fact, Broca’s aphasia is commonly associated with damage to areas outside this area (Reference Mohr, Pessin and FinkelsteinMohr et al., 1978) – we now know that Broca’s area indeed plays important roles in the articulation of speech as well as in semantic and syntactic processing (Reference Goucha and FriedericiGoucha & Friederici, 2015).
Broca summarised his ideas about the lateralisation of language function by claiming that ‘we speak with the left hemisphere’, but that there is a minority of people who process speech in the right hemisphere. He was careful to note, however, that even in ‘left-brained’ people, the left hemisphere was not the only possible seat of the language faculty, that is, where the link between ideas and linguistic signs is established. Since the link between these concepts appeared intact in those patients who were still able to perceive and meaningfully comprehend language, Broca hinted that the general language faculty may be spread out over more areas, but gave only the broad suggestion that the right hemisphere of the brain may take on this role in case of damage to the left hemisphere (Reference BrocaBroca, 1865).
2.2 Carl Wernicke: From Production to Perception
In the next decade, the discussion of language function and pathology in the brain expanded from language production to include the perception of speech. Another important innovation was the beginning of a move from isolated ‘seats’ of functions in the brain to recognising the importance of associations, or connections, between areas. German anatomist and neuropathologist Carl Wernicke (1848–1905) was inspired by Theodor Meynert, with whom he studied for six months, as well as by the work of Paul Broca. Meynert was a proponent of models of brain function that not only included discrete localised areas but also the connections between them, and – like Broca – he had also investigated the connection between aphasia and brain injury (Reference Whitaker and EtlingerWhitaker & Etlinger, 1993). Wernicke’s Der aphasische Symptomenkomplex [The Aphasic Syndrome] (Reference Wernicke1874) references Meynert’s ideas and was based on descriptions of patients who appeared to mainly have deficits in comprehending rather than producing speech. The first such description concerned a fifty-nine-year-old woman presenting with nausea and headaches. She could use words and phrases correctly and spontaneously but could only understand a few spoken words, with great difficulty. The second – a woman of seventy-five years who was initially assumed to be deaf – could not answer any questions correctly and used only a small number of words in her confused and garbled speech. Wernicke concluded that these patients had lost their ability to understand spoken language and that they showed signs of sensory or receptive aphasia. This contrasted with Broca’s patients, whose aphasia was primarily expressive. Wernicke suggested that the symptoms of sensory aphasia in these patients were due to damage to a posterior part of the left superior temporal gyrus (STG), which we commonly refer to today as part of Wernicke’s area. However, it is important to note that definitions of the actual area – as is also the case with Broca’s area – often refer to anatomy rather than function (Reference BinderBinder, 2015). Thus, the picture becomes more complex when one considers the myriad functions served by the different constituent parts and cellular composition that make up the ‘classical’ language areas.
Wernicke applied Meynert’s ideas of neural connectivity to begin building an extended model of neurolinguistic brain function. He defined a speech centre where the inferior frontal gyrus – Broca’s area – was responsible for motor-articulatory function and Wernicke’s area a sensory centre for conceptual ‘sound-images’ (Klangbilder). Wernicke also hypothesised a connection between the sound-image and motor areas. While the sound-image centres were assumed to be distributed bilaterally (across both hemispheres of the brain), the sound-image centre was only connected to the motor centre on the left side of the brain, leading to a generally left-dominant STG. Higher cognitive functions of the brain were thus not assumed to be localised to particular areas but arose as a result of connections between cortical areas. Wernicke originally proposed that the pathway between these two language centres would run through the insula, but later accepted that the relevant structure is the arcuate fasciculus, a bundle of fibres that connects temporal and parietal areas with the frontal lobe (Reference DejerineDejerine, 1895; Reference GeschwindGeschwind, 1967). Wernicke even hypothesised that damage to this connection would give rise to a new type of aphasia – Leitungsaphasie, or conduction aphasia – and he correctly predicted that this type of aphasia would lead to problems with spoken word or sentence repetition. Even though conduction aphasia is now known to be associated with damage to areas in the temporal and parietal cortices rather than the arcuate fasciculus (Reference Buchsbaum, Baldo and OkadaBuchsbaum et al., 2011; Reference Shuren, Schefft and YehShuren et al., 1995), this prediction was a testament to the innovation and explanatory power of his model.
2.3 From Neuroanatomy to Neuropsychology and Cognitive Neuroscience
Wernicke’s model was subsequently updated by German physician Reference LichtheimLudwig Lichtheim (1885), whose aim was to describe the pathways necessary for both normal language function and pathology, and to relate functions to neurophysiology. He achieved this by adding complexities and nodes to Wernicke’s more rudimentary diagrams (see Figure 1), suggesting that once normal language function is established through the diagram and its assumed neurophysiological underpinnings, it would be possible to define language disorders by assuming lesions along the pathways. In this way, Broca’s aphasia was caused by damage to the ‘motor centre of speech’ (area M), Wernicke’s a result of damage to area A (the acoustic word-centre), conduction aphasia as a result of damage to the connection between M and A, and so on. It was also assumed that node B, responsible for the elaboration of (semantic) concepts in accordance with Wernicke and connected to both A and M, was distributed over many areas of the cortex, something which has since been corroborated using modern neuroimaging techniques (Reference Huth, de Heer, Griffiths, Theunissen and GallantHuth et al., 2016). Damage to the pathway B-M was associated with a condition today known as transcortical motor aphasia, linked to areas surrounding Broca’s area. The symptoms of this disorder are similar to those of Broca’s aphasia, with the main difference being that patients with transcortical motor aphasia can repeat words and sentences (Reference Berthier, Starkstein and LeiguardaBerthier et al., 1991).
The Wernicke-Lichtheim model and its focus on tying together models of brain function with underlying physiology constituted a crucial milestone for the development of modern neuropsychology and cognitive neuroscience. Another important contribution was the detailed map of the cerebral cortex by German neurologist Reference BrodmannKorbinian Brodmann (1909). Brodmann identified fifty-two regions of the cortex based on their cytoarchitectonic features, that is, their cellular composition. His map is still extensively used to refer to cortical areas today, with the abbreviation BA for ‘Brodmann area’. For example, Broca’s area is commonly defined as being made up of pars opercularis, or BA44, and pars triangularis, or BA45, two areas that Brodmann identified as being closely cytoarchitectonically related, while Wernicke’s area is often associated with BA22, the posterior part of the superior temporal gyrus, and BA40, the supramarginal gyrus, which borders BA22 with no sharp cytoarchitectonic boundary.
In the 1960s, American neurologist Norman Geschwind published highly influential papers that drew on and further updated the connectionist model of brain function (Reference GeschwindGeschwind, 1965a, Reference Geschwind1965b). He stressed the importance of understanding disorders like aphasia as disorders of disconnections of either white matter pathways between primary receptive and motor areas or parts of the cortex known as association areas (‘obligatory way stations’), which receive inputs from multiple areas of the brain. Moreover, Geschwind extended the area of research from patient case studies to the mammalian brain in humans and other animals. In particular, he suggested that the inferior parietal lobule – comprising the angular and supramarginal gyri – is unique to humans among the primates due to its importance for speech processing. This evolutionarily advanced association area is surrounded by other association areas, leading Geschwind to call it the ‘association area of association areas’, or a secondary association area. It is involved in forming the cross-modal associations between auditory and visual representations and thus plays a role in language-specific tasks such as object naming and semantic processing (see Section 5). However, it also underpins more domain-general complex cognitive functions, such as future planning, spatial attention and social cognition (Reference Numssen, Bzdok and HartwigsenNumssen, Bzdok, & Hartwigsen, 2021).
The second half of the twentieth century saw rapid and paradigm-shifting advances in both linguistic theory (Reference ChomskyChomsky, 1965a, Reference Chomsky1965b) and cognitive neuroscience and methodology, as well as the establishment and development of psycholinguistics throughout the century as a fruitful and innovative field of research (Reference CutlerCutler, 2012). With these advances, the speech and language models of Broca and Wernicke turned out to be both anatomically and linguistically underspecified (Reference Poeppel and HickokPoeppel & Hickok, 2004). It has become clear that language comprises a number of complex systems – phonology, phonetics, syntax, semantics and so on, each made up of separate subsystems and each overlapping with other systems – and that neurolinguistic theories of brain function have to take into account the fact that language processing capacities underpinning these systems appear to be distributed across the brain: both hemispheres of the brain are involved, and processing takes place in both cortical and subcortical structures. To understand how language is processed in the brain, we need detailed theories of how both language and the brain work, with both sets of theories informing and constraining each other to create linking hypotheses and testable predictions.
Based on these theoretical and methodological advances from the decades leading into the twenty-first century, Reference Hickok and PoeppelHickok and Poeppel (2000, Reference Hickok and Poeppel2004) presented a functional anatomical framework of the cortical organisation of speech perception, as well as an account of different types of aphasias. It was grounded in the nineteenth century idea that speech processing and comprehension necessarily involve interfaces with a conceptual and a motor-articulatory system. The framework drew inspiration from previous work on cortical vision and auditory processing that had identified functionally and anatomically differentiated processing streams in the cerebral cortex (Reference Milner and GoodaleMilner & Goodale, 1995; Reference RauscheckerRauschecker, 1998; Reference Ungerleider, Mishkin, Ingle, Goodale and MansfieldUngerleider & Mishkin, 1982): a ventral and a dorsal stream (Latin venter, ‘belly’; dorsum, ‘back [of the body]’). Since all tasks involving speech appear to activate the STG, the early first stages of speech perception are proposed to involve auditory-responsive cortical fields in the STG bilaterally, that is, on both the left and right sides of the brain, albeit with some functional asymmetry, as originally suggested by Wernicke. This asymmetric lateralisation means that sound is processed differentially by the two hemispheres. For example, it has been suggested that the left hemisphere specialises in temporal processing and the right specialises in analysing spectral information (Reference Zatorre and SykaZatorre, 1997). Alternatively, the left hemisphere may specialise in shorter temporal integration windows – faster sample rates (25–50 ms) – while the right specialises in longer windows, or slower sampling rates (150–250 ms) (Reference PoeppelPoeppel, 2001, Reference Poeppel2003). Yet another reason for the left-hemisphere dominance for speech sounds is that categorical perception – a crucial speech perception mechanism (see Section 4) – appears to be subserved by areas in the left temporal lobe (Reference Liebenthal, Binder, Spitzer, Possing and MedlerLiebenthal et al., 2005).
After the acoustic-phonetic analysis, processing is split into two streams. The dorsal stream is critical for mapping sound onto auditory-motor (articulatory) representations, while the ventral stream maps sound onto meaning. The processing streams are bidirectional, so that they underpin both speech perception and production. Thus, the dorsal stream is involved in verbatim repetition tasks that require a mapping from conceptual to articulatory motor representations, and it may play a role in but is not critical for speech perception in passive listening conditions. The ventral stream is broadly responsible for comprehension, that is, the conversion of continuous speech input to something that can be analysed linguistically, as well as acoustic-phonetic processing and the interfaces between lexical and morphological and syntactic processing.
3 Neuroimaging
A range of powerful psycholinguistic behavioural paradigms have been devised to infer the structure, flow and time course of information processing in the brain. Examples include speech shadowing, where listeners can repeat incoming speech at latencies of 150–200 ms (Reference ChistovichChistovich, 1960); dichotic listening, which can be used to determine the (most often left-dominant) laterality of speech function in individuals (Reference BroadbentBroadbent, 1954, Reference Broadbent1956); word spotting, aimed at testing the process of speech segmentation (see Section 4) (Reference McQueen, Norris and CutlerMcQueen, Norris, & Cutler, 1994); gating, where progressively longer portions of words are presented to test the time course of lexical processing (Reference GrosjeanGrosjean, 1980); and lexical decision, where listeners determine whether a word is real or not (Reference Meyer and SchvaneveldtD. E. Meyer & Schvaneveldt, 1971; Reference Rubenstein, Garfield and MillikanRubenstein, Garfield, & Millikan, 1970). In addition, a number of neuroimaging techniques were developed in the twentieth century to track brain activity in space and time. In the following sections, the focus lies on electroencephalography, but other widely used neuroimaging techniques are functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG) and functional near-infrared spectroscopy (fNIRS).
3.1 Electroencephalography and Functional Magnetic Resonance Imaging
Electroencephalography (EEG) is a non-invasive method for measuring fluctuations in the naturally occurring electrical activity in the brain – measured in microvolts – at the millisecond scale, using electrodes placed on the scalp. The trace of voltage over time is referred to as the electroencephalogram. This electrical activity has its source in brain cells – neurons – and can be measured on the scalp. Originating near the cell body of a neuron, action potentials are electrical signals that the brain uses to convey, receive, and analyse information. Neurons receive information through several short processes known as dendrites and send signals to other neurons through an axon, which is a single, tubular process covered in the lipid substance myelin. The insulating fatty myelin sheath acts as an insulating layer that increases the speed of action potentials. The axon ends in synapses, which contact other neurons. A transmitting cell is referred to as presynaptic and a receiving cell as postsynaptic. The activity reflected in the electroencephalogram is overwhelmingly made up of summed postsynaptic potentials (about 10–100 ms) as a result of activity in large groups of similarly oriented neurons rather than individual action potentials (about 1 ms), with the exception of the early auditory brainstem response (ABR), which reflects action potentials generated in the cochlea that travel through the auditory nerve (Reference Pratt, Kappenman and LuckPratt, 2011). Thus, the functional temporal resolution of EEG is commonly between tens and hundreds rather than single milliseconds.
The temporal resolution of EEG is in contrast with functional magnetic resonance imaging which measures the comparatively slower magnetic signatures of blood flow to areas in the brain active in response to certain conditions (Reference Ogawa, Lee, Nayak and GlynnOgawa et al., 1990). Put simply, as an area becomes active – compared to a baseline of, for example, rest or silence – blood flows to the area with every heartbeat to provide it with oxygen, replace the deoxygenated blood and replenish energy. In fact, more oxygen than is needed according to neuronal energy consumption is delivered to areas with increased neural activation (Reference Fox, Raichle, Mintun and DenceFox et al., 1988). The main dependent variable in fMRI analyses – blood-oxygen-level-dependent signal (BOLD) – is not a result of an increase in deoxygenated blood in active areas (which actually decreases the BOLD signal (Reference Ogawa and LeeOgawa & Lee, 1990; Reference Ogawa, Lee, Nayak and GlynnOgawa et al., 1990)), but rather due to oxygenated blood washing the deoxygenated blood away, providing an indirect but highly useful link between neural activity and the fMRI signal. It takes several seconds for oxygenated blood to saturate an area. Thus, the temporal resolution of fMRI and the BOLD signal is orders of magnitude slower than that of EEG (seconds vs. milliseconds). However, the spatial resolution of fMRI is excellent, allowing researchers to track active brain areas and networks at the scale of millimetres and below in three dimensions across the entire brain, including subcortical areas and areas deep within the brain. As such, it is mainly used to answer questions of ‘where’ something happens in the brain rather than ‘when’, including at the level of separate cortical layers (Reference Lawrence, Formisano, Muckli and de LangeLawrence et al., 2019). Both neuroimaging methods can be combined and used concurrently with specialised EEG equipment and pre-processing methods, enabling correlations between BOLD signal and EEG amplitude to answer research questions that require both spatial and temporal data.
A technique which is closely related to EEG – magnetoencephalography – has been used to illustrate the time course of information transmission between different areas in the brain, taking full advantage of the excellent temporal resolution and the additional spatial resolution offered by the electromagnetic properties of groups of neurons as measured using MEG (Reference Pulvermüller and ShtyrovPulvermüller & Shtyrov, 2008; Reference Pulvermüller, Shtyrov and IlmoniemiPulvermüller, Shtyrov, & Ilmoniemi, 2003).
3.2 EEG and Event-Related Potentials
By placing electrodes on the scalp (normally ranging from 32 to 128, and up to 256 electrodes) and amplifying the signal, the voltage at each electrode can be tracked over time relative to a reference electrode placed away from the scalp. The reference electrode is ideally placed in a location where it picks up as little neural activity as possible, commonly the mastoid process behind the ear. The average signal of all electrodes can also serve as a reference. However, there is no perfect reference, and it should be chosen carefully since it can influence the data significantly. The temporal resolution of the signal depends on the sampling rate used for the recording, such that a 1 kHz rate provides a voltage reading per electrode every millisecond, 250 Hz every four milliseconds and so on. The highest measurable frequency depends on the sampling rate so that the highest frequency that can be resolved is half the sampling rate (the Nyquist theorem). With a 250 Hz sampling rate, the highest reliably resolvable frequency is 125 Hz, or the Nyquist frequency. In addition to this, filters are usually applied to the EEG so that frequencies above and below certain thresholds are attenuated, and noise can be suppressed. High-pass filters – attenuating low frequencies – can help suppress slow drifts in the signal, often caused by perspiration (Reference Picton and HillyardPicton & Hillyard, 1972). Low-pass filters, which attenuate frequencies above a certain threshold, have an anti-aliasing effect (attenuating potentially artifactual frequencies greater than the Nyquist frequency) and they also act to reduce the effect of high-frequency electromyographic muscle artifacts. The bone and connective tissue surrounding the brain act as a natural low-pass filter of frequencies, but also as a spatial filter, so that the signal is smeared and spread out against the skull (Reference Srinivasan, Nunez, Tucker, Silberstein and CaduschSrinivasan et al., 1996). Thus, while EEG has excellent temporal resolution at the millisecond scale, its spatial resolution is poor: there is an infinite number of possible neural generators that can explain any given data measured on the scalp (the inverse problem).
The raw output of an EEG recording contains contributions from many different sources, such as muscle or cardiac activity, body movements and mains electricity noise, many of which have a signal strength orders of magnitude stronger than those of neural signals. There are various methods for reducing the impact of noise and artifacts. For example, discrete artifacts such as eye-blinks can be attenuated and effectively removed (Reference Jung, Makeig and WesterfieldJung et al., 2000), and this is commonly combined with amplitude cut-offs so that trials with amplitude fluctuations of ±100 µV are discarded before the final averaging and analysis, though some have argued for minimal pre-processing of relatively clean EEG data obtained in laboratory conditions (Reference DelormeDelorme, 2023). By averaging the EEG, neural responses to events – such as a visually or auditorily presented stimulus – can be isolated from the noise, leading to the extraction of event-related potentials (ERPs). These waveforms are time-locked to events and consist of negative or positive voltage deflections relative to a baseline, referred to as peaks or components, which can be defined as changes in voltage that vary systematically across conditions and subjects (Reference LuckLuck, 2014). Some components, such as P1, N1 and P2, are named based on whether the deflection is positive (‘P’) or negative (‘N’) relative to the baseline, as well as their occurrence relative to other waveforms (so that P2 is the second positive deflection after P1). These particular components – the P1-N1-P2 complex – are obligatorily evoked by auditory stimuli (Reference Näätänen and PictonNäätänen & Picton, 1987), such as pure tones or spoken words, and reflect the detection of auditory onsets as well as their acoustic properties. Other components may have more descriptive names, such as the mismatch negativity (MMN), a widely used component that appears in response to a stimulus that deviates from previously repeated stimuli, indicating that the participant has perceived the difference between the two types of stimuli (Reference Näätänen, Gaillard and MäntysaloNäätänen, Gaillard, & Mäntysalo, 1978). There are a number of more or less descriptively named components that are useful for speech research. The phonological mismatch (or mapping) negativity (PMN) is elicited by mismatched phonemes in words otherwise expected based on the sentence context (Reference Connolly and PhillipsConnolly & Phillips, 1994; Reference Newman and ConnollyR. L. Newman & Connolly, 2009; R. L. Reference Newman, Connolly, Service and McIvorNewman et al., 2003) and the left-anterior negativity (LAN) – named after its usual topographical distribution on the scalp – is found for morphosyntactic violations (A. J. Reference Newman, Ullman, Pancheva, Waligura and NevilleNewman et al., 2007; Reference Osterhout and MobleyOsterhout & Mobley, 1995).
When the EEG is averaged into event-related potentials, an ERP is obtained for each condition and each electrode site, in time-windows or epochs ranging from hundreds to thousands of milliseconds, and a baseline window of 100–200 ms. The most common analysis is of signal amplitude in a time-window, and this measurement is relative to the baseline. Thus, the experimenter must take care to ensure that the baseline does not differ between conditions since this may introduce confounds and influence the interpretation of ERPs. For example, if there is consistent noise in baselines in condition A but not in condition B, or if condition A stimuli are preceded by silence and stimuli in B are not, this will be reflected in the ERPs.
ERPs are by definition tied to events, the onset of which may be more or less difficult to define. The onset of an auditory stimulus from silence constitutes a relatively well-defined context from which to time-lock and extract an ERP. However, this may be difficult to obtain in studies of intonation or other prosodic phenomena, for example, where the definition of a discrete, perceptually relevant event for time-locking could be more elusive, leading to temporal jitter that could mask small or short-lasting effects in the data. In these cases, it is important to perform detailed acoustic analyses of the stimuli to ascertain that event onsets are well controlled and that the baseline does not vary between conditions.
Since an ERP is obtained for each electrode site, one can calculate difference waves – reflecting the total amplitude difference over time between conditions A and B in a certain time-window – and subsequently construct a scalp map or topography of a component, showing its distribution across the scalp. In the case of the mismatch negativity, for example, the specific topography may vary depending on stimulus features, but it typically displays a frontal distribution (Reference AlhoAlho, 1995), skewed towards the right hemisphere, except for language-related deviants which tend to show left-lateralised MMNs (Reference Tervaniemi and HugdahlTervaniemi & Hugdahl, 2003). Component scalp topography is further influenced by the choice of reference site, and this must be taken into consideration when inspecting topography plots. While it is essentially impossible to infer the source of brain activity from component topography, it can be a useful tool to argue in favour of – or rule out – interpretations of ERP effects. Thus, if component A reliably shows frontal distributions in the literature and component B is typically posterior, this may be used to interrogate the interpretation of an effect elicited in a carefully controlled experiment.
The temporal resolution of EEG makes it an excellent method for answering a multitude of questions regarding cognitive processes. Like fMRI and other neuroimaging techniques, it does not require an overt response from participants, which means that brain activity can be probed without particular tasks, and the pool of participants that can be tested is expanded to non-verbal or pre-verbal populations, or to participants who cannot physically give overt responses to stimuli. EEG has been referred to as ‘reaction time for the 21st century’ (Reference Luck, Woodman and VogelLuck, Woodman, & Vogel, 2000). As such, it can provide information about cognitive processes that are more or less invisible to mental chronometry, which may occur after or even before the onset of a stimulus. For example, EEG can reveal subconscious brain responses to phonetic differences between such things as minimally different speech sounds (for example, using the MMN). It can also measure grammaticality or acceptability of linguistic structures, at multiple levels and at the millisecond scale, without overt responses. In addition to this, it allows researchers to ask questions regarding the ordering and timing of cognitive processes. Thus, five-month-old pre-verbal infants have been shown to recode complex input into abstract categories within minutes of training, as evidenced by EEG mismatch responses (Reference Kabdebon and Dehaene-LambertzKabdebon & Dehaene-Lambertz, 2019). EEG has also been used to show that the brain can tell the difference between real and pseudowords within thirty milliseconds of a mismatching phoneme (Reference Shtyrov and LenzenShtyrov & Lenzen, 2017). With regard to the ordering of processes – and a caveat that the underpinnings and drivers of most ERP components still remain to be fully elucidated – two commonly studied ERP components, the N400 and the P600, are often found in succession. This has sparked debates, for example, about whether semantic processing precedes syntactic processing in speech comprehension (Reference Bornkessel-Schlesewsky and SchlesewskyBornkessel-Schlesewsky & Schlesewsky, 2008), or vice versa (Reference FriedericiFriederici (2002), but see also Reference Steinhauer and DrurySteinhauer and Drury (2012) for a critical review. Broadly speaking, the N400 is the default neural response to the semantic content of any potentially meaningful stimulus, occurring at 300–500 milliseconds after its onset. In addition to being modulated by lexical characteristics of words presented in isolation, the N400 amplitude is sensitive to the probability of encountering a word’s semantic features given the preceding context, and is larger in cases where a word is semantically unexpected, such as the sentence He spread the warm bread with socks (Reference Kutas and FedermeierKutas & Federmeier, 2011; Reference Kutas and HillyardKutas & Hillyard, 1980; Reference Kuperberg, Brothers and WlotkoKuperberg, Brothers, & Wlotko, 2020; Reference Nour Eddine, Brothers, Kuperberg and FedermeierNour Eddine, Brothers, & Kuperberg, 2022). The P600 occurs at 600–1,000 milliseconds following the onset of a violation (Reference Osterhout and HolcombOsterhout & Holcomb, 1992), and is suggested to reflect an error signal and subsequent reprocessing, reanalysis or reinterpretation as the brain tries to determine whether an initial decision was correct. The P600 is not restricted to syntax as previously believed (Reference Brouwer, Crocker, Venhuizen and HoeksBrouwer et al., 2017; Reference Knoeferle, Habets, Crocker and MunteKnoeferle et al., 2008; Reference Kuperberg, Brothers and WlotkoKuperberg et al., 2020), but also appears in response to semantic or thematic incongruities such as For breakfast, the eggs would only eat … (Reference Kuperberg, Sitnikova, Caplan and HolcombKuperberg et al., 2003), and its function has been connected to another common, non-linguistic neural error signal, the P300 (Reference Coulson, King and KutasCoulson, King, & Kutas, 1998; Reference Kuperberg, Brothers and WlotkoKuperberg et al., 2020; Reference Sassenhagen, Schlesewsky and Bornkessel-SchlesewskySassenhagen, Schlesewsky, & Bornkessel-Schlesewsky, 2014).
In summary, the event-related potential technique allows researchers to gain insight into both the actual phenomena of written and spoken language, as well as the neural mechanisms that give rise to these potentials on the scalp. The latter can be achieved by combining EEG with fMRI to investigate the neural source of ERP components, or by computational modelling, where researchers build algorithmic models to determine which factors are necessary to give rise to effects similar to those found in ERPs. This is then used to create new hypotheses and predictions for further study. Much knowledge of ERPs has also come from meticulous and replicated experimental work. Alternatively, ERPs can be and are often used as an experimental tool without necessarily referencing the underlying mechanisms. For example, to determine whether native or second-language speakers can tell the difference between speech sound categories, such as dental and retroflex plosives, the presence of an MMN indicates that they are indeed perceived as different sounds. The MMN is also useful since it persists even in the absence of attention: participants often watch a silent film to divert attention to the stimuli, but the MMN still occurs to signal perceptual differences between standard and deviant stimuli. Finding a LAN would, for example, suggest a subject-verb number agreement error (Reference Osterhout and MobleyOsterhout & Mobley, 1995). Language learning has been tested using a methodology based on the N400 and P600 components: early learners have been shown to display an N400-like effect to syntactic errors, which changes to a P600 for later, more advanced learners (Reference Osterhout, Poliakov and InoueOsterhout et al., 2008).
3.3 Statistical Analysis of EEG Data
Traditionally, the amplitudes of event-related potentials have been interrogated using analyses of variance (ANOVAs). The experimenter chooses a time-window in which to extract average ERP amplitudes from the experimental conditions (the dependent variable) and performs an analysis which can include both within-subject (in an MMN experiment, this might be an experimental contrast between two phonemes of interest to the experimenter: one as a standard and one as a deviant) and between-subject factors (such as native/non-native speakers). Thus, data is averaged over time-windows, conditions and participants, and entered into a repeated-measures ANOVA. For the mismatch negativity, the dependent variable is often the difference between the average response to the standard stimulus and the average deviant response, that is, a difference wave. The MMN typically peaks between 100–250 ms (Reference NäätänenNäätänen, 1995; Reference SchrögerSchröger, 1997), and this is consequently a common à priori time-window for the statistical analysis of this component. Additionally, topographical factors are included in the ANOVA. This commonly involves a factor covering clusters of electrode sites along the anterior-central-posterior axis of the scalp as well as a laterality factor (left-right or left-midline-right). Interrogating interactions between these factors could thus reveal differences between experimental conditions with – for example – left-anterior or right-posterior topographical distributions, and so on.
A common issue that arises in EEG data analysis is that pre-processing steps such as artifact rejection may lead to missing data, so that the number of observations differs across conditions, violating core equal-variance assumptions used in ANOVAs. One proposed solution is mixed-effects models (Reference Baayen, Davidson and BatesBaayen, Davidson, & Bates, 2008), which are more robust in this respect and decrease the risk of Type I errors, that is, rejecting the null hypothesis when it is actually true (false positive). Furthermore, mixed-effects models have the advantage of allowing the experimenter to account for the effect of participant and item variability (Reference Barr, Levy, Scheepers and TilyBarr et al., 2013) – random effects – on the dependent variable, as well as include both categorical and continuous variables in the analysis of EEG data (Reference Smith and KutasN. J. Smith & Kutas, 2015).
Neuroimaging data is also highly multidimensional, with potentially thousands of readings per second (time) per electrode site (space) in an EEG recording. What is often referred to as the multiple-comparisons problem arises from the large number of simultaneous statistical comparisons, which increases the risk of erroneous statistical inferences such as Type I (false positive) and II (false negative) errors. The problem can be exacerbated by the experimenter visually inspecting the waveforms to choose a time-window (or cluster of electrodes) for analysis where the difference between conditions appears largest but may simply be due to noise. It is therefore recommended that researchers select time-windows and electrode sites for analysis in a theory-driven manner, based on à priori assumptions from the literature (such as in the MMN example in Section 3.2). Another data-driven way of analysing ERPs and correcting for multiple comparisons is the non-parametric cluster-based permutation approach, which has gained popularity in recent years. Here, data-driven refers to the fact that – apart from extracting epochs time-locked to events – one does not need to know the spatiotemporal distribution of the effect in advance: it allows for ‘prior ignorance’, as well as exploratory analyses of potentially novel phenomena (Reference Maris and OostenveldMaris & Oostenveld, 2007; Reference Sassenhagen and DraschkowSassenhagen & Draschkow, 2019). This method does not require the researcher to select a particular time-window for analysis, and it solves the multiple-comparisons problem by reducing comparisons of condition differences in each sample to a single comparison between experimental conditions in a spatiotemporal grid (Reference Maris and OostenveldMaris & Oostenveld, 2007), thereby decreasing type I error rates (Reference Pernet, Latinus, Nichols and RousseletPernet et al., 2015). It also takes advantage of the fact that in high-dimensional spatiotemporal data such as EEG, clusters of adjacent electrodes (and time-points) are likely to show similar effects in time and space, leading to increased statistical sensitivity – and a lower type II error rate – compared to methods such as Bonferroni correction, by providing prior knowledge about the expected effect (Reference Maris and OostenveldMaris & Oostenveld, 2007). However, while cluster-based permutation techniques have been claimed and used to find the time at which effects begin (effect onset) in the literature, this type of analysis does not in fact test the statistical significance of effect latency (i.e., its onset in milliseconds) or topography (scalp distribution). Thus, care must be taken when interpreting the analysis output, so as not to overstate the significance of latency or topography results (Reference Sassenhagen and DraschkowSassenhagen & Draschkow, 2019).
Yet another increasingly popular method for analysing EEG data is multivariate pattern analysis (MVPA). MVPA encompasses a set of neuroimaging analysis methods where machine-learning classifier algorithms use patterns of brain activation to ‘decode’ the underlying model that explains the data. A subset of the neuroimaging data is used to train the classifier to distinguish a reliable difference in brain activation pattern between the experimental conditions, which can be tested using parametric tests such as Student’s t-test or nonparametric tests like the Wilcoxon signed-rank test or permutation tests, with different options for addressing the multiple-comparisons problem (Reference Grootswagers, Wardle and CarlsonGrootswagers, Wardle, & Carlson, 2017). The classifier’s decoding accuracy can be tracked over time at the millisecond scale, making MVPA an excellent tool to investigate the temporal dynamics of neural processes and information processing in the brain. As in traditional ERP analyses, the data can consist of evoked brain responses to stimuli, such as images or sounds. Like cluster-based permutation, MVPA allows for similar ‘prior ignorance’, as well as exploratory analyses, regarding the spatial distribution and timing of effects. It can also have increased sensitivity compared to traditional univariate approaches, with multivariate techniques more capable of detecting subtle differences between conditions at an earlier stage (Reference Cauchoix, Barragan-Jason, Serre and BarbeauCauchoix et al., 2014). A typical use of MVPA decoding could be an experiment where a participant views green squares or blue circles, while their brain activity is recorded using EEG or MEG. The aim is then to predict – based on patterns of brain activation – whether the participant viewed a green square or a blue circle, with the assumption that the brain activation patterns differ between the two conditions (Reference Grootswagers, Wardle and CarlsonGrootswagers et al., 2017). In spoken-language research, MVPA has been used to investigate multiple levels of linguistic processing simultaneously, tracking near-parallel brain responses to grammatical and ungrammatical structures, words and pseudowords, as well as semantic features in task-free paradigms, allowing the inclusion of participants unable to give an overt response, such as those with brain damage or children with developmental disorders (Reference Jensen, Hyder and ShtyrovJensen, Hyder, & Shtyrov, 2019). At lower levels of speech perception, MVPA has begun to be used to investigate long-standing questions, taking advantage of its excellent tracking of temporal dynamics in information processing. Reference Beach, Ozernov-Palchik and MayBeach et al. (2021) applied MVPA to MEG brain responses to syllables on the ba-da continuum in active and passive listening conditions to investigate the stages involved in the transformation from detailed (continuous) acoustic analysis to (categorical) phonemic representations, to determine for how long subphonemic information is available. Stimulus decoding accuracy above chance began at 165 milliseconds, underpinned chiefly by activity in the left hemisphere. It was observed for longer when a response was required, suggesting that decision-relevant stimulus information was available for longer in the active condition. Furthermore, even when a categorical phoneme representation had been reached (see Section 4), subphonemic information was still available, something which may be important in higher levels of spoken-word recognition and lexical processing, allowing recovery from an initial word interpretation that turns out to be incorrect (Reference McMurray, Tanenhaus and AslinMcMurray, Tanenhaus, & Aslin, 2009), as well as for perceptual processes such as compensation for coarticulation (see Section 4).
4 From Sound to Perception
Sound is a sensation produced by waves of energy that cause pressure changes in the air. The number of pressure changes – increases and decreases in air pressure – per time period is referred to as the frequency of the sound, measured in Hertz (Hz, cycles per second). The size of the wave is referred to as its amplitude. The range of hearing for young, healthy adult humans is around 20–20,000 Hz (Reference FletcherFletcher, 1940). Speech is a complex and rapidly changing sound signal, made up of a spectrum of sound waves with different frequencies and amplitudes. As a comparison, the timbre of an instrument is one of the main distinguishing factors between the sound of an oboe or a bassoon playing the same note at the same volume, where sound waves emanating from these instruments have different spectra – or increased magnitude at certain frequencies – leading, along with some other factors, to the perception of a bassoon or an oboe. The perception of speech sounds is similarly dependent upon changes in the sound spectrum over time. When we listen to speech, peaks or components at certain frequencies in the spectrum can lead to the perception of a certain speech sound, such as a vowel or a consonant. These peaks are commonly referred to as formants and are numbered from 1 upwards (F1, F2, F3 and so on). The fundamental frequency (F0) underlies the perception of the pitch of the sound, so that a middle C has a fundamental frequency of 512 Hz, and the F0 of A is 440 Hz, with octaves at double those values and harmonics as multiples of the fundamental. Fundamental frequency in speech varies between 100–250 Hz (Reference Peterson and BarneyPeterson & Barney, 1952). However, pitch can still be perceived even if all F0 energy is removed (the missing fundamental effect), and thus the perception of pitch is more complex than a simple tracking of the fundamental frequency (Reference Hall and PetersHall & Peters, 1981).
The vocal tract, including constrictions produced by the tongue, teeth and lips, acts as a filter of the signal originating in the glottis, and this leads to formants, or peaks, of acoustic energy in the speech signal at certain frequencies (Reference FantFant, 1970). Formants are a result of factors such as the configuration or length of the vocal tract from the glottis to the lips – such that longer vocal tracts lead to lower formant frequencies – as well as parts of the vocal tract modulating the acoustic sound waves along the way. In this way, an important difference between the words tar and tea in a non-rhotic version of English (where post-vocalic /r/ is absent) lies in the formant structure of the vowels: the height of the tongue body modulates the frequency of the first formant (F1), so that low vowels (as in tar) have a higher-frequency F1 than high vowels (tea) (Reference Peterson and BarneyPeterson & Barney, 1952). In addition to this, the vowel in tea is articulated with the tongue body to the front of the mouth, leading to a higher second formant (F2) than the vowel in tar, which is articulated further back in the mouth. In this way, the two-dimensional vowel space of a particular language or speaker can be modelled as a function of first and second formant frequency and thus along the low-high and front-back dimensions.
A purely monochromatic – single-wavelength – sinusoidal wave cannot transmit any information, and thus changes or modulations to the carrier signal are crucial for the transmission of information (Reference PicinbonoPicinbono, 1997). From the structures of the ear all the way to the cerebral cortex, the auditory system acts as an analyser of (patterns of) frequency but also of the temporal information of sound, such as changes or modulations in sound amplitude over time, including silences that contain no energy but may be informationally salient with regard to features such as stop consonant voicing (Reference ReppRepp, 1988; Reference RosenRosen, 1992). All natural sounds involve patterns of amplitude modulation. The auditory system decomposes complex sounds into a number of filtered signals divided into different frequency bands (Reference FletcherFletcher, 1940; Reference Moore, Glasberg and BaerMoore, Glasberg, & Baer, 1997), but ultimately accomplishes the subsequent integration of these frequencies to give rise to the percept of speech sounds such as vowels rather than a disjointed collection of formant frequencies (Reference ReppRepp, 1988). All natural sounds such as music and speech can be described in the form of amplitude modulations over time and frequency (Reference Singh and TheunissenSingh & Theunissen, 2003). These modulations are crucial for speech perception (Reference Chi, Gao, Guyton, Ru and ShammaChi et al., 1999), and the auditory system appears to be highly selective towards and specialised for amplitude-modulated natural sounds such as speech (Reference Joris, Schreiner and ReesJoris, Schreiner, & Rees, 2004; Reference Koumura, Terashima and FurukawaKoumura, Terashima, & Furukawa, 2023; Reference Liang, Lu and WangLiang, Lu, & Wang, 2002; Reference Yin, Johnson, O’Connor and SutterYin et al., 2011). The system is also robust to degraded spectral resolution (Reference Remez, Rubin, Pisoni and CarrellRemez et al., 1981) – something which often occurs in hearing-impaired listeners or users of cochlear implants – so long as amplitude modulations are preserved. When spectral information is degraded, cues to voicing and manner in consonants are still correctly perceived, whereas the perception of vowels or of consonantal place of articulation – which require more spectral information – is less accurate (Reference Loizou, Dorman and TuLoizou, Dorman, & Tu, 1999; Reference Shannon, Zeng and WygonskiShannon, Zeng, & Wygonski, 1998; Reference Shannon, Zeng, Kamath, Wygonski and EkelidShannon et al., 1995). An important concept in the neural processing of speech is the amplitude envelope, which refers to changes to intensity and duration in sound amplitude – including falls and rises – over time, across a range of frequency bands. A simple fluctuating amplitude envelope can be created by mixing two pure tones with slightly differing frequency, something which will give rise to the perception of beats (Reference HelmholtzHelmholtz, 1877/1895) and is occasionally used by musicians to tune instruments to a reference pitch. Another concept is the temporal fine structure of sound. In terms of signal processing, the fine structure can be viewed as a carrier signal, while the envelope is an amplitude modulator of that signal (Reference HilbertHilbert, 1912). A word spoken in isolation will bring about an onset of the amplitude envelope above the ambient noise level of the environment. The rate at which the amplitude rises (its rise time from onset to maximum amplitude) is an important cue to things such as speech rhythm. Listeners are also more sensitive to spectrotemporal features of sound onsets than offsets, such that sound onsets receive greater perceptual weighting (Reference Phillips, Hall and BoehnkePhillips, Hall, & Boehnke, 2002). Fluctuations in factors such as intensity give rise to the perception of loudness, while the duration of modulations can be heard as differences in vowel length, such as English hit and heat. Conversely, offsets in the envelope can be important cues to segmental information, syllable structure, or the endings of words or phrases. The envelope thus represents relatively slow fluctuations in amplitude over time and can be imagined as the upper and lower outlines of the speech signal. Its different frequency bands contain different information that is useful for perceiving speech, as well as information linked to the physical characteristics of the speaker. Many parallel streams of information thus occur and are processed simultaneously, but across different timescales. For example, some parts of the speech signal, such as prosodic phrase-boundary marking, can occur over longer timescales (several seconds) than others, such as the realisation of stop consonants, which occur over tens of milliseconds. Fast or transient changes in envelope amplitude are thus an important cue for distinguishing consonants (such as stops) from non-consonants, and the amplitude information contained over the tens of milliseconds of consonant release burst further acts as a cue to place of articulation, such as the difference between labial ba or velar ka (Reference Stevens and BlumsteinStevens & Blumstein, 1978). Similarly, the perception of segments differing in manner of articulation – for example, the distinction between the voiceless fricative /ʃ/ and affricate /t͡ʃ/ in sheep and cheap respectively – rely on factors such as the rise time and overall duration of the frication noise (Reference Howell and RosenHowell & Rosen, 1983; Reference Repp, Liberman, Eccardt and PesetskyRepp et al., 1978).
Figure 2 shows the waveform and spectrogram for the spoken sentence dunk the stale biscuits into strong drink. The spectrogram does not represent the sensory input directly, but rather a transformation of the data which is similar to that undertaken by the auditory system.
Different frequency bands of the envelope have been shown to be useful for different types of cues in speech perception, with some bands containing complementary information about similar percepts. Amplitude modulations in the lower frequency band of the envelope (1–50 Hz, or modulations over a range of 20–1,000 ms) contain cues used in the processing of syllables (at around 2–5 Hz, or 200–500 ms). Modulations in this band also provide some information about phonemic segmental identity, such as the difference between voiceless affricates and fricatives. The band between 50 and 500 Hz (2–20 ms) contains information about periodicity in the speech signal. For example, the perception of voice pitch or melody is dependent upon changes in fundamental frequency (F0), which in turn is reflective of the rate of change of periodic fluctuations in vocal fold vibrations (Reference RosenRosen, 1992). An example of a segment marked by increased periodicity or quasi-periodicity is the voiced nasal /m/ in mat. Aperiodicity – an irregular or random pattern of fluctuations over time – on the other hand can lead to the perception of noise, voicelessness or frication in segments (such as /h/ in horse or /ʃ/ in ship) as well as voicing distinctions between allophones. In addition to the concept of lexical competition (Reference Norris, McQueen and CutlerNorris, McQueen, & Cutler, 1995), voicing distinctions have been found to be useful in the segmentation of speech. Since spoken language does not include blank spaces between words that are reliable for speech segmentation, we must use other cues to divide or segment the continuous speech signal into discrete items. Thus, the spectral content of the amplitude envelope turns out to be an important perceptual cue to the difference between phrases and words like night rate and nitrate, where the /r/ is voiced in night rate but not in nitrate. This is due to the aspiration of the /t/ in nitrate rendering the /r/ more aperiodic and thus devoicing it (Reference LehisteLehiste, 1960). While there are no pauses between words in speech, silences between individual speech sounds can, in fact, change phonemic perception. For example, introducing a sufficiently long silent interval between /s/ and /l/ in slit can give rise to a percept of split (Reference Bastian, Eimas and LibermanJ. Bastian, Eimas, & Liberman, 1961), while an extended silence between the words grey ship can result in a percept of great ship (Reference Repp, Liberman, Eccardt and PesetskyRepp et al., 1978). Different languages provide different cues to speech segmentation, involving phonotactics (Reference McQueenMcQueen, 1998; Reference McQueen and CoxMcQueen & Cox, 1995), vowel phonology (Reference Suomi, McQueen and CutlerSuomi, McQueen, & Cutler, 1997), metrical structure (Reference Cutler and NorrisCutler & Norris, 1988; Reference Norris, McQueen and CutlerNorris et al., 1995) and lexical prosody (Reference Söderström, Lulaci and RollSöderström, Lulaci, & Roll, 2023).
The timbre and formant pattern of the temporal fine structure of sound are contained in the frequency band with a wide range between 600 and 10,000 Hz (Reference RosenRosen, 1992). The fine structure also carries information important for pitch perception (Reference Smith, Delgutte and OxenhamZ. M. Smith, Delgutte, & Oxenham, 2002). In vowels, high-frequency spectral formant information from vocal tract configuration is crucial for their identity. For example, a vowel articulated to the front of the mouth such as /i/ contains more high-frequency energy than /u/ which is articulated to the back of the mouth. Whereas consonants are marked by rapid spectral changes at around 10–30 ms, the rate at which the spectral changes take place in vowels (monophthongs) is slower, less dynamic and more steady (Reference StevensStevens, 1980). Rapid transitions between formants also give important information about the identity of unfolding speech sounds, such as the difference between date and gate, which is dependent on both the spectral information in the word-initial burst and on the following dynamic formant transitions marking the unfolding diphthong (Reference Hazan and RosenHazan & Rosen, 1991).
Apart from a detailed analysis of sound, the auditory system must be able to handle noise and variability in the signal, with the objective of extracting meaning from the acoustic sound waves reaching our ears. We regularly hear speech in wildly differing contexts and from different speakers, meaning that it is highly variable, and words and speech sounds are almost never heard out of the context of surrounding speech sounds. In quiet conditions, speech intelligibility is relatively intact even when spectral information is reduced and only envelope information is preserved (Reference Loizou, Dorman and TuLoizou et al., 1999; Reference Shannon, Zeng, Kamath, Wygonski and EkelidShannon et al., 1995), but speech in the presence of background noise may require more fine-structure information in order to be intelligible (Reference Qin and OxenhamQin & Oxenham, 2003; Reference Shamma and LorenziShamma & Lorenzi, 2013).
While the continuous signal proceeds in a ‘left-to-right’ fashion, the system transforms the signal through non-linear, parallel processes in which the perception of individual speech sounds is ultimately influenced by context at different levels. In this way, perception is driven by our experience with acoustic stimuli: it is active (Reference BajcsyBajcsy, 1988; Reference Helmholtz and KahlHelmholtz, 1878/1971). For example, syntactic boundaries between words are not necessarily marked by silences in the acoustic signal (see examples in Figure 2), but listeners nevertheless analyse phrases as perceptual units, as shown by experiments where listeners erroneously report hearing superimposed clicks at syntactic boundaries (Reference Fodor and BeverFodor & Bever, 1965; Reference Holmes and ForsterHolmes & Forster, 1972). The listener thus contributes perceptual structure to the signal, based on experience and rules of a particular language (Reference Fodor and BeverFodor & Bever, 1965). Identical acoustic signals can be perceived as different phonemes depending on context, and instances of the same phonetic category also vary in their physical properties within speakers (Reference Allen, Miller and DeStenoAllen et al., 2003; Reference Newman, Clouse and BurnhamR. S. Newman et al., 2001). For example, if we consider the formant frequencies of the speech sound /d/ as in date, the acoustics are so strongly influenced by the following vowel that it is impossible to find one definitive acoustic correlate that sets the sound apart as a /d/ (Reference Liberman, Delattre, Cooper and GerstmanLiberman et al., 1954). Studies using synthesised tone glissandos closely matching formant frequencies and transitions confirm that there is no simple psychoacoustic mapping between spectrotemporal properties and phonetic perception (Reference Klatt and ShattuckKlatt & Shattuck, 1974).
A common source of context-dependent signal variance that the system must be able to handle is coarticulation, where the phonetic context – surrounding speech sounds – leads to the realisation of an allophonic (non-phonemic) variant of the intended speech sound. Since our articulators take time to shift between different configurations, articulatory gestures flow into and modify one another. For example, velar stops are articulated more frontally before a front vowel like /i/, and more towards the back of the mouth before a back vowel like /u/ (Reference ÖhmanÖhman, 1966), while lip spreading or rounding influence the frequency content of a fricative like /s/ in see and Sue respectively: in Sue, the spectral energy of the fricative noise is lower, creating an anticipatory cue to the degree of roundedness of the upcoming vowel (Reference Lulaci, Tronnier, Söderström and RollLulaci et al., 2022; Reference Schreiber and McMurraySchreiber & McMurray, 2019). The perception of unvoiced stops similarly differs depending on the subsequent vowel: identical noise bursts are identified as /p/ if they precede /i/ or /u/, but as /k/ if they precede /a/ (Reference Liberman, Delattre and CooperLiberman, Delattre, & Cooper, 1952). Conversely, the second vowel in Henry is darker than that in Henley in Standard Southern British English, due to a lowering of F2 and F3 under influence from the consonant /ɹ/ (Reference Local and KellyLocal & Kelly, 1986). Coarticulatory effects are not restricted to immediately adjacent speech sounds but can spread through entire syllables and even further ahead in an utterance. At the level of the syllable, information about syllable-final voicing is available as early as in a syllable-initial phoneme, such as in the words lack and lag, where voiced codas are preceded by longer vowels with lower F1 and higher F2, as well as a darker and longer syllable-initial /l/ in British English (Reference Hawkins and NguyenHawkins & Nguyen, 2004). Over even longer timespans, anticipatory effects of an /ɹ/ can be heard up to five syllables, or one second, before the speech sound is heard, such as in a sentence like we heard it might be ram (Reference Heid and HawkinsHeid & Hawkins, 2000). It has been suggested that coarticulation may serve a communicative function in speech (Reference WhalenWhalen, 1990), including over longer timespans or domains (Reference WestWest, 1999).
The extent of coarticulation between neighbouring sounds varies across languages (Reference Manuel, Hewlett and HardcastleManuel, 1999). For example, in English, the word bank is realised with the voiced velar nasal /ŋ/ under influence of the velar /k/. This type of coarticulatory process can either be optional or obligatory within and between languages. In Russian, the similar word банка (/banka/, ‘jar’) is realised with the voiced dental nasal /n/ with no effect of the velar stop, whereas, in English bank, this assimilation is obligatory. Listeners take advantage of their language experience and knowledge to account for how phonemes are actually realised in speech and to process the signal accurately. This mechanism is important since there is not necessarily any one-to-one mapping from formant frequency or spectral content to phoneme or sound identity, or between sound and perception: a lack of invariance (Reference Liberman, Cooper, Shankweiler and Studdert-KennedyLiberman et al., 1967); a many-to-one mapping which the auditory perception system needs to achieve. For example, the perception of stops such as [b] and glides such as [w] is dependent on the duration of the following vowel, where a longer subsequent vowel is more likely to lead to the perception of a stop (J. L. Reference Miller and LibermanMiller & Liberman, 1979), while vowels embedded in consonant-vowel-consonant (CVC) contexts are perceived differently depending on the spectral content of the surrounding speech sounds (Reference Lindblom and Studdert-KennedyLindblom & Studdert‐Kennedy, 1967).
If the system had no mechanism to compensate for sources of variance in the speech signal, perception would prove difficult. We thus need to create perceptual constancy across speakers and contexts (Reference KuhlKuhl, 1979; Reference SummerfieldSummerfield, 1981). One of the most important solutions to the lack of invariance is categorical perception. The brain is able to generalise and find patterns across exemplars to form functionally equivalent categories. This is demonstrated by the fact that listeners find it easier to discriminate sounds that lie on opposite sides of a phoneme boundary, that is, between categories, as compared to sounds that belong to the same phoneme category (Reference Liberman, Harris, Hoffman and GriffithLiberman, Harris, Hoffman, & Griffith, 1957). Thus, while speech sounds vary in their acoustic, subphonemic realisation for a number of reasons, they are ultimately perceived as distinct categories of sounds. Since the main aim of the listener is to distinguish one word from another as quickly and as efficiently as possible, this is a crucial mechanism. For example, given synthesised tokens ranging from ba to da, listeners will report an abrupt change from one category to another, without reporting sounds as belonging to ambiguous or in-between categories. The categorical perception effect is stronger for consonants than vowels, which tend to be perceived in a more continuous and less categorical manner (Reference Fry, Abramson, Eimas and LibermanFry et al., 1962), suggesting that listeners are sensitive to finer distinctions in vowels. Since phoneme categories and category boundaries are by definition language-specific, so is categorical perception. Thus, in a language where low and row – such as Japanese – are not heard as different words, the sounds /l/ and /r/ are perceived as variants of the same sounds: they belong to the same category (Reference GotoGoto, 1971; Reference Miyawaki, Jenkins and StrangeMiyawaki et al., 1975). Categorical perception is also influenced by the phoneme inventory of the specific language, that is, how crowded the phoneme space is for certain categories of sounds. For example, phoneme-detection tasks have shown that if a language has many fricatives, like Polish, detection of fricatives in nonsense words is slower and less accurate. If a language has many vowels, like English, vowel detection is similarly impacted, and so on (Reference Wagner and ErnestusWagner & Ernestus, 2008).
However, phonetic categories are not immutable, and the compensation-for-coarticulation mechanism can shift phoneme category boundaries under a large number of conditions, at both lower and higher levels of processing (Reference Repp, Liberman and HarnadRepp & Liberman, 1987). Thus, at the level of individual speech sounds, listeners hear ambiguous stops ranging between /t/ and /k/ as more like /k/ following the fricative /s/, but as /t/ following /ʃ/, due to listeners’ knowledge of the influence on the vocal tract configuration of lip spreading (as in /s/) and rounding (as in /ʃ/), as well as its effect on neighbouring speech sounds (Reference Mann and ReppMann & Repp, 1981). Category adjustments also occur at the word and sentence levels. In the lexical domain, an ambiguous sound between /d/ and /t/ is more likely to be reported by English listeners as /t/ before /i:k/, but as /d/ before /i:p/. This is because teak and deep are words in English, whereas /di:k/ and /ti:p/ are not, showing a biasing effect of the contents of the listener’s mental lexicon on phoneme categorisation, known as the Ganong effect: the tendency to perceive an ambiguous sound as a phoneme that could complete a real word rather than a nonword (Reference GanongGanong, 1980). At the sentence level, the semantics of a preceding sentence lead to sounds between /b/ and /p/ being reported as /p/ in She likes to jog along the -ath but as /b/ in She ran hot water for the -ath (Reference Miller, Green and SchermerJ. L. Miller, Green, & Schermer, 1984). In addition to these types of contexts, we must also be able to adapt to the physiology of individual speakers, as well as variations in pronunciation and dialects. We adapt to these variations in speech rapidly and efficiently. When hearing a word with ambiguous formant frequency in the vowel, our vowel percept is influenced by the spectral content of a preceding sentence, taking into account the physiology of the speaker’s vocal tract (Reference Broadbent, Ladefoged and LawrenceBroadbent, Ladefoged, & Lawrence, 1956; Reference Ladefoged and BroadbentLadefoged & Broadbent, 1957). We also adapt to deviations in the realisation of phonemes, allowing us to comprehend speakers with a different accent or even temporary differences in pronunciation (Reference Norris, McQueen and CutlerNorris, McQueen, & Cutler, 2003).
As a general principle in the auditory system, the immediate contrast between neighbouring sounds plays an important role in category adjustment (Reference Diehl, Elman and McCuskerDiehl, Elman, & McCusker, 1978). Indeed, the basic function of speech segments in spoken language is to separate and differentiate sounds from each other, making them distinctive (Reference Jakobson, Fant and HalleJakobson, Fant, & Halle, 1961). As Reference Broadbent, Ladefoged and LawrenceBroadbent et al. (1956) and Reference Ladefoged and BroadbentLadefoged and Broadbent (1957) showed, the perception of vowels with ambiguous formant frequencies – for example, between bit and bet – is influenced by the spectral content of a preceding sentence (Please say what this word is). Specifically, when the introductory sentence had relatively low F1, a target word was perceived as a word with relatively high F1 (bet), while if the preceding context had higher F1, the word was perceived as bit, which has a relatively low F1 frequency. In more immediate contexts, following a voiced consonant with a high frequency content such as /d/ – or indeed a sine-wave non-speech analogue with similar acoustic properties – the subsequent vowel is more likely to be heard as a vowel with a low F2 (/ʌ/). Conversely, a vowel following a low-frequency consonant such as /b/ is more likely to be perceived as the high-F2 vowel /ɛ/ (Reference Holt, Lotto and KluenderHolt, Lotto, & Kluender, 2000). The frequency of preceding non-speech sine wave tones can also influence the perception of subsequent stops, with a subsequent sound being perceived as being lower in frequency as a function of increasing frequency in the preceding sound (Reference Lotto and KluenderLotto & Kluender, 1998). Furthermore, when two similar consonants occur successively, an ambiguous consonant – for example, between /b/ and /d/ – is more likely to be perceived as having a posterior place of articulation when it is preceded by a consonant with an anterior place of articulation. Thus, the contrast between the two sounds is perceptually enhanced by the auditory system (Reference ReppRepp, 1978). Similarly, listeners are more likely to report hearing synthesised ambiguous stop consonants on a /d-g/ continuum as the velar stop /ga/ after /al/ but more likely to report the dental stop /da/ after /ar/ (Reference MannMann, 1980). The syllable /al/ has a more frontal place of articulation than /ar/. Consequently, with a lifetime of exposure to coarticulatory assimilation effects on native-language speech sounds, listeners expect stops following /l/ to be produced with a more forward place of articulation than those following /r/. Listeners also know that lip rounding in anticipation of an upcoming speech sound lowers the spectral frequency of a preceding fricative. When presented with an ambiguous sound between /s/ and /ʃ/, the sound is more likely to be heard as /s/ before a rounded vowel like /u/ (Reference Mann and ReppMann & Repp, 1980). If this were not the case, the lower spectral frequency brought about by coarticulation could lead to the erroneous perception of a /ʃ/. This process of perceptual compensation – the strength of which can vary across listeners (Reference Yu and LeeYu & Lee, 2014) – is influenced by the listener’s native phonology and the transitional probabilities of the language (Reference McQueenPitt & McQueen, 1998), as well as basic auditory perception mechanisms, with a resulting decrease in the perceptual difference between canonical and assimilated speech sounds (Reference Kang, Johnson and FinleyKang, Johnson, & Finley, 2016; Reference Mitterer, Csépe and BlomertMitterer, Csépe, & Blomert, 2006). The system thus combines linguistic biases and acoustic knowledge to maximise perceptual contrast between speech sounds to account and compensate for these effects and give rise to a percept that can be influenced by both spectral content and attributes of the phonetic context (Reference Kingston, Kawahara and ChamblessKingston et al., 2014).
The brain’s ability to perceive speech sounds in a categorical manner does not mean that we are insensitive to subphonemic detail, especially if that detail is perceptually useful. In fact, even in categorical perception tasks, subphonemic information is still available to the listener at the neural level, suggesting that both continuous and categorical representations may be active in parallel (Reference Beach, Ozernov-Palchik and MayBeach et al., 2021; Reference Dehaene-Lambertz and PallierDehaene-Lambertz et al., 2005). Thus, while speech sounds are realised differently depending on the surrounding phonetic context, and we can use our linguistic knowledge to compensate for this fact, we also take advantage of these variations during speech perception. The vowels in the English words job and jog contain subphonemic information about the place of articulation of the upcoming stop, meaning that the words become distinct even before the end of the vowel (Reference Marslen-Wilson and WarrenMarslen-Wilson & Warren, 1994; Reference McQueen, Norris and CutlerMcQueen, Norris, & Cutler, 1999) (see Figure 3). Listeners make use of this type of coarticulatory information to make word recognition more efficient and to rule potential similar-sounding competing words out of contention. It also allows the processor to retain information that may be useful in cases where an initial interpretation of the word turns out to be incorrect (Reference McMurray, Tanenhaus and AslinMcMurray et al., 2009).
In English, regressive assimilation can cause phrases like freight bearer to be produced as [freɪpbɛrə] rather than [freɪtbɛrə], with labial features spreading backwards from the /b/ in bearer. However, listeners still report hearing a /t/, albeit more slowly than in a canonical, non-assimilated version of the phrase. Thus, the auditory system helps the listeners to rapidly restore assimilated phonemes with little effort at an early prelexical stage (Reference Gaskell and Marslen-WilsonGaskell & Marslen-Wilson, 1998; Reference Mitterer and BlomertMitterer & Blomert, 2003). Bringing lexical, semantic and other expectations to bear, listeners can even restore phonemes that have been masked or fully replaced by a noise burst or cough (R. M. Reference WarrenWarren, 1970), provided that the burst is spectrally similar to the replaced sound (Reference SamuelSamuel, 1981a, Reference Samuel1981b; Reference WarrenR. M. Warren, 1984). This is extremely useful, given that we often hear speech in noisy conditions. Phoneme restoration is thus the effect of hearing a speech sound instead of the noise, given enough ambiguity, such as the medial /s/ in legislatures being replaced by or overlaid with noise in a sentence like The state governors met with the respective legislatures convening in the capital city (Reference SamuelSamuel, 1981a, Reference Samuel1981b; Reference WarrenR. M. Warren, 1970). In fact, listeners find it difficult to even locate the noise in a subsequently presented written version of the sentence (Reference WarrenR. M. Warren, 1970), illustrating the strength of the effect.
In conclusion, while subphonemic information is available and actively used by the listener in speech perception, the auditory system performs a ‘normalisation’ as it transforms the continuous auditory input into discrete behaviourally and linguistically relevant categories. Categorical perception constitutes a solution to the variance problem, and malleable speech sound categories allow us to adjust to sound differences caused by factors ranging from individual speaker physiology or circumstance to phonetic context, distinguishing discrete words in the signal to ultimately lead to speech comprehension.
4.1 Prediction in Speech Perception and Spoken-Word Recognition
It has long been suggested that we do not just passively perceive the world. Rather, we actively but unconsciously infer the likely causes of the input, something which was originally discussed in relation to cognitive optical illusions that we cannot help but be tricked by (Reference HelmholtzHelmholtz, 1867). Through unconscious inference, we construct and constantly update ‘hypotheses’ about the world. These are based on an internal model of how the world works, and we process input with respect to those hypotheses, creating structure in our perceived reality by combining the input with our knowledge and assumptions, presumably stored as statistical distributions (Reference Leonard and ChangLeonard & Chang, 2014). Our perception and behaviour can thus operate on prior probabilities based on past experience and can, in this way, be predictive, analogously to a curve fitted to extant and expected data points, helping us fill in the blanks in partial or incomplete data using top-down modulation throughout the neural hierarchy (Reference Asilador and LlanoAsilador & Llano, 2020). This is supported by the fact that the majority of input connections to the primary auditory cortex originate in areas further up in the hierarchy (see Section 5.2). It is important to note, however, that unlikely perceptions do indeed occur, meaning that we cannot always simply accept the most likely hypothesis as true (Reference GregoryGregory, 1980). It has been proposed that the brain approximates Bayes’ theorem or Bayesian inference (Reference BayesBayes, 1763; Reference HohwyHohwy, 2020), which, put simply, provides the conditional probability of an event (such as encountering a particular spoken word), or the likelihood of a hypothesis being true, given the evidence or prior information. In perception, a central problem lies in the fact that the same sensory effect may have many potential different sources, and data can be noisy or ambiguous. Using Bayesian inference, bottom-up sensory information can be combined with prior information – including from the linguistic and communicative context – to arrive at the most likely causes of the sensory data and achieve optimal word recognition, that is, to recognise words as quickly as possible given an acceptable level of accuracy (Reference Norris and McQueenNorris & McQueen, 2008). This assumption of optimality has led to entities using Bayesian decision theory being referred to as ‘ideal observers’ (Reference GeislerGeisler, 2011; Reference Geisler and KerstenGeisler & Kersten, 2002). This is not to say that human performance is always optimal, but the assumption instead provides a starting point for building explanatory theories and models, based on observations of deviations from optimality by human listeners. A related model commonly used to explain neural processing is predictive coding (Reference Rao and BallardRao & Ballard, 1999). Predictive coding postulates that the brain generates models of the external world and updates them when new information violates expectations, generating a prediction error, which is the difference between sensory input and the prediction, that is, the ‘newsworthy’ information that cannot be predicted (Reference FristonFriston, 2018). The system then converges on the response that best explains the current input (Reference Rao and BallardRao & Ballard, 1999). Thus, instead of representing the input directly, the brain can process the prediction error, making processing more efficient. The goal of the system is to minimise prediction error in the long run, allowing for unsupervised learning and inference, to constitute a solution to the problem of multiple potential causes of perceptual data (Reference HohwyHohwy, 2020). Through predictive coding, the brain has been proposed to predict input across the linguistic and neural hierarchies over multiple timescales (Reference Caucheteux, Gramfort and KingCaucheteux, Gramfort, & King, 2023).
In spoken-word recognition, an important source of prior information is word frequency, that is, how often a particular word occurs in speech. More frequent words like cat are more easily recognised than less frequent words, such as vat (Reference HowesHowes, 1957; Reference Pollack, Rubenstein and DeckerPollack, Rubenstein, & Decker, 1960; Reference SavinSavin, 1963), which is almost fifty times less frequent (Reference Balota, Yap and CorteseBalota et al., 2007). The use of word frequency as prior information has been proposed to scale with ambiguity and noise: the more ambiguous the input, the more prior information exerts an influence (Reference Norris and McQueenNorris & McQueen, 2008). In this general sense, prediction in speech perception and spoken-word recognition is a mechanism through which our beliefs are constantly updated as more data arrives. For example, the Ganong effect – lexical effects on phoneme categorisation (see Section 4) – can be explained by the interaction of pre-lexical and lexical information according to Bayesian principles (Reference Norris, McQueen and CutlerNorris, McQueen, & Cutler, 2016). At a more general level, word frequency provides a wide range of possible outcomes – a weak prior – while sentence context can conversely be highly constraining: the sentence onset the cat sat on the … may lead to the strong expectation of mat, while he ate a … is less constraining (Reference Norris, McQueen and CutlerNorris et al., 2016). Indeed, multiple words can become probable at the same time. At shorter timescales, incoming phonemes likewise provide prior information as regards the rest of the word, widening or narrowing probability distributions of possible outcomes (Reference Friston, Sajid and Quiroga-MartinezFriston et al., 2021; Reference Gagnepain, Henson and DavisGagnepain, Henson, & Davis, 2012; Reference Roll, Söderström, Hjortdal and HorneRoll et al., 2023; Reference Söderström and CutlerSöderström & Cutler, 2023). According to predictive coding models of word recognition, lexical candidates compete by making incompatible predictions of upcoming speech sounds and suppressing prediction errors from their neighbours (Reference Gagnepain, Henson and DavisGagnepain et al., 2012; Reference SpratlingSpratling, 2008).
As a way to explain why the brain extracts meaning from speech with such apparent ease, predictive processing has been postulated at all levels of speech perception and comprehension – from sentence contexts to specific phonological or lexical predictions – and it remains a widely researched and discussed topic. For example, in conversational turn-taking (Reference Sacks, Schegloff and JeffersonSacks, Schegloff, & Jefferson, 1974), speakers take on average 200 milliseconds – a mere fifth of a second – to transition between turns (Reference Stivers, Enfield and BrownStivers et al., 2009). This is despite the fact that it takes much longer to plan and produce even short utterances, suggesting that mental processes must overlap (Reference Levinson and TorreiraLevinson & Torreira, 2015). Crucially, the speed at which this happens also implies that some type of prediction is taking place: listeners can use a number of cues in the signal to anticipate the end of a turn (see Reference MeyerA. S. Meyer (2023) for a review). This appeal to predictive processing is similar to arguments made in the context of speech shadowing, where speakers can repeat speech at speeds – 250 milliseconds or less between hearing and repeating – that strongly suggest a predictive influence from higher-order syntactic, semantic or pragmatic contexts (Reference ChistovichChistovich, 1960; Reference Marslen-WilsonMarslen-Wilson, 1973, Reference Marslen-Wilson1985). A sentence context can thus be used to pre-activate the semantic features of expected sentence-final words in a graded fashion (Reference Federmeier and KutasFedermeier & Kutas, 1999; Reference Federmeier, McLennan, De Ochoa and KutasFedermeier et al., 2002), and the phonological structure (Reference DeLong, Urbach and KutasDeLong, Urbach, & Kutas, 2005) or acoustic features (Reference Broderick, Anderson and LalorBroderick, Anderson, & Lalor, 2019) of words can be predicted based on context or lexical knowledge (Reference Brodbeck, Hong and SimonBrodbeck, Hong, & Simon, 2018). Furthermore, the endings of words can be predicted based on the ‘micro-context’ of word onsets (Reference Roll, Söderström, Frid, Mannfolk and HorneRoll et al., 2017; Reference Roll, Söderström, Hjortdal and HorneRoll et al., 2023; Reference Söderström and CutlerSöderström & Cutler, 2023; Reference Söderström, Horne, Frid and RollSöderström et al., 2016; Reference Söderström, Horne and RollSöderström et al., 2017; Reference Söderström, Horne and RollSöderström, Horne, & Roll, 2017).
5 Structure and Function of the Auditory System
5.1 From the Cochlea to Auditory Nuclei
When sound waves reach the cochlea in the inner ear, the mechanical energy is converted to electrical energy that can be analysed by the nervous system (Reference HudspethHudspeth, 1997). The cochlea performs a spectral decomposition of the signal, and this transformed acoustic information is sent on in the form of electrical signals – sequences of action potentials, or spikes – from the cochlea in several parallel streams through cochlear ganglion cells and the auditory nerve (or cochlear nerve). The physical properties of the incoming sound are encoded through the temporal and spatial distribution of action potentials (Reference Rouiller and EhretRouiller, 1997; Reference ShammaShamma, 2001). For example, increased sound intensity leads to an increase in nerve impulses ascending the auditory nerve (Reference Galambos and DavisGalambos & Davis, 1943), and different frequency bands are processed in different parts of the basilar membrane in the cochlea (Reference FletcherFletcher, 1940). Signals are sent through to several auditory nuclei (a nucleus is a cluster of neurons) and areas in the brainstem and midbrain, including the olivary complex in the brainstem, the inferior colliculus in the midbrain, and on to the thalamus, which relays the information to auditory areas in the cerebral cortex higher up in the processing hierarchy (Reference Huffman and HensonHuffman & Henson, 1990) (see Figure 4).
These parallel streams are responsible for conveying different aspects or features of the acoustic signal, such as pitch or spectral information, as well as the onsets and offsets of sounds. The subcortical detection and extraction of these acoustic features allows for cortical structures to subsequently merge them into more complete acoustic objects (Reference NelkenNelken, 2004). Several layers of this hierarchy are further defined by tonotopy, meaning that there is spatial separation in how different frequencies of sound are transmitted and processed (Reference Romani, Williamson and KaufmanRomani, Williamson, & Kaufman, 1982). For example, the cochlea is organised so that low-frequency components of the sound are processed at one end, with increasingly higher frequencies being processed along the length of the cochlea towards the other end. This is then reflected in the fact that fibres from the low-frequency end – the apex – terminate at different parts of the neuron clusters connected to the cochlea – the cochlear nuclei – as compared to fibres from the high-frequency part of the cochlea. The tonotopic organisation of the system ensures that a representation or map of the cochlea is maintained all the way from subcortical networks to the cerebral cortex. This spatial organisation of nerve fibres is sometimes referred to as a place code, while the rate or frequency code refers to the frequency of a signal being reflected in the spiking rate of neurons.
5.2 Subcortical Networks and the Extraction of Acoustic Features of Speech
As the signal travels from the cochlea, subcortical networks – located in the hierarchy between the cochlea and primary auditory cortex (see Figure 4, Section 5.1) – play an important role in extracting and transforming the features that are crucial for successful perception of sound and speech. At early stages in this ascending auditory pathway, the firing of auditory nerve fibres closely represents both the fine and coarse structure of complex sounds. Thus, the temporal and frequency information of speech is represented in auditory nerve activity, so that certain fibres respond more strongly to certain frequencies while temporal modulations in the signal are represented in the latency, timing and firing rate of the neural response (Reference Joris and YinJoris & Yin, 1992; Reference Rose, Brugge, Anderson and HindRose et al., 1967; Reference Young and SachsYoung & Sachs, 1979). For example, modulations over time in the neural response directly represent temporal features of speech such as voice-onset time (Reference YoungYoung, 2008). Thus, it has been suggested that in the early auditory system, the timing of neuronal spiking underlies the processing of consonant sounds while the fine-structure detail in vowel sounds is represented by spiking rates of neurons synchronising with the signal (Reference Perez, Engineer and JakkamsettiPerez et al., 2013). The electric auditory brainstem response (ABR) can be tracked over time using EEG electrodes on the scalp (Reference Jewett and WillistonJewett & Williston, 1971; Reference Jewett, Romano and WillistonJewett, Romano, & Williston, 1970). Short, non-periodic stimuli elicit transient responses, while periodic stimuli (such as vowels) elicit sustained responses that are part of the ABR. The ABR is commonly used in clinical settings to test auditory function using simple click sounds, but it has also been widely used to track brainstem processing of speech sounds, where it has been found to reflect speech-specific information such as fundamental and formant frequencies, as well as syllable structure (Reference GreenbergGreenberg, 1980; Reference Moushegian, Rupert and StillmanMoushegian, Rupert, & Stillman, 1973; Reference Russo, Nicol, Musacchia and KrausRusso et al., 2004; Reference Worden and MarshWorden & Marsh, 1968; Reference Young and SachsYoung & Sachs, 1979).
As the signals converge and are integrated to begin to form percepts, there is a gradual decrease in the precision of the representation as we reach higher levels of the system. Neurons in the auditory nerve may phase-lock to information at up to 10,000 Hz, corresponding to the fine structure of speech (Reference Heinz, Colburn and CarneyHeinz, Colburn, & Carney, 2001) and neurons in the cochlear nuclei synchronise with signals at rates of hundreds or thousands of cycles per second (Reference Rhode and GreenbergRhode & Greenberg, 1994), whereas neurons at subsequent higher stages of the hierarchy – from the inferior colliculus to the thalamus and on to the primary auditory cortex – operate at increasingly slower levels of stimulus synchronisation (Reference Bartlett and WangBartlett & Wang, 2007; Reference Liang, Lu and WangLiang et al., 2002; Reference Rees and PalmerRees & Palmer, 1989; Reference Yin, Johnson, O’Connor and SutterYin et al., 2011) as representations also become more complex.
At this point, it is important to note that, while this simplified illustration of the auditory system proceeds in a hierarchical, linear fashion from cochlea to cortex – also mostly overlooking hemispheric differences between the left and right sides of the brain – the flow of processing from the ear to the brain is not simply unidirectional. Moreover, many neural responses represent the speech signal non-linearly (Reference Christianson, Sahani and LindenChristianson, Sahani, & Linden, 2008; Reference David, Mesgarani, Fritz and ShammaDavid et al., 2009) and seldom isomorphically. That is, one-to-one mappings between the input and neural representation appear to be rare, apart from the level of the early auditory system (Reference ReppRepp, 1988; Reference YoungYoung, 2008). Recall that perception is an active process (Reference BajcsyBajcsy, 1988; Reference FristonFriston, 2005; Reference HelmholtzHelmholtz, 1867, Reference Helmholtz and Kahl1878/1971) and thus we do not just hear, we listen (Reference Friston, Sajid and Quiroga-MartinezFriston et al., 2021). There are multiple two-way flows of information at all levels of the system, meaning that there are both ascending (afferent) pathways going towards the brain and descending (efferent) pathways carrying information back down through the auditory system, in several feedback loops from the cortex to subcortical structures, including the thalamus and inferior colliculus (Reference Winer, Chernock, Larue and CheungWiner et al., 2002), as well as nuclei further down in the hierarchy, such as the olivary complex (Reference Coomes and SchofieldCoomes & Schofield, 2004) and cochlear nucleus (Reference HeldHeld, 1893; Reference Schofield and CoomesSchofield & Coomes, 2006; Reference Weedman and RyugoWeedman & Ryugo, 1996a, Reference Weedman and Ryugo1996b). The pathways from the auditory cortex all the way down to the cochlea are often referred to as corticofugal projections. In fact, roughly one-third of inputs to the primary auditory cortex originate in and ascend from subcortical areas, while two-thirds – the majority – are descending signals from cortical areas higher up in the hierarchy (Reference Diamond, Jones and PowellDiamond, Jones, & Powell, 1969; Reference Scheich, Brechmann, Brosch, Budinger and OhlScheich et al., 2007). Descending signals serve to sharpen and tune the response of subcortical neurons (Reference SugaSuga, 2008) to filter and control incoming acoustic information, and the longest feedback signals go back as far as the hair cells in the cochlea, where they continue to play an important role in speech perception (Reference Froehlich, Collet, Chanal and MorgonFroehlich et al., 1990; Reference Garinis, Glattke and ConeGarinis, Glattke, & Cone, 2011; Reference Huffman and HensonHuffman & Henson, 1990; Reference Luo, Wang, Kashani and YanLuo et al., 2008). At the level of the cochlea, signals descending through the system from the olivary complex also help to protect the cochlea and its hair cells from traumatic effects caused by loud sounds (Reference GuinanGuinan, 2006; Reference Taranda, Maison and BallesteroTaranda et al., 2009). This descending pathway is modulated by attention, the mechanism that allows listeners to focus on behaviourally relevant stimuli (Reference Galbraith and ArroyoGalbraith & Arroyo, 1993; Reference Giard, Collet, Bouchet and PernierGiard et al., 1994; Reference Petersen and PosnerPetersen & Posner, 2012). Attention increases the signal-to-noise ratio (Reference Mertes, Johnson and DingerMertes, Johnson, & Dinger, 2019), sharpens the response to speech in noise and facilitates cocktail-party speech perception – the ability to focus on one stimulus in the presence of others competing for attention – prior to cortical processing (Reference CherryCherry, 1953; Reference Festen and PlompFesten & Plomp, 1990; Reference Price and BidelmanPrice & Bidelman, 2021).
The importance of peripheral subcortical networks for speech perception has been highlighted by research into patients with auditory neuropathy. These patients have minimal cochlear and cognitive deficits but have great difficulty understanding speech. One such patient, an eleven-year-old girl, perceived speakers as sounding ‘weird, like spacemen’ (Reference Starr, McPherson and PattersonStarr et al., 1991). This was marked by difficulty in distinguishing vowel sounds, but a relatively unaffected ability to distinguish words based on high-frequency consonants. Auditory neuropathy appears to affect the temporal precision of neural coding and transmission in the auditory nerve, with less effect on percepts based on frequency or intensity. A larger study of twenty-one patients further corroborated the impact of auditory nerve timing deficits on speech perception (Reference Zeng, Kong, Michalewski and StarrZeng et al., 2005), finding a decreased ability to detect transient and rapidly changing sounds. In addition, pitch discrimination was found to be impaired below 4 kHz, and temporal processing deficits further manifested as difficulties in separating successively occurring sounds and detecting both slow and fast temporal modulations, as well as gaps between sounds. These types of temporal mechanisms perform important functions in speech perception. A relatively slow temporal modulation such as voice-onset time (VOT) – the time between the release of a stop consonant and the onset of voicing – is an important cue to the difference between the voiced and voiceless consonants in pa and ba in English (Reference Lisker and AbramsonLisker & Abramson, 1964), where a gap shorter than 30 ms leads to the perception of voicing and a longer gap signals voicelessness (Reference WoodWood, 1976). Thus, while a sound sequence like ama does not contain any perceptible gaps, aba contains a voice-onset gap which may be a useful cue to phoneme identification. Listeners with normal hearing can detect gaps of only a couple of milliseconds (Reference FitzgibbonsFitzgibbons, 1984), but a perceptual threshold of around 30 ms has been posited with regard to speech phenomena like voice-onset time (Reference Pastore and FarringtonPastore & Farrington, 1996). In the subcortical auditory system, circuits in the ventral cochlear nucleus (VCN) involving temporally precise and sensitive octopus cells detect and track acoustic onsets and periodicity (Reference Ferragamo and OertelFerragamo & Oertel, 2002; Reference Golding, Ferragamo and OertelGolding, Ferragamo, & Oertel, 1999), as well as synchrony (Reference Oertel, Bal, Gardner, Smith and JorisOertel et al., 2000), suggesting an important role of these circuits in the processing of VOT and similar, relatively slow, temporal modulations in speech perception. Octopus cells also synchronise strongly to faster amplitude envelope modulations (Reference RhodeRhode, 1994; Reference Rhode and GreenbergRhode & Greenberg, 1994) and thus appear to be involved in the processing of fundamental frequency and vowel-type formant information (Reference RhodeRhode, 1998). Ascending the pathway, octopus cells target areas of the superior olivary complex – the first point of binaural convergence in the auditory system (Reference Walton, Burkard, Hof and MobbsWalton & Burkard, 2001) – as well as the lateral lemniscus (Reference Felix II, Gourévitch and Gómez-ÁlvarezFelix II et al., 2017).
Many acoustic features of speech have been analysed through multiple afferent pathways in the brainstem when the signal reaches the inferior colliculus, an important nucleus located in the midbrain. All ascending auditory pathways converge here, and it also receives descending information from the thalamus and cortex (Reference Rouiller and EhretRouiller, 1997). Processing in the inferior colliculus is more complex than in systems preceding it in the peripheral auditory system, but less so than in the cortex (Reference Portfors, Sinex, Winer and SchreinerPortfors & Sinex, 2005). Information about the timing and intensity of sounds reaching the ears at subtly different times is processed and sent to the superior colliculus, where it is used to localise sounds in space. The inferior colliculus contains neurons sensitive to amplitude- and frequency-modulated sounds (Reference Rees and MøllerRees & Møller, 1987; Reference Rodríguez, Read and EscabíRodríguez, Read, & Escabí, 2010; Reference SchullerSchuller, 1979), as well as sound duration and offset (Reference Casseday, Ehrlich and CoveyCasseday, Ehrlich, & Covey, 1994, Reference Casseday, Ehrlich and Covey2000; Reference Ehrlich, Casseday and CoveyEhrlich, Casseday, & Covey, 1997) and gap detection (Reference Walton, Frisina and O’NeillWalton, Frisina, & O’Neill, 1998). Voice-onset time appears to be represented in a similar fashion to the auditory nerve, that is, through a pause in neuronal spiking corresponding to the VOT (Reference YoungYoung, 2008). The inferior colliculus plays a crucial role in the process of filtering and sharpening the signal, as well as compensating for the effects of reverberation on the amplitude envelope of the speech signal (Reference Slama and DelgutteSlama & Delgutte, 2015; Reference SugaSuga, 1995), for example when the system perceives vowels such as /a/ and /i/ (Reference Sayles, Stasiak and WinterSayles, Stasiak, & Winter, 2016). This early filtering and compensation system appears to help the primary auditory cortex further up in the hierarchy fulfil important functions, such as processing speech sounds as robust and invariant categories in conditions marked by noise or reverberation (Reference Mesgarani, David, Fritz and ShammaMesgarani et al., 2014), which may occur in a loud restaurant or cocktail party where we may hear many people speaking at once (Reference CherryCherry, 1953).
5.3 From Subcortical to Cortical Processing of Speech
From the inferior colliculus, signals are relayed through the medial geniculate body – the auditory part of the thalamus – and on to the auditory cortex, which is located in the temporal lobe of the brain. It takes ten to twenty milliseconds for the acoustic information to be transferred from the cochlea to the auditory cortex (Reference Eldredge and MillerEldredge & Miller, 1971; Reference Rupp, Uppenkamp and GutschalkRupp et al., 2002) and much acoustic processing has occurred before the signal reaches this point. It has been suggested that the detailed analysis of spectrotemporal features of speech is complete at the level of the inferior colliculus (Reference Nelken, Fishbach, Las, Ulanovsky and FarkasNelken et al., 2003) and the nature of the processing and representation of sound broadly changes as the signal reaches cortical areas (Reference Miller, Escabí, Read and SchreinerL. M. Miller et al., 2001). For example, the modulation transfer function of the auditory system – essentially its temporal resolution – is around ten times lower in cortical than subcortical structures (<100 Hz vs. ~1,000 Hz) (Reference Joris and YinJoris & Yin, 1992; Reference Kowalski, Depireux and ShammaKowalski, Depireux, & Shamma, 1996; Reference Rhode and GreenbergRhode & Greenberg, 1994; Reference Schreiner and UrbasSchreiner & Urbas, 1986; Reference Yin, Johnson, O’Connor and SutterYin et al., 2011). The auditory cortex can be viewed as a bank of filters, arranged according to tonotopy, that responds to spectrotemporal modulations, so that sounds are decomposed through axes going from slow to fast temporal rates of modulation, and from narrow to broad scales of spectral modulation. Thus, while peripheral subcortical structures transform the acoustic signal to a time-frequency representation, the auditory cortex performs a more complex, joint spectrotemporal decomposition and analysis: just as the cochlea represents the sound wave at different frequencies, the auditory cortex represents the sound spectrogram at different resolutions (Reference Chi, Ru and ShammaChi, Ru, & Shamma, 2005). As a principle, it has been suggested that posterior/dorsal regions of the auditory cortex respond selectively to coarse spectral information with high temporal precision, while anterior/ventral regions encode fine-grained spectral information with low temporal precision (Reference Santoro, Moerel and De MartinoSantoro et al., 2014).
Representations also become more complex and categorical – invariant – as we reach higher stages in the hierarchy (Reference Carruthers, Laplagne and JaegleCarruthers et al., 2015; Reference Perez, Engineer and JakkamsettiPerez et al., 2013; Reference Sharpee, Atencio and SchreinerSharpee, Atencio, & Schreiner, 2011), such that cortical areas respond strongly to behaviourally meaningful categories of sounds rather than only general spectrotemporal properties. From the thalamus and primary auditory cortex onwards, it has been proposed that the brain thus operates primarily on complex, higher-order sound objects rather than basic acoustic features (Reference Mesgarani, David, Fritz and ShammaMesgarani et al., 2008; Reference Nelken, Fishbach, Las, Ulanovsky and FarkasNelken et al., 2003). Cortical responses are also malleable – or plastic – and change in the short or long term depending on behavioural or contextual requirements, as well as statistical regularities. Thus, cortical responses can change if required due to experience, an experimental task or expectation of a reward, that is, if something is behaviourally relevant (Reference Fritz, Shamma, Elhilali and KleinFritz, Shamma, Elhilali, & Klein, 2003; Reference Scheich, Brechmann, Brosch, Budinger and OhlScheich et al., 2007). This allows for perceptual enhancement of degraded speech and enables more efficient perception (Reference Holdgraf, de Heer and PasleyHoldgraf et al., 2016). In oddball paradigms, where stimuli appear with different probabilities of occurrence (see Section 3.2), more rare stimuli show stronger responses in the primary auditory cortex than common stimuli (Reference Ulanovsky, Las and NelkenUlanovsky, Las, & Nelken, 2003), something which – combined with later processing stages (Reference Schönwiesner, Novitski and PakarinenSchönwiesner et al., 2007) – is subsequently reflected in the mismatch negativity (MMN) ERP component on the scalp (Reference Näätänen, Gaillard and MäntysaloNäätänen et al., 1978). These types of findings corroborate the idea that cortical responses are sensitive to behaviourally relevant and more abstract representations of sound in a way that those found lower down in the hierarchy are not (Reference Chechik and NelkenChechik & Nelken, 2012; Reference NelkenNelken, 2008). Another feature of processing that is typical for the cerebral cortex – especially as we move beyond primary auditory areas – is speech specificity or selectivity, meaning that neurons may respond preferentially to speech over non-speech stimuli (Reference Scott and JohnsrudeScott & Johnsrude, 2003). This may be driven by the particular nature of speech, which is acoustically complex as seen in envelope variability and the structure and transitions of formants and so on (Reference Hullett, Hamilton, Mesgarani, Schreiner and ChangHullett et al., 2016).
The earliest stage of the processing hierarchy in the cerebral cortex lies in the primary auditory cortex, more specifically in the medial part of an area known as the transverse temporal or Heschl’s gyrus (Reference Morosan, Rademacher and SchleicherMorosan et al., 2001), comprising Brodmann areas 41 and 42 (Reference BrodmannBrodmann, 1909). Its importance for speech perception was noted as early as the nineteenth century when Adolf Kussmaul and Ludwig Lichtheim connected damage to the area with an auditory comprehension disorder known as pure word deafness (Reference Pandya, Petrides, Cipolloni and PetridesPandya et al., 2015). While this area also responds to unmodulated spectral non-speech noise (Reference Hickok and PoeppelHickok & Poeppel, 2004), Heschl’s gyrus appears to contain mechanisms that are specialised for speech processing. In this role, the auditory cortex responds strongly to amplitude- and frequency-modulated sounds (Reference Ding and SimonDing & Simon, 2009; Reference Liégeois-Chauvel, Lorenzi, Trébuchon, Régis and ChauvelLiégeois-Chauvel et al., 2004) and transforms acoustic features from simple to more complex representations. Heschl’s gyrus also plays a role in pitch processing (Reference De Angelis, De Martino and MoerelDe Angelis et al., 2018; Reference Griffiths and HallGriffiths & Hall, 2012; Reference Kumar, Sedley and NourskiKumar et al., 2011). Certain parts of the primary auditory cortex respond preferentially to phonemes as compared to non-speech sounds, and it also encodes speaker-specific features as well as speaker-invariant categorical representations of phonemes, allowing us to achieve perceptual constancy of phoneme identity in processing (Reference Khalighinejad, Patel and HerreroKhalighinejad et al., 2021; Reference Town, Wood and BizleyTown, Wood, & Bizley, 2018). These functions are crucial for speech processing since we need both to be able to distinguish and identify different speakers and to create abstract, speaker-invariant categories. Since there is variation in how speech sounds are pronounced and realised (see Section 4), we need to normalise the acoustic signal and create robust phonemic categories that are insensitive to acoustic variations, such as allophones. In this way, we can suppress information that may be perceptually irrelevant.
While the surrounding secondary auditory areas also exhibit more domain-general properties (Reference Griffiths and WarrenGriffiths & Warren, 2002), a relative specialisation and sensitivity to speech sounds continues to define sound processing as we progress along the cortical hierarchy. Thus, the nearby supratemporal plane – comprising the planum polare, planum temporale and superior temporal gyrus (STG) – combines to encode the formant frequencies of vowels and spectrotemporal composition of consonants (Reference Formisano, De Martino, Bonte and GoebelFormisano et al., 2008; Reference Näätänen, Lehtokoski and LennesNäätänen et al., 1997; Reference Warren, Jennings and GriffithsJ. D. Warren, Jennings, & Griffiths, 2005), including extremely transient sounds such as consonantal stops (Reference Obleser, Zimmermann, Van Meter and RauscheckerObleser et al., 2007). It is also involved in abstract sublexical processing in speech perception (Reference Hasson, Skipper, Nusbaum and SmallHasson et al., 2007) and is sensitive to transitional probabilities between speech sounds or syllables and other statistical regularities, as is Heschl’s gyrus (Reference Leonard, Bouchard, Tang and ChangLeonard et al., 2015; Reference McNealy, Mazziotta and DaprettoMcNealy, Mazziotta, & Dapretto, 2006; Reference Roll, Söderström and MannfolkRoll et al., 2015; Reference Söderström, Horne and RollSöderström et al., 2017; Reference Tobia, Iacovella, Davis and HassonTobia et al., 2012; Reference Tremblay, Baroni and HassonTremblay, Baroni, & Hasson, 2013). Planum temporale completes the spectral envelope analysis and abstraction before further phoneme-level processing higher up in the temporal lobe (Reference Kumar, Stephan, Warren, Friston and GriffithsKumar et al., 2007).
The primary auditory cortex connects through a cortico-cortical stream to the posterior part of the superior temporal gyrus (pSTG) (Reference BrodmannBrodmann, 1909; Reference Brugge, Volkov, Garell, Reale and HowardBrugge et al., 2003), which – in the left hemisphere of the brain – is traditionally considered as part of Wernicke’s area (Reference BinderBinder, 2015; Reference Bogen and BogenBogen & Bogen, 1976). The pSTG, which is connected to but functionally distinct from Heschl’s gyrus, is a core association area for acoustic processing and spectrotemporal analysis (Reference Hickok and PoeppelHickok & Poeppel, 2007; Reference Howard, Volkov and MirskyHoward et al., 2000). It has even been shown that spectrotemporal details of speech can be reconstructed using cortical neuroelectric data from the pSTG (Reference Pasley, David and MesgaraniPasley et al., 2012). Reference BrodmannBrodmann (1909) defined the STG as area 22, and modern analyses of cell and receptor composition have been used to refine definitions of this area further (Reference Morosan, Schleicher, Amunts and ZillesMorosan et al., 2005; Reference Zachlod, Rüttgers and BludauZachlod et al., 2020). A ‘tuning’ gradient runs across the length of the superior temporal gyrus so that pSTG specialises in speech varying fast over time – with high frequency but low spectral modulation – while the anterior part of the axis specialises in speech with slow temporal modulations but with higher spectral modulation. Thus, the posterior part is more specialised in phonemic processing, while temporally slow syllabic or prosodic processing occurs in the anterior part, towards the front of the brain (Reference Hullett, Hamilton, Mesgarani, Schreiner and ChangHullett et al., 2016). The transformation of speech sounds to categorical phoneme representations thus emerges in Heschl’s gyrus and continues onto the surface of the superior temporal gyrus (Reference Chang, Rieger and JohnsonChang et al., 2010; Reference Formisano, De Martino, Bonte and GoebelFormisano et al., 2008; Reference Khalighinejad, Patel and HerreroKhalighinejad et al., 2021; Reference Steinschneider, Nourski and KawasakiSteinschneider et al., 2011). In fact, the entire inventory of American English phonemes has been mapped onto sites along the STG, with distinct neural populations sensitive to contrastive features such as place and manner of articulation, voicing and voice-onset time rather than to discrete phonemes, suggesting a complex, multidimensional mechanism that operates on the acoustic features that make up the phonemes and phonemic contrasts of a language (Reference Chang, Rieger and JohnsonChang et al., 2010; Reference Mesgarani, David, Fritz and ShammaMesgarani et al., 2014; Reference Steinschneider, Nourski and KawasakiSteinschneider et al., 2011).
The STG also performs a normalisation of these sound representations, adjusting for differences in individual voices so that speaker-independent meaning can be extracted from the spoken message (Reference Sjerps, Fox, Johnson and ChangSjerps et al., 2019). It can also perceptually restore phonemes (see Section 4) based on top-down input from frontal regions, at around 150 milliseconds after the onset of an ambiguous noise that replaces a phoneme within a word. This helps make processing robust to noisy conditions (Reference Leonard, Baud, Sjerps and ChangLeonard et al., 2016), highlighting the important role that the STG plays in transforming sound to phonological representations and in the solution to the variance problem in speech perception. Categorical phonemic perception is also subserved by the superior temporal sulcus (STS), which lies lateral to and below Heschl’s gyrus (Reference Uppenkamp, Johnsrude, Norris, Marslen-Wilson and PattersonUppenkamp et al., 2006). At this point, according to the dual-route model of speech perception, the system diverges into the ventral and dorsal streams, with the ventral stream mapping sensory or phonological representations onto lexical conceptual representations (sound to meaning), and the dorsal stream responsible for mapping phonological representations onto articulatory motor representations (meaning to sound) (Reference Hickok and PoeppelHickok & Poeppel, 2007). The STS has been suggested to be part of a network with the middle temporal gyrus (MTG) that goes from phonological processing and the categorical perception of phonemes to their integration into higher-level semantic representations to drive speech comprehension, with the anterior portions of the STS being involved in the integration of phonemes into words (Reference DeWitt and RauscheckerDeWitt & Rauschecker, 2012; Reference Liebenthal, Binder, Spitzer, Possing and MedlerLiebenthal et al., 2005; Reference Overath, McDermott, Zarate and PoeppelOverath et al., 2015; Reference Scott and JohnsrudeScott & Johnsrude, 2003). The posterior MTG is thus considered a lexical interface and storage of abstract word representations in the ventral stream, mapping sound to meaning (Reference Davis and GaskellDavis & Gaskell, 2009; Reference GowGow, 2012; Reference Hickok and PoeppelHickok & Poeppel, 2000, Reference Hickok and Poeppel2004, Reference Hickok and Poeppel2007). The STS is sensitive to the higher-order word-recognition process of lexical competition – words or lexical candidates competing with each other for activation and recognition – in the form of lexical neighbourhood density (Reference Luce and PisoniLuce & Pisoni, 1998; Reference Okada and HickokOkada & Hickok, 2006). The superior temporal gyrus is also involved in this process. Reference Gagnepain, Henson and DavisGagnepain et al. (2012) suggest a model whereby neurons in the STG represent the difference between predicted and heard speech sounds (see Section 4.1). In this way, lexical candidates compete by giving rise to incompatible predictions for which speech sounds will be heard next. Next to the posterior STG lies the supramarginal gyrus (SMG, BA40), with no sharp border between the regions with respect to cellular composition (Reference BrodmannBrodmann, 1909). An inferior parietal area traditionally viewed as part of Wernicke’s area along with the STG, the supramarginal gyrus continues the higher-level categorical analysis of speech sounds together with the nearby angular gyrus (BA39) (Reference Joanisse, Zevin and McCandlissJoanisse, Zevin, & McCandliss, 2007) and serves as an interface between phonetic and semantic representations for articulation in the dorsal stream (Reference GowGow, 2012). The supramarginal gyrus – along with frontal areas such as the inferior frontal gyrus – has also been found to exert top-down influence on lexical and phonemic processing in the STS and STG (Reference Gow, Segawa, Ahlfors and LinGow et al., 2008), and the SMG itself is subject to modulation from higher-level frontal areas (Reference Gelfand and BookheimerGelfand & Bookheimer, 2003). The perception of categories in the supramarginal gyrus is driven by the selective amplification of key stimulus differences, that is, across phoneme boundaries, while differences treated as invariances (within-category) are suppressed (Reference Raizada and PoldrackRaizada & Poldrack, 2007). Together, the supramarginal and angular gyri form the inferior parietal lobule, the area referred to by Norman Reference GeschwindGeschwind (1965a) as the ‘association area of association areas’ (see Section 2). The angular gyrus itself has been suggested to be at the top of a processing hierarchy in the retrieval and integration of semantic representations (Reference Binder, Desai, Graves and ConantBinder et al., 2009; Reference Righi, Blumstein, Mertus and WordenRighi et al., 2010), and is also involved together with SMG in the active prediction of upcoming words (Reference Willems, Frank, Nijhof, Hagoort and van den BoschWillems et al., 2016) and word endings (Reference Roll, Söderström, Frid, Mannfolk and HorneRoll et al., 2017; Reference Söderström, Horne and RollSöderström et al., 2017).
Areas in the temporal and parietal lobes connect via a large network of white-matter pathways to each other and to frontal regions of the brain (Reference GowGow, 2012). The arcuate fasciculus, which has traditionally been considered to be the most important language-network connection, runs between Broca’s and Wernicke’s areas (Reference BastianH. C. Bastian, 1887; Reference DejerineDejerine, 1895). This tract contains a direct pathway between temporal and frontal regions, as well as two indirect pathways described using modern neuroimaging techniques. These indirect pathways connect the inferior parietal lobe – the supramarginal and angular gyri – to Broca’s and Wernicke’s areas, respectively. This suggests that a more complex anatomically and functionally dissociable white-matter network than has been traditionally assumed is involved in speech perception. The direct pathway has been suggested to mainly subserve phonological processing, with a more semantically oriented role for the indirect, inferior parietal pathway (Reference Catani, Jones and FfytcheCatani, Jones, & ffytche, 2005).
While much processing occurs in parallel at different levels of the linguistic and neural hierarchies (Reference Beach, Ozernov-Palchik and MayBeach et al., 2021; Reference Gwilliams, King, Marantz and PoeppelGwilliams et al., 2022; Reference Rauschecker and ScottRauschecker & Scott, 2009), information can pass between temporal and frontal areas of the brain in ten to thirty milliseconds, as measured by the timing of early auditory ERP components (Reference Matsumoto, Nair and LaPrestoMatsumoto et al., 2004; Reference Pulvermüller and ShtyrovPulvermüller & Shtyrov, 2008; Reference Pulvermüller, Shtyrov and IlmoniemiPulvermüller et al., 2003). Similarly to descending corticofugal pathways from the primary auditory cortex influencing subcortical processing all the way down to the level of the cochlea (see Section 5.1), frontal areas of the brain play a crucial role in providing descending top-down modulations of processing in speech perception, mediating activity in areas such as the temporal and primary auditory cortices, and allowing us to attend to and predict stimuli that are relevant to behaviour (Reference Braga, Wilson, Sharp, Wise and LeechBraga, Wilson, Sharp, Wise, & Leech, 2013; Reference Brass and von CramonBrass & von Cramon, 2004; Reference Cope, Sohoglu and SedleyCope et al., 2017; Reference Tzourio, Massioui and CrivelloTzourio et al., 1997). Depending on the source of this top-down information, this can take place over tens or hundreds of milliseconds (i.e., in the case of phonemic processing) or seconds in the case of prosody, as well as longer timescales. There is thus a hierarchy of linguistic representations over a number of timescales in the brain, with higher-order predictions generated in frontal and associative areas (Reference Wacongne, Labyt and van WassenhoveWacongne et al., 2011). While the exact neural principles and mechanisms involved in top-down processing remain debated and widely researched, frontal areas have thus been proposed to play more abstract, decision-related roles in auditory processing (Reference Binder, Liebenthal, Possing, Medler and WardBinder, Liebenthal, Possing, Medler, & Ward, 2004; Reference Scott and JohnsrudeScott & Johnsrude, 2003) and activity in prefrontal areas can influence processing at lower levels in the auditory hierarchy, such as the primary auditory cortex (Reference Wang, Zhang, Zou, Luo and DingWang, Zhang, Zou, Luo, & Ding, 2019), and has been suggested to drive representational computations to achieve category invariance in perception (Reference Myers, Blumstein, Walsh and EliassenMyers, Blumstein, Walsh, & Eliassen, 2009). Recall that the superior temporal gyrus can process phoneme-replacing noise as phonetic information given a surrounding lexical context, so that a word like fa[?]tor can be perceived as factor, with the appropriate phoneme rapidly restored by the perceptual system. To achieve this, the restoration in the temporal lobe is preceded by biasing predictive neural activity in frontal areas, particularly the left inferior frontal gyrus (Reference Leonard, Baud, Sjerps and ChangLeonard et al., 2016). Similarly, being given explicit context information about upcoming ambiguous speech sounds, syllables or words helps us disambiguate and perceive these sounds (Reference Miller, Heise and LichtenG. A. Miller, Heise, & Lichten, 1951; Reference O’NeillO’Neill, 1957), much like the implicit effect that preceding sounds or within-word context has on the perception of phonemes (Reference Lotto and KluenderLotto & Kluender, 1998; Reference Marslen-WilsonMarslen-Wilson, 1975). In this way, a preceding stimulus or context – spoken language or written text – can provide disambiguating top-down information to bias the perception of subsequently presented degraded speech, and this use of prior information to disambiguate speech is also associated with activity in the inferior frontal gyrus, which lies at a high level of the auditory hierarchy (Reference Sohoglu and DavisSohoglu & Davis, 2016). This is in contrast with disambiguating bottom-up information, such as an increase in the perceptual detail in the auditory stimulus, which triggers activity in lower-level auditory areas in the superior temporal gyrus rather than frontal areas (Reference Sohoglu, Peelle, Carlyon and DavisSohoglu et al., 2012). Importantly, the mediating connection between frontal and temporal regions allows both acoustic and linguistic information to interact as we process the phonemes of incoming words, and thus fuses phonetic and phonological information (Reference Cope, Sohoglu and PetersonCope et al., 2023; Reference Cope, Sohoglu and SedleyCope et al., 2017; Reference Kim, Martino and OverathKim, Martino, & Overath, 2023; Reference Overath and LeeOverath & Lee, 2017). This occurs through spectral analyses of formant structures in loops between the primary auditory cortex and superior temporal gyrus and sulcus, which are modulated by signals from frontal areas such as the left inferior frontal gyrus.
The main goal of the speech perception and word recognition process is to establish what a spoken word is as quickly as possible, from sound waves entering our ears through to acoustic and linguistic analysis of the fleeting signal. We recognise words extremely rapidly: the brain can tell real and nonwords apart based on an incoming disambiguating speech sound as quickly as thirty to fifty milliseconds after phoneme onset, performing ‘first pass’ lexical processing in left temporal and frontal cortical circuits (Reference MacGregor, Pulvermüller, van Casteren and ShtyrovMacGregor et al., 2012; Reference Shtyrov and LenzenShtyrov & Lenzen, 2017). Speech perception is an active process through which combined bottom-up and top-down mechanisms allow us to consider information as soon as it becomes available and represent discrete, invariant, and categorically perceived phonemes, continuously resolving ambiguous information and integrating prior information with sensory signals – throughout the neural hierarchy from cochlea to cortex – as words unfold in time, to ultimately comprehend the speaker’s message.
6 Directions for Future Research
Advances in our understanding of language and speech, as well as the brain, auditory system and experimental methodology, have propelled the fields of phonetics and neuroscience over the past century and a half. A parallel evolution in both fields has been necessary to reach the current levels of knowledge we have about how the brain processes spoken language. Neuroimaging techniques now allow temporal and spatial resolutions of neural processing at the millisecond and submillimetre scales. A slew of psycholinguistic methods developed over the past half-century are used to probe a wide range of detailed questions in language processing. Meanwhile, new statistical methods can provide more robust interpretations of both behavioural and neuroimaging data.
However, continued linguistic and neuroscientific theory and model-building are still crucial if we are to generate and constrain hypotheses to explain the actual data: not just how something works, but why it works the way it does (Reference Norris and CutlerNorris & Cutler, 2021). This includes extending the experimental psycholinguistic endeavour to more languages so as to understand what linguistic phenomena are possible and how listeners take advantage of them in perception. Less-studied languages with particular lexical and morphosyntactic properties can be used to expand theories about both language and the brain, such as Welsh (Reference Boyce, Browman and GoldsteinBoyce, Browman, & Goldstein, 1987; Reference Vaughan-Evans, Kuipers, Thierry and JonesVaughan-Evans et al., 2014) or Iwaidja (Reference EvansEvans, 2000), spoken on Croker Island in northern Australia. In these languages, word-initial phonemes can change depending on syntactic context, something which will have implications for theories of lexical processing. The more specific and representative the linguistic and psycholinguistic description, the easier it is to create linking hypotheses together with the neurophysiologist and search for corresponding neural correlates, to ultimately create models that are consistent with both language and brain function. That being said, a phonetician does not necessarily have to be interested in neurobiology to use neuroimaging techniques to address empirical questions: one can study the mind without studying the brain. For example, in EEG studies, it is perfectly valid to use the presence of an MMN (see Section 3.2) to determine whether listeners can perceive a difference between two speech sounds, without addressing the potential neural generators of the MMN. Similarly, while the neural underpinnings of the N400 are widely researched and largely unknown, a larger N400 component in one condition is nevertheless a strong indication that a stimulus is perceived like a word (in contrast to a pseudoword), and so on. This may be less straightforward in fMRI, where the careless use of reverse inference – where a particular cognitive process is inferred from observed activity in a particular brain region – can lead to incorrect interpretations of the data (Reference HutzlerHutzler, 2014; Reference PoldrackPoldrack, 2006). Similarly, one cannot understand brain function by simply knowing where something is processed: an understanding of the hierarchical and parallel neural systems and subsystems that underlie perception is necessary.
A key development is currently underway in the combination of linguistic and neural models with machine-learning and artificial intelligence techniques (see Section 3.3), as well as ongoing interdisciplinary collaborations between speech scientists and neuroscientists worldwide. This will benefit further research using both naturalistic speech and large-scale corpora of spoken language, and carefully controlled experimental paradigms and stimuli traditionally employed in psycholinguistic research. Models such as predictive coding keep spawning and constraining hypotheses regarding multiple facets of brain function (Reference FristonFriston, 2018), while the generation, content and temporal dynamics of predictions in the brain remain a fruitful subject of ongoing study: not just to answer how speech perception occurs, but also why it is usually so efficient.
David Deterding
Universiti Brunei Darussalam
David Deterding is a Professor at Universiti Brunei Darussalam. His research has involved the measurement of rhythm, description of the pronunciation of English in Singapore, Brunei and China, and the phonetics of Austronesian languages such as Malay, Brunei Malay, and Dusun.
Advisory Board
Bill Barry, Saarland University
Anne Cutler, Western Sydney University
Jette Hansen Edwards, Chinese University of Hong Kong
John Esling, University of Victoria
Ulrike Gut, Münster University
Jane Setter, Reading University
Marija Tabain, La Trobe University
Benjamin V. Tucker, University of Alberta
Weijing Zhou, Yangzhou University
Carlos Gussenhoven, Radboud University
About the Series
The Cambridge Elements in Phonetics series will generate a range of high-quality scholarly works, offering researchers and students authoritative accounts of current knowledge and research in the various fields of phonetics. In addition, the series will provide detailed descriptions of research into the pronunciation of a range of languages and language varieties. There will be Elements describing the phonetics of the major languages of the world, such as French, German, Chinese and Malay as well as the pronunciation of endangered languages, thus providing a valuable resource for documenting and preserving them.