1. Introduction
Our experience of the world consists of continuous streams of actions involving movements of people, objects and entities. Organizing these continuous streams into discrete event units and communicating about them with others is a core aspect of human cognition. How people communicate about motion events and how this is shaped by linguistic diversity are mostly studied with a focus on linguistic encoding in speech (Slobin, Reference Slobin, Gumperz and Levinson1996; Talmy, Reference Talmy and Shopen1985). Yet, human communication typically occurs in face-to-face settings with an interactional exchange of multimodal signals (Holler & Levinson, Reference Holler and Levinson2019; Perniss, Reference Perniss2018). One of these multimodal signals is the hand gestures that accompany speech (Kendon, Reference Kendon2004; McNeill, Reference McNeill2005; Özyürek, Reference Özyürek, Church, Alibali and Kelly2017). People express motion events and their components using spontaneous gestures that have different affordances for representing and packaging event components, such as relying on iconic links between form and meaning to varying degrees (Kita et al., Reference Kita, Alibali and Chu2017). In this article, we review recent empirical evidence on multimodal encoding of motion to gain a deeper understanding of whether and how language typology shapes linguistic expressions in different modalities (i.e., in verbal and visual channels), and how this changes across different sensory modalities of input (e.g., information perceived through auditory versus visual modality) and interacts with other aspects of cognition (e.g., event apprehension, memory). Our goal is to bring these lines of work together for the first time to enhance our understanding of event language and cognition from a multimodal perspective and to discuss how these expand the seminal work of Leonard Talmy on typology of event integration (Talmy, Reference Talmy and Kimball1975, Reference Talmy and Shopen1985, Reference Talmy2000).
Motion events (e.g., a woman walking towards a bus stop) are central to everyday life and involve the displacement of an object known as the figure (the woman), with respect to a reference object known as the ground (the bus stop), along a trajectory or path (towards) and a manner through which the motion unfolds (walking). Yet, languages differ in how they encode these semantic components of motion events. Talmy’s (Reference Talmy2000) typology of event integration provides a useful framework for explaining how languages map complex event structures onto different syntactic categories.Footnote 1 This framework classifies languages based on whether the core schematic event component – path in the case of motion events – is expressed in the main verb or in a satellite. Satellite-framed languages (e.g., English, Dutch) typically express manner of motion in the main verb and use satellites such as particles or prepositional phrases to express path of motion. As a result, in satellite-framed languages, path and manner of motion are mostly conflated in a single clause (e.g., she ran into the house). By contrast, verb-framed languages (e.g., Turkish, Spanish, Greek, Japanese) typically express path of motion in the main verb and supporting event components, such as manner of motion, in adverbial phrases or subordinate verbs. Therefore, in verb-framed languages, path and manner of motion are typically distributed across separate clauses (e.g., she entered the house running) and manner is more likely to be omitted from the event description (Slobin, Reference Slobin, Gentner and Goldin-Meadow2003). Although Talmy’s (Reference Talmy2000) typology of event integration captures both (intransitive) spontaneous motion events (e.g., walk, run, jump) and (transitive) caused motion events (e.g., hit, push, put), here we focus on the former as the majority of the co-speech gesture work is on spontaneous motion events.
In the recent years, a substantial amount of work has shown that speakers of these typologically different languages indeed express motion events adhering to these attested differences (Allen et al., Reference Allen, Özyürek, Kita, Brown, Furman, Ishizuka and Fujii2007; Bohnemeyer et al., Reference Bohnemeyer, Enfield, Essegbey, Ibarretxe-Antuñano, Kita, Lüpke and Ameka2007; Gennari et al., Reference Gennari, Sloman, Malt and Fitch2002; Naigles et al., Reference Naigles, Eisenberg, Kako, Highter and McGraw1998; Papafragou et al., Reference Papafragou, Massey and Gleitman2002, Reference Papafragou, Massey and Gleitman2006; Slobin, Reference Slobin, Gumperz and Levinson1996, Reference Slobin, Hickmann and Robert2006; see also articles in Bylund & Athanasopoulos, Reference Bylund and Athanasopoulos2015; Ibarretxe-Antuñano, Reference Ibarretxe-Antuñano2017). This work has shown that the typologies described above reflect the most frequent and typical patterns of linguistic expression used in narratives or short event descriptions across satellite-framed and verb-framed languages. However, deviations from these patterns have been also reported, since, for example, satellite-framed languages can also express path in the main verbs (e.g., English: car exited the garage) and verb-framed languages can express manner in the main verbs (e.g., Turkish: kız içeri koştu – a girl run inside; koş, corresponding to run; (Özçalışkan, Reference Özçalışkan2015; Özçalışkan & Slobin, Reference Özçalışkan, Slobin, Özsoy, Nakipoglu-Demiralp, Erguvanli-Taylan and Aksu-Koç2003). Yet, the use of such deviant patterns is limited, and speakers of these languages tend to conform to the typological patterns as reflected in their most frequent usage patterns.
In addition to systematic differences in cross-linguistic encoding in speech, motion events are also an ideal test bed for investigating whether and how these differences are reflected in gestural encoding. This is because motion events involve rich visuospatial information and gestures have modality-specific advantages for conveying visuospatial information – such as iconic gestures representing the similarity between the gesture form and the meaning of a referent. In fact, a core assumption that is shared by different models of gesture production is that gesture derives from visuospatial imagery (Sketch Model, de Ruiter, Reference de Ruiter and McNeill2000; Postcard Model, de Ruiter, Reference de Ruiter2007; Gesture as Simulated Action Framework, Hostetter & Alibali, Reference Hostetter and Alibali2008, Reference Hostetter and Alibali2019; Information Packaging Hypothesis, Kita, Reference Kita and McNeill2000; Interface Model, Kita & Özyürek, Reference Kita and Özyürek2003; Lexical Retrieval Hypothesis, Krauss et al., Reference Krauss, Chen, Gottesman and McNeill2000; Growth Point Theory, McNeill, Reference McNeill1992; McNeill & Duncan, Reference McNeill, Duncan and McNeill2000). However, what is interesting from a linguistic encoding of motion events perspective is that according to the Interface Model of co-speech gesture production (Kita & Özyürek, Reference Kita and Özyürek2003), gestures are generated through interactions between the linguistic conceptualization underlying speech production and the visuospatial imagery underlying gesture production. Through these interactions, co-speech gestures represent information following language-specific constraints on information packaging in the speech that they accompany. That is, each co-speech gesture is likely to express semantic information that is encoded within one processing unit (i.e., verbal clause) in speech.
This view is supported by cross-linguistic work showing that gestural encoding of event components differs in ways tightly linked to linguistic encoding in speech (Akhavan et al., Reference Akhavan, Nozari and Göksun2017; Gullberg et al., Reference Gullberg, Hendriks and Hickmann2008; Kita et al., Reference Kita, Alibali and Chu2017; Kita & Özyürek, Reference Kita and Özyürek2003; Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2016a, Reference Özçalışkan, Lucero and Goldin-Meadow2016b; Özyürek et al., Reference Özyürek, Kita, Allen, Furman and Brown2005). However, the strict effect of language typology on multimodal encoding of motion events does not persist under all circumstances. In the sections that follow, we discuss empirical evidence on how multimodal expressions of motion in speech and gesture might interact with the effect of language typology and other aspects of cognition (e.g., visual attention and memory).
2. Cross-linguistic variability in encoding of motion events in speech and co-speech gesture
As mentioned already, one important consequence of the typological patterns in motion event encoding predicted by the event integration framework (Talmy, Reference Talmy2000) is the packaging of semantic components at the clausal level in speech. In satellite-framed languages, path and manner are tightly packaged in a single clause, whereas in verb-framed languages, path and manner are separated across two clauses, and in most cases, manner might be omitted due to being expressed outside of the verb in a subordinate verb. In a seminal work that cross-linguistically tested the consequence of these typological patterns on gestural representations of motion events, speakers of a satellite-framed language, English, and two verb-framed languages, Turkish and Japanese, were asked to describe motion events (Kita & Özyürek, Reference Kita and Özyürek2003). The results revealed that English speakers typically encoded path and manner of motion in a single verbal clause in speech and also tended to conflate path and manner in a single gesture. On the other hand, speakers of Turkish and Japanese typically distributed path and manner information across different clauses in speech and also tended to produce separate gestures for path and manner. Similar findings have been replicated across different languages and language pairs, such as English (Kita et al., Reference Kita, Özyürek, Allen, Brown, Furman and Ishizuka2007), Farsi (Akhavan et al., Reference Akhavan, Nozari and Göksun2017), French (Gullberg et al., Reference Gullberg, Hendriks and Hickmann2008), Turkish (Mamus et al., Reference Mamus, Speed, Özyürek and Majid2022, Reference Mamus, Speed, Rissman, Majid and Özyürek2023; Ünal et al., Reference Ünal, Manhardt and Özyürek2022), Dutch-Turkish (ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022), Turkish-English (Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2016a, Reference Özçalışkan, Lucero and Goldin-Meadow2016b; Özyürek et al., Reference Özyürek, Kita, Allen, Furman and Brown2005), Korean-English (Choi & Lantolf, Reference Choi and Lantolf2008) and Japanese-English-Turkish (Kita & Özyürek, Reference Kita and Özyürek2003). Thus, co-speech gesture often follows the typological patterns in motion event encoding defined by Talmy (Reference Talmy2000).
Although the tight semantic link between speech and gesture is well established now, their semantic relation has been initially depicted in a different way. McNeill and Duncan (Reference McNeill, Duncan and McNeill2000) claimed that gestures serve a compensatory purpose for speech. They examined Spanish and English speakers’ motion event descriptions of Tweety cartoons and found that Spanish speakers – in line with their language typology – often omit manners in speech but depict them in gestures. Thus, gestures encode additional information to speech when this information is difficult to encode linguistically. However, later empirical work provided abundant evidence against the claims of McNeill and Duncan (Reference McNeill, Duncan and McNeill2000), and showed that speech and gesture typically express similar content as co-speech gestures appear with the element expressed in the main verb (Akhavan et al., Reference Akhavan, Nozari and Göksun2017; Gullberg et al., Reference Gullberg, Hendriks and Hickmann2008; Kita & Özyürek, Reference Kita and Özyürek2003; Özyürek et al., Reference Özyürek, Kita, Allen, Furman and Brown2005; ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022; Ünal et al., Reference Ünal, Manhardt and Özyürek2022). For example, when English and Turkish speakers encode both path and manner, they package them in syntactically different ways both in speech and in co-speech gesture – that is, English speakers producing conflated gestures more frequently and Turkish speakers producing separate path or manner gestures more frequently. Nevertheless, when the same English and Turkish speakers encode only path or only manner of motion, that is when they produce syntactically similar descriptions, importantly, their co-speech gestures also look similar (Özyürek et al., Reference Özyürek, Kita, Allen, Furman and Brown2005). Crucially, in cases when one motion event component is omitted from speech, the same semantic element is also omitted from co-speech gesture (see also Sümer & Özyürek, Reference Sümer and Özyürek2022). Furthermore, in verb-framed languages where path is expressed in the main verb, even in cases when both path and manner are expressed in speech, people may express only path in gesture (French: Gullberg et al., Reference Gullberg, Hendriks and Hickmann2008; Turkish: Mamus et al., Reference Mamus, Speed, Özyürek and Majid2022, Reference Mamus, Speed, Rissman, Majid and Özyürek2023; Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2016b, Reference Özçalışkan, Lucero and Goldin-Meadow2018; ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022). A similar pattern is observed in Farsi that has a mixed verb-framed and satellite-framed typology (Akhavan et al., Reference Akhavan, Nozari and Göksun2017). In the study with Farsi speakers, participants typically encoded path in light verbs plus prepositions and manner in adverbs (e.g., corresponding to the girl came towards the tree in a running fashion) and produced gestures that only expressed path. These findings extend previously shown links between language-specific encoding in speech and co-speech gesture by showing that the semantic elements that can be packaged within the main verb in speech guide co-speech gesture production. Thus, these findings also provide counter evidence for a mechanism where gesture organizes in such a way that it expresses conventionally what is not expressed in motion event components in speech unlike the claims of McNeill and Duncan (Reference McNeill, Duncan and McNeill2000).
3. Within-language variability and susceptibility of manner in multimodal encoding of motion events
While there is a tight semantic link between speech and co-speech gesture when cross-linguistic variability is concerned, studies also show some within-language variability in motion event encoding in speech, which has further consequences for co-speech gesture. These studies indicate that the variation in the type of manner in the event might trigger variation in speech and gesture patterns.
One study with speakers of English has shown that the syntactic encoding of path and manner varies depending on how manner relates to path of motion (Kita et al., Reference Kita, Özyürek, Allen, Brown, Furman and Ishizuka2007). In that study, English speakers were more likely to tightly package path and manner in a single clause in speech and a single gesture when the manner is inherent to the change of location (e.g., a triangle is jumping while going up on an inclined surface). On the other hand, the same participants were more likely to use separate clauses that express either path or manner of motion together with path-only or manner-only gestures when the manner was incidental to the change of location (e.g., a triangle rotating on its horizontal axis while going down into the water).
Further evidence on the influence of manner type on the mention of manner comes from cross-linguistic work. In a study comparing motion event descriptions of English and Greek speakers, both language groups described events involving predictable and unpredictable manners (Papafragou et al., Reference Papafragou, Massey and Gleitman2006). When manners are predictable from the context (e.g., a man is walking down the stairs), Greek speakers frequently omitted the manner of motion from their event descriptions, in line with the verb-framed typology, as these manners could be inferred even if not explicitly expressed in speech. However, when manner was rather unusual and not easily predictable (e.g., a man is sliding down the stairs), Greek speakers were twice as likely to mention it compared to when it was a predictable manner. These findings converge with recent cross-linguistic evidence from speakers of Turkish and Dutch, showing that when describing spontaneous motion events involving a person changing location in non-default ways (e.g., twirling, skipping), both language groups mentioned manner more often than path (ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022).Footnote 2 Furthermore, in the same study, both Turkish and Dutch speakers were equally likely to gesture about path and manner. Thus, for Turkish speakers, these atypical manners have possibly increased the frequency of mention of manner in speech, which further increased the frequency of manner gestures, eliminating previously shown manner omissions in verb-framed languages. This interpretation is corroborated by findings of another study (Ünal et al., Reference Ünal, Manhardt and Özyürek2022) that used similar motion event stimuli with the exception that manner of motion were rather typical ways of changing location (e.g., walking, running). In that study, Turkish speakers mentioned both path and manner of motion in speech but often produced path-only gestures. On an interesting note, recent evidence suggests that expression of manner gestures is more susceptible to the influence of social context than path gestures in Korean and Catalan – both verb-framed languages – for example, speakers produced fewer manner gestures when they interacted with an unknown superior compared to interaction with a friend (Brown et al., Reference Brown, Kim, Hübscher and Winter2023).
Together, the studies reviewed in this and the previous sections suggest that spoken descriptions of typical motion events are most likely to be characterized by the patterns defined by Talmy’s typology of event integration (2000). These patterns in speech further influence the frequency, form and content of co-speech gestures, as in the case of path-only gestures accompanying path verbs in speech in verb-framed languages. However, other factors such as the saliency or the typicality of the manner or pragmatic requirements can interact with the lexical or syntactic constraints on the expression of motion event components. For example, type of manner can influence syntactic choices within speakers of a satellite-framed language, which could override typological patterns (Kita et al., Reference Kita, Özyürek, Allen, Brown, Furman and Ishizuka2007). Furthermore, in verb-framed languages – such as Greek, Turkish, Korean and Catalan – manner expressions in event descriptions may be more sensitive and open to variation than path expressions. This may arise from the fact that manners are often optional and omittable in verb-framed languages (Slobin, Reference Slobin, Gentner and Goldin-Meadow2003; Sümer & Özyürek, Reference Sümer and Özyürek2022). These patterns conform to but also extend Talmy’s (Reference Talmy2000) typology of events and demonstrate the need to take into account the variability within the event itself.
4. Role of sensory modality and visual experience in multimodal encoding of motion events
Most studies reviewed so far have used visual stimuli to examine motion event expressions and their patterns by using video-clips, cartoons, line drawings and so on (Akhavan et al., Reference Akhavan, Nozari and Göksun2017; Gennari et al., Reference Gennari, Sloman, Malt and Fitch2002; Gullberg et al., Reference Gullberg, Hendriks and Hickmann2008; Kita & Özyürek, Reference Kita and Özyürek2003; Papafragou et al., Reference Papafragou, Massey and Gleitman2002; Slobin et al., Reference Slobin, Ibarretxe-Antuñano, Kopecka and Majid2014; ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022; Ünal et al., Reference Ünal, Manhardt and Özyürek2022). These studies have not taken into account whether these patterns might change depending on the modality of the input. The sensory modality of input may influence multimodal encoding of motion events as each sensory modality has different perceptual affordances – for example, vision dominates in spatial perception despite the fact that auditory and haptic channels can also provide spatial information through cross-modal integration (Alais & Burr, Reference Alais and Burr2004; Eimer, Reference Eimer2004; Thinus-Blanc & Gaunet, Reference Thinus-Blanc and Gaunet1997).
One exception to the previous literature using visual stimuli is the work of Özçalışkan et al. (Reference Özçalışkan, Lucero and Goldin-Meadow2016b, Reference Özçalışkan, Lucero and Goldin-Meadow2018). They conducted cross-linguistic studies to examine differences in packaging of motion event elements in congenitally sighted, blind and blindfolded (sighted with covered eyes during the experiment) speakers of Turkish and English. They created haptic static scenes consisting of landmark objects (e.g., a toy house) and dolls in different postures to indicate the motion (e.g., a girl running into a house). Sighted participants observed the scenes without touching, whereas blind and blindfolded participants explored the scenes through touch. The main goal of the study was to investigate how sighted and blind participants would package motion event elements at the syntactic level. The findings revealed that Turkish and English speakers packaged path and manner according to their language typology and there were no differences between blind and non-blind speakers within a language group. Thus, Turkish blind and non-blind participants separated path and manner, while English blind and non-blind participants conflated them both in speech and in co-speech gesture, as in previously mentioned cross-linguistic work (e.g., Kita & Özyürek, Reference Kita and Özyürek2003; Özyürek et al., Reference Özyürek, Kita, Allen, Furman and Brown2005). These findings indicate similarities in the syntactic form of co-speech gestures across participants with different visual experience (i.e., blind versus non-bind participants). Thus, they suggested that language typology is the main factor that determines speech and gesture patterns. However, the role of input modality on multimodal descriptions was not an interest of Özçalışkan et al. (Reference Özçalışkan, Lucero and Goldin-Meadow2016b). Therefore, they did not report a direct comparison between descriptions of blindfolded and sighted participants.
A recent study that systematically investigated the role of input modality in sighted individuals, however, showed that sensory modality of input matters for encoding of motion events in speech (Mamus et al., Reference Mamus, Speed, Özyürek and Majid2022). In that study, everyday motion events (e.g., someone running to an elevator) were created as audio-only, visual-only or multimodal (audio+visual) stimuli. Turkish speakers who only listened to the events produced more path and fewer manner descriptions in speech compared to another group of Turkish speakers who watched the event with or without the audio. Therefore, compared to auditory input, visual input elicited manner more than path expressions, in a verb-framed language. Interestingly though, the change in the speech patterns were not reflected in co-speech gestures. Speakers dominantly produced path-only gestures regardless of the sensory modality of input. Thus, speech appears to be more sensitive to the input modality compared to gesture that adhered to the verb-framed typology of the language. These findings suggest that the sensory modality of input influences speakers’ encoding of motion events to some extent, apart from the language typology (Slobin, Reference Slobin, Gumperz and Levinson1996; Talmy, Reference Talmy2000).
Another study by Mamus et al. (Reference Mamus, Speed, Rissman, Majid and Özyürek2023) examined the role of (lack of) visual experience on motion event descriptions of blind, blindfolded and sighted speakers of Turkish using auditory stimuli only. They compared how often blind and non-blind speakers encoded path and manner in their speech and co-speech gesture. Blind participants had more path than manner expressions in their speech compared to sighted participants. In co-speech gestures, blind participants overall produced less path and less manner gestures than sighted participants, but path-only gestures were dominant across all groups. This suggests that lack of exposure to vision for long term might in fact change expression of path and manner both in speech and in gesture.
Taken together, these findings reveal some similarities and differences between the role of sensory modality of input and visual experience on expression of motion in speech and gesture, at least in speakers of a verb-framed language. Both temporary changes in the input modality of sighted participants and lack of visual experience for long term influenced manner expression in speech. Participants expressed manner less frequently when they experienced motion events through audio as opposed to visual stimuli and when they lacked a visual experience of the world in general. This concurs the idea that linguistic encoding of manner is more susceptible to visual input than path is – at least in a verb-framed language. Motion expressions in gesture were modulated by long-term visual experience – possibly by changing how sensory information was mapped onto the event construal over time – but not by temporary changes in the sensory modality of input. However, the packaging of path and manner in speech and co-speech gesture seems resistant to change regardless of the effect of (lack of) long-term visual experience. This highlights the importance of comparing different sensory modalities of input, as well as different aspects of multimodal event descriptions (e.g., as frequency of mention in addition to packaging) to better understand whether and how sensory modality of input and long-term visual experience influences multimodal motion event descriptions. Furthermore, as in the work reviewed in the previous section, manner expression seems to be more susceptible to variability in input modality than path expression.
5. Interface between multimodal encoding motion and cognition
So far, we have focused on the variations in speech and gesture across speakers of different languages as well as within speakers of a single language driven by various factors such as type of manner in which the motion event unfolds, the sensory modality of stimuli and (lack of) long-term visual experience. We now turn to the interface between multimodal encoding of motion in speech and gesture and on other aspects of cognition, such as visual attention during event apprehension and event memory.
The link between speech production and event apprehension is well established. While or before describing visual events, speakers attend to those aspects of the events that they (plan to) speak about (Gleitman et al., Reference Gleitman, January, Nappa and Trueswell2007; Griffin & Bock, Reference Griffin and Bock2000; Konopka & Meyer, Reference Konopka and Meyer2014; Meyer et al., Reference Meyer, Sleiderink and Levelt1998; van de Velde et al., Reference van de Velde, Meyer and Konopka2014). Such findings have been taken as evidence for a speech production model according to which speaking begins with a preverbal apprehension of an event that includes the people, objects, entities and spatial-temporal features involved in the event (Levelt, Reference Levelt1989). A cross-linguistic extension of this model that is predominantly supported by empirical work in the domain of motion is known as the thinking for speaking hypothesis (Slobin, Reference Slobin, Gumperz and Levinson1996). In this view, speakers attend to the aspects of experience they plan to communicate about in ways consistent with how their language packages information at the lexical and syntactic level. Consistent with this possibility, cross-linguistic eye-tracking studies show that speakers of English and Greek attend to motion events differently in ways that parallel how their language expresses motion, but only prior to describing the events in speech (Papafragou et al., Reference Papafragou, Hulbert and Trueswell2008; see also Bunger et al., Reference Bunger, Trueswell and Papafragou2012, Reference Bunger, Skordos, Trueswell and Papafragou2021; Flecken et al., Reference Flecken, Stutterheim and Carroll2014; Sakarias & Flecken, Reference Sakarias and Flecken2019; Trueswell & Papafragou, Reference Trueswell and Papafragou2010).
If language specificity of gestures accompanying speech is an outcome of an interface between linguistic conceptualization and visuospatial imagery during message preparation (Kita & Özyürek, Reference Kita and Özyürek2003), one would expect similar links between event apprehension and co-speech gesture production. This possibility was recently tested in an eye-tracking study with Turkish-speaking adults (Ünal et al., Reference Ünal, Manhardt and Özyürek2022). Participants watched videos of motion events while their eye-movements were recorded. Once the video ended, they described the event to an addressee sitting across them. The videos were constructed in such a way that the path and manner information relevant for linguistic descriptions of motion could be defined as separate areas of interest. As a first step, motion descriptions in speech and gesture were investigated. As predicted by Talmy’s (Reference Talmy2000) typology of event integration, the majority of the descriptions included both path and manner of motion, with path of motion mostly expressed through path verbs. Furthermore, consistent with the predictions of the Interface Model (Kita & Özyürek, Reference Kita and Özyürek2003), such spoken descriptions were frequently accompanied by gestures that only expressed path of motion.
Next, the relation between visual attention allocated to event components and the encoding of event components in speech and gesture was examined. Of interest was whether the additional encoding of path in gesture would be linked to even more visual attention allocated to path of motion during message preparation. The results were in line with this prediction: Turkish speakers allocated more attention to path of motion when their speech was accompanied by a path gesture compared to when they did not express any motion information in gesture. Crucially, these differences were found after controlling for the content of the motion descriptions in speech such that all descriptions expressed both path and manner. This suggests that the links between visual attention and gesture production emerged in addition to the links found between visual attention and speech production.
The study reviewed above demonstrates links between motion event descriptions in speech and gesture and event cognition before producing these descriptions. Another way in which such links could be seen is after producing multimodal motion event descriptions. A number of studies have explored this possibility by investigating whether encoding motion events in speech and gesture has consequences for memory for motion event components and whether this relation is modulated by language typology. We begin with studies testing the relation between motion event speech and memory and then discuss whether encoding motion event components in gesture has additional benefits for motion event memory.
In one study, adult English speakers had better motion event memory when their memory was tested after describing events compared to when they did not describe the events (Bunger et al., Reference Bunger, Trueswell and Papafragou2012). However, this study did not test whether the gains in memory accuracy come from mentioning specific motion event components in the description of that event. In another study examining the relation between motion event descriptions and memory more closely (Skordos et al., Reference Skordos, Bunger, Richards, Selimis, Trueswell and Papafragou2020), speakers of English and Greek viewed a set of motion event clips and were asked to produce a single verb describing each event. Immediately after this task, they saw another set of motion event clips and indicated whether or not these clips were the same as the ones they saw in the production task. Verb production patterns were consistent with Talmy’s (Reference Talmy2000) typology: English speakers were less likely to produce path verbs than Greek speakers, and Greek speakers were less likely to produce manner verbs than English speakers. In the memory task, participants had worse memory for manner of motion when they used a path verb to describe the event. However, memory for path of motion was not predicted by the type of verb produced. Importantly, the relation between verb production and memory was not modulated by language typology.
The findings above are suggestive of a relation between expressing motion event components in speech and memory for those components; however, this work has an important limitation. Participants were required to use a single verb to describe the event. Nevertheless, people typically describe events in complete utterances rather than in single verbs. This limitation was addressed by a recent study comparing another pair of typologically different languages: Turkish and Dutch (ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022). Participants viewed and described videos of motion events to an addressee seated across them. They were not given any instructions that would constrain the kind of descriptions they would produce. Later, participants saw another set of motion events and indicated whether or not they had seen them before. Half of the events in this second set were the same as the ones they saw before. Of the remaining events, half had a different path and the other half had a different manner. Neither Turkish nor Dutch speakers had manner memory that were above chance level. Therefore, the relation between motion event speech and memory could only be tested for path of motion. The results revealed that both Turkish and Dutch speakers were better at recognizing that the path of motion had changed when they had mentioned path of motion in speech. Furthermore, as in the previous study, the relation between motion event descriptions in speech and motion event memory was not modulated by language typology.
Together, these findings on the relation between motion event encoding in speech and memory indicate that encoding motion event components in speech is associated with better memory for those components. However, this relation between speech and memory is not further modulated by language typology. Instead, this seems to be characterized by some patterns that generalize across speakers of typologically different languages. For example, memory for manner is worse than memory for path for both speakers of verb-framed and satellite-framed languages – Greek-English (Skordos et al., Reference Skordos, Bunger, Richards, Selimis, Trueswell and Papafragou2020), Turkish-Dutch (ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022), English (Bunger et al., Reference Bunger, Trueswell and Papafragou2012), Greek-English (Papafragou et al., Reference Papafragou, Massey and Gleitman2002) and Spanish-English (Gennari et al., Reference Gennari, Sloman, Malt and Fitch2002). All in all, these studies suggest that in addition to some language-general tendencies, whether or not an event component is encoded in speech seems to be more critical for motion event memory than how it is encoded with regard to a language typology.
The mechanism behind the relation between motion event encoding in speech and memory could be better understood when the relation between motion event encoding in speech and event apprehension is considered. Since the event components mentioned in the spoken descriptions are allocated more visual attention and construed as part of the conceptualization of this event (Levelt, Reference Levelt1989; Papafragou et al., Reference Papafragou, Hulbert and Trueswell2008), they may also be remembered better. As discussed above, gesturing about motion event components guides visual attention to motion event components (Ünal et al., Reference Ünal, Manhardt and Özyürek2022). Furthermore, these effects emerge in addition to the effects of speech production on visual attention. An important question is whether similar links exist between motion event encoding in gesture and memory.
Contrary to this possibility, the study by ter Bekke et al. (Reference ter Bekke, Özyürek and Ünal2022) showed that gesturing about path of motion was not related to improvement in path memory. Importantly, in this study, gestures were spontaneously produced and when participants produced a path gesture, they typically described path of motion in their speech. Therefore, the findings of this study speak to (the lack of) an additional advantage of gesture production on top of speech production on motion event memory rather than the relative benefits of motion event expressions in speech versus gesture. In order to evaluate the latter possibility, further research focusing on non-redundant gestures that express motion event components not expressed in the accompanying speech is necessary. Finally, similar to the findings on the relation between motion event descriptions in speech and memory, language typology did not interact with the relation between motion event descriptions in gesture and motion event memory. Together, these findings indicate that even though language typology shapes multimodal event descriptions in speech and gesture, and visual attention to event components prior to producing those descriptions, it does not further modulate the relation between these event descriptions and the cognitive processes that follow these descriptions, such as subsequent memory.
6. Discussion and conclusions
In this article, we reviewed a growing body of research on multimodal encoding of motion events in speech and gesture. We asked to what extent and under which conditions language typology shapes event descriptions, whether this changes across the modality of expression or sensory modality of input, and how event descriptions in different modalities interface other aspects of cognition such as visual attention to events and memory for events. Our goal was to draw on the evidence from these different lines of research to broaden our understanding of language as a multimodal and multisensory phenomenon and to expand on the insights provided by Talmy’s typology of event integration.
Cross-linguistic evidence on the multimodal expression of same motion events shows that language typology shapes motion event descriptions in speech and co-speech gesture in ways paralleling the patterns formulated by Talmy’s typology of event integration (1975, 1985, 2000). According to this typology, the informational content of a unit of processing in speech (i.e., a clause) differs in descriptions of motion events in satellite-framed and verb-framed languages. Similarly, the semantic elements expressed in gestures accompanying spoken motion event descriptions also differ across speakers of satellite-framed and verb-framed languages. This similarity is uniquely predicted by the Interface Model (Kita & Özyürek, Reference Kita and Özyürek2003), according to which gestures are generated through an interface between visuospatial imagery and how information is packaged within a processing unit in the accompanying speech. Importantly, these cross-linguistic differences in co-speech cannot be explained by other models of gesture production proposing that gestures planned prior to and independent of linguistic formulation in speech (de Ruiter, Reference de Ruiter and McNeill2000, Reference de Ruiter2007; Krauss et al., Reference Krauss, Chen, Gottesman and McNeill2000).
Converging evidence for this view comes from recent cross-linguistic work examining descriptions of motion events only with gesture in the absence of speech (i.e., silent gesture). For example, when asked to describe motion events with silent gestures, speakers of both Turkish and English conflated path and manner in a single gesture unlike the typological patterns in Turkish (Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2016a). In the same study, gestures produced along with speech differed cross-linguistically, following typological patterns (see also Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2018, Reference Özçalışkan, Lucero and Goldin-Meadow2023). Thus, gestural representations of events reflect language typology but only when gestures accompany speech. These findings corroborate the idea that language specificity of co-speech gestures arise from interactions between linguistic conceptualization and the visuo-spatial imagery during online language production.
Nevertheless, motion event expressions in speech and gesture are not only shaped by language typology. Several studies demonstrated that factors related to event structure, such as the type of manner in which the motion unfolds, the sensory input through which the event to be described is perceived, speakers’ lifetime experience with the visual world, and pragmatic requirements can possibly modulate how people express motion event components in speech. Moreover, under certain circumstances, variations in motion descriptions in speech further modulate the expression of the same events components in co-speech gestures. In both modalities, expression of manner seems to be more susceptible to these influences than path, possibly because manner is a more peripheral event component.Footnote 3 There has been a growing body of work investigating how linguistic expression of core versus peripheral event components is modulated by linguistic, conceptual, or pragmatic factors (Do et al., Reference Do, Papafragou and Trueswell2020, Reference Do, Papafragou and Trueswell2022; Grigoroglou & Papafragou, Reference Grigoroglou and Papafragou2019; Ünal et al., Reference Ünal, Richards, Trueswell and Papafragou2021). The work reviewed here extends this line of work to motion expressions in the visual modality (i.e., co-speech gestures) and to the path and manner components of motion events.
Another conclusion emerging from the work reviewed here concerns the relation between multimodal motion event descriptions and other aspects of cognition. Prior to language production, language-specific encoding of motion event components in both speech and gesture guide allocation of attention to event components. These findings support influential theories of speech production (Levelt, Reference Levelt1989) and the thinking for speaking hypothesis (Slobin, Reference Slobin, Gumperz and Levinson1996) by showing that the way speakers pick up information from the visual world takes into account the lexical and structural constraints on how those aspects are expressed in language. These findings also provide evidence for the Interface Model (Kita & Özyürek, Reference Kita and Özyürek2003) from the planning phase of multimodal language production by showing that gesture production interfaces with event conceptualization in ways similar to speech production.
However, after language production, how event descriptions interface with cognition is different for speech and gesture – at least for motion event memory. Motion event descriptions in speech but not gesture predicts memory for those event components mentioned in the descriptions. Even though the evidence reviewed here only concerns the domain of motion events, there is converging evidence from the domain of object locations (Karadöller et al., Reference Karadöller, Sümer, Ünal, Özyürek, Fitch, Lamm, Leder and Tessmar2021, Reference Karadöller, Sümer, Ünal and Özyürek2022). This work shows that memory for object locations is predicted by whether or not object locations are expressed in the descriptions of scenes. But whether object locations are described in the verbal/auditory modality (i.e., speech) or visual modality (i.e., gesture or sign) does not matter for memory.
Unlike the relation between multimodal motion event descriptions and visual attention, the relation between speech and memory does not seem to be modulated by language typology. Furthermore, memory for motion events seems to be characterized by some cognitive biases that generalize across speakers of different languages. Memory for manner might be more fragile (Gennari et al., Reference Gennari, Sloman, Malt and Fitch2002; Papafragou et al., Reference Papafragou, Massey and Gleitman2002; ter Bekke et al., Reference ter Bekke, Özyürek and Ünal2022) and even affected by how path is expressed (Skordos et al., Reference Skordos, Bunger, Richards, Selimis, Trueswell and Papafragou2020), while the opposite is not true. These findings are consistent with the idea that path is a more central aspect of motion and manner is more peripheral – a distinction that is also central to Talmy’s typological classification of languages. Importantly, these findings on the relation between motion event speech and memory also cohere with cross-linguistic evidence on motion event descriptions in speech and gesture, for example, the case of path bias in gestures and manner expressions being susceptible to the influence of factors other than language typology.
6.1. Future directions and open questions
At present, several questions remain open for further research on this topic. One critical issue is the definition of path and manner verbs in motion events. As introduced previously, path verbs of motion specify a direction of motion without detailing how the manner occurs whereas manner verbs of motion specify a manner of motion without specifying its precise direction. Manner also encompasses a wide range of aspects including motor pattern (e.g., crawl, walk, run), attitude (e.g., stroll, amble, saunter), and rate (e.g., hurry, dash) (e.g., Slobin, Reference Slobin, Strömqvist and Verhoeven2004; Slobin et al., Reference Slobin, Ibarretxe-Antuñano, Kopecka and Majid2014). However, it is not always straightforward to decide whether a motion verb is a manner or path verb as there are verbs encoding manner and path together (e.g., Cifuentes-Férez, Reference Cifuentes-Férez2008; Slobin et al., Reference Slobin, Ibarretxe-Antuñano, Kopecka and Majid2014). For example, the verb fall (as well as sink) is considered to be path verb in some studies but a manner verb in others. Support for fall as a path verb comes from that fall is a change-of-location verb and there is no particular motor pattern, rate, or attitude involved. Yet, it is also possible to claim that fall includes information about manner in a broad sense (such as descending with increasing momentum). Others argue that it is an in-between case which contains manner (in a broad sense) and path (downwards motion) together. As this categorization may vary depending on the subject of interest, researchers should provide operational definitions of path and manner verbs in a particular work, which is often not the case in the work discussed above.
The majority of the empirical evidence reviewed in this article comes from studies conducted with adults. In terms of development, it is widely accepted that children’s early motion event descriptions in speech are both characterized by language-specific and language-general patterns (Allen et al., Reference Allen, Özyürek, Kita, Brown, Furman, Ishizuka and Fujii2007). However, the development of language-specific co-speech gestures is less well understood. Some findings indicate language specificity in speech and gesture around the same time (Özçalışkan, Reference Özçalışkan2007; Özçalışkan et al., Reference Özçalışkan, Lucero and Goldin-Meadow2023), and some findings suggest that language-specific patterns develop later in gesture than in speech (Özyürek et al., Reference Özyürek, Kita, Allen, Brown, Furman and Ishizuka2008). Furthermore, the relation between speech and gesture may change throughout development, since children are typically more likely to express non-redundant information in co-speech gestures, for example, when describing caused motion events (Furman et al., Reference Furman, Küntay and Özyürek2014; Niu et al., Reference Niu, Cienki, Ortega and Coene2022). Thus, developmental work can provide novel insights into language-specific influences on multimodal event descriptions in speech and gesture.
This article mainly focuses on motion event descriptions in first language and how they interact with broader cognition. However, the majority of language users around the world are multilingual. There has been a growing body of research investigating how people adjust their event descriptions when speaking a typologically different second language (e.g., Aktan-Erciyes et al., Reference Aktan-Erciyes, Göksun, Tekcan and Aksu-Koç2020; Cadierno, Reference Cadierno, Achard and Niemier2008; Emerson et al., Reference Emerson, Limia and Özçalışkan2021; Hohenstein et al., Reference Hohenstein, Eisenberg and Naigles2006; Soroli et al., Reference Soroli, Sahraoui and Sacchett2012), how speaking typologically different second language interacts with motion descriptions in first language (A. Brown & Gullberg, Reference Brown and Gullberg2010, Reference Brown and Gullberg2011; Emerson et al., Reference Emerson, Limia and Özçalışkan2021), and how bilingual experience interacts with other aspects of motion event cognition (e.g., Aktan-Erciyes et al., Reference Aktan-Erciyes, Akbuğa, Kızıldere and Göksun2022). Future research should focus on how the factors discussed in the present article interact with bilingualism in shaping motion event expressions in speech and gesture.
The present discussion focuses on multimodal expression of literal motion where people describe a physical change of location. However, motion verbs are also used to refer to non-physical motion when describing a metaphorical change of location (Özçalışkan, Reference Özçalışkan2003, Reference Özçalışkan2004; Lakoff & Johnson, Reference Lakoff and Johnson2008; see also Caballero, Reference Caballero2007, Reference Caballero2017; Johnson & Larson, Reference Johnson and Larson2003). A growing body of research strongly suggests that the typological patterns in verb-framed and satellite-framed languages in the expression of literal motion are also extended to metaphorical uses of motion expressions (Özçalışkan, Reference Özçalışkan2003, Reference Özçalışkan2004, Reference Özçalışkan2005). Nevertheless, existing work on expression of metaphorical motion is based on written descriptions in magazines (Caballero, Reference Caballero2007, Reference Caballero2017) and novels (Özçalışkan, Reference Özçalışkan2003, Reference Özçalışkan2004) or based on elicited descriptions in written form (Özçalışkan, Reference Özçalışkan2005). Future work based on naturalistic face-to-face interactions can reveal whether these typological patterns are manifested in both spoken expressions and multimodal expressions in speech and co-speech gesture.
As discussed above, Talmy’s (Reference Talmy2000) typology of event integration offers a binary classification of languages as verb- or satellite-framed based on whether path of motion is encoded within or outside the main verb. However, this classification fails to characterize languages where path and manner are expressed by equivalent linguistic elements. For example, languages with pervasive serial constructions allow two or more main verbs slots in a single clause (e.g., pǎ chū - run exit in Mandarin Chinese) and cannot be classified as either verb- or satellite-framed based on the semantics of the main verb. This limitation of Talmy’s typology is addressed by proposing a third category, known as equipollently-framed languages (Slobin, Reference Slobin, Strömqvist and Verhoeven2004), and empirical studies have provided empirical support for the presence of this third typology (Chen & Guo, Reference Chen and Guo2009; Guo & Chen, Reference Guo, Chen, Guo, Lieven, Budwig, Ervin-Tripp, Nakamura and Özçalışkan2009). Nevertheless, little is known about gestural representations of motion events in equipollently-framed languages. One study by Brown and Chen (Reference Brown and Chen2013) has shown that speakers of Mandarin Chinese frequently encoded manner in speech but did not tend to gestures that highlight manner and instead encode path in gestures (though this study did not focus on the expression of path in speech). However, evidence from co-speech gestures accompanying descriptions of other types of events shows that serial verb constructions tend to co-occur with single rather than multiple gestures (Defina, Reference Defina2016; Niu et al., Reference Niu, Cienki, Ortega and Coene2022). It remains to be seen how the factors discussed above such as type of manner or sensory modality shape motion event descriptions in equipollently-framed across spoken and gestural modalities.
Another direction for future research that would have important implications for application is in the area of translation studies. A number of studies have documented the challenges in translating motion information across typologically different languages (Filipović, Reference Filipović2008; Hijazo-Gascón, Reference Hijazo-Gascón2019; Ibarretxe-Antuñano, Reference Ibarretxe-Antuñano2004; Rojo López & Cifuentes-Férez, Reference Rojo López, Cifuentes-Férez and Ibarretxe-Antuñano2015; Slobin, Reference Slobin, Gumperz and Levinson1996). For instance, when translating motion descriptions in satellite-framed languages to verb-framed languages, information about manner tends to get omitted or expressed less elaborately. However, since the available evidence comes from text-based translations, little is known about how such omissions shape expressions in gesture or how the information conveyed in the gestural modality gets translated. To address these issues, further work on in multimodal translation of motion descriptions in naturalistic settings is required.
Finally, empirical investigations of Talmy’s (Reference Talmy2000) typology of event integration that considers the multimodal nature of language is primarily in the domain of spontaneous motion events. There is also growing evidence on caused motion events (Furman et al., Reference Furman, Küntay and Özyürek2014; Niu et al., Reference Niu, Cienki, Ortega and Coene2022). Future research should investigate whether the empirical conclusions drawn based on the current lines of work generalize to other classes of events included in Talmy’s event integration framework, such as change of state events and caused motion events, as well as other core schematic features, such as the temporal contours of the event.
6.2. Final conclusions
In conclusion, the present review strongly suggests that Talmy’s typology of event integration predicts multimodal event descriptions not only in speech but also in co-speech gesture. This typology also predicts visual attention to event components prior to producing these language-specific event descriptions. However, these influences of language typology may be overridden by variability within the event itself, such as type and modality of stimuli, especially for expression of peripheral event components, such as manner. Together, the empirical evidence reviewed here confirms but also extends Talmy’s event integration framework.
Data availability statement
Data sharing is not applicable to this article as no datasets were generated or analyzed during the preparation of this article.
Competing interest
The authors declare no competing interests.