Clark and Fischer (C&F) provide a timely reminder that there is a large and underappreciated gap between the ambitions of social robotics and the actual social competence of robots (Park, Healey, & Kaniadakis, Reference Park, Healey and Kaniadakis2021). As they demonstrate, natural conversation presents complex challenges that go well beyond current engineering capabilities (see also Healey, Reference Healey, Muggleton and Chater2021). Nonetheless, they also point to parallels in the ways in which people interact with each other and with social robots.
This commentary questions the ontological distinction underlying C&F's discussion. Specifically, does their account of depiction provide a principled basis for their argument that depictions of social agency fundamentally differ from actual social agency?
C&F discuss various examples of depictions of social agents including Laurence Olivier's performance of Hamlet. Depiction in these examples is complex. The character – Hamlet – is based on a mixture of characters from earlier plays (possibly also Shakespeare's son); there are multiple versions of the text of Hamlet; different productions select different parts of those texts, different actors perform those parts differently; direction, costume, staging, scenography vary, and so on. C&F embrace this complexity and use it to characterise various aspects of ways people treat interaction with social robots as performance.
The problem, as we see it, is that C&F's account of depiction is so rich, encompassing so much of human social interaction, that the distinction between actual social agents and depictions of social agents dissolves. As C&F show, there are familiar contexts in which people perform a role, such as hotel receptionist, which also involve derived authority, particular communicative styles and particular costumes and props. These roles are depictions and successful interaction in these cases requires that we recognise and engage with the performance (Eshghi, Howes, & Gregoromichelaki, Reference Eshghi, Howes, Gregoromichelaki, Bernardy, Blanck, Chatzikyriakidis, Lappin and Maskharashvili2022). However, arguably, all human social interaction has these properties (Kempson, Cann, Gregoromichelaki, & Chatzikyriakidis, Reference Kempson, Cann, Gregoromichelaki and Chatzikyriakidis2016). It was Goffman's (Reference Goffman1959) insight that this kind of performative, depictive, dramaturgical description can be applied to any human social interaction.
When the receptionist in C&F's example (target article, sect. 8.1) switches to being someone who grewup in the same region as Clark, this is, in Goffman's terms, a switch from one kind of performed identity to another. It involves, for example, switching to certain kinds of community-specific knowledge, norms, and patterns of language use (see also Clark, Reference Clark, Gumperz and Levinson1996). People have multiple overlapping identities, all involving elements of depiction: different social repertoires, forms of authority, and conventions of interpretation. Moreover, it is unclear why such performances of identity involve depictions rather than indices to contextual features (“contextualisation cues”) that transform the current situation to a new one where the terms of the interaction have changed.
Despite this, we share the intuition that the features of interaction that C&F highlight are important. However, the crucial role that they assign to inference and pretense seems uncharacteristically individualistic, presenting the role of potentially sophisticated robots as passive, and ignoring efforts people make to scaffold the interaction. Our suggestion is that one way to retain a meaningful, explanatory role for depictions is to abandon the assumption of any fundamental discontinuity between authentic and performed social agency and, instead, look at how depiction functions in interaction. Specifically, the way depictions are used as a means of transforming the relation between interlocutors when social performances threaten to break down; they provide a way to account for the gap between a represented social role and the role invoked to explain the performative failure. Returning to C&F's receptionist example, the inability to provide local hotel information leads to the discovery of the receptionist's actual location which prompts the conversation to switch from “customer”-“receptionist” to “people from Rapid City.”
Not all failures emerge at the level of social performance. When we encounter contemporary social robots, there are a variety of ways in which things can go wrong and a variety of stances we can take to explain the failure (cf. Dennett, Reference Dennett1987). We quickly discover the limitations of robot social affordances and this forces us to reason about, for example, who made this thing? (authority); what is it supposed to do? (intention/character); is there hardware failure (base scene)? This applies equally to humans and robots: We sometimes invoke problems with authority (e.g., someone is too junior or too young to answer), intention (e.g. deceit) or hardware problems (someone can't hear, or is too drunk).
There are some empirical advantages to approaching depiction in this way. It restricts the range of possible depictions to things that are actually cited to account for disruptions to interaction rather than the indefinitely many possible forms of social depiction we could imagine. It also provides an index of social competence. The relative frequency with which we invoke interactive depictions or, for example, hardware problems, provides a measure of how sophisticated a social agent is. Embarrassment accompanies the failure of social roles (Goffman, Reference Goffman1967); involving characteristic displays such as blushing, averting eye contact, face touching, and smiling and laughter. Unlike shame, embarrassment also directly implicates other participants in a coordinated understanding of what has failed, how it failed and how to recover from it. Interestingly, robots are not currently designed to systematically recognise or produce signals of embarrassment (Park et al., Reference Park, Healey and Kaniadakis2021).
Our assumption is that what makes an “authentic” social interaction is the ability to detect and recover from failure – something in principle achievable by machines. Machines can participate in interactions where cognitive abilities are distributed across multiple agents and each can compensate for the failures or inadequacy of the other. The centrality of miscommunication (and ability to recover from it) in human–human interaction (Healey, de Ruiter, & Mills, Reference Healey, de Ruiter and Mills2018; Howes & Eshghi, Reference Howes and Eshghi2021) follows from the observation that we never share the same language, skills, or information as anyone we nevertheless successfully interact with (Clark, Reference Clark, Gumperz and Levinson1996). This is obvious in, for example, parent–child or expert/non-expert interactions, but is arguably characteristic of all social exchanges, including interactions with social robots. At present the potential possibilities for divergences may be broader and along different dimensions but this is not, we argue, different in kind.
Clark and Fischer (C&F) provide a timely reminder that there is a large and underappreciated gap between the ambitions of social robotics and the actual social competence of robots (Park, Healey, & Kaniadakis, Reference Park, Healey and Kaniadakis2021). As they demonstrate, natural conversation presents complex challenges that go well beyond current engineering capabilities (see also Healey, Reference Healey, Muggleton and Chater2021). Nonetheless, they also point to parallels in the ways in which people interact with each other and with social robots.
This commentary questions the ontological distinction underlying C&F's discussion. Specifically, does their account of depiction provide a principled basis for their argument that depictions of social agency fundamentally differ from actual social agency?
C&F discuss various examples of depictions of social agents including Laurence Olivier's performance of Hamlet. Depiction in these examples is complex. The character – Hamlet – is based on a mixture of characters from earlier plays (possibly also Shakespeare's son); there are multiple versions of the text of Hamlet; different productions select different parts of those texts, different actors perform those parts differently; direction, costume, staging, scenography vary, and so on. C&F embrace this complexity and use it to characterise various aspects of ways people treat interaction with social robots as performance.
The problem, as we see it, is that C&F's account of depiction is so rich, encompassing so much of human social interaction, that the distinction between actual social agents and depictions of social agents dissolves. As C&F show, there are familiar contexts in which people perform a role, such as hotel receptionist, which also involve derived authority, particular communicative styles and particular costumes and props. These roles are depictions and successful interaction in these cases requires that we recognise and engage with the performance (Eshghi, Howes, & Gregoromichelaki, Reference Eshghi, Howes, Gregoromichelaki, Bernardy, Blanck, Chatzikyriakidis, Lappin and Maskharashvili2022). However, arguably, all human social interaction has these properties (Kempson, Cann, Gregoromichelaki, & Chatzikyriakidis, Reference Kempson, Cann, Gregoromichelaki and Chatzikyriakidis2016). It was Goffman's (Reference Goffman1959) insight that this kind of performative, depictive, dramaturgical description can be applied to any human social interaction.
When the receptionist in C&F's example (target article, sect. 8.1) switches to being someone who grewup in the same region as Clark, this is, in Goffman's terms, a switch from one kind of performed identity to another. It involves, for example, switching to certain kinds of community-specific knowledge, norms, and patterns of language use (see also Clark, Reference Clark, Gumperz and Levinson1996). People have multiple overlapping identities, all involving elements of depiction: different social repertoires, forms of authority, and conventions of interpretation. Moreover, it is unclear why such performances of identity involve depictions rather than indices to contextual features (“contextualisation cues”) that transform the current situation to a new one where the terms of the interaction have changed.
Despite this, we share the intuition that the features of interaction that C&F highlight are important. However, the crucial role that they assign to inference and pretense seems uncharacteristically individualistic, presenting the role of potentially sophisticated robots as passive, and ignoring efforts people make to scaffold the interaction. Our suggestion is that one way to retain a meaningful, explanatory role for depictions is to abandon the assumption of any fundamental discontinuity between authentic and performed social agency and, instead, look at how depiction functions in interaction. Specifically, the way depictions are used as a means of transforming the relation between interlocutors when social performances threaten to break down; they provide a way to account for the gap between a represented social role and the role invoked to explain the performative failure. Returning to C&F's receptionist example, the inability to provide local hotel information leads to the discovery of the receptionist's actual location which prompts the conversation to switch from “customer”-“receptionist” to “people from Rapid City.”
Not all failures emerge at the level of social performance. When we encounter contemporary social robots, there are a variety of ways in which things can go wrong and a variety of stances we can take to explain the failure (cf. Dennett, Reference Dennett1987). We quickly discover the limitations of robot social affordances and this forces us to reason about, for example, who made this thing? (authority); what is it supposed to do? (intention/character); is there hardware failure (base scene)? This applies equally to humans and robots: We sometimes invoke problems with authority (e.g., someone is too junior or too young to answer), intention (e.g. deceit) or hardware problems (someone can't hear, or is too drunk).
There are some empirical advantages to approaching depiction in this way. It restricts the range of possible depictions to things that are actually cited to account for disruptions to interaction rather than the indefinitely many possible forms of social depiction we could imagine. It also provides an index of social competence. The relative frequency with which we invoke interactive depictions or, for example, hardware problems, provides a measure of how sophisticated a social agent is. Embarrassment accompanies the failure of social roles (Goffman, Reference Goffman1967); involving characteristic displays such as blushing, averting eye contact, face touching, and smiling and laughter. Unlike shame, embarrassment also directly implicates other participants in a coordinated understanding of what has failed, how it failed and how to recover from it. Interestingly, robots are not currently designed to systematically recognise or produce signals of embarrassment (Park et al., Reference Park, Healey and Kaniadakis2021).
Our assumption is that what makes an “authentic” social interaction is the ability to detect and recover from failure – something in principle achievable by machines. Machines can participate in interactions where cognitive abilities are distributed across multiple agents and each can compensate for the failures or inadequacy of the other. The centrality of miscommunication (and ability to recover from it) in human–human interaction (Healey, de Ruiter, & Mills, Reference Healey, de Ruiter and Mills2018; Howes & Eshghi, Reference Howes and Eshghi2021) follows from the observation that we never share the same language, skills, or information as anyone we nevertheless successfully interact with (Clark, Reference Clark, Gumperz and Levinson1996). This is obvious in, for example, parent–child or expert/non-expert interactions, but is arguably characteristic of all social exchanges, including interactions with social robots. At present the potential possibilities for divergences may be broader and along different dimensions but this is not, we argue, different in kind.
Financial support
Christine Howes was supported by two grants from the Swedish Research council (VR) 2016-0116 – Incremental Reasoning in Dialogue (IncReD) and 2014-39 for the establishment of the Centre for Linguistic Theory and Studies in Probability (CLASP) at the University of Gothenburg. Purver received financial support from the Slovenian Research Agency via research core funding for the programme Knowledge Technologies (P2-0103) and the project Sovrag (Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration, J5-3102); and the UK EPSRC via the project Sodestream (Streamlining Social Decision Making for Improved Internet Standards, EP/S033564/1).
Competing interest
None.