Automated surveillance of human activities has traditionally been a computer vision field interested in the recognition of motion patterns and in the production of high-level descriptions for actions and interactions among entities of interest (Cedras & Shah, 1995; Aggarwal & Cai, 1999; Gavrila, 1999; Moeslund, Hilton, & Krüger, 2006; Buxton, 2003; Hu et al., 2004; Turaga et al., 2008; Dee & Velastin, 2008; Aggarwal & Ryoo, 2011; Borges, Conci, & Cavallaro, 2013). The study on human activities has been revitalized in the last five years by addressing the so-called social signals (Pentland, 2007). In fact, these nonverbal cues inspired by the social, affective, and psychological literature (Vinciarelli, Pantic, & Bourlard, 2009) have allowed a more principled understanding of how humans act and react to other people and to their environment.
Social Signal Processing (SSP) is the scientific field making a systematic, algorithmic and computational analysis of social signals, drawing significant concepts from anthropology and social psychology (Vinciarelli et al., 2009). In particular, SSP does not stop at just modeling human activities, but aims at coding and decoding human behavior. In other words, it focuses on unveiling the underlying hidden states that drive one to act in a distinct way with particular actions. This challenge is supported by decades of investigation in human sciences (psychology, anthropology, sociology, etc.) that showed how humans use nonverbal behavioral cues, like facial expressions, vocalizations (laughter, fillers, back-channel, etc.), gestures, or postures to convey, often outside conscious awareness, their attitude toward other people and social environments, as well as emotions (Richmond & McCroskey, 1995). The understanding of these cues is thus paramount in order to understand the social meaning of human activities.
The formal marriage of automated video surveillance with Social Signal Processing had its programmatic start during SISM 2010 (the International Workshop on Socially Intelligent Surveillance and Monitoring; http://profs.sci.univr.it/∼cristanm/ SISM2010/), associated with the IEEE Computer Vision and Pattern Recognition conference. At that venue, the discussion was focused on what kind of social signals can be captured in a generic surveillance scenario, detailing the specific scenarios where the modeling of social aspects could be the most beneficial.
After 2010, SSP hybridizations with surveillance applications have grown rapidly in number and systematic essays about the topic started to compare in the computer vision literature (Cristani et al., 2013).