We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter describes several probabilistic techniques for representing, recognizing, and generating spatiotemporal configuration sequences. We first describe how such techniques can be used to visually track and recognize lip movements in order to augment a speech recognition system. We then demonstrate additional techniques that can be used to animate video footage of talking faces and synchronize it to different sentences of an audio track. Finally we outline alternative low-level representations that are needed to apply these techniques to articulated body gestures.
Introduction
Gestures can be described as characteristic configurations over time. While uttering a sentence, we express very fine grained verbal gestures as complex lip configurations over time, and while performing bodily actions, we generate articulated configuration sequences of jointed arm and leg segments. Such configurations lie in constrained subspaces and different gestures are embodied as different characteristic trajectories in these constrained subspaces.
We present a general technique called Manifold Learning, that is able to estimate such constrained subspaces from example data. This technique is applied to the domain of tracking, recognition, and interpolation. Characteristic trajectories through such spaces are estimated using Hidden Markov Models. We show the utility of these techniques on the domain of visual acoustic recognition of continuous spelled letters.
We also show how visual acoustic lip and facial feature models can be used for the inverse task: facial animation. For this domain we developed a modified tracking technique and a different lip interpolation technique, as well as a more general decomposition of visual speech units based on Visemes.
Using computers to watch human activity has proven to be a research area having not only a large number of potentially important applications (in surveillance, communications, health, etc.) but also one the had led to a variety of new, fundamental problems in image processing and computer vision. In this chapter we review research that has been conducted at the University of Maryland during the past five years on various topics involving analysis of human activity.
Introduction
Our interest in this general area started with consideration of the problem of how a computer might recognize a facial expression from the changing appearance of the face displaying the expression. Technically, this led us to address the problem of how the non-rigid deformations of facial features (eyes, mouth) could be accurately measured even while the face was moving rigidly.
In section 10.2 we discuss our solution to this problem. Our approach to this problem, in which the rigid head motion is estimated and used to stabilize the face so that the non-rigid feature motions could be recovered, naturally led us to consider the problem of head gesture recognition. Section 10.3 discusses two approaches to recognition of head gestures, both of which employ the rigid head motion descriptions estimated in the course of recognizing expressions.
The ability to track a face in a video stream opens up new possibilities for human computer interaction. Applications range from head gesture-based interfaces for physically handicapped people, to image-driven animated models for low bandwidth video conferencing. Here we present a novel face tracking algorithm which is robust to partial occlusion of the face. Since the tracker is tolerant of noisy, computationally cheap feature detectors, frame-rate operation is comfortably achieved on standard hardware.
Introduction
The ability to detect and track a person's face is potentially very powerful for human-computer interaction. For example, a person's gaze can be used to indicate something, in much the same manner as pointing. One can envisage a window manager which automatically shuffles to the foreground whichever window the user is looking at [153, 152]. Gaze aside, the head position and orientation can be used for virtual holography [14]: as the viewer moves around the screen, the computer displays a different projection of a scene, giving the illusion of holography. Another application lies in low-bandwidth video conferencing: live images of participant's face can be used to guide a remote, synthesised “clone” face which is viewed by other participants [180, 197]. A head tracker could also provide a very useful computer interface for physically handicapped people, some of whom can only communicate using head gestures. With an increasing number of desktop computers being supplied with video cameras and framegrabbers as standard (ostensibly for video mail applications), it is becoming both useful and feasible to track the computer user's face.
Present day Human–Computer Interaction (HCI) revolves around typing at a keyboard, moving and pointing with a mouse, selecting from menus and searching through manuals. These interfaces are far from ideal, especially when dealing with graphical input and trying to visualise and manipulate three-dimensional data and structures. For many, interacting with a computer is a cumbersome and frustrating experience. Most people would prefer more natural ways of dealing with computers, closer to face-to-face, human-to-human conversation, where talking, gesturing, hearing and seeing are part of the interaction. They would prefer computer systems that understand their verbal and non-verbal languages.
Since its inception in the 1960s, the emphasis of HCI design has been on improving the “look and feel” of a computer by incremental changes, to produce keyboards that are easier to use, and graphical input devices with bigger screens and better sounds. The advent of low-cost computing and memory, and computers equipped with tele-conferencing hardware (including cameras mounted above the display) means that video input and audio input is available with little additional cost. It is now possible to conceive of more radical changes in the way we interact with machines: of computers that are listening to and looking at their users.
Progress has already been achieved in computer speech synthesis and recognition [326]. Promising commercial products already exist that allow natural speech to be digitised and processed by computer (32 kilobytes per second) for use in dictation systems.
This chapter introduces the Human Reader project and some research results of human-machine interfaces based on image sequence analysis. Real-time responsive and multimodal gesture interaction, which is not an easily achieved capability, is investigated. Primary emphasis is placed on real-time responsive capability for head and hand gestural interaction as applied to the project's Headreader and Handreader. Their performances are demonstrated in experimental interactive applications, the CG Secretary Agent and the FingerPointer. Next, we focus on facial expression as a rich source of nonverbal message media. A preliminary experiment in facial expression research using an optical-flow algorithm is introduced to show what kind of information can be extracted from facial gestures. Real-time responsiveness is left to subsequent research work, some of which is introduced in other chapters of this book. Lastly, new directions in vision-based interface research are briefly addressed based on these experiences.
Introduction
Human body movement plays a very important role in our daily communications. Such communications do not merely include human-to-human interactions but also human-to-computer (and other inanimate objects) interactions. We can easily infer people's intentions from their gestures. I believe that a computer possessing “eyes” in addition to a mouse and keyboard, would be able to interact with humans in a smooth, enhanced and well-organized way by using visual input information. If a machine can sense and identify an approaching user, for example, it can load his/her personal profile and prepare the necessary configuration before he/she starts to use it.
Bayesian approaches have enjoyed a great deal of recent success in their application to problems in computer vision (Grenander, 1976-1981; Bolle & Cooper, 1984; Geman & Geman, 1984; Marroquin et al., 1985; Szeliski, 1989; Clark & Yuille, 1990; Yuille & Clark, 1993; Madarasmi et al., 1993). This success has led to an emerging interest in applying Bayesian methods to modeling human visual perception (Bennett et al., 1989; Kersten, 1990; Knill & Kersten, 1991; Richards et al., 1993). The chapters in this book represent to a large extent the fruits of this interest: a number of new theoretical frameworks for studying perception and some interesting new models of specific perceptual phenomena, all founded, to varying degrees, on Bayesian ideas. As an introduction to the book, we present an overview of the philosophy and fundamental concepts which form the foundation of Bayesian theory as it applies to human visual perception. The goal of the chapter is two-fold: first, it serves as a tutorial to the basics of the Bayesian approach to readers who are unfamiliar with it, and second, to characterize the type of theory of perception the approach is meant to provide. The latter topic, by its meta-theoretic nature, is necessarily subjective. This introduction represents the views of the authors in this regard, not necessarily those held by other contributors to the book.
First, we introduce the Bayesian framework as a general formalism for specifying the information in images which allows an observer to perceive the world.
A task of visual perception is to find the scene which best explains visual observations. Figure 9.1 can be used to illustrate the problem of perception. The visual data is the image held by two cherubs at the right. Scattered in the middle are various geometrical objects – “scene interpretations” – which may account for the observed data. How does one choose between the competing interpretations for the image data?
One approach is to find the probability that each interpretation could have created the observed data. Bayesian statistics are a powerful tool for this, e.g. Geman & Geman (1984), Jepson & Richards (1992), Kersten (1991), Szeliski (1989). One expresses prior assumptions as probabilities and calculates for each interpretation a posterior probability, conditioned on the visual data. The best interpretation may be that with the highest probability density, or a more sophisticated criterion may be used. Other computational techniques, such as regularization (Poggio et al., 1985; Tikhonov & Arsenin, 1977), can be posed in a Bayesian framework (Szeliski, 1989). In this chapter, we will apply the powerful assumption of “generic view” in a Bayesian framework. This will lead us to an additional term from Bayesian theory involving the Fisher information matrix. (See also chapter 7 by Blake et al..) This will modify the posterior probabilities to give additional information about the scene.
Texture cues in the image plane are a potentially rich source of surface information available to the human observer. A photograph of a cornfield, for example, can give a compelling impression of the orientation of the ground plane relative to the observer. Gibson (1950) designed the first experiments to test the ability of humans to use this texture information in their estimation of surface orientation. Since that time, various authors have proposed and tested hypotheses concerning the relative importance of different visual cues in human judgements of shape from texture (Cutting & Millard, 1984; Todd & Akerstrom, 1987). This work has generally relied on a cue conflict paradigm in which one cue is varied while the other is held constant. This is potentially problematic, since surfaces with conflicting texture cues do not occur in nature. It is possible that in a psychophysical experiment our visual system might employ a different mechanism to resolve the cue conflict condition. We show in this paper that the strength of individual texture cues can be measured and compared with an ideal observer model without resorting to a cue conflict paradigm.
Ideal observer models for estimation of shape from texture have been described by Witkin (1981), Kanatani & Chou (1989), Davis et al. (1983), Blake & Marinos (1990), Marinos & Blake (1990). Given basic assumptions about the distribution and orientation of texture elements, an estimate of surface orientation can be obtained, together with crucial information about reliability of the estimate.
By the late eighties, the computational approach to perception advocated by Marr (1982) was well established. In vision, most properties of the 2 ½ D sketch such as surface orientation and 3D shape admitted solutions, especially for machine vision systems operating in constrained environments. Similarly, tactile and force sensing was rapidly becoming a practicality for robotics and prostheses. Yet in spite of this progress, it was increasingly apparent that machine perceptual systems were still enormously impoverished versions of their biological counterparts. Machine systems simply lacked the inductive intelligence and knowledge that allowed biological systems to operate successfully over a variety of unspecified contexts and environments. The role of “top-down” knowledge was clearly underestimated and was much more important than precise edge, region, “textural”, or shape information. It was also becoming obvious that even when adequate “bottom-up” information was available, we did not understand how this information should be combined from the different perceptual modules, each operating under their often quite different and competing constraints (Jain, 1989). Furthermore, what principles justified the choice of these “constraints” in the first place? Problems such as these all seemed to be subsumed under a lack of understanding of how prior knowledge should be brought to bear upon the interpretation of sensory data. Of course, this conclusion came as no surprise to many cognitive and experimental psychologists (e.g. Gregory, 1980; Hochberg, 1988; Rock, 1983), or to neurophysiologists who were exploring the role of massive reciprocal descending pathways (Maunsell & Newsome, 1987; Van Essen et al., 1992).
The previous chapters have demonstrated the many ways one can use a Bayesian formulation for computationally modeling perceptual problems. In this chapter, we look at the implications of a Bayesian view of visual information processing for investigating human visual perception. We will attempt to outline the elements of a general program of empirical research which results from taking the Bayesian formulation seriously as a framework for characterizing human perceptual inference. A major advantage of following such a program is that it supports a strong integration of psychophysics and computational theory, since its structure is the same as that of the Bayesian framework for computational modeling. In particular, it provides the foundation for a psychophysics of constraints, used to test hypotheses about the quantitative and qualitative constraints used in human perceptual inferences. The Bayesian approach also suggests new ways to conceptualize the general problem of perception and to decompose it into isolatable parts for psychophysical investigation. Thus, it not only provides a framework for modeling solutions to specific perceptual problems; it also guides the definition of the problems.
The chapter is organized into four major sections. In the next section, we develop a framework for characterizing human perception in Bayesian terms and analyze its implications for studying human perceptual performance. The third and fourth sections of the chapter apply the framework to two specific problems: the perception of 3-D shape from surface contours and the perception of 3-D object motion from cast shadow motion.
By
B.M. Bennett, University of California at Irvine,
D.D. Hoffman, University of California at Irvine,
C. Prakash, California State University,
S.N. Richman, University of California at Irvine
The search is on for a general theory of perception. As the papers in this volume indicate, many perceptual researchers now seek a conceptual framework and a general formalism to help them solve specific problems.
One candidate framework is “observer theory” (Bennett, Hoffman, & Prakash, 1989a). This paper discusses observer theory, gives a sympathetic analysis of its candidacy, describes its relationship to standard Bayesian analysis, and uses it to develop a new account of the relationship between computational theories and psychophysical data. Observer theory provides powerful tools for the perceptual theorist, psychophysicist, and philosopher. For the theorist it provides (1) a clean distinction between competence and performance, (2) clear goals and techniques for solving specific problems, and (3) a canonical format for presenting and analyzing proposed solutions. For the psychophysicist it provides techniques for assessing the psychological plausibility of theoretical solutions in the light of psychophysical data. And for the philosopher it provides conceptual tools for investigating the relationship of sensory experience to the material world.
Observer theory relates to Bayesian approaches as follows. In Bayesian approaches to vision one is given an image (or small collection of images), and a central goal is to compute the probability of various scene interpretations for that image (or small collection of images). That is, a central goal is to compute a conditional probability measure, called the “posterior distribution”, which can be written p(Scene | Image) or, more briefly, p(S | I).
The term “Pattern Theory” was introduced by Ulf Grenander in the 70's as a name for a field of applied mathematics which gave a theoretical setting for a large number of related ideas, techniques and results from fields such as computer vision, speech recognition, image and acoustic signal processing, pattern recognition and its statistical side, neural nets and parts of artificial intelligence (see Grenander, 1976-81). When I first began to study computer vision about ten years ago, I read parts of this book but did not really understand his insight. However, as I worked in the field, every time I felt I saw what was going on in a broader perspective or saw some theme which seemed to pull together the field as a whole, it turned out that this theme was part of what Grenander called pattern theory. It seems to me now that this is the right framework for these areas, and, as these fields have been growing explosively, the time is ripe for making an attempt to reexamine recent progress and try to make the ideas behind this unification better known. This article presents pattern theory from my point of view, which may be somewhat narrower than Grenander's, updated with recent examples involving interesting new mathematics.
When we see objects in the world, what we actually “see” is much more than the retinal image. Our perception is three-dimensional. Moreover, it reflects constant properties of the objects and the environment, regardless of changes in the retinal image with varying viewing condition. How does the visual system make this possible?
Two different approaches have been evident in the study of visual perception. One approach, most successful in recent times, is based on the idea that perception emerges automatically by some combination of neuronal receptive fields. In the study of depth perception, this general line of thinking has been supported by psychophysical and physiological evidence. The “purely cyclopean” perception in the Julesz' random dot stereogram (Julesz, 1960) shows that depth can emerge without the mediation of any higher order form recognition. This suggested that relatively local disparity-specific processes could account for the perception of a floating figure in an otherwise camouflaged display. Corresponding electrophysiological experiments using single cell recordings demonstrated that the depth of such stimuli could be coded by neurons in the visual cortex, receiving input from the two eyes (Barlow et al., 1967; Poggio & Fischer, 1977). In contrast to this more modern approach, there exists an older tradition which asserts that perception is inferential, that it can cleverly determine the nature of the world with limited image data. Starting with Helmholtz's unconscious inference (Helmholtz, 1910) and with more recent formulations such as Gregory's “perceptual hypotheses”, this approach stresses the importance of problem solving in the process of seeing (Hochberg, 1981; Gregory, 1970; Rock, 1983).