1. Introduction
The increasing technological advancements in robotics and deep learning have facilitated the development of robots that operate with humans. Numerous applications require both humans and robots to co-exist, including robotic guides, robot slaves, robots and humans collectively moving as a group. The common scenarios correspond to a human following a robot, a robot following another robot, a robot moving towards a goal with a human following the robot or a robotic convoy guiding a human group. The modern-day robots already come with a sophisticated navigation stack enabling them to autonomously move from one place to the other. The addition of humans poses a new problem of human tracking as robots and humans both move in real time. Knowledge of the position of the humans enables navigation decision-making for the robot including an intention to go near the human or away from the human while being generally directed towards a goal. Tracking helps to anticipate the future positions of the robots and humans in real time for navigation decision-making. Explicitly, the anticipated trajectories are very fruitful to perform autonomous social robot navigation, human tracking, traffic monitoring, human-robot and robot-human following behaviour [Reference Schreier, Willert and Adamy1, Reference Deo, Rangesh and Trivedi2].
In this paper, a framework is presented for human tracking, for applications where a human is following the robot. The robot moves towards its goal and covers some distance, and the human follows the robot. The problem statement is that the robot is taking a human on a tour. Therefore, the robot knows the human who is following the robot. The robot can use face recognition techniques to identify the human to track if there are multiple people at the rear of the robot. Here, a socialistic particle filter framework is presented to track the human who is following the robot.
Estimation of human position in a dynamic system with noisy measurements is a very active area of research in computer vision and social robot navigation. Over the past decades, particle filter method has been implemented for different types of state prediction problems which also includes visual tracking [Reference Zhang, Liu, Xu, Liu and Yang3], human tracking [Reference Chen, Cui, Kong, Guo and Cao4] and social robot localization [Reference Luo and Qin5]. The particle filter represents the probability density function with a collection of samples. The complexity of the particle filter is directly proportional to the number of samples which are utilized for prediction. Recently, particle filters have achieved a high demand in robotics exploration and analysis because of their superiority. A very limited number of particles gives reasonable computation time, and the particles are also found concentrated on the region of interest.
In the proposed application, the robot moves in dense indoor scenarios where the shape or area of the workspace is not consistent. Due to this, the average linear velocity and angular velocity of the robot also changes drastically which becomes a cause for sharp deviation in the human’s velocity and position. Most conventional tracking algorithms smoothen such behaviours which leads to erroneous results. The case of the robot making a sharp turn is particularly interesting. If tracking in the image plane, a sharp robot turn produces a very sharp change in the position of the person in the image, which is an effect that most conventional trackers do not account for. The proposed model hence tracks the humans in 3D real-world coordinates, while making a socialistic motion model that accurately predicts the behaviour of the human, knowing the motion made by the robot.
The current applications rely on expensive socialistic robots that come with a lot of autonomy. However, if robots are to be widely used, it is important to have them on a budget. Attributed to advances in deep learning, it is possible to use low-end vision cameras to reach the goal avoiding of obstacles. Further, current research makes it feasible to localize the robot using affordable odometry systems with special markers for long-term correction. In this paper, tracking and localization are carried out on moving people using budget sensors. The tracking is presented on a budget differential wheel drive by the mobile robot using a single limited field of view monocular camera for tracking the human in 3D. The social robot applications require a knowledge of the positions of the human which necessitates tracking the humans in 3D. When human is following the robot, then it is not necessary that every time the human can place himself/herself exactly behind the robot. Also when the robot reaches at a corner or turning point, in this situation the human may or may not be behind the robot. Moreover, space may also not be available behind the robot in some cases because at a visiting place there are also some narrow corridor or very limited space. Hence, the human can also not place himself/herself just behind the robot and the visibility losses. Our proposed model handles these situations and continuously tracks the humans without any position loss. In this work, only a less expensive monocular camera (like a webcam) is sufficient to perform 3D tracking in real time. High expensive cameras (such as stereo cameras, lidar) are not needed for the proposed model. This makes the approach feasible in terms of the cost.
Initially, a human face is detected in the image from a rear-looking monocular camera. Once the face of the human is detected at a certain position, then the proposed particle filter tracker initializes on the human face and tracks the human during the motion. The robot’s initial position is assumed to be the origin $(0,0,0)$ , and also, the human is assumed to be standing behind the robot initially. As the robot moves, the human follows the robot by maintaining a certain socialistic distance. The particle filter tracker knows only the initial position of the human face in the 3D world and predicts the future position of the human face as both the robot and the human move.
The particle filter tracker predicts the next position using a socialistic motion model. The socialistic motion model assumes that the humans always go towards the robot maintaining a socialistic distance. The humans further have a smooth transformation of speeds at any point of time including switching between behaviours that are incorporated in the motion model. The socialistic motion model is important since the human will be out of the visibility of the robot for prolonged times. The conventional trackers fail in such a situation. Even if parameter tweaking is done to let tracking continue, the uncertainty of position grows extremely large to make tracking useless. However, in real life, a mother leading a child does not very often look back and always has an indicative idea of the position of the child. This motivates a tracker than can have a robust performance even with an extremely limited observation, which is facilitated by a knowledge of the social behaviour of the human when following the robot. Different kinds of fusion between socialistic and random motion models are also tried; however, experiments reveal that the socialistic model is itself extremely robust and gives the best results.
The observation model uses a monocular camera mounted on the rear of the robot. To determine the error between the predicted position and the actual position, the 3D world coordinate system (particle) is converted into a 2D image plane coordinate. To convert 3D real world to 2D image plane, the predicted human position is first transformed in the coordinate system of the camera of the robot (knowing the position of the robot), and then, the calibration matrix is used to project the face on the 2D image plane. The 2D coordinate in the image plane represents the centre position of the human face in the image plane. The error between the observed position in the image and the predicted position of the human face is computed using this technique. A deep learning-based face detection method gives the actual position of the human face in the 2D image plane and particle filter model predicts the position of the human face in the 3D real world.
It must be noted that the approach is different from the standard problem of human tracking because of the following differences, because of which none of the existing literatures can be easily used:
-
The standard human tracking approaches use a monocular camera for tracking in the image plane, while the proposed approach tracks the humans in a 3D environment from a monocular camera mounted on a moving robot.
-
The standard human tracking approaches assume that the human will mostly be visible, while the proposed approach is useful when there are large spans of time in which the human will be outside the field of view of the camera.
-
The standard human tracking approaches are applicable only for limited time durations, but the proposed approach is applicable to track a human for long duration of time. In this duration, the tracked object will disappear and appear multiple times which must still be tracked, an ability that the current tracking algorithms lack.
The main contributions of the paper are as follows:
-
1. In this paper, a particle filter model is used to track a human who is following the mobile robot for navigation. Only a single low-cost camera has been used to perform a tracking in 3D. The problem in question is 3D tracking of humans from a moving robot, while most existent approaches primarily track objects in the image plane using a static camera. This is a new problem under these settings that has not been active research.
-
2. A socialistic human following behaviour is developed that accounts for social attraction and repulsion forces between the people and those with the robot. The model can predict the motion of the person even with a lack of visibility. The force constants are obtained realistically using real-life socialistic experiments with human subjects. Such social modelling is missing in the competent approaches that limits their performance. Comparisons with several filters show the improvement of the proposed approach that is catalysed by an accurate social model that can continue tracking even with a lack of visibility of the moving object, a capability that has never been demonstrated in the current tracking literature.
-
3. The robot can handle the loss of visibility due to sharp turns and corners where detection is not possible attributed to the strong social prediction model. Everyday scenarios see congested places where the robot needs to often circumvent narrow turns and visibility is lost, that can be handled by the proposed approach. Conventional trackers are prone to delete tracks in cases of repeated lack of visibility. Even if the track deletion is not performed, the existing trackers will fail to accurately predict the position in cases of a prolonged lack of visibility. The proposed algorithm can handle such practical cases.
-
4. The working environment has different types of noises due to non-ideal settings like false positives and false negatives in the detection of the humans. Conventional trackers cannot estimate when the object being tracked will show a slow or a fast motion, based on which the tracker can be made more sensitive or more stable. The proposed technique understands the social context to reason for possible human positions and estimates whether the human will make a slow or a fast motion in the real world, and whether the projected position in the image plane will show a slow or a fast transition. This lets the particle filter rely on the predicted motion model in case of a false negative, while not unduly move the particles in case of a false positive.
2. Related work
In the fields of social robotics and computer vision, a lot of work has been done. The domain of social robotics is growing very rapidly. A popular method is to extract the legs by using the moving blobs detection method which also appears as a local-minima in range data [Reference Lenser and Veloso6–Reference Wengefeld, Müller, Lewandowski and Gross8]. The authors computed motion features and geometrical features to define and characterize the human. If humans are not moving, these features are not suitable to detect the human. The technique faces the problems of false negatives for a variety of attires, while the assumption is of a large field of view and high-precision sensing.
The study of human motion has attracted much attention due to its huge applicability in various fields. Many applications have fixed cameras which makes it possible to have numerous heuristics for tracking. Authors [Reference Liu, Huang, Han, Bu and Lv9] captured and tracked human motion based on multiple RGBD cameras. The method could also solve the problem of body occlusion. Furthermore, in the paper [Reference Yu, Zhao, Huang, Li and Liu10] an algorithm was developed to capture very fast human motion based on a single RDB-D camera. The method concatenated pose detection with motion tracking. Moreover, human motion prediction can also be done by modelling of convolutional hierarchical auto-encoder [Reference Li, Wang, Yang, Wang, Poiana, Chaudhry and Zhang11]. In the paper [Reference Malviya and Kala12], authors performed multiple faces tracking methods with different heights of the camera to know the best field of view of the camera. However, in the proposed application, the camera is on a moving robot which makes tracking challenging. For static target visual servoing-based go-to-goal behaviour has been designed for controlling a mobile robot using a Gaussian function [Reference Dönmez, Kocamaz and Dirik13]. Moreover, a decision tree controller has been developed for a vision-based control and integrated with the potential field method [Reference Dönmez and Kocamaz14]. Deep reinforcement learning has been utilized for training an agent to learn safe motion planning and avoiding collisions in a dynamic scenario [Reference Chen, Liu, Everett and How15, Reference Everett, Chen and How16].
The current paper is limited to tracking for a specific application of a robotic guide; however, numerous applications of tracking for human-robot interaction are worth a mention. Admittance control was utilized to implement robot guidance (concept of ‘walk-through programming’) as discussed in [Reference Ferraguti, Landi, Secchi, Fantuzzi, Nolli and Pesamosca17, Reference Landi, Ferraguti, Secchi and Fantuzzi18]. Memory-augmented networks have been proposed and applied to predict the trajectory of the vehicles on the road [Reference Marchetti, Becattini, Seidenari and Bimbo19]. A fuzzy neural network can also be applied for the tracking of a robot trajectory with the integration of the kinematic model and parameter learning [Reference Bencherif and Chouireb20]. Moreover, a new motion planning algorithm [Reference Park, Park and Manocha21] was proposed for human motion prediction. This algorithm was also applicable for determining the smooth and collision-free path for the robot. In the development of the human-computer interaction and autonomous navigation, a MotionFlow model [Reference Jang, Elmqvist and Ramani22] was introduced, and it was based on a visual analytic system and different pattern analyses of human motion and further robot slowing down behaviour [Reference Malviya, Reddy and Kala23], a hybrid method [Reference Reddy, Malviya and Kala24] and customization of robot appearance [Reference Guo, Xu, Thalmann and Yao25] have been implemented. In paper [Reference Liang, Zhang, Lu, Zhou, Li, Ye and Zou26], a delicately designed network was integrated with one-shot multi-object tracking which is known as CSTrack. Further detection and re-identification is an essential framework in tracking and the tracking system was based on a tracking and detection paradigm [Reference Zhang, Wang, Wang, Zeng and Liu27, Reference Wang, Zheng, Liu, Li and Wang28].
Human tracking in a large crowd is a related problem in robotics and computer vision. It is hard to detect and track human motion and their behaviour in real time as the robots and vehicles co-exist with multiple humans [Reference Bera and Manocha29]. Human behaviour is not fixed and it changes, which makes the tracking tough. Humans always try to change their velocity to avoid obstacles and other moving humans. Here, a few tracking approaches have been proposed that were very reliable and worked for non-real time and offline applications [Reference Bera and Manocha29]. Methodologies developed for real time or online human tracking are limited to simpler scenes with lesser humans. In real-time tracking, the trajectory of every moving human is based upon their sub-goal positions and basic interactions with the other humans and obstacles. So appropriate motion model should be built for accurate crowd tracking which must include these characteristics. Motion models which do not account for large uncertainties of human trajectories in a large crowed scenario also include several broadly applied motion models based on constant acceleration and constant velocity model breakdown to utilize these human behaviours [Reference Bimbo and Dini30]. The proposed approach is a step in the same direction of socialistically modelling the person, applied on a camera on the moving base (robot), which makes the problem harder. The proposal is for a specific application where the human follows the robot, which is a largely untouched problem.
There are multiple applications involving research in robot motion concerning social behaviour such as navigation, emotion detection, guiding a human. The motion model in the proposed approach is closely related to the problem of path planning, where the motion of a moving person is guessed. A constraint in the model motion unlike planning is the availability of only a partial knowledge of the world that affects the motion of the person hence most research is not applicable. In the distance-aware dynamic roadmap [Reference Knobloch, Vahrenkamp, Wächter and Asfour31], the collision-free path planning was implemented by the cooperating distance from the obstacle to determine the trajectory of the robot. Another approach called as constriction decomposition method [Reference Brown and Waslander32] was derived and tested in complex and indoor environments. Variants of the artificial potential field [Reference Orozco-Rosas, Picos and Montiel33], adding deliberation to reactive algorithms [Reference Paliwal and Kala34], pedestrian model-based reactive planning [Reference Bevilacqua, Frego, Fontanelli and Palopoli35], coordination strategy [Reference Kala36] and vision-based control approach [Reference Dirik, Kocamaz and Dönmez37] have been used for motion planning. Further, an adaptive side-by-side navigation methodology has been developed for guiding the robot [Reference Repiso, Garrell and Sanfeliu38].
Understanding human motion is useful for social robot motion planning so that mobile robots are able to maintain the same distance as observed with humans under the different behaviours [Reference Malviya and Kala39]. Unorganized chaining behaviour is essential for the robot path planning especially in situations like corridors, where a queue is formed automatically, and still, there is no rule to form specific queue [Reference Malviya and Kala40]. If multiple robots are moving in a group, they can follow a leader robot in a sequential manner [Reference Kumar, Banerjee and Kala41]. Human-robot interaction also requires few themes such as path planning, safety, psychological constrains and prediction [Reference Lasota, Fong and Shah42]. RNN- and CNN-based approach have also been applied for agent information and trajectory prediction [Reference Zhao, Xu and Monfort43, Reference Rhinehart, McAllister, Kitani and Levine44] and further provided in a recent survey on forecasting agent behaviour and its trajectory [Reference Rudenko, Palmieri, Herman, Kitani, Gavrila and Arras45]. In the paper [Reference Jain, Semwal and Kaushik46], autocorrelation procedure has been utilized for fine-tuning the threshold for stride segmentation on a gait inertial sensor data. In the paper [Reference Semwal, Gaud, Lalwani, Bijalwan and Alok47], deep learning models have been used for addressing the different walking problem of the humanoid robot. Further in the paper [Reference Gupta and Semwal48], gait authentication model has been proposed under the occluded scenarios and on a Kinect sensor data to detect the occluded gait cycle. In the domain of computer vision and robotics, human motion and trajectory prediction can also be carried out by the group affiliation which is based on the contextual cue [Reference Rudenko, Palmieri, Lilienthal and Arras49]. A single robot can also act a crowed environment and here SPENCER has been used for the human detection [Reference Linder, Breuers, Leibe and Arras50]. Furthermore, flow model and uncertainty of the trajectory has also useful for the practical application of the motion planning problem [Reference Swaminathan, Kucner, Magnusson, Palmieri and Lilienthal51]. The other application of trajectory prediction can also be used to learn personal traits for determining different interaction parameters among the people [Reference Bera, Randhavane and Manocha52, Reference Ma, Huang, Lee and Kitani53].
The artificial potential field method is widely employed for social robot navigation. In the paper [Reference Li, Chang and Fu54], several behaviours of the social robot were analysed to control the robot. The most challenging issue was not only how to design and develop the social behaviour mechanism but also how to retune different parameters for exact and accurate operation. In the proposed work, the aim is not to control the robot, but to design a socialistic module for the prediction of the human motion, using a variety of fused behaviours. Correspondingly, the proposed method integrates the motion primitives with an observation model to make a socialistic particle filter.
Based on the literature, it is reassessed that tracking for the specific socialistic behaviour of a human following a robot as a guide has not been done, while the approaches in the literature for other tracking applications are not applicable as many uses fixed cameras and many others assume a large field of view sensing so that the person is mostly within the sensing range. Kinds of literature from socialistic motion planning are more focussed on general obstacle avoidance, while the ones in socialistic interaction do not account for motion problems. Hence, this paper notes the specific problems of the application and experimental setting (person not visible for long, especially at corners) and solves the same.
3. Particle filter
The paper uses a particle filter for human tracking and therefore this section very briefly presents the working of the particle filters. A very basic idea of the particle filter is that ‘any probability density function’ can be represented by the set of particles (samples). Every particle has a numerical value for the state variables. A particle filter is an efficient method to express and retain the probability density function with a non-linear and non-stationary nature. In general, the particle filter method is also known as a sequential Monte Carlo method. In the specific problem, the robot is tracking the moving human whose pose is described by Eq. (1)
where (r t,X , r t,Y ) is the 2D coordinate of the robot, $r_{t,\theta }$ is the robot heading and r t is the robot state at time t.
In a particle filter tracking framework [Reference Choi and Christensen55], the posterior density function $p(q_{t}|z_{1:\,t},r_{1:\,t})$ is, given all past observation z 1:t and all past positions of the robot r 1:t, defined as a set of multiple weighted particles given by Eq. (2)
In Eq. (2), the symbol $q_{t}$ denotes the set of all samples representing the current state of the human face with particle i given by $q_{t}^{(\mathrm{i})}$ . The subscript t stands for time. $n$ represents the total number of particles. The corresponding weights $\mathrm{w}_{t}^{(i)}$ are directly proportional to the likelihood function $p(z_{t}|q_{t}^{(\mathrm{i})})$ , which is the probability of observing $z_{t}$ given the state $q_{t}^{(\mathrm{i})}.$
In the proposed particle filter model, the current position $\overline{q_{t}}$ of the human’s face as predicted by the weighted mean of the particles as expressed in Eq. (3).
The motion model of the particle filter predicts the future position of the person being tracked, given the current probability density function represented by using particles. The motion model is applied to every particle using Eqs. (4–5).
Here, $u_{t}^{(i)}$ is the predicted control for the person consisting of linear velocity $(v_{t}^{(i)})$ and angular velocity $(\omega _{t}^{(i)})$ . The function predict tries to predict the motion of the human, given the current prospective position $q_{t}^{(i)}$ as per the i th particle and the best-estimated position of the robot $r_{t}$ . Motion noise u noise, sampled from a uniform distribution, is added in addition to model the uncertainties in prediction. The control $u_{t}^{(i)}$ is applied to the particle to give the next estimated state $q_{t+1}^{(i)}$ using the known kinematic equation K. The kinematic equation also samples a noise to model the natural motion noise.
Thereafter, an observation is made in the form of an image with the known position of the human face in it, given by z t+1 = (u t+1 v t+1)T. The weights are thus calculated as given by Eq. (6).
Here $H(q_{t+1}^{(i)})$ is the observation model that predicts the observation (position in image of the face) if the human face was at $q_{t+1}^{(i)}$ as suggested by particle i.
A natural consequence of running a particle filter with the stated prediction and observation model is that with time many particles get saturated at areas of very low likelihood, while there are too few particles at the areas of high likelihood. This significantly reduces the algorithm performance. Hence, the resampling is applied which grows more samples at prominent areas and deletes samples at low likelihood areas.
4. Algorithm design
The basic architecture of the particle filter-based tracking model is summarized in Fig. 1. A new method is designed to track the human in the 3D environment. A social particle filter motion model is proposed to track the human following the robot. Here, the robot starts its journey from the initial state (origin (0, 0)), and the human follows the robot while maintaining a certain socialistic distance. The interesting aspect is when either the person or the robot has to account for a sharp turn in the workspace, wherein the ideal following behaviour is no longer applicable, and the person further gets outside the field of view of the camera. Only the initial position of the human is known by the particle filter, which tracks the human and determines its position in 3D environment using a single rear-facing monocular camera.
4.1. Face detection
A convolution neural network is used to detect the human’s face. As the human comes in sight of the robot and appears at the front of the monocular camera (mounted on the robot), the human is detected by his/her face. As the face of the human is detected, a rectangle is plotted around the human’s face in the video frame. The model in reference paper [Reference Zhang, Zhang, Li and Qiao56] has been applied for continuous face detection and its localization. Furthermore, this network has been subdivided into three fragments stages which are P-Net (fast-proposal network), R-Net (refinement network) and O-Net (output network). Here, a candidate window is generated by the fast-proposal network, refinement is done by the refinement network, and the facial landmarks and a bounding box are obtained from the output network.
Here four distinct kinds of data annotation have been used for the training process: (i) Negatives: consider the region that has an Intersection over Union ratio less than 0.3 from any ground truth face, (ii) Positive: The region that consists more than 0.65 Intersection over Union from a ground truth faces data will be considered, (iii) Part faces: Intersection over Union between 0.4 and 0.65 and (iv) Landmark faces: Here, five different landmark’s positions are labelled for faces. In the above annotation steps, positive and negative have been used for face classification. Part faces and positives have been utilized for bounding box regression, followed by facial landmark localization has been carried out by the landmark faces. Training data for each network are characterized as follows:
P-Net: Randomly cropped various patches from WIDER FACE [Reference Yang, Luo, Loy and Tang57] to assemble negative, positive and part faces, followed by faces from CelebA [Reference Liu, Luo, Wang and Tang58] have been cropped as landmark faces.
R-Net: Detect the face from WIDAR FACE [Reference Yang, Luo, Loy and Tang57] to assemble negative, positive and part faces, although landmark faces are detected from the CelebA [Reference Liu, Luo, Wang and Tang58].
O-Net: This is similar to R-Net for assembling the data but ideally used the initial first two steps in this framework for detecting the faces.
Suppose an image is available. By the image pyramid model, it is re-sized into a distinct scale and it works as the input for these three fragment stages of the model.
Stage I: Initially, P-Net (fast-proposal network) is applied to generate the candidate (facial) window and bounding box via a regression vector. Furthermore, NMS (non-maximum suppression) is employed to integrate the extremely overlapped candidate.
Stage II: In this stage, each determined candidate window is properly filled into an R-Net (refine network) so that a large number of false positives are eliminated. When all the false positives eliminate, then a calibration process is performed with the regression of the bounding box and also a non-maximum suppression is regulated here.
Stage III: This stage is mostly the same as stage I, but the main objective of this new stage is to determine the entire area covered by the human’s face with extremely high recognition. The output of this stage is the ‘position of five facial landmarks.
Various experiments and measurements are done to determine the performance measurements. Since in progressive CNN [Reference Li, Lin, Shen, Brandt and Hua59], some filters with minor diversity have few restrictions, so in the comparison with multi-class object detection approach, the human’s face detection approach is highly challenging for binary taxonomy. In the end, statistics of each filter are minimized to $3\times 3$ from $5\times 5$ to prevent computational time; however, depth is increased to obtain high performance. Here, PReLU [Reference He, Zhang, Ren and Sun60] works as a non-linear activation function that is applied after convolution to fully convolution layers excluding the output layers. PReLU is an extended and modified version of ReLU. ReLU is also called the most recent development of the deep network [Reference Krizhevsky, Sutskever and Hinton61]. Figure 2 explains the overall depiction [Reference Zhang, Zhang, Li and Qiao56] of the convolution neural network with P-Net, R-Net, and O-Net which is designed for human’s face detection. Here 2 and 1 are used as the step size for pooling and convolution. MP and Conv are an abbreviation for max-pooling and convolution.
In the detection phase, a cross-entropy loss is employed for every sample $s_{k}$ . Mathematically, it can be written as Eq. (7):
Here, ground truth label is represented as $\eta_{k}^{det}$ and network obtained probability is $p_{k}$ . It implies that sample being face has a probability $p_{k}$ .
In the sub-problem of regression on the bounding box, the network offset is determined for each candidate among it and its closest ground truth (top, height, width and left of every bounding box). The main objective of this approach is known as the problem of the regression. Moreover, Euclidean loss is applied for each sample $s_{k}$ as shown in Eq. (8):
Here, the term $\hat{\eta}_{k}^{box}$ represents regression target which is obtained from the network module and $\eta_{k}^{box}$ is the term which represents the ground truth coordinate that includes (top, width, left and height), and it implies $\eta_{k}^{box}\in R^{4}$ . The facial landmark step minimizes the Euclidean loss, and this problem is similar to problem of bounding box regression, and it can be formulized as given by Eq. (9)
Here, the term $\hat{\eta}_{k}^{\textit{landmark}}$ denotes the facial landmark coordinates (computed from the network module) and $\hat{\eta}_{k}^{\textit{landmark}}$ represents the ground truth coordinate for the $k^{th}$ sample. Here, left mouth corner, nose, left eye, right mouth corner, right eye and nose are the five facial landmarks and hence $\eta_{k}^{\textit{landmark}}\in R^{10}$ .
In the learning process of CNN, different kinds of images are used like non-face images, face images and moderately aligned face images. If a sample belongs to the background, then $loss_{k}^{det}$ is calculated, while the remaining other losses are set to zero. It can be implemented directly using a simple type indicator. The target of its learning process is given in Eq. (10)
Here, $M'$ is the number of training samples. The task importance is represented by the parameter $\aleph _{j}$ , and $\beth _{k}^{j}$ is a type indicator.
The original classifier is trained after involving traditional hard sample. Online hard sample mining is employed for taxonomy of non-face and face; this is called as adaptive approach for training purpose. Here, in every mini-batch, the losses are sorted that are obtained in forward propagation from each sample. Also, hard samples are selected among the topmost loss. Finally, hard samples are employed to determine the gradient at the backpropagation. Finally, easy samples have been ignored because they are very unlikely to for strengthening the detector in the training stage.
4.2. Motion model of the particle filter
The first task in the design of the particle filter is to make a motion model. The typical approaches in the literature assume a randomized motion model in which the uncertainties sharply increase in the absence of observation, which is not suitable for the current application with limited sensing. Furthermore, the randomized motion model may not accommodate the sharp turns by the humans. The humans are less kinematically constrained and can display sharp turns, which may be needed due to a densely packed workspace or a sudden turn by the robot. The robot position in 3D is assumed to be known at every instance of time, which is obtained from the localization module of the robot. While computing the human’s estimated position, the information of the robot position is also considered.
It is assumed that the person is following the robot, which is very useful information to predict the motion of the human. The human is assumed to follow a strategy to maintain a socialistic distance from the robot. Humans tend to always keep a socialistic distance between themselves while navigating. The socialistic distance is complex and depends upon the affinity between the two people and the group to which they belong. Fortunately, the robot is not considered socially amicable and not a part of any social group involving humans. Hence, let the desired separation between the human and robot be given by d soc, which includes the radius of the person and robot. The notations are shown in Fig. 3.
Some experiments were done with human subjects to get the socialistic distance d soc. The distance, unfortunately, depends upon the particular subject and situation. The distance is hence sampled from a uniform distribution. For modelling, the socialistic distance is however taken as a constant (represented by the mean alone) and uncertainties are added to the other noise terms.
Let $q_{t}^{(i)}$ be the pose of the person as inferred by particle i at any time t. Let the pose of the robot be given by $r_{t}.$ The person’s immediate speed is set so as to attain a socialistic distance d soc eventually, which is modelled by using Eqs. (11–13) using the principles of a proportionate controller with proportionality constant K P.
Here, ${\Delta} t$ is the time difference between two successive iterations of the particle filter algorithm.
Here, d soc is the socialistic distance between the person and the robot including the radii of the two entities. $d_{t}^{(i)}$ is the distance between the robot and the human (particle). A restriction applied in the selection of speed is that the person does not go backwards, which is another social etiquette when people follow each other. Temporarily, the distance may be lesser than required, but the robot eventually moves ahead increasing the distance. $\epsilon _{\textit{motion}}$ is the noise added accounting for the randomized nature of d soc, the change of behaviour of the person, momentarily distraction, accounting for person-personality specific characteristics, presence of obstacles, etc. The noise is sampled from a uniform distribution $[\!-\!\epsilon _{vmax},\epsilon _{vmax}]$ .
The speed of the person is subjected to a threshold v max, further constraining it not to be negative, given by Eq. (14)
A heuristic is that the person cannot move faster than the maximum speed of the robot, while the maximum speed of the robot is available from the robot’s documentation. This helps in setting the value of v max.
The person is expected to turn so as to face the robot. Unlike robots, humans do not have non-holonomic constraints and can make sharp turns. The person hence is assumed to have immediately oriented in the direction of the robot, without the need for a separate angular speed, and thus, the orientation is given by Eqs. (15–16).
Here, $\epsilon _{\theta }$ is a noise sampled from a uniform distribution $[\!-\!\epsilon _{\theta max},\epsilon _{\theta \max }]$ . The noise accounts for factors including smoothness preference of the human, specific behaviour characteristics, presence of obstacles, etc.
The predicted position of the particle is thus approximately given by Eqs. (17–18).
It is also good to discuss the randomized motion model, which simply adds noise to produce the state of the person at any instant of time, as given by Eqs. (19–22)
Here, $v_{s}$ and $\theta _{s}$ are the maximum noise related to speed and angle, respectively.
A purely socialistic behaviour may not be able to deal with non-anticipated changes like the person encircling due to an obstacle, the person deliberately showing non-cooperation, etc. Hence, a fusion of the socialistic and random motion model is also used, given by Eqs. (23–24).
Here, w 1 controls the contribution of the socialistic and randomized terms.
In the proposed model, Eq. (11) is used to compute the velocity of the particles and $\epsilon _{v}$ is the uniformly distributed noise as explained in the Eq. (13). Distance between the particle and the robot is given by Eq. (12). Eq. (14) gives the maximum velocity of particle at a particular time t so that this quantity never become negative value. Orientation between the particle and robot is given by Eq. (15) and noise is also uniformly distributed over here as mentioned in Eq. (16). Socialistic force vector (HFR behaviour) and, on the basis of this socialistic force, particle position is computed as Eqs. (17–18), followed by random behaviour is applied to compute next updated position of the particle Eqs. (19–20). Eqs. (21–22) are the noises related to speed and angle respectively. Socialistic and random behaviour are merged in the Eq. (23) to compute next updated position of the particle.
4.3. Observation model of the particle filter
The motion model predicts the human position in the 3D real-world coordinates, and this 3D point is projected into the 2D image plane. The proposed approach is applied on a monocular camera which is calibrated to obtain the lens distortion and intrinsic parameter as a priori information. Hence, the camera model uses the standard calibration matrix ( $[\![{\zeta _C}]\!]$ ) which is given by Eq. (25)
Here, $f_{x}$ and $f_{y}$ are the focal length, and $(\partial _{x},\partial _{y})$ is the principal point offset. $s_k$ represents the axis skew which causes the shear distortion in the projected image. The coordinate systems used in the computation are represented in Fig. 4.
The robot coordinate axis system (XRYRZR) is defined by the characteristic point taken as the centre of the camera lens. A transformation from robot to human is given by Eq. (26)
Here, $r_{t,\theta }$ is the orientation of the robot. The robot coordinate axis system is rotated from the world coordinate axis system by $r_{t,\theta }+\pi$ , where the extra rotation of π suggests that the camera is looking at the rear side and not the forward side. $\eta_{c}$ is the difference in height between the person and camera.
The camera coordinate axis system is the same with the difference that the Z-axis of the camera goes out of the camera facing the object, and hence, the transformation between the robot and camera coordinate axis is given by Eq. (27).
Figure 4 explains coordinate geometry and its relationship to real-world, robot, camera and image coordinate system. Here, $(X_{W}, Y_{W},$ $Z_{W})$ is the real-world coordinate system. A robot is situated on the ground with $(X_{R}, Y_{R},$ $Z_{R})$ coordinate system. Further, $(X_{C'}, Y_{C'},$ $Z_{C'})$ is the uncorrected camera coordinate systems (with a different permutation of axis in contrast to the camera coordinate system standards), and $(X_{C}, Y_{C},$ $Z_{C})$ is the corrected camera coordinate system.
Since the camera is mounted on the robot, therefore the Z-axis of the camera (uncorrected) will be the same as the Z-axis of the robot.
The camera is looking behind the robot therefore X-axis and Y-axis of the camera (uncorrected) will be in the opposite direction of the X-axis and Y-axis of the robot coordinate, respectively.
The above Eqs. (28–30) are for the uncorrected camera coordinate system, and the human is detected in the image plane, so in the image plane, Z-axis is the X-axis of the uncorrected camera coordinate system (Eq. (31)).
The height of the camera or uncorrected Z-axis of the camera will be the Y-axis of the image plane (Eq. (32)).
The uncorrected Y-axis of the camera will be the X-axis of the image plane (Eq. (33)).
The person prospectively at $p_{t+1}^{(i)}$ should hence be seen at a position $T_{R}^{W}p_{t+1}^{(i)}$ in the robot coordinate frame and $T_{C}^{R}T_{R}^{W}p_{t+1}^{(i)}$ in the camera coordinate frame. The projection onto the image happens by passing through the calibration matrix, producing the image points given by Eq. (34–36)
Here, (c x,c y) are the coordinates of the centre of the image and correct the coordinates from the centre of the image to the top left corner.
The face detection method is used to detect the human face. The observation model is applied here to assign weights to the particles. Let z t+1 = (u obs t+1 v obs t+1) be the observed centre of the bounding box of the face in the 2D image plane. The updated weight of particle i is given by Eq. (37).
The overall complete Pseudo code of this tracking and detection module is explained briefly as Algorithm (I). The algorithm assumes the initial position of the human that is initialized in line 1. The loop in lines 3–6 initializes all particles with the known position of the human and equal weights. The main loop on time is in lines 8–20. First, the motion model is applied in line 9. Then, an observation is made in line 10. If there is an observation, lines 12–15 project the human as per the particle hypothesis onto the image plane and sets the weight proportional to the error. Line 16 is for resampling of the particles. Line 18 computes the output position of the tracker.
5. Result
The experimentations are done on an Amigobot differential wheel drive robot. The robot localizes primarily using high-precision wheel encoders. A web camera was mounted on the robot, looking backwards. The camera was elevated to nearly the human height, which by experimentation significantly improves the face detection accuracy. The camera was calibrated to get the intrinsic and extrinsic parameters. The initial position of the robot was taken as the origin. The initial setup is shown in Fig. 5. First, the human position was fixed, and the robot was moved in the XY axis, and also rotation was applied on the robot using teleoperation using the ROS framework. Thereafter, the person followed the robot. The human is detected by a monocular camera. In this paper, Logitech webcam has been used for human detection. The camera is mounted onto the AmigoBot robot at 1220 mm height and detects the human by their face in 3D and further visualized in the 2D image plane. Here the resolution of the 2D image plane is 640 $\,\times\,$ 480. The robot localizes using the position encoders. The robot uses the optical quadrature shaft encoder. The encoders have a resolution of 9550 ticks per wheel revolution that translates to approximately 30 ticks per millimetre.
In this paper, Robot Operating System, python language and ROSARIA library are used. The experiments were carried out on real mobile robot (hardware) named as Amigobot. The sensor used was a Logitech web camera that can record videos up to 60 Hz. However, due to software overheads the camera was operated at 24 Hz. The robot knows its positions primarily using the wheel encoders. Because of the resolution, practically the frequency of the wheel encoders is extremely high. However, again because of the software overheads, the pose is calculated at a frequency of 10 Hz. The robot is also equipped with sonar sensors for autonomous navigation; however, for the experiments the robot followed a fixed path and the sensor was not used. 3D monocular tracking system is tested on different scenarios. Three major different scenarios where the proposed method is tested are discussed. Dataset was collected by a real robot and camera. Initially, the robot localizes itself at an origin (0, 0) and the human is placed just behind the robot at a coordinate point (−1950, 0) with respect to robot coordinate frame. Here, the coordinate system is the same for the robot and the human. Robot positions are known by the Robot Operating System and ROSARIA library and it is being recorded at every time stamp. The human is following the robot, and only the initial position of the human is known and further positions have to be predicted by the robot. The human is detected in the 2D image plane by their faces using a monocular camera mounted on the robot. Here, three different scenarios have been used for data collection so that all distinct challenges can take place and their solution can be estimated. All testing is done at the Centre of Intelligent Robotics of the institute to ensure the validity and correctness of the proposed method. In all scenarios, the human starts walking and follows the robot. In the first scenario (scenario ID I), the robot moved straight till 6 m and then gradually took $90^{\circ}$ anticlockwise turn and moved utmost 0.5 and then again took a $90^{\circ}$ anticlockwise turn and thereafter moved straight forward. When the robot covered 6 distance, then it again took utmost $60^{\circ}$ anticlockwise turn and stopped. The challenge in this scenario is repeated frequent turn. First, the robot turned, and thus, the human was out of view, and thereafter, the person took time to turn, after a brief period the robot already had to take a second turn. This gave a very brief period for the tracker to re-converge the particles.
In the second scenario (scenario ID II), the robot moved straight till 6 m followed by a gradual $90^{\circ}$ anticlockwise turn and then covered about 3 m and again gradually rotated $90^{\circ}$ clockwise and covered a 5 m distance before stopping. The challenge in this scenario is the sharpness of the turn. The robot took too long to fully complete nearly $180^{\circ}$ turn within which there is barely no visibility of the human. In the next scenario (scenario ID III), the robot also covered a major distance with different rotations, specifically took $360^{\circ}$ clockwise rotation. This increases the complexity manifold, because the person visibility is lost while the robot is rotating. In this case specifically, the turn is too large and sudden. Even after the robot completes the turn, the person takes time to turn and come behind the robot.
Two metrics are used to assess the performance of the algorithm. The first metric is the trace distance, calculated as the distance between every point in the tracked trajectory with the closest point in the ground truth. The second metric is the inverted trace distance, calculated as the distance between every point in the ground truth to the closest point in the tracked trajectory. The metrics are given by Eqs. (38–39). Here, d() is the distance function.
The first metric (trace) will take a value of 0 if the tracker keeps the person at the source only and does not move. Therefore, a second metric that nearly conveys the same meaning was used where such an output will be given a very large error. The comparison is done with multiple variants of the proposed algorithm and baselines. The first method is the socialistic particle filter whose motion model is the human following the robot with motion noises. The second is the baseline, which is a randomized behaviour with random motion. The third algorithm is the fusion of the socialistic and random motion. The contribution of the socialistic algorithm is made up of a factor of p to get an algorithm parameter, called the p-biased fusion. The fusion with random weights to socialistic and random components is also used. The next algorithm is a baseline that tracks the humans on the image plane and transforms the results into 3D by estimating the depth of the person. The estimation of depth is possible since the height of the person is assumed to be known, which is enough to calculate the scaling factor. Kalman Filter has been widely used in the literature for tracking, and the same is hence used for the comparisons as well. Further, another domain of work is literatures solving the prediction problem. The approaches predict the future trajectory of the human, and the predictions are matched with the observed trajectories. LSTM is a popular prediction algorithm. Therefore, a comparison with an LSTM-based trajectory prediction algorithm is also done.
Table I gives the error between the particle to robot and robot to the particle. The error is calculated on distinct algorithms (all the measurements in the mm). The proposed socialistic particle filter approach has the least error as compared to all other approaches and baselines. This includes both the error metrics. The baseline approach of random motion of the human performed consistently poorly. Therefore, fusing the random motion model with the proposed approach also makes the proposed approach perform relatively poorly. The Kalman Filter assumes linearity that does not hold and therefore the approach gives extremely poor results. When the robot rotates, there is an extremely sharp change in the location of the tracked human who may not have moved much. The Kalman Filter cannot model such sharp changes anticipative due to the robot’s turns and the approach hence accumulating severe errors. The LSTM network predicts without observation. It is applicable only for short sequences and the errors keep increasing that cannot be corrected by observation. The experiments reported are all prolonged duration of navigation where the trajectory sequence is large (long duration of time) and hence the tracking gets lost.
The human tracking is performed on the different algorithms and baselines, and trajectory of the human and the robot is plotted on all approaches. The results are given in Figs. 6 and 7 for Scenario ID I, Figs. 8 and 9 for Scenario ID II, and Figs. 10 and 11 for Scenario ID III. In all the scenarios, the proposed socialistic tracker gave the best results as compared to all other approaches and baselines.
Parameter analysis is done on the distinct parameters such as motion noise, variance and the number of particles. It is evident that a smaller number of particles gave the best performance, which indicates the strength of the model. As a result, a large number of particles to increase diversity are not necessary.
Table II summarizes the error with respect to the number of particles. Table III studies the observation variance parameter ($\sigma ^{2}$). The ideal values are the ones that capture the diversity of observations within the image as the uncertainty increases. Finally, the testing with the motion noise is done in Table IV. The noises also correspond to the person generally following the robot religiously and hence a small noise is enough to model the following behaviour uncertainties. A few snapshots of the 3D tracking for scenario ID I are shown in Fig. 12. Here the robot took a single tour of the laboratory and the human subjects followed the robot. In this scenario, the robot took a rotation for four times at four different corners. Human tracking is also performed in another scenario with ID II. Here, the robot took a few rotations and tracked the human at several paces and corners. The tracking is shown as Fig. 13.
Scenario ID III is more complicated because the robot took several turns as the human followed. The robot tracked the human at each rotation and corner while the human got out of view on multiple occasions. The pictorial demonstration of scenario ID III is shown in Fig. 14.
Currently, the face detection algorithm takes 0.122 s per frame while the particle filter including face detection takes 0.155 s per frame. Exclusively, the particle filter approach takes 0.155–0.122 = 0.033 s per frame. The authors have previously worked extensively on real robots. In the experiments related to controls, the author’s group typically controls the robot using a frequency of 10 Hz or at 0.1 s per frame. When dealing with more involved re-planning and other decision-making modules, the group goes at a frequency of up to 1 Hz or 1 s per frame. As per the current compute used, the computation time of 0.155 s per frame gives an operating frequency of 6.45 Hz which is an acceptable frequency for such applications. Furthermore, the time required by the novel Particle Filter constitutes only 21.29% of the total computation time, while face detection is the step that takes the largest amount of time. Currently, the computation is done on a CPU. Like the deep learning-driven research in self-driving cars, robotics and other intelligent systems, we assume that a powerful GPU would be available to significantly lower this time, enabling real-time deep learning-driven solutions.
6. Conclusion
In this paper, a social particle filter model has been used for tracking a human who follows the mobile robot for navigation in indoor scenarios. The application includes robotic guides in museums, shopping centres and other places. The proposed model was powered by a socialistic motion model that could always guess the human’s trajectory with uncertainties limited by a small noise. Here, a single low-cost camera has been used to perform a tracking in 3D real world since our model does not require a high-grade stereo camera or a robot with sophisticated sensors. In the recent literatures, tracking has been performed in the image plane using a static camera, and our approach can perform tracking with a moving camera in the 3D real world. The proposed model can predict the motion of human even with a lack of visibility of the person due to sharp turns, corners and in situations where due to a lack of space the person cannot place themselves behind the robot. A socialistic human following behaviour is also developed through this model. Here comparisons with several filters and different methods have been carried out to show the improvement of the proposed approach that is catalysed by an accurate social model that can continue tracking even with a lack of visibility of the moving object, a capability that has never been demonstrated in the current tracking literature. The proposed method gave very less errors as compared to the other baseline approaches and variants.
One of the major limitations currently is in benchmarking tracking. Currently, while the experiments were carried out under realistic settings with the complete robotic setup including humans, the benchmarking was carried out using simulators to get the ground truth. Humans may not have socially simulated the robots. The experiments need to be done using external trackers to prepare the ground truth. The major application was enabling the robot to take an informed decision based on the tracked positions of the people. Using the tracker uncertainty as an input in the planning algorithm of the robot needs to be explored. In the future, people can be observed for some time and a new algorithm can be used that will be able to learn the different types of patterns that the people form between themselves. The algorithm will be able to identify which patterns are good for better motion prediction and resultantly, a self-adaptive motion model will be proposed as a future work that will use an unequivocal parameter setting based on the observed pattern formation of the people. In this way, the robot may act as an expert guide that adapts its behaviour based on the perceived behaviour of the other people.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/S0263574722001795.
Author contributions
Both authors equally contributed.
Financial support
This work is supported by the Indian Institute of Information Technology, Allahabad, India.
Ethical standards
Not applicable.
Conflicts of interest
The authors have no conflict of interest to declare.