1. Introduction
The intensive care unit (ICU) provides comprehensive treatment, nursing care, and rehabilitation services to critically ill patients. The assessment and management of pain, agitation, and delirium (PAD) are key responsibilities of clinicians in the treatment or postoperation period of the patient [Reference Rodriguez, Cucurull, Gonzlez and Gonfaus1, Reference Miriam, Meixner and Lautenbacher2]. However, continuously monitoring a patient for hours at a time is both costly and labor-intensive [Reference Drennan and Fiona3]. Studies have revealed a low doctor-patient ratio in the ICU [Reference Stretch and Shepherd4]. Consequently, long-term monitoring of patients in the ICU results in significant physical and psychological strain on the medical staff. In addition, clinicians in the ICU directly contact the patient during a pandemic such as COVID-19 [Reference Tian, Hu and Lou5], which increases the risk of infection. Thanks to the advances in facial expression recognition [Reference Roy, Bhowmik and Saha6–Reference Mohammad and Hadid8], particularly for infants and the elderly, automatic monitoring of PAD become possible. However, a patient’s head may move from side to side as suffering from PAD. Therefore, a robotic face tracking system is required to accurately capture the facial action for the intelligent management of PAD in critical care.
Face tracking with a manipulator can be modeled as a visual servoing problem. Visual servoing can be generally divided into position-based visual servoing (PBVS) and image-based visual servoing (IBVS) [Reference Qi, Tang and Zhang9–Reference Dong and Zhu12]. Some tracking algorithms have been proposed to track the face based on PBVS or IBVS. Face tracking using IBVS has to select more than three features to construct an interactive Jacobian matrix, but the head rotation can lead to the loss of feature points. In addition, if the selected feature points are close to each other, a slight head rotation will lead to a huge movement of joint angle. Ref. [Reference Zhang, Fan, Zhang and Zhan13] presents a PBVS-based face tracking method using a 6-DOF robot to collect patients’ physiological parameters when they are in inconvenient situations. Ref. [Reference Zhang, Ouyang, He, Yuan and Yang14] proposes a face-tracking strategy based on the manipulability of the manipulator and models the head motion as ellipsoids. However, all these methods assume the optimal configuration of the camera installed on the end effector is in the reachable workspace of the manipulator, which may not be true for the large amplitude head motions. Therefore, tracking the face with large-amplitude head motion is still an open question.
Clinical observations and workspace analysis show that the position variance of the head is small due to the constraint of the bed. The unreachable optimal configuration of the camera is mainly caused by the head rotation. Therefore, the tracking algorithm should increase the weights of the orientation control object [Reference Rubio15–Reference Silva-Ortigoza, Hernandez-Marquez, Roldan-Caballero, Tavera-Mosqueda, Marciano-Melchor, Garcia-Sanchez, Hernandez-Guzman and Silva-Ortigoza20]. Furthermore, a patient’s head always rotates from side to side because the body is constrained on the bed. To recapture the face exceeding the reachable workspace, the camera should always be toward the center of the head like the remote center of motion (RCM) constraint which requires the endoscope and instruments to pivot around their incision ports and avoid damage on the surrounding tissue [Reference Sadeghian, Zokaei and Hadian Jazi21]. RCM constraints can be classified into mechanical RCM and programmable RCM. Mechanical RCM is based on mechanical structures such as parallelogram structures. Programmable RCM is often designed using a 6-DOF manipulator through software programming, which has higher flexibility and versatility compared with mechanical RCM. However, incision ports are generally constant during surgery while the head center is variable because of the head motion. Hence, the RCM constraint should be expanded to adapt the face-tracking system.
Collected face data shows that critically ill patients are often of little-amplitude head motion. Although the network can recognize the face action, changing the configuration of the manipulator unintentionally increases its chance of damage. Region-based visual serving (RBVS) is a good choice to handle this problem which keeps the camera static in a given region [Reference Tahri and Chaumette22, Reference M., Tsai, Chen and Cheng23]. However, the joint movement may be large when the manipulator exceeds the given region. Thanks to the advantage of the redundant manipulator which can hold the configuration of the end-effector while adjusting part of the joints in the null space [Reference Chen, Li and Wu24, Reference Zhang, Long and Long25]. And many studies have been conducted to track objects using redundant manipulators, such as ref. [Reference Tarokh and Zhang26], which present a genetic algorithm to realize the real-time motion tracking of both redundant and non-redundant manipulators without using the inverse or pseudo-inverse jacobian matrix of the manipulator. Ref. [Reference Tsai, Hung and Chang27] presents a biological inverse kinematics method and a trajectory tracking approach for a 7-DOF manipulator which employs a Jacobian inverse kinematic method to achieve trajectory tracking control with acceptable accuracy. However, these tracking controls of manipulators at the computation of control actions guide the motion of the end-effector along user-defined or desired paths in the workspace [Reference Pan, Gao and Xu28]. The patient’s head movement trajectory is time varying and moves randomly in most cases. Hence, the robotic face tracking system needs to change the joint angles of the manipulator in nullspace to reduce the chance of damage and next-moment traceability.
In this paper, a face-tracking algorithm is designed for large-amplitude head motions with a 7-DOF manipulator. The face configuration is recognized by mediapipe. Then, we define an optimal control problem with visual feedback, which optimizes the orientation and position simultaneously by converting the angle error into arc length distance and assigning them different important weights. The joint angles at the previous moment are used as the initial condition and the joint angles that minimize error with theoretical optimal pose are searched in the workspace. In the part of decreasing the case of losing face and increasing the probability of recapturing faces after the face is lost, we define the FOC constraint which regards the center of the head as a fixed point. By predicting the pose of the face at the next moment, the directivity constraints of the camera beam are constructed and added to the optimization function so that the camera always points to the center of the face during the movement. The advantages of a redundant manipulator are used to design the region-based face-tracking approach. When the face moves in a small region, all the feature points of the face can be captured by keeping the camera static, that is, keeping the end-effector of the manipulator still. To avoid excessive rotation of the angle when star tracking at the next moment, the manipulator should move in the direction of the head movement trend in the null space to smooth the tracking trajectory. The conceptual representation of the framework is shown in Fig. 1.
The main contributions of this paper are as follows:
-
1. We propose a face-tracking algorithm for large amplitude head motion without assuming the optimal configuration of the manipulator is reachable. The CBO algorithm assigns different important weights to the trade-off between the theoretical optimal configuration and workspace constraints.
-
2. To our best knowledge, it’s the first to introduce the facial orientation center constraint for face tracking. By constraining the camera always toward the center of the head, the tracking system can remain stable as losing part of facial images during tracking and increase the probability of recapturing the face exceeding the reachable workspace of the manipulator.
-
3. We present a region-based tracking approach to stabilize the manipulator for small amplitude head motions and minimize the error between the optimal and current configurations in the null space of the 7-DOF manipulator to smooth the track trajectory.
2. System design and working principle
2.1. Clinical requirements
In ICU, patients with agitation and delirium are often encountered. This symptom not only brings inconvenience to medical treatment and nursing but more importantly, it poses a threat to the safety of the patient and the medical staff and is closely related to the poor prognosis. Therefore, it is particularly important for medical staff to recognize the symptoms of agitation and delirium in time and make systematic assessments and effective treatments.
As an active field of research in artificial intelligence, deep learning (DL) can automatically obtain features of input signals and process them to achieve the effect of recognition or classification which makes facial expression recognition and pain level classification based on facial feature points possible [Reference Semwal and Londhe29]. Based on supervised learning, it can determine the patient’s symptoms through ratings and scales such as the riker sedation-agitation scale [Reference Riker, Jean and Gilles30], motor activity assessment scale [Reference Devlin, Boleski and Mlynarek31], and adaptation to intensive care environment [Reference Weinert and McFarland32]. It will be very meaningful to apply this technology to the daily monitoring of patients to determine the symptoms of patients and reduce the burden on the medical staff. However, it must obtain facial feature points, and then pain level classification can be carried out. Therefore, it is imperative to develop a face-tracking system. To the best of medical knowledge, agitation is a psychomotor disorder caused by excessive physical activity due to excessive stress and it often manifests as pacing back and forth, restlessness which will cause the manipulator to exceed its workspace when tracking the face [Reference Chevrolet and Jolliet33]. Therefore, traditional face tracking in the workspace is not applicable, and it is necessary to develop a system that takes into account the optimal tracking effect outside the workspace.
2.2. System architecture
Parametric analysis for face tracking framework using a 7-DOF manipulator model is derived in this subsection. First, the manipulator model assumed in this paper is described. Second, the face detection system is introduced. Next, we will introduce how to determine the target pose and the tracking system architecture.
2.2.1. Manipulator model
We consider the 7-DOF S-R-S manipulator in this paper. The 7-DOF manipulator is designed with a humanoid arm configuration. The origins of the coordinate system of the first three rotation axes are located at the same position to form a spherical joint similar to the shoulder joint of a human; the fourth rotation axis is similar to the elbow joint and the origins of the last three rotation axes are also located at the same position to form a spherical joint similar to the wrist joint [Reference Shimizu, Kakuya and Yoon34, Reference Fu and Pan35]. Through this design, its posture can be continuously changed without changing its end position, which avoids the huge range of motion for changing its posture similar to industrial robots. The basic configuration is completed by placing adjacent rotation axes vertically. The configuration of the 7-DOF S-R-S manipulator is shown in Fig. 2. One possible set of Denavit-Hartenberg (DH) parameters that describe the kinematic chain of an S-R-S serial manipulator are listed in Table I [Reference Corke36]. Through the four parameters of each joint, its posture transformation with the previous joint can be obtained. For example, the four parameters of joint one can be used to obtain the posture of the first rotation axis relative to the base coordinate system. By analogy, the wrist’s posture can be obtained by multiplying the seven posture transformation matrices. While the relation between one assigned reference frame ( $i-1$ ) and the next ( $i$ ) by the transformation matrix is
where $c\alpha, s\alpha$ represent $\cos\alpha, \sin\alpha$ , respectively, and the wrist’s pose in the task by multiplying is ${}^{0}\mathrm{T}_7 ={}^{0}\mathrm{T}_1{}^{1}\mathrm{T}_2{}^{2}\mathrm{T}_3{}^{3}\mathrm{T}_4{}^{4}\mathrm{T}_5{}^{5}\mathrm{T}_6{}^{6}\mathrm{T}_7$ .
2.2.2. Real-time facial information collection
Mediapipe face detection is an ultra-fast face detection solution with six landmarks and multi-facet support. It is based on BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. The detector’s hyper-real-time performance enables it to be applied to any real-time viewfinder experience that requires accurate facial regions of interest as input to other task-specific models, such as 3D facial key points or geometric estimation (e.g., mediapipe face mesh), facial features or expression classification, and facial region segmentation. Mediapipe face mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices, using machine learning (ML) to infer 3D surface geometry. Combined with 3D face landmarks, face pose is estimated by solving a Perspective-n-Point (PnP) problem.
Taking the pose of the nose as the center pose of the face, in order to more robustly represent the position of the face relative to the camera, the nose, and surrounding feature points are summed and averaged to obtain the pixel coordinates of the center of the face in the image. The depth of the center of the face is obtained by matching the pixels of the color image and the depth image. The coordinate axes of the final detected face posture are as follows: when a person is standing upright, the $z$ -axis is the forward direction of the eyes, the $y$ -axis is vertically upward from the top of the head, and the $x$ -axis is horizontal to the left.
The internal and external parameters are obtained automatically. The information on head pose obtained by the face detection module contains six values, which can be expressed in a vector representation as follows:
where $t_f = \left [x_p \quad y_p \quad z_p\right ]$ is the translation vector, and $r_f = \left [\theta _p \quad \theta _y \quad \theta _r\right ]$ denotes the rotation of the head relative to the camera.
2.2.3. Target pose determination
Four coordinate systems are set in the 7-DOF face tracking system: base coordinate system ${}^{b}\mathrm{T}:(o-x_by_bz_b)$ , wrist coordinate system ${}^{w}\mathrm{T}:(o-x_wy_wz_w)$ , and camera coordinate system ${}^{c}\mathrm{T}:(o-x_cy_cz_c)$ , ${}^{f}\mathrm{T}: (o-x_fy_fz_f)$ is the coordinate system of face.
The coordinate systems and their relationships involved in this paper are shown in Fig. 3. Installing the manipulator on a fixed base so that it does not vibrate during movement. The camera is rigidly connected to the end flange of the manipulator through a connecting device, and the coordinate axes of the camera coordinate system and the end coordinate system of the manipulator point to the same direction, and there is only a positional offset. In order to ensure the safety of people during the tracking process, the safe distance between the face and the camera is set as $L$ . The transformation matrix between the wrist and the camera is obtained by the coordinate calibration and denoted as ${}^{w}\mathrm{T}_c$ .
These pivotal coordinate systems are used to solve the target pose. First, converting the 6-dimensional pose of the head based on camera coordinates into a transformation matrix ${}^{c}\mathrm{T}_f$ shown as Eq. (4) that the head pose can be transferred from the camera coordinate system to the base coordinate system. The transformation matrix ${}^{b}\mathrm{T}_w$ based on the base coordinate of the wrist will be obtained from the joint angles of the manipulator. The pose representation ${}^{b}\mathrm{T}_f$ of the head based on the base coordinate system is obtained by continuous multiplication where ${}^{b}\mathrm{T}_w,{}^{w}\mathrm{T}_c,{}^{c}\mathrm{T}_f \in \mathrm{SE}(3)$ .
When the ${}^{b}\mathrm{T}_f$ is known, The target pose of the manipulator movement can be reversely obtained through the pose transformation relationships between the face and the camera, the camera and the end of the manipulator. So far, all the transformation matrices have been obtained, when the pose of the head is predicted to the dashed coordinate axis, the target pose of the manipulator ${}^{b}\mathrm{T}_g$ can be calculated, and the manipulator will move to the dashed target pose of the manipulator through joint rotation.
2.2.4. System structure
This system uses the realsense camera (D435i, INTEL, AMERICA) modules to collect depth RGB images with human faces. The Mediapipe toolkit calls the camera and processes an image frame-by-frame to obtain head information. In this system, only head position information, including the $x$ , $y$ , and $z$ axes of the head in the camera coordinate system, and three Euler angles describing the rotation of the head are used. After receiving the head information from the human environment acquired by the camera, predict the pose of the head at the next moment based on the previous head information and then the motion control module converts the data to the base coordinate system according to the wrist pose of the manipulator which selected as an initial condition. And design the FOC constraint through the predicted value, and design the optimization function to directly output the motion angle value of the manipulator under the condition of considering the speed and angle limit of the manipulator, the manipulator drives the camera to move. Finally, using serialization to predict faces for tracking purposes.
Every time the motion control module receives a group of head information, the above process is repeated. The detailed structure block diagram of the 7-DOF face tracking system is shown in Fig. 4.
3. Methodology
3.1. Converting radian to arc length
In face-tracking research, pose tracking is more important than position tracking because even if the manipulator reaches the optimal position, the camera will lose the face if it is not facing the face, resulting in the failure of face detection and tracking. However, pose angle and position distance have different dimensions and cannot be directly combined for optimization. It is an inevitable choice to convert the radian into arc length and unify the dimension with distance. Moreover, the importance of position and orientation is determined through importance weights assignment. The transformation relations based on different coordinate systems and a more detailed explanation can be obtained in [Reference Zhang, Zhao, He, Ouyang and Yang37].
3.2. CBO algorithm
The seven rotation angles $\theta _i, i=1,\dots,7$ of the 7-DOF manipulator are used as unknown variables to construct the optimization function, then the homogeneous transformation matrix of the manipulator about the rotation angles is
converting it to Euler angle form:
where
Then a CBO algorithm is proposed to trade off between theoretical optimal pose and workspace constraints by converting the radian into arc length.
where $\boldsymbol{{P}}_{x,y,z}=\left [r_{14},r_{24},r_{34}\right ]^{\mathrm{T}}, \boldsymbol{{P}}_\textrm{opt},\boldsymbol{\theta }_{x,y,z},\boldsymbol{\theta }_\textrm{opt} \in \mathbb{R}^{3\times 1}$ , $\Delta \theta _i$ is the change value of joint $i$ at two adjacent moments and $\boldsymbol{{P}}_\textrm{opt},\boldsymbol{\theta }_\textrm{opt}$ are the optimal position and the optimal Euler angle, respectively. $\alpha$ and $\beta$ are the importance weights corresponding to the position and orientation. After the optimization function is constructed, the joint angles at the previous moment are used as the initial condition, and the joint angles that minimize $f$ are searched in the global workspace based on Algorithm 1.
3.3. FOC constraint
Medical minimally invasive surgical robots use RCM to reduce damage to human tissue, such as rotation and vertical translation and medical machines always focus on this point for treatment. We draw on this feature to set the head center point as FOC so that the camera always points to the head center point in the process of tracking. This not only improves the accuracy of the tracking but also improves the ability to re-capturing the face after the optimal pose is out of the workspace of the manipulator since the camera is pointed to the face. In contrast to RCM, FOC is outside the operating mechanism. RCM constraints can be achieved through mechanical design or algorithmic control. Because the pose of the head changes, the FOC constraint can only be implemented by constraining the $z$ -axis direction of the camera. The difference between them is shown in Fig. 5, $P_d$ is the design point to be reached. From Fig. 5(a), the wound point on the skin did not change during the process of reaching the design point. When the optimal pose is in the workspace, the camera points to the center of the head, that is, the $z$ -axis of the camera parallelling with the $z$ -axis of the head. And when the optimal pose is beyond the workspace, the camera still points to the center of the head but the $z$ -axis of the camera is not necessarily parallel with the $z$ -axis of the head learned from Fig. 5(b). The FOC here is not a point that is conceptually fixed in space but indicates that the pointing of the camera is fixed.
In order to set the FOC, then the $z$ -axis of the camera should be constrained that coincides with the line connecting the camera and the center of the head.
where $\boldsymbol{{P}}_f$ is the center of the head and $\boldsymbol{{P}}=\boldsymbol{{P}}_{x,y,z}$ . Then make sure that the cosine of the angle between the camera’s $z$ -axis and $z_d$ is 1,
where $\boldsymbol{{z}}_c=\left [r_{13},r_{23},r_{33}\right ]^{\mathrm{T}}, \boldsymbol{{z}}_d \in \mathbb{R}^{3\times 1}$ . Add Eq. (16) as a constraint to the optimization function of Eq. (14).
3.4. Region-based tracking approach
The slight movement of the head will be captured by the camera, and the optimal pose of the manipulator will be changed. However, the head cannot be fixed, which makes the optimal pose change all the time, and the manipulator moves all the time, which is tantamount to increasing the damage probability of the manipulator. From the perspective of emotion recognition, it is not necessary to obtain the most complete or comprehensive face image all the time, as long as the facial feature points of the face can be detected. Therefore, it is necessary to set a small region to keep the manipulator still.
Head movement is a continuous process. Although the face may still be within a set small region at the previous moment, the next moment will exceed this region, resulting in the failure of the manipulator tracking. The 7-DOF manipulator can ensure that the wrist pose remains unchanged while the internal joint angle changes. In spite of the head moving within a set small region, the manipulator can move to approach the optimal pose determined by the motion of the head in its nullspace. Therefore region-based tracking approach is designed to adjust the joint angles to reduce the difference between the optimal pose to improve movement ability for tracking the face and smooth the tracking trajectory.
The relationship between the wrist speed of the manipulator and the joint angles is $\dot{X}=J(\boldsymbol{\theta })\dot{\boldsymbol{\theta }}$ , where $\dot{X}\in \mathbb{R}^m, \dot{\boldsymbol{\theta }}\in \mathbb{R}^n, J(\boldsymbol{\theta })\in \mathbb{R}^{m\times n}$ is the Jacobian matrix of the manipulator. When the pose of the wrist of the manipulator is determined, its inverse solution is:
where $J^+$ is the generalized inverse matrix of $J$ , If $J$ have full rank, $J^+=J^{\mathrm{T}}(JJ^{\mathrm{T}})^{-1}$ , $I\in \mathbb{R}^{n\times n}$ is identity matrix, $\omega \in \mathbb{R}^n$ is arbitrary velocity vector. When the wrist pose remains unchanged, $(J)^+\dot{X} = 0$ . Therefore, by optimizing $\omega$ , the goal of moving in nullspace is achieved. However, the joint angles $\boldsymbol{\theta }$ are not directly optimized which inevitably introduces a new calculation error. Therefore, by directly optimizing the joint angles $\boldsymbol{\theta }$ , we ensure that the manipulator moves in nullspace while minimizing the joint angle error with the theoretical optimal pose.
where $\boldsymbol{\theta }_{d}$ is the joint angles of the theoretical optimal pose, $\boldsymbol{\theta }_{s}$ is the joint angles of the starting pose. By solving this function, the change values of the joints can be directly obtained based on Algorithm 2.
4. Simulations
In order to evaluate the performance of the proposed face-tracking method, we perform simulations in MATLAB environment. A simulated manipulator is constructed by Robotic Toolbox [Reference Corke38]. The tracking effect of the CBO algorithm with FOC constraint (CBOwFOC) and the CBO algorithm without FOC constraint (CBOw/oFOC) compared with the tracking method using the pseudo-inverse of the Jacobian matrix (PIJM) is discussed in this section [Reference Liao and Liu39]. Set $d_\textrm{bs} = 341.5\,\textrm{mm}, d_\textrm{se} = 395.1714\,\textrm{mm}, d_\textrm{ew} = 367.3077\,\textrm{mm}, d_\textrm{wf} = 250.3\,\textrm{mm}$ according to the actual installation situation. Set the safety tracking distance to $L=600\,\textrm{mm}$ . In order to reflect that position and orientation are equally important in face tracking, $\alpha =0.5$ and $\beta =0.5$ are set. By analyzing the tracking effect, camera pointing, and tracking trajectories, the tracking accuracy of the method proposed in this paper is illustrated when the optimal pose exceeds the workspace of the manipulator.
4.1. Impact of FOC on tracking performance in the reachable workspace
Collecting head motion data with the joints $\theta _1 = -0.034^{\circ }$ , $\theta _2=51.707^{\circ }$ , $\theta _3= 0.676^{\circ }$ , $\theta _4= -92.206^{\circ }$ , $\theta _5=-0.050^{\circ }$ , $\theta _6= -56.105^{\circ }$ , $\theta _7 =90.237^{\circ }$ of the manipulator. Setting the distance between the camera and the head to $600\,\textrm{mm}$ . Predicting the position and velocity of the head through the error and velocity of the head at the previous moment. We evaluate the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head which expresses whether the camera pointing to the center of the head, the angle between the camera’s $z$ -axis and the head’s $z$ -axis which indicates whether the camera and the head are located with relative poses in tracking, the tracking distance, and the tracking trajectories in the space, respectively. Its effect diagrams are shown in Fig. 6.
In the beginning, PIJM can track faces well, but divergence occurs later. Mainly because it needs to compensate for the position and angle at the same time during the tracking process, and the error accumulation is getting larger and larger later. This problem can be better reflected in Fig. 6(d), and the tracking trajectory of PIJM gradually deviates from the real trajectory later. It can be seen from Fig. 6 that after utilizing the FOC constraint, it shows superior performance in both tracking distance and tracking angle. The angle between the $z$ axis of the camera and the line combining the camera and head always keeps a small value which indicates the validity of the FOC constraint. Determine whether the manipulator has tracked the optimal pose through the angle between $z$ axes of the camera and the head while the angle always keeps values close to $180^{\circ }$ which indicates that the manipulator and the head are in relative poses. The tracking distance of CBOwFOC is always close to the setting value compared with CBOw/oFOC. In Fig. 6(d), the red line represents the tracking trajectory and direction of CBOw/oFOC, the blue line is the tracking trajectory and direction of CBOwFOC, the black line is the real trajectory and direction, and the green line is the tracking trajectory and direction of the PIJM method. The spatial trajectories plot in later chapters are the same as this setting. In order to show the tracking trajectories more clearly, we translate the tracking trajectories of CBOwFOC and CBOw/oFOC along the z-axis by $\pm 200\, \textrm{mm}$ . The tracking trajectory of CBOw/oFOC is more discrete than CBOwFOC, which means that the manipulator shakes more during the tracking process.
4.2. Impact of FOC on tracking performance in the unreachable workspace
Collecting head motion data with the joints $\theta _1 = -0.317^{\circ }$ , $\theta _2=59.184^{\circ }$ , $\theta _3= -0.780^{\circ }$ , $\theta _4= -94.092^{\circ }$ , $\theta _5=-0.056^{\circ }$ , $\theta _6= -57.594^{\circ }$ , $\theta _7 =90.643^{\circ }$ of the manipulator. Setting the distance between the camera and the head to $600\,\textrm{mm}$ . We evaluate the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head, the angle between the camera’s $z$ -axis and the head’s $z$ -axis, the tracking distance, and the tracking trajectories in the space, respectively. Its effect diagrams are shown in Fig. 7.
When collecting facial data, we assume that all six-dimensional pose information of the head is available, but its optimal tracking pose is beyond the workspace of the manipulator. It can more effectively judge the pros and cons of various methods in the tracking problem when the optimal pose for tracking exceeds the workspace. The angle shown in Fig. 7(a) with FOC constraint keeps the value close to zero which indicates the camera pointing to the center of the head. Although there are fluctuations, it does not affect the tracking effect. On the contrary, the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head using CBOw/oFOC reach a large value from the beginning which denotes the camera does not point to the center of the head at all. PIJM’s performance is somewhere in between. Considering the field of view of the camera, CBOwFOC can always capture the face from Fig. 7(b). The tracking performance of different methods can be better reflected from the tracking trajectories from Fig. 7(d). Although the tracking trajectory of CBOwFOC is different from the optimal trajectory, the trajectory directions are basically the same. However, the tracking trajectory of CBOw/oFOC is not only different from the real tracking trajectory but also its orientation completely away from the facial orientation got by comparing with the real trajectory orientation. It is fully demonstrated that CBOwFOC can make the camera point to the head, capturing the face and performing face tracking by the manipulator when the head movement causes the optimal tracking to pose to exceed the workspace.
4.3. Impact of recapturing the face on tracking performance
Collecting head motion data with the joints $\theta _1 = -0.317^{\circ }$ , $\theta _2=59.184^{\circ }$ , $\theta _3= -0.780^{\circ }$ , $\theta _4= -94.092^{\circ }$ , $\theta _5=-0.056^{\circ }$ , $\theta _6= -57.594^{\circ }$ , $\theta _7 =90.643^{\circ }$ of the manipulator. Setting that if the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head is larger than $30^{\circ }$ , the camera will lose the face. We will take the optimal tracking pose out of the workspace and see whether the camera will lose faces and cause tracking to fail. We evaluate the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head, the angle between the camera’s $z$ -axis and the head’s $z$ -axis, the tracking distance, and the tracking trajectories in the space, respectively. Its effect diagrams are shown in Fig. 8.
In the beginning, either CBOwFOC or CBOw/oFOC can track the face well. At around $t = 300$ , the rotation of the head makes the optimal tracking pose out of the workspace, and CBOwFOC has a tracking jitter phenomenon, but CBOw/oFOC directly loses the face and causes the tracking to fail. When the head continues to rotate so that the optimal tracking pose returns to the workspace, the tracking jitter disappeared, and the tracking effect is not affected compared with the moment $t \lt 300$ . The occurrence of jitter and the tracking effect are better shown in Fig. 8(d). When the head moves and the pose is within the green ellipse, its corresponding optimal tracking pose is shown with the black tracking trajectory also within the green ellipse. CBOw/oFOC directly loses the face and stops tracking. Although there is a certain gap between the tracking trajectory of CBOwFOC and the optimal tracking trajectory because the optimal tracking pose exceeds the workspace, it always keeps tracking and the angle of the camera pointing to the center of the head is within the maximum value.
4.4. Impact of region-based tracking approach on tracking performance
Collecting head motion data with the joints $\theta _1 = -0.317^{\circ }$ , $\theta _2=59.184^{\circ }$ , $\theta _3= -0.780^{\circ }$ , $\theta _4= -94.092^{\circ }$ , $\theta _5=-0.056^{\circ }$ , $\theta _6= -57.594^{\circ }$ , $\theta _7 =90.643^{\circ }$ of the manipulator. Set a small region based on half of the maximum range of motion of the manipulator during $1/30\,\textrm{s}$ interval. We evaluated the tracking distance and the tracking trajectories in space, respectively. Its effect diagrams are shown in Fig. 9.
The amplitude change of the tracking distance is not very large, and the tracking trajectory corresponds to the tracking distance. The tracking trajectory of CBOw/oFOC is similar to the tracking distance and has a larger jitter range, while the tracking trajectory of CBOwFOC is smoother. It can be seen from the tracking trajectories that the null space motion can not only keep up with the head motion but also effectively reduce the motion of the end of the manipulator. Moreover, it can be seen from the direction of the trajectory that the direction of CBOw/oFOC is quite different from the true value, while the direction of CBOwFOC is closer to the true value.
4.5. Impact of the whole framework on tracking performance
In order to increase the difficulty of tracking, we set the case of dropping frames and detecting the face incorrectly. And filling in missing and erroneous values by interpolation. The proportion of failed head location detection is $33.2\%$ that the position values of the head are $0$ and $1.8\%$ for the position values of the head are much large which are considered incorrect. Collecting head motion data with the joints $\theta _1 = -0.317^{\circ }$ , $\theta _2=59.184^{\circ }$ , $\theta _3= -0.780^{\circ }$ , $\theta _4= -94.092^{\circ }$ , $\theta _5=-0.056^{\circ }$ , $\theta _6= -57.594^{\circ }$ , $\theta _7 =90.643^{\circ }$ of the manipulator. Setting a small region based on half of the maximum range of motion of the manipulator during $1/30\,\textrm{s}$ interval. We evaluate the angle between the camera’s $z$ -axis orientation and the line connecting the camera and the head, the angle between the camera’s $z$ -axis and the head’s $z$ -axis, the tracking distance, and the tracking trajectories in the space, respectively. Its effect diagrams are shown in Fig. 10.
As can be seen from Fig. 10, once the camera detects the face incorrectly, CBOw/oFOC will lose the face and fail to track. Although there is still a jitter in CBOwFOC, it does not affect the overall tracking effect. The tracking effect fully demonstrates the robustness of tracking when using FOC to keep the camera facing the center of the face. As can be seen from Fig. 10(c), the tracking distance does not change all the time and even maintains a fixed value for a certain period which can be better displayed in Fig. 10(d). The tracking trajectories of CBOwFOC and CBOw/oFOC are much less than the real optimal tracking trajectory. What lead to these results are the tracking failure of CBOw/oFOC and the null space motion of CBOwFOC.
5. Experiments
To further validate the effectiveness of our proposed method, we constructed a real experimental platform. The camera is connected to the end of the manipulator through a rigid coupling mechanism, and its pose variations are consistent with the positional and orientational changes defined in Eq. (3) for the camera and the manipulator’s end. The initial pose of the manipulator is set to $\theta _1 = -0.002^{\circ }$ , $\theta _2=60.003^{\circ }$ , $\theta _3= 0.001^{\circ }$ , $\theta _4= -90.022^{\circ }$ , $\theta _5=0.000^{\circ }$ , $\theta _6= -59.985^{\circ }$ , $\theta _7 =89.992^{\circ }$ . The experimental setup is illustrated in Fig. 11(a). The setting of the evaluation index is shown in Fig. 11(b).
Due to the initial misalignment between the pose of the face and the camera, a pretracking “registration” procedure is conducted to position the face centrally in the captured camera image and maintain a predetermined distance of 600 mm. We evaluate the pointing deviation of the camera center from the face center by evaluating the y-axis distance of the head pose relative to the camera throughout the entire experimental process of CBOwFOC. Additionally, we analyze the motion trajectories of the head and manipulator to determine whether the end pose of the manipulator remains stationary when the head exhibits minimal movement. Its effect diagrams are shown in Fig. 12.
It is impossible for the face to remain absolutely still, and when the mediapipe is used to detect the pose of the face, it will cause the pose of the face to change all the time, which will make the manipulator shake. After balancing between computational efficiency and optimization algorithm tolerance settings, we believe that as long as the distance with respect to the $y$ -axis of the camera coordinate system is less than 5 mm, it means that the optical axis of the camera points to the center of the face. From the experimental results, after the tracking is started, it can be seen that the distances of the $y$ -axis converge to within 5 mm and the deviation distance maintains a small jitter. When we set the tolerance value of the optimization algorithm to be smaller, that is, the pointing error will be smaller, but a large amount of computing time will be lost to meet this requirement, resulting in a reduction in tracking frequency. Moreover, it can be seen from the tracked trajectory that when the face shakes in a small range, the end of the manipulator will not track all the time to achieve the optimal tracking accuracy, which is consistent with the results in the simulation. This reduces manipulator injury from unnecessary movement. From the perspective of real experimental efficiency, it shows the effectiveness of our designed FOC constraints and region-based tracking approach.
6. Conclusion
The framework is able to track the face while the theoretically optimal tracking poses are located out of the workspace of the manipulator by the proposed CBO algorithm, which will minimize the error by converting radian to arc length. The proposed tracking method with FOC constraint makes sure that the camera points to the center of the head, and it increases the ability to reacquire the face when the theoretically optimal pose is located out of the workspace of the manipulator caused by the motion of the head. The region-based tracking approach can effectively reduce the movement of the end of the manipulator without degrading tracking performance. In our tracking simulations and experiments, we demonstrated each proposed method for tracking performance and the tracking effect of the whole framework. Interesting topics for future analysis are designing the small region more intelligently and trying to model the head movement intent of patients in the ICU to make head pose prediction more accurate.
Despite the challenges that need to be tackled before the method could be used in the clinical setting, our presented work has shown the potential to realize face tracking for large-amplitude head motions and will hopefully pave the way for a promising future of medical care.
Author contributions
Shuai Zhang and Bo Ouyang conceived and designed the study. Cancan Zhao collected the face data. Xin Yuan conducted face detection. Shuai Zhang conducted the simulation experiments. Shuai Zhang, Bo Ouyang, and Shanlin Yang wrote the article.
Financial support
This work was supported in part by the National Key Research and Development Program of China (Grant No. 2021YFC0122602), in part by the Joint Funds Program of the National Natural Science Foundation of China (Grant No. U21A20517), and in part by the Basic Science Centre Program of the National Natural Science Foundation of China (Grant No. 72188101).
Competing interests
The authors declare no competing interest exist.