Gesture Recognition is an important part of Human Computer Interaction (HCI). It can be used to improve user interfaces, e.g. virtual reality in the gaming industry or remote control of robot arms using gestures. It can even make life easier for people with disabilities as an assistive technology such as sign language translation.
Example Problem Formulation: Classify the type of walk a person is doing.
As of today, with Deep Learning and only 2D RGB images as inputs, we can build fairly robust models for static gesture recognition. Though, the more promising application for continuous gestures recognition still faces many challenges and is an open problem.
This article is intended for Data Scientists who have at least basic understanding of machine learning and deep learning in computer vision.
Challenges to overcome
Most problems in gesture recognition are associated with accuracy and performance. For instance, different people might perform the same gesture differently in both speed and movement range. This greatly increases the complexity of the problem if we have many types of gestures. In this article, we will restrict our discussion only to hand gestures as they are the most important features for many gesture recognition tasks.
- Occlusions (i.e. hands not in full sight) occur frequently and might also be a challenge to model. A robust model must be able to take into account the global- and local features in order to handle occlusions. Imagine a person that is holding a banana, then the part of the hand behind the banana is not visible to the camera/model.
Illustration of keypoint detection of hands.
- Co-articulation occurs when a gesture is performed differently depending on what gesture comes before and after. Co-articulation usually happens between two gestures as an overlap. This occurs frequently in fluent sign language.
- A limiting aspect for creating an end-to-end software is the trade-off between real-time performance and accuracy. This may be a requirement for creating e.g. reliable gesture user interfaces for controlling robot arms.
- Another problem with continuous gesture recognition is that it is expensive to create datasets with frame-level annotations, which may be necessary for sign language recognition or other fast-paced gesticulations. Imagine any sign language with over 1000 words (understatement) and the amount of time that must be spent annotating this dataset.
Creating your own datasets
Let's address the problem of acquiring a frame-level dataset for continuous hand gesture recognition. This problem was addressed by a paper from 2016 where the authors train a CNN on 1 000 000 images using weakly supervised learning. Their algorithm takes advantage of the Expectation-Maximization (EM)-algorithm and "inaccurate" labels in order to train a fairly robust CNN model on different hand shapes. In other words, through clustering the model learns different hand shapes. This is a times-saving approach if you want to create your own frame-level dataset for handshapes.
Illustration of weak annotations/labels of the frames in a video. Each frame is an image.
Deep Learning Architectures
The gestures are characterized by spatial movements of the hand through the time, therefore our model must be able to capture spatio-temporal features. In other words, the model will handle 2D images (spatial features) through time (temporal features). Here I list some approaches that can be explored and tested:
- 3D Convolutional Neural Networks (3D CNN), the author of this paper argues that this type of model is more suitable for modelling spatio-temporal features compared to 2D CNNs. These models perform well when doing action recognition on videos and their usage may be extended to continuous gesture recognition. For a demonstration of these type of networks, skip to 18:00 in the following presentation.
The presentation was made by TwentyBN for a PyData conference. Skip to 18:00 in the video to see the demonstration of their model which is partly based on 3d convolutional layers.
- Object localization models which predicts bounding boxes such as YOLOv3, Single Shot Detectors (SSD) or Faster R-CNN can be used for static hand gestures recognition (small demonstration of YOLOv3 in the figure below). You can choose between fast real-time inference models e.g. YOLOv3 and SSD. Else if you value accuracy over speed you can use R-CNNs.
A demonstration of YOLOv3 trained on static hand gestures.
- Keypoint detection models can estimate joint positions which in turn model human poses. These systems are usually built on top of an object localization model and their output could be used as features for lightweight classification models. For instance, with the help of Hidden Markov Models (HMM) or Recurrent Neural Networks (RNN). Alternatively, there exists complete end-to-end systems such as MediaPipe, OpenPose and wrnchAI. Check them out! You might be able to integrate these solutions into your gesture recognition system if you don't want to develop one yourself.
- In this article I have introduced you to some of the challenges in continuous hand gesture recognition. For instance, occlusions and co-articulations.
- You can use weakly supervised learning to train your model and therefore save time.
- An end-to-end system could be a combination of different systems such as OpenPose + RNN (example) for continuous gesture recognition.
The next steps are to decide the following
- What will your dataset look like (which hand gestures) and how will you gather it?
- Compare all the different approaches/models.