Back to our Insights
Back to our Insights

Clay AIR Patents Release: A Comparison with Google MediaPipe Hand Tracking Models


Jean Baptiste Guignard

Clay AIR proprietary tools designed to optimize in-house hand tracking models training time and accuracy

Hand tracking and gesture recognition technology represent a revolution in the way people interact with technology: virtual interactions with digital and holographic objects, touchless controls with smart displays, and remote interactions with autonomous devices are now possible. These new ways of interacting pave the way to a wide variety of applications in industries such as entertainment, manufacturing, robotics, automotive, and healthcare. 

How to provide ready-to-run, accurate and battery-efficient hand tracking models

Clay AIR has been developing gesture recognition and hand tracking solutions since 2015, backed by ten years of R&D. Among all of the players who aim to provide a hardware-agnostic, performant, and intuitive solution, the challenge remains unchanged.

How to provide ready-to-run and accurate hand tracking models on any device while preserving the CPU consumption? 

Clay AIR hand tracking models are trained by proprietary tools to fasten training time and annotation process. This image shows virtually generated hands by KANT, Clay AIR proprietary tool.
Hands generated by KANT, Clay AIR’s new proprietary tool. The third hand from the right is the real model. 

Clay AIR introduces new proprietary tools USG & KANT designed to improve accuracy, performance and training time of hand tracking models

Clay AIR hand tracking and gesture recognition technology is the articulation of our models, the proprietary tools we designed to train our models, and our technical capabilities in other interaction technology (i.e. 6DoF, SLAM, Planar Detection, Body and Face Recognition).

In this publication, we will refer to Google.AI’s paper state-of-the-art technology for real-time hand tracking: New On-Device, Real-Time Hand Tracking with MediaPipe” to introduce Clay AIR’s two latest proprietary tools, USG & KANT, designed to improve model training, resulting in higher gesture recognition accuracy, increased performance, and quicker model readiness and training time. You can find more information about our patents and scientific papers here.

Clay AIR hand tracking technology using Nreal cameras

Similarities and differences to Google’s Machine Learning Pipeline for real-time hand tracking 

Google’s approach provides high-fidelity hand and finger tracking from an HD RGB camera by employing machine learning (ML) to infer 21 key points of the hand from a single frame. 

The architecture of Clay AIR’s machine learning pipeline for gesture recognition and hand tracking is different in the methods and tools used to train our models, which results in higher performance, quicker implementation time, and higher accuracy. 

this image shows how a machine learning model works
Machine Learning Model Illustration

Hand landmark model differences

Google hand landmark model performs precise key point localization of 21 3D hand-knuckle coordinates inside the detected hand regions via regression (direct coordinate prediction). Google feeds its real-time hand tracking model with cropped real-world photos and rendered synthetic images to predict the 21 key points

Clay AIR’s hand landmark models perform a prediction of 22 3D key point coordinates obtained from a 1,4M sample database, with cropped real-world and synthetic images. 

However, the input (monochrome), the resolution (96x96p, 112x112p, 128x128p with a correlative maximum distance of 5,6 feet), the blending distribution (synthetic/manual), the bounding box (adaptive/rectangular), the model itself (direct 3D) and the training method differ from Google’s. 

Clay AIR hand tracking models are trained to identify 22 key points of interest on each hand located in the field of view of the camera.
Clay AIR tracks 22 key points on each hand

Input differences

At Clay AIR we are able to use a 6DoF input out of a monochrome camera, versus Google’s 256x256p out of an RGB camera.

Monochrome sensors are typically already being used for room-scaling purposes, and running our software through the same camera allows us to avoid opening other cameras such as the RGB, which are well known for their heat and high CPU consumption.

Monochrome inputs are more challenging to process, as the images are lower resolution, in black and white, and more distorted. Even so, we are able to run machine-learning based tracking and gesture recognition through monochrome cameras, in addition to RGB, IR and ToF cameras.

Google uses models without SSD, which results in a slower and less accurate object detection. 

comparison between a monochrome fisheye camera and iphone rgb camera. on the right, the hand tracking model trained by Clay AIR identifies 22 key points of interest on each hand.
On the left, a monochrome camera input (Nreal). Note the distortion of the picture compared to the RGB picture (iPhone), on the right.

Annotation, training and mixing method differences

As Google stated, 30K samples were used, partly manually annotated and partly synthetic. Manual annotation usually costs 0.5 cents per sample and lasts 3 to 4 weeks.

As part of the process is manual, the positioning is uncertain and the confidence of the keypoints is consequently lessened, thus jitter is likely to occur. On the other hand, synthetic data can carry biases such as the image’s grain, that can result in less recognized hands. 

Clay AIR developed two proprietary tools to accelerate the annotation and training process of in-house hand tracking models

KANT (Knowledge Automated Notation Tool)

KANT is a generic annotation tool that enables us to generate 90k samples per hour. Any object could be generated, but we use it to generate balanced and representative hand poses and positions.

It includes luminance and background matching to adjust seamlessly to new devices, ISP, and FOV. The resulting generated samples feed our 2D or 3D hand models. 

USG (Unity Sample Generator)

USG is a semi-assisted tool conceived to help and accelerate the manual annotation process (real-world data). More particularly conjoining the simultaneous IR, TOF, Monochrome, and RGB cameras streams on the same calibrated device enables us to project 3D coordinates out of 2D monochrome images.

In addition, a grid that correlates ten monochrome, six RGB and eleven ToF cameras calibrated together makes it possible to multiply the annotated image by the number of all cameras, therefore substantially increasing the number of annotated images. 

Clay AIR new proprietary tools allow for a shorter training time by generating 3D hand samples performing different gestures in different environments
Training Data Samples 

Our mixing method is different too: we nourish our three-model-based 2D or 3D architecture with samples of positive and negative proportion, and samples of accordingly rendered number and nature to modify the learning rate and perturbation data in real-time. 

Ultimately, the two tools enable Clay AIR to use only one camera feed as no more triangulation is needed to predict the Z point, thus cutting the DSP load in half.

Google’s annotation process compared to Clay AIR proprietary tools

Google vs Clay AIR hand tracking: comparison of annotation, training method, lens, input data processing, hand landmark model, and computation.
Computing time comparison on different devices: Google vs Clay AIR

Implications for our partners and end-users using Clay AIR hand tracking models

A shorter implementation time

KANT and USG impact the training velocity drastically: from 10,000 images in 2D per month with the previous tools to 90,000 per hour in 3D, resulting in a shorter time to implementation for our partners. 

Increased accuracy

With semi-automated annotation processes and an increased diversity of samples, Clay AIR is able to provide less jitter and reduce inaccuracy of manual-only annotated data. The hand tracking models are therefore more accurate, which increases the sense of immersion for users. 

A power-efficient solution

The two tools enable Clay AIR to use only one camera feed as no more triangulation is needed to predict the Z point, thus cutting the DSP load in half.

It is particularly interesting for partners looking to implement intuitive hand interactions on lightweight devices with lower computing power, and in systems where the CPU load must be spared for essential features, such as in the driver monitoring systems in cars or trucks. 

Using Clay AIR proprietary tools to annotate and train data sets: faster process and increased performance

Before and after using KANT and USG

About Clay AIR

Clay AIR is the only hardware-agnostic software solution for hand tracking and gesture recognition with leading-edge performance.

Clay AIR is a proprietary software solution that enables realistic interaction with the digital world for a variety of mobile, XR, and other devices using computer vision and artificial intelligence.

Recently, Clay AIR collaborated with Lenovo to bring native gesture recognition to the ThinkReality A6 augmented reality (AR) headset. 

Clay AIR also partnered with Renault-Nissan-Mitsubishi to create their prototype in-car air gesture controls to increase safety and improve driving experiences. 

The company is also working with Nreal to add hand tracking and gesture recognition to its mixed reality headsets, and with Qualcomm to implement Clay AIR’s technology at the chipset level to simplify integrations and bring hand tracking and gesture controls to more AR and VR devices. 

If you would like more information about implementing our solutions, feel free to reach out to us here.



Bringing natural interaction to the virtual and augmented world.