written by
Simon Germeau

Eye tracking using webcam images in Tensorflow

Artificial Intelligence 4 min read

Location of gaze estimation or in simpler words, eye tracking, has become a useful way to test and evaluate a user interface to improve the user experience. Eye tracking gives a direct understanding of the way users ‘look’ at a user interface.

Of course, this is not the only way eye tracking can be used. For example, you can create an application that enables a user to control a mouse with their eyes. There is only one downside to this: you need an external eye tracking device to get this information. Brainjar wanted to build a solution for this and that’s where I came in!

During my twelve-week-long internship at Brainjar, I created a deep learning model that can perform eye tracking using webcam images. In this blog, I’ll explain how I went to work. 👀

First things first: DATA

Every deep learning project starts with a dataset. After doing some online research into different datasets for eye tracking, I found a large-scale crowd-sourced dataset called GazeCapture. This dataset contains 2.5 million samples from approximately 1.5K unique subjects. Every sample in this dataset contains a webcam image of the subject who is looking at a certain point on the screen. Moreover, the dataset contains information about where this point exactly is on the screen.

Dataset eye tracking GazeCapture
GazeCapture dataset

Since predicting an X and Y pixel on the screen is very dependent on the type of screen, this dataset takes another approach, by predicting the distance the subject is looking away from the camera. This way, the model can be made device-independent.

AI and eye tracking magic

As I am very new to the field of AI, I decided to first implement the model as proposed in the GazeCapture paper. I did this to make myself more familiar with basic concepts regarding AI and TensorFlow. You need to keep a lot of external factors in mind when building an eye tracking solution. For example, you need to think about where the subject is located in the frame. For this reason, it is not possible to create an accurate model using only images of the eyes, you need more input variables.

Original model with a twist

The Original GazeCapture eye tracking model has four different inputs: crop of the left eye, crop of the right eye, crop of the face, and a 25x25 px segmentation mask that represents where the subject is located in the frame. Even though this model did give a good result, I wanted to give a twist to this model. Therefore, I implemented my own version.

I decided to replace the segmentation mask they use in the GazeCapture model with a list of 68 facial landmarks. Facial landmarks are points that are ‘drawn’ over a face to represent salient points on the face like eyes, mouth, nose, eyebrows, and so on. These points give me information about where the person is in the frame. Moreover, it can give an indication of the direction the person is looking at. I calculated these points for every sample in my dataset using the Face-Alignment library.

Eye tracking - face landmarks input
Example input

In the first version of my model, I gave an array containing 68 landmarks together with a crop of the left and the right eye as input for the model. Unfortunately, I noticed the model had a hard time finding a correlation between the landmarks and the point on the screen.

With some advice from my mentor, I no longer gave an array as input but I changed it to an image where all these points are plotted on. Every feature of the face, like the nose and eyebrows, has a different color. All colors were max RGB values e.g. 255, 0, 255 and 0, 0, 255. The architecture of my best and final AI eye-tracking model can be seen in the following image. As you can see, I chose to implement a Resnet structure in my model.

Model architecture

Let’s track some eyes: real-world testing

A model with a low validation loss can still be useless if it does not perform well in a real-world test. To execute a test, I created a small web application with a React front-end and a FastAPI back-end.

The front-end is a simple application that takes a webcam photo and sends it to the back-end. When the front-end gets a coordinate from the back-end, it colors one of the four squares that you are looking at. To know the center of the screen, there also is a calibration step.

The result

Follow-up eye tracking project

Are you looking for an internship where you can work with AI? Even better: an internship that also uses an eye tracking solution like this one? At Brainjar, a new internship project just launched: Brainhouse. During this project, you will help ALS patients by providing an AI-driven application that helps them surf the web with their eyes. You can still apply!

Brainjar Artificial intelligence Deep learning Machine Learning Neural Networks