Biometric data & Machine Learning

Written by William Rosengren

This is an article published by William Rosengren who worked as a Data Scientist at NordAxon. The article was originally published on medium.com on the 6th of May 2021.

Introduction

Just like with music, clothing and food trends there are also trends in our world of AI(Artificial Intelligence) and ML(Machine Learning). Predictive analysis of biometric data has increased in popularity over the last years, a clear trend in the AI/ML community. This is for example the data which is used by casual runners to measure their performance or by forensic scientist to investigate crime. Nurses and medical doctors use biometric data to evaluate the condition of patients, and it can of course even be used to authenticate the identify of a person with a high level of accuracy. In this article, I will take a look at analysis of biometric data, the approaches that are used to model the data, challenges that data scientist may encounter when dealing with this type of data and relate all those to the use case Nystagmus, medical condition causing involuntary eye movements.

What is Biometric Data?

Before we can conduct an in-depth analysis of biometric data, we first need to know what they are. The word biometric consists of two components: bio meaning life, and metric, relating to a standard of measurement. Basically, biometric data is any information that can be measured from a human body. When we think of biometric data we might think of respiration rate, heart activity and brain activity, but biometric data may also include a person’s voice or their vein structure.

Biometric data versus Personal data

Some of you might ask yourself, what is the difference between biometric data and personal data? As discussed above, biometric data are data that can be measured from a person. Personal data on the other hand, are data that can be used to determine the identity of a person. Examples of personal data include the name of person, a social security number, a home address, a telephone number and medical records [1]. For example, the fingerprint of a person is biometric data, but it is also personal data, since it can be used to identify a person. This means that biometric data is in many cases also personal data. When dealing with biometric data, it is therefore important to treat it with care, and I will come back to how to handle this later in the article.

Machine Learning

Machine learning is, as we all know, one of the hottest topics in the world of research, science, computers, and just about everywhere data is talked about. This technology can provide remarkable insight to the most complex of problems, and it is therefore an incredible tool for us data enthusiasts. When working with biometric data, it is particular useful, since biometric data in many cases is quite complex. Here is a brief summary of the two most commonly types of machine learning approaches used today: supervised learning and unsupervised learning.

Supervised learning is used to establish a relationship between some input, and an output, e.g., the max pulse and the age of a person. When building supervised learning models, input data is collected together with corresponding outputs, and the model is evaluated based on how well it can describe this relationship.

Unsupervised models, on the other hand, are used on data where there is no corresponding output. There are two applications for unsupervised learning that are frequently used: dimensionality reduction and clustering. The aim of dimensionality reduction is to extract the relevant data in datasets where the input data have many dimensions. Dimensionality reduction is particularly useful if the number of input dimensions is large.

Clustering is used to find data that have similar properties. Unsupervised learning can be used as an analysis method all by itself, or as a pre-processing step to improve the performance of supervised learning models. We can think of unsupervised learning as a way to sort our data.

Deep learning is a modelling approach used to create complex models that can find patterns in data that are otherwise hard to find, and it can be used for both supervised and unsupervised machine learning problems.

A case with eye movement signals

I will now go through an example of how to use machine learning on biometric data from my research as a PhD student. I studied a condition called nystagmus, and patients who suffer from this express involuntary eye movements. There are various reasons why a person might have nystagmus, and in many cases, it is a bit on an enigma for medical doctors working with the patients. The goal of my research was to give these medical professionals a way to accurately evaluate the condition of their patients, and to measure and predict the development of the patients’ eye movements. Below you can see an example of a person expressing horizontal nystagmus [2].

In order to build our models, we need some data, and the first thing I did, was to record eye movements of nystagmus patients. This was done with a technique called eye tracking, which films the eyes of the person, and with the help of image analysis, estimates where the person looked. The outputs from the eye tracker are time dependent horizontal and vertical eye positions. The next step was to model these eye movements, and this was done using a frequency domain analysis called Fourier transforms. This is a form of supervised learning, where the goal is to estimate the original signal using a summation of sinusoidal components. From the Fourier analysis, I was able to extract characteristics, or features as they are called in the machine learning world, that describe the eye movements for each of the participants in my studies.

For some nystagmus patients, the eye movements change based on the how the eyes are angled. For example, the eye movements were different when the person looked to the right compared to when the person looked to the left. Therefore, I wanted to compare the eye movements from one recording to another recording. To do so, I used an unsupervised learning clustering method to compare the features that I got from the Fourier analysis. The result was a method that could distinguish between eye movements from a person, without having to go through the eye movement recordings manually. The method could also be used to compare eye movements from different people. Now, lets imagine that we have a dataset of 1000 people, recorded at different locations by different people. By running my algorithm, we could find the people that have similar eye movement patterns. This is an amazing result, as it lets us compare data from different patients, recorded in different years, at different locations — without having to consult an ophthalmologist!

Once the algorithm was developed, we could use it to link results to things like treatments or patient diagnostics. This could for example be used to investigate which treatments that gave the best effect to a certain type of eye movement pattern. This type of analysis could just as well have been applied to heart activity data (ECG), or just about any other form of biometric time series data, and it shows how powerful machine learning can be.

Machine learning on Biometric data - Challenges and solutions

As we have seen, machine learning is quite useful for analysing biometric data, and the applications using machine learning will likely have an exponential increase in the years to come. However, there are many challenges when working with biometric data, as most often these data are treated as personal data. In some cases, it is possible to anonymise the data, so that it is no longer possible to link the data to a specific person. But it is not always guaranteed that the data will no longer be classified as personal data.

Within the European Union there is the General Data Protection Regulation (GDPR), which regulates how data may be stored and who will have access to the data. Biometric data are of course included here, and this will sometimes put limitations on how models can be developed and how data are allowed to be shared. In addition to GDPR, the European Parliament released a document on the 21:st of April 2021 titled: Laying down harmonized rules on artificial intelligence and amending certain union legislative act. This document is focused on ensuring that AI systems are safe and trustworthy. In the coming years, this will be a hot button topic, and it will be important for data scientist and others interested in machine learning working with biometric data.

To run experiments with personal data, it is necessary to have consent from each participant, even if the data has been anonymised, and in the case of research, an ethical approval is in most cases required. This can lead to practical problems when using supervised learning, because it requires annotated data. To get these annotations, an expert needs to go through the data and provide the invaluable annotations. When developing machine learning models, this is a problem, since most models need a lot of data in order to give good results. This is especially true when working with complex models like deep neural networks. If data cannot be shared, then a lot of hours needs to be devoted to data annotation for each project, and it is difficult to compare performance between different methods. If the data collection process is slowed down, it may hurt the development of this useful technology. In some cases, there are open datasets where the participants of research studies have approved that their data are shared. One example of an open-source database for biometric data is Physionet[3], which consists of a wide variety of medical recordings.

Creating our own data

What do we do if we don’t have enough data to build or models? If we can’t collect more data, or if we need our data now, there are typically two approaches that we can use. The first alternative is to create synthetic data. A synthetic dataset is just what it sounds like, data that have been synthesised. Here it is possible to either use a model that can generate the type of data that are desired, or to manually create data that resembles the original dataset. Let’s say that that we are interested in analysing a population of people, where we know their age and max pulse. There are recordings from 100 people, but we believe that we would need twice as much for our model development. By estimating the statistical distribution of the 2-dimensional data (age and max pulse), we could sample from the estimated distribution, and thus create “new” data points. The distribution could for example be estimated using a Kernel density estimation (KDE) method. This approach could also be used to create “fake” medial journals with realistic values. The benefit of this is that we don’t need to give away any real medical data to someone who is in need of testing an application on medical journal data.

The second option is to use data augmentation, which is a method for creating “new” data from a dataset that already exists. This is a common method for image data, where the images can be rotated, flipped, or noise can be added. Let’s say that we are working on a method for detecting early stage brain tumours, and we have used computer tomography (CT) to scan a number of patients and healthy controls. We have some data, but as always when developing machine learning models, we would like to have more, and we decide to use data augmentation to create more images. In the example below, a brain image from a CT scan[4] has been augmented using three different methods.

 

Original image, noise added, blurred, rotated
Original image, noise added, blurred, rotated

Here I have used noise addition, a lowpass filter, and used image rotation to create three new images. These images could indicate disturbances during the measurements, that the camera was out of focus, or that the image was captured using the wrong setting, respectively. If the original label for this particular image was that the patient was healthy, the labels for the augmented data would be healthy as well.

For data augmentation it is important to know the variation that is realistic for our dataset. Synthetic and augmented data is sometimes very useful, but they cannot replace real data, and domain expertise about real data is key in this case!

Closing remarks

Biometric data gives us valuable information about medical conditions and human psychology, and we have not seen the end of the use for this unique and individual data. Having domain experience is crucial when entering the wonderful world of biometric data analysis. In order to ensure that the ethical requirements are met when we make our analysis, we as data scientists and developers need to keep philosophers and the medical profession close when treading in these deep waters.

[1] ec.europa.eu

[2] commons.wikimedia.org

[3] https://physionet.org/

[4] CT Scan