Skip to the content.

Facial Expression Detection of Live Video with Neural Networks

By Hunter Abraham & Collin Lenz

Facial Expression Detection

Emotion detection involves classifying the emotion of a facial expression using computer vision algorithms, in our case a convolutional neural network. This problem can be extended to live video feeds, classifying the emotions of the person or people being recorded in real-time. This classification can be comprised of many subproblems within computer vision, such as edge detection, shadow detection, feature engineering, and machine learning using convolutional neural networks.


Image

Why?

The solution to this problem has many applications. One medical use could be to track the rehabilitation of psychiatric patients. Rather than having a doctor monitor the patient’s emotional stability, a live video feed could monitor them autonomously. Another application could help socially challenged people on live video calls. People with Asperger’s or Autism often struggle to understand others’ emotions. With the increase in video calls due to the pandemic, we could create a plug-in for Zoom to display the emotions of a video call’s participants. This way, people with social disabilities can recognize emotions easily. Yet another application could involve managing crowds during large events. The application could analyze the emotions of crowd members and determine the aggregate feeling of the crowd. Such a tool could be useful in predicting riots or determining successful displays at concerts. While there are many applications of emotion recognition, there are also serious ethical concerns. Namely, the ethical dilemmas related to mass surveillance also apply to emotion detection. Also, a person’s facial expression does not necessarily imply their emotional state. However, developing the tool is still an interesting problem that has the potential to help many people.

Our Classification Method

We have found a dataset for emotion detection, FER-2013, and have implemented a baseline CNN with a test accuracy of 54%.

Convolutional Neural Network




We created our baseline CNN variant based off the CNN described here.


CNN Demo

Recurrent Neural Network Attempt

Our plan was to use a Viola-Jones object detector to extract facial features. Then, we would feed these features into a recurrent neural network to capture temporal context in our model. For example, if a facial expression was just sad and now is surprised perhaps it is fearful instead. However, there was no dataset of time series facial expressions available to us. So, we decided to try to implement a Feed-Forward Network.

Feed-Forward Network Attempt

The feed-forward network would use the same input features to classify facial expressions. However, there was an issue with its implementation. Namely, the Viola-Jones object detector wouldn’t work on the low-resolution images of our dataset (48x48). Hence, we proceeded with a new Convolutional Neural Network architecture.


left: Viola-Jones classifier ran on a test image, right: sample picture from FER-2013 dataset

New Convolutional Neural Network Attempt




The new Convolutional Neural Network contained several changes from the old. Namely, we used higher dropouts after each layer in hopes of increasing the generalization of the model. Also, we updated the activation function to tanh rather than ReLU. Moreover, the size of our convolutional layers were altered and we increased the number of perceptrons in the Dense layers. However, these changes wound up wounding our performance, leading us to an accuracy of 29%

Training Data

We used a dataset from a Kaggle competition called fer2013. The dataset is comprised of 48x48 pixel greyscale images of faces that are categorized as: 0=Angry, 1-Disgust, 2-Fear, 3-Happy, 4-Sad, 5-Surprise, and 6-Neutral. The data is represented as two columns, one with the numerical categorization from 0-6 and the other with the pixel values.


Image

Presentation

A link to our presentation can be found here.