Lip Recognition Using Cross-Audio-Visual Recognition Technology Using 3D Convolutional Neural Networks

The lip recognition system uses machine vision technology to continuously recognize faces from images, determine the person who is speaking, extract the continuous changing features of the person's mouth, and then input the continuously changing features into the lip recognition model. The pronunciation corresponding to the speaker's population type is output, and then the most likely natural language sentence is calculated according to the recognized pronunciation.

Lip recognition is not a recent technology. As early as 2003, Intel developed the lip recognition software Audio Visual Speech Recognition (AVSR), which enables developers to develop computers that can recognize lip; in 2016, Google DeepMind's lip recognition software The language recognition technology can already support 17,500 words, and the recognition accuracy rate of the news test set has reached more than 50%.

Everyone must be curious about how the lip recognition system is implemented. Amirsina Torfi et al. achieved lip recognition using 3D convolutional neural network cross-audio-visual recognition technology, and hosted the code on GitHub as open source:

Portal:

https://github.com/astorfi/lip-reading-deeplearning

Next, I will introduce how to use the cross-audio-visual recognition technology of 3D convolutional neural network for lip recognition. For the complete paper, please refer to:

https://ieeexplore.ieee.org/document/8063416

The following is a simple implementation method for lip recognition.

The user needs to prepare the input data according to the format. This project implements audio-visual matching using coupled 3D convolutional neural networks. Lip recognition is one of the specific applications of this project.

Overview

Audio-visual recognition (AVR) is considered as another solution for the task of speech recognition when the audio is corrupted, and it is also a visual recognition method for speaker verification in multi-person scenarios . The approach of the AVR system is to use the information extracted from one modality to improve the recognition ability of another modality by filling in the missing information.

â–ŒProblems and methods

The key problem of this work is to find out the correspondence between audio and video streams. We propose a coupled 3D convolutional neural network architecture that maps two modalities into a representation space and uses the learned multimodal features to judge the correspondence between audiovisual streams.

â–ŒHow to use 3D Convolutional Neural Networks

Our proposed architecture will combine temporal and spatial information to efficiently discover correlations between temporal information of different modalities. Our method uses a relatively smaller network architecture and a smaller dataset and outperforms existing audio-visual matching methods, which mainly use CNNs to represent features. We also demonstrate that efficient pair selection methods can significantly improve performance.

Code

The input pipe must be provided by the user. The rest of the implementation contains datasets for utterance-based feature extraction.

â–ŒLip recognition

For lip recognition, video must be used as input. First, use the cd command to enter the appropriate directory:

Run the dedicated python file as follows:

Run the above script to extract the lip motion by saving the mouth region for each frame and create a new video by encircling the mouth region in the frame for better visualization.

The required arguments are defined by the following Python script, which is already defined in the VisualizeLip.py file:

Some of the defined parameters have default values â€‹â€‹and they do not require further action.

â–ŒProcessing

In the visual part, the video is post-processed to make the frame rate equal, both of which are 30f/s. Then, use the dlib library to track faces in the video and extract mouth regions. Finally, all mouth regions are resized to the same size and concatenated to form the input feature dataset. The dataset does not contain any audio files. Extract audio files from video using FFmpeg framework. The data processing pipeline is shown in the following figure:

â–ŒInput pipeline

Our proposed architecture uses two distinct convolutional networks (ConvNets) and the input is a pair of speech and video streams. The network input is a pair of features representing lip motion and speech features extracted from 0.3 seconds of video. The main task is to determine if the audio stream corresponds to the lip motion video within the desired stream duration. In the next two subsections, we will cover the input of speech and visual streams respectively.

Speech Net

On the time axis, temporal features are non-overlapping 20ms windows used to generate local spectral features. The speech feature input is represented in the form of an image data cube, corresponding to the spectrogram, and the first and second derivatives of the MFEC features. These three channels correspond to image depth. From a 0.3 second video clip, 15 temporal feature sets (each forming 40 MFEC features) can be derived, which form the speech feature cube. The input feature dimension of an audio stream is 15x40x3. As shown below:

Speech features are extracted using the SpeechPy package.

To understand how input pipelines work, see:

code/speech_input/input_feature.py

Visual Net

The frame rate of each video clip used in this work is 30 f/s. Thus, 9 consecutive image frames form a 0.3 second video stream. The input to the network's video stream is a cube of size 9x60x100, where 9 is the number of frames representing temporal information. Each channel is a 60x100 grayscale image of the mouth region.

Architecture

The architecture is a coupled 3D convolutional neural network, where two networks with different weights must be trained. In the visual network, spatial and temporal information of lip movements are combined to exploit temporal correlations. In the audio network, the extracted energy features serve as the spatial dimension, and the stacked audio frames constitute the temporal dimension. In our proposed 3D convolutional neural network architecture, convolution operations are performed on two audiovisual streams over consecutive time frames.

training/evaluation

First, clone the repository. Then, use the cd command to enter the dedicated directory:

Finally, the train.py file must be executed:

For the evaluation phase, a similar script must be executed:

â–ŒRunning results

The results below show the effect of this method on the convergence accuracy and speed of convergence.

The best result, the one on the far right, belongs to our proposed method.

The effect of the proposed online pair selection method is shown in the figure above.

After analyzing this, I hope you can find the source code on Github and start practicing! Attached is the code demo given by the author.

DEMO demo address

1.Training/Evaluation:

https://asciinema.org/a/kXIDzZt1UzRioL1gDPzOy9VkZ

2.Lip Tracking:

https://asciinema.org/a/RiZtscEJscrjLUIhZKkoG3GVm

Mineral Insulated Cable

Mineral Insulated Metal Sheathed Cables

Standard applied: IEC60502

Rated Voltage: 450/750V 600/1000V

Conductor: Class 2 copper as per IEC 60228

Fire Barrier: Mica tape

Insulation: mineral compound

Filling: mineral compound

Sheath: LSOH or aluminum corrugated welding tube or copper corrugated welding tube

Application: Those Fire Cable for installation where fire, smoke emission and toxic fumes create a potential threat to life and equipment

Mineral Insulated Cable,Mineral Insulated Cable Accessories,Bare Copper Mineral Insulated Cable,Mineral Insulated Electrical Cable

Shenzhen Bendakang Cables Holding Co., Ltd , https://www.bdkcables.com