CNN-VIT video dynamic gesture recognition

The development of artificial intelligence is changing with each passing day, and it has also profoundly affected the development of the field of human-computer interaction. Gestures, as a natural and fast way of interaction, are widely used in fields such as intelligent driving and virtual reality. The task of gesture recognition is that when the operator makes a certain gesture, the computer can quickly and accurately determine the type of the gesture. This article will use ModelArts to develop and train a video dynamic gesture recognition algorithm model to detect dynamic gesture categories such as swipe up, swipe down, left swipe, right swipe, open, close, etc., to achieve a function similar to the air gestures on Huawei mobile phones.

Algorithm introduction

The CNN-VIT video dynamic gesture recognition algorithm first uses the pre-trained network InceptionResNetV2 to extract video action clip features frame by frame, and then inputs the Transformer Encoder for classification. We use the dynamic gesture recognition sample data set to test the algorithm, which contains a total of 108 videos. The data set contains videos of 7 gestures including invalid gestures, swipe up, slide down, left swipe, right swipe, open, close, etc. The specific operation process as follows:

Presentation 1_edit_569379046802172.png

First, we decode the captured video file to extract key frames, save them every 4 frames, and then perform center cropping and preprocessing of the image. The code is as follows:

def load_video(file_name):
    cap = cv2.VideoCapture(file_name) 
    # Extract every few frames
    frame_interval = 4
    frames = []
    count = 0
    while True:
        right, frame = cap.read()
        if not ret:
            break
        
        # Save every frame_interval frame
        if count % frame_interval == 0:
            #Center crop    
            frame = crop_center_square(frame)
            # Zoom
            frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE))
            # BGR -> RGB  [0,1,2] -> [2,1,0]
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)
        count += 1
        
    return np.array(frames)

Then we create an image feature extractor and use the pre-trained model InceptionResNetV2 to extract image features. The code is as follows:

def get_feature_extractor():
    feature_extractor = keras.applications.inception_resnet_v2.InceptionResNetV2(
        weights = 'imagenet',
        include_top = False,
        pooling = 'avg',
        input_shape = (IMG_SIZE, IMG_SIZE, 3)
    )
    
    preprocess_input = keras.applications.inception_resnet_v2.preprocess_input
    
    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)
    outputs = feature_extractor(preprocessed)
    
    model = keras.Model(inputs, outputs, name = 'feature_extractor')
    
    return model

Then extract the video feature vector. If the video has less than 40 frames, create an array of all 0s for padding:

def load_data(videos, labels):
    
    video_features = []

    for video in tqdm(videos):
        frames = load_video(video)
        counts = len(frames)
        # If the number of frames is less than MAX_SEQUENCE_LENGTH
        if counts < MAX_SEQUENCE_LENGTH:
            # filler
            diff = MAX_SEQUENCE_LENGTH - counts
            #Create a numpy array of all 0s
            padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
            # Array concatenation
            frames = np.concatenate((frames, padding))
        # Get the previous MAX_SEQUENCE_LENGTH frame
        frames = frames[:MAX_SEQUENCE_LENGTH, :]
        # Extract features in batches
        video_feature = feature_extractor.predict(frames)
        video_features.append(video_feature)
        
    return np.array(video_features), np.array(labels)

Finally, create the VIT Model with the following code:

#Position encoding
class PositionalEmbedding(layers.Layer):
    def __init__(self, seq_length, output_dim):
        super().__init__()
        #Construct a list from 0~MAX_SEQUENCE_LENGTH
        self.positions = tf.range(0, limit=MAX_SEQUENCE_LENGTH)
        self.positional_embedding = layers.Embedding(input_dim=seq_length, output_dim=output_dim)
    
    def call(self,x):
        #Position encoding
        positions_embedding = self.positional_embedding(self.positions)
        # Add inputs
        return x + positions_embedding

# Encoder
class TransformerEncoder(layers.Layer):
    
    def __init__(self, num_heads, embed_dim):
        super().__init__()
        self.p_embedding = PositionalEmbedding(MAX_SEQUENCE_LENGTH, NUM_FEATURES)
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, dropout=0.1)
        self.layernorm = layers.LayerNormalization()
    
    def call(self,x):
        # positional embedding
        positional_embedding = self.p_embedding(x)
        # self attention
        attention_out = self.attention(
            query = positional_embedding,
            value = positional_embedding,
            key = positional_embedding,
            attention_mask = None
        )
        # layer norm with residual connection        
        output = self.layernorm(positional_embedding + attention_out)
        return output

def video_cls_model(class_vocab):
    #Number of categories
    classes_num = len(class_vocab)
    # Define model
    model =keras.Sequential([
        layers.InputLayer(input_shape=(MAX_SEQUENCE_LENGTH, NUM_FEATURES)),
        TransformerEncoder(2, NUM_FEATURES),
        layers.GlobalMaxPooling1D(),
        layers.Dropout(0.1),
        layers.Dense(classes_num, activation="softmax")
    ])
    # Compile model
    model.compile(optimizer = keras.optimizers.Adam(1e-5), 
                  loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics = ['accuracy']
    )
    return model

Model training

For a complete experience, you can click Run in ModelArts to run the Notebook I published with one click :

Screenshot 2024-04-28 133611_edit_572368136181552.png The final accuracy of the model on the entire data set reached 87%, which means that training on a small data set achieved relatively good results.

video reasoning

First load the VIT Model and obtain the video category index tag:

import random
#Load model
model = tf.keras.models.load_model('saved_model')
# Category tags
label_to_name = {0:'Invalid gesture', 1:'Swipe up', 2:'Slide down', 3:'Slide left', 4:'Slide right', 5:'Open', 6:'Close', 7 :'zoom in', 8:'zoom out'}

Then use the image feature extractor InceptionResNetV2 to extract video features:

# Get video features
def getVideoFeat(frames):
    
    frames_count = len(frames)
    
    # If the number of frames is less than MAX_SEQUENCE_LENGTH
    if frames_count < MAX_SEQUENCE_LENGTH:
        # filler
        diff = MAX_SEQUENCE_LENGTH - frames_count
        #Create a numpy array of all 0s
        padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
        # Array concatenation
        frames = np.concatenate((frames, padding))

    # Get the previous MAX_SEQ_LENGTH frame
    frames = frames[:MAX_SEQUENCE_LENGTH,:]
    # Calculate video features N, 1536
    video_feat = feature_extractor.predict(frames)

    return video_feat

Finally, the feature vector of the video sequence is input into the Transformer Encoder for prediction:

#Video prediction
def testVideo():
    test_file = random.sample(videos, 1)[0]
    label = test_file.split('_')[-2]

    print('File name: {}'.format(test_file) )
    print('Real category:{}'.format(label_to_name.get(int(label))) )

    # Read each frame of the video
    frames = load_video(test_file)
    #Select the previous frame MAX_SEQUENCE_LENGTH to display
    frames = frames[:MAX_SEQUENCE_LENGTH].astype(np.uint8)
    # Save as GIF
    imageio.mimsave('animation.gif', frames, duration=10)
    # Get features
    feat = getVideoFeat(frames)
    # Model inference
    prob = model.predict(tf.expand_dims(feat, axis=0))[0]
    
    print('Predicted category: ')
    for i in np.argsort(prob)[::-1][:5]:
        print('{}: {}%'.format(label_to_name[i], round(prob[i]*100, 2)))
    
    return display(Image(open('animation.gif', 'rb').read()))

Model prediction results:

File name:hand_gesture/woman_014_0_7.mp4
Real class: Invalid gesture
Forecast Category:
Invalid gestures: 99.82%
Decline: 0.12%
Close: 0.04%
Swipe left: 0.01%
Open: 0.01%

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Take you to develop a video dynamic gesture recognition model

CNN-VIT video dynamic gesture recognition

Algorithm introduction

Model training

video reasoning

Guess you like