This article is shared from the Huawei Cloud Community " CNN-VIT Video Dynamic Gesture Recognition [Playing with Huawei Cloud] ", author: HouYanSong.
CNN-VIT video dynamic gesture recognition
The development of artificial intelligence is changing with each passing day, and it has also profoundly affected the development of the field of human-computer interaction. Gestures, as a natural and fast way of interaction, are widely used in fields such as intelligent driving and virtual reality. The task of gesture recognition is that when the operator makes a certain gesture, the computer can quickly and accurately determine the type of the gesture. This article will use ModelArts to develop and train a video dynamic gesture recognition algorithm model to detect dynamic gesture categories such as swipe up, swipe down, left swipe, right swipe, open, close, etc., to achieve a function similar to the air gestures on Huawei mobile phones.
Algorithm introduction
The CNN-VIT video dynamic gesture recognition algorithm first uses the pre-trained network InceptionResNetV2 to extract video action clip features frame by frame, and then inputs the Transformer Encoder for classification. We use the dynamic gesture recognition sample data set to test the algorithm, which contains a total of 108 videos. The data set contains videos of 7 gestures including invalid gestures, swipe up, slide down, left swipe, right swipe, open, close, etc. The specific operation process as follows:
First, we decode the captured video file to extract key frames, save them every 4 frames, and then perform center cropping and preprocessing of the image. The code is as follows:
def load_video(file_name): cap = cv2.VideoCapture(file_name) # Extract every few frames frame_interval = 4 frames = [] count = 0 while True: right, frame = cap.read() if not ret: break # Save every frame_interval frame if count % frame_interval == 0: #Center crop frame = crop_center_square(frame) # Zoom frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE)) # BGR -> RGB [0,1,2] -> [2,1,0] frame = frame[:, :, [2, 1, 0]] frames.append(frame) count += 1 return np.array(frames)
Then we create an image feature extractor and use the pre-trained model InceptionResNetV2 to extract image features. The code is as follows:
def get_feature_extractor(): feature_extractor = keras.applications.inception_resnet_v2.InceptionResNetV2( weights = 'imagenet', include_top = False, pooling = 'avg', input_shape = (IMG_SIZE, IMG_SIZE, 3) ) preprocess_input = keras.applications.inception_resnet_v2.preprocess_input inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3)) preprocessed = preprocess_input(inputs) outputs = feature_extractor(preprocessed) model = keras.Model(inputs, outputs, name = 'feature_extractor') return model
Then extract the video feature vector. If the video has less than 40 frames, create an array of all 0s for padding:
def load_data(videos, labels): video_features = [] for video in tqdm(videos): frames = load_video(video) counts = len(frames) # If the number of frames is less than MAX_SEQUENCE_LENGTH if counts < MAX_SEQUENCE_LENGTH: # filler diff = MAX_SEQUENCE_LENGTH - counts #Create a numpy array of all 0s padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # Array concatenation frames = np.concatenate((frames, padding)) # Get the previous MAX_SEQUENCE_LENGTH frame frames = frames[:MAX_SEQUENCE_LENGTH, :] # Extract features in batches video_feature = feature_extractor.predict(frames) video_features.append(video_feature) return np.array(video_features), np.array(labels)
Finally, create the VIT Model with the following code:
#Position encoding class PositionalEmbedding(layers.Layer): def __init__(self, seq_length, output_dim): super().__init__() #Construct a list from 0~MAX_SEQUENCE_LENGTH self.positions = tf.range(0, limit=MAX_SEQUENCE_LENGTH) self.positional_embedding = layers.Embedding(input_dim=seq_length, output_dim=output_dim) def call(self,x): #Position encoding positions_embedding = self.positional_embedding(self.positions) # Add inputs return x + positions_embedding # Encoder class TransformerEncoder(layers.Layer): def __init__(self, num_heads, embed_dim): super().__init__() self.p_embedding = PositionalEmbedding(MAX_SEQUENCE_LENGTH, NUM_FEATURES) self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, dropout=0.1) self.layernorm = layers.LayerNormalization() def call(self,x): # positional embedding positional_embedding = self.p_embedding(x) # self attention attention_out = self.attention( query = positional_embedding, value = positional_embedding, key = positional_embedding, attention_mask = None ) # layer norm with residual connection output = self.layernorm(positional_embedding + attention_out) return output def video_cls_model(class_vocab): #Number of categories classes_num = len(class_vocab) # Define model model =keras.Sequential([ layers.InputLayer(input_shape=(MAX_SEQUENCE_LENGTH, NUM_FEATURES)), TransformerEncoder(2, NUM_FEATURES), layers.GlobalMaxPooling1D(), layers.Dropout(0.1), layers.Dense(classes_num, activation="softmax") ]) # Compile model model.compile(optimizer = keras.optimizers.Adam(1e-5), loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics = ['accuracy'] ) return model
Model training
For a complete experience, you can click Run in ModelArts to run the Notebook I published with one click :
The final accuracy of the model on the entire data set reached 87%, which means that training on a small data set achieved relatively good results.
video reasoning
First load the VIT Model and obtain the video category index tag:
import random #Load model model = tf.keras.models.load_model('saved_model') # Category tags label_to_name = {0:'Invalid gesture', 1:'Swipe up', 2:'Slide down', 3:'Slide left', 4:'Slide right', 5:'Open', 6:'Close', 7 :'zoom in', 8:'zoom out'}
Then use the image feature extractor InceptionResNetV2 to extract video features:
# Get video features def getVideoFeat(frames): frames_count = len(frames) # If the number of frames is less than MAX_SEQUENCE_LENGTH if frames_count < MAX_SEQUENCE_LENGTH: # filler diff = MAX_SEQUENCE_LENGTH - frames_count #Create a numpy array of all 0s padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # Array concatenation frames = np.concatenate((frames, padding)) # Get the previous MAX_SEQ_LENGTH frame frames = frames[:MAX_SEQUENCE_LENGTH,:] # Calculate video features N, 1536 video_feat = feature_extractor.predict(frames) return video_feat
Finally, the feature vector of the video sequence is input into the Transformer Encoder for prediction:
#Video prediction def testVideo(): test_file = random.sample(videos, 1)[0] label = test_file.split('_')[-2] print('File name: {}'.format(test_file) ) print('Real category:{}'.format(label_to_name.get(int(label))) ) # Read each frame of the video frames = load_video(test_file) #Select the previous frame MAX_SEQUENCE_LENGTH to display frames = frames[:MAX_SEQUENCE_LENGTH].astype(np.uint8) # Save as GIF imageio.mimsave('animation.gif', frames, duration=10) # Get features feat = getVideoFeat(frames) # Model inference prob = model.predict(tf.expand_dims(feat, axis=0))[0] print('Predicted category: ') for i in np.argsort(prob)[::-1][:5]: print('{}: {}%'.format(label_to_name[i], round(prob[i]*100, 2))) return display(Image(open('animation.gif', 'rb').read()))
Model prediction results:
File name:hand_gesture/woman_014_0_7.mp4 Real class: Invalid gesture Forecast Category: Invalid gestures: 99.82% Decline: 0.12% Close: 0.04% Swipe left: 0.01% Open: 0.01%
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
I decided to give up on open source industrial software. Major events - OGG 1.0 was released, Huawei contributed all source code. Ubuntu 24.04 LTS was officially released. Google Python Foundation team was laid off. Google Reader was killed by the "code shit mountain". Fedora Linux 40 was officially released. A well-known game company released New regulations: Employees’ wedding gifts must not exceed 100,000 yuan. China Unicom releases the world’s first Llama3 8B Chinese version of the open source model. Pinduoduo is sentenced to compensate 5 million yuan for unfair competition. Domestic cloud input method - only Huawei has no cloud data upload security issues