SSD algorithm to achieve

This article Objective: To introduce a fabulous project - with Keras to implement SSD algorithm .

This article directory:

0 Introduction
1 How to Train SSD model
2 How to evaluate the SSD model
3 SSD how to fine-tune the model
4 Other Precautions

0 Introduction

After reading my SSD algorithm, there are a lot of doubts about specific details, recorded as follows:

SSD networks is how to achieve?
Existing data look like?
How to break up an image into anchors?
How the label to each tag the anchors?
How positive and negative samples is defined? Matching strategy is the Editor?
Negative sample excavation difficulty is how to achieve?
How the data is fed into the? What is out?
L2 Normalization where, how to achieve?
Atrous layer where?
SSD loss function is how to achieve?
Data in the model is how the flow?
Data enhancement is how to achieve?
How to predict the results of the frame on the original?
How to calculate mAP model on Pascal VOC or MS COCO's?

On github search and found this awesome project - with Keras to implement SSD algorithm , the SSD is ideal for those learning algorithm, but details some vague students (mostly my own). Very detailed documentation comments, and provide very clear operating instructions, such as how to train SSD model, how to assess the performance of the model, how to fine-tune the model pre-trained in their own data sets and so on.

In order to facilitate rapid understanding, which is to a simple version of the SSD network as an example (ssd7_training.ipynb file) to record summary, details refer to the project documentation comments.

1 How to Train SSD model

The main function of the process, in addition to the introduction of library functions, and a predefined common parameters, is divided into four elements:

Ready model;
- Construction of the model;
- Custom loss function, and compile;
Ready data;
- Defined training and validation sets of object image generator datagen;
- Using a function of the image generator reads a document image and tag information;
- The image enhancement method defined chain;
- Encoded by the encoder tag information into a format required loss function;
- Defined data iterator Generator;
training;
- Defined callback function;
- training;
- Training results visualization;
Prediction (detection visual effect);
- Iterator definition data, and obtaining a sample of a batch;
- The samples are sent to the model prediction, and prediction decoded frame;
- The true value prediction block and the original block in Videos, contrast.

1.1 Preparation model

1.1.1 build model

In a training mode to build a small SSD model. Predictor drawn from four places. Wherein each of the prediction of each pixel in FIG. 4 corresponds anchor block.

Model building process:

Build a base network;
FIG respectively at four feature extraction Predictor;
Each predictor is divided into three paths (a first article according to the article illustrated reference);
Road classification that requires softmax, three-way and finally the last dimension concatenate, to give (batches, n_total_boxes, 21 + 4 + 8), the model is the raw output raw output, which represents the four do n_boxes_total predicted characteristic diagram corresponding to an anchor block Total last dimension is 21 frame offset gt categories + + + anchor frame coordinate variance;
If the mode is the inference, one connected to the decoder also needs DecodeDetections (outputs the confidence threshold, NMS and other prediction frame after filtering) at the end

Remarks:

AnchorBoxes generating layer anchor box, but why take back boxes4, rather than behind conv4, as both of you? A: same as the intermediate values only two dimensions, i.e., the aspect of the characteristics of FIG. However, according to a function described later CONV4 should be connected, i.e., input (batch, height, width, channels)

AnchorBoxes layer

The purpose of the feature of an input, the original broken into a series of anchor block.
Process: The scaling factor parameters and aspect ratio, can be calculated and the size of the anchor block number characteristic diagram corresponding to a pixel, and then a characteristic of FIG height and width, to obtain the center of the anchor block.
Input (batch, height, width, channels), i.e., the size of the features of FIG.
Output (batch, height, width, n_boxes, 8), where the total number of anchor block n_boxes represents a feature corresponding to FIG, 8 shows the anchor frame information, i.e., the coordinate + variance;

DecodeDetections层

When modeling mode = inference, then the back of the predictor in the decoder;
Process: The confidence threshold, the NMS, the maximum number of output parameters, etc., for each plot screened before top_K predicted frame;
Input is the output of the original model (batch, n_boxes_total, n_classes + 4 + 8), the last category is one-dimensional (21) + frame offset + anchor frame and variance (centroids format);
Output (batch, top_k, 6), the last dimension is (class_id, confidences, box_coordinates), coordinate format is xmin, ymin, xmax, ymax. Here top_K = 200, even if the box is not reasonable forecast, will Couchu 200.

Remarks:

Input Parameters in claim coordinate output is only supported coords = 'centroids', where coords = 'centroids' refers to the format of the input, the actual output format [xmin, ymin, xmax, ymax].

1.1.2 custom loss function, and compile

(In SSD300 model, you need to load VGG16 the right to pre-trained weight.)

Custom loss function keras_ssd_loss.py

Defines a Loss SSDLoss, there are various specific loss function, such as loss of smooth L1 and log;
Loss smooth L1: two parameters are (batch_size, #boxes, 4), output (batch_size, #boxes). Doubt: this is a direct request smooth L1, seeking direct loss of a coordinate value? One might think that should be a demand offset the loss ah? Or input is supposed to be offset coordinate values rather than direct? A: When an incoming call is offset in compute_loss function, it is OK. log loss is very simple;
compute_loss total loss function calculation, and parameter y_true y_pred are (batch_size, #boxes, # classes + 12), the output scalar. Question 1, divided by the total loss why the number of positive samples, rather than the total number? A: The ratio of positive and negative samples 1: 3, but a multiple of the difference does not affect the results. Question 2, the return result is still (batch,), is not a scalar? That multiplied by batch_size still make sense to you? A: keras force in each batch manner calculated values, i.e. always ensure batch dimensions, it will be given when the actual operation (batch, ...) the average value of the batch, is calculated as a batch compute_loss total loss, so keras forced batch_size is the average and then multiplied by the sum.

Remarks:

Since loss of function definition, it is a function passed to compile object, returned by this function according y_true and y_predict loss calculation;
This format is y_pred (batch_size, #boxes, #classes + 12) raw output, i.e., the model; y_true subsequent SSDInputEncoder class instance is the true outputs the frame code value;
If you are loading saved model, function layer and attention by load_model in custom_objects incoming custom.

1.2 prepare data

This content is generally by an image generator DataGenerator custom class and method to achieve. Wherein DataGenerator function inside generate () needs to receive the image enhancement block chain and a true value encoder parameters, it is necessary to further customize the two classes.

DataGenerator class

DataGenerator automatically invoked when you instantiate __ init __ (), in which there can first image enhancement processing (keras is so processing is done here in the deformation process generate function);
DataGenerator function inside parser_csv () is read from the data file and the label (i.e. block true value), the true value of the read block incoming format is a number of samples of length list, wherein each element is a 2D array, i.e., array [[class_id, xmin, ymin, xmax, ymax] ...], shape of (n_boxes, 5), wherein the number of values for the sample block n_boxes true;
DataGenerator function inside generate () receives the image enhancement block chain and a true value encoder and other parameters, is to produce a batch of data (X, y); (Note: keras built flow_from_directory realized read file data and generating (X, y) two functions, but in addition to the file need to be resolved here CSV, there may be other forms, it is divided into two functions);
One method for accelerating on: the first image and the first read labels parser_csv, then use create_hdf5_dataset () function, converted into the image and label file h5 (training set near 8G, validation set near 2G, already includes the true value box, but coding). When you create can be read directly after DataGenerator h5 file, and then with the generate function data is no longer needed parser_csv. But after the test, with no h5 file looks like there is no difference training, training a epoch of time is 12 minutes.

Defined data enhancement chain

DataAugmentationConstantInputSize class, image distortion, deformation true value box have? Otherwise it is not on the. How to change? Method A: the deformation of the module data_generator.object_detection_2d_geometric_ops, the labels are processed together into.
After Python, if you write __ call __ () method when creating the class, then the class to instantiate an instance, the instance name () is invoked __ call __ () method; custom layer keras in use call (), instead __ __ Call ();
In __ __ DataAugmentationConstantInputSize the init () integrates a deformable object placed in the sequence, and () function call __ call __.

SSDInputEncoder class instance with a true value encoded into block format required y_true loss function (here represented by y_encoded)

gt_labels input is a length of batch_size list, wherein each element is a 2D array, i.e., array [[class_id, xmin, ymin, xmax, ymax] ...];
The main function of the __ call __ () function implemented in three steps:
- The original size of the scaling factor and the aspect ratio, characterized in three dimensions in FIG condition created y_encoded template (i.e., a series of anchors, shape of (batch, # boxes, 21 + 4 + 4 + 4), and finally to 21 category + gt + anchor frame coordinate frame coordinate + variance);
- Matching block and an anchor block true value, i.e. the last update of a dimension of 21 + 4;
- Converting the coordinates of the offset gt anchor frame;

Call train_dataset.generate () required to generate data (X, y)

When ready data objects and enhanced link SSDInputEncoder objects, together with other parameters passed train_dataset.generate, specify the format of return data generator (processed_images, encoded_labels), (shape of the former (batch, h_img, w_img, channel), the SSDInputEncoder is the result output by the class instances), model.fit_generator for subsequent use.

1.3 Training

Defines several useful callback function, ModelCheckpoint (model saved after each epoch), CSVLogger (save the loss and indicators to CSV files), EarlyStopping (early stop), ReduceLROnPlateau (flat area automatically reduces the learning rate), in SSD300 also used LearningRateScheduler (as planned adjustment learning rate), TerminateOnNaN (NaN encountered data that is to stop training). The most commonly used is ModelCheckpoint and CSVLogger.

When training parameters initial_epoch and final_epoch is also very interesting, allowing the user to start from where you left off training. (Noon sleep was no longer afraid of falling out :-))

Training results visualization: You can direct calls to fit the return value can also be read CSV files recorded value.

1.4 Prediction (visual detection results)

Get forecasts

Iterator definition data, and obtaining a sample of a batch;
This batch of samples into the prediction model to obtain a prediction value; (this time is obtained y_pred raw output model)

The decoder processes the predicted value

decode_detections same function-decoder layer model architecture DecodeDetections function, they are:
- Switch offset coordinates (either absolute coordinates or relative coordinates), while the number 12 to 4 after transfection, number;
- For each category, and for filtering the NMS confidence;
- Select the previous top_k predictive frame (if set top_k), if insufficient top_k directly output.
y_pred input parameters: the raw output SSD model training mode (batch, # boxes, 21 + 4 + 4 + 4), wherein all #boxes anchor frame;
Return Value: (batch, filtered_boxes, 6), wherein the number of elapsed filtered_boxes the filtered prediction block, 6 [class_id, confidence, xmin, ymin, xmax, ymax];
Note: decode_detections DecodeDetections layer functions and at different: If an insufficient number after filtering top_k prediction block, the former is the direct output, but the latter will be filled into a top_k (corresponding to loss calculation dimension).

The prediction frame is displayed on the image contrast

Displaying an image, and draw the callout box frame prediction method;
In plt.cm plt value may be mapped to a pseudo color (useful, because with respect to the luminance, it is more sensitive to color changes), the reference

The difference between 1.5 SSD300 training

SSD300 training model, using Pascal VOC data, tag files are XML files;
SSD300 model structure, there are three things to note:
- Structural models in their native structures SSD;
- Convolution void layer: fc6 = Conv2D (1024, (3, 3), dilation_rate = (6, 6), ...);
- L2 Normalization层：conv4_3_norm = L2Normalization(gamma_init=20,...)(conv4_3)；
Question: SSD300 define the model parameters when the image channel replaced by BGR to train, but in the end there is no prediction of when the image channel into BGR?

2 How to evaluate the SSD model

Roughly points:

This is a single row of a document, i.e. SSD300_evaluation.ipynb;
SSD Evaluation, create a model using a inference model, download the file VGG_VOC0712Plus_SSD_300x300_ft_iter_160000.h5 weight training model is based on the training mode created (since it is weights file, it is certainly to get training, it certainly is a training mode model created ), so model.load_weights (weights_path, by_name = True) the need to add by_name, otherwise do not add up;
PR curve drawing method.

3 SSD how to fine-tune the model

This one available at weight_sampling_tutorial.ipynb.

The authors provide several trained SSD model, how to fine-tune these models to enable them to complete their tasks on their own data sets? For example, now I want to recognize eight kinds of objects, and the author is provided on COCO trained to recognize 80 kinds of model objects in MS, then how to operate?

The authors suggest three methods, and that the best method is a direct result of the classifier downsampling. For example the first classifier output SSD is a predictor (batch, h1, w1, 81 * 4), where w1 and h1 are conv4_3 FIG feature height and width, obtained by sampling (BATCH, h1, w1 at the output, 9 * 4) wherein 9 indicates eight kinds of objects and the background, and then fine-tuning can be set on its own data. This method is particularly effective for those tasks within the target object 80 is of the MS COCO categories.

4 Other Precautions

model.load_weights ( './ ssd7_weights.h5', by_name = True): Here by_name same name refers to the right to load only the heavy layer, different from those for loading weight structural models, see
Try to use the whole model is saved model.save, because after stored separately, optimizer when reloading the state will be reset, see

Reference: