[Machine Learning] HOG+SVM Realizes Pedestrian Detection

Task: Using the INRIA Person dataset, extract HOG features and use the SVM method to achieve pedestrian detection in images.
This article will give detailed operation steps and possible pitfalls.

1. Preparation

1. Download the dataset

The INRIA dataset contains images of upright or walking people and was used by Navneet Dalal to train a human detector published in CVPR 2005.

Pit point 1 : The official website http://pascal.inrialpes.fr/data/human/ displays 403 Forbidden after opening.
Solution : Use Motrix /IDM/Xunlei (or other download tools that support FTP) to open ftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar for download (compressed size is about 1GB).

2. Unzip the dataset

Pit point 2 : When using WinRAR/7Zip to decompress, problems such as file overwriting and administrator privileges are required.
Solution : The file contains soft links, so WinRAR/7Zip cannot be used to decompress it, but the command tar from Linux should be used to decompress it. If you have installed a Linux-like environment such as WSL (Windows Subsystem for Linux)/MinGW, you can call the following command:

tar xvf INRIAPerson.tar

Among them xstands for Extract(unzip), vstands for Verbose(displays the file being decompressed), fstands for File Name(followed by the file name). This command will generate a folder in the current directory INRIAPerson, which contains the decompressed files.

2. Introduction to HOG features

The full name of HOG is Histogram of Oriented Gradient (Histogram of Oriented Gradient), which is a feature descriptor (Feature Descriptor) for object detection in computer vision. The role of feature descriptors is to extract useful information and discard redundant information. For an object, it is its shape—that is, its boundary—that distinguishes its characteristics. However, the gray level at the boundary generally has a sudden change, so we can know where the boundary is by examining the gradient of the image.

1. Gradient

First, we assume that the input image is a grayscale image (in fact, we generally deal with a part of the image, that is, the window, rather than the entire image). It can be seen as the line ( rrr ) and columns (ccc ) binary function:I ( r , c ) I(r,c)I(r,c ) , of whichIII stands forrrline r , ccThe grayscale of the pixels in column c (the value range is 0 00~ 255 255 255 ). When studying a binary function, we often consider its gradient. Here we need to knowIII am atxxxyyThe gradient in the y direction. The approach we take is: use the difference between the gray levels of adjacent grids to make an approximation. III am atxxxyyy方向的梯度公式如下: I x ( r , c ) = I ( r , c + 1 ) − I ( r , c − 1 ) I y ( r , c ) = I ( r + 1 , c ) − I ( r − 1 , c ) \begin{aligned} I_x(r,c)&=I(r,c+1)-I(r,c-1)\\ I_y(r,c)&=I(r+1,c)-I(r-1,c) \end{aligned} Ix(r,c)Iy(r,c)=I(r,c+1)I(r,c1)=I(r+1,c)I(r1,c)Logically speaking, the above formula should be divided by 2 22 , but these constants are irrelevant because of normalization later on. It can also be understood as using the vector[ − 1 0 1 ] \begin{bmatrix}-1&0&1\end{bmatrix}[101] and[ − 1 0 1 ] {\begin{bmatrix}-1\\0\\1\end{bmatrix}} 101 Perform convolution operation on the original image. Next we convert the gradients to polar coordinates, where the angles are constrained to 0° 0\degree~ 180 ° 180\degree 180° μ = I x 2 + I y 2 θ = 180 π ( arctan ⁡ I y I x ) \begin{aligned} \mu&=\sqrt{I_x^2+I_y^2}\\ \theta&=\frac{180}{\pi}\left(\arctan\frac{I_y}{I_x}\right) \end{aligned} mi=Ix2+Iy2 =Pi180( arctanIxIy)Here we put arctan ⁡ \arctanarctan定义为 arctan ⁡ x = { tan ⁡ − 1 x , x ≥ 0 tan ⁡ − 1 x + π , x < 0 \arctan x=\begin{cases} \tan^{-1}x,&x\ge 0\\ \tan^{-1}x+\pi,&x<0 \end{cases} arctanx={ tan1x,tan1x+p ,x0x<0And θ \thetaθ is expressed in degrees.

2. Grid (Cell)

We proceed to segment the image into C × CC\times CC×C size grid (generallyC = 8 C=8C=8 ). The figure below demonstrates such a segmentation, each green box is8 × 8 8\times 88×8 grids:
8x8 grid
each grid hasC 2 C^2C2 (generally64 6464 ) pixels. Each pixel has a gradient, and we need to count the gradient direction of these pixels (that is, the angleθ \thetaθ ) distribution law. The gradient modulus length and direction of a grid in the above figure are as follows:

gradient
If you want to count the distribution of angles, you need to use the concept of histogram. In the histogram, xxEach interval of the x- axis is called a bin. You can understand the bin as a bucket, and the input data is put into the corresponding bucket according to which range it is in. Then forθ \thetaθ , its range is0 ° 0\degree~ 180 ° 180\degree 180° , our range is divided intoBBB bins. Generally takeB = 9 B=9B=9 , that is to say the width of each interval isw = 180 B = 20 ° w=\frac{180}{B}=20\degreew=B180=20° . We put each bin from0 00 toB − 1 B-1B1 for numbering. SectionIIi bin の范围是[ wi , w ( i + 1 ) ) [wi,w(i+1))[wi,w(i+1 )) , the center isw ⁣ ( i + 1 2 ) w\!\left(i+\frac{1}{2}\right)w(i+21) . For example, whenB = 9 B=9B=9 , the third bin (i = 3 i=3i=3 ) The range is[ 60 ° , 80 ° ) [60\degree,80\degree)[60°,80° ) and the center is70 ° 70\degree70° . But, we will not simply put each pixel according toθ \thetaThe range of θ is placed in the bucket, butμ \muThe size of μ , put it into two adjacent buckets according to a certain ratio. Finally, the value of each bucket is not a number, but a measure of "contribution". The contribution of a pixel to a bin depends not only on the modulus length of the gradientμ \muμ , also depends on its angleθ \thetaθ is the distance from the center of the bin. The longer the modulus length, the greater the contribution; the farther the distance, the smaller the contribution. Specifically, for a gradient modulus lengthμ \muμ , the orientation angle isθ \thetaThe pixel point of θ , let j = ⌊ θ w − 1 2 ⌋ j=\left\lfloor\cfrac{\theta}{w}-\cfrac{1}{2}\right\rfloorj=wi21 , then it

  • Pairs numbered j mod bj\bmod bjmodThe contribution of the bin of b is vj = μ cj + 1 − θ w v_j=\mu\cfrac{c_{j+1}-\theta}{w}vj=mwcj+1i
  • The pair number is ( j + 1 ) mod b (j+1)\bmod b(j+1)modThe contribution of the bin of b is vj + 1 = μ θ − cjw v_{j+1}=\mu\cfrac{\theta-c_j}{w}vj+1=mwicj

Finally, each grid will get a histogram, and each entry in the histogram is the sum of the contributions of all the pixels in this grid to this bin. Interestingly, the sum of the contributions of each pixel to the two bins must be μ \mum .

The image below is an example. First, we put 0 ° 0\degree~ 180 ° 180\degree 180° divided intoB = 9 B=9B=9 parts, each centered at10 ° 10\degree10° 30 ° 30\degree 30°、…、 170 ° 170\degree 170° . Now we have aθ = 77 ° \theta=77\degreei=77° , the modulus length isμ \muThe gradient of μ , it is on the 3rd bin (the range is60 ° 60\degree60°~ 80 ° 80\degree 80° , centered at70 ° 70\degree70° ) contributes0.65 μ 0.65\mu0.65 μ , for bin No. 4 (the range is80 ° 80\degree80°~ 100 ° 100\degree 100° , centered at90 ° 90\degree90° ) contributes0.35 μ 0.35\mu0.35μ . _
Histogram example
For the picture of the athlete, the figure below shows how to calculate a gradient modulus length of85 8585 , the angle is165 165Contribution of 165 pixels:
distribution contribution
The histogram of this grid is as follows:
histogram

3. Block Normalization

Although we have obtained a histogram, overall, the height of the histogram has a great relationship with the brightness of the image. We don't want the overall histogram height of the photos taken during the day and the photos taken at night to be very different. Therefore, we need to normalize it. Pack the grid into blocks, each block has 2 × 2 2\times 22×2 grids, and the blocks can overlap. Obviously, the number of pixels in each block is2 C × 2 C 2C\times 2C2 C×2C . _ We scan the entire window in a sliding window manner, moving one block at a time. This ensures that every grid that is not on the edge is covered by four blocks.
block normalization
The size of the above picture is 64 × 128 64\times 12864×128 , that is8 × 16 8\times 168×16 grids, each block has7 77 , vertical position has15 1515 .

Now, since each block has 4 44 grids, the histogram of each grid has9 99 entries, we can concatenate the entries of these histograms to form36 3636 -dimensional vectorb \boldsymbol{b}b . Now, we use the Euclidean norm (Euclidean norm) to putthe b \boldsymbol{b}b is normalized so that its modulus is close to1 11: b : = b ∥ b ∥ 2 + ε \boldsymbol{b}:=\frac{\boldsymbol{b}}{\sqrt{ {\|\boldsymbol{b}\|}^2+\varepsilon}} b:=b2+e bwhere ε \varepsilonε is to prevent division by0 00 plus a very small positive number.

You may ask: why not normalize each grid? The answer is that the overall difference in the height of the histogram between the grids carries part of the information, which cannot be completely erased. And for each 2 × 2 2\times 22×2 blocks, the information represented by the average gray level difference between different grids can be preserved to a certain extent.

4. HOG Feature (HOG Feature)

Next, we take the b \boldsymbol{b} of each blockThe b vectors are all connected to form a huge vectorh \boldsymbol{h}h , and then perform the following three steps:

(1) Perform a preliminary normalization: h : = h ∥ h ∥ 2 + ε \boldsymbol{h}:=\cfrac{\boldsymbol{h}}{\sqrt{ {\| \boldsymbol{h}\ |}^2+\varepsilon}}h:=h2+e h

(2) such that h \boldsymbol{h}The size of each number in h does not exceed a positive thresholdτ \tauτ , that is, forh \boldsymbol{h}h 'snnthn dimensionhn h_nhn, hn : = min ⁡ ( hn , τ ) h_n:=\min(h_n,\tau)hn:=min(hn,t ) ;

(3) Finally, normalize again: h : = h ∥ h ∥ 2 + ε \boldsymbol{h}:=\cfrac{\boldsymbol{h}}{\sqrt{ {\| \boldsymbol{h}\ |}^2+\varepsilon}}h:=h2+e h. And we're done.

for a YYLine Y ,XXThe window of X column, its grid number isYC × XC \cfrac{Y}{C}\times\cfrac{X}{C}CY×CX, the number of blocks is ( YC − 1 ) × ( XC − 1 ) \left(\cfrac{Y}{C}-1\right)\times\left(\cfrac{X}{C}-1\right)(CY1)×(CX1 ) , the last HOG featureh \boldsymbol{h}The dimension of h is 4 B × ( YC − 1 ) × ( XC − 1 ) 4B\times\left(\cfrac{Y}{C}-1\right)\times\left(\cfrac{X}{C }-1\right)4B _×(CY1)×(CX1 ) . The HOG feature dimension of that athlete picture is4 × 9 × 15 × 7 = 3780 4\times 9\times 15\times 7=37804×9×15×7=3780

5. Use to skimage.feature.hogextract HOG features

skimageThe installation method:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scikit-image

For this 96 × 160 96\times 16096×160 pictures (namedhog_test.png)

hog_test.png

The code below extracts its HOG features:

# encoding: UTF-8
# 文件: hog.py
# 描述: 提取图片的HOG特征

from skimage.io import imread
from skimage.feature import hog

def extract_hog_feature(filename):
    # 提取filename文件的HOG特征
    image = imread(filename, as_gray=True)
    # 读取图片,as_gray=True表示读取成灰度图
    feature = hog( # 提取HOG特征
        image, # 图片
        orientations=9, # 方向的个数,即bin的个数B
        pixels_per_cell=(8, 8), # 格子的大小,C×C
        cells_per_block=(2, 2), # 一块有2×2个格子
        block_norm='L2-Hys', # 归一化方法
        visualize=False # 是否返回可视化图像
    )
    return feature

if __name__ == '__main__':
    feature = extract_hog_feature('hog_test.png')
    print(feature) # 显示HOG特征
    print(feature.shape) # 显示HOG特征的维数

The output is:

[0.24284172 0.24284172 0.21779826 ... 0.1942068  0.25568547 0.10666346]
(7524,)

In other words, the HOG feature we obtained is a 7524 75247524- dimensional vector. Here7524 = 4 × 9 × ( 96 8 − 1 ) × ( 160 8 − 1 ) 7524=4\times 9\times\left(\cfrac{96}{8}-1\right)\times\left( \cfrac{160}{8}-1\right)7524=4×9×(8961)×(81601)

If we set visualizeit to True, hogthe function will return a tuple containing HOG features and visualization images, call

import matplotlib.pyplot as plt
...
feature, visimg = hog(...)
plt.imshow(visimg)
plt.show()

have to

HOG features

This is hog_test.pngthe visualization of the HOG features of the image.

6. Summary

Now we convert a picture into a vector using HOG features, and then we can use SVM classification.

3. Training and testing of the support vector machine model

In this article on the basics of Support Vector Machine (SVM), we introduced the basic principles of SVM. Next, I mainly introduce the Python implementation of SVM.

1. Processing the dataset

training dataset

Logically speaking, the positive samples of the training data (pictures of people) are placed in INRIAPerson\train_64x128_H96\posthe folder, and the negative samples (pictures of no one) are placed in INRIAPerson\train_64x128_H96\negthe folder, and the pictures should be 64 × 128 64\times 12864×128 size. However, this is not the case:

Pit point 3 :INRIAPerson\train_64x128_H96\posThere are no pictures in the folder, only soft links;INRIAPerson\train_64x128_H96\negit is directly a soft link. And,posthe link is toINRIAPerson\96X160H96\Train\posthe files inside, these pictures are all 96 × 160 96\times 16096×160 in size;negthe link is toINRIAPerson\Train\negthe file inside, and the size of the picture varies.
Solution:

  • The folder that will INRIAPerson\96X160H96\Train\posbe used as a positive sample, where the image size is 96 × 160 96\times 16096×The reason for 160 is that there are 16 × 16around the image16×16 padding. Therefore, when reading the picture, it is necessary to intercept the center64 × 128 64\times 12864×Part of size 128 , that is, the coordinates are( 16 , 16 ) (16,16)(16,16)~ ( 80 , 144 ) (80,144) (80,144 ) .
  • will be used INRIAPerson\Train\negas the negative sample folder, and for each picture we randomly intercept 10 1010 64× 128 64\times 12864×128 parts.

test data set

In the same way, it will INRIAPerson\70X134H96\Test\posbe used as the folder for testing positive samples and INRIAPerson Test\negthe folder for testing negative samples.

2. Read pictures and extract HOG features

Due to the large amount of data, each grid contains the number of pixels CCC takes8 88 will cause the training to be extremely slow, so here I takeC = 16 C=16C=16

For positive samples, we intercept the middle 64 × 128 64\times 12864×128 size part, extract its HOG features. For negative samples, we randomly intercept 10 10above10 64× 128 64\times 12864×Part 128 extracts HOG features. Finally, two lists are obtained:xandy, wherexis the HOG feature (that is, the training data),yis the label, if itx[i]comes from a positive sampley[i] = 1, otherwisey[i] = 0. code show as below:

import random
import os
import tqdm # 用于在循环时显示进度条
from skimage.io import imread # 读取图像的函数
from skimage.feature import hog # skimage自带的提取HOG特征的函数

def clip_image(img, left, top,
    width=64, height=128):
    '''
    截取图片的一个区域。

    参数
    ---
    img: 图片输入。
    left: 区域左边的坐标。
    top: 区域上边的坐标。
    width: 区域宽度。
    height: 区域高度。
    '''
    return img[top:top + height, left:left + width]

def extract_hog_feature(img):
    '''
    提取单个图像img的HOG特征。
    '''
    return hog(
        img,
        orientations=9,
        pixels_per_cell=(16, 16),
        cells_per_block=(2, 2),
        block_norm='L2-Hys',
        visualize=False
    ).astype('float32')

def read_images(pos_dir, neg_dir,
    neg_area_count, description):
    '''
    读取图片,提取样本HOG特征。

    参数
    ---
    pos_dir: 正样本所在文件夹。
    neg_dir: 负样本所在文件夹。
    neg_area_count: 在每个负样本中随机截取区域的个数。
    description: 用途描述(训练/测试)。

    返回值
    -----
    返回一个元组(x, y),x是所有图片的HOG特征,
    y是所有图片的分类(1=正样本,0=负样本)。
    '''
    pos_img_files = os.listdir(pos_dir)
    # 正样本文件列表
    neg_img_files = os.listdir(neg_dir)
    # 负样本文件列表

    area_width = 64 # 截取的区域宽度
    area_height = 128 # 截取的区域高度

    x = [] # 图片的HOG特征
    y = [] # 图片的分类

    for pos_file in tqdm(pos_img_files,
        desc=f'{
      
      description}正样本'):
        # 读取所有正样本
        pos_path = os.path.join(pos_dir, pos_file)
        # 正样本路径
        pos_img = imread(pos_path, as_gray=True)
        # 正样本图片
        img_height, img_width = pos_img.shape
        # 该图片的宽、高
        clip_left = (img_width - area_width) // 2
        # 截取区域的左边
        clip_top = (img_height - area_height) // 2
        # 截取区域的上边
        pos_center = clip_image(pos_img,
            clip_left, clip_top, area_width, area_height)
        # 截取中间部分
        hog_feature = extract_hog_feature(
            pos_center) # 提取HOG特征
        x.append(hog_feature) # 加入HOG向量
        y.append(1) # 1代表正类
    
    for neg_file in tqdm(neg_img_files,
        desc=f'{
      
      description}训练负样本'):
        # 读取所有负样本
        neg_path = os.path.join(neg_dir, neg_file)
        # 负样本路径
        neg_img = imread(neg_path, as_gray=True)
        # 负样本图片
        img_height, img_width = neg_img.shape
        # 该图片的宽、高
        left_max = img_width - area_width
        # 区域左边坐标的最大值
        top_max = img_height - area_height
        # 区域
        for _ in range(neg_area_count):
            # 随机截取neg_area_count个区域
            left = random.randint(0, left_max) # 区域左边
            top = random.randint(0, top_max) # 区域上边
            clipped_area = clip_image(neg_img,
                left, top, area_width, area_height)
            # 截取的区域
            hog_feature = extract_hog_feature(
                clipped_area) # 提取HOG特征
            x.append(hog_feature)
            y.append(0)
    return x, y

The above read_imagesfunction can be used to read both training data and test data.

Pit point 4 :skimage.io.imreadThe image read out is in the form of "height × width" instead of "width × height".
Solution : Note that the first dimension of the subscript is the ordinate and the second dimension is the abscissa when intercepting the picture.

The following two functions read_training_dataand read_test_datathe calling read_imagesfunction read the training data and test data respectively (the return value is still two: HOG features and labels). Whether it is training or testing data, negative samples are randomly intercepted 10 regions.

def read_training_data():
    '''
    读取训练数据。
    '''
    pos_dir = 'INRIAPerson/96X160H96/Train/pos'
    neg_dir = 'INRIAPerson/Train/neg'
    neg_area_count = 10
    description = '训练'
    return read_images(pos_dir, neg_dir,
        neg_area_count, description)

def read_test_data():
    '''
    读取测试数据。
    '''
    pos_dir = 'INRIAPerson/70X134H96/Test/pos'
    neg_dir = 'INRIAPerson/Test/neg'
    neg_area_count = 10
    description = '测试'
    return read_images(pos_dir, neg_dir,
        neg_area_count, description)

If we need to train multiple times, we don't need to extract the HOG features of the image every time the program is executed. We can save the calculated HOG features in a file, and read the data from the file when training SVM. The functions for saving and reading files are as follows:

def save_hog(x, y, filename):
    '''
    把read_training_samples的返回值(x, y)
    写入名为filename的文件。
    '''
    with open(filename, 'wb') as file:
        pickle.dump((x, y), file)

def load_hog(filename):
    '''
    从名为filename的文件中加载训练数据(x, y)。
    '''
    result = None
    with open(filename, 'rb') as file:
        result = pickle.load(file)
    return result

3. Training SVM

After successfully reading the training data, it is time to enter the link of training SVM. For an introduction to the principle of SVM, please refer to the basic knowledge of Support Vector Machine (SVM) , here we just need to adjust the library.

sklearnThe SVM classifier class in is sklearn.svm.SVC:

from sklearn.svm import SVC

SVCSeveral parameters we focus on .

  • tol: That is, "tolerance". The optimization problem of SVM has a condition: for each i ( 1 ≤ i ≤ n ) i(1\le i\le n)i(1in) y i ( w T x + b ) ≥ 1 y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1 yi(wTx+b)1 . But for non-linearly separable data, this condition may not always be satisfied, so we loosen the condition, that is,yi ( w T x + b ) ≥ 1 − ε y_i\left(\bm{w} ^{\mathrm{T}}\bm{x}+b\right)\ge 1-\varepsilonyi(wTx+b)1ε , among which\varepsilonε istol. tolThe smaller the value, the stricter the condition. sklearnThe defaulttolis1e-3, which I use1e-6, to make the classification a little stricter.
  • C: penalty coefficient. For non-linearly separable samples, we must allow some samples not to satisfy the constraints, at this time we introduce slack variables ξ i \xi_iXi, general iiThe constraint condition of i samples is fromyi ( w T x + b ) ≥ 1 − ε y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1- \varepsilonyi(wTx+b)1This value( w T x + b ) ≥ 1 − ε − ξ i y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1-\ varepsilon-\xi_iyi(wTx+b)1eXi, and ξ i ≥ 0 \xi_i\ge 0Xi0 . At the same time, the purpose of optimization should also change. We hope that the fewer variables that do not satisfy the constraints, the better, so change the purpose of optimization tomin ⁡ w , b 1 2 ∥ w ∥ 2 + C ∑ i = 1 n ξ i \min\limits_{\bm{w},b}\frac{1}{2}{\|\bm{w}\|}^2+C\sum\limits_{i=1}^n\xi_iw,bmin21w2+Ci=1nXi, where CCC is the penalty coefficient. The penalty coefficient reflects the punishment for variables that do not satisfy the constraints,CCThe larger C , the lower the tolerance for variables that do not satisfy the constraints. According to Dalal's paper (Reference [6]), whenC = 0.01 C=0.01C=0.01 achieved better results, so I also chooseC = 0.01.
  • max_iter:The maximum number of iterations. We hope that the SVM will end the iteration when it reaches the target situation, so setting it to -1means that there is no limit on the number of times.
  • gamma: gammaThe parameters of the Gaussian kernel, we set to auto, that is, the value set automatically gamma.
  • kernel: kernel function. According to Dalal's paper, the use of Gaussian kernels can improve the recognition accuracy to a certain extent, and the performance may decrease (compared to linear kernels), so I use Gaussian kernels ( ) kernel = rbf.
  • probability: Whether to output the probability. In the pedestrian detection task, we hope that there is a probability that there are pedestrians in an area on the picture, so set probability = True.

The code for training the SVM is as follows:

def train_SVM(x, y):
    '''
    训练SVM。

    参数
    ---
    x, y: read_training_samples的返回值。

    返回值
    -----
    返回训练所得的SVM。
    '''
    SVM = SVC(
        tol=1e-6,
        C=0.01,
        max_iter=-1,
        gamma='auto',
        kernel='rbf',
        probability=True
    ) # 创建SVM实例
    SVM.fit(x, y) # 进行训练
    return SVM

Here SVM can also be replaced with a Logistic Regression model, just SVM = SVC(...)replace it with LR = sklearn.linear_model.LogisticRegression(tol=1e-6, C=0.01, max_iter=10000).

4. Test SVMs

After the training is complete, we need to test the SVM on the test data. For a given sample, the prediction result given by SVM may be correct or wrong. If the SVM says that the positive sample contains people, the sample is called a true positive (TP); if the positive sample is said to contain people, it is called a false negative (FN); If the negative sample is said to contain people, it is called a false positive (false positive, FP); if the negative sample is said to contain no human, it is called a true negative (true negative, TN). Among them, true examples and true negative examples are cases where the prediction is correct, and false positive examples and false negative examples are cases where the prediction is wrong. It can be summarized into a table as follows:

True Value\SVM Classification Results just burden
just True Example (TP) False Negatives (FN)
burden False Positives (FP) True Negative (TN)

The number of positive samples (that is, the true value is positive) is TP+FN, and the number of negative samples is FP+TN.

Define the recall rate (Recall) as the proportion of the positive samples predicted to be positive, that is, Recall=TP/(TP+FN); the precision rate (Precision) is the proportion of the positive samples that the SVM says is really a positive sample, that is, Precision =TP/(TP+FP). The recall rate also has a name called True Positive Rate (TPR), and correspondingly there is False Positive Rate (False Positive Rate, FPR), which is the proportion of samples predicted to be positive in negative samples, that is, FPR =FP/(FP+TN). Obviously FPR=1-TPR. Define the missing rate (Miss Rate, MR) as the proportion that is not detected in the positive sample, that is, MR=1-Recall=FN/(TP+FN).

The output of the SVM is the probability that there are pedestrians in an area of ​​the picture. We define a threshold (threshold) so that if the probability given by the SVM is greater than or equal to the threshold, it is considered that there are pedestrians in the area (that is, the prediction result is positive); If the probability is less than this threshold, it is considered that there are no pedestrians in the area (i.e. the prediction result is negative). Different thresholds lead to different prediction results. When the threshold is close to 1 1When 1 , the conditions for the SVM prediction result to be positive are extremely harsh, and both true positive cases and false positive cases are reduced, while false negative cases and true negative cases are increased; on the contrary, when the threshold is close to 00When 0 , SVM will say that many samples are pedestrians, the real cases and false positive cases increase, and the false negative cases and true negative cases decrease. Since TP+FN and FP+TN remain unchanged, the larger the threshold, the larger the Miss Rate (because FN increases), and the FPR decreases (because FP decreases). In this way, for each threshold, we can obtain a Miss Rate and an FPR. Taking the threshold value through the different probabilities given by SVM for each test sample, we can draw a Miss Rate-False Positive Rate curve:

Miss Rate-False Positive Rate curve

The closer the curve is to the axes (that is, the smaller the area under the curve), the more reliable the model. This is because both Miss Rate and False Positive Rate are the amount we want to reduce, and we want them to take smaller values ​​at the same time, that is, given the value of Miss Rate, the smaller the False Positive Rate, the better, that is, under the curve The smaller the area, the better.

We test_SVMtest the SVM with the function. First call SVM.predict_probathe method to calculate the probability that each test sample contains pedestrians, then calculate the value of Miss Rate and False Positive Rate for each threshold, and draw the Miss Rate-False Positive Rate curve. Finally, the function returns the Area Under the Curve (AUC) value of the ROC curve (Receiver Operating Characteristic Curve), the closer the value is to 1 11 indicates that the model is more reliable.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics

def test_SVM(SVM, test_data, show_stats=False):
    '''
    测试训练好的SVM。

    参数
    ---
    SVM: 训练好的SVM模型。
    test_data: 测试数据(read_test_data的返回值)。
    show_stats: 是否显示统计数据(miss rate vs.
    false positive rate曲线)。

    返回值
    -----
    返回AUC(ROC曲线下的面积)。AUC介于0.5和1之间。
    AUC越接近1,模型越可靠。
    '''
    hog_features = test_data[0] # 测试数据的HOG特征
    labels = test_data[1] # 数据标签(0=不是人,1=是人)
    prob = SVM.predict_proba(hog_features)[:, 1]
    if show_stats:
        # 下面将prob和labels按prob的降序排序
        sorted_indices = np.argsort(
            prob, kind="mergesort")[::-1]
        labels = labels[sorted_indices]
        prob = prob[sorted_indices]
        distinct_value_indices = np.where(np.diff(prob))[0]
            # prob中不同值第一次出现的下标
        threshold_idxs = np.r_[
            distinct_value_indices, labels.size - 1]
            # 阈值的下标,在末尾增加了最后一个样本的下标
        tps = np.cumsum(labels)[threshold_idxs]
            # 不同概率阈值对应的真正例数。
            # 注意现在已经按prob的降序排序,
            # 这种写法正确的原因是:在数组某一位置前的概率
            # 一定大于阈值,在此之后的概率一定小于阈值,
            # 所以真正例数就是在这一位置之前的正样本数。
        fps = 1 + threshold_idxs - tps
            # 不同概率阈值对应的假正例数。
            # threshold_idxs存储的是下标,
            # 加一后变成个数,
            # 再减去真正例数就是假正例数。
        num_positive = tps[-1]
            # tps的最后一项就是labels的和,
            # 因此代表正例的个数。
        recall = tps / num_positive
            # 查全率就是在所有正例中查出了多少真正例。
        miss = 1 - recall # 计算miss
        num_negative = fps[-1] # 负例个数
        fpr = fps / num_negative
            # 假阳性率(false positive rate)
        plt.plot(miss, fpr, color='red')
        plt.xlabel('False Positive Rate')
        plt.ylabel('Miss Rate')
        plt.title('Miss Rate - '
            'False Positive Rate Curve')
        plt.show()
    AUC = metrics.roc_auc_score(labels, prob)
    return AUC

5. Image frame

Next we introduce methods to frame pedestrians in an image. The main idea is sliding windows (Sliding Windows). That is: use windows of different sizes to slide over the image with different step lengths, and calculate the HOG feature of the step size in the window each time. The algorithm has three loops:

  • The first level enumerates the width of the window. The aspect ratio of the window is fixed (2:1), and the width changes from an initial min_width(48 by default 4848 ) and multiplied each timewidth_scale(the default is1.25 1.251.25 ) times, stop when the image width is exceeded. If you want to achieve better recognition results, you can set themin_widthtowidth_scalea smaller value, at the cost of slower recognition speed.
  • The second layer enumerates the abscissa on the left side of the window, from 0 to 0Start with 0 , increase by one step each timecoord_step(the default is16 1616 ) until the right side reaches the image boundary. If you want to achieve better recognition results, you cancoord_stepturn it down, but the recognition speed will still be slower.
  • The third layer enumerates the vertical coordinates of the upper side of the window, from 0 to 0Start with 0 and increase by one step each timecoord_step.
  • For each window, scale it to area_width * area_height(default 64 × 128 64\times 12864×128 ), extract its HOG features.

Then, for the HOG features of all windows, the probability of pedestrians in them is given by SVM. When the probability is greater than the threshold threshold(0.99 by default 0.990.99 ), it is considered that there are pedestrians. But there is still a problem, a pedestrian may be framed by multiple boxes, we need to select the most suitable box among them. This requires the use of Non-Maximum Suppression (Non-Maximum Suppression, NMS).

NMS example, picture from https://zhuanlan.zhihu.com/p/78504109

The basic idea of ​​NMS is that, for two boxes with more overlapping parts, discard the one with a lower probability of containing pedestrians, and retain the one with a higher probability. How to measure the amount of overlap? We use Intersection over Union (IoU). IoUIt is the ratio of the intersection area of ​​two boxes to the union area. IoUThe bigger it is, the more overlap. When IoUit is greater than or equal to a threshold IoU_threshold, one of the boxes is discarded.

IoU, picture from https://zhuanlan.zhihu.com/p/78504109

When calculating the union area, you only need to add the areas of the two boxes and then subtract the intersection area (similar to the principle of inclusion and exclusion). code show as below:

def area_of_box(box):
    '''
    计算框的面积。

    参数
    ---
    box: 框,格式为(left, top, width, height)。

    返回值
    -----
    box的面积,即width * height。
    '''
    return box[2] * box[3]

def intersection_over_union(box1, box2):
    '''
    两个框的交并比(IoU)。

    参数
    ---
    box1: 边框1。
    box2: 边框2。
    '''
    intersection_width = max(0,
        box1[0] + box1[2] - box2[0])
        # 相交部分宽度=max(0, box1的右边 - box2的左边)
    intersection_height = max(0,
        box1[1] + box1[3] - box2[1])
        # 相交部分长度=max(0, box1的下边 - box2的上边)
    intersection_area = intersection_width * \
        intersection_height # 相交部分面积
    area_box1 = area_of_box(box1) # box1的面积
    area_box2 = area_of_box(box2) # box1的面积
    union_area = area_box1 + area_box2 - \
        intersection_area
    if abs(union_area) < 1:
        IoU = 0 # 防止除以0
    else:
        IoU = intersection_area / union_area
            # 并集的面积等于二者面积之和减去交集的面积
    return IoU

The main flow of the NMS algorithm is: traverse each box, if it is discarded by another box, it will not be added to the result list, otherwise it will be added to the result list. Finally return the result list. code show as below:

def non_maximum_suppression(pos_box_list, pos_prob,
    IoU_threshold=0.4):
    '''
    非极大值抑制(NMS)。

    参数
    ---
    pos_box_list: 含有人的概率大于阈值的边框列表。
    pos_prob: 对应的概率。
    IoU_threshold: 舍弃边框的IoU阈值。

    返回值
    -----
    抑制后的边框列表。
    '''
    result = [] # 结果
    for box1, prob1 in zip(pos_box_list, pos_prob):
        discard = False # 是否舍弃box1
        for box2, prob2 in zip(
            pos_box_list, pos_prob):
            if intersection_over_union(
                box1, box2) > IoU_threshold:
                # IoU大于阈值
                if prob2 > prob1: # 舍弃置信度较小的
                    discard = True
                    break
        if not discard: # 未舍弃box1
            result.append(box1) # 加入结果列表
    return result

Finally, the code to frame a single image is as follows:

from cv2 import rectangle, imshow, waitKey
from skimage.io import imread
from skimage.transform import resize

def detect_pedestrian(SVM, filename, show_img=False,
    threshold=0.99, area_width=64, area_height=128,
    min_width=48, width_scale=1.25, coord_step=16,
    ratio=2):
    '''
    用SVM检测file文件中的行人,采用非极大值抑制(NMS)
    避免重复画框。

    参数
    ---
    SVM: 训练好的SVM模型。
    filename: 输入文件名。
    show_img: 是否给用户显示已画框的图片。
    threshold: 将某一部分视为人的概率阈值。
    area_width: 缩放后区域的宽度。
    area_height: 缩放后区域的高度。
    min_width: 框宽度的最小值,也是初始值。
    width_scale: 每一次框宽度增大时扩大的倍数。
    coord_step: 坐标变化的步长。
    ratio: 框的长宽比。

    返回值
    -----
    一个列表,每个列表项是一个元组
    (left, top, width, height), 为行人的边框。
    '''
    box_list = [] # 行人边框列表
    hog_list = [] # HOG特征列表
    with open(filename, 'rb') as file:
        img = imread(file, as_gray=True) # 读取文件
        img_height, img_width = img.shape # 图片长宽
        width = min_width # 框的宽度
        height = int(width * ratio) # 框的长度
        while width < img_width and height < img_height:
            for left in range(0, img_width - width,
                coord_step): # 框的左侧
                for top in range(0, img_height - height,
                    coord_step): # 框的上侧
                    patch = clip_image(img, left, top,
                        width, height) # 截取图像的一部分
                    resized = resize(patch,
                        (area_height, area_width))
                        # 缩放图片
                    hog_feature = extract_hog_feature(
                        resized) # 提取HOG特征
                    box_list.append((left, top,
                        width, height))
                    hog_list.append(hog_feature)
            width = int(width * width_scale)
            height = width * ratio
        prob = SVM.predict_proba(hog_list)[:, 1]
            # 用SVM模型进行判断
        mask = (prob >= threshold)
            # 布尔数组, mask[i]代表prob[i]是否等于阈值
        pos_box_list = np.array(box_list)[mask]
            # 含有人的框
        pos_prob = prob[mask] # 对应的预测概率
        box_list_after_NMS = non_maximum_suppression(
            pos_box_list, pos_prob)
            # NMS处理之后的框列表
        if show_img:
            shown_img = np.array(img)
                # 复制原图像,准备画框
            for box in box_list_after_NMS:
                shown_img = rectangle(shown_img,
                    pt1=(box[0], box[1]),
                    pt2=(box[0] + box[2],
                        box[1] + box[3]),
                    color=(0, 0, 0),
                    thickness=2)
            imshow('', shown_img)
            waitKey(0)
        return box_list_after_NMS

4. Complete code

# encoding: UTF-8
# 文件: hog_svm.py
# 作者: seh_sjij

import numpy as np
import time
import random
import os
import pickle
import joblib
from tqdm import tqdm
from cv2 import rectangle, imshow, waitKey
from skimage.io import imread
from skimage.feature import hog
from skimage.transform import resize
from sklearn import metrics
from sklearn.svm import SVC
import matplotlib.pyplot as plt

def clip_image(img, left, top,
    width=64, height=128):
    '''
    截取图片的一个区域。

    参数
    ---
    img: 图片输入。
    left: 区域左边的坐标。
    top: 区域上边的坐标。
    width: 区域宽度。
    height: 区域高度。
    '''
    return img[top:top + height, left:left + width]

def extract_hog_feature(img):
    '''
    提取单个图像img的HOG特征。
    '''
    return hog(
        img,
        orientations=9,
        pixels_per_cell=(16, 16),
        cells_per_block=(2, 2),
        block_norm='L2-Hys',
        visualize=False
    ).astype('float32')

def read_images(pos_dir, neg_dir,
    neg_area_count, description):
    '''
    读取图片,提取样本HOG特征。

    参数
    ---
    pos_dir: 正样本所在文件夹。
    neg_dir: 负样本所在文件夹。
    neg_area_count: 在每个负样本中随机截取区域的个数。
    description: 用途描述(训练/测试)。

    返回值
    -----
    返回一个元组(x, y),x是所有图片的HOG特征,
    y是所有图片的分类(1=正样本,0=负样本)。
    '''
    pos_img_files = os.listdir(pos_dir)
    # 正样本文件列表
    neg_img_files = os.listdir(neg_dir)
    # 负样本文件列表

    area_width = 64 # 截取的区域宽度
    area_height = 128 # 截取的区域高度

    x = [] # 图片的HOG特征
    y = [] # 图片的分类

    for pos_file in tqdm(pos_img_files,
        desc=f'{
      
      description}正样本'):
        # 读取所有正样本
        pos_path = os.path.join(pos_dir, pos_file)
        # 正样本路径
        pos_img = imread(pos_path, as_gray=True)
        # 正样本图片
        img_height, img_width = pos_img.shape
        # 该图片的宽、高
        clip_left = (img_width - area_width) // 2
        # 截取区域的左边
        clip_top = (img_height - area_height) // 2
        # 截取区域的上边
        pos_center = clip_image(pos_img,
            clip_left, clip_top, area_width, area_height)
        # 截取中间部分
        hog_feature = extract_hog_feature(
            pos_center) # 提取HOG特征
        x.append(hog_feature) # 加入HOG向量
        y.append(1) # 1代表正类
    
    for neg_file in tqdm(neg_img_files,
        desc=f'{
      
      description}训练负样本'):
        # 读取所有负样本
        neg_path = os.path.join(neg_dir, neg_file)
        # 负样本路径
        neg_img = imread(neg_path, as_gray=True)
        # 负样本图片
        img_height, img_width = neg_img.shape
        # 该图片的宽、高
        left_max = img_width - area_width
        # 区域左边坐标的最大值
        top_max = img_height - area_height
        # 区域
        for _ in range(neg_area_count):
            # 随机截取neg_area_count个区域
            left = random.randint(0, left_max) # 区域左边
            top = random.randint(0, top_max) # 区域上边
            clipped_area = clip_image(neg_img,
                left, top, area_width, area_height)
            # 截取的区域
            hog_feature = extract_hog_feature(
                clipped_area) # 提取HOG特征
            x.append(hog_feature)
            y.append(0)
    return x, y

def read_training_data():
    '''
    读取训练数据。
    '''
    pos_dir = 'INRIAPerson/96X160H96/Train/pos'
    neg_dir = 'INRIAPerson/Train/neg'
    neg_area_count = 10
    description = '训练'
    return read_images(pos_dir, neg_dir,
        neg_area_count, description)

def read_test_data():
    '''
    读取测试数据。
    '''
    pos_dir = 'INRIAPerson/70X134H96/Test/pos'
    neg_dir = 'INRIAPerson/Test/neg'
    neg_area_count = 10
    description = '测试'
    return read_images(pos_dir, neg_dir,
        neg_area_count, description)

def save_hog(x, y, filename):
    '''
    把read_training_samples的返回值(x, y)
    写入名为filename的文件。
    '''
    with open(filename, 'wb') as file:
        pickle.dump((x, y), file)

def load_hog(filename):
    '''
    从名为filename的文件中加载训练数据(x, y)。
    '''
    result = None
    with open(filename, 'rb') as file:
        result = pickle.load(file)
    return result

def train_SVM(x, y):
    '''
    训练SVM。

    参数
    ---
    x, y: read_training_samples的返回值。

    返回值
    -----
    返回训练所得的SVM。
    '''
    SVM = SVC(
        tol=1e-6,
        C=0.01,
        max_iter=-1,
        gamma='auto',
        kernel='rbf',
        probability=True
    ) # 创建SVM实例
    SVM.fit(x, y) # 进行训练
    return SVM

def test_SVM(SVM, test_data, show_stats=False):
    '''
    测试训练好的SVM。

    参数
    ---
    SVM: 训练好的SVM模型。
    test_data: 测试数据(read_test_data的返回值)。
    show_stats: 是否显示统计数据(miss rate vs.
    false positive rate曲线)。

    返回值
    -----
    返回AUC(ROC曲线下的面积)。AUC介于0.5和1之间。
    AUC越接近1,模型越可靠。
    '''
    hog_features = test_data[0] # 测试数据的HOG特征
    labels = test_data[1] # 数据标签(0=不是人,1=是人)
    prob = SVM.predict_proba(hog_features)[:, 1]
    if show_stats:
        # 下面将prob和labels按prob的降序排序
        sorted_indices = np.argsort(
            prob, kind="mergesort")[::-1]
        labels = labels[sorted_indices]
        prob = prob[sorted_indices]
        distinct_value_indices = np.where(np.diff(prob))[0]
            # prob中不同值第一次出现的下标
        threshold_idxs = np.r_[
            distinct_value_indices, labels.size - 1]
            # 阈值的下标,在末尾增加了最后一个样本的下标
        tps = np.cumsum(labels)[threshold_idxs]
            # 不同概率阈值对应的真正例数。
            # 注意现在已经按prob的降序排序,
            # 这种写法正确的原因是:在数组某一位置前的概率
            # 一定大于阈值,在此之后的概率一定小于阈值,
            # 所以真正例数就是在这一位置之前的正样本数。
        fps = 1 + threshold_idxs - tps
            # 不同概率阈值对应的假正例数。
            # threshold_idxs存储的是下标,
            # 加一后变成个数,
            # 再减去真正例数就是假正例数。
        num_positive = tps[-1]
            # tps的最后一项就是labels的和,
            # 因此代表正例的个数。
        recall = tps / num_positive
            # 查全率就是在所有正例中查出了多少真正例。
        miss = 1 - recall # 计算miss
        num_negative = fps[-1] # 负例个数
        fpr = fps / num_negative
            # 假阳性率(false positive rate)
        plt.plot(miss, fpr, color='red')
        plt.xlabel('False Positive Rate')
        plt.ylabel('Miss Rate')
        plt.title('Miss Rate - '
            'False Positive Rate Curve')
        plt.show()
    AUC = metrics.roc_auc_score(labels, prob)
    return AUC

def area_of_box(box):
    '''
    计算框的面积。

    参数
    ---
    box: 框,格式为(left, top, width, height)。

    返回值
    -----
    box的面积,即width * height。
    '''
    return box[2] * box[3]

def intersection_over_union(box1, box2):
    '''
    两个框的交并比(IoU)。

    参数
    ---
    box1: 边框1。
    box2: 边框2。
    '''
    intersection_width = max(0,
        box1[0] + box1[2] - box2[0])
        # 相交部分宽度=max(0, box1的右边 - box2的左边)
    intersection_height = max(0,
        box1[1] + box1[3] - box2[1])
        # 相交部分长度=max(0, box1的下边 - box2的上边)
    intersection_area = intersection_width * \
        intersection_height # 相交部分面积
    area_box1 = area_of_box(box1) # box1的面积
    area_box2 = area_of_box(box2) # box1的面积
    union_area = area_box1 + area_box2 - \
        intersection_area
    if abs(union_area) < 1:
        IoU = 0 # 防止除以0
    else:
        IoU = intersection_area / union_area
            # 并集的面积等于二者面积之和减去交集的面积
    return IoU

def non_maximum_suppression(pos_box_list, pos_prob,
    IoU_threshold=0.4):
    '''
    非极大值抑制(NMS)。

    参数
    ---
    pos_box_list: 含有人的概率大于阈值的边框列表。
    pos_prob: 对应的概率。
    IoU_threshold: 舍弃边框的IoU阈值。

    返回值
    -----
    抑制后的边框列表。
    '''
    result = [] # 结果
    for box1, prob1 in zip(pos_box_list, pos_prob):
        discard = False # 是否舍弃box1
        for box2, prob2 in zip(
            pos_box_list, pos_prob):
            if intersection_over_union(
                box1, box2) > IoU_threshold:
                # IoU大于阈值
                if prob2 > prob1: # 舍弃置信度较小的
                    discard = True
                    break
        if not discard: # 未舍弃box1
            result.append(box1) # 加入结果列表
    return result

def detect_pedestrian(SVM, filename, show_img=False,
    threshold=0.99, area_width=64, area_height=128,
    min_width=48, width_scale=1.25, coord_step=16,
    ratio=2):
    '''
    用SVM检测file文件中的行人,采用非极大值抑制(NMS)
    避免重复画框。

    参数
    ---
    SVM: 训练好的SVM模型。
    filename: 输入文件名。
    show_img: 是否给用户显示已画框的图片。
    threshold: 将某一部分视为人的概率阈值。
    area_width: 缩放后区域的宽度。
    area_height: 缩放后区域的高度。
    min_width: 框宽度的最小值,也是初始值。
    width_scale: 每一次框宽度增大时扩大的倍数。
    coord_step: 坐标变化的步长。
    ratio: 框的长宽比。

    返回值
    -----
    一个列表,每个列表项是一个元组
    (left, top, width, height), 为行人的边框。
    '''
    box_list = [] # 行人边框列表
    hog_list = [] # HOG特征列表
    with open(filename, 'rb') as file:
        img = imread(file, as_gray=True) # 读取文件
        img_height, img_width = img.shape # 图片长宽
        width = min_width # 框的宽度
        height = int(width * ratio) # 框的长度
        while width < img_width and height < img_height:
            for left in range(0, img_width - width,
                coord_step): # 框的左侧
                for top in range(0, img_height - height,
                    coord_step): # 框的上侧
                    patch = clip_image(img, left, top,
                        width, height) # 截取图像的一部分
                    resized = resize(patch,
                        (area_height, area_width))
                        # 缩放图片
                    hog_feature = extract_hog_feature(
                        resized) # 提取HOG特征
                    box_list.append((left, top,
                        width, height))
                    hog_list.append(hog_feature)
            width = int(width * width_scale)
            height = width * ratio
        prob = SVM.predict_proba(hog_list)[:, 1]
            # 用SVM模型进行判断
        mask = (prob >= threshold)
            # 布尔数组, mask[i]代表prob[i]是否等于阈值
        pos_box_list = np.array(box_list)[mask]
            # 含有人的框
        pos_prob = prob[mask] # 对应的预测概率
        box_list_after_NMS = non_maximum_suppression(
            pos_box_list, pos_prob)
            # NMS处理之后的框列表
        if show_img:
            shown_img = np.array(img)
                # 复制原图像,准备画框
            for box in box_list_after_NMS:
                shown_img = rectangle(shown_img,
                    pt1=(box[0], box[1]),
                    pt2=(box[0] + box[2],
                        box[1] + box[3]),
                    color=(0, 0, 0),
                    thickness=2)
            imshow('', shown_img)
            waitKey(0)
        return box_list_after_NMS

def detect_multiple_images(SVM, dir):
    '''
    检测多个图像文件(dir文件夹中所有文件)中的行人。

    参数
    ---
    SVM: 训练好的SVM模型。
    dir: 存放图片的文件夹。
    '''
    files = os.listdir(dir)
    for file in files:
        file_path = os.path.join(dir, file)
        detect_pedestrian(SVM, file_path,
            show_img=True)

if __name__ == '__main__':
    print('execution starts')

    random.seed(time.time()) # 设置随机数种子
    x, y = read_training_data() # 读取训练数据,提取HOG特征
    save_hog(x, y, 'hog_xy.pickle')
    print('training data hog extraction done')

    test_data = read_test_data() # 读取测试数据,提取HOG特征
    save_hog(*test_data, 'test_data_hog.pickle')
    print('test data hog extraction done')

    x, y = load_hog('hog_xy.pickle') # 训练SVM模型
    time_before_training = time.time()
    SVM = train_SVM(x, y)
    time_after_training = time.time()
    print('SVM training done, cost %.2fs.' % \
        (time_after_training - time_before_training))
    joblib.dump(SVM, 'SVM.model', compress=9)

    SVM = joblib.load('SVM.model') # 测试SVM模型
    test_data = load_hog('test_data_hog.pickle')
    print('AUC=%.8f.' % test_SVM(SVM, test_data, True))

    detect_multiple_images(SVM, # 用SVM模型识别图片
        'INRIAPerson/Test/pos')

5. Results

The Miss Rate-False Positive Rate curve I got is as follows:

Miss Rate-False Positive Rate curve

The AUC of the ROC curve is 0.99204004 0.992040040.99204004 . Then I recognized some pictures:

Picture 1

picture 2
picture 3

The result can only be said to be unsatisfactory, the upper limit of the HOG+SVM model should be like this. I found that it particularly likes to recognize some columnar objects (such as street light poles, window frames) as adults, probably because their HOG features are relatively similar.

6. Summary

This is the whole content of HOG+SVM to realize pedestrian detection. I can't guarantee that this implementation method is optimal. Adjusting some parameters may further improve the performance of SVM. You can explore by yourself in the process of practice~

References

  1. https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients
  2. https://learnopencv.com/histogram-of-oriented-gradients/
  3. https://courses.cs.duke.edu/fall15/compsci527/notes/hog.pdf
  4. https://baike.baidu.com/item/HOG/9738560
  5. https://blog.csdn.net/jingyu_1/article/details/124217455
  6. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886-893 vol. 1.
  7. https://zhuanlan.zhihu.com/p/594165143
  8. https://zhuanlan.zhihu.com/p/27202924
  9. https://zhuanlan.zhihu.com/p/78504109

Guess you like

Origin blog.csdn.net/qaqwqaqwq/article/details/129842507