Article directory
Task: Using the INRIA Person dataset, extract HOG features and use the SVM method to achieve pedestrian detection in images.
This article will give detailed operation steps and possible pitfalls.
1. Preparation
1. Download the dataset
The INRIA dataset contains images of upright or walking people and was used by Navneet Dalal to train a human detector published in CVPR 2005.
Pit point 1 : The official website http://pascal.inrialpes.fr/data/human/ displays 403 Forbidden after opening.
Solution : Use Motrix /IDM/Xunlei (or other download tools that support FTP) to open ftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar for download (compressed size is about 1GB).
2. Unzip the dataset
Pit point 2 : When using WinRAR/7Zip to decompress, problems such as file overwriting and administrator privileges are required.
Solution : The file contains soft links, so WinRAR/7Zip cannot be used to decompress it, but the command tar from Linux should be used to decompress it. If you have installed a Linux-like environment such as WSL (Windows Subsystem for Linux)/MinGW, you can call the following command:
tar xvf INRIAPerson.tar
Among them x
stands for Extract
(unzip), v
stands for Verbose
(displays the file being decompressed), f
stands for File Name
(followed by the file name). This command will generate a folder in the current directory INRIAPerson
, which contains the decompressed files.
2. Introduction to HOG features
The full name of HOG is Histogram of Oriented Gradient (Histogram of Oriented Gradient), which is a feature descriptor (Feature Descriptor) for object detection in computer vision. The role of feature descriptors is to extract useful information and discard redundant information. For an object, it is its shape—that is, its boundary—that distinguishes its characteristics. However, the gray level at the boundary generally has a sudden change, so we can know where the boundary is by examining the gradient of the image.
1. Gradient
First, we assume that the input image is a grayscale image (in fact, we generally deal with a part of the image, that is, the window, rather than the entire image). It can be seen as the line ( rrr ) and columns (ccc ) binary function:I ( r , c ) I(r,c)I(r,c ) , of whichIII stands forrrline r , ccThe grayscale of the pixels in column c (the value range is 0 00~ 255 255 255 ). When studying a binary function, we often consider its gradient. Here we need to knowIII am atxxx、yyThe gradient in the y direction. The approach we take is: use the difference between the gray levels of adjacent grids to make an approximation. III am atxxx、yyy方向的梯度公式如下: I x ( r , c ) = I ( r , c + 1 ) − I ( r , c − 1 ) I y ( r , c ) = I ( r + 1 , c ) − I ( r − 1 , c ) \begin{aligned} I_x(r,c)&=I(r,c+1)-I(r,c-1)\\ I_y(r,c)&=I(r+1,c)-I(r-1,c) \end{aligned} Ix(r,c)Iy(r,c)=I(r,c+1)−I(r,c−1)=I(r+1,c)−I(r−1,c)Logically speaking, the above formula should be divided by 2 22 , but these constants are irrelevant because of normalization later on. It can also be understood as using the vector[ − 1 0 1 ] \begin{bmatrix}-1&0&1\end{bmatrix}[−101] and[ − 1 0 1 ] {\begin{bmatrix}-1\\0\\1\end{bmatrix}} −101 Perform convolution operation on the original image. Next we convert the gradients to polar coordinates, where the angles are constrained to 0° 0\degree0°~ 180 ° 180\degree 180°: μ = I x 2 + I y 2 θ = 180 π ( arctan I y I x ) \begin{aligned} \mu&=\sqrt{I_x^2+I_y^2}\\ \theta&=\frac{180}{\pi}\left(\arctan\frac{I_y}{I_x}\right) \end{aligned} mi=Ix2+Iy2=Pi180( arctanIxIy)Here we put arctan \arctanarctan定义为 arctan x = { tan − 1 x , x ≥ 0 tan − 1 x + π , x < 0 \arctan x=\begin{cases} \tan^{-1}x,&x\ge 0\\ \tan^{-1}x+\pi,&x<0 \end{cases} arctanx={ tan−1x,tan−1x+p ,x≥0x<0And θ \thetaθ is expressed in degrees.
2. Grid (Cell)
We proceed to segment the image into C × CC\times CC×C size grid (generallyC = 8 C=8C=8 ). The figure below demonstrates such a segmentation, each green box is8 × 8 8\times 88×8 grids:
each grid hasC 2 C^2C2 (generally64 6464 ) pixels. Each pixel has a gradient, and we need to count the gradient direction of these pixels (that is, the angleθ \thetaθ ) distribution law. The gradient modulus length and direction of a grid in the above figure are as follows:
If you want to count the distribution of angles, you need to use the concept of histogram. In the histogram, xxEach interval of the x- axis is called a bin. You can understand the bin as a bucket, and the input data is put into the corresponding bucket according to which range it is in. Then forθ \thetaθ , its range is0 ° 0\degree0°~ 180 ° 180\degree 180° , our range is divided intoBBB bins. Generally takeB = 9 B=9B=9 , that is to say the width of each interval isw = 180 B = 20 ° w=\frac{180}{B}=20\degreew=B180=20° . We put each bin from0 00 toB − 1 B-1B−1 for numbering. SectionIIi bin の范围是[ wi , w ( i + 1 ) ) [wi,w(i+1))[wi,w(i+1 )) , the center isw ( i + 1 2 ) w\!\left(i+\frac{1}{2}\right)w(i+21) . For example, whenB = 9 B=9B=9 , the third bin (i = 3 i=3i=3 ) The range is[ 60 ° , 80 ° ) [60\degree,80\degree)[60°,80° ) and the center is70 ° 70\degree70° . But, we will not simply put each pixel according toθ \thetaThe range of θ is placed in the bucket, butμ \muThe size of μ , put it into two adjacent buckets according to a certain ratio. Finally, the value of each bucket is not a number, but a measure of "contribution". The contribution of a pixel to a bin depends not only on the modulus length of the gradientμ \muμ , also depends on its angleθ \thetaθ is the distance from the center of the bin. The longer the modulus length, the greater the contribution; the farther the distance, the smaller the contribution. Specifically, for a gradient modulus lengthμ \muμ , the orientation angle isθ \thetaThe pixel point of θ , let j = ⌊ θ w − 1 2 ⌋ j=\left\lfloor\cfrac{\theta}{w}-\cfrac{1}{2}\right\rfloorj=⌊wi−21⌋ , then it
- Pairs numbered j mod bj\bmod bjmodThe contribution of the bin of b is vj = μ cj + 1 − θ w v_j=\mu\cfrac{c_{j+1}-\theta}{w}vj=mwcj+1−i;
- The pair number is ( j + 1 ) mod b (j+1)\bmod b(j+1)modThe contribution of the bin of b is vj + 1 = μ θ − cjw v_{j+1}=\mu\cfrac{\theta-c_j}{w}vj+1=mwi−cj。
Finally, each grid will get a histogram, and each entry in the histogram is the sum of the contributions of all the pixels in this grid to this bin. Interestingly, the sum of the contributions of each pixel to the two bins must be μ \mum .
The image below is an example. First, we put 0 ° 0\degree0°~ 180 ° 180\degree 180° divided intoB = 9 B=9B=9 parts, each centered at10 ° 10\degree10°、 30 ° 30\degree 30°、…、 170 ° 170\degree 170° . Now we have aθ = 77 ° \theta=77\degreei=77° , the modulus length isμ \muThe gradient of μ , it is on the 3rd bin (the range is60 ° 60\degree60°~ 80 ° 80\degree 80° , centered at70 ° 70\degree70° ) contributes0.65 μ 0.65\mu0.65 μ , for bin No. 4 (the range is80 ° 80\degree80°~ 100 ° 100\degree 100° , centered at90 ° 90\degree90° ) contributes0.35 μ 0.35\mu0.35μ . _
For the picture of the athlete, the figure below shows how to calculate a gradient modulus length of85 8585 , the angle is165 165Contribution of 165 pixels:
The histogram of this grid is as follows:
3. Block Normalization
Although we have obtained a histogram, overall, the height of the histogram has a great relationship with the brightness of the image. We don't want the overall histogram height of the photos taken during the day and the photos taken at night to be very different. Therefore, we need to normalize it. Pack the grid into blocks, each block has 2 × 2 2\times 22×2 grids, and the blocks can overlap. Obviously, the number of pixels in each block is2 C × 2 C 2C\times 2C2 C×2C . _ We scan the entire window in a sliding window manner, moving one block at a time. This ensures that every grid that is not on the edge is covered by four blocks.
The size of the above picture is 64 × 128 64\times 12864×128 , that is8 × 16 8\times 168×16 grids, each block has7 77 , vertical position has15 1515 .
Now, since each block has 4 44 grids, the histogram of each grid has9 99 entries, we can concatenate the entries of these histograms to form36 3636 -dimensional vectorb \boldsymbol{b}b . Now, we use the Euclidean norm (Euclidean norm) to putthe b \boldsymbol{b}b is normalized so that its modulus is close to1 11: b : = b ∥ b ∥ 2 + ε \boldsymbol{b}:=\frac{\boldsymbol{b}}{\sqrt{ {\|\boldsymbol{b}\|}^2+\varepsilon}} b:=∥b∥2+ebwhere ε \varepsilonε is to prevent division by0 00 plus a very small positive number.
You may ask: why not normalize each grid? The answer is that the overall difference in the height of the histogram between the grids carries part of the information, which cannot be completely erased. And for each 2 × 2 2\times 22×2 blocks, the information represented by the average gray level difference between different grids can be preserved to a certain extent.
4. HOG Feature (HOG Feature)
Next, we take the b \boldsymbol{b} of each blockThe b vectors are all connected to form a huge vectorh \boldsymbol{h}h , and then perform the following three steps:
(1) Perform a preliminary normalization: h : = h ∥ h ∥ 2 + ε \boldsymbol{h}:=\cfrac{\boldsymbol{h}}{\sqrt{ {\| \boldsymbol{h}\ |}^2+\varepsilon}}h:=∥h∥2+eh;
(2) such that h \boldsymbol{h}The size of each number in h does not exceed a positive thresholdτ \tauτ , that is, forh \boldsymbol{h}h 'snnthn dimensionhn h_nhn, hn : = min ( hn , τ ) h_n:=\min(h_n,\tau)hn:=min(hn,t ) ;
(3) Finally, normalize again: h : = h ∥ h ∥ 2 + ε \boldsymbol{h}:=\cfrac{\boldsymbol{h}}{\sqrt{ {\| \boldsymbol{h}\ |}^2+\varepsilon}}h:=∥h∥2+eh. And we're done.
for a YYLine Y ,XXThe window of X column, its grid number isYC × XC \cfrac{Y}{C}\times\cfrac{X}{C}CY×CX, the number of blocks is ( YC − 1 ) × ( XC − 1 ) \left(\cfrac{Y}{C}-1\right)\times\left(\cfrac{X}{C}-1\right)(CY−1)×(CX−1 ) , the last HOG featureh \boldsymbol{h}The dimension of h is 4 B × ( YC − 1 ) × ( XC − 1 ) 4B\times\left(\cfrac{Y}{C}-1\right)\times\left(\cfrac{X}{C }-1\right)4B _×(CY−1)×(CX−1 ) . The HOG feature dimension of that athlete picture is4 × 9 × 15 × 7 = 3780 4\times 9\times 15\times 7=37804×9×15×7=3780。
5. Use to skimage.feature.hog
extract HOG features
skimage
The installation method:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scikit-image
For this 96 × 160 96\times 16096×160 pictures (namedhog_test.png
)
The code below extracts its HOG features:
# encoding: UTF-8
# 文件: hog.py
# 描述: 提取图片的HOG特征
from skimage.io import imread
from skimage.feature import hog
def extract_hog_feature(filename):
# 提取filename文件的HOG特征
image = imread(filename, as_gray=True)
# 读取图片,as_gray=True表示读取成灰度图
feature = hog( # 提取HOG特征
image, # 图片
orientations=9, # 方向的个数,即bin的个数B
pixels_per_cell=(8, 8), # 格子的大小,C×C
cells_per_block=(2, 2), # 一块有2×2个格子
block_norm='L2-Hys', # 归一化方法
visualize=False # 是否返回可视化图像
)
return feature
if __name__ == '__main__':
feature = extract_hog_feature('hog_test.png')
print(feature) # 显示HOG特征
print(feature.shape) # 显示HOG特征的维数
The output is:
[0.24284172 0.24284172 0.21779826 ... 0.1942068 0.25568547 0.10666346]
(7524,)
In other words, the HOG feature we obtained is a 7524 75247524- dimensional vector. Here7524 = 4 × 9 × ( 96 8 − 1 ) × ( 160 8 − 1 ) 7524=4\times 9\times\left(\cfrac{96}{8}-1\right)\times\left( \cfrac{160}{8}-1\right)7524=4×9×(896−1)×(8160−1)。
If we set visualize
it to True
, hog
the function will return a tuple containing HOG features and visualization images, call
import matplotlib.pyplot as plt
...
feature, visimg = hog(...)
plt.imshow(visimg)
plt.show()
have to
This is hog_test.png
the visualization of the HOG features of the image.
6. Summary
Now we convert a picture into a vector using HOG features, and then we can use SVM classification.
3. Training and testing of the support vector machine model
In this article on the basics of Support Vector Machine (SVM), we introduced the basic principles of SVM. Next, I mainly introduce the Python implementation of SVM.
1. Processing the dataset
training dataset
Logically speaking, the positive samples of the training data (pictures of people) are placed in INRIAPerson\train_64x128_H96\pos
the folder, and the negative samples (pictures of no one) are placed in INRIAPerson\train_64x128_H96\neg
the folder, and the pictures should be 64 × 128 64\times 12864×128 size. However, this is not the case:
Pit point 3 :INRIAPerson\train_64x128_H96\pos
There are no pictures in the folder, only soft links;INRIAPerson\train_64x128_H96\neg
it is directly a soft link. And,pos
the link is toINRIAPerson\96X160H96\Train\pos
the files inside, these pictures are all 96 × 160 96\times 16096×160 in size;neg
the link is toINRIAPerson\Train\neg
the file inside, and the size of the picture varies.
Solution:
- The folder that will
INRIAPerson\96X160H96\Train\pos
be used as a positive sample, where the image size is 96 × 160 96\times 16096×The reason for 160 is that there are 16 × 16around the image16×16 padding. Therefore, when reading the picture, it is necessary to intercept the center64 × 128 64\times 12864×Part of size 128 , that is, the coordinates are( 16 , 16 ) (16,16)(16,16)~ ( 80 , 144 ) (80,144) (80,144 ) . - will be used
INRIAPerson\Train\neg
as the negative sample folder, and for each picture we randomly intercept 10 1010 64× 128 64\times 12864×128 parts.
test data set
In the same way, it will INRIAPerson\70X134H96\Test\pos
be used as the folder for testing positive samples and INRIAPerson Test\neg
the folder for testing negative samples.
2. Read pictures and extract HOG features
Due to the large amount of data, each grid contains the number of pixels CCC takes8 88 will cause the training to be extremely slow, so here I takeC = 16 C=16C=16。
For positive samples, we intercept the middle 64 × 128 64\times 12864×128 size part, extract its HOG features. For negative samples, we randomly intercept 10 10above10 64× 128 64\times 12864×Part 128 extracts HOG features. Finally, two lists are obtained:x
andy
, wherex
is the HOG feature (that is, the training data),y
is the label, if itx[i]
comes from a positive sampley[i] = 1
, otherwisey[i] = 0
. code show as below:
import random
import os
import tqdm # 用于在循环时显示进度条
from skimage.io import imread # 读取图像的函数
from skimage.feature import hog # skimage自带的提取HOG特征的函数
def clip_image(img, left, top,
width=64, height=128):
'''
截取图片的一个区域。
参数
---
img: 图片输入。
left: 区域左边的坐标。
top: 区域上边的坐标。
width: 区域宽度。
height: 区域高度。
'''
return img[top:top + height, left:left + width]
def extract_hog_feature(img):
'''
提取单个图像img的HOG特征。
'''
return hog(
img,
orientations=9,
pixels_per_cell=(16, 16),
cells_per_block=(2, 2),
block_norm='L2-Hys',
visualize=False
).astype('float32')
def read_images(pos_dir, neg_dir,
neg_area_count, description):
'''
读取图片,提取样本HOG特征。
参数
---
pos_dir: 正样本所在文件夹。
neg_dir: 负样本所在文件夹。
neg_area_count: 在每个负样本中随机截取区域的个数。
description: 用途描述(训练/测试)。
返回值
-----
返回一个元组(x, y),x是所有图片的HOG特征,
y是所有图片的分类(1=正样本,0=负样本)。
'''
pos_img_files = os.listdir(pos_dir)
# 正样本文件列表
neg_img_files = os.listdir(neg_dir)
# 负样本文件列表
area_width = 64 # 截取的区域宽度
area_height = 128 # 截取的区域高度
x = [] # 图片的HOG特征
y = [] # 图片的分类
for pos_file in tqdm(pos_img_files,
desc=f'{
description}正样本'):
# 读取所有正样本
pos_path = os.path.join(pos_dir, pos_file)
# 正样本路径
pos_img = imread(pos_path, as_gray=True)
# 正样本图片
img_height, img_width = pos_img.shape
# 该图片的宽、高
clip_left = (img_width - area_width) // 2
# 截取区域的左边
clip_top = (img_height - area_height) // 2
# 截取区域的上边
pos_center = clip_image(pos_img,
clip_left, clip_top, area_width, area_height)
# 截取中间部分
hog_feature = extract_hog_feature(
pos_center) # 提取HOG特征
x.append(hog_feature) # 加入HOG向量
y.append(1) # 1代表正类
for neg_file in tqdm(neg_img_files,
desc=f'{
description}训练负样本'):
# 读取所有负样本
neg_path = os.path.join(neg_dir, neg_file)
# 负样本路径
neg_img = imread(neg_path, as_gray=True)
# 负样本图片
img_height, img_width = neg_img.shape
# 该图片的宽、高
left_max = img_width - area_width
# 区域左边坐标的最大值
top_max = img_height - area_height
# 区域
for _ in range(neg_area_count):
# 随机截取neg_area_count个区域
left = random.randint(0, left_max) # 区域左边
top = random.randint(0, top_max) # 区域上边
clipped_area = clip_image(neg_img,
left, top, area_width, area_height)
# 截取的区域
hog_feature = extract_hog_feature(
clipped_area) # 提取HOG特征
x.append(hog_feature)
y.append(0)
return x, y
The above read_images
function can be used to read both training data and test data.
Pit point 4 :skimage.io.imread
The image read out is in the form of "height × width" instead of "width × height".
Solution : Note that the first dimension of the subscript is the ordinate and the second dimension is the abscissa when intercepting the picture.
The following two functions read_training_data
and read_test_data
the calling read_images
function read the training data and test data respectively (the return value is still two: HOG features and labels). Whether it is training or testing data, negative samples are randomly intercepted 10 regions.
def read_training_data():
'''
读取训练数据。
'''
pos_dir = 'INRIAPerson/96X160H96/Train/pos'
neg_dir = 'INRIAPerson/Train/neg'
neg_area_count = 10
description = '训练'
return read_images(pos_dir, neg_dir,
neg_area_count, description)
def read_test_data():
'''
读取测试数据。
'''
pos_dir = 'INRIAPerson/70X134H96/Test/pos'
neg_dir = 'INRIAPerson/Test/neg'
neg_area_count = 10
description = '测试'
return read_images(pos_dir, neg_dir,
neg_area_count, description)
If we need to train multiple times, we don't need to extract the HOG features of the image every time the program is executed. We can save the calculated HOG features in a file, and read the data from the file when training SVM. The functions for saving and reading files are as follows:
def save_hog(x, y, filename):
'''
把read_training_samples的返回值(x, y)
写入名为filename的文件。
'''
with open(filename, 'wb') as file:
pickle.dump((x, y), file)
def load_hog(filename):
'''
从名为filename的文件中加载训练数据(x, y)。
'''
result = None
with open(filename, 'rb') as file:
result = pickle.load(file)
return result
3. Training SVM
After successfully reading the training data, it is time to enter the link of training SVM. For an introduction to the principle of SVM, please refer to the basic knowledge of Support Vector Machine (SVM) , here we just need to adjust the library.
sklearn
The SVM classifier class in is sklearn.svm.SVC
:
from sklearn.svm import SVC
SVC
Several parameters we focus on .
tol
: That is, "tolerance". The optimization problem of SVM has a condition: for each i ( 1 ≤ i ≤ n ) i(1\le i\le n)i(1≤i≤n), y i ( w T x + b ) ≥ 1 y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1 yi(wTx+b)≥1 . But for non-linearly separable data, this condition may not always be satisfied, so we loosen the condition, that is,yi ( w T x + b ) ≥ 1 − ε y_i\left(\bm{w} ^{\mathrm{T}}\bm{x}+b\right)\ge 1-\varepsilonyi(wTx+b)≥1−ε , among which\varepsilonε istol
.tol
The smaller the value, the stricter the condition.sklearn
The defaulttol
is1e-3
, which I use1e-6
, to make the classification a little stricter.C
: penalty coefficient. For non-linearly separable samples, we must allow some samples not to satisfy the constraints, at this time we introduce slack variables ξ i \xi_iXi, general iiThe constraint condition of i samples is fromyi ( w T x + b ) ≥ 1 − ε y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1- \varepsilonyi(wTx+b)≥1−This value( w T x + b ) ≥ 1 − ε − ξ i y_i\left(\bm{w}^{\mathrm{T}}\bm{x}+b\right)\ge 1-\ varepsilon-\xi_iyi(wTx+b)≥1−e−Xi, and ξ i ≥ 0 \xi_i\ge 0Xi≥0 . At the same time, the purpose of optimization should also change. We hope that the fewer variables that do not satisfy the constraints, the better, so change the purpose of optimization tomin w , b 1 2 ∥ w ∥ 2 + C ∑ i = 1 n ξ i \min\limits_{\bm{w},b}\frac{1}{2}{\|\bm{w}\|}^2+C\sum\limits_{i=1}^n\xi_iw,bmin21∥w∥2+Ci=1∑nXi, where CCC is the penalty coefficient. The penalty coefficient reflects the punishment for variables that do not satisfy the constraints,CCThe larger C , the lower the tolerance for variables that do not satisfy the constraints. According to Dalal's paper (Reference [6]), whenC = 0.01 C=0.01C=0.01 achieved better results, so I also chooseC = 0.01
.max_iter
:The maximum number of iterations. We hope that the SVM will end the iteration when it reaches the target situation, so setting it to-1
means that there is no limit on the number of times.gamma
:gamma
The parameters of the Gaussian kernel, we set toauto
, that is, the value set automaticallygamma
.kernel
: kernel function. According to Dalal's paper, the use of Gaussian kernels can improve the recognition accuracy to a certain extent, and the performance may decrease (compared to linear kernels), so I use Gaussian kernels ( )kernel = rbf
.probability
: Whether to output the probability. In the pedestrian detection task, we hope that there is a probability that there are pedestrians in an area on the picture, so setprobability = True
.
The code for training the SVM is as follows:
def train_SVM(x, y):
'''
训练SVM。
参数
---
x, y: read_training_samples的返回值。
返回值
-----
返回训练所得的SVM。
'''
SVM = SVC(
tol=1e-6,
C=0.01,
max_iter=-1,
gamma='auto',
kernel='rbf',
probability=True
) # 创建SVM实例
SVM.fit(x, y) # 进行训练
return SVM
Here SVM can also be replaced with a Logistic Regression model, just SVM = SVC(...)
replace it with LR = sklearn.linear_model.LogisticRegression(tol=1e-6, C=0.01, max_iter=10000)
.
4. Test SVMs
After the training is complete, we need to test the SVM on the test data. For a given sample, the prediction result given by SVM may be correct or wrong. If the SVM says that the positive sample contains people, the sample is called a true positive (TP); if the positive sample is said to contain people, it is called a false negative (FN); If the negative sample is said to contain people, it is called a false positive (false positive, FP); if the negative sample is said to contain no human, it is called a true negative (true negative, TN). Among them, true examples and true negative examples are cases where the prediction is correct, and false positive examples and false negative examples are cases where the prediction is wrong. It can be summarized into a table as follows:
True Value\SVM Classification Results | just | burden |
---|---|---|
just | True Example (TP) | False Negatives (FN) |
burden | False Positives (FP) | True Negative (TN) |
The number of positive samples (that is, the true value is positive) is TP+FN, and the number of negative samples is FP+TN.
Define the recall rate (Recall) as the proportion of the positive samples predicted to be positive, that is, Recall=TP/(TP+FN); the precision rate (Precision) is the proportion of the positive samples that the SVM says is really a positive sample, that is, Precision =TP/(TP+FP). The recall rate also has a name called True Positive Rate (TPR), and correspondingly there is False Positive Rate (False Positive Rate, FPR), which is the proportion of samples predicted to be positive in negative samples, that is, FPR =FP/(FP+TN). Obviously FPR=1-TPR. Define the missing rate (Miss Rate, MR) as the proportion that is not detected in the positive sample, that is, MR=1-Recall=FN/(TP+FN).
The output of the SVM is the probability that there are pedestrians in an area of the picture. We define a threshold (threshold) so that if the probability given by the SVM is greater than or equal to the threshold, it is considered that there are pedestrians in the area (that is, the prediction result is positive); If the probability is less than this threshold, it is considered that there are no pedestrians in the area (i.e. the prediction result is negative). Different thresholds lead to different prediction results. When the threshold is close to 1 1When 1 , the conditions for the SVM prediction result to be positive are extremely harsh, and both true positive cases and false positive cases are reduced, while false negative cases and true negative cases are increased; on the contrary, when the threshold is close to 00When 0 , SVM will say that many samples are pedestrians, the real cases and false positive cases increase, and the false negative cases and true negative cases decrease. Since TP+FN and FP+TN remain unchanged, the larger the threshold, the larger the Miss Rate (because FN increases), and the FPR decreases (because FP decreases). In this way, for each threshold, we can obtain a Miss Rate and an FPR. Taking the threshold value through the different probabilities given by SVM for each test sample, we can draw a Miss Rate-False Positive Rate curve:
The closer the curve is to the axes (that is, the smaller the area under the curve), the more reliable the model. This is because both Miss Rate and False Positive Rate are the amount we want to reduce, and we want them to take smaller values at the same time, that is, given the value of Miss Rate, the smaller the False Positive Rate, the better, that is, under the curve The smaller the area, the better.
We test_SVM
test the SVM with the function. First call SVM.predict_proba
the method to calculate the probability that each test sample contains pedestrians, then calculate the value of Miss Rate and False Positive Rate for each threshold, and draw the Miss Rate-False Positive Rate curve. Finally, the function returns the Area Under the Curve (AUC) value of the ROC curve (Receiver Operating Characteristic Curve), the closer the value is to 1 11 indicates that the model is more reliable.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
def test_SVM(SVM, test_data, show_stats=False):
'''
测试训练好的SVM。
参数
---
SVM: 训练好的SVM模型。
test_data: 测试数据(read_test_data的返回值)。
show_stats: 是否显示统计数据(miss rate vs.
false positive rate曲线)。
返回值
-----
返回AUC(ROC曲线下的面积)。AUC介于0.5和1之间。
AUC越接近1,模型越可靠。
'''
hog_features = test_data[0] # 测试数据的HOG特征
labels = test_data[1] # 数据标签(0=不是人,1=是人)
prob = SVM.predict_proba(hog_features)[:, 1]
if show_stats:
# 下面将prob和labels按prob的降序排序
sorted_indices = np.argsort(
prob, kind="mergesort")[::-1]
labels = labels[sorted_indices]
prob = prob[sorted_indices]
distinct_value_indices = np.where(np.diff(prob))[0]
# prob中不同值第一次出现的下标
threshold_idxs = np.r_[
distinct_value_indices, labels.size - 1]
# 阈值的下标,在末尾增加了最后一个样本的下标
tps = np.cumsum(labels)[threshold_idxs]
# 不同概率阈值对应的真正例数。
# 注意现在已经按prob的降序排序,
# 这种写法正确的原因是:在数组某一位置前的概率
# 一定大于阈值,在此之后的概率一定小于阈值,
# 所以真正例数就是在这一位置之前的正样本数。
fps = 1 + threshold_idxs - tps
# 不同概率阈值对应的假正例数。
# threshold_idxs存储的是下标,
# 加一后变成个数,
# 再减去真正例数就是假正例数。
num_positive = tps[-1]
# tps的最后一项就是labels的和,
# 因此代表正例的个数。
recall = tps / num_positive
# 查全率就是在所有正例中查出了多少真正例。
miss = 1 - recall # 计算miss
num_negative = fps[-1] # 负例个数
fpr = fps / num_negative
# 假阳性率(false positive rate)
plt.plot(miss, fpr, color='red')
plt.xlabel('False Positive Rate')
plt.ylabel('Miss Rate')
plt.title('Miss Rate - '
'False Positive Rate Curve')
plt.show()
AUC = metrics.roc_auc_score(labels, prob)
return AUC
5. Image frame
Next we introduce methods to frame pedestrians in an image. The main idea is sliding windows (Sliding Windows). That is: use windows of different sizes to slide over the image with different step lengths, and calculate the HOG feature of the step size in the window each time. The algorithm has three loops:
- The first level enumerates the width of the window. The aspect ratio of the window is fixed (2:1), and the width changes from an initial
min_width
(48 by default 4848 ) and multiplied each timewidth_scale
(the default is1.25 1.251.25 ) times, stop when the image width is exceeded. If you want to achieve better recognition results, you can set themin_width
towidth_scale
a smaller value, at the cost of slower recognition speed. - The second layer enumerates the abscissa on the left side of the window, from 0 to 0Start with 0 , increase by one step each time
coord_step
(the default is16 1616 ) until the right side reaches the image boundary. If you want to achieve better recognition results, you cancoord_step
turn it down, but the recognition speed will still be slower. - The third layer enumerates the vertical coordinates of the upper side of the window, from 0 to 0Start with 0 and increase by one step each time
coord_step
. - For each window, scale it to
area_width * area_height
(default 64 × 128 64\times 12864×128 ), extract its HOG features.
Then, for the HOG features of all windows, the probability of pedestrians in them is given by SVM. When the probability is greater than the threshold threshold
(0.99 by default 0.990.99 ), it is considered that there are pedestrians. But there is still a problem, a pedestrian may be framed by multiple boxes, we need to select the most suitable box among them. This requires the use of Non-Maximum Suppression (Non-Maximum Suppression, NMS).
The basic idea of NMS is that, for two boxes with more overlapping parts, discard the one with a lower probability of containing pedestrians, and retain the one with a higher probability. How to measure the amount of overlap? We use Intersection over Union (IoU). IoU
It is the ratio of the intersection area of two boxes to the union area. IoU
The bigger it is, the more overlap. When IoU
it is greater than or equal to a threshold IoU_threshold
, one of the boxes is discarded.
When calculating the union area, you only need to add the areas of the two boxes and then subtract the intersection area (similar to the principle of inclusion and exclusion). code show as below:
def area_of_box(box):
'''
计算框的面积。
参数
---
box: 框,格式为(left, top, width, height)。
返回值
-----
box的面积,即width * height。
'''
return box[2] * box[3]
def intersection_over_union(box1, box2):
'''
两个框的交并比(IoU)。
参数
---
box1: 边框1。
box2: 边框2。
'''
intersection_width = max(0,
box1[0] + box1[2] - box2[0])
# 相交部分宽度=max(0, box1的右边 - box2的左边)
intersection_height = max(0,
box1[1] + box1[3] - box2[1])
# 相交部分长度=max(0, box1的下边 - box2的上边)
intersection_area = intersection_width * \
intersection_height # 相交部分面积
area_box1 = area_of_box(box1) # box1的面积
area_box2 = area_of_box(box2) # box1的面积
union_area = area_box1 + area_box2 - \
intersection_area
if abs(union_area) < 1:
IoU = 0 # 防止除以0
else:
IoU = intersection_area / union_area
# 并集的面积等于二者面积之和减去交集的面积
return IoU
The main flow of the NMS algorithm is: traverse each box, if it is discarded by another box, it will not be added to the result list, otherwise it will be added to the result list. Finally return the result list. code show as below:
def non_maximum_suppression(pos_box_list, pos_prob,
IoU_threshold=0.4):
'''
非极大值抑制(NMS)。
参数
---
pos_box_list: 含有人的概率大于阈值的边框列表。
pos_prob: 对应的概率。
IoU_threshold: 舍弃边框的IoU阈值。
返回值
-----
抑制后的边框列表。
'''
result = [] # 结果
for box1, prob1 in zip(pos_box_list, pos_prob):
discard = False # 是否舍弃box1
for box2, prob2 in zip(
pos_box_list, pos_prob):
if intersection_over_union(
box1, box2) > IoU_threshold:
# IoU大于阈值
if prob2 > prob1: # 舍弃置信度较小的
discard = True
break
if not discard: # 未舍弃box1
result.append(box1) # 加入结果列表
return result
Finally, the code to frame a single image is as follows:
from cv2 import rectangle, imshow, waitKey
from skimage.io import imread
from skimage.transform import resize
def detect_pedestrian(SVM, filename, show_img=False,
threshold=0.99, area_width=64, area_height=128,
min_width=48, width_scale=1.25, coord_step=16,
ratio=2):
'''
用SVM检测file文件中的行人,采用非极大值抑制(NMS)
避免重复画框。
参数
---
SVM: 训练好的SVM模型。
filename: 输入文件名。
show_img: 是否给用户显示已画框的图片。
threshold: 将某一部分视为人的概率阈值。
area_width: 缩放后区域的宽度。
area_height: 缩放后区域的高度。
min_width: 框宽度的最小值,也是初始值。
width_scale: 每一次框宽度增大时扩大的倍数。
coord_step: 坐标变化的步长。
ratio: 框的长宽比。
返回值
-----
一个列表,每个列表项是一个元组
(left, top, width, height), 为行人的边框。
'''
box_list = [] # 行人边框列表
hog_list = [] # HOG特征列表
with open(filename, 'rb') as file:
img = imread(file, as_gray=True) # 读取文件
img_height, img_width = img.shape # 图片长宽
width = min_width # 框的宽度
height = int(width * ratio) # 框的长度
while width < img_width and height < img_height:
for left in range(0, img_width - width,
coord_step): # 框的左侧
for top in range(0, img_height - height,
coord_step): # 框的上侧
patch = clip_image(img, left, top,
width, height) # 截取图像的一部分
resized = resize(patch,
(area_height, area_width))
# 缩放图片
hog_feature = extract_hog_feature(
resized) # 提取HOG特征
box_list.append((left, top,
width, height))
hog_list.append(hog_feature)
width = int(width * width_scale)
height = width * ratio
prob = SVM.predict_proba(hog_list)[:, 1]
# 用SVM模型进行判断
mask = (prob >= threshold)
# 布尔数组, mask[i]代表prob[i]是否等于阈值
pos_box_list = np.array(box_list)[mask]
# 含有人的框
pos_prob = prob[mask] # 对应的预测概率
box_list_after_NMS = non_maximum_suppression(
pos_box_list, pos_prob)
# NMS处理之后的框列表
if show_img:
shown_img = np.array(img)
# 复制原图像,准备画框
for box in box_list_after_NMS:
shown_img = rectangle(shown_img,
pt1=(box[0], box[1]),
pt2=(box[0] + box[2],
box[1] + box[3]),
color=(0, 0, 0),
thickness=2)
imshow('', shown_img)
waitKey(0)
return box_list_after_NMS
4. Complete code
# encoding: UTF-8
# 文件: hog_svm.py
# 作者: seh_sjij
import numpy as np
import time
import random
import os
import pickle
import joblib
from tqdm import tqdm
from cv2 import rectangle, imshow, waitKey
from skimage.io import imread
from skimage.feature import hog
from skimage.transform import resize
from sklearn import metrics
from sklearn.svm import SVC
import matplotlib.pyplot as plt
def clip_image(img, left, top,
width=64, height=128):
'''
截取图片的一个区域。
参数
---
img: 图片输入。
left: 区域左边的坐标。
top: 区域上边的坐标。
width: 区域宽度。
height: 区域高度。
'''
return img[top:top + height, left:left + width]
def extract_hog_feature(img):
'''
提取单个图像img的HOG特征。
'''
return hog(
img,
orientations=9,
pixels_per_cell=(16, 16),
cells_per_block=(2, 2),
block_norm='L2-Hys',
visualize=False
).astype('float32')
def read_images(pos_dir, neg_dir,
neg_area_count, description):
'''
读取图片,提取样本HOG特征。
参数
---
pos_dir: 正样本所在文件夹。
neg_dir: 负样本所在文件夹。
neg_area_count: 在每个负样本中随机截取区域的个数。
description: 用途描述(训练/测试)。
返回值
-----
返回一个元组(x, y),x是所有图片的HOG特征,
y是所有图片的分类(1=正样本,0=负样本)。
'''
pos_img_files = os.listdir(pos_dir)
# 正样本文件列表
neg_img_files = os.listdir(neg_dir)
# 负样本文件列表
area_width = 64 # 截取的区域宽度
area_height = 128 # 截取的区域高度
x = [] # 图片的HOG特征
y = [] # 图片的分类
for pos_file in tqdm(pos_img_files,
desc=f'{
description}正样本'):
# 读取所有正样本
pos_path = os.path.join(pos_dir, pos_file)
# 正样本路径
pos_img = imread(pos_path, as_gray=True)
# 正样本图片
img_height, img_width = pos_img.shape
# 该图片的宽、高
clip_left = (img_width - area_width) // 2
# 截取区域的左边
clip_top = (img_height - area_height) // 2
# 截取区域的上边
pos_center = clip_image(pos_img,
clip_left, clip_top, area_width, area_height)
# 截取中间部分
hog_feature = extract_hog_feature(
pos_center) # 提取HOG特征
x.append(hog_feature) # 加入HOG向量
y.append(1) # 1代表正类
for neg_file in tqdm(neg_img_files,
desc=f'{
description}训练负样本'):
# 读取所有负样本
neg_path = os.path.join(neg_dir, neg_file)
# 负样本路径
neg_img = imread(neg_path, as_gray=True)
# 负样本图片
img_height, img_width = neg_img.shape
# 该图片的宽、高
left_max = img_width - area_width
# 区域左边坐标的最大值
top_max = img_height - area_height
# 区域
for _ in range(neg_area_count):
# 随机截取neg_area_count个区域
left = random.randint(0, left_max) # 区域左边
top = random.randint(0, top_max) # 区域上边
clipped_area = clip_image(neg_img,
left, top, area_width, area_height)
# 截取的区域
hog_feature = extract_hog_feature(
clipped_area) # 提取HOG特征
x.append(hog_feature)
y.append(0)
return x, y
def read_training_data():
'''
读取训练数据。
'''
pos_dir = 'INRIAPerson/96X160H96/Train/pos'
neg_dir = 'INRIAPerson/Train/neg'
neg_area_count = 10
description = '训练'
return read_images(pos_dir, neg_dir,
neg_area_count, description)
def read_test_data():
'''
读取测试数据。
'''
pos_dir = 'INRIAPerson/70X134H96/Test/pos'
neg_dir = 'INRIAPerson/Test/neg'
neg_area_count = 10
description = '测试'
return read_images(pos_dir, neg_dir,
neg_area_count, description)
def save_hog(x, y, filename):
'''
把read_training_samples的返回值(x, y)
写入名为filename的文件。
'''
with open(filename, 'wb') as file:
pickle.dump((x, y), file)
def load_hog(filename):
'''
从名为filename的文件中加载训练数据(x, y)。
'''
result = None
with open(filename, 'rb') as file:
result = pickle.load(file)
return result
def train_SVM(x, y):
'''
训练SVM。
参数
---
x, y: read_training_samples的返回值。
返回值
-----
返回训练所得的SVM。
'''
SVM = SVC(
tol=1e-6,
C=0.01,
max_iter=-1,
gamma='auto',
kernel='rbf',
probability=True
) # 创建SVM实例
SVM.fit(x, y) # 进行训练
return SVM
def test_SVM(SVM, test_data, show_stats=False):
'''
测试训练好的SVM。
参数
---
SVM: 训练好的SVM模型。
test_data: 测试数据(read_test_data的返回值)。
show_stats: 是否显示统计数据(miss rate vs.
false positive rate曲线)。
返回值
-----
返回AUC(ROC曲线下的面积)。AUC介于0.5和1之间。
AUC越接近1,模型越可靠。
'''
hog_features = test_data[0] # 测试数据的HOG特征
labels = test_data[1] # 数据标签(0=不是人,1=是人)
prob = SVM.predict_proba(hog_features)[:, 1]
if show_stats:
# 下面将prob和labels按prob的降序排序
sorted_indices = np.argsort(
prob, kind="mergesort")[::-1]
labels = labels[sorted_indices]
prob = prob[sorted_indices]
distinct_value_indices = np.where(np.diff(prob))[0]
# prob中不同值第一次出现的下标
threshold_idxs = np.r_[
distinct_value_indices, labels.size - 1]
# 阈值的下标,在末尾增加了最后一个样本的下标
tps = np.cumsum(labels)[threshold_idxs]
# 不同概率阈值对应的真正例数。
# 注意现在已经按prob的降序排序,
# 这种写法正确的原因是:在数组某一位置前的概率
# 一定大于阈值,在此之后的概率一定小于阈值,
# 所以真正例数就是在这一位置之前的正样本数。
fps = 1 + threshold_idxs - tps
# 不同概率阈值对应的假正例数。
# threshold_idxs存储的是下标,
# 加一后变成个数,
# 再减去真正例数就是假正例数。
num_positive = tps[-1]
# tps的最后一项就是labels的和,
# 因此代表正例的个数。
recall = tps / num_positive
# 查全率就是在所有正例中查出了多少真正例。
miss = 1 - recall # 计算miss
num_negative = fps[-1] # 负例个数
fpr = fps / num_negative
# 假阳性率(false positive rate)
plt.plot(miss, fpr, color='red')
plt.xlabel('False Positive Rate')
plt.ylabel('Miss Rate')
plt.title('Miss Rate - '
'False Positive Rate Curve')
plt.show()
AUC = metrics.roc_auc_score(labels, prob)
return AUC
def area_of_box(box):
'''
计算框的面积。
参数
---
box: 框,格式为(left, top, width, height)。
返回值
-----
box的面积,即width * height。
'''
return box[2] * box[3]
def intersection_over_union(box1, box2):
'''
两个框的交并比(IoU)。
参数
---
box1: 边框1。
box2: 边框2。
'''
intersection_width = max(0,
box1[0] + box1[2] - box2[0])
# 相交部分宽度=max(0, box1的右边 - box2的左边)
intersection_height = max(0,
box1[1] + box1[3] - box2[1])
# 相交部分长度=max(0, box1的下边 - box2的上边)
intersection_area = intersection_width * \
intersection_height # 相交部分面积
area_box1 = area_of_box(box1) # box1的面积
area_box2 = area_of_box(box2) # box1的面积
union_area = area_box1 + area_box2 - \
intersection_area
if abs(union_area) < 1:
IoU = 0 # 防止除以0
else:
IoU = intersection_area / union_area
# 并集的面积等于二者面积之和减去交集的面积
return IoU
def non_maximum_suppression(pos_box_list, pos_prob,
IoU_threshold=0.4):
'''
非极大值抑制(NMS)。
参数
---
pos_box_list: 含有人的概率大于阈值的边框列表。
pos_prob: 对应的概率。
IoU_threshold: 舍弃边框的IoU阈值。
返回值
-----
抑制后的边框列表。
'''
result = [] # 结果
for box1, prob1 in zip(pos_box_list, pos_prob):
discard = False # 是否舍弃box1
for box2, prob2 in zip(
pos_box_list, pos_prob):
if intersection_over_union(
box1, box2) > IoU_threshold:
# IoU大于阈值
if prob2 > prob1: # 舍弃置信度较小的
discard = True
break
if not discard: # 未舍弃box1
result.append(box1) # 加入结果列表
return result
def detect_pedestrian(SVM, filename, show_img=False,
threshold=0.99, area_width=64, area_height=128,
min_width=48, width_scale=1.25, coord_step=16,
ratio=2):
'''
用SVM检测file文件中的行人,采用非极大值抑制(NMS)
避免重复画框。
参数
---
SVM: 训练好的SVM模型。
filename: 输入文件名。
show_img: 是否给用户显示已画框的图片。
threshold: 将某一部分视为人的概率阈值。
area_width: 缩放后区域的宽度。
area_height: 缩放后区域的高度。
min_width: 框宽度的最小值,也是初始值。
width_scale: 每一次框宽度增大时扩大的倍数。
coord_step: 坐标变化的步长。
ratio: 框的长宽比。
返回值
-----
一个列表,每个列表项是一个元组
(left, top, width, height), 为行人的边框。
'''
box_list = [] # 行人边框列表
hog_list = [] # HOG特征列表
with open(filename, 'rb') as file:
img = imread(file, as_gray=True) # 读取文件
img_height, img_width = img.shape # 图片长宽
width = min_width # 框的宽度
height = int(width * ratio) # 框的长度
while width < img_width and height < img_height:
for left in range(0, img_width - width,
coord_step): # 框的左侧
for top in range(0, img_height - height,
coord_step): # 框的上侧
patch = clip_image(img, left, top,
width, height) # 截取图像的一部分
resized = resize(patch,
(area_height, area_width))
# 缩放图片
hog_feature = extract_hog_feature(
resized) # 提取HOG特征
box_list.append((left, top,
width, height))
hog_list.append(hog_feature)
width = int(width * width_scale)
height = width * ratio
prob = SVM.predict_proba(hog_list)[:, 1]
# 用SVM模型进行判断
mask = (prob >= threshold)
# 布尔数组, mask[i]代表prob[i]是否等于阈值
pos_box_list = np.array(box_list)[mask]
# 含有人的框
pos_prob = prob[mask] # 对应的预测概率
box_list_after_NMS = non_maximum_suppression(
pos_box_list, pos_prob)
# NMS处理之后的框列表
if show_img:
shown_img = np.array(img)
# 复制原图像,准备画框
for box in box_list_after_NMS:
shown_img = rectangle(shown_img,
pt1=(box[0], box[1]),
pt2=(box[0] + box[2],
box[1] + box[3]),
color=(0, 0, 0),
thickness=2)
imshow('', shown_img)
waitKey(0)
return box_list_after_NMS
def detect_multiple_images(SVM, dir):
'''
检测多个图像文件(dir文件夹中所有文件)中的行人。
参数
---
SVM: 训练好的SVM模型。
dir: 存放图片的文件夹。
'''
files = os.listdir(dir)
for file in files:
file_path = os.path.join(dir, file)
detect_pedestrian(SVM, file_path,
show_img=True)
if __name__ == '__main__':
print('execution starts')
random.seed(time.time()) # 设置随机数种子
x, y = read_training_data() # 读取训练数据,提取HOG特征
save_hog(x, y, 'hog_xy.pickle')
print('training data hog extraction done')
test_data = read_test_data() # 读取测试数据,提取HOG特征
save_hog(*test_data, 'test_data_hog.pickle')
print('test data hog extraction done')
x, y = load_hog('hog_xy.pickle') # 训练SVM模型
time_before_training = time.time()
SVM = train_SVM(x, y)
time_after_training = time.time()
print('SVM training done, cost %.2fs.' % \
(time_after_training - time_before_training))
joblib.dump(SVM, 'SVM.model', compress=9)
SVM = joblib.load('SVM.model') # 测试SVM模型
test_data = load_hog('test_data_hog.pickle')
print('AUC=%.8f.' % test_SVM(SVM, test_data, True))
detect_multiple_images(SVM, # 用SVM模型识别图片
'INRIAPerson/Test/pos')
5. Results
The Miss Rate-False Positive Rate curve I got is as follows:
The AUC of the ROC curve is 0.99204004 0.992040040.99204004 . Then I recognized some pictures:
The result can only be said to be unsatisfactory, the upper limit of the HOG+SVM model should be like this. I found that it particularly likes to recognize some columnar objects (such as street light poles, window frames) as adults, probably because their HOG features are relatively similar.
6. Summary
This is the whole content of HOG+SVM to realize pedestrian detection. I can't guarantee that this implementation method is optimal. Adjusting some parameters may further improve the performance of SVM. You can explore by yourself in the process of practice~
References
- https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients
- https://learnopencv.com/histogram-of-oriented-gradients/
- https://courses.cs.duke.edu/fall15/compsci527/notes/hog.pdf
- https://baike.baidu.com/item/HOG/9738560
- https://blog.csdn.net/jingyu_1/article/details/124217455
- Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886-893 vol. 1.
- https://zhuanlan.zhihu.com/p/594165143
- https://zhuanlan.zhihu.com/p/27202924
- https://zhuanlan.zhihu.com/p/78504109