The basic principle of object recognition and its implementation in Python

overview

Object recognition is a general term describing a group of related computer vision tasks that involve recognizing objects in images.

Image classification involves predicting the class of an object in an image, and object localization involves identifying the location of one or more objects in an image and drawing a bounding box around them. Object recognition combines these two tasks to locate and classify one or more objects in an image, so when people refer to object detection or object detection, they are really referring to object recognition.

Region-based convolutional neural network (R-CNN) is a family of convolutional neural network models designed for object detection. R-CNN is a two-stage detection algorithm. The first stage identifies a subset of regions in an image that are likely to contain objects. The second stage classifies the objects in each region.

There are four main variants of the R-CNN series models, namely R-CNN, Fast R-CNN, Faster R-CNN, and MaskR-CNN. Each variant attempts to optimize, speed up, or enhance the outcome of one or more of the algorithmic processes, but the algorithmic flow remains essentially the same regardless of the variant. The basic process of target detection using R-CNN is as follows
insert image description here

  • Takes an image as input and extracts about 2000 region proposals (possibly containing bounding boxes of objects) from the image.
  • Warp (reshape) each region proposal to a fixed size, which is passed as input to the CNN.
  • CNN extracts a fixed-length feature vector for each region proposal.
  • These features are used to classify region proposals using a class-specific linear SVM

CNN neural network

R-CNN is based on algorithms such as CNN (convolutional neural network), linear regression, and support vector machine (SVM) to realize target detection technology. To understand the basic principles of R-CNN, you have to figure out what it is.

So what is CNN? To clarify this problem, the first thing to understand is why CNN appeared. The emergence of CNN is to solve the defects of BP neural network in processing images. When the BP neural network processes the image, each pixel value of the image will be input into the neural network as a feature, which will inevitably lose the two-dimensional structure information of the image. When the BP neural network is used to classify the image, the The object cannot move or deform, otherwise it will cause errors in recognition. And the neurons in the BP neural network are fully connected, which will cause the weight matrix to be too large and the amount of calculation to be large. For these reasons, the BP neural network once encountered a bottleneck when processing images, and CNN was born to solve this series of problems.

We know that the traditional BP neural network looks like this, as shown in the figure below.

image-20210707220229877

Then CNN only made some changes to the function and form of the network level on the basis of the BP neural network, and added some layers that are not in the BP neural network. We can understand CNN as an improved body of the BP neural network. .

The following figure is a typical structure of CNN. As shown in the figure, the CNN neural network has 7 layers, from left to right: convolutional layer, pooling layer, convolutional layer, pooling layer, fully connected layer, fully connected layer , the output layer. (Please note that the input features here are not counted in the network level)

image-20210707220249198

From the above figure, we can see that the convolutional layer and the pooling layer are actually performing feature extraction operations, and then input the extracted features to the fully connected layer (traditional BP neural network).

convolutional layer

In the first convolutional layer in CNN, first select a local area (filter) to scan the entire image, all nodes circled by the local area will perform multiplication and accumulation operations with the filter, and then connect to a node in the next layer superior. Assuming that the picture to be scanned is a grayscale picture (that is, only one color channel), all filters are also a two-dimensional matrix, and the convolution process can be represented by the following animation.

https://img-blog.csdnimg.cn/20200413223638467.gif#pic_center

If the picture to be scanned by the convolutional layer is a color picture (that is, the picture has three channels of RGB), then the pixels of the picture can be represented as a three-dimensional structure, and the selected filter is also a three-dimensional structure, as shown in the figure.

image-20210707220621142

Also use a moving picture to represent the convolution process.

Please note that one filter convolutes a picture (whether the picture is single-channel or multi-channel) and can only get the information of one surface. The process of product) is regarded as the process of extracting features, and one convolution extracts features such as pictures (for example, multiple filters extract the pixel features of the red channel, green channel, and blue channel of the picture respectively)

pooling layer

If the input is an image, the main function of the pooling layer is to compress the image. If the pooling layer is sandwiched between consecutive convolutional layers, it is used to compress the data convolved by the previous convolutional layer (or it can also be understood as filtering) to reduce overfitting.

The working process of the pooling layer can also be simply represented by a moving picture.

https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbG5vdGVib29rLmdpdGh1Yi5pby9pbWcvQ05OL3Bvb2xmaWcuZ2lm

In the above figure, for each 3*3 window, the largest number is selected as the value of the corresponding element of the output matrix. For example, the largest number in the first 3*3 window of the input matrix is ​​5, then the first element of the output matrix is 5, and so on. Please note that the pixels scanned by the window here do not repeat each time, so as to avoid repeated filtering pixels.

fully connected layer

Usually the fully connected layer is at the end of the convolutional neural network, which is the same as the connection method of traditional neural network neurons. The fully connected layer uses the output of the previous convolutional layer and pooling layer as the feature value for network training, and uses the inverse Correct the parameters and thresholds of the fully connected layer and the weight of the filter of the previous convolutional layer to the propagation algorithm.

image-20210707221153389

To sum up, the convolutional neural network (CNN) is a variant of the BP neural network. It no longer trains the pixel values ​​​​of the picture as the features of the model like the BP neural network, but trains the pixel values ​​​​of the picture through A series of processed (convolution, pooling) output values ​​are fed to the model as features for training. The advantage of this approach is that it can preserve the multi-dimensional information of the picture and improve the accuracy of the model classification.

Find bounding boxes that may contain objects in an image


Region proposals are bounding boxes that may contain objects, represented by tuples of (x, y, h, w). (x,y) are the coordinates of the center of the bounding box, and (h,w) are the height and width of the bounding box, respectively. These region proposals are computed by an algorithm called selective search. For an image, about 2000 region proposals are extracted.

Extract CNN features from region proposals

insert image description here

To train a CNN for feature extraction, an architecture such as VGG-16 is initialized with pretrained weights from imagenet data. The output layer with 1000 classes is cut off. So when a region proposal image (due to the limitation of the fully connected layer in CNN, the region proposals need to be adjusted to a fixed size before passing to meet the input limit of CNN) is passed to the network, we get a 4096-dimensional feature vector

When all region proposals (about 2000 region proposals) are extracted and input to the CNN network model, a feature map (4096*2000) will eventually be obtained. After obtaining the feature map, the next step is to classify each region and identify what kind of object is in each region.

Use the extracted features to classify objects

Send each feature of the feature map, that is, the feature of each region, to each SVM binary classifier for detection, and output the probability of belonging to this class. If there are 20 SVM binary classifiers, then 2000*20 will be obtained Matrix, which is the probability that region proposals belong to each class. Region proposals are assigned to the class that gets the highest probability. Therefore, all 2000 region proposals in the image are labeled with a class label and a probability of belonging to that class. Among so many region proposals, many are redundant and overlapping bounding boxes, which need to be removed. To achieve this, a non-maximum suppression algorithm is used.
insert image description here

Non-maximum suppression is a greedy algorithm. It selects the box with the highest probability obtained using the SVM. It then computes the IoU scores of all other bounding boxes belonging to that class. Boxes with IoU scores greater than the threshold (threshold size can be defined by yourself, generally speaking, the threshold is defined as 0.7) are deleted. In other words, bounding boxes with very high overlap are removed. The next highest scoring box is then selected, and so on, until all overlapping bounding boxes of that class are removed. Do this for all classes to get the result shown above.

Code

Selected original image for object recognition

insert image description here

Here we choose to use the faster R-CNN target detector in PyTorch for object recognition. First, we need to install PyTorch.

pip install torch

Start writing code, first import the required packages

from PIL import Image
import matplotlib.pyplot as plt
import torch
import torchvision.transforms as T
import torchvision
import numpy as np
import cv2

Download the pre-trained model, Resnet50 Faster R-CNN, which already has trained weight parameters.

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

insert image description here

Define the category name given by the official PyTorch documentation

COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

Define a new function get_prediction(), which is used to input the image path and recognize the objects in the image. First obtain the image from the image path, convert the image into an image tensor, and pass the image tensor to the model to obtain the image object recognition result pred

def get_prediction(img_path, threshold):
    img = Image.open(img_path)
  	transform = T.Compose([T.ToTensor()]) 
  	img = transform(img) 
    pred = model([img]) 

Extract the category pred_class of each region proposal object classification in the picture from pred, the coordinates pred_boxes of the object in the picture, and the probability pred_score of the classification of the object in the picture, and perform threshold filtering (the region whose category probability confidence is lower than the threshold is filtered out) Then print and display, and finally return the recognition result.

	pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())] 
    pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
  	pred_score = list(pred[0]['scores'].detach().numpy())
    pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1] 
  	pred_boxes = pred_boxes[:pred_t+1]
  	pred_class = pred_class[:pred_t+1]
    print("pred_class:",pred_class)
    print("pred_boxes:",pred_boxes)
    return pred_boxes, pred_class

Define a new function object_detection_api(), which is used to call the function get_prediction() defined above to get the object recognition result in the image, and draw the recognition result

def object_detection_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
    boxes, pred_cls = get_prediction(img_path, threshold) 
  	img = cv2.imread(img_path)
  	img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  	for i in range(len(boxes)):
    	cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
    	cv2.putText(img,pred_cls[i], boxes[i][0],  cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
  	plt.imshow(img)
  	plt.show()

Finally, call the object_detection_api() function, pass in the path of the image to be recognized, and perform object recognition on the image

if __name__ == '__main__':
	object_detection_api(img_path="images/8433365521_9252889f9a_z.jpg")
    

Get the object recognition results printed to the console, and use matplotlib to draw them out

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/atuo200/article/details/119417315