The original image is split into a grid where a classifier is run on each patch of the grid to determine whether the patch contains an object or not. Bounding boxes are then assigned to the patch that is classified as positive.
1. Yolo v1
The model is composed of 6 blocks conv/leaky_relu/max_pool, followed by 3 blocks conv/leaky_relu, and 2 fully connected layer, 1 leaky_relu layer, and 1 fully connected layer. The output layer has 1470 nodes. The model has about 45 million parameters. The grid is made of patches.
The output layer is composed of:
- Classifier nodes: 980 nodes. For each patch, there are 20 classification nodes for each of the 20 classes.
- Confidence scores: 98 nodes. There is a confidence score for each bounding box.
- Bounding boxes: 392 nodes. There are 2 boxes that are predicted for each patch. A bounding box is defined by the parameters: .
is the coordinates of the center of the box relative to the bound of the grid cell. The and are relative to the whole image.
2. Yolo v3
For YoloV3, each grid cell predict bounding boxes.
3. EXplanation
3.1 Confidence Score
The confidence score reflects how confident the model is that the box contains an object and also how accurate it thinks that the box that it predicts.
If no object are detected, the confidence score = 0. Otherwise, .
3.2 Bounding Box
Each bounding box consists of 5 predictions and confidence.