For image category tasks, a common option for convolutional neural network (CNN) architecture is repeated blocks of convolution and max pooling layers, followed by 2 or more densely linked layers. The last thick layer has a softmax activation function and a node for each prospective object classification.
As an example, consider the VGG-16 model architecture, portrayed in the figure listed below.
We can summarize the layers of the VGG-16 design by executing the following line of code in the terminal:
- python -c 'from keras.applications.vgg16 import VGG16; VGG16(). summary()'
Your output need to appear as follows:
You will observe 5 blocks of (two to three) convolutional layers followed by a max pooling layer. The last max pooling layer is then flattened and followed by three densely connected layers. Notice that most of the criteria in the design come from the completely linked layers!
As you can most likely envision, an architecture like this has the risk of overfitting to the training dataset. In practice, dropout layers are used to avoid overfitting.
Global Average Pooling
In the last few years, experts have relied on worldwide average pooling (SPACE) layers to lessen overfitting by minimizing the overall number of criteria in the model. Comparable to max pooling layers, SPACE layers are utilized to reduce the spatial dimensions of a three-dimensional tensor. Nevertheless, SPACE layers carry out a more extreme type of dimensionality decrease, where a tensor with dimensions h × w × d is reduced in size to have dimensions 1 × 1 × d. GAP layers lower each h × w function map to a single number by just taking the average of all hw worths.
The first paper to propose SPACE layers designed an architecture where the final max pooling layer included one activation map for each image category in the dataset. Limit pooling layer was then fed to a SPACE layer, which yielded a vector with a single entry for each possible things in the classification task. The authors then used a softmax activation function to yield the anticipated possibility of each class. If you peek at the original paper, I particularly recommend taking a look at Area 3.2, titled “International Typical Pooling”.
The ResNet-50 design takes a less severe method; instead of eliminating thick layers entirely, the GAP layer is followed by one densely linked layer with a softmax activation function that yields the predicted things classes.
In mid-2016, scientist at MIT demonstrated that CNNs with GAP layers (a.k.a. GAP-CNNs) that have been trained for a category task can likewise be used for things localization. That is, a GAP-CNN not just tells us what object is included in the image – it likewise informs us where the object is in the image, and through no additional work on our part! The localization is revealed as a heat map (referred to as a class activation map), where the color-coding scheme determines areas that are reasonably essential for the GAP-CNN to perform the item identification task.
In the repository, I have explored the localization capability of the pre-trained ResNet-50 design, using the method from this paper. The main idea is that each of the activation maps in the final layer preceding the SPACE layer functions as a detector for a various pattern in the image, localized in area. To get the class activation map corresponding to an image, we need just to transform these identified patterns to spotted things.
This improvement is done by noticing each node in the GAP layer corresponds to a various activation map, and that the weights linking the SPACE layer to the final dense layer encode each activation map’s contribution to the predicted things class. To obtain the class activation map, we sum the contributions of each of the found patterns in the activation maps, where identified patterns that are more crucial to the predicted things class are given more weight.
How the Code Runs
Let’s take a look at the ResNet-50 architecture by performing the following line of code in the terminal:
- python -c 'from keras.applications.resnet50 import ResNet50; ResNet50(). summary()'
The last few lines of output need to appear as follows ( Notice that unlike the VGG-16 model, most of the trainable criteria are not situated in the totally linked layers at the top of the network!):
Denselayers towards the end of the network are of the most interest to us. Note that the
AveragePooling2Dlayer remains in truth a GAP layer!
We’ll begin with the
Activationlayer. This layer contains 2048 activation maps, each with dimensions 7 × 7. Let fk represent the k-th activation map, where k ∈ .
AveragePooling2DSPACE layer lowers the size of the preceding layer to (1,1,2048) by taking the average of each function map. The next
Flattenlayer simply flattens the input, without leading to any change to the info contained in the previous GAP layer.
The things category forecasted by ResNet-50 corresponds to a single node in the final
Denselayer; and, this single node is connected to every node in the preceding
Flattenlayer. Let wk represent the weight linking the k-th node in the
Flattenlayer to the output node corresponding to the forecasted image classification.
Then, in order to get the class activation map, we require just compute the sum
w1 ⋅ f1 w2 ⋅ f2 … w2048 ⋅ f2048
You can outline these class activation maps for any picture of your choosing, to explore the localization ability of ResNet-50 Keep in mind that in order to allow contrast to the initial image, bi-linear up-sampling is utilized to resize each activation map to 224 ×224 (This leads to a class activation map with size 224 ×224)
If you want to utilize this code to do your own object localization, you require only download the repository.