faster r-cnn的核心创新点就是RPN网络和对应的分类器网络,但实际上这两个网络的结构都非常简单,创新更多的是体现在思想上,即从原图中找Anchor,从feature map中找ROI,而难点则在于实现这个想法。
RPN网络和classifier的代码都是在相应的基础网络里面的,以VGG为例,这两个网络是放在keras_frcnn/vgg.py中
代码及注释如下:
#输入:base_layers(37,37,512),num_anchors=9 600/16=37 #输出:[x_class, x_regr, base_layers](37,37,9+36+512) #x_class:(37,37,9),每个位置上的每个anchor的二分类的结果 #x_regr:(37,37,36=9*4),激活函数采用线性回归,得到每个anchor的回归参数 #base_layers:(37,37,512),vgg的feature map输出 def rpn(base_layers, num_anchors): #先变成(37,37,256) x = Conv2D(256, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers) #用9个(1,1)的过滤器对(37,37,256)做卷积,得到(37,37,9),每个位置上的每个anchor的二分类的结果 x_class = Conv2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x) #用36个(1,1)的过滤器对(37,37,9)做卷积,得到(37,37,36=9*4),激活函数采用线性回归,得到每个anchor的回归参数 x_regr = Conv2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x) return [x_class, x_regr, base_layers] #输入:base_layers vgg的feature map (37,37,512) #input_rois输入的rois,一张图片对应的的特征图上的全部anchor, #num_rois:使用的roi的数量 #nb_classes:所有的标注框的类别总和,20类前景物体+1类背景 #输出:[out_class, out_regr] #out_class:anchor的分类 #out_regr:anchor的回归参数 #分类和回归用的是同一个网络,输入都是anchor #这里的回归是再次对proposals进行bounding box regression,获取更高精度的rect box def classifier(base_layers, input_rois, num_rois, nb_classes = 21, trainable=False): # compile times on theano tend to be very high, so we use smaller ROI pooling regions to workaround if K.backend() == 'tensorflow': pooling_regions = 7 input_shape = (num_rois, 7, 7, 512) elif K.backend() == 'theano': pooling_regions = 7 input_shape = (num_rois, 512, 7, 7) #RoiPoolingConv自定义的keras layer,定义一个RoiPoolingConv层,RoiPoolingConv就是将feature map上的anchor给缩放一下,就像pooling层从(4*4)-(1*1)一样, #RoiPoolingConv将不同大小的anchor都统一成(7*7)大小 out_roi_pool = RoiPoolingConv(pooling_regions, num_rois)([base_layers, input_rois]) #out_roi_pool的输出形状是(1, self.num_rois, self.pool_size, self.pool_size, self.nb_channels) #因此下面这句的意思就是对每个RoiPoolingConv后的anchor进行计算,即现在的out的input shape变成了(self.pool_size, self.pool_size, self.nb_channels)即(7,7,512) out = TimeDistributed(Flatten(name='flatten'))(out_roi_pool)#TimeDistributed applies a layer to every temporal slice of an input. out = TimeDistributed(Dense(4096, activation='relu', name='fc1'))(out)#然后接两个全连接层形成一个4096*1的向量作为最终输入softmax的向量 out = TimeDistributed(Dense(4096, activation='relu', name='fc2'))(out) out_class = TimeDistributed(Dense(nb_classes, activation='softmax', kernel_initializer='zero'), name='dense_class_{}'.format(nb_classes))(out) # note: no regression target for bg class out_regr = TimeDistributed(Dense(4 * (nb_classes-1), activation='linear', kernel_initializer='zero'), name='dense_regress_{}'.format(nb_classes))(out) return [out_class, out_regr]