Hikvision

Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2

Our work on object detection is based on Faster R-CNN. We design and validate the following improvements: 

* Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version. 

* Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically. 

* Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point. 

* Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied. 

* Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied. 

The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2. 


[CLS-LOC] 

A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set. 



Trimps-Soushen

Object detection (DET) 

We use several pre-trained models, including ResNet, Inception, Inception-Resnet etc. By taking the predict boxes from our best model as region proposals, we average the softmax scores and the box regression outputs across all models. Other improvements include annotations refine, boxes voting and features maxout. 


Object classification/localization (CLS-LOC) 

Based on image classification models like Inception, Inception-Resnet, ResNet and Wide Residual Network (WRN), we predict the class labels of the image. Then we refer to the framework of "Faster R-CNN" to predict bounding boxes based on the labels. Results from multiple models are fused in different ways, using the model accuracy as weights.