Detectron Model Zoo and Baselines

卫甫

2023-12-01

参考 Detectron Model Zoo and Baselines - 云+社区 - 腾讯云

Introduction

This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.

Common Settings and Notes

All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
For training, only horizontal flipping data augmentation was used.
For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
All models were trained on the union of coco_2014_train and coco_2014_valminusminival, which is exactly equivalent to the recently defined coco_2017_train dataset.
All models were tested on the coco_2014_minival dataset, which is exactly equivalent to the recently defined coco_2017_val dataset.
Inference times are often expressed as "X + Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
The model id column is provided for ease of reference.
To check downloaded file integrity: for any download URL on this page, simply append .md5sum to the URL to download the file's md5 hash.
All models and results below are on the COCO dataset.
Baseline models and results for the Cityscapes dataset are coming soon!

Training Schedules

We use three training schedules, indicated by the lr schd column in the tables below.

1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in coco_2014_train union coco_2014_valminusminival (or equivalently, coco_2017_train).
2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.

All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

ImageNet Pretrained Models

The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.

R-50.pkl: converted copy of MSRA's original ResNet-50 model
R-101.pkl: converted copy of MSRA's original ResNet-101 model
X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)

Log Files

Training and inference logs are available for most models in the model zoo.

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-C4	RPN	1x	2	4.3	0.187	4.7	0.113	-	-	-	51.6	35998355	model \| props: 1, 2, 3
R-50-FPN	RPN	1x	2	6.4	0.416	10.4	0.080	-	-	-	57.2	35998814	model \| props: 1, 2, 3
R-101-FPN	RPN	1x	2	8.1	0.503	12.6	0.108	-	-	-	58.2	35998887	model \| props: 1, 2, 3
X-101-64x4d-FPN	RPN	1x	2	11.5	1.395	34.9	0.292	-	-	-	59.4	35998956	model \| props: 1, 2, 3
X-101-32x8d-FPN	RPN	1x	2	11.6	1.102	27.6	0.222	-	-	-	59.5	36760102	model \| props: 1, 2, 3

Notes:

Inference time only includes RPN proposal generation.
"prop. AR" is proposal average recall at 1000 proposals per image.
Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival.

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-C4	Fast	1x	1	6.0	0.456	22.8	0.241 + 0.003	34.4	-	-	-	36224013	model \| boxes
R-50-C4	Fast	2x	1	6.0	0.453	45.3	0.241 + 0.003	35.6	-	-	-	36224046	model \| boxes
R-50-FPN	Fast	1x	2	6.0	0.285	7.1	0.076 + 0.004	36.4	-	-	-	36225147	model \| boxes
R-50-FPN	Fast	2x	2	6.0	0.287	14.4	0.077 + 0.004	36.8	-	-	-	36225249	model \| boxes
R-101-FPN	Fast	1x	2	7.7	0.448	11.2	0.102 + 0.003	38.5	-	-	-	36228880	model \| boxes
R-101-FPN	Fast	2x	2	7.7	0.449	22.5	0.103 + 0.004	39.0	-	-	-	36228933	model \| boxes
X-101-64x4d-FPN	Fast	1x	1	6.3	0.994	49.7	0.292 + 0.003	40.4	-	-	-	36226250	model \| boxes
X-101-64x4d-FPN	Fast	2x	1	6.3	0.980	98.0	0.291 + 0.003	39.8	-	-	-	36226326	model \| boxes
X-101-32x8d-FPN	Fast	1x	1	6.4	0.721	36.1	0.217 + 0.003	40.6	-	-	-	37119777	model \| boxes
X-101-32x8d-FPN	Fast	2x	1	6.4	0.720	72.0	0.217 + 0.003	39.7	-	-	-	37121469	model \| boxes
R-50-C4	Mask	1x	1	6.4	0.466	23.3	0.252 + 0.020	35.5	31.3	-	-	36224121	model \| boxes \| masks
R-50-C4	Mask	2x	1	6.4	0.464	46.4	0.253 + 0.019	36.9	32.5	-	-	36224151	model \| boxes \| masks
R-50-FPN	Mask	1x	2	7.9	0.377	9.4	0.082 + 0.019	37.3	33.7	-	-	36225401	model \| boxes \| masks
R-50-FPN	Mask	2x	2	7.9	0.377	18.9	0.083 + 0.018	37.7	34.0	-	-	36225732	model \| boxes \| masks
R-101-FPN	Mask	1x	2	9.6	0.539	13.5	0.111 + 0.018	39.4	35.6	-	-	36229407	model \| boxes \| masks
R-101-FPN	Mask	2x	2	9.6	0.537	26.9	0.109 + 0.016	40.0	35.9	-	-	36229740	model \| boxes \| masks
X-101-64x4d-FPN	Mask	1x	1	7.3	1.036	51.8	0.292 + 0.016	41.3	37.0	-	-	36226382	model \| boxes \| masks
X-101-64x4d-FPN	Mask	2x	1	7.3	1.035	103.5	0.292 + 0.014	41.1	36.6	-	-	36672114	model \| boxes \| masks
X-101-32x8d-FPN	Mask	1x	1	7.4	0.766	38.3	0.223 + 0.017	41.3	37.0	-	-	37121516	model \| boxes \| masks
X-101-32x8d-FPN	Mask	2x	1	7.4	0.765	76.5	0.222 + 0.014	40.7	36.3	-	-	37121596	model \| boxes \| masks

Notes:

Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
Inference time excludes proposal generation.

End-to-End Faster & Mask R-CNN Baselines

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-C4	Faster	1x	1	6.3	0.566	28.3	0.167 + 0.003	34.8	-	-	-	35857197	model \| boxes
R-50-C4	Faster	2x	1	6.3	0.569	56.9	0.174 + 0.003	36.5	-	-	-	35857281	model \| boxes
R-50-FPN	Faster	1x	2	7.2	0.544	13.6	0.093 + 0.004	36.7	-	-	-	35857345	model \| boxes
R-50-FPN	Faster	2x	2	7.2	0.546	27.3	0.092 + 0.004	37.9	-	-	-	35857389	model \| boxes
R-101-FPN	Faster	1x	2	8.9	0.647	16.2	0.120 + 0.004	39.4	-	-	-	35857890	model \| boxes
R-101-FPN	Faster	2x	2	8.9	0.647	32.4	0.119 + 0.004	39.8	-	-	-	35857952	model \| boxes
X-101-64x4d-FPN	Faster	1x	1	6.9	1.057	52.9	0.305 + 0.003	41.5	-	-	-	35858015	model \| boxes
X-101-64x4d-FPN	Faster	2x	1	6.9	1.055	105.5	0.304 + 0.003	40.8	-	-	-	35858198	model \| boxes
X-101-32x8d-FPN	Faster	1x	1	7.0	0.799	40.0	0.233 + 0.004	41.3	-	-	-	36761737	model \| boxes
X-101-32x8d-FPN	Faster	2x	1	7.0	0.800	80.0	0.233 + 0.003	40.6	-	-	-	36761786	model \| boxes
R-50-C4	Mask	1x	1	6.6	0.620	31.0	0.181 + 0.018	35.8	31.4	-	-	35858791	model \| boxes \| masks
R-50-C4	Mask	2x	1	6.6	0.620	62.0	0.182 + 0.017	37.8	32.8	-	-	35858828	model \| boxes \| masks
R-50-FPN	Mask	1x	2	8.6	0.889	22.2	0.099 + 0.019	37.7	33.9	-	-	35858933	model \| boxes \| masks
R-50-FPN	Mask	2x	2	8.6	0.897	44.9	0.099 + 0.018	38.6	34.5	-	-	35859007	model \| boxes \| masks
R-101-FPN	Mask	1x	2	10.2	1.008	25.2	0.126 + 0.018	40.0	35.9	-	-	35861795	model \| boxes \| masks
R-101-FPN	Mask	2x	2	10.2	0.993	49.7	0.126 + 0.017	40.9	36.4	-	-	35861858	model \| boxes \| masks
X-101-64x4d-FPN	Mask	1x	1	7.6	1.217	60.9	0.309 + 0.018	42.4	37.5	-	-	36494496	model \| boxes \| masks
X-101-64x4d-FPN	Mask	2x	1	7.6	1.210	121.0	0.309 + 0.015	42.2	37.2	-	-	35859745	model \| boxes \| masks
X-101-32x8d-FPN	Mask	1x	1	7.7	0.961	48.1	0.239 + 0.019	42.1	37.3	-	-	36761843	model \| boxes \| masks
X-101-32x8d-FPN	Mask	2x	1	7.7	0.975	97.5	0.240 + 0.016	41.7	36.9	-	-	36762092	model \| boxes \| masks

Notes:

For these models, RPN and the detector are trained jointly and end-to-end.
Inference time is fully image-to-detections, including proposal generation.

RetinaNet Baselines

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-FPN	RetinaNet	1x	2	6.8	0.483	12.1	0.125	35.7	-	-	-	36768636	model \| boxes
R-50-FPN	RetinaNet	2x	2	6.8	0.482	24.1	0.127	35.7	-	-	-	36768677	model \| boxes
R-101-FPN	RetinaNet	1x	2	8.7	0.666	16.7	0.156	37.7	-	-	-	36768744	model \| boxes
R-101-FPN	RetinaNet	2x	2	8.7	0.666	33.3	0.154	37.8	-	-	-	36768840	model \| boxes
X-101-64x4d-FPN	RetinaNet	1x	2	12.6	1.613	40.3	0.341	39.8	-	-	-	36768875	model \| boxes
X-101-64x4d-FPN	RetinaNet	2x	2	12.6	1.625	81.3	0.339	39.2	-	-	-	36768907	model \| boxes
X-101-32x8d-FPN	RetinaNet	1x	2	12.7	1.343	33.6	0.277	39.5	-	-	-	36769563	model \| boxes
X-101-32x8d-FPN	RetinaNet	2x	2	12.7	1.340	67.0	0.276	38.6	-	-	-	36769641	model \| boxes

Notes: none

Mask R-CNN with Bells & Whistles

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
X-152-32x8d-FPN-IN5k	Mask	s1x	1	9.6	1.188	85.8	12.100 + 0.046	48.1	41.5	-	-	37129812	model \| boxes \| masks
[above without test-time aug.]							0.325 + 0.018	45.2	39.7	-	-

Notes:

A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
Like the other results, this is a single model result (it is not an ensemble of models)

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Our keypoint detection baselines differ from our box and mask baselines in a couple of details:

Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
Models are trained only on images from coco_2014_train union coco_2014_valminusminival that contain at least one person with keypoint annotations (all other images are discarded from the training set).
Metrics are reported for the person class only (still run on the entire coco_2014_minival dataset).

Person-Specific RPN Baselines

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-FPN	RPN	1x	2	6.4	0.391	9.8	0.082	-	-	-	64.0	35998996	model \| props: 1, 2, 3
R-101-FPN	RPN	1x	2	8.1	0.504	12.6	0.109	-	-	-	65.2	35999521	model \| props: 1, 2, 3
X-101-64x4d-FPN	RPN	1x	2	11.5	1.394	34.9	0.289	-	-	-	65.9	35999553	model \| props: 1, 2, 3
X-101-32x8d-FPN	RPN	1x	2	11.6	1.104	27.6	0.224	-	-	-	66.2	36760438	model \| props: 1, 2, 3

Notes:

Metrics are for the person category only.
Inference time only includes RPN proposal generation.
"prop. AR" is proposal average recall at 1000 proposals per image.
Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival. These include all images, not just the ones with valid keypoint annotations.

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-FPN	Kps	1x	2	7.7	0.533	13.3	0.081 + 0.087	52.7	-	64.1	-	37651787	model \| boxes \| kps
R-50-FPN	Kps	s1x	2	7.7	0.533	19.2	0.080 + 0.085	53.4	-	65.5	-	37651887	model \| boxes \| kps
R-101-FPN	Kps	1x	2	9.4	0.668	16.7	0.109 + 0.080	53.5	-	65.0	-	37651996	model \| boxes \| kps
R-101-FPN	Kps	s1x	2	9.4	0.668	24.1	0.108 + 0.076	54.6	-	66.0	-	37652016	model \| boxes \| kps
X-101-64x4d-FPN	Kps	1x	2	12.8	1.477	36.9	0.288 + 0.077	55.8	-	66.7	-	37731079	model \| boxes \| kps
X-101-64x4d-FPN	Kps	s1x	2	12.9	1.478	53.4	0.286 + 0.075	56.3	-	67.1	-	37731142	model \| boxes \| kps
X-101-32x8d-FPN	Kps	1x	2	12.9	1.215	30.4	0.219 + 0.084	55.4	-	66.2	-	37730253	model \| boxes \| kps
X-101-32x8d-FPN	Kps	s1x	2	12.9	1.214	43.8	0.218 + 0.071	55.9	-	67.0	-	37731010	model \| boxes \| kps

Notes:

Metrics are for the person category only.
Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
Inference time excludes proposal generation.

End-to-End Keypoint-Only Mask R-CNN Baselines

backbone	type	lr schd	im/ gpu	train mem (GB)	train time (s/iter)	train time total (hr)	inference time (s/im)	box AP	mask AP	kp AP	prop. AR	model id	download links
R-50-FPN	Kps	1x	2	9.0	0.832	20.8	0.097 + 0.092	53.6	-	64.2	-	37697547	model \| boxes \| kps
R-50-FPN	Kps	s1x	2	9.0	0.828	29.9	0.096 + 0.089	54.3	-	65.4	-	37697714	model \| boxes \| kps
R-101-FPN	Kps	1x	2	10.6	0.923	23.1	0.124 + 0.084	54.5	-	64.8	-	37697946	model \| boxes \| kps
R-101-FPN	Kps	s1x	2	10.6	0.921	33.3	0.123 + 0.083	55.3	-	65.8	-	37698009	model \| boxes \| kps
X-101-64x4d-FPN	Kps	1x	2	14.1	1.655	41.4	0.302 + 0.079	56.3	-	66.0	-	37732355	model \| boxes \| kps
X-101-64x4d-FPN	Kps	s1x	2	14.1	1.731	62.5	0.322 + 0.074	56.9	-	66.8	-	37732415	model \| boxes \| kps
X-101-32x8d-FPN	Kps	1x	2	14.2	1.410	35.3	0.235 + 0.080	56.0	-	66.0	-	37792158	model \| boxes \| kps
X-101-32x8d-FPN	Kps	s1x	2	14.2	1.408	50.8	0.236 + 0.075	56.9	-	67.0	-	37732318	model \| boxes \| kps

Notes:

Metrics are for the person category only.
For these models, RPN and the detector are trained jointly and end-to-end.
Inference time is fully image-to-detections, including proposal generation.

Detectron Model Zoo and Baselines

Introduction

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Faster & Mask R-CNN Baselines

RetinaNet Baselines

Mask R-CNN with Bells & Whistles

Keypoint Detection Baselines

Person-Specific RPN Baselines

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

End-to-End Keypoint-Only Mask R-CNN Baselines

相关阅读

相关文章

相关问答