Detectron Model Zoo and Baselines

卫甫
2023-12-01

参考 Detectron Model Zoo and Baselines - 云+社区 - 腾讯云

Introduction

This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.

Common Settings and Notes

  • All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
  • All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
  • For training, only horizontal flipping data augmentation was used.
  • For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
  • All models were trained on the union of coco_2014_train and coco_2014_valminusminival, which is exactly equivalent to the recently defined coco_2017_train dataset.
  • All models were tested on the coco_2014_minival dataset, which is exactly equivalent to the recently defined coco_2017_val dataset.
  • Inference times are often expressed as "X + Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
  • Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
  • The model id column is provided for ease of reference.
  • To check downloaded file integrity: for any download URL on this page, simply append .md5sum to the URL to download the file's md5 hash.
  • All models and results below are on the COCO dataset.
  • Baseline models and results for the Cityscapes dataset are coming soon!

Training Schedules

We use three training schedules, indicated by the lr schd column in the tables below.

  • 1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in coco_2014_train union coco_2014_valminusminival (or equivalently, coco_2017_train).
  • 2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
  • s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.

All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

ImageNet Pretrained Models

The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.

  • R-50.pkl: converted copy of MSRA's original ResNet-50 model
  • R-101.pkl: converted copy of MSRA's original ResNet-101 model
  • X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
  • X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
  • X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)

Log Files

Training and inference logs are available for most models in the model zoo.

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model iddownload
links
R-50-C4RPN1x24.30.1874.70.113---51.635998355model | props: 123
R-50-FPNRPN1x26.40.41610.40.080---57.235998814model | props: 123
R-101-FPNRPN1x28.10.50312.60.108---58.235998887model | props: 123
X-101-64x4d-FPNRPN1x211.51.39534.90.292---59.435998956model | props: 123
X-101-32x8d-FPNRPN1x211.61.10227.60.222---59.536760102model | props: 123

Notes:

  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival.

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model iddownload
links
R-50-C4Fast1x16.00.45622.80.241 + 0.00334.4---36224013model | boxes
R-50-C4Fast2x16.00.45345.30.241 + 0.00335.6---36224046model | boxes
R-50-FPNFast1x26.00.2857.10.076 + 0.00436.4---36225147model | boxes
R-50-FPNFast2x26.00.28714.40.077 + 0.00436.8---36225249model | boxes
R-101-FPNFast1x27.70.44811.20.102 + 0.00338.5---36228880model | boxes
R-101-FPNFast2x27.70.44922.50.103 + 0.00439.0---36228933model | boxes
X-101-64x4d-FPNFast1x16.30.99449.70.292 + 0.00340.4---36226250model | boxes
X-101-64x4d-FPNFast2x16.30.98098.00.291 + 0.00339.8---36226326model | boxes
X-101-32x8d-FPNFast1x16.40.72136.10.217 + 0.00340.6---37119777model | boxes
X-101-32x8d-FPNFast2x16.40.72072.00.217 + 0.00339.7---37121469model | boxes
R-50-C4Mask1x16.40.46623.30.252 + 0.02035.531.3--36224121model | boxes | masks
R-50-C4Mask2x16.40.46446.40.253 + 0.01936.932.5--36224151model | boxes | masks
R-50-FPNMask1x27.90.3779.40.082 + 0.01937.333.7--36225401model | boxes | masks
R-50-FPNMask2x27.90.37718.90.083 + 0.01837.734.0--36225732model | boxes | masks
R-101-FPNMask1x29.60.53913.50.111 + 0.01839.435.6--36229407model | boxes | masks
R-101-FPNMask2x29.60.53726.90.109 + 0.01640.035.9--36229740model | boxes | masks
X-101-64x4d-FPNMask1x17.31.03651.80.292 + 0.01641.337.0--36226382model | boxes | masks
X-101-64x4d-FPNMask2x17.31.035103.50.292 + 0.01441.136.6--36672114model | boxes | masks
X-101-32x8d-FPNMask1x17.40.76638.30.223 + 0.01741.337.0--37121516model | boxes | masks
X-101-32x8d-FPNMask2x17.40.76576.50.222 + 0.01440.736.3--37121596model | boxes | masks

Notes:

  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Faster & Mask R-CNN Baselines

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model iddownload
links
R-50-C4Faster1x16.30.56628.30.167 + 0.00334.8---35857197model | boxes
R-50-C4Faster2x16.30.56956.90.174 + 0.00336.5---35857281model | boxes
R-50-FPNFaster1x27.20.54413.60.093 + 0.00436.7---35857345model | boxes
R-50-FPNFaster2x27.20.54627.30.092 + 0.00437.9---35857389model | boxes
R-101-FPNFaster1x28.90.64716.20.120 + 0.00439.4---35857890model | boxes
R-101-FPNFaster2x28.90.64732.40.119 + 0.00439.8---35857952model | boxes
X-101-64x4d-FPNFaster1x16.91.05752.90.305 + 0.00341.5---35858015model | boxes
X-101-64x4d-FPNFaster2x16.91.055105.50.304 + 0.00340.8---35858198model | boxes
X-101-32x8d-FPNFaster1x17.00.79940.00.233 + 0.00441.3---36761737model | boxes
X-101-32x8d-FPNFaster2x17.00.80080.00.233 + 0.00340.6---36761786model | boxes
R-50-C4Mask1x16.60.62031.00.181 + 0.01835.831.4--35858791model | boxes | masks
R-50-C4Mask2x16.60.62062.00.182 + 0.01737.832.8--35858828model | boxes | masks
R-50-FPNMask1x28.60.88922.20.099 + 0.01937.733.9--35858933model | boxes | masks
R-50-FPNMask2x28.60.89744.90.099 + 0.01838.634.5--35859007model | boxes | masks
R-101-FPNMask1x210.21.00825.20.126 + 0.01840.035.9--35861795model | boxes | masks
R-101-FPNMask2x210.20.99349.70.126 + 0.01740.936.4--35861858model | boxes | masks
X-101-64x4d-FPNMask1x17.61.21760.90.309 + 0.01842.437.5--36494496model | boxes | masks
X-101-64x4d-FPNMask2x17.61.210121.00.309 + 0.01542.237.2--35859745model | boxes | masks
X-101-32x8d-FPNMask1x17.70.96148.10.239 + 0.01942.137.3--36761843model | boxes | masks
X-101-32x8d-FPNMask2x17.70.97597.50.240 + 0.01641.736.9--36762092model | boxes | masks

Notes:

  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.

RetinaNet Baselines

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model iddownload
links
R-50-FPNRetinaNet1x26.80.48312.10.12535.7---36768636model | boxes
R-50-FPNRetinaNet2x26.80.48224.10.12735.7---36768677model | boxes
R-101-FPNRetinaNet1x28.70.66616.70.15637.7---36768744model | boxes
R-101-FPNRetinaNet2x28.70.66633.30.15437.8---36768840model | boxes
X-101-64x4d-FPNRetinaNet1x212.61.61340.30.34139.8---36768875model | boxes
X-101-64x4d-FPNRetinaNet2x212.61.62581.30.33939.2---36768907model | boxes
X-101-32x8d-FPNRetinaNet1x212.71.34333.60.27739.5---36769563model | boxes
X-101-32x8d-FPNRetinaNet2x212.71.34067.00.27638.6---36769641model | boxes

Notes: none

Mask R-CNN with Bells & Whistles

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model iddownload
links
X-152-32x8d-FPN-IN5kMasks1x19.61.18885.812.100 + 0.04648.141.5--37129812model | boxes | masks
[above without test-time aug.]0.325 + 0.01845.239.7--

Notes:

  • A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
  • The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
  • Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
  • Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
  • Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
  • Like the other results, this is a single model result (it is not an ensemble of models)

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Our keypoint detection baselines differ from our box and mask baselines in a couple of details:

  • Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
  • Models are trained only on images from coco_2014_train union coco_2014_valminusminival that contain at least one person with keypoint annotations (all other images are discarded from the training set).
  • Metrics are reported for the person class only (still run on the entire coco_2014_minival dataset).

Person-Specific RPN Baselines

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box APmask APkp APprop. ARmodel iddownload
links
R-50-FPNRPN1x26.40.3919.80.082---64.035998996model | props: 123
R-101-FPNRPN1x28.10.50412.60.109---65.235999521model | props: 123
X-101-64x4d-FPNRPN1x211.51.39434.90.289---65.935999553model | props: 123
X-101-32x8d-FPNRPN1x211.61.10427.60.224---66.236760438model | props: 123

Notes:

  • Metrics are for the person category only.
  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival. These include all images, not just the ones with valid keypoint annotations.

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box APmask APkp APprop. ARmodel iddownload
links
R-50-FPNKps1x27.70.53313.30.081 + 0.08752.7-64.1-37651787model | boxes | kps
R-50-FPNKpss1x27.70.53319.20.080 + 0.08553.4-65.5-37651887model | boxes | kps
R-101-FPNKps1x29.40.66816.70.109 + 0.08053.5-65.0-37651996model | boxes | kps
R-101-FPNKpss1x29.40.66824.10.108 + 0.07654.6-66.0-37652016model | boxes | kps
X-101-64x4d-FPNKps1x212.81.47736.90.288 + 0.07755.8-66.7-37731079model | boxes | kps
X-101-64x4d-FPNKpss1x212.91.47853.40.286 + 0.07556.3-67.1-37731142model | boxes | kps
X-101-32x8d-FPNKps1x212.91.21530.40.219 + 0.08455.4-66.2-37730253model | boxes | kps
X-101-32x8d-FPNKpss1x212.91.21443.80.218 + 0.07155.9-67.0-37731010model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Keypoint-Only Mask R-CNN Baselines

        backbone        typelr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box APmask APkp APprop. ARmodel iddownload
links
R-50-FPNKps1x29.00.83220.80.097 + 0.09253.6-64.2-37697547model | boxes | kps
R-50-FPNKpss1x29.00.82829.90.096 + 0.08954.3-65.4-37697714model | boxes | kps
R-101-FPNKps1x210.60.92323.10.124 + 0.08454.5-64.8-37697946model | boxes | kps
R-101-FPNKpss1x210.60.92133.30.123 + 0.08355.3-65.8-37698009model | boxes | kps
X-101-64x4d-FPNKps1x214.11.65541.40.302 + 0.07956.3-66.0-37732355model | boxes | kps
X-101-64x4d-FPNKpss1x214.11.73162.50.322 + 0.07456.9-66.8-37732415model | boxes | kps
X-101-32x8d-FPNKps1x214.21.41035.30.235 + 0.08056.0-66.0-37792158model | boxes | kps
X-101-32x8d-FPNKpss1x214.21.40850.80.236 + 0.07556.9-67.0-37732318model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.
 类似资料:

相关阅读

相关文章

相关问答