当前位置: 首页 > 工具软件 > Fast Assert > 使用案例 >

detectron2 出现错误:RuntimeError: CUDA error: device-side assert triggered

益楷
2023-12-01

错误信息

[10/19 20:29:19] d2.data.datasets.coco INFO: Loaded 38743 images in COCO format from /home/dlsvr3/server/train_server/uploads/物料纸箱20221019/out_dir/COCO/annotations/train.json
[10/19 20:29:20] d2.data.build INFO: Removed 0 images with no usable annotations. 38743 images left.
[10/19 20:29:20] d2.data.dataset_mapper INFO: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640,), max_size=640, sample_style='choice'), RandomFlip()]
[10/19 20:29:20] d2.data.build INFO: Using training sampler TrainingSampler
[10/19 20:29:20] d2.data.common INFO: Serializing 38743 elements to byte tensors and concatenating them all ...
[10/19 20:29:20] d2.data.common INFO: Serialized dataset takes 101.06 MiB
[10/19 20:29:20] fvcore.common.checkpoint INFO: [Checkpointer] Loading from mm01_p/models/model_final_f10217.pkl ...
[10/19 20:29:20] fvcore.common.checkpoint INFO: Reading a file from 'Detectron2 Model Zoo'
[10/19 20:29:20] d2.engine.train_loop INFO: Starting training from iteration 0
[10/19 20:29:38] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/home/dlsvr3/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/dlsvr3/detectron2/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/home/dlsvr3/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/home/dlsvr3/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dlsvr3/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 163, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/dlsvr3/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dlsvr3/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 739, in forward
    losses = self._forward_box(features, proposals)
  File "/home/dlsvr3/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 804, in _forward_box
    losses = self.box_predictor.losses(predictions, proposals)
  File "/home/dlsvr3/detectron2/detectron2/modeling/roi_heads/fast_rcnn.py", line 324, in losses
    proposal_boxes, gt_boxes, proposal_deltas, gt_classes
  File "/home/dlsvr3/detectron2/detectron2/modeling/roi_heads/fast_rcnn.py", line 338, in box_reg_loss
    fg_inds = nonzero_tuple((gt_classes >= 0) & (gt_classes < self.num_classes))[0]
  File "/home/dlsvr3/detectron2/detectron2/layers/wrappers.py", line 132, in nonzero_tuple
    return x.nonzero(as_tuple=True)
RuntimeError: CUDA error: device-side assert triggered
[10/19 20:29:38] d2.engine.hooks INFO: Total training time: 0:00:18 (0:00:00 on hooks)
[10/19 20:29:38] d2.utils.events INFO:  iter: 0    lr: N/A  max_mem: 2963M

 这个错误出现在使用detectron2训练mask rcnn

出现原因

	cfg.MODEL.RETINANET.NUM_CLASSES = class_name_number + 1 
    cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES = class_name_number + 1
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = class_name_number + 1

 训练的类别个数有100+个类别,上面的几个NUM_CLASSES 没有设置正确或者只设置了其中的一个,而默认值是80,与训练集的类别个数对应不上,所以出现了错误。

解决办法

 解决方法也很简单,以下都设置正确就好了,类别数+1是加了背景类别

    cfg.MODEL.RETINANET.NUM_CLASSES = class_name_number + 1
    cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES = class_name_number + 1
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = class_name_number + 1
 类似资料: