论文链接:https://arxiv.org/abs/2104.14294
代码链接:github仓库
您可以选择仅下载用于下游任务的预训练主干的权重,或包含学生和教师网络的主干和投影头权重的完整检查点。我们还提供onnx
格式的主干,以及详细的参数和训练/评估日志。请注意,“DeiT-S”和“ViT-S”名称指的是同一架构。
vits16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits16 ' )
vits8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits8 ' )
vitb16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb16 ' )
vitb8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb8 ' )
xcit_small_12_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p16 ' )
xcit_small_12_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p8 ' )
xcit_medium_24_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p16 ' )
xcit_medium_24_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p8 ' )
resnet50 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_resnet50 ' )
请安装 PyTorch 并下载 ImageNet 数据集。此代码库是使用 python 3.6 版、PyTorch 1.7.1 版、CUDA 11.0 和 torchvision 0.8.2 开发的。可以在预训练模型部分的“args”列中找到重现我们论文中提出的模型的确切参数。要查看 DINO 培训的完整文档,请运行:
python main_dino.py——help
使用以下命令在具有 8 个 GPU 的单个节点上使用 ViT-small 网络运行 DINO 100 个纪元。训练时间为 1.75 天,生成的检查点在 k-NN 评估上应达到 69.3%,在线性评估上应达到 74.0%。我们提供训练和线性评估日志(评估时批次大小为 256)以帮助提高重现性。
python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
我们使用 Slurm 和 submitit ( pip install submitit
)。在 2 个节点上训练,每个节点有 8 个 GPU(总共 16 个 GPU):
python run_with_submitit.py --nodes 2 --ngpus 8 --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
python run_with_submitit.py --nodes 2 --ngpus 8 --use_volta32 --arch vit_base --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
您可以通过以下方式提高运行的性能:
-训练更多时期:--epochs 300
,
-增加教师温度:--teacher_temp 0.07 --warmup_teacher_temp_epochs 30
。
-删除最后一层归一化(只有使用--arch vit_small
才安全):--norm_last_layer false
,
python run_with_submitit.py --arch vit_small --epochs 300 --teacher_temp 0.07 --warmup_teacher_temp_epochs 30 --norm_last_layer false --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
生成的预训练模型在 k-NN 评估上应达到 73.3%,在线性评估上应达到 76.0%。使用 16 个 GPU 的训练时间为 2.6 天。我们提供训练和线性评估日志(评估时批次大小为 256)以帮助提高重现性。
此代码也适用于在卷积网络上训练 DINO,例如 ResNet-50。在这种情况下,我们强烈建议采用一些优化参数。例如,以下是在具有 8 个 GPU 的单个节点上的 ResNet-50 上训练 DINO 100 个纪元的命令。我们为_ _ _这个运行。
python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch resnet50 --optimizer sgd --lr 0.03 --weight_decay 1e-4 --weight_decay_end 1e-4 --global_crops_scale 0.14 1 --local_crops_scale 0.05 0.14 --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
您可以通过运行以下命令查看最后一层不同头上 [CLS] 令牌的自注意力:
python visualize_attention.py
要在预训练模型上使用单个 GPU 评估简单的 k-NN 分类器,请运行:
python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --data_path /path/to/imagenet
如果您选择不指定--pretrained_weights
,则默认使用 DINO 参考权重。如果你想从你自己的运行中评估检查点,你可以运行例如:
python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --pretrained_weights /path/to/checkpoint.pth --checkpoint_key 老师 --data_path /path/to/imagenet
要在具有 8 个 GPU 的单个节点上的冻结权重上训练监督线性分类器,请运行:
python -m torch.distributed.launch --nproc_per_node=8 eval_linear.py --data_path /path/to/imagenet
我们发布了评估不同模型的日志和权重:
您可以通过运行以下命令行来检查 ImageNet 验证集上预训练权重的性能:
python eval_linear.py --evaluate --arch vit_small --patch_size 16 --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_small --patch_size 8 --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_base --patch_size 16 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_base --patch_size 8 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch resnet50 --data_path /path/to/imagenet/train
import argparse
import os
import sys
import datetime
import time
import math
import json
from pathlib import Path
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.backends.cudnn as cudnn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torchvision import models as torchvision_models
import utils
import vision_transformer as vits
from vision_transformer import DINOHead
# 指定模型,Backbone
parser.add_argument('--arch', default='vit_small', type=str,
choices=['vit_tiny', 'vit_small', 'vit_base', 'xcit', 'deit_tiny', 'deit_small'] \
+ torchvision_archs + torch.hub.list("facebookresearch/xcit:main"),
help="""Name of architecture to train. For quick experiments with ViTs,
we recommend using vit_tiny or vit_small.""")
#VIT的patch大小使用较小的值会带来更好的性能,但需要更多内存。仅适用对于 ViT(vit_tiny、vit_small 和 vit_base)
parser.add_argument('--patch_size', default=16, type=int, help="""Size in pixels
of input square patches - default 16 (for 16x16 patches). Using smaller
values leads to better performance but requires more memory. Applies only
for ViTs (vit_tiny, vit_small and vit_base). If <16, we recommend disabling
mixed precision training (--use_fp16 false) to avoid unstabilities.""")
# DINO 头部输出
parser.add_argument('--out_dim', default=65536, type=int, help="""Dimensionality of
the DINO head output. For complex and large datasets large values (like 65k) work well.""")
#是否对DINO头部的最后一层进行权重归一化
parser.add_argument('--norm_last_layer', default=True, type=utils.bool_flag,
help="""Whether or not to weight normalize the last layer of the DINO head.
Not normalizing leads to better performance but can make the training unstable.
In our experiments, we typically set this paramater to False with vit_small and True with vit_base.""")
#教师更新参数
parser.add_argument('--momentum_teacher', default=0.996, type=float, help="""Base EMA
parameter for teacher update. The value is increased to 1 during training with cosine schedule.
We recommend setting a higher value with small batches: for example use 0.9995 with batch size of 256.""")
#是否在投影头中使用批量归一化
parser.add_argument('--use_bn_in_head', default=False, type=utils.bool_flag,
help="Whether to use batch normalizations in projection head (Default: False)")
# Temperature teacher parameters
parser.add_argument('--warmup_teacher_temp', default=0.04, type=float,
help="""Initial value for the teacher temperature: 0.04 works well in most cases.
Try decreasing it if the training loss does not decrease.""")
parser.add_argument('--teacher_temp', default=0.04, type=float, help="""Final value (after linear warmup)
of the teacher temperature. For most experiments, anything above 0.07 is unstable. We recommend
starting with the default value of 0.04 and increase this slightly if needed.""")
parser.add_argument('--warmup_teacher_temp_epochs', default=0, type=int,
help='Number of warmup epochs for the teacher temperature (Default: 30).')
# 训练优化参数
#使用半精度进行训练。改善训练时间和内存要求,但会引起不稳定和性能轻微下降
parser.add_argument('--use_fp16', type=utils.bool_flag, default=True, help="""Whether or not
to use half precision for training. Improves training time and memory requirements,
but can provoke instability and slight decay of performance. We recommend disabling
mixed precision if the loss is unstable, if reducing the patch size or if training with bigger ViTs.""")
parser.add_argument('--weight_decay', type=float, default=0.04, help="""Initial value of the
weight decay. With ViT, a smaller value at the beginning of training works well.""")
parser.add_argument('--weight_decay_end', type=float, default=0.4, help="""Final value of the
weight decay. We use a cosine schedule for WD and using a larger decay by
the end of training improves performance for ViTs.""")
parser.add_argument('--clip_grad', type=float, default=3.0, help="""Maximal parameter
gradient norm if using gradient clipping. Clipping with norm .3 ~ 1.0 can
help optimization for larger ViT architectures. 0 for disabling.""")
parser.add_argument('--batch_size_per_gpu', default=64, type=int,
help='Per-GPU batch-size : number of distinct images loaded on one GPU.')
parser.add_argument('--epochs', default=100, type=int, help='Number of epochs of training.')
parser.add_argument('--freeze_last_layer', default=1, type=int, help="""Number of epochs
during which we keep the output layer fixed. Typically doing so during
the first epoch helps training. Try increasing this value if the loss does not decrease.""")
parser.add_argument("--lr", default=0.0005, type=float, help="""Learning rate at the end of
linear warmup (highest LR used during training). The learning rate is linearly scaled
with the batch size, and specified here for a reference batch size of 256.""")
parser.add_argument("--warmup_epochs", default=10, type=int,
help="Number of epochs for the linear learning-rate warm up.")
parser.add_argument('--min_lr', type=float, default=1e-6, help="""Target LR at the
end of optimization. We use a cosine LR schedule with linear warmup.""")
parser.add_argument('--optimizer', default='adamw', type=str,
choices=['adamw', 'sgd', 'lars'], help="""Type of optimizer. We recommend using adamw with ViTs.""")
parser.add_argument('--drop_path_rate', type=float, default=0.1, help="stochastic depth rate")
# Multi-crop parameters
parser.add_argument('--global_crops_scale', type=float, nargs='+', default=(0.4, 1.),
help="""Scale range of the cropped image before resizing, relatively to the origin image.
Used for large global view cropping. When disabling multi-crop (--local_crops_number 0), we
recommand using a wider range of scale ("--global_crops_scale 0.14 1." for example)""")
parser.add_argument('--local_crops_number', type=int, default=8, help="""Number of small
local views to generate. Set this parameter to 0 to disable multi-crop training.
When disabling multi-crop we recommend to use "--global_crops_scale 0.14 1." """)
parser.add_argument('--local_crops_scale', type=float, nargs='+', default=(0.05, 0.4),
help="""Scale range of the cropped image before resizing, relatively to the origin image.
Used for small local view cropping of multi-crop.""")
# Misc 杂项目,包含数据路径输出路径等等
………………(省略)
return parser
if __name__ == '__main__':
#获取参数设置
parser = argparse.ArgumentParser('DINO', parents=[get_args_parser()])
args = parser.parse_args()
#建立输出文件夹
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
#开始训练
train_dino(args)
class DINOLoss(nn.Module):
def __init__(self, out_dim, ncrops, warmup_teacher_temp, teacher_temp,
warmup_teacher_temp_epochs, nepochs, student_temp=0.1,
center_momentum=0.9):
super().__init__()
self.student_temp = student_temp
self.center_momentum = center_momentum
self.ncrops = ncrops
self.register_buffer("center", torch.zeros(1, out_dim))
# we apply a warm up for the teacher temperature because
# a too high temperature makes the training instable at the beginning
self.teacher_temp_schedule = np.concatenate((
np.linspace(warmup_teacher_temp,
teacher_temp, warmup_teacher_temp_epochs),
np.ones(nepochs - warmup_teacher_temp_epochs) * teacher_temp
))
def forward(self, student_output, teacher_output, epoch):
"""
Cross-entropy between softmax outputs of the teacher and student networks.
"""
student_out = student_output / self.student_temp
student_out = student_out.chunk(self.ncrops)
# 教师网络通过center模块后分布变平滑,再通过softmax变尖锐
temp = self.teacher_temp_schedule[epoch]
teacher_out = F.softmax((teacher_output - self.center) / temp, dim=-1)
teacher_out = teacher_out.detach().chunk(2)
total_loss = 0
n_loss_terms = 0
for iq, q in enumerate(teacher_out):
for v in range(len(student_out)):
if v == iq:
# we skip cases where student and teacher operate on the same view
continue
loss = torch.sum(-q * F.log_softmax(student_out[v], dim=-1), dim=-1)
total_loss += loss.mean()
n_loss_terms += 1
total_loss /= n_loss_terms
self.update_center(teacher_output)
return total_loss
#center模块参数更新
@torch.no_grad()
def update_center(self, teacher_output):
"""
Update center used for teacher output.
"""
batch_center = torch.sum(teacher_output, dim=0, keepdim=True)
dist.all_reduce(batch_center)
batch_center = batch_center / (len(teacher_output) * dist.get_world_size())
# ema update
self.center = self.center * self.center_momentum + batch_center * (1 - self.center_momentum)
def train_one_epoch(student, teacher, teacher_without_ddp, dino_loss, data_loader,
optimizer, lr_schedule, wd_schedule, momentum_schedule,epoch,
fp16_scaler, args):
metric_logger = utils.MetricLogger(delimiter=" ")
header = 'Epoch: [{}/{}]'.format(epoch, args.epochs)
for it, (images, _) in enumerate(metric_logger.log_every(data_loader, 10, header)):
# update weight decay and learning rate according to their schedule
it = len(data_loader) * epoch + it # global training iteration
for i, param_group in enumerate(optimizer.param_groups):
param_group["lr"] = lr_schedule[it]
if i == 0: # only the first group is regularized
param_group["weight_decay"] = wd_schedule[it]
# move images to gpu
images = [im.cuda(non_blocking=True) for im in images]
# teacher and student forward passes + compute dino loss
with torch.cuda.amp.autocast(fp16_scaler is not None):
teacher_output = teacher(images[:2]) # only the 2 global views pass through the teacher
student_output = student(images)
loss = dino_loss(student_output, teacher_output, epoch)
if not math.isfinite(loss.item()):
print("Loss is {}, stopping training".format(loss.item()), force=True)
sys.exit(1)
# 学生网络权重更新
optimizer.zero_grad()
param_norms = None
if fp16_scaler is None:
loss.backward()
if args.clip_grad:
param_norms = utils.clip_gradients(student, args.clip_grad)
utils.cancel_gradients_last_layer(epoch, student,
args.freeze_last_layer)
optimizer.step()
else:
fp16_scaler.scale(loss).backward()
if args.clip_grad:
fp16_scaler.unscale_(optimizer) # unscale the gradients of optimizer's assigned params in-place
param_norms = utils.clip_gradients(student, args.clip_grad)
utils.cancel_gradients_last_layer(epoch, student,
args.freeze_last_layer)
fp16_scaler.step(optimizer)
fp16_scaler.update()
# 更新教师网络参数,使用学生网络的参数进行动量更新
# EMA update for the teacher
with torch.no_grad():
m = momentum_schedule[it] # momentum parameter
for param_q, param_k in zip(student.module.parameters(), teacher_without_ddp.parameters()):
param_k.data.mul_(m).add_((1 - m) * param_q.detach().data)
# logging
torch.cuda.synchronize()
metric_logger.update(loss=loss.item())
metric_logger.update(lr=optimizer.param_groups[0]["lr"])
metric_logger.update(wd=optimizer.param_groups[0]["weight_decay"])
# gather the stats from all processes
metric_logger.synchronize_between_processes()
print("Averaged stats:", metric_logger)
return {k: meter.global_avg for k, meter in metric_logger.meters.items()}