DINO代码分析-自监督学习

王才英

2023-12-01

论文链接：https://arxiv.org/abs/2104.14294

代码链接：github仓库

一、官方Readme

预训练模型

您可以选择仅下载用于下游任务的预训练主干的权重，或包含学生和教师网络的主干和投影头权重的完整检查点。我们还提供onnx格式的主干，以及详细的参数和训练/评估日志。请注意，“DeiT-S”和“ViT-S”名称指的是同一架构。

PyTorch Hub 上的预训练模型

vits16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits16 ' )
vits8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits8 ' )
vitb16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb16 ' )
vitb8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb8 ' )
xcit_small_12_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p16 ' )
xcit_small_12_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p8 ' )
xcit_medium_24_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p16 ' )
xcit_medium_24_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p8 ' )
resnet50 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_resnet50 ' )

训练

文档

请安装 PyTorch 并下载 ImageNet 数据集。此代码库是使用 python 3.6 版、PyTorch 1.7.1 版、CUDA 11.0 和 torchvision 0.8.2 开发的。可以在预训练模型部分的“args”列中找到重现我们论文中提出的模型的确切参数。要查看 DINO 培训的完整文档，请运行：

python main_dino.py——help

DINO训练：

使用以下命令在具有 8 个 GPU 的单个节点上使用 ViT-small 网络运行 DINO 100 个纪元。训练时间为 1.75 天，生成的检查点在 k-NN 评估上应达到 69.3%，在线性评估上应达到 74.0%。我们提供训练和线性评估日志(评估时批次大小为 256）以帮助提高重现性。

python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

多节点训练

我们使用 Slurm 和 submitit ( pip install submitit )。在 2 个节点上训练，每个节点有 8 个 GPU（总共 16 个 GPU）：

python run_with_submitit.py --nodes 2 --ngpus 8 --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

python run_with_submitit.py --nodes 2 --ngpus 8 --use_volta32 --arch vit_base --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

提升 DINO 性能：t-rex：

您可以通过以下方式提高运行的性能：
-训练更多时期：--epochs 300，
-增加教师温度：--teacher_temp 0.07 --warmup_teacher_temp_epochs 30。
-删除最后一层归一化（只有使用--arch vit_small才安全）：--norm_last_layer false，

python run_with_submitit.py --arch vit_small --epochs 300 --teacher_temp 0.07 --warmup_teacher_temp_epochs 30 --norm_last_layer false --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

生成的预训练模型在 k-NN 评估上应达到 73.3%，在线性评估上应达到 76.0%。使用 16 个 GPU 的训练时间为 2.6 天。我们提供训练和线性评估日志(评估时批次大小为 256）以帮助提高重现性。

ResNet-50 和其他卷积神经网络训练

此代码也适用于在卷积网络上训练 DINO，例如 ResNet-50。在这种情况下，我们强烈建议采用一些优化参数。例如，以下是在具有 8 个 GPU 的单个节点上的 ResNet-50 上训练 DINO 100 个纪元的命令。我们为_ _ _这个运行。

python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch resnet50 --optimizer sgd --lr 0.03 --weight_decay 1e-4 --weight_decay_end 1e-4 --global_crops_scale 0.14 1 --local_crops_scale 0.05 0.14 --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir

自注意力可视化

您可以通过运行以下命令查看最后一层不同头上 [CLS] 令牌的自注意力：

python visualize_attention.py

评估：ImageNet 上的 k-NN 分类

要在预训练模型上使用单个 GPU 评估简单的 k-NN 分类器，请运行：

python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --data_path /path/to/imagenet

如果您选择不指定--pretrained_weights，则默认使用 DINO 参考权重。如果你想从你自己的运行中评估检查点，你可以运行例如：

python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --pretrained_weights /path/to/checkpoint.pth --checkpoint_key 老师 --data_path /path/to/imagenet

评估：ImageNet 上的线性分类

要在具有 8 个 GPU 的单个节点上的冻结权重上训练监督线性分类器，请运行：

python -m torch.distributed.launch --nproc_per_node=8 eval_linear.py --data_path /path/to/imagenet

我们发布了评估不同模型的日志和权重：

您可以通过运行以下命令行来检查 ImageNet 验证集上预训练权重的性能：

python eval_linear.py --evaluate --arch vit_small --patch_size 16 --data_path /path/to/imagenet/train

python eval_linear.py --evaluate --arch vit_small --patch_size 8 --data_path /path/to/imagenet/train

python eval_linear.py --evaluate --arch vit_base --patch_size 16 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train

python eval_linear.py --evaluate --arch vit_base --patch_size 8 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train

python eval_linear.py --evaluate --arch resnet50 --data_path /path/to/imagenet/train

二、程序

2.1程序结构

main_dino.py 网络结构
vision_transformer.py Transformer结构
eval_knn.py KNN评估
eval_linear.py 线性评估
eval_video_segmentation.py 分割评估
visualize_attention.py 语义分割可视化

2.2 main_dino.py

2.2.1 导入包

import argparse
import os
import sys
import datetime
import time
import math
import json
from pathlib import Path

import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.backends.cudnn as cudnn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torchvision import models as torchvision_models

import utils
import vision_transformer as vits
from vision_transformer import DINOHead

2.2.2 命令行参数设置

	# 指定模型，Backbone
    parser.add_argument('--arch', default='vit_small', type=str,
        choices=['vit_tiny', 'vit_small', 'vit_base', 'xcit', 'deit_tiny', 'deit_small'] \
                + torchvision_archs + torch.hub.list("facebookresearch/xcit:main"),
        help="""Name of architecture to train. For quick experiments with ViTs,
        we recommend using vit_tiny or vit_small.""")
    #VIT的patch大小使用较小的值会带来更好的性能，但需要更多内存。仅适用对于 ViT（vit_tiny、vit_small 和 vit_base）
    parser.add_argument('--patch_size', default=16, type=int, help="""Size in pixels
        of input square patches - default 16 (for 16x16 patches). Using smaller
        values leads to better performance but requires more memory. Applies only
        for ViTs (vit_tiny, vit_small and vit_base). If <16, we recommend disabling
        mixed precision training (--use_fp16 false) to avoid unstabilities.""")
    # DINO 头部输出
    parser.add_argument('--out_dim', default=65536, type=int, help="""Dimensionality of
        the DINO head output. For complex and large datasets large values (like 65k) work well.""")
    #是否对DINO头部的最后一层进行权重归一化
    parser.add_argument('--norm_last_layer', default=True, type=utils.bool_flag,
        help="""Whether or not to weight normalize the last layer of the DINO head.
        Not normalizing leads to better performance but can make the training unstable.
        In our experiments, we typically set this paramater to False with vit_small and True with vit_base.""")
    #教师更新参数
    parser.add_argument('--momentum_teacher', default=0.996, type=float, help="""Base EMA
        parameter for teacher update. The value is increased to 1 during training with cosine schedule.
        We recommend setting a higher value with small batches: for example use 0.9995 with batch size of 256.""")
    #是否在投影头中使用批量归一化
    parser.add_argument('--use_bn_in_head', default=False, type=utils.bool_flag,
        help="Whether to use batch normalizations in projection head (Default: False)")

    # Temperature teacher parameters 
    parser.add_argument('--warmup_teacher_temp', default=0.04, type=float,
        help="""Initial value for the teacher temperature: 0.04 works well in most cases.
        Try decreasing it if the training loss does not decrease.""")
    parser.add_argument('--teacher_temp', default=0.04, type=float, help="""Final value (after linear warmup)
        of the teacher temperature. For most experiments, anything above 0.07 is unstable. We recommend
        starting with the default value of 0.04 and increase this slightly if needed.""")
    parser.add_argument('--warmup_teacher_temp_epochs', default=0, type=int,
        help='Number of warmup epochs for the teacher temperature (Default: 30).')

    # 训练优化参数
    #使用半精度进行训练。改善训练时间和内存要求，但会引起不稳定和性能轻微下降
    parser.add_argument('--use_fp16', type=utils.bool_flag, default=True, help="""Whether or not
        to use half precision for training. Improves training time and memory requirements,
        but can provoke instability and slight decay of performance. We recommend disabling
        mixed precision if the loss is unstable, if reducing the patch size or if training with bigger ViTs.""")
    parser.add_argument('--weight_decay', type=float, default=0.04, help="""Initial value of the
        weight decay. With ViT, a smaller value at the beginning of training works well.""")
    parser.add_argument('--weight_decay_end', type=float, default=0.4, help="""Final value of the
        weight decay. We use a cosine schedule for WD and using a larger decay by
        the end of training improves performance for ViTs.""")
    parser.add_argument('--clip_grad', type=float, default=3.0, help="""Maximal parameter
        gradient norm if using gradient clipping. Clipping with norm .3 ~ 1.0 can
        help optimization for larger ViT architectures. 0 for disabling.""")
    parser.add_argument('--batch_size_per_gpu', default=64, type=int,
        help='Per-GPU batch-size : number of distinct images loaded on one GPU.')
    parser.add_argument('--epochs', default=100, type=int, help='Number of epochs of training.')
    parser.add_argument('--freeze_last_layer', default=1, type=int, help="""Number of epochs
        during which we keep the output layer fixed. Typically doing so during
        the first epoch helps training. Try increasing this value if the loss does not decrease.""")
    parser.add_argument("--lr", default=0.0005, type=float, help="""Learning rate at the end of
        linear warmup (highest LR used during training). The learning rate is linearly scaled
        with the batch size, and specified here for a reference batch size of 256.""")
    parser.add_argument("--warmup_epochs", default=10, type=int,
        help="Number of epochs for the linear learning-rate warm up.")
    parser.add_argument('--min_lr', type=float, default=1e-6, help="""Target LR at the
        end of optimization. We use a cosine LR schedule with linear warmup.""")
    parser.add_argument('--optimizer', default='adamw', type=str,
        choices=['adamw', 'sgd', 'lars'], help="""Type of optimizer. We recommend using adamw with ViTs.""")
    parser.add_argument('--drop_path_rate', type=float, default=0.1, help="stochastic depth rate")

    # Multi-crop parameters
    parser.add_argument('--global_crops_scale', type=float, nargs='+', default=(0.4, 1.),
        help="""Scale range of the cropped image before resizing, relatively to the origin image.
        Used for large global view cropping. When disabling multi-crop (--local_crops_number 0), we
        recommand using a wider range of scale ("--global_crops_scale 0.14 1." for example)""")
    parser.add_argument('--local_crops_number', type=int, default=8, help="""Number of small
        local views to generate. Set this parameter to 0 to disable multi-crop training.
        When disabling multi-crop we recommend to use "--global_crops_scale 0.14 1." """)
    parser.add_argument('--local_crops_scale', type=float, nargs='+', default=(0.05, 0.4),
        help="""Scale range of the cropped image before resizing, relatively to the origin image.
        Used for small local view cropping of multi-crop.""")

    # Misc 杂项目，包含数据路径输出路径等等
    ………………（省略）
    return parser

2.2.3 main函数

if __name__ == '__main__':
	#获取参数设置
    parser = argparse.ArgumentParser('DINO', parents=[get_args_parser()])
    args = parser.parse_args()
    #建立输出文件夹
    Path(args.output_dir).mkdir(parents=True, exist_ok=True)
    #开始训练
    train_dino(args)

2.2.4 LOSS

class DINOLoss(nn.Module):
    def __init__(self, out_dim, ncrops, warmup_teacher_temp, teacher_temp,
                 warmup_teacher_temp_epochs, nepochs, student_temp=0.1,
                 center_momentum=0.9):
        super().__init__()
        self.student_temp = student_temp
        self.center_momentum = center_momentum
        self.ncrops = ncrops
        self.register_buffer("center", torch.zeros(1, out_dim))
        # we apply a warm up for the teacher temperature because
        # a too high temperature makes the training instable at the beginning
        self.teacher_temp_schedule = np.concatenate((
            np.linspace(warmup_teacher_temp,
                        teacher_temp, warmup_teacher_temp_epochs),
            np.ones(nepochs - warmup_teacher_temp_epochs) * teacher_temp
        ))

    def forward(self, student_output, teacher_output, epoch):
        """
        Cross-entropy between softmax outputs of the teacher and student networks.
        """
        student_out = student_output / self.student_temp
        student_out = student_out.chunk(self.ncrops)

        # 教师网络通过center模块后分布变平滑，再通过softmax变尖锐
        temp = self.teacher_temp_schedule[epoch]
        teacher_out = F.softmax((teacher_output - self.center) / temp, dim=-1)
        teacher_out = teacher_out.detach().chunk(2)

        total_loss = 0
        n_loss_terms = 0
        for iq, q in enumerate(teacher_out):
            for v in range(len(student_out)):
                if v == iq:
                    # we skip cases where student and teacher operate on the same view
                    continue
                loss = torch.sum(-q * F.log_softmax(student_out[v], dim=-1), dim=-1)
                total_loss += loss.mean()
                n_loss_terms += 1
        total_loss /= n_loss_terms
        self.update_center(teacher_output)
        return total_loss
        
	#center模块参数更新
    @torch.no_grad()
    def update_center(self, teacher_output):
        """
        Update center used for teacher output.
        """
        batch_center = torch.sum(teacher_output, dim=0, keepdim=True)
        dist.all_reduce(batch_center)
        batch_center = batch_center / (len(teacher_output) * dist.get_world_size())

        # ema update
        self.center = self.center * self.center_momentum + batch_center * (1 - self.center_momentum)

2.2.5 训练函数

def train_one_epoch(student, teacher, teacher_without_ddp, dino_loss, data_loader,
                    optimizer, lr_schedule, wd_schedule, momentum_schedule,epoch,
                    fp16_scaler, args):
    metric_logger = utils.MetricLogger(delimiter="  ")
    header = 'Epoch: [{}/{}]'.format(epoch, args.epochs)
    for it, (images, _) in enumerate(metric_logger.log_every(data_loader, 10, header)):
        # update weight decay and learning rate according to their schedule
        it = len(data_loader) * epoch + it  # global training iteration
        for i, param_group in enumerate(optimizer.param_groups):
            param_group["lr"] = lr_schedule[it]
            if i == 0:  # only the first group is regularized
                param_group["weight_decay"] = wd_schedule[it]

        # move images to gpu
        images = [im.cuda(non_blocking=True) for im in images]
        # teacher and student forward passes + compute dino loss
        with torch.cuda.amp.autocast(fp16_scaler is not None):
            teacher_output = teacher(images[:2])  # only the 2 global views pass through the teacher
            student_output = student(images)
            loss = dino_loss(student_output, teacher_output, epoch)

        if not math.isfinite(loss.item()):
            print("Loss is {}, stopping training".format(loss.item()), force=True)
            sys.exit(1)

        # 学生网络权重更新
        optimizer.zero_grad()
        param_norms = None
        if fp16_scaler is None:
            loss.backward()
            if args.clip_grad:
                param_norms = utils.clip_gradients(student, args.clip_grad)
            utils.cancel_gradients_last_layer(epoch, student,
                                              args.freeze_last_layer)
            optimizer.step()
        else:
            fp16_scaler.scale(loss).backward()
            if args.clip_grad:
                fp16_scaler.unscale_(optimizer)  # unscale the gradients of optimizer's assigned params in-place
                param_norms = utils.clip_gradients(student, args.clip_grad)
            utils.cancel_gradients_last_layer(epoch, student,
                                              args.freeze_last_layer)
            fp16_scaler.step(optimizer)
            fp16_scaler.update()
		# 更新教师网络参数，使用学生网络的参数进行动量更新
        # EMA update for the teacher
        with torch.no_grad():
            m = momentum_schedule[it]  # momentum parameter
            for param_q, param_k in zip(student.module.parameters(), teacher_without_ddp.parameters()):
                param_k.data.mul_(m).add_((1 - m) * param_q.detach().data)

        # logging
        torch.cuda.synchronize()
        metric_logger.update(loss=loss.item())
        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
        metric_logger.update(wd=optimizer.param_groups[0]["weight_decay"])
    # gather the stats from all processes
    metric_logger.synchronize_between_processes()
    print("Averaged stats:", metric_logger)
    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}