Vision Transformer (ViT) :Transformer在computer vision领域的应用(四)
Experiment的下半部分。
PRE-TRAINING DATA REQUIREMENTS
ViT在一个超大的数据集上做的预训练,在这一章节论文讨论了一下这个数据集的规模到底有多大影响,多少数据才能补充上Tranformer欠缺的归纳偏置。The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.
实验一:预训练数据集规模决定大模型(ViT-L)的优势能否发挥
论文从两个方向去做的变量控制:
- 模型变量:ViT-Base(小模型)、ViT-Large(大模型);
- 数据变量:预训练数据集从 “小→中→大” 递增(First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT- 300M. ):
- 小:ImageNet(1.2M 样本,1k 类);
- 中:ImageNet-21k(14M 样本,21k 类);
- 大:JFT-300M(300M 样本,数百万类);