当前位置: 首页 > news >正文

《强量化 Transformers:开启计算机视觉新篇》

一、Compact Convolutional Transformers

       ViT在小型数据集上性能不够好的问题,这个问题非常实际,现实情况下如果确实没有大量数据集,同时也没有合适的预训练模型需要从头训练的时候,ViT架构性能是不如CNN架构的。这篇文章实际上并没有引入大量的卷积操作,通过修改patch size,以及使用SeqPool的方法就可以取得不错的成绩。

       CCT 的核心设计理念是将卷积神经网络强大的局部特征提取能力与 Transformer 卓越的全局建模能力深度融合,兼收并蓄两者之长。通过精心设计的卷积模块,CCT 能够高效地提取图像中的局部细节信息,精准捕捉图像中物体的边缘、纹理等特征;而 Transformer 模块则负责在更大的范围内建模,捕捉不同局部区域之间的长距离依赖关系,从而对图像的整体结构和语义有更全面、深入的理解。  

核心贡献如下:

  • 通过引入ViT-Lite能够有效从头开始在小型数据集上实现更高精度,打破Transformer需要大量数据的神话。
  • 引入新型序列池化策略(sequence pooling)的CVT(Compact Vision Transformer),从而让Transformer无需class token
  • 引入CCT(Compact Convolutional Transformer)来提升模型性能,同时可以让图片输入尺寸更加灵活。

(1) ViT-Lite

(2) CVT

(3)CCT

与传统 ViT 中简单粗暴地将图像分割成均匀、非重叠的补丁不同,CCTTokenizer 引入了一个全卷积迷你网络.每一层卷积都像是在对图像进行一次深度的 "扫描",提取出不同层次的特征;零填充层巧妙地保持了图像的空间尺寸,确保信息的完整性;而最大池化层则在降低数据维度的同时,突出了图像中的关键特征。

CCT 模型构建与训练

(一)数据预处理与增强

在 CCT 模型的训练征程中,数据预处理与增强是至关重要的起始步骤,如同精心准备食材是烹饪出美味佳肴的基础。本次实验选用的 CIFAR - 10 数据集,宛如一座丰富的图像宝库,其中包含了 10 个不同类别的 60000 张彩色图像,每类图像各有 6000 张,这些图像如同繁星般璀璨,涵盖了飞机、汽车、鸟类、猫、鹿、狗、青蛙、马、船和卡车等众多物体,为模型的训练提供了丰富多样的样本。

(二)模型的搭建(keras)

import tensorflow as tf
from keras import layers
import kerasimport matplotlib.pyplot as plt
import numpy as np'''导入需要的模块''''''超参数设置'''
positional_emb = True
conv_layers = 2
projection_dim = 128num_heads = 2
transformer_units = [projection_dim,projection_dim,
]
transformer_layers = 2
stochastic_depth_rate = 0.1learning_rate = 0.001
weight_decay = 0.0001
batch_size = 128
num_epochs = 30
image_size = 32'''数据集下载---cifar10'''
num_classes = 10
input_shape = (32, 32, 3)(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")'''CCT token'''class CCTTokenizer(layers.Layer):def __init__(self,kernel_size=3,stride=1,padding=1,pooling_kernel_size=3,pooling_stride=2,num_conv_layers=conv_layers,num_output_channels=[64, 128],positional_emb=positional_emb,**kwargs,):super().__init__(**kwargs)# This is our tokenizer.self.conv_model = keras.Sequential()for i in range(num_conv_layers):self.conv_model.add(layers.Conv2D(num_output_channels[i],kernel_size,stride,padding="valid",use_bias=False,activation="relu",kernel_initializer="he_normal",))self.conv_model.add(layers.ZeroPadding2D(padding))self.conv_model.add(layers.MaxPooling2D(pooling_kernel_size, pooling_stride, "same"))self.positional_emb = positional_embdef call(self, images):outputs = self.conv_model(images)# After passing the images through our mini-network the spatial dimensions# are flattened to form sequences.batch_size = tf.shape(outputs)[0]h = tf.shape(outputs)[1]w = tf.shape(outputs)[2]c = tf.shape(outputs)[3]reshaped = tf.reshape(outputs,(batch_size, h * w, c))return reshaped'''位置编码'''class PositionEmbedding(keras.layers.Layer):def __init__(self,sequence_length,initializer="glorot_uniform",**kwargs,):super().__init__(**kwargs)if sequence_length is None:raise ValueError("`sequence_length` must be an Integer, received `None`.")self.sequence_length = int(sequence_length)self.initializer = keras.initializers.get(initializer)def get_config(self):config = super().get_config()config.update({"sequence_length": self.sequence_length,"initializer": keras.initializers.serialize(self.initializer),})return configdef build(self, input_shape):feature_size = input_shape[-1]self.position_embeddings = self.add_weight(name="embeddings",shape=[self.sequence_length, feature_size],initializer=self.initializer,trainable=True,)super().build(input_shape)def call(self, inputs, start_index=0):shape = tf.shape(inputs)feature_length = shape[-1]sequence_length = shape[-2]# trim to match the length of the input sequence, which might be less# than the sequence_length of the layer.position_embeddings = tf.convert_to_tensor(self.position_embeddings)position_embeddings = tf.slice(position_embeddings,[start_index, 0],[sequence_length, feature_length],)return tf.broadcast_to(position_embeddings, shape)def compute_output_shape(self, input_shape):return input_shapeclass SequencePooling(layers.Layer):def __init__(self):super().__init__()self.attention = layers.Dense(1)def call(self, x):attention_weights = tf.nn.softmax(self.attention(x), axis=1)attention_weights = tf.transpose(attention_weights, perm=(0, 2, 1))weighted_representation = tf.matmul(attention_weights, x)return tf.squeeze(weighted_representation, -2)class StochasticDepth(layers.Layer):def __init__(self, drop_prop, **kwargs):super().__init__(**kwargs)self.drop_prob = drop_propdef call(self, x, training=None):if training:keep_prob = 1 - self.drop_probshape = (tf.shape(x)[0],) + (1,) * (len(x.shape) - 1)random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)random_tensor = tf.floor(random_tensor)return (x / keep_prob) * random_tensorreturn xdef mlp(x, hidden_units, dropout_rate):for units in hidden_units:x = layers.Dense(units, activation=tf.nn.gelu)(x)x = layers.Dropout(dropout_rate)(x)return xdata_augmentation = keras.Sequential([layers.Rescaling(scale=1.0 / 255),layers.RandomCrop(image_size, image_size),layers.RandomFlip("horizontal"),],name="data_augmentation",
)def create_cct_model(image_size=image_size,input_shape=input_shape,num_heads=num_heads,projection_dim=projection_dim,transformer_units=transformer_units,
):inputs = layers.Input(input_shape)# Augment data.augmented = data_augmentation(inputs)# Encode patches.cct_tokenizer = CCTTokenizer()encoded_patches = cct_tokenizer(augmented)# Apply positional embedding.if positional_emb:sequence_length = encoded_patches.shape[1]encoded_patches += PositionEmbedding(sequence_length=sequence_length)(encoded_patches)# Calculate Stochastic Depth probabilities.dpr = [x for x in np.linspace(0, stochastic_depth_rate, transformer_layers)]# Create multiple layers of the Transformer block.for i in range(transformer_layers):# Layer normalization 1.x1 = layers.LayerNormalization(epsilon=1e-5)(encoded_patches)# Create a multi-head attention layer.attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=projection_dim, dropout=0.1)(x1, x1)# Skip connection 1.attention_output = StochasticDepth(dpr[i])(attention_output)x2 = layers.Add()([attention_output, encoded_patches])# Layer normalization 2.x3 = layers.LayerNormalization(epsilon=1e-5)(x2)# MLP.x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)# Skip connection 2.x3 = StochasticDepth(dpr[i])(x3)encoded_patches = layers.Add()([x3, x2])# Apply sequence pooling.representation = layers.LayerNormalization(epsilon=1e-5)(encoded_patches)weighted_representation = SequencePooling()(representation)# Classify outputs.logits = layers.Dense(num_classes)(weighted_representation)# Create the Keras model.model = keras.Model(inputs=inputs, outputs=logits)return modeldef run_experiment(model):optimizer = keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.0001)model.compile(optimizer=optimizer,loss=keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=0.1),metrics=[keras.metrics.CategoricalAccuracy(name="accuracy"),keras.metrics.TopKCategoricalAccuracy(5, name="top-5-accuracy"),],)checkpoint_filepath = "/tmp/checkpoint.weights.h5"checkpoint_callback = keras.callbacks.ModelCheckpoint(checkpoint_filepath,monitor="val_accuracy",save_best_only=True,save_weights_only=True,)history = model.fit(x=x_train,y=y_train,batch_size=batch_size,epochs=num_epochs,validation_split=0.1,callbacks=[checkpoint_callback],)model.load_weights(checkpoint_filepath)_, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)print(f"Test accuracy: {round(accuracy * 100, 2)}%")print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")return historycct_model = create_cct_model()
history = run_experiment(cct_model)

模型性能
根据 中的实现和注释,CCT 模型在 CIFAR-10 数据集上展现出了高效的性能:

参数量:约 40 万参数,仅为标准 ViT 的 9%
准确率:在 30 个 epoch 训练后达到约 79% 的 top-1 准确率
训练速度:由于参数量少,训练速度快,适合在有限计算资源上运行

需要注意的是,CCT 模型的设计重点是参数效率,而不是绝对性能。在资源受限的环境中,CCT 提供了很好的性能和资源消耗平衡。

参考

https://arxiv.org/abs/2104.05704

https://github.com/SHI-Labs/Compact-Transformers

https://zhuanlan.zhihu.com/p/364589899


二 MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

论文地址:https://arxiv.org/pdf/2110.02178

  它可以有效地将局部和全局信息进行编码。与ViT及其变体不同,MobileViT从不同的角度学习全局表示。标准卷积涉及三个操作:展开(unfloading) 、局部处理(local processing) 和展开(folding) 。MobileViT块使用Transformer将卷积中的局部建模替换为全局建模。这使得MobileViT块具有CNN和ViT的性质,有助于它用更少的参数和简单的训练方式学习更好的表示。

http://www.dtcms.com/a/609851.html

相关文章:

  • 免费做店招的网站国外用wordpress
  • 网站制作器手机版北京网页设计制作
  • 互联网服务的全链路架构流程解析
  • ARM与x86交叉编译实战排错指南
  • Agentic RL 如何让语⾔ 模型成为⾃主智能体
  • k8s之Headless浅谈
  • 安卓Telephony中的 phoneId、subId、simSlotIndex含义对比
  • dw做的网站怎么传到网络上去哪里有做ppt模板下载网站
  • 快速建站费用wordpress 注册简码
  • 从出厂到交付:能源设备运输如何实现全程风险可视化?
  • Kubernetes环境部署Redis集群
  • 公司门禁使用操作说明书
  • Wireshark网络数据包分析工具完整教程与实战案例
  • 以往届优秀展商为镜,探2026航空发动机与燃气轮机展——新奥能源
  • 先买空间再写网站广州番禺伤人案
  • 人工智能之数据分析 numpy:第二章 简介与安装
  • 地图可视化实践录:使用Turf.js简化路线
  • 从零开始搭建Linux Web服务器
  • 南通网站建设制作html网页设计表格代码范文
  • Chrome 插件框架 Plasmo 基本使用示例
  • 一小时学做网站杭州高端网站设计
  • LinuxC语言文件i/o笔记(第十八天)
  • 上海网站设计哪家强常州做上市公司律所
  • Word进阶
  • MySQL: 基准测试全流程指南:原理、工具(mysqlslap/sysbench)与实战演示
  • 青岛建站公司流程建筑行业公司
  • 贵州省网站备案虚拟主机管理系统
  • Hexo 个人博客从搭建到上线全流程(含踩坑指南)
  • CNN详解:卷积神经网络是如何识别图像的?
  • [高可用/负载均衡] Ribbon LoadBalancer: 开源的客户端式负载均衡框架