当前位置: 首页 > news >正文

RK3588 ArmNN CPU/GPU ResNet50 FP32/FP16/INT8 推理测试

RK3588 ArmNN CPU/GPU ResNet50 FP32/FP16/INT8 推理测试

      • **背景与目标**
    • 一.性能数据【INT8模型在CPU上推理的结果已经不对,暂未分析原因】
    • 二.操作步骤
      • 2.1 在`x86-Linux`上生成`onnx`模型,以及`tflite`量化模型(避免在RK3588上安装过多依赖)
        • 2.1.1 创建容器
        • 2.1.2 安装依赖
        • 2.1.3 下载推理图片
        • 2.1.4 生成`onnx`模型
        • 2.1.5 `onnx`转`tensorflow`,验证结果是否正确
        • 2.1.6 `tensorflow saved_model` 转`tflite`,验证结果是否正确
      • 2.2 `RK3588`上运行`ArmNN`推理测试
        • 2.2.1 下载`ArmNN SDK`,设置环境变量
        • 2.2.2 创建`MSE`计算脚本
        • 2.2.3 测试`CPU-FP32`的`MSE`
        • 2.2.4 测试`GPU-FP32`的`MSE`
        • 2.2.4 测试`CPU-FP16`的`MSE`
        • 2.2.5 测试`GPU-FP16`的`MSE`
        • 2.2.6 测试`CPU-INT8`的`MSE`
        • 2.2.7 测试`GPU-INT8`的`MSE`
        • 2.2.8 性能测试

背景与目标

本文在RK3588芯片上完成了以下任务:

  1. 将PyTorch的ResNet-50模型转换为ONNX格式
  2. 将ONNX模型转换为TFLite(FP16、INT8)
  3. 在RK3588平台使用ArmNN 测试CPU和GPU上,不同精度模型的推理性能与精度损失
  4. 通过MSE(均方误差)和执行时间两个指标评估各模型表现

一.性能数据【INT8模型在CPU上推理的结果已经不对,暂未分析原因】

  • 模型:resnet50 输入:[1,3.224,224 float32] 输出:[1,1000 float32]
精度类型CPU推理时间(ms)GPU推理时间(ms)结果正确性(MSE)
ONNX FP32175.3971.74MSE:0.00000
TFLite FP16117.1066.84MSE:0.00000
TFLite INT846.88 MSE:4.3317627.59 MSE:4.28963CPU上推理的MSE: 4.276822

二.操作步骤

2.1 在x86-Linux上生成onnx模型,以及tflite量化模型(避免在RK3588上安装过多依赖)

2.1.1 创建容器
docker run -it --privileged --net=host \-v $PWD:/home -w /home --rm  nvcr.io/nvidia/pytorch:24.03-py3 /bin/bash
2.1.2 安装依赖
pip install tensorflow==2.15 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow_probability==0.22 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install onnx onnx-tf  -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple
2.1.3 下载推理图片
wget https://raw.githubusercontent.com/hi20240217/csdn_images/refs/heads/main/YellowLabradorLooking_new.jpg
2.1.4 生成onnx模型
cat> resnet50.py <<-'EOF'
import requests
from PIL import Image
from io import BytesIO
import torchvision.transforms as transforms
import torch
import numpy as np
import torchvision.models as models# 读取图片
image = Image.open("YellowLabradorLooking_new.jpg")# 定义预处理流程
preprocess = transforms.Compose([transforms.Resize(256),transforms.CenterCrop(224),transforms.ToTensor(),transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])# 应用预处理
img_t = preprocess(image)
input_tensor = torch.unsqueeze(img_t, 0).float()input_np=input_tensor.numpy()# 保存预处理好的输入,用于后面量化和精度对比
np.savetxt('resnet50_input.txt',input_np.reshape(-1), delimiter=' ', fmt='%.6f')
np.save('resnet50_input.npy',input_np)# 加载预训练的ResNet50模型
model = models.resnet50(pretrained=True).float()
model.eval()  # 将模型设为评估模式# 执行前向推理
with torch.no_grad():output = model(input_tensor).numpy()# 保存推理的结果,用于后面对比精度
np.savetxt('resnet50_output.txt',output.reshape(-1), delimiter=' ', fmt='%.6f')# 获取预测类别
predicted = np.argmax(output)
print("Index:",predicted)input_names = ["input"]
output_names = ["output"]
torch.onnx.export(model, input_tensor, "resnet50.onnx", verbose=False, input_names=input_names, output_names=output_names)
EOF
python3 resnet50.py
onnxsim resnet50.onnx resnet50_opt.onnx
2.1.5 onnxtensorflow,验证结果是否正确
cat> onnx2tf.py<<-'EOF'  
import onnx
from onnx_tf.backend import prepare
onnx_model = onnx.load("resnet50_opt.onnx")   # 加载ONNX模型
tf_rep = prepare(onnx_model)                  # 转换为TensorFlow表示
tf_rep.export_graph("saved_model")            # 导出为SavedModel格式import tensorflow as tf
import numpy as np
tf_model = tf.saved_model.load("saved_model")
tf_model = tf_model.signatures["serving_default"]  # 获取推理函数# 验证结果是否符合预期
input_data = np.load('resnet50_input.npy')
tf_input = tf.constant(input_data)
tf_output = tf_model(tf_input)
print(tf_output.keys())
tf_output = tf_output["output"].numpy()
predicted = np.argmax(tf_output)
print("Index:",predicted)
EOF
rm saved_model -rf
python3 onnx2tf.py
2.1.6 tensorflow saved_modeltflite,验证结果是否正确
cat> tf2lite.py<<-'EOF'  
import numpy as np
import sys
import tensorflow as tfdef representative_dataset():for _ in range(1):data = tf.convert_to_tensor(np.load('resnet50_input.npy').astype(np.float32))yield [data]converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]          # 启用默认优化quant_type=sys.argv[1]
output_path=sys.argv[2]if quant_type=="int8":converter.representative_dataset = representative_dataset                    # 设置代表数据集converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]  # 全整数支持
else:converter.target_spec.supported_types = [tf.float16]tflite_quant_model = converter.convert()
with open(output_path, "wb") as f:f.write(tflite_quant_model)# 推理验证部分
def validate_inference():# 加载原始模型original_model = tf.saved_model.load("saved_model")infer = original_model.signatures["serving_default"]# 加载量化模型interpreter = tf.lite.Interpreter(model_content=tflite_quant_model)interpreter.allocate_tensors()# 获取模型IO详细信息input_details = interpreter.get_input_details()output_details = interpreter.get_output_details()# 加载测试数据test_data = np.load('resnet50_input.npy')# 原始模型推理original_input = tf.constant(test_data, dtype=tf.float32)original_output = infer(original_input)original_output = list(original_output.values())[0].numpy()# 量化模型输入预处理input_shape = input_details[0]['shape']input_dtype = input_details[0]['dtype']# 处理量化参数(如果存在)if input_details[0]['quantization'] != (0, 0):scale, zero_point = input_details[0]['quantization']quantized_input = np.round(test_data / scale + zero_point)quantized_input = np.clip(quantized_input, np.iinfo(input_dtype).min,np.iinfo(input_dtype).max).astype(input_dtype)else:quantized_input = test_data.astype(input_dtype)# 设置输入并推理interpreter.set_tensor(input_details[0]['index'], quantized_input)interpreter.invoke()# 获取输出并反量化(如果需要)quant_output = interpreter.get_tensor(output_details[0]['index'])if output_details[0]['quantization'] != (0, 0):scale, zero_point = output_details[0]['quantization']quant_output = (quant_output.astype(np.float32) - zero_point) * scale# 计算误差指标print("Index:",np.argmax(original_output),np.argmax(quant_output))mse = np.mean((original_output - quant_output) ** 2)mae = np.mean(np.abs(original_output - quant_output))print("\n验证结果:")print(f"原始模型输出均值:{np.mean(original_output):.4f}")print(f"量化模型输出均值:{np.mean(quant_output):.4f}")print(f"均方误差 (MSE): {mse:.6f}")print(f"平均绝对误差 (MAE): {mae:.6f}")# 执行验证
validate_inference()
EOF
python3 tf2lite.py int8 resnet50_int8.tflite
python3 tf2lite.py fp18 resnet50_fp16.tflite

输出

# int8 (已经存在误差)
Index: 208 904
验证结果:
原始模型输出均值:0.0000
量化模型输出均值:0.0006
均方误差 (MSE): 4.276822
平均绝对误差 (MAE): 1.536772# fp16
Index: 208 208
验证结果:
原始模型输出均值:0.0000
量化模型输出均值:0.0000
均方误差 (MSE): 0.000003
平均绝对误差 (MAE): 0.001248

2.2 RK3588上运行ArmNN推理测试

2.2.1 下载ArmNN SDK,设置环境变量
wget https://github.com/ARM-software/armnn/releases/download/v25.02/ArmNN-linux-aarch64.tar.gz
tar -xf ArmNN-linux-aarch64.tar.gz
export LD_LIBRARY_PATH=$PWD:$PWD/delegate:$LD_LIBRARY_PATH
2.2.2 创建MSE计算脚本
cat> compute_mse.py<<-'EOF'  
import numpy as np
import sys
l = np.loadtxt(sys.argv[1], delimiter=' ').reshape(-1)
r = np.loadtxt(sys.argv[2], delimiter=' ').reshape(-1)
print(l.shape,r.shape)
print(l[:8])
print(r[:8])
mse=np.mean((l - r) ** 2)
print(f'MSE:{mse:.5f}')
EOF
2.2.3 测试CPU-FP32MSE
./ExecuteNetwork -c CpuAcc -m resnet50_opt.onnx -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_fp32_cpu_pred.txt
python3 compute_mse.py resnet50_fp32_cpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-0.964521  1.21882  -2.75427  -1.55273  -0.906159 -1.08674  -3.01336  0.647321]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363 0.647324]
MSE:0.00000
2.2.4 测试GPU-FP32MSE
./ExecuteNetwork -c GpuAcc -m resnet50_opt.onnx -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_fp32_gpu_pred.txt
python3 compute_mse.py resnet50_fp32_gpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-0.964525  1.21881  -2.75427  -1.55272  -0.906155 -1.08674  -3.01337  0.64732 ]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363 0.647324]
MSE:0.00000
2.2.4 测试CPU-FP16MSE
./ExecuteNetwork -c CpuAcc -m resnet50_fp16.tflite -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_fp16_cpu_pred.txt
python3 compute_mse.py resnet50_fp16_cpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-0.965583  1.21759  -2.75328  -1.55201  -0.907105 -1.08602  -3.01482   0.646945]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363  0.647324]
MSE:0.00000
2.2.5 测试GPU-FP16MSE
./ExecuteNetwork -c GpuAcc -m resnet50_fp16.tflite -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_fp16_gpu_pred.txt
python3 compute_mse.py resnet50_fp16_gpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-0.965581  1.21759  -2.75328  -1.55201  -0.907109 -1.08602  -3.01482   0.646946]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363  0.647324]
MSE:0.00000
2.2.6 测试CPU-INT8MSE
./ExecuteNetwork -c CpuAcc -m resnet50_int8.tflite -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_int8_cpu_pred.txt
python3 compute_mse.py resnet50_int8_cpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-1.57105   1.45883  -2.30047  -1.2344    0.168327  1.12218  -0.617199  -1.12218 ]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363  0.647324]
MSE:4.33176
2.2.7 测试GPU-INT8MSE
./ExecuteNetwork -c GpuAcc -m resnet50_int8.tflite -I 1 -d resnet50_input.txt -w temp.txt
cat temp.txt | awk -F, '{print $2}' | awk -F: '{print $2}' | sed 's/ /\n/g' > resnet50_int8_gpu_pred.txt
python3 compute_mse.py resnet50_int8_gpu_pred.txt resnet50_output.txt

输出

(1000,) (1000,)
[-1.40273   1.62716  -2.30047  -1.17829   0.336654  1.06607  -0.617199  -1.12218 ]
[-0.964521  1.218818 -2.754273 -1.552727 -0.906162 -1.086741 -3.013363  0.647324]
MSE:4.28963
2.2.8 性能测试
./ExecuteNetwork -c CpuAcc -m resnet50_opt.onnx -I 5 -N 
./ExecuteNetwork -c GpuAcc -m resnet50_opt.onnx -I 5 -N ./ExecuteNetwork -c CpuAcc -m resnet50_fp16.tflite -I 5 -N 
./ExecuteNetwork -c GpuAcc -m resnet50_fp16.tflite -I 5 -N ./ExecuteNetwork -c CpuAcc -m resnet50_int8.tflite -I 5 -N 
./ExecuteNetwork -c GpuAcc -m resnet50_int8.tflite -I 5 -N 

输出

Info: Execution time: 175.39 ms.
Info: Execution time: 71.74 ms.Info: Execution time: 117.10 ms.
Info: Execution time: 66.84 ms.Info: Execution time: 46.88 ms.
Info: Execution time: 27.59 ms.

相关文章:

  • .NET外挂系列:3. 了解 harmony 中灵活的纯手工注入方式
  • 如何自学FPGA设计?
  • 2.4.2死锁的处理策略-预防死锁
  • DB31/T 1552-2025《居民电子健康档案应用系统等级评估指南》:上海地方标准全面解析
  • notepad++
  • 【设计模式】基于 Java 语言实现工厂模式
  • 替换word中的excel
  • OpenCV 第6课 图像处理之几何变换(重映射)
  • 水利数据采集MCU水资源的智能守护者
  • numpy数组的复数和时间的操作
  • 软考软件评测师——黑盒测试测试方法
  • vscode 常用调试
  • Linux中的 I/O 复用机制 select
  • win11下docker 的使用方案
  • 信奥赛-刷题笔记-栈篇-T2-P1165日志分析0519
  • AI大模型应用微调服务商分享:微调技术Lora和SFT的异同
  • 从JSON中提取任意位置键对应值的几种Python方法
  • 机器学习 集成学习方法之随机森林
  • MySQL——基本查询内置函数
  • matlab慕课学习3.4
  • 日本农林水产大臣因不当“大米言论”引咎辞职
  • 中英举行人工智能对话
  • 王毅同巴基斯坦副总理兼外长达尔会谈
  • 上海将完善隔代照料支持措施:建设老幼共享设施、提高科学育儿指导力度
  • 人民日报任平:从“地瓜经济”理论到民营经济促进法,读懂中国经济的成长壮大之道
  • 电子凭证会计数据标准推广至全国