TensorRT笔记(3):解析样例中BufferManager类的设计
在https://blog.csdn.net/ouliten/article/details/154490047?spm=1001.2014.3001.5502#t14
里,我只是提了一下BufferManager,因为整体比较多,专门放在本文来讲。
可以具体对比一下cuda编程笔记(21)-- TensorRT-CSDN博客,这篇文章的推理过程,涉及了GPU内存的申请释放,为了不显示管理GPU内存,所以官方封装了BufferManager类,避免混乱。
本文所有源码在samples/common/buffers.h里都可以找到:TensorRT/samples/common/buffers.h at main · NVIDIA/TensorRT
GenericBuffer
//!
//! \brief The GenericBuffer class is a templated class for buffers.
//!
//! \details This templated RAII (Resource Acquisition Is Initialization) class handles the allocation,
//! deallocation, querying of buffers on both the device and the host.
//! It can handle data of arbitrary types because it stores byte buffers.
//! The template parameters AllocFunc and FreeFunc are used for the
//! allocation and deallocation of the buffer.
//! AllocFunc must be a functor that takes in (void** ptr, size_t size)
//! and returns bool. ptr is a pointer to where the allocated buffer address should be stored.
//! size is the amount of memory in bytes to allocate.
//! The boolean indicates whether or not the memory allocation was successful.
//! FreeFunc must be a functor that takes in (void* ptr) and returns void.
//! ptr is the allocated buffer address. It must work with nullptr input.
//!
template <typename AllocFunc, typename FreeFunc>
class GenericBuffer
{
public://!//! \brief Construct an empty buffer.//!GenericBuffer(nvinfer1::DataType type = nvinfer1::DataType::kFLOAT): mSize(0), mCapacity(0), mType(type), mBuffer(nullptr){}//!//! \brief Construct a buffer with the specified allocation size in bytes.//!GenericBuffer(size_t size, nvinfer1::DataType type): mSize(size), mCapacity(size), mType(type){if (!allocFn(&mBuffer, this->nbBytes())){throw std::bad_alloc();}}GenericBuffer(GenericBuffer&& buf): mSize(buf.mSize), mCapacity(buf.mCapacity), mType(buf.mType), mBuffer(buf.mBuffer){buf.mSize = 0;buf.mCapacity = 0;buf.mType = nvinfer1::DataType::kFLOAT;buf.mBuffer = nullptr;}GenericBuffer& operator=(GenericBuffer&& buf){if (this != &buf){freeFn(mBuffer);mSize = buf.mSize;mCapacity = buf.mCapacity;mType = buf.mType;mBuffer = buf.mBuffer;// Reset buf.//移动时记得释放源对象的资源buf.mSize = 0;buf.mCapacity = 0;buf.mBuffer = nullptr;}return *this;}//!//! \brief Returns pointer to underlying array.//!void* data(){return mBuffer;}//!//! \brief Returns pointer to underlying array.//!const void* data() const{return mBuffer;}//!//! \brief Returns the size (in number of elements) of the buffer.//!size_t size() const{return mSize;}//!//! \brief Returns the size (in bytes) of the buffer.//!size_t nbBytes() const{return this->size() * samplesCommon::getElementSize(mType);}//!//! \brief Resizes the buffer. This is a no-op if the new size is smaller than or equal to the current capacity.//!void resize(size_t newSize)//动态扩容逻辑{mSize = newSize;if (mCapacity < newSize){freeFn(mBuffer);if (!allocFn(&mBuffer, this->nbBytes())){throw std::bad_alloc{};}mCapacity = newSize;}}//!//! \brief Overload of resize that accepts Dims//!void resize(const nvinfer1::Dims& dims){return this->resize(samplesCommon::volume(dims));}~GenericBuffer()//RAII 思想的体现,自动释放内存{freeFn(mBuffer);}private:size_t mSize{0}, mCapacity{0};nvinfer1::DataType mType;void* mBuffer;AllocFunc allocFn;FreeFunc freeFn;
};
这个类可以统一管理 Host / Device 的内存,无论是 malloc/free,还是 cudaMalloc/cudaFree,都能统一封装成一个“缓冲区对象”。
为什么叫 GenericBuffer?
因为这个类 不依赖于具体的分配方式,而是通过模板参数决定。
这里的 AllocFunc 和 FreeFunc 是模板参数,不是函数类型,而是函数对象(functor)类型。它们会重载operator()方法,用来自定义内存的分配释放。由于cpu内存和gpu内存的分配释放方式是不同的,所以用这种方式进行屏蔽。(后面会给出cpu,gpu内存分配释放的对应类)
nvinfer1::DataType
数据类型的枚举
//!
//! \enum DataType
//! \brief The type of weights and tensors.
//!
enum class DataType : int32_t
{//! 32-bit floating point format.kFLOAT = 0,//! IEEE 16-bit floating-point format -- has a 5 bit exponent and 11 bit significand.kHALF = 1,//! Signed 8-bit integer representing a quantized floating-point value.kINT8 = 2,//! Signed 32-bit integer format.kINT32 = 3,//! 8-bit boolean. 0 = false, 1 = true, other values undefined.kBOOL = 4,//! Unsigned 8-bit integer format.//! Cannot be used to represent quantized floating-point values.//! Use the IdentityLayer to convert kUINT8 network-level inputs to {kFLOAT, kHALF} prior//! to use with other TensorRT layers, or to convert intermediate output//! before kUINT8 network-level outputs from {kFLOAT, kHALF} to kUINT8.//! kUINT8 conversions are only supported for {kFLOAT, kHALF}.//! kUINT8 to {kFLOAT, kHALF} conversion will convert the integer values//! to equivalent floating point values.//! {kFLOAT, kHALF} to kUINT8 conversion will convert the floating point values//! to integer values by truncating towards zero. This conversion has undefined behavior for//! floating point values outside the range [0.0F, 256.0F) after truncation.//! kUINT8 conversions are not supported for {kINT8, kINT32, kBOOL}.kUINT8 = 5,//! Signed 8-bit floating point with//! 1 sign bit, 4 exponent bits, 3 mantissa bits, and exponent-bias 7.kFP8 = 6,//! Brain float -- has an 8 bit exponent and 8 bit significand.kBF16 = 7,//! Signed 64-bit integer type.kINT64 = 8,//! Signed 4-bit integer type.kINT4 = 9,};
samplesCommon::getElementSize
根据上述枚举变量,返回真实字节大小
//#include "common.h"
inline uint32_t getElementSize(nvinfer1::DataType t) noexcept
{switch (t){case nvinfer1::DataType::kINT64: return 8;case nvinfer1::DataType::kINT32:case nvinfer1::DataType::kFLOAT: return 4;case nvinfer1::DataType::kBF16:case nvinfer1::DataType::kHALF: return 2;case nvinfer1::DataType::kBOOL:case nvinfer1::DataType::kUINT8:case nvinfer1::DataType::kINT8:case nvinfer1::DataType::kFP8: return 1;case nvinfer1::DataType::kINT4:ASSERT(false && "Element size is not implemented for sub-byte data-types");}return 0;
}
自动内存管理类
class DeviceAllocator
{
public:bool operator()(void** ptr, size_t size) const{return cudaMalloc(ptr, size) == cudaSuccess;}
};class DeviceFree
{
public:void operator()(void* ptr) const{cudaFree(ptr);}
};class HostAllocator
{
public:bool operator()(void** ptr, size_t size) const{*ptr = malloc(size);return *ptr != nullptr;}
};class HostFree
{
public:void operator()(void* ptr) const{free(ptr);}
};using DeviceBuffer = GenericBuffer<DeviceAllocator, DeviceFree>;
using HostBuffer = GenericBuffer<HostAllocator, HostFree>;//!
//! \brief The ManagedBuffer class groups together a pair of corresponding device and host buffers.
//!
class ManagedBuffer
{
public:DeviceBuffer deviceBuffer;HostBuffer hostBuffer;
};
这里就比较好理解了,如果cuda不了解,可以看我写的cuda编程笔记cuda编程笔记(2)--传递参数、设备属性_结构体参数的核函数-CSDN博客
BufferManager
class BufferManager
{
public:static const size_t kINVALID_SIZE_VALUE = ~size_t(0);//!//! \brief Create a BufferManager for handling buffer interactions with engine, when the I/O tensor volumes//! are provided//!BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, std::vector<int64_t> const& volumes, int32_t batchSize = 0): mEngine(engine), mBatchSize(batchSize){// Create host and device buffersfor (int32_t i = 0; i < mEngine->getNbIOTensors(); i++){auto const name = engine->getIOTensorName(i);mNames[name] = i;nvinfer1::DataType type = mEngine->getTensorDataType(name);std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};manBuf->deviceBuffer = DeviceBuffer(volumes[i], type);manBuf->hostBuffer = HostBuffer(volumes[i], type);void* deviceBuffer = manBuf->deviceBuffer.data();mDeviceBindings.emplace_back(deviceBuffer);mManagedBuffers.emplace_back(std::move(manBuf));}}//!//! \brief Create a BufferManager for handling buffer interactions with engine.//!BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, int32_t const batchSize = 0,nvinfer1::IExecutionContext const* context = nullptr): mEngine(engine), mBatchSize(batchSize){// Create host and device buffersfor (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++){auto const name = engine->getIOTensorName(i);mNames[name] = i;auto dims = context ? context->getTensorShape(name) : mEngine->getTensorShape(name);size_t vol = context || !mBatchSize ? 1 : static_cast<size_t>(mBatchSize);nvinfer1::DataType type = mEngine->getTensorDataType(name);int32_t vecDim = mEngine->getTensorVectorizedDim(name);if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector{int32_t scalarsPerVec = mEngine->getTensorComponentsPerElement(name);dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);vol *= scalarsPerVec;}vol *= samplesCommon::volume(dims);std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};manBuf->deviceBuffer = DeviceBuffer(vol, type);manBuf->hostBuffer = HostBuffer(vol, type);void* deviceBuffer = manBuf->deviceBuffer.data();mDeviceBindings.emplace_back(deviceBuffer);mManagedBuffers.emplace_back(std::move(manBuf));}}//!//! \brief Returns a vector of device buffers that you can use directly as//! bindings for the execute and enqueue methods of IExecutionContext.//!std::vector<void*>& getDeviceBindings(){return mDeviceBindings;}//!//! \brief Returns a vector of device buffers.//!std::vector<void*> const& getDeviceBindings() const{return mDeviceBindings;}//!//! \brief Returns the device buffer corresponding to tensorName.//! Returns nullptr if no such tensor can be found.//!void* getDeviceBuffer(std::string const& tensorName) const{return getBuffer(false, tensorName);}//!//! \brief Returns the host buffer corresponding to tensorName.//! Returns nullptr if no such tensor can be found.//!void* getHostBuffer(std::string const& tensorName) const{return getBuffer(true, tensorName);}//!//! \brief Returns the size of the host and device buffers that correspond to tensorName.//! Returns kINVALID_SIZE_VALUE if no such tensor can be found.//!size_t size(std::string const& tensorName) const{auto record = mNames.find(tensorName);if (record == mNames.end())return kINVALID_SIZE_VALUE;return mManagedBuffers[record->second]->hostBuffer.nbBytes();}//!//! \brief Templated print function that dumps buffers of arbitrary type to std::ostream.//! rowCount parameter controls how many elements are on each line.//! A rowCount of 1 means that there is only 1 element on each line.//!template <typename T>void print(std::ostream& os, void* buf, size_t bufSize, size_t rowCount){assert(rowCount != 0);assert(bufSize % sizeof(T) == 0);T* typedBuf = static_cast<T*>(buf);size_t numItems = bufSize / sizeof(T);for (int32_t i = 0; i < static_cast<int>(numItems); i++){// Handle rowCount == 1 caseif (rowCount == 1 && i != static_cast<int>(numItems) - 1)os << typedBuf[i] << std::endl;else if (rowCount == 1)os << typedBuf[i];// Handle rowCount > 1 caseelse if (i % rowCount == 0)os << typedBuf[i];else if (i % rowCount == rowCount - 1)os << " " << typedBuf[i] << std::endl;elseos << " " << typedBuf[i];}}//!//! \brief Copy the contents of input host buffers to input device buffers synchronously.//!void copyInputToDevice(){memcpyBuffers(true, false, false);}//!//! \brief Copy the contents of output device buffers to output host buffers synchronously.//!void copyOutputToHost(){memcpyBuffers(false, true, false);}//!//! \brief Copy the contents of input host buffers to input device buffers asynchronously.//!void copyInputToDeviceAsync(cudaStream_t const& stream = 0){memcpyBuffers(true, false, true, stream);}//!//! \brief Copy the contents of output device buffers to output host buffers asynchronously.//!void copyOutputToHostAsync(cudaStream_t const& stream = 0){memcpyBuffers(false, true, true, stream);}~BufferManager() = default;private:void* getBuffer(bool const isHost, std::string const& tensorName) const{auto record = mNames.find(tensorName);if (record == mNames.end())return nullptr;return (isHost ? mManagedBuffers[record->second]->hostBuffer.data(): mManagedBuffers[record->second]->deviceBuffer.data());}bool tenosrIsInput(const std::string& tensorName) const{return mEngine->getTensorIOMode(tensorName.c_str()) == nvinfer1::TensorIOMode::kINPUT;}void memcpyBuffers(bool const copyInput, bool const deviceToHost, bool const async, cudaStream_t const& stream = 0){for (auto const& n : mNames){void* dstPtr = deviceToHost ? mManagedBuffers[n.second]->hostBuffer.data(): mManagedBuffers[n.second]->deviceBuffer.data();void const* srcPtr = deviceToHost ? mManagedBuffers[n.second]->deviceBuffer.data(): mManagedBuffers[n.second]->hostBuffer.data();size_t const byteSize = mManagedBuffers[n.second]->hostBuffer.nbBytes();const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;if ((copyInput && tenosrIsInput(n.first)) || (!copyInput && !tenosrIsInput(n.first))){if (async)CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));elseCHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));}}}std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The pointer to the engineint mBatchSize; //!< The batch size for legacy networks, 0 otherwise.std::vector<std::unique_ptr<ManagedBuffer>> mManagedBuffers; //!< The vector of pointers to managed buffersstd::vector<void*> mDeviceBindings; //!< The vector of device buffers needed for engine executionstd::unordered_map<std::string, int32_t> mNames; //!< The map of tensor name and index pairs
};
成员变量
std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The pointer to the engineint mBatchSize; //!< The batch size for legacy networks, 0 otherwise.std::vector<std::unique_ptr<ManagedBuffer>> mManagedBuffers; //!< The vector of pointers to managed buffersstd::vector<void*> mDeviceBindings; //!< The vector of device buffers needed for engine executionstd::unordered_map<std::string, int32_t> mNames; //!< The map of tensor name and index pairs
BufferManager├── mEngine // TensorRT 引擎├── mBatchSize // 批次大小├── mManagedBuffers // 每个 tensor 的 host/device buffer│ ├── ManagedBuffer│ │ ├── HostBuffer hostBuffer;│ │ └── DeviceBuffer deviceBuffer;├── mDeviceBindings // 存放所有 device buffer 的指针└── mNames // tensor 名 -> 索引的映射
每个 tensor 都对应一对host和device内存
这里的tensor并不是中间激活值的tensor,仅仅是输入和输出两个出入口的tensor。
私有成员函数
//获取对应tensorName的tensor的主机或设备内存void* getBuffer(bool const isHost, std::string const& tensorName) const{auto record = mNames.find(tensorName);if (record == mNames.end())return nullptr;return (isHost ? mManagedBuffers[record->second]->hostBuffer.data(): mManagedBuffers[record->second]->deviceBuffer.data());}//判断该tensorName是否是输入tensorbool tenosrIsInput(const std::string& tensorName) const{return mEngine->getTensorIOMode(tensorName.c_str()) == nvinfer1::TensorIOMode::kINPUT;}//设备<->主机内存之间的相互拷贝。可以选择异步void memcpyBuffers(bool const copyInput, bool const deviceToHost, bool const async, cudaStream_t const& stream = 0){for (auto const& n : mNames){void* dstPtr = deviceToHost ? mManagedBuffers[n.second]->hostBuffer.data(): mManagedBuffers[n.second]->deviceBuffer.data();void const* srcPtr = deviceToHost ? mManagedBuffers[n.second]->deviceBuffer.data(): mManagedBuffers[n.second]->hostBuffer.data();size_t const byteSize = mManagedBuffers[n.second]->hostBuffer.nbBytes();const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;if ((copyInput && tenosrIsInput(n.first)) || (!copyInput && !tenosrIsInput(n.first))){if (async)CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));elseCHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));}}}
构造函数
//!//! \brief Create a BufferManager for handling buffer interactions with engine, when the I/O tensor volumes//! are provided//!BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, std::vector<int64_t> const& volumes, int32_t batchSize = 0): mEngine(engine), mBatchSize(batchSize){// Create host and device buffersfor (int32_t i = 0; i < mEngine->getNbIOTensors(); i++){auto const name = engine->getIOTensorName(i);mNames[name] = i;nvinfer1::DataType type = mEngine->getTensorDataType(name);std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};manBuf->deviceBuffer = DeviceBuffer(volumes[i], type);manBuf->hostBuffer = HostBuffer(volumes[i], type);void* deviceBuffer = manBuf->deviceBuffer.data();mDeviceBindings.emplace_back(deviceBuffer);mManagedBuffers.emplace_back(std::move(manBuf));}}//!//! \brief Create a BufferManager for handling buffer interactions with engine.//!BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, int32_t const batchSize = 0,nvinfer1::IExecutionContext const* context = nullptr): mEngine(engine), mBatchSize(batchSize){// Create host and device buffersfor (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++){auto const name = engine->getIOTensorName(i);mNames[name] = i;//如果有传入 context(说明是动态 shape 模型),就从执行上下文拿当前真实 shape;//否则(静态 shape)直接用引擎记录的 shape。auto dims = context ? context->getTensorShape(name) : mEngine->getTensorShape(name);//对于动态 shape 或 batchSize 为 0 的模型(比如显式 batch 模式),这里先设为 1;//否则用 batchSize。size_t vol = context || !mBatchSize ? 1 : static_cast<size_t>(mBatchSize);nvinfer1::DataType type = mEngine->getTensorDataType(name);//getTensorVectorizedDim():查看是否启用了向量化加载(例如 FP16 或 INT8 的 Tensor Core 优化)。int32_t vecDim = mEngine->getTensorVectorizedDim(name);if (-1 != vecDim) //某一维度的张量被“打包”成了向量(例如一组 4 个 FP16){//shape 的该维度变成「原大小 / 每向量元素个数」//同时总体积再乘回 scalarsPerVec,保证内存体积正确int32_t scalarsPerVec = mEngine->getTensorComponentsPerElement(name);dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);vol *= scalarsPerVec;}//samplesCommon::volume(dims) 会把所有维度的乘积计算出来,相当于 ∏ dims.d[i]vol *= samplesCommon::volume(dims);std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};manBuf->deviceBuffer = DeviceBuffer(vol, type);manBuf->hostBuffer = HostBuffer(vol, type);void* deviceBuffer = manBuf->deviceBuffer.data();mDeviceBindings.emplace_back(deviceBuffer);mManagedBuffers.emplace_back(std::move(manBuf));}}
-
第一个版本用于固定尺寸的张量(volumes 已经已知)。
比如 MNIST 这种简单网络。 -
第二个版本用于动态 shape 或带 batch的网络,
需要通过IExecutionContext::getTensorShape()查询每个 tensor 的实际维度。
两者最后都干同一件事:
根据 shape & DataType 计算总字节数,然后分配 host/device buffer。
向量化vectorized
在 TensorRT 里,模型优化过程中可能会对张量的某个维度做 向量化访问(vectorized access)。
例如:
-
如果权重或激活是
FP16或INT8类型; -
TensorRT 会自动尝试用更宽的 load/store 指令(如
float4、half2、int4); -
这样能提升带宽利用率,减少访存指令数量;
-
向量化通常发生在最后一个维度(通道维度 C),但也可能在别的维度上。
这个float4可以参考cuda编程笔记(10)--memory access 优化_coalesced memory access-CSDN博客
TensorRT 的张量维度取决于网络的格式(format),主要有两种常见布局:
| 格式 | 含义 | 维度顺序 | 举例(图像) |
|---|---|---|---|
kLINEAR(默认) | 普通线性布局 | [N, C, H, W] | batch、通道、高、宽 |
kHWC8, kCHW32, kCHW4 等 | 向量化布局 | 最后一维被“打包”为 8/16/32 等 | C通道维被展开到额外维度 |
| Tensor shape | Layout | 向量化维度返回值 | 含义 |
|---|---|---|---|
[N, C, H, W],未向量化 | kLINEAR | -1 | 普通 float |
[N, C/4, H, W, 4] | kCHW4 | 1 | C 维被打包成 4 |
[N, H, W, C/8, 8] | kHWC8 | 3 | 最后一维打包 |
[N, C/32, H, W, 32] | kCHW32 | 1 | 通道维 vectorized by 32 |
设你的 tensor 是 [1, 128, 224, 224],每个元素是 FP16(2 字节)。
TensorRT 可能会:
-
把每 2 个通道合并成一个向量单元(
half2); -
新的 shape 逻辑上变为
[1, 64, 224, 224, 2]; -
其中
vecDim = 1(通道维),scalarsPerVec = 2。
但是我们读取Tensor的时候,读到的维度依然是返回的是逻辑形状(未向量化的 shape)
| 概念 | 含义 | 示例 |
|---|---|---|
| 逻辑形状 (logical shape) | 你在网络层面看到的张量维度(模型原始定义的形状) | [1, 64, 7, 7] |
| 物理形状 (physical shape) | 实际内存中存储的数据排布方式(经过向量化/对齐) | [1, 16, 7, 7] (每个元素是 vec4,含 4 个标量) |
所以在dims变量上,我们原本读入的是[1,64,7,7],为了匹配实际内存的排布,我们需要手动对dims的维度进行修改
getTensorVectorizedDim
int32_t ICudaEngine::getTensorVectorizedDim(char const* tensorName) const noexcept;
-
参数: tensor 名;
-
返回值:
-
-1→ 没有向量化; -
>=0→ 向量化所在的维度索引(从 0 开始)。
-
getTensorComponentsPerElement
int32_t ICudaEngine::getTensorComponentsPerElement(char const* tensorName) const noexcept;
-
返回值:
-
-1→ 没有向量化; -
>=0→ 向量化时一个向量包含的元素个数。比如float4就是包含4个float
-
divUp
template <typename A, typename B>
inline A divUp(A x, B n)
{return (x + n - 1) / n;
}
简单的整除
samplesCommon::volume
inline int64_t volume(nvinfer1::Dims const& d)
{return std::accumulate(d.d, d.d + d.nbDims, int64_t{1}, std::multiplies<int64_t>{});
}
对维度做累乘
函数处理的逻辑
在构造函数里
auto dims = context ? context->getTensorShape(name) : mEngine->getTensorShape(name);
size_t vol = context || !mBatchSize ? 1 : static_cast<size_t>(mBatchSize);
nvinfer1::DataType type = mEngine->getTensorDataType(name);
int32_t vecDim = mEngine->getTensorVectorizedDim(name);
if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector{int32_t scalarsPerVec = mEngine->getTensorComponentsPerElement(name);dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);vol *= scalarsPerVec;}
vol *= samplesCommon::volume(dims);
这一段是对向量化的处理
1️⃣ 获取张量形状
-
如果提供了
context(执行上下文),就从上下文获取当前张量 shape。
这可以处理动态 shape 或绑定动态 batch 的情况。 -
否则就用 engine 的张量 shape(模型原始定义的逻辑 shape)。
-
得到的
dims是逻辑形状,比如[1, 64, 7, 7]。
这里的mBatchSize逻辑我其实没太看懂。如果context不为空或者mBatchSize=0,就用1,这没问题。但是还可以手动设置这是我没理解的地方,难道mEngine得到的dims不包含batch维度吗?
2️⃣ 初始化体积(vol)
-
如果有上下文或 batchSize = 0(非 legacy 网络),先把 vol 置为 1。
-
否则使用 batchSize 作为初始体积。
-
这里
vol用来累计最终 buffer 的总元素数(可能会再乘向量化因子)。
3️⃣ 获取数据类型
4️⃣ 检查是否向量化
-
vecDim:TensorRT 指出哪一维被向量化(例如通道维度),-1 表示未向量化。 -
scalarsPerVec:每个向量包含多少标量元素(例如 FP16 vec4 → 4 个标量)。 -
dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec)-
调整向量化维度的大小,用“向上取整”的方式计算实际需要的向量数。
-
示例:逻辑通道 64,vec4 →
divUp(64, 4) = 16个向量。
-
-
vol *= scalarsPerVec:乘回每个向量包含的标量总数,得到物理 buffer 总元素数。
vol *= scalarsPerVec是因为我们是先把维度做了除法divUp,然后再将维度累乘samplesCommon::volume(dims)。如果这里不乘scalarsPerVec,那就相当于少算了元素
5️⃣ 计算总元素数
-
samplesCommon::volume(dims)计算张量的体积(dims 所有维度的乘积)。 -
最终
vol是 buffer 中标量总数(用于分配 host / device buffer)。
getDeviceBindings
//!//! \brief Returns a vector of device buffers that you can use directly as//! bindings for the execute and enqueue methods of IExecutionContext.//!std::vector<void*>& getDeviceBindings(){return mDeviceBindings;}//!//! \brief Returns a vector of device buffers.//!std::vector<void*> const& getDeviceBindings() const{return mDeviceBindings;}
这两就没啥好说的了,返回推理所需要是设备内存,其实也就是输入输出Tensor对应的设备内存
但是实际上,这个接口只适用IExecutionContext的executeV2接口,在更新的版本的推理接口用不上这个了,详情参考:https://blog.csdn.net/ouliten/article/details/151794803?spm=1001.2014.3001.5502#t21
更多的是使用getDeviceBuffer获取对应的设备内存即可
