当前位置：首页 > news >正文

Thrust库介绍与使用

news 2025/7/11 7:17:25

文章目录

Thrust库介绍与使用指南
- Thrust库概述
- 主要特性
- 基本使用
- - 包含头文件
  - 基本示例
- 内存资源管理配置
- - 1. 使用系统默认内存资源
  - 2. 自定义内存资源
  - - 创建自定义内存资源
    - 使用自定义分配器
  - 3. 配置全局内存资源
  - 4. 内存池选项配置
- 高级内存管理技巧
- - 1. 使用托管内存(Unified Memory)
  - 2. 异步内存操作
  - 3. 内存重用
- 性能考虑
- 总结
使用 `thrust::universal_vector` 和 Unified Memory
- 基本使用方法
- 硬件和软件要求
- - 1. 硬件要求
  - 2. 软件要求
- 工作原理
- 性能考虑
- 高级用法

Thrust库介绍与使用指南

Thrust库概述

Thrust是一个基于C++模板库的并行算法库，类似于C++标准模板库(STL)，专为CUDA平台设计，但也可以用于多核CPU。它提供了丰富的数据并行原语，如排序、前缀求和、归约等，使开发者能够以高级抽象的方式编写高性能并行代码。

主要特性

类似STL的接口：熟悉STL的开发者可以快速上手
跨平台支持：支持CUDA GPU和多核CPU后端
丰富的算法：包括排序、扫描、归约、变换等
容器管理：提供设备向量(device_vector)和主机向量(host_vector)
可扩展性：允许自定义算法和内存分配策略

基本使用

包含头文件

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>

基本示例

// 创建主机向量
thrust::host_vector<int> h_vec(4);
h_vec[0] = 3; h_vec[1] = 1; h_vec[2] = 4; h_vec[3] = 2;// 传输到设备
thrust::device_vector<int> d_vec = h_vec;// 在设备上排序
thrust::sort(d_vec.begin(), d_vec.end());// 传回主机
h_vec = d_vec;

内存资源管理配置

Thrust 1.8及以上版本支持自定义内存资源管理，类似于C++17的内存资源(memory_resource)概念。

1. 使用系统默认内存资源

默认情况下，Thrust使用以下内存资源：

主机端：std::allocator
设备端：cudaMalloc/cudaFree

2. 自定义内存资源

创建自定义内存资源

#include <thrust/mr/allocator.h>
#include <thrust/mr/pool.h>
#include <thrust/mr/pool_options.h>// 创建内存池资源
thrust::mr::pool_options options;
options.min_blocks = 16;  // 最小块数// 创建CUDA内存池
auto cuda_pool = thrust::mr::make_cuda_memory_pool(options);// 创建分配器
using cuda_pool_allocator = thrust::mr::stateless_resource_allocator<int, thrust::mr::memory_resource<thrust::device_system_tag>>;
cuda_pool_allocator alloc(cuda_pool.get());

使用自定义分配器

// 使用自定义分配器创建向量
thrust::device_vector<int, cuda_pool_allocator> d_vec(1024, alloc);// 或者直接使用内存资源
thrust::mr::vector<int, thrust::device_memory_resource> d_vec2(1024, cuda_pool.get());

3. 配置全局内存资源

可以替换Thrust的默认内存资源：

// 设置默认设备内存资源
thrust::mr::set_default_resource(thrust::device_system_tag{}, cuda_pool.get());// 之后创建的device_vector将使用这个内存池
thrust::device_vector<int> d_vec(1024);  // 使用自定义内存池

4. 内存池选项配置

内存池的配置对性能有重要影响：

thrust::mr::pool_options options;
options.min_blocks = 32;      // 每个大小类别的最小块数
options.max_blocks = 0;       // 0表示无限制
options.largest_block_size = 0; // 0表示无限制
options.min_block_size = 256; // 最小块大小(字节)
options.alignment = 256;      // 对齐要求auto pool = thrust::mr::make_cuda_memory_pool(options);

高级内存管理技巧

1. 使用托管内存(Unified Memory)

#include <thrust/universal_vector.h>// 创建托管内存向量
thrust::universal_vector<int> u_vec(1024);// 可以在主机和设备上透明访问

2. 异步内存操作

#include <thrust/async/copy.h>thrust::device_vector<int> d_vec(1024);
thrust::host_vector<int> h_vec(1024);// 异步拷贝
auto future = thrust::async::copy(thrust::device, d_vec.begin(), d_vec.end(), h_vec.begin());// 可以执行其他操作...// 等待拷贝完成
future.wait();

3. 内存重用

thrust::device_vector<int> temp;// 重用内存
temp.resize(1024);  // 第一次分配
// ...使用temp...temp.resize(2048);  // 可能重用现有内存或重新分配// 显式释放内存
thrust::device_vector<int>().swap(temp);  // 释放所有内存

性能考虑

减少内存分配：频繁分配/释放内存会降低性能，尽量重用内存
使用内存池：对于频繁的小内存分配特别有效
批量操作：合并多个小操作可以减少内核启动开销
选择合适的算法：某些算法对内存访问模式有特定要求

总结

Thrust提供了灵活的内存管理机制，从简单的默认分配到高级的自定义内存池。通过合理配置内存资源，可以显著提高应用程序的性能，特别是在需要频繁内存操作的场景中。根据应用特点选择合适的内存管理策略，是优化Thrust应用的重要步骤。

使用 `thrust::universal_vector` 和 Unified Memory

thrust::universal_vector 是 Thrust 库中支持 Unified Memory (统一内存) 的容器，它允许 CPU 和 GPU 共享同一内存空间，无需显式地在主机和设备之间传输数据。

基本使用方法

#include <thrust/universal_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>int main() {// 创建一个大小为10的universal_vectorthrust::universal_vector<int> vec(10);// 在主机端填充数据thrust::fill(vec.begin(), vec.end(), 42);// 在设备端使用数据 (会自动按需迁移)thrust::device_ptr<int> dev_ptr = vec.data();// 使用dev_ptr进行GPU计算...// 数据修改后，主机端访问会自动同步for(int i = 0; i < vec.size(); i++) {std::cout << vec[i] << " ";}return 0;
}

硬件和软件要求

要使用 thrust::universal_vector 和 Unified Memory，需要以下支持：

1. 硬件要求

NVIDIA GPU：需要计算能力 6.0 (Pascal) 或更高
- 计算能力 6.x (Pascal) 提供基本 Unified Memory 支持
- 计算能力 7.0 (Volta) 及更高版本提供更好的页面迁移和并发访问支持
AMD GPU：需要支持 ROCm 平台的 GPU
CPU：x86_64 架构

2. 软件要求

CUDA 6.0 或更高版本 (对于 NVIDIA GPU)
- CUDA 8.0 引入更好的 Unified Memory 支持
- 较新版本提供更多优化
Thrust 1.9 或更高版本
操作系统：
- Linux: 需要内核 3.13 或更高版本
- Windows: 需要支持 CUDA 的版本
- 对于 NVIDIA，需要安装正确的 GPU 驱动

工作原理

Unified Memory 创建了一个在 CPU 和 GPU 之间共享的内存池，具有以下特点：

自动迁移：数据会根据访问需求在 CPU 和 GPU 之间自动迁移
单一指针：使用相同的指针在主机和设备代码中访问数据
简化编程：无需手动管理 cudaMemcpy 等数据传输操作

性能考虑

首次访问延迟：首次访问数据时会有页面迁移开销
过度迁移：频繁的CPU-GPU交替访问可能导致性能下降
预取：可以使用 cudaMemPrefetchAsync 预取数据到目标设备
固定内存：对于频繁访问的数据，考虑使用 cudaMallocManaged 分配固定内存

高级用法

// 预取数据到GPU
cudaMemPrefetchAsync(thrust::raw_pointer_cast(vec.data()), vec.size() * sizeof(int), device_id);// 设置内存访问建议
cudaMemAdvise(thrust::raw_pointer_cast(vec.data()), vec.size() * sizeof(int),cudaMemAdviseSetPreferredLocation, device_id);