当前位置：首页 > news >正文

CUDA中thrust::device_vector使用详解

news 2025/9/12 9:14:47

1. 背景

thrust::device_vector 是 Thrust 库提供的 GPU 容器，功能类似于 C++ STL 里的 std::vector，但它的数据存在 CUDA 设备显存 (device memory)。
它是 RAII 封装的 GPU 容器，自动管理 CUDA 内存的分配与释放，避免手动调用 cudaMalloc / cudaFree。

可以简单理解为：

std::vector<T> —— 管理 CPU 内存
thrust::device_vector<T> —— 管理 GPU 内存

2. 基本语法

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>int main() {// 在 GPU 上创建一个 10 元素的向量，初始值为 0thrust::device_vector<int> d_vec(10, 0);// 在 CPU 上创建一个 host_vectorthrust::host_vector<int> h_vec(10, 1);// 拷贝 host_vector -> device_vectord_vec = h_vec;// 从 GPU 拷贝数据回 CPUthrust::host_vector<int> h_out = d_vec;// 打印结果for (int i = 0; i < h_out.size(); i++) {std::cout << h_out[i] << " ";}std::cout << std::endl;
}

输出：

1 1 1 1 1 1 1 1 1 1

3. 构造函数

thrust::device_vector<T> v;               // 空向量
thrust::device_vector<T> v(n);            // n 个未初始化元素
thrust::device_vector<T> v(n, value);     // n 个元素，初始化为 value
thrust::device_vector<T> v(v2);           // 拷贝构造
thrust::device_vector<T> v(h_vec);        // 从 host_vector 构造
thrust::device_vector<T> v(begin, end);   // 从迭代器范围构造

4. 常用成员函数

成员函数	功能
`size()`	返回元素个数
`resize(n)`	调整大小
`empty()`	是否为空
`operator[]`	访问元素（⚠️ 只能在 host 端访问，效率不高）
`begin() / end()`	返回迭代器（可用于 Thrust 算法）
`data()`	返回裸指针（`T*`，可传入 CUDA kernel）
`clear()`	清空向量
`swap()`	交换两个向量

5. 与 Host 交互

Thrust 提供 host_vector，用于 CPU 内存管理。
两者之间可以直接赋值，Thrust 会自动调用 cudaMemcpy。

thrust::host_vector<int> h(4);
h[0] = 10; h[1] = 20; h[2] = 30; h[3] = 40;thrust::device_vector<int> d = h;  // host -> device
thrust::host_vector<int> h2 = d;   // device -> host

6. 与 CUDA Kernel 配合

如果需要在自定义 CUDA kernel 中使用 device_vector 的数据，可以通过 thrust::raw_pointer_cast 获取裸指针：

__global__ void kernel(int* data, int n) {int idx = threadIdx.x + blockIdx.x * blockDim.x;if (idx < n) {data[idx] *= 2;}
}int main() {thrust::device_vector<int> d(5, 1);int* raw_ptr = thrust::raw_pointer_cast(d.data());kernel<<<1, 5>>>(raw_ptr, d.size());cudaDeviceSynchronize();thrust::host_vector<int> h = d;for (int x : h) std::cout << x << " "; // 输出 2 2 2 2 2
}

7. 与 Thrust 算法结合

Thrust 提供了类似 STL 的并行算法，可以直接作用于 device_vector：

#include <thrust/sort.h>
#include <thrust/reduce.h>thrust::device_vector<int> d = {5, 1, 4, 2, 3};// 排序
thrust::sort(d.begin(), d.end());// 求和
int sum = thrust::reduce(d.begin(), d.end(), 0, thrust::plus<int>());

8. 注意事项

operator[] 访问元素慢：每次调用会从 GPU 拷贝到 CPU。
批量操作推荐用 thrust::copy / host_vector。
同步点问题：
大多数 Thrust 算法是 异步的，执行时可能延迟，需要 cudaDeviceSynchronize() 确保完成。
线程安全：
device_vector 不是线程安全的，多个 host 线程不能同时写。
内存管理：
内部用 cudaMalloc / cudaFree，分配/释放开销比 std::vector 大，所以小对象频繁分配要避免。

9. 使用场景

需要 GPU 并行加速的容器场景（替代 cudaMalloc 手写内存管理）
配合 Thrust 并行算法（sort, reduce, transform 等）
CUDA kernel 输入输出数据管理

10. 小结

thrust::device_vector = GPU 上的 std::vector。
简化了 CUDA 内存管理，支持 STL 风格算法。
适合 GPU 并行计算，但要注意 operator[] 的效率和内存分配开销。

11 `thrust::device_vector` 常见操作速查表

1. 构造与初始化

thrust::device_vector<int> d1;             // 空向量
thrust::device_vector<int> d2(10);         // 10 个未初始化元素
thrust::device_vector<int> d3(10, 42);     // 10 个元素，值均为 42
thrust::device_vector<int> d4 = {1, 2, 3}; // 列表初始化 (C++11)

2. Host / Device 拷贝

thrust::host_vector<int> h(5, 1);// host -> device
thrust::device_vector<int> d = h;// device -> host
thrust::host_vector<int> h2 = d;// device -> device
thrust::device_vector<int> d2 = d;

3. 元素访问

慎用 operator[]，效率较低（会触发 host/device 拷贝）。

// 推荐方式：批量拷贝
thrust::host_vector<int> h = d;// 不推荐：单个元素访问
int x = d[0];

4. 获取裸指针 (kernel 使用)

int* raw_ptr = thrust::raw_pointer_cast(d.data());

5. 与 CUDA Kernel 结合

__global__ void kernel(int* data, int n) {int i = threadIdx.x + blockIdx.x * blockDim.x;if (i < n) data[i] *= 2;
}thrust::device_vector<int> d(5, 1);
int* raw_ptr = thrust::raw_pointer_cast(d.data());kernel<<<1, 5>>>(raw_ptr, d.size());
cudaDeviceSynchronize();

6. 拷贝 (Thrust 算法)

// host -> device
thrust::copy(h.begin(), h.end(), d.begin());// device -> host
thrust::copy(d.begin(), d.end(), h.begin());// device -> device
thrust::copy(d.begin(), d.end(), d2.begin());

7. 排序

thrust::device_vector<int> d = {4, 2, 1, 3};
thrust::sort(d.begin(), d.end());  // 结果: 1 2 3 4

8. 归约 (求和 / 最大值 / 最小值)

int sum = thrust::reduce(d.begin(), d.end(), 0, thrust::plus<int>());
int maxv = thrust::reduce(d.begin(), d.end(), INT_MIN, thrust::maximum<int>());
int minv = thrust::reduce(d.begin(), d.end(), INT_MAX, thrust::minimum<int>());

9. 变换 (并行 for_each / transform)

// 每个元素 *2
thrust::transform(d.begin(), d.end(), d.begin(),thrust::placeholders::_1 * 2);// 类似 for_each
thrust::for_each(d.begin(), d.end(), [] __device__ (int& x) { x += 10; });

10. 填充 / 初始化

thrust::fill(d.begin(), d.end(), 7);   // 全部设为 7
thrust::sequence(d.begin(), d.end(), 0, 1); // 生成 0,1,2,3,...

11. 条件计数 / 过滤

int count_even = thrust::count_if(d.begin(), d.end(),[] __device__ (int x) { return x % 2 == 0; });

12. unique / remove

// 删除重复元素
auto new_end = thrust::unique(d.begin(), d.end());
d.erase(new_end, d.end());// 删除小于 0 的元素
auto new_end2 = thrust::remove_if(d.begin(), d.end(),[] __device__ (int x) { return x < 0; });
d.erase(new_end2, d.end());

13. scan (前缀和)

thrust::device_vector<int> d = {1, 2, 3, 4};// inclusive_scan: 1, 3, 6, 10
thrust::inclusive_scan(d.begin(), d.end(), d.begin());// exclusive_scan: 0, 1, 3, 6
thrust::exclusive_scan(d.begin(), d.end(), d.begin());