Is not as trivial as it seems. I just finished to implement one and I can tell what you need
read the scan
Gpu Gems 3 Article in particular chapter **39.3.1 Stream Compaction**.
To implement your own start from the LargeArrayScan example in the SDK, that will give you just the prescan. Assuming you have the selection array in device memory (an array of 1 and 0 meaning 1- select 0 - discard),
dev_selection_array a
dev_elements_array elements to be selected a
dev_prescan_array and a
dev_result_array all of size
N then you do
prescan(dev_prescan_array,dev_selection_array, N);
scatter(dev_result_array, dev_prescan_array,
dev_selection_array, dev_elements_array, N);
where the scatter is
__global__ void scatter_kernel( T*dev_result_array,
const T* dev_prescan_array,
const T* dev_selection_array,
const T* dev_elements_array, std::size_t size){
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= size) return;
if (dev_selection_array[idx] == 1){
dev_result_array[dev_prescan_array[idx]] = dev_elements_array[idx];
}
}
for other nice application of the prescan see the paper
Ble93Otherwise you can use the
CUDPP library where such primitives exists and are highly optimized even more that the sdk one.