c++ - CUDA - Generating the Halton sequence in parallel -
i want write kernel in cuda generate the halton sequence in parallel, 1 value generated , stored each thread.
looking @ sequence, seem generating each subsequent value in sequence involves work done in generating previous value. generating each value scratch involve redundant work , cause large gap between execution times of threads.
is there way parallel kernel improves upon serial algorithm? i'm new parallel programming pardon question if answer well-known pattern.
note: did find this link in textbook (which uses without describing how works) file link there dead.
halton sequence generated by:
- get representation of in base-p numeral system
- reverse bit order
for example, base-2 halton sequence:
index binary reversed result 1 1 1 1 / 10 = 1 / 2 2 10 01 01 / 100 = 1 / 4 3 11 11 11 / 100 = 3 / 4 4 100 001 001 / 1000 = 1 / 8 5 101 101 101 / 1000 = 5 / 8 6 110 011 011 / 1000 = 3 / 8 7 111 111 111 / 1000 = 7 / 8
so there indeed lot of repeated work in bit-wise reverse. first thing can reuse previous results.
when computing element index in base-p halton sequence, first determine leading bit , remaining part of base-p representation of (this can done scheduling threads in base-p fashion). have
out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1) //"^" used convenience
to avoid unnecessary global memory read, each thread should handle elements same "remaining part" different "leading bit". if generating halton sequence between p^n , p^(n+1), there should conceptually p^n parallel tasks. however, causes no problem if assign thread group of tasks.
further optimization can made mixing re-compute , load-from-memory.
example code:
total thread number should p^m.
const int m = 3 //any value __device__ void halton(float* out, int p, int n) { const int tid = ... //globally unique , continuous thread id const int step = p^m; //you know mean int w = step; //w weight of leading bit for(int n = m; n <= n; ++n) //n position of leading bit { for(int r = tid; r < w; r += step) //r remaining part for(int l = 1; l < p; ++l) //l leading bit out[l*w + r] = out[r] + l/w; w *= p; } }
note: example not compute first p^m elements in halton sequence, these values still needed.