c++ - CUDA - Generating the Halton sequence in parallel -

i want write kernel in cuda generate the halton sequence in parallel, 1 value generated , stored each thread.

looking @ sequence, seem generating each subsequent value in sequence involves work done in generating previous value. generating each value scratch involve redundant work , cause large gap between execution times of threads.

is there way parallel kernel improves upon serial algorithm? i'm new parallel programming pardon question if answer well-known pattern.

note: did find this link in textbook (which uses without describing how works) file link there dead.

halton sequence generated by:

get representation of in base-p numeral system
reverse bit order

for example, base-2 halton sequence:

index      binary     reversed     result 1             1           1           1 /   10 = 1 / 2 2            10          01          01 /  100 = 1 / 4 3            11          11          11 /  100 = 3 / 4 4           100         001         001 / 1000 = 1 / 8 5           101         101         101 / 1000 = 5 / 8 6           110         011         011 / 1000 = 3 / 8 7           111         111         111 / 1000 = 7 / 8

so there indeed lot of repeated work in bit-wise reverse. first thing can reuse previous results.

when computing element index in base-p halton sequence, first determine leading bit , remaining part of base-p representation of (this can done scheduling threads in base-p fashion). have

out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1) //"^" used convenience

to avoid unnecessary global memory read, each thread should handle elements same "remaining part" different "leading bit". if generating halton sequence between p^n , p^(n+1), there should conceptually p^n parallel tasks. however, causes no problem if assign thread group of tasks.

further optimization can made mixing re-compute , load-from-memory.

example code:

total thread number should p^m.

const int m = 3 //any value __device__ void halton(float* out, int p, int n) {     const int tid = ... //globally unique , continuous thread id     const int step = p^m; //you know mean     int w = step; //w weight of leading bit     for(int n = m; n <= n; ++n) //n position of leading bit     {         for(int r = tid; r < w; r += step) //r remaining part         for(int l = 1; l < p; ++l) //l leading bit             out[l*w + r] = out[r] + l/w;         w *= p;     } }

note: example not compute first p^m elements in halton sequence, these values still needed.

Search This Blog

hj

c++ - CUDA - Generating the Halton sequence in parallel -

Popular posts from this blog

title2

title3

title3