| |
| **Design document for OpenMP reductions on the GPU** |
| |
| //Abstract: //In this document we summarize the new design for an OpenMP |
| implementation of reductions on NVIDIA GPUs. This document comprises |
| * a succinct background review, |
| * an introduction to the decoupling of reduction algorithm and |
| data-structure-specific processing routines, |
| * detailed illustrations of reduction algorithms used and |
| * a brief overview of steps we have made beyond the last implementation. |
| |
| **Problem Review** |
| |
| Consider a typical OpenMP program with reduction pragma. |
| |
| ``` |
| double foo, bar; |
| #pragma omp parallel for reduction(+:foo, bar) |
| for (int i = 0; i < N; i++) { |
| foo+=A[i]; bar+=B[i]; |
| } |
| ``` |
| where 'foo' and 'bar' are reduced across all threads in the parallel region. |
| Our primary goal is to efficiently aggregate the values of foo and bar in |
| such manner that |
| * makes the compiler logically concise. |
| * efficiently reduces within warps, threads, blocks and the device. |
| |
| **Introduction to Decoupling** |
| In this section we address the problem of making the compiler |
| //logically concise// by partitioning the task of reduction into two broad |
| categories: data-structure specific routines and algorithmic routines. |
| |
| The previous reduction implementation was highly coupled with |
| the specificity of the reduction element data structures (e.g., sizes, data |
| types) and operators of the reduction (e.g., addition, multiplication). In |
| our implementation we strive to decouple them. In our final implementations, |
| we could remove all template functions in our runtime system. |
| |
| The (simplified) pseudo code generated by LLVM is as follows: |
| |
| ``` |
| 1. Create private copies of variables: foo_p, bar_p |
| 2. Each thread reduces the chunk of A and B assigned to it and writes |
| to foo_p and bar_p respectively. |
| 3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn, |
| interWarpCpyFn) |
| where: |
| struct ReduceData { |
| double *foo; |
| double *bar; |
| } reduceData |
| reduceData.foo = &foo_p |
| reduceData.bar = &bar_p |
| |
| shuffleReduceFn and interWarpCpyFn are two auxiliary functions |
| generated to aid the runtime performing algorithmic steps |
| while being data-structure agnostic about ReduceData. |
| |
| In particular, shuffleReduceFn is a function that takes the following |
| inputs: |
| a. local copy of ReduceData |
| b. its lane_id |
| c. the offset of the lane_id which hosts a remote ReduceData |
| relative to the current one |
| d. an algorithm version paramter determining which reduction |
| algorithm to use. |
| This shuffleReduceFn retrieves the remote ReduceData through shuffle |
| intrinsics and reduces, using the algorithm specified by the 4th |
| parameter, the local ReduceData and with the remote ReduceData element |
| wise, and places the resultant values into the local ReduceData. |
| |
| Different reduction algorithms are implemented with different runtime |
| functions, but they all make calls to this same shuffleReduceFn to |
| perform the essential reduction step. Therefore, based on the 4th |
| parameter, this shuffleReduceFn will behave slightly differently to |
| cooperate with the runtime function to ensure correctness under |
| different circumstances. |
| |
| InterWarpCpyFn, as the name suggests, is a function that copies data |
| across warps. Its function is to tunnel all the thread private |
| ReduceData that is already reduced within a warp to a lane in the first |
| warp with minimal shared memory footprint. This is an essential step to |
| prepare for the last step of a block reduction. |
| |
| (Warp, block, device level reduction routines that utilize these |
| auxiliary functions will be discussed in the next section.) |
| |
| 4. if ret == 1: |
| The master thread stores the reduced result in the globals. |
| foo += reduceData.foo; bar += reduceData.bar |
| ``` |
| |
| **Reduction Algorithms** |
| |
| On the warp level, we have three versions of the algorithms: |
| |
| 1. Full Warp Reduction |
| |
| ``` |
| gpu_regular_warp_reduce(void *reduce_data, |
| kmp_ShuffleReductFctPtr ShuffleReduceFn) { |
| for (int offset = WARPSIZE/2; offset > 0; offset /= 2) |
| ShuffleReduceFn(reduce_data, 0, offset, 0); |
| } |
| ``` |
| ShuffleReduceFn is used here with lane_id set to 0 because it is not used |
| therefore we save instructions by not retrieving lane_id from the corresponding |
| special registers. The 4th parameters, which represents the version of the |
| algorithm being used here, is set to 0 to signify full warp reduction. |
| |
| In this version specified (=0), the ShuffleReduceFn behaves, per element, as |
| follows: |
| |
| ``` |
| //reduce_elem refers to an element in the local ReduceData |
| //remote_elem is retrieved from a remote lane |
| remote_elem = shuffle_down(reduce_elem, offset, 32); |
| reduce_elem = reduce_elem @ remote_elem; |
| |
| ``` |
| |
| An illustration of this algorithm operating on a hypothetical 8-lane full-warp |
| would be: |
| {F74} |
| The coloring invariant follows that elements with the same color will be |
| combined and reduced in the next reduction step. As can be observed, no overhead |
| is present, exactly log(2, N) steps are needed. |
| |
| 2. Contiguous Full Warp Reduction |
| ``` |
| gpu_irregular_warp_reduce(void *reduce_data, |
| kmp_ShuffleReductFctPtr ShuffleReduceFn, int size, |
| int lane_id) { |
| int curr_size; |
| int offset; |
| curr_size = size; |
| mask = curr_size/2; |
| while (offset>0) { |
| ShuffleReduceFn(reduce_data, lane_id, offset, 1); |
| curr_size = (curr_size+1)/2; |
| offset = curr_size/2; |
| } |
| } |
| ``` |
| |
| In this version specified (=1), the ShuffleReduceFn behaves, per element, as |
| follows: |
| ``` |
| //reduce_elem refers to an element in the local ReduceData |
| //remote_elem is retrieved from a remote lane |
| remote_elem = shuffle_down(reduce_elem, offset, 32); |
| if (lane_id < offset) { |
| reduce_elem = reduce_elem @ remote_elem |
| } else { |
| reduce_elem = remote_elem |
| } |
| ``` |
| |
| An important invariant (also a restriction on the starting state of the |
| reduction) is that this algorithm assumes that all unused ReduceData are |
| located in a contiguous subset of threads in a warp starting from lane 0. |
| |
| With the presence of a trailing active lane with an odd-numbered lane |
| id, its value will not be aggregated with any other lane. Therefore, |
| in order to preserve the invariant, such ReduceData is copied to the first lane |
| whose thread-local ReduceData has already being used in a previous reduction |
| and would therefore be useless otherwise. |
| |
| An illustration of this algorithm operating on a hypothetical 8-lane partial |
| warp woud be: |
| {F75} |
| |
| As illustrated, this version of the algorithm introduces overhead whenever |
| we have odd number of participating lanes in any reduction step to |
| copy data between lanes. |
| |
| 3. Dispersed Partial Warp Reduction |
| ``` |
| gpu_irregular_simt_reduce(void *reduce_data, |
| kmp_ShuffleReductFctPtr ShuffleReduceFn) { |
| int size, remote_id; |
| int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2; |
| do { |
| remote_id = find_the_next_active_lane_id_right_after_me(); |
| // the above function returns 0 of no active lane |
| // is present right after the current thread. |
| size = get_number_of_active_lanes_in_this_warp(); |
| logical_lane_id /= 2; |
| ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2); |
| } while (logical_lane_id % 2 == 0 && size > 1); |
| ``` |
| |
| There is no assumption made about the initial state of the reduction. |
| Any number of lanes (>=1) could be active at any position. The reduction |
| result is kept in the first active lane. |
| |
| In this version specified (=2), the ShuffleReduceFn behaves, per element, as |
| follows: |
| ``` |
| //reduce_elem refers to an element in the local ReduceData |
| //remote_elem is retrieved from a remote lane |
| remote_elem = shuffle_down(reduce_elem, offset, 32); |
| if (LaneId % 2 == 0 && Offset > 0) { |
| reduce_elem = reduce_elem @ remote_elem |
| } else { |
| reduce_elem = remote_elem |
| } |
| ``` |
| We will proceed with a brief explanation for some arguments passed in, |
| it is important to notice that, in this section, we will introduce the |
| concept of logical_lane_id, and it is important to distinguish it |
| from physical lane_id as defined by nvidia. |
| 1. //logical_lane_id//: as the name suggests, it refers to the calculated |
| lane_id (instead of the physical one defined by nvidia) that would make |
| our algorithm logically concise. A thread with logical_lane_id k means |
| there are (k-1) threads before it. |
| 2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane |
| id of the remote lane from which we will retrieve the ReduceData. We |
| subtract (threadIdx+1) from it because we would like to maintain only one |
| underlying shuffle intrinsic (which is used to communicate among lanes in a |
| warp). This particular version of shuffle intrinsic we take accepts only |
| offsets, instead of absolute lane_id. Therefore the subtraction is performed |
| on the absolute lane_id we calculated to obtain the offset. |
| |
| This algorithm is slightly different in 2 ways and it is not, conceptually, a |
| generalization of the above algorithms. |
| 1. It reduces elements close to each other. For instance, values in the 0th lane |
| is to be combined with that of the 1st lane; values in the 2nd lane is to be |
| combined with that of the 3rd lane. We did not use the previous algorithm |
| where the first half of the (partial) warp is reduced with the second half |
| of the (partial) warp. This is because, the mapping |
| f(x): logical_lane_id -> physical_lane_id; |
| can be easily calculated whereas its inverse |
| f^-1(x): physical_lane_id -> logical_lane_id |
| cannot and performing such reduction requires the inverse to be known. |
| 2. Because this algorithm is agnostic about the positions of the lanes that are |
| active, we do not need to perform the coping step as in the second |
| algorithm. |
| An illustrative run would look like |
| {F76} |
| As observed, overhead is high because in each and every step of reduction, |
| logical_lane_id is recalculated; so is the remote_id. |
| |
| On a block level, we have implemented the following block reduce algorithm: |
| |
| ``` |
| gpu_irregular_block_reduce(void *reduce_data, |
| kmp_ShuffleReductFctPtr shuflReduceFn, |
| kmp_InterWarpCopyFctPtr interWarpCpyFn, |
| int size) { |
| |
| int wid = threadIdx.x/WARPSIZE; |
| int lane_id = threadIdx.x%WARPSIZE; |
| |
| int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division |
| |
| unsigned tnum = __ballot(1); |
| int thread_num = __popc(tnum); |
| |
| //full warp reduction |
| if (thread_num == WARPSIZE) { |
| gpu_regular_warp_reduce(reduce_data, shuflReduceFn); |
| } |
| //partial warp reduction |
| if (thread_num < WARPSIZE) { |
| gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num, |
| lane_id); |
| } |
| //Gather all the reduced values from each warp |
| //to the first warp |
| //named_barrier inside this function to ensure |
| //correctness. It is effectively a sync_thread |
| //that won't deadlock. |
| interWarpCpyFn(reduce_data, warp_needed); |
| |
| //This is to reduce data gathered from each "warp master". |
| if (wid==0) { |
| gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed, |
| lane_id); |
| } |
| |
| return; |
| } |
| ``` |
| In this function, no ShuffleReduceFn is directly called as it makes calls |
| to various versions of the warp-reduction functions. It first reduces |
| ReduceData warp by warp; in the end, we end up with the number of |
| ReduceData equal to the number of warps present in this thread |
| block. We then proceed to gather all such ReduceData to the first warp. |
| |
| As observed, in this algorithm we make use of the function InterWarpCpyFn, |
| which copies data from each of the "warp master" (0th lane of each warp, where |
| a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a |
| mathematical sense) the problem of reduction across warp masters in a block to |
| the problem of warp reduction which we already have solutions to. |
| |
| We can thus completely avoid the use of atomics to reduce in a threadblock. |
| |
| **Efficient Cross Block Reduce** |
| |
| The next challenge is to reduce values across threadblocks. We aim to do this |
| without atomics or critical sections. |
| |
| Let a kernel be started with TB threadblocks. |
| Let the GPU have S SMs. |
| There can be at most N active threadblocks per SM at any time. |
| |
| Consider a threadblock tb (tb < TB) running on SM s (s < SM). 'tb' is one of |
| at most 'N' active threadblocks on SM s. Let each threadblock active on an SM |
| be given an instance identifier id (0 <= id < N). Therefore, the tuple (s, id) |
| uniquely identifies an active threadblock on the GPU. |
| |
| To efficiently implement cross block reduce, we first allocate an array for |
| each value to be reduced of size S*N (which is the maximum number of active |
| threadblocks at any time on the device). |
| |
| Each threadblock reduces its value to slot [s][id]. This can be done without |
| locking since no other threadblock can write to the same slot concurrently. |
| |
| As a final stage, we reduce the values in the array as follows: |
| |
| ``` |
| // Compiler generated wrapper function for each target region with a reduction |
| clause. |
| target_function_wrapper(map_args, reduction_array) <--- start with 1 team and 1 |
| thread. |
| // Use dynamic parallelism to launch M teams, N threads as requested by the |
| user to execute the target region. |
| |
| target_function<<M, N>>(map_args) |
| |
| Reduce values in reduction_array |
| |
| ``` |
| |
| **Comparison with Last Version** |
| |
| |
| The (simplified) pseudo code generated by LLVM on the host is as follows: |
| |
| |
| ``` |
| 1. Create private copies of variables: foo_p, bar_p |
| 2. Each thread reduces the chunk of A and B assigned to it and writes |
| to foo_p and bar_p respectively. |
| 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) |
| where: |
| struct ReduceData { |
| double *foo; |
| double *bar; |
| } reduceData |
| reduceData.foo = &foo_p |
| reduceData.bar = &bar_p |
| |
| reduceFn is a pointer to a function that takes in two inputs |
| of type ReduceData, "reduces" them element wise, and places the |
| result in the first input: |
| reduceFn(ReduceData *a, ReduceData *b) |
| a = a @ b |
| |
| Every thread in the parallel region calls kmpc_reduce_nowait with |
| its private copy of reduceData. The runtime reduces across the |
| threads (using tree reduction on the operator 'reduceFn?) and stores |
| the final result in the master thread if successful. |
| 4. if ret == 1: |
| The master thread stores the reduced result in the globals. |
| foo += reduceData.foo; bar += reduceData.bar |
| 5. else if ret == 2: |
| In this case kmpc_reduce_nowait() could not use tree reduction, |
| so use atomics instead: |
| each thread atomically writes to foo |
| each thread atomically writes to bar |
| ``` |
| |
| On a GPU, a similar reduction may need to be performed across SIMT threads, |
| warps, and threadblocks. The challenge is to do so efficiently in a fashion |
| that is compatible with the LLVM OpenMP implementation. |
| |
| In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs, |
| the salient steps of the code generated are as follows: |
| |
| |
| ``` |
| 1. Create private copies of variables: foo_p, bar_p |
| 2. Each thread reduces the chunk of A and B assigned to it and writes |
| to foo_p and bar_p respectively. |
| 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) |
| status = can_block_reduce() |
| if status == 1: |
| reduce efficiently to thread 0 using shuffles and shared memory. |
| return 1 |
| else |
| cannot use efficient block reduction, fallback to atomics |
| return 2 |
| 4. if ret == 1: |
| The master thread stores the reduced result in the globals. |
| foo += reduceData.foo; bar += reduceData.bar |
| 5. else if ret == 2: |
| In this case kmpc_reduce_nowait() could not use tree reduction, |
| so use atomics instead: |
| each thread atomically writes to foo |
| each thread atomically writes to bar |
| ``` |
| |
| The function can_block_reduce() is defined as follows: |
| |
| |
| ``` |
| int32_t can_block_reduce() { |
| int tid = GetThreadIdInTeam(); |
| int nt = GetNumberOfOmpThreads(tid); |
| if (nt != blockDim.x) |
| return 0; |
| unsigned tnum = __ballot(1); |
| if (tnum != (~0x0)) { |
| return 0; |
| } |
| return 1; |
| } |
| ``` |
| |
| This function permits the use of the efficient block reduction algorithm |
| using shuffles and shared memory (return 1) only if (a) all SIMT threads in |
| a warp are active (i.e., number of threads in the parallel region is a |
| multiple of 32) and (b) the number of threads in the parallel region |
| (set by the num_threads clause) equals blockDim.x. |
| |
| If either of these preconditions is not true, each thread in the threadblock |
| updates the global value using atomics. |
| |
| Atomics and compare-and-swap operations are expensive on many threaded |
| architectures such as GPUs and we must avoid them completely. |
| |
| |
| **Appendix: Implementation Details** |
| |
| |
| ``` |
| // Compiler generated function. |
| reduceFn(ReduceData *a, ReduceData *b) |
| a->foo = a->foo + b->foo |
| a->bar = a->bar + b->bar |
| |
| // Compiler generated function. |
| swapAndReduceFn(ReduceData *thread_private, int lane) |
| ReduceData *remote = new ReduceData() |
| remote->foo = shuffle_double(thread_private->foo, lane) |
| remote->bar = shuffle_double(thread_private->bar, lane) |
| reduceFn(thread_private, remote) |
| |
| // OMP runtime function. |
| warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn): |
| offset = 16 |
| while (offset > 0) |
| swapAndReduceFn(thread_private, offset) |
| offset /= 2 |
| |
| // OMP runtime function. |
| warpReduce_irregular(): |
| ... |
| |
| // OMP runtime function. |
| kmpc_reduce_warp(reduceData, swapAndReduceFn) |
| if all_lanes_active: |
| warpReduce_regular(reduceData, swapAndReduceFn) |
| else: |
| warpReduce_irregular(reduceData, swapAndReduceFn) |
| if in_simd_region: |
| // all done, reduce to global in simd lane 0 |
| return 1 |
| else if in_parallel_region: |
| // done reducing to one value per warp, now reduce across warps |
| return 3 |
| |
| // OMP runtime function; one for each basic type. |
| kmpc_reduce_block_double(double *a) |
| if lane == 0: |
| shared[wid] = *a |
| named_barrier(1, num_threads) |
| if wid == 0 |
| block_reduce(shared) |
| if lane == 0 |
| *a = shared[0] |
| named_barrier(1, num_threads) |
| if wid == 0 and lane == 0 |
| return 1 // write back reduced result |
| else |
| return 0 // don't do anything |
| |
| ``` |
| |
| |
| |
| ``` |
| // Compiler generated code. |
| 1. Create private copies of variables: foo_p, bar_p |
| 2. Each thread reduces the chunk of A and B assigned to it and writes |
| to foo_p and bar_p respectively. |
| 3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn) |
| 4. if ret == 1: |
| The master thread stores the reduced result in the globals. |
| foo += reduceData.foo; bar += reduceData.bar |
| 5. else if ret == 3: |
| ret = block_reduce_double(reduceData.foo) |
| if ret == 1: |
| foo += reduceData.foo |
| ret = block_reduce_double(reduceData.bar) |
| if ret == 1: |
| bar += reduceData.bar |
| ``` |
| |
| **Notes** |
| |
| 1. This scheme requires that the CUDA OMP runtime can call llvm generated |
| functions. This functionality now works. |
| 2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery |
| (including calls through function pointers) are optimized away. |
| 3. If we are reducing multiple to multiple variables in a parallel region, |
| the reduce operations are all performed in warpReduce_[ir]regular(). This |
| results in more instructions in the loop and should result in fewer |
| stalls due to data dependencies. Unfortunately we cannot do the same in |
| kmpc_reduce_block_double() without increasing shared memory usage. |