|  |  | 
|  | **Design document for OpenMP reductions on the GPU** | 
|  |  | 
|  | //Abstract: //In this document we summarize the new design for an OpenMP | 
|  | implementation of reductions on NVIDIA GPUs.  This document comprises | 
|  | * a succinct background review, | 
|  | * an introduction to the decoupling of reduction algorithm and | 
|  | data-structure-specific processing routines, | 
|  | * detailed illustrations of reduction algorithms used and | 
|  | * a brief overview of steps we have made beyond the last implementation. | 
|  |  | 
|  | **Problem Review** | 
|  |  | 
|  | Consider a typical OpenMP program with reduction pragma. | 
|  |  | 
|  | ``` | 
|  | double foo, bar; | 
|  | #pragma omp parallel for reduction(+:foo, bar) | 
|  | for (int i = 0; i < N; i++) { | 
|  | foo+=A[i]; bar+=B[i]; | 
|  | } | 
|  | ``` | 
|  | where 'foo' and 'bar' are reduced across all threads in the parallel region. | 
|  | Our primary goal is to efficiently aggregate the values of foo and bar in | 
|  | such manner that | 
|  | * makes the compiler logically concise. | 
|  | * efficiently reduces within warps, threads, blocks and the device. | 
|  |  | 
|  | **Introduction to Decoupling** | 
|  | In this section we address the problem of making the compiler | 
|  | //logically concise// by partitioning the task of reduction into two broad | 
|  | categories: data-structure specific routines and algorithmic routines. | 
|  |  | 
|  | The previous reduction implementation was highly coupled with | 
|  | the specificity of the reduction element data structures (e.g., sizes, data | 
|  | types) and operators of the reduction (e.g., addition, multiplication). In | 
|  | our implementation we strive to decouple them. In our final implementations, | 
|  | we could remove all template functions in our runtime system. | 
|  |  | 
|  | The (simplified) pseudo code generated by LLVM is as follows: | 
|  |  | 
|  | ``` | 
|  | 1. Create private copies of variables: foo_p, bar_p | 
|  | 2. Each thread reduces the chunk of A and B assigned to it and writes | 
|  | to foo_p and bar_p respectively. | 
|  | 3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn, | 
|  | interWarpCpyFn) | 
|  | where: | 
|  | struct ReduceData { | 
|  | double *foo; | 
|  | double *bar; | 
|  | } reduceData | 
|  | reduceData.foo = &foo_p | 
|  | reduceData.bar = &bar_p | 
|  |  | 
|  | shuffleReduceFn and interWarpCpyFn are two auxiliary functions | 
|  | generated to aid the runtime performing algorithmic steps | 
|  | while being data-structure agnostic about ReduceData. | 
|  |  | 
|  | In particular, shuffleReduceFn is a function that takes the following | 
|  | inputs: | 
|  | a. local copy of ReduceData | 
|  | b. its lane_id | 
|  | c. the offset of the lane_id which hosts a remote ReduceData | 
|  | relative to the current one | 
|  | d. an algorithm version paramter determining which reduction | 
|  | algorithm to use. | 
|  | This shuffleReduceFn retrieves the remote ReduceData through shuffle | 
|  | intrinsics and  reduces, using the algorithm specified by the 4th | 
|  | parameter, the local ReduceData and with the remote ReduceData element | 
|  | wise, and places the resultant values into the local ReduceData. | 
|  |  | 
|  | Different reduction algorithms are implemented with different runtime | 
|  | functions, but they all make calls to this same shuffleReduceFn to | 
|  | perform the essential reduction step. Therefore, based on the 4th | 
|  | parameter, this shuffleReduceFn will behave slightly differently to | 
|  | cooperate with the runtime function to ensure correctness under | 
|  | different circumstances. | 
|  |  | 
|  | InterWarpCpyFn, as the name suggests, is a function that copies data | 
|  | across warps. Its function is to tunnel all the thread private | 
|  | ReduceData that is already reduced within a warp to a lane in the first | 
|  | warp with minimal shared memory footprint. This is an essential step to | 
|  | prepare for the last step of a block reduction. | 
|  |  | 
|  | (Warp, block, device level reduction routines that utilize these | 
|  | auxiliary functions will be discussed in the next section.) | 
|  |  | 
|  | 4. if ret == 1: | 
|  | The master thread stores the reduced result in the globals. | 
|  | foo += reduceData.foo; bar += reduceData.bar | 
|  | ``` | 
|  |  | 
|  | **Reduction Algorithms** | 
|  |  | 
|  | On the warp level, we have three versions of the algorithms: | 
|  |  | 
|  | 1. Full Warp Reduction | 
|  |  | 
|  | ``` | 
|  | gpu_regular_warp_reduce(void *reduce_data, | 
|  | kmp_ShuffleReductFctPtr ShuffleReduceFn) { | 
|  | for (int offset = WARPSIZE/2; offset > 0; offset /= 2) | 
|  | ShuffleReduceFn(reduce_data, 0, offset, 0); | 
|  | } | 
|  | ``` | 
|  | ShuffleReduceFn is used here with lane_id set to 0 because it is not used | 
|  | therefore we save instructions by not retrieving lane_id from the corresponding | 
|  | special registers. The 4th parameters, which represents the version of the | 
|  | algorithm being used here, is set to 0 to signify full warp reduction. | 
|  |  | 
|  | In this version specified (=0), the ShuffleReduceFn behaves, per element, as | 
|  | follows: | 
|  |  | 
|  | ``` | 
|  | //reduce_elem refers to an element in the local ReduceData | 
|  | //remote_elem is retrieved from a remote lane | 
|  | remote_elem = shuffle_down(reduce_elem, offset, 32); | 
|  | reduce_elem = reduce_elem @ remote_elem; | 
|  |  | 
|  | ``` | 
|  |  | 
|  | An illustration of this algorithm operating on a hypothetical 8-lane full-warp | 
|  | would be: | 
|  | {F74} | 
|  | The coloring invariant follows that elements with the same color will be | 
|  | combined and reduced in the next reduction step. As can be observed, no overhead | 
|  | is present, exactly log(2, N) steps are needed. | 
|  |  | 
|  | 2. Contiguous Full Warp Reduction | 
|  | ``` | 
|  | gpu_irregular_warp_reduce(void *reduce_data, | 
|  | kmp_ShuffleReductFctPtr ShuffleReduceFn, int size, | 
|  | int lane_id) { | 
|  | int curr_size; | 
|  | int offset; | 
|  | curr_size = size; | 
|  | mask = curr_size/2; | 
|  | while (offset>0) { | 
|  | ShuffleReduceFn(reduce_data, lane_id, offset, 1); | 
|  | curr_size = (curr_size+1)/2; | 
|  | offset = curr_size/2; | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | In this version specified (=1), the ShuffleReduceFn behaves, per element, as | 
|  | follows: | 
|  | ``` | 
|  | //reduce_elem refers to an element in the local ReduceData | 
|  | //remote_elem is retrieved from a remote lane | 
|  | remote_elem = shuffle_down(reduce_elem, offset, 32); | 
|  | if (lane_id < offset) { | 
|  | reduce_elem = reduce_elem @ remote_elem | 
|  | } else { | 
|  | reduce_elem = remote_elem | 
|  | } | 
|  | ``` | 
|  |  | 
|  | An important invariant (also a restriction on the starting state of the | 
|  | reduction) is that this algorithm assumes that all unused ReduceData are | 
|  | located in a contiguous subset of threads in a warp starting from lane 0. | 
|  |  | 
|  | With the presence of a trailing active lane with an odd-numbered lane | 
|  | id, its value will not be aggregated with any other lane. Therefore, | 
|  | in order to preserve the invariant, such ReduceData is copied to the first lane | 
|  | whose thread-local ReduceData has already being used in a previous reduction | 
|  | and would therefore be useless otherwise. | 
|  |  | 
|  | An illustration of this algorithm operating on a hypothetical 8-lane partial | 
|  | warp woud be: | 
|  | {F75} | 
|  |  | 
|  | As illustrated, this version of the algorithm introduces overhead whenever | 
|  | we have odd number of participating lanes in any reduction step to | 
|  | copy data between lanes. | 
|  |  | 
|  | 3. Dispersed Partial Warp Reduction | 
|  | ``` | 
|  | gpu_irregular_simt_reduce(void *reduce_data, | 
|  | kmp_ShuffleReductFctPtr ShuffleReduceFn) { | 
|  | int size, remote_id; | 
|  | int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2; | 
|  | do { | 
|  | remote_id = find_the_next_active_lane_id_right_after_me(); | 
|  | // the above function returns 0 of no active lane | 
|  | // is present right after the current thread. | 
|  | size = get_number_of_active_lanes_in_this_warp(); | 
|  | logical_lane_id /= 2; | 
|  | ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2); | 
|  | } while (logical_lane_id % 2 == 0 && size > 1); | 
|  | ``` | 
|  |  | 
|  | There is no assumption made about the initial state of the reduction. | 
|  | Any number of lanes (>=1) could be active at any position. The reduction | 
|  | result is kept in the first active lane. | 
|  |  | 
|  | In this version specified (=2), the ShuffleReduceFn behaves, per element, as | 
|  | follows: | 
|  | ``` | 
|  | //reduce_elem refers to an element in the local ReduceData | 
|  | //remote_elem is retrieved from a remote lane | 
|  | remote_elem = shuffle_down(reduce_elem, offset, 32); | 
|  | if (LaneId % 2 == 0 && Offset > 0) { | 
|  | reduce_elem = reduce_elem @ remote_elem | 
|  | } else { | 
|  | reduce_elem = remote_elem | 
|  | } | 
|  | ``` | 
|  | We will proceed with a brief explanation for some arguments passed in, | 
|  | it is important to notice that, in this section, we will introduce the | 
|  | concept of logical_lane_id, and it is important to distinguish it | 
|  | from physical lane_id as defined by nvidia. | 
|  | 1. //logical_lane_id//: as the name suggests, it refers to the calculated | 
|  | lane_id (instead of the physical one defined by nvidia) that would make | 
|  | our algorithm logically concise. A thread with logical_lane_id k means | 
|  | there are (k-1) threads before it. | 
|  | 2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane | 
|  | id of the remote lane from which we will retrieve the ReduceData. We | 
|  | subtract (threadIdx+1) from it because we would like to maintain only one | 
|  | underlying shuffle intrinsic (which is used to communicate among lanes in a | 
|  | warp). This particular version of shuffle intrinsic we take accepts only | 
|  | offsets, instead of absolute lane_id. Therefore the subtraction is performed | 
|  | on the absolute lane_id we calculated to obtain the offset. | 
|  |  | 
|  | This algorithm is slightly different in 2 ways and it is not, conceptually, a | 
|  | generalization of the above algorithms. | 
|  | 1. It reduces elements close to each other. For instance, values in the 0th lane | 
|  | is to be combined with that of the 1st lane; values in the 2nd lane is to be | 
|  | combined with that of the 3rd lane. We did not use the previous algorithm | 
|  | where the first half of the (partial) warp is reduced with the second half | 
|  | of the (partial) warp. This is because, the mapping | 
|  | f(x): logical_lane_id -> physical_lane_id; | 
|  | can be easily calculated whereas its inverse | 
|  | f^-1(x): physical_lane_id -> logical_lane_id | 
|  | cannot and performing such reduction requires the inverse to be known. | 
|  | 2. Because this algorithm is agnostic about the positions of the lanes that are | 
|  | active, we do not need to perform the coping step as in the second | 
|  | algorithm. | 
|  | An illustrative run would look like | 
|  | {F76} | 
|  | As observed, overhead is high because in each and every step of reduction, | 
|  | logical_lane_id is recalculated; so is the remote_id. | 
|  |  | 
|  | On a block level, we have implemented the following block reduce algorithm: | 
|  |  | 
|  | ``` | 
|  | gpu_irregular_block_reduce(void *reduce_data, | 
|  | kmp_ShuffleReductFctPtr shuflReduceFn, | 
|  | kmp_InterWarpCopyFctPtr interWarpCpyFn, | 
|  | int size) { | 
|  |  | 
|  | int wid = threadIdx.x/WARPSIZE; | 
|  | int lane_id = threadIdx.x%WARPSIZE; | 
|  |  | 
|  | int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division | 
|  |  | 
|  | unsigned tnum = __ballot(1); | 
|  | int thread_num = __popc(tnum); | 
|  |  | 
|  | //full warp reduction | 
|  | if (thread_num == WARPSIZE) { | 
|  | gpu_regular_warp_reduce(reduce_data, shuflReduceFn); | 
|  | } | 
|  | //partial warp reduction | 
|  | if (thread_num < WARPSIZE) { | 
|  | gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num, | 
|  | lane_id); | 
|  | } | 
|  | //Gather all the reduced values from each warp | 
|  | //to the first warp | 
|  | //named_barrier inside this function to ensure | 
|  | //correctness. It is effectively a sync_thread | 
|  | //that won't deadlock. | 
|  | interWarpCpyFn(reduce_data, warp_needed); | 
|  |  | 
|  | //This is to reduce data gathered from each "warp master". | 
|  | if (wid==0) { | 
|  | gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed, | 
|  | lane_id); | 
|  | } | 
|  |  | 
|  | return; | 
|  | } | 
|  | ``` | 
|  | In this function, no ShuffleReduceFn is directly called as it makes calls | 
|  | to various versions of the warp-reduction functions. It first reduces | 
|  | ReduceData warp by warp; in the end, we end up with the number of | 
|  | ReduceData equal to the number of warps present in this thread | 
|  | block. We then proceed to gather all such ReduceData to the first warp. | 
|  |  | 
|  | As observed, in this algorithm we make use of the function InterWarpCpyFn, | 
|  | which copies data from each of the "warp master" (0th lane of each warp, where | 
|  | a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a | 
|  | mathematical sense) the problem of reduction across warp masters in a block to | 
|  | the problem of warp reduction which we already have solutions to. | 
|  |  | 
|  | We can thus completely avoid the use of atomics to reduce in a threadblock. | 
|  |  | 
|  | **Efficient Cross Block Reduce** | 
|  |  | 
|  | The next challenge is to reduce values across threadblocks.  We aim to do this | 
|  | without atomics or critical sections. | 
|  |  | 
|  | Let a kernel be started with TB threadblocks. | 
|  | Let the GPU have S SMs. | 
|  | There can be at most N active threadblocks per SM at any time. | 
|  |  | 
|  | Consider a threadblock tb (tb < TB) running on SM s (s < SM).  'tb' is one of | 
|  | at most 'N' active threadblocks on SM s.  Let each threadblock active on an SM | 
|  | be given an instance identifier id (0 <= id < N).  Therefore, the tuple (s, id) | 
|  | uniquely identifies an active threadblock on the GPU. | 
|  |  | 
|  | To efficiently implement cross block reduce, we first allocate an array for | 
|  | each value to be reduced of size S*N (which is the maximum number of active | 
|  | threadblocks at any time on the device). | 
|  |  | 
|  | Each threadblock reduces its value to slot [s][id].  This can be done without | 
|  | locking since no other threadblock can write to the same slot concurrently. | 
|  |  | 
|  | As a final stage, we reduce the values in the array as follows: | 
|  |  | 
|  | ``` | 
|  | // Compiler generated wrapper function for each target region with a reduction | 
|  | clause. | 
|  | target_function_wrapper(map_args, reduction_array)  <--- start with 1 team and 1 | 
|  | thread. | 
|  | // Use dynamic parallelism to launch M teams, N threads as requested by the | 
|  | user to execute the target region. | 
|  |  | 
|  | target_function<<M, N>>(map_args) | 
|  |  | 
|  | Reduce values in reduction_array | 
|  |  | 
|  | ``` | 
|  |  | 
|  | **Comparison with Last Version** | 
|  |  | 
|  |  | 
|  | The (simplified) pseudo code generated by LLVM on the host is as follows: | 
|  |  | 
|  |  | 
|  | ``` | 
|  | 1. Create private copies of variables: foo_p, bar_p | 
|  | 2. Each thread reduces the chunk of A and B assigned to it and writes | 
|  | to foo_p and bar_p respectively. | 
|  | 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) | 
|  | where: | 
|  | struct ReduceData { | 
|  | double *foo; | 
|  | double *bar; | 
|  | } reduceData | 
|  | reduceData.foo = &foo_p | 
|  | reduceData.bar = &bar_p | 
|  |  | 
|  | reduceFn is a pointer to a function that takes in two inputs | 
|  | of type ReduceData, "reduces" them element wise, and places the | 
|  | result in the first input: | 
|  | reduceFn(ReduceData *a, ReduceData *b) | 
|  | a = a @ b | 
|  |  | 
|  | Every thread in the parallel region calls kmpc_reduce_nowait with | 
|  | its private copy of reduceData.  The runtime reduces across the | 
|  | threads (using tree reduction on the operator 'reduceFn?) and stores | 
|  | the final result in the master thread if successful. | 
|  | 4. if ret == 1: | 
|  | The master thread stores the reduced result in the globals. | 
|  | foo += reduceData.foo; bar += reduceData.bar | 
|  | 5. else if ret == 2: | 
|  | In this case kmpc_reduce_nowait() could not use tree reduction, | 
|  | so use atomics instead: | 
|  | each thread atomically writes to foo | 
|  | each thread atomically writes to bar | 
|  | ``` | 
|  |  | 
|  | On a GPU, a similar reduction may need to be performed across SIMT threads, | 
|  | warps, and threadblocks.  The challenge is to do so efficiently in a fashion | 
|  | that is compatible with the LLVM OpenMP implementation. | 
|  |  | 
|  | In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs, | 
|  | the salient steps of the code generated are as follows: | 
|  |  | 
|  |  | 
|  | ``` | 
|  | 1. Create private copies of variables: foo_p, bar_p | 
|  | 2. Each thread reduces the chunk of A and B assigned to it and writes | 
|  | to foo_p and bar_p respectively. | 
|  | 3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock) | 
|  | status = can_block_reduce() | 
|  | if status == 1: | 
|  | reduce efficiently to thread 0 using shuffles and shared memory. | 
|  | return 1 | 
|  | else | 
|  | cannot use efficient block reduction, fallback to atomics | 
|  | return 2 | 
|  | 4. if ret == 1: | 
|  | The master thread stores the reduced result in the globals. | 
|  | foo += reduceData.foo; bar += reduceData.bar | 
|  | 5. else if ret == 2: | 
|  | In this case kmpc_reduce_nowait() could not use tree reduction, | 
|  | so use atomics instead: | 
|  | each thread atomically writes to foo | 
|  | each thread atomically writes to bar | 
|  | ``` | 
|  |  | 
|  | The function can_block_reduce() is defined as follows: | 
|  |  | 
|  |  | 
|  | ``` | 
|  | int32_t can_block_reduce() { | 
|  | int tid = GetThreadIdInTeam(); | 
|  | int nt = GetNumberOfOmpThreads(tid); | 
|  | if (nt != blockDim.x) | 
|  | return 0; | 
|  | unsigned tnum = __ballot(1); | 
|  | if (tnum != (~0x0)) { | 
|  | return 0; | 
|  | } | 
|  | return 1; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | This function permits the use of the efficient block reduction algorithm | 
|  | using shuffles and shared memory (return 1) only if (a) all SIMT threads in | 
|  | a warp are active (i.e., number of threads in the parallel region is a | 
|  | multiple of 32) and (b) the number of threads in the parallel region | 
|  | (set by the num_threads clause) equals blockDim.x. | 
|  |  | 
|  | If either of these preconditions is not true, each thread in the threadblock | 
|  | updates the global value using atomics. | 
|  |  | 
|  | Atomics and compare-and-swap operations are expensive on many threaded | 
|  | architectures such as GPUs and we must avoid them completely. | 
|  |  | 
|  |  | 
|  | **Appendix: Implementation Details** | 
|  |  | 
|  |  | 
|  | ``` | 
|  | // Compiler generated function. | 
|  | reduceFn(ReduceData *a, ReduceData *b) | 
|  | a->foo = a->foo + b->foo | 
|  | a->bar = a->bar + b->bar | 
|  |  | 
|  | // Compiler generated function. | 
|  | swapAndReduceFn(ReduceData *thread_private, int lane) | 
|  | ReduceData *remote = new ReduceData() | 
|  | remote->foo = shuffle_double(thread_private->foo, lane) | 
|  | remote->bar = shuffle_double(thread_private->bar, lane) | 
|  | reduceFn(thread_private, remote) | 
|  |  | 
|  | // OMP runtime function. | 
|  | warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn): | 
|  | offset = 16 | 
|  | while (offset > 0) | 
|  | swapAndReduceFn(thread_private, offset) | 
|  | offset /= 2 | 
|  |  | 
|  | // OMP runtime function. | 
|  | warpReduce_irregular(): | 
|  | ... | 
|  |  | 
|  | // OMP runtime function. | 
|  | kmpc_reduce_warp(reduceData, swapAndReduceFn) | 
|  | if all_lanes_active: | 
|  | warpReduce_regular(reduceData, swapAndReduceFn) | 
|  | else: | 
|  | warpReduce_irregular(reduceData, swapAndReduceFn) | 
|  | if in_simd_region: | 
|  | // all done, reduce to global in simd lane 0 | 
|  | return 1 | 
|  | else if in_parallel_region: | 
|  | // done reducing to one value per warp, now reduce across warps | 
|  | return 3 | 
|  |  | 
|  | // OMP runtime function; one for each basic type. | 
|  | kmpc_reduce_block_double(double *a) | 
|  | if lane == 0: | 
|  | shared[wid] = *a | 
|  | named_barrier(1, num_threads) | 
|  | if wid == 0 | 
|  | block_reduce(shared) | 
|  | if lane == 0 | 
|  | *a = shared[0] | 
|  | named_barrier(1, num_threads) | 
|  | if wid == 0 and lane == 0 | 
|  | return 1  // write back reduced result | 
|  | else | 
|  | return 0  // don't do anything | 
|  |  | 
|  | ``` | 
|  |  | 
|  |  | 
|  |  | 
|  | ``` | 
|  | // Compiler generated code. | 
|  | 1. Create private copies of variables: foo_p, bar_p | 
|  | 2. Each thread reduces the chunk of A and B assigned to it and writes | 
|  | to foo_p and bar_p respectively. | 
|  | 3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn) | 
|  | 4. if ret == 1: | 
|  | The master thread stores the reduced result in the globals. | 
|  | foo += reduceData.foo; bar += reduceData.bar | 
|  | 5. else if ret == 3: | 
|  | ret = block_reduce_double(reduceData.foo) | 
|  | if ret == 1: | 
|  | foo += reduceData.foo | 
|  | ret = block_reduce_double(reduceData.bar) | 
|  | if ret == 1: | 
|  | bar += reduceData.bar | 
|  | ``` | 
|  |  | 
|  | **Notes** | 
|  |  | 
|  | 1. This scheme requires that the CUDA OMP runtime can call llvm generated | 
|  | functions. This functionality now works. | 
|  | 2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery | 
|  | (including calls through function pointers) are optimized away. | 
|  | 3. If we are reducing multiple to multiple variables in a parallel region, | 
|  | the reduce operations are all performed in warpReduce_[ir]regular(). This | 
|  | results in more instructions in the loop and should result in fewer | 
|  | stalls due to data dependencies.  Unfortunately we cannot do the same in | 
|  | kmpc_reduce_block_double() without increasing shared memory usage. |