src/third_party/llvm-project/polly/www/documentation/gpgpucodegen.html - cobalt - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
         "http://www.w3.org/TR/html4/strict.dtd">
 <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
 <html>
 <head>
   <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
   <title>Polly - GPGPU Code Generation</title>
   <link type="text/css" rel="stylesheet" href="../menu.css">
   <link type="text/css" rel="stylesheet" href="../content.css">
 </head>
 <body>
 <div id="box">
 <!--#include virtual="../menu.html.incl"-->
 <div id="content">
   <!--*********************************************************************-->
   <h1>Polly - GPGPU Code Generation</h1>
   <!--*********************************************************************-->
 <p><em>WARNING: This project was part of the Google Summer of Code 2012.
 It is currently not finished, but it is in the design and implementation stage.
 The ideas/plans described here may not yet be implemented in Polly and may
 change later on.</em></p>

 This project adds GPGPU code generation feature to Polly.

 <h2>Objective</h2>
 <p>The overall objective of this GSoC project is to create a preliminary
    implementation of GPGPU code generation for Polly. With this addition, users
    can parallelize some perfectly nested loops with Polly to execute on a
    heterogeneous platform, composed of CPU and GPU.</p>
 <p>There are several successful projects about automatic source-to-source gpu
    code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for
    computing an affine schedule and then applies a wavefront transformation to
    obtain one sequential and n-1 parallel loops. The parallel loops are then
    mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced
    algorithms which can expose much more parallelism than other methods . And It
    also introduces affine partition heuristics and code generation algorithms
    for locality enhancement in the registers and shared memory.</p>
 <p>Since automatic GPGPU code generation is quite a complex problem and what we
    target is a low-level intermediate representation, LLVM IR, rather than a
    high-level language source, it is important for us to set a proper objective
    as a start step to give a complete solution to GPGPU code generation for LLVM
    IR.</p>
 <p>Firstly, we plan to target two kinds of relatively simple test cases. One is
    comprised of pure parallel and perfectly nested loops, like the following
    code.</p>
 <pre>
 parfor(int i=0 to M)
   parfor(int j=0 to N)
     LoopBody(i, j);
 </pre>
 <p>Another one is that all the loops in it are parallel except the inner-most
    one, just like this:</p>
 <pre>
 parfor(int i=0 to M)
   parfor(int j=0 to N)
     non-parfor(int k=0 to K)
       LoopBody(i, j, k);
 </pre>
 <p>The LoopBody part should be limited to instructions or functions calls
    (intrinsics) which can be handled by LLVM's NVPTX backend.</p>
 <p>On the other hand, we focus on building a preliminary and scalable framework
    of GPGPU code generation for polly. Thus we plan to employ relatively simple
    tiling and mapping algorithms and optimize them later.</p>
 <h2>Work Flow</h2>
 <h3>GPGPU Code Generation In General</h3>
 <p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU
    code generation problem.</p>
 <li>Look for parallel loops.</li>
 <li>Create a polyhedral model from the loops.</li>
 <li>Tile and map the loops to GPU blocks and threads.</li>
 <li>Determine where to place the data.</li>
 <h3>What has been done in Polly</h3>
 <p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and
    many other analysis and transformation passes.</p>
 <h3>What to do in Polly</h3>
 <p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is
    a low-level optimizer, which means we can't use a source-level compiler
    (e.g. NVCC) to generate the final assembly for the device. We need manually
    insert device driver API calls to execute the generated kernel assembly
    text.</p>
 <p>In this project, we assume that the device driver library has provided an
    interface to launch kernels in the form of assembly text. Fortunately, most
    of the mainstream GPU vendors provide such a feature in thier products (see
    ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we
    are going to do in Polly is:</p>
 <li>Find a way to tile the parallel loops.</li>
 <li>Find a way to extract the loop body and transform it into thread-centric
     parallel code.</li>
 <li>Find a way to store/load the thread-centric code into/from a device module.
 <li>Find a way to pass the target machine information and generate code of the
     device module for the target.
 <li>Find a way to map the tiled loop to GPU blocks and threads.</li>
 <li>Find a way to insert CUDA synchronization operations on-demand.
 <li>Find a way to generate the memory copy operations between a host and a
     device.</li>
 <li>Implement/Wrap a runtime library to serve as the execution engine for the
     generated device code.</li>

 <h3>The Work Flow</h3>
 <p>In this section, we assume that the host cpu is X86 and the device is NVIDIA
    CUDA-compatible. we will use the following test case to describe our work
    flow.</p>
 <pre>
 for(i = 0; i &lt; 128; i++)
       for(j = 0; j &lt; 128; j++)
               A[i][j] = i*128 + j;
 </pre>
 <p>The work flow of our code generator is as follows.</p>
 <p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel
    tiled code.</p>
 The "schedule" part of the pre-optimization jscop file is as the following:
 <pre>
 "schedule" : "{ Stmt_for_body3[i0, i1] -&gt; schedule[0, i0, 0, i1, 0] }"
 </pre>
 The jscop file describing the tiling transformation is:
 <pre>
 "schedule" : "{ Stmt_for_body3[i0, i1] -&gt; schedule[0, o0, o1, o2, o3]:
               o0 &gt;= 0 and o0 &lt;= 7 and o1 &gt;= 0 and o1 &lt;= 15 and
               o2 &gt;= 0 and o2 &lt;= 7 and o3 &gt;= 0 and o3 &lt;= 15 and
               i0 = 16o0 + o1 and i1 = 16o2 + o3 }"
 </pre>
 We can test the schedule with the following command line.
 <pre>
 opt -load /path/to/polly/build/LLVMPolly.so -basicaa -polly-import-jscop
     -polly-ast -analyze -q ./test.ll
     -polly-import-jscop-postfix=transformed+gpu
 </pre>
 The output of this schedule is:
 <pre>
 for (c2=0;c2&lt;=7;c2++) {
   for (c3=0;c3&lt;=15;c3++) {
     for (c4=0;c4&lt;=7;c4++) {
       for (c5=0;c5&lt;=15;c5++) {
         Stmt_for_body3(16*c2+c3,16*c4+c5);
       }
     }
   }
 }
 </pre>
 Now we get a 4-dimensional parallel loops with a single SCoP statement in it.
 <p>2.We then extract the loop body (or the inner-most non-parallel loop) into a
    LLVM function, tagging it with PTX_Kernel call convention.</p>
 <p>3.We extract the PTX_kernel function into a temporary module, set the target
    triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary
    module into a string, store it in the original module and erase the
    PTX_kernel function.</p>
 <p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code
    is composed of a call to the llvm.codegen intrinsic and function calls to our
    GPU runtime library.</p>
 <p>5.Finally, we generate the executable program with <em>llc</em> or run the
    optimized LLVM IRs with a JIT compiler like  <em>lli</em>.</p>
 <h2>Usage</h2>
 <p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p>
 <pre>cd /path/to/llvm/source
 git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre>
 <p>2. Build the test case.</p>
 <pre>/path/to/polly/source/test/create_ll.sh test.c</pre>
 <p>3. Get and edit the jscop file (take function "gpu_codegen" as an example).
 </p>
 <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
     -polly-export-jscop ./test.ll
 cp gpu_codegen___%for.cond---%for.end8.jscop
    gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu
 vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre>
 <p><em>(Please refer to section "The Work Flow" on how to edit the "schedule"
        part of a statement)</em></p>
 <p>4. Optimize the code with GPGPU code generation.</p>
 <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
     -polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu
     -polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S
     -o test.gpued.ll</pre>
 <p>5. Build the final assembly and executable.</p>
 <pre>llc test.gpued.ll -o test.s
 gcc test.s -lGPURuntime -o test</pre>
 <p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that
         /path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p>
 <h2>TODO List</h2>

 <table class="wikitable" cellpadding="2">
 <tbody>
 <tr style="background: rgb(239, 239, 239)">
   <th width="400px"> Tasks</th>
   <th width="150px"> Status </th>
   <th> Owner </th>
 </tr>
 <tr>
 <th align="left">Tiling the Parallel Loops with An External Jscop File</th>
 <td align="center" class='open'>Open, In Design</td>
 <td>Yabin Hu</td>
 </tr>
 <tr>
 <th align="left">GPU Runtime Library Implementation</th>
 <td align="center" class='inprogress'>Coding Finished, In Reviewing</td>
 <td></td>
 </tr>
 <tr>
 <th align="left">llvm.codegen Intrinsic Implementation</th>
 <td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td>
 <td></td>
 </tr>
 <tr>
 <th align="left">Code Generation For Host</th>
 <td align="center" class='inprogress'>50% Done</td>
 <td></td>
 </tr>

 </tbody></table>

 <h2>References</h2>
 <li type="1" value="1">
 <em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br />
     Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br />
     International Conference on Compiler Construction (CC) 2010.<br />
 </li>
 <li type="1"><em>PPCG Project</em><br />
 <a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg
 </a></li>
 <li type="1">
 <em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the
     Answer. </em><br />
   Chris Gregg and Kim Hazelwood<br />
   International Symposium on Performance Analysis of Systems and Software
   (ISPASS) 2011.
 </li>
 <p></p>
 </div>
 </div>
 </body>
 </html>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
	"http://www.w3.org/TR/html4/strict.dtd">
	<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
	<html>
	<head>
	<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
	<title>Polly - GPGPU Code Generation</title>
	<link type="text/css" rel="stylesheet" href="../menu.css">
	<link type="text/css" rel="stylesheet" href="../content.css">
	</head>
	<body>
	<div id="box">
	<!--#include virtual="../menu.html.incl"-->
	<div id="content">
	<!--*********************************************************************-->
	<h1>Polly - GPGPU Code Generation</h1>
	<!--*********************************************************************-->
	<p><em>WARNING: This project was part of the Google Summer of Code 2012.
	It is currently not finished, but it is in the design and implementation stage.
	The ideas/plans described here may not yet be implemented in Polly and may
	change later on.</em></p>

	This project adds GPGPU code generation feature to Polly.

	<h2>Objective</h2>
	<p>The overall objective of this GSoC project is to create a preliminary
	implementation of GPGPU code generation for Polly. With this addition, users
	can parallelize some perfectly nested loops with Polly to execute on a
	heterogeneous platform, composed of CPU and GPU.</p>
	<p>There are several successful projects about automatic source-to-source gpu
	code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for
	computing an affine schedule and then applies a wavefront transformation to
	obtain one sequential and n-1 parallel loops. The parallel loops are then
	mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced
	algorithms which can expose much more parallelism than other methods . And It
	also introduces affine partition heuristics and code generation algorithms
	for locality enhancement in the registers and shared memory.</p>
	<p>Since automatic GPGPU code generation is quite a complex problem and what we
	target is a low-level intermediate representation, LLVM IR, rather than a
	high-level language source, it is important for us to set a proper objective
	as a start step to give a complete solution to GPGPU code generation for LLVM
	IR.</p>
	<p>Firstly, we plan to target two kinds of relatively simple test cases. One is
	comprised of pure parallel and perfectly nested loops, like the following
	code.</p>
	<pre>
	parfor(int i=0 to M)
	parfor(int j=0 to N)
	LoopBody(i, j);
	</pre>
	<p>Another one is that all the loops in it are parallel except the inner-most
	one, just like this:</p>
	<pre>
	parfor(int i=0 to M)
	parfor(int j=0 to N)
	non-parfor(int k=0 to K)
	LoopBody(i, j, k);
	</pre>
	<p>The LoopBody part should be limited to instructions or functions calls
	(intrinsics) which can be handled by LLVM's NVPTX backend.</p>
	<p>On the other hand, we focus on building a preliminary and scalable framework
	of GPGPU code generation for polly. Thus we plan to employ relatively simple
	tiling and mapping algorithms and optimize them later.</p>
	<h2>Work Flow</h2>
	<h3>GPGPU Code Generation In General</h3>
	<p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU
	code generation problem.</p>
	<li>Look for parallel loops.</li>
	<li>Create a polyhedral model from the loops.</li>
	<li>Tile and map the loops to GPU blocks and threads.</li>
	<li>Determine where to place the data.</li>
	<h3>What has been done in Polly</h3>
	<p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and
	many other analysis and transformation passes.</p>
	<h3>What to do in Polly</h3>
	<p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is
	a low-level optimizer, which means we can't use a source-level compiler
	(e.g. NVCC) to generate the final assembly for the device. We need manually
	insert device driver API calls to execute the generated kernel assembly
	text.</p>
	<p>In this project, we assume that the device driver library has provided an
	interface to launch kernels in the form of assembly text. Fortunately, most
	of the mainstream GPU vendors provide such a feature in thier products (see
	ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we
	are going to do in Polly is:</p>
	<li>Find a way to tile the parallel loops.</li>
	<li>Find a way to extract the loop body and transform it into thread-centric
	parallel code.</li>
	<li>Find a way to store/load the thread-centric code into/from a device module.
	<li>Find a way to pass the target machine information and generate code of the
	device module for the target.
	<li>Find a way to map the tiled loop to GPU blocks and threads.</li>
	<li>Find a way to insert CUDA synchronization operations on-demand.
	<li>Find a way to generate the memory copy operations between a host and a
	device.</li>
	<li>Implement/Wrap a runtime library to serve as the execution engine for the
	generated device code.</li>

	<h3>The Work Flow</h3>
	<p>In this section, we assume that the host cpu is X86 and the device is NVIDIA
	CUDA-compatible. we will use the following test case to describe our work
	flow.</p>
	<pre>
	for(i = 0; i < 128; i++)
	for(j = 0; j < 128; j++)
	A[i][j] = i*128 + j;
	</pre>
	<p>The work flow of our code generator is as follows.</p>
	<p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel
	tiled code.</p>
	The "schedule" part of the pre-optimization jscop file is as the following:
	<pre>
	"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, i0, 0, i1, 0] }"
	</pre>
	The jscop file describing the tiling transformation is:
	<pre>
	"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, o0, o1, o2, o3]:
	o0 >= 0 and o0 <= 7 and o1 >= 0 and o1 <= 15 and
	o2 >= 0 and o2 <= 7 and o3 >= 0 and o3 <= 15 and
	i0 = 16o0 + o1 and i1 = 16o2 + o3 }"
	</pre>
	We can test the schedule with the following command line.
	<pre>
	opt -load /path/to/polly/build/LLVMPolly.so -basicaa -polly-import-jscop
	-polly-ast -analyze -q ./test.ll
	-polly-import-jscop-postfix=transformed+gpu
	</pre>
	The output of this schedule is:
	<pre>
	for (c2=0;c2<=7;c2++) {
	for (c3=0;c3<=15;c3++) {
	for (c4=0;c4<=7;c4++) {
	for (c5=0;c5<=15;c5++) {
	Stmt_for_body3(16c2+c3,16c4+c5);
	}
	}
	}
	}
	</pre>
	Now we get a 4-dimensional parallel loops with a single SCoP statement in it.
	<p>2.We then extract the loop body (or the inner-most non-parallel loop) into a
	LLVM function, tagging it with PTX_Kernel call convention.</p>
	<p>3.We extract the PTX_kernel function into a temporary module, set the target
	triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary
	module into a string, store it in the original module and erase the
	PTX_kernel function.</p>
	<p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code
	is composed of a call to the llvm.codegen intrinsic and function calls to our
	GPU runtime library.</p>
	<p>5.Finally, we generate the executable program with <em>llc</em> or run the
	optimized LLVM IRs with a JIT compiler like <em>lli</em>.</p>
	<h2>Usage</h2>
	<p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p>
	<pre>cd /path/to/llvm/source
	git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre>
	<p>2. Build the test case.</p>
	<pre>/path/to/polly/source/test/create_ll.sh test.c</pre>
	<p>3. Get and edit the jscop file (take function "gpu_codegen" as an example).
	</p>
	<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
	-polly-export-jscop ./test.ll
	cp gpu_codegen___%for.cond---%for.end8.jscop
	gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu
	vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre>
	<p><em>(Please refer to section "The Work Flow" on how to edit the "schedule"
	part of a statement)</em></p>
	<p>4. Optimize the code with GPGPU code generation.</p>
	<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa
	-polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu
	-polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S
	-o test.gpued.ll</pre>
	<p>5. Build the final assembly and executable.</p>
	<pre>llc test.gpued.ll -o test.s
	gcc test.s -lGPURuntime -o test</pre>
	<p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that
	/path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p>
	<h2>TODO List</h2>

	<table class="wikitable" cellpadding="2">
	<tbody>
	<tr style="background: rgb(239, 239, 239)">
	<th width="400px"> Tasks</th>
	<th width="150px"> Status </th>
	<th> Owner </th>
	</tr>
	<tr>
	<th align="left">Tiling the Parallel Loops with An External Jscop File</th>
	<td align="center" class='open'>Open, In Design</td>
	<td>Yabin Hu</td>
	</tr>
	<tr>
	<th align="left">GPU Runtime Library Implementation</th>
	<td align="center" class='inprogress'>Coding Finished, In Reviewing</td>
	<td></td>
	</tr>
	<tr>
	<th align="left">llvm.codegen Intrinsic Implementation</th>
	<td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td>
	<td></td>
	</tr>
	<tr>
	<th align="left">Code Generation For Host</th>
	<td align="center" class='inprogress'>50% Done</td>
	<td></td>
	</tr>

	</tbody></table>

	<h2>References</h2>
	<li type="1" value="1">
	<em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br />
	Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br />
	International Conference on Compiler Construction (CC) 2010.<br />
	</li>
	<li type="1"><em>PPCG Project</em><br />
	<a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg
	</a></li>
	<li type="1">
	<em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the
	Answer. </em><br />
	Chris Gregg and Kim Hazelwood<br />
	International Symposium on Performance Analysis of Systems and Software
	(ISPASS) 2011.
	</li>
	<p></p>
	</div>
	</div>
	</body>
	</html>