third_party/llvm-project/polly/lib/External/ppcg/README - cobalt - Git at Google

 Requirements:

 - automake, autoconf, libtool
 	(not needed when compiling a release)
 - pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config)
 	(not needed when compiling a release using the included isl and pet)
 - gmp (http://gmplib.org/)
 - libyaml (http://pyyaml.org/wiki/LibYAML)
 	(only needed if you want to compile the pet executable)
 - LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html)
 	Unless you have some other reasons for wanting to use the svn version,
 	it is best to install the latest release (3.9).
 	For more details, see pet/README.

 If you are installing on Ubuntu, then you can install the following packages:

 automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm

 Note that you need at least version 3.2 of libclang-dev (ubuntu raring).
 Older versions of this package did not include the required libraries.
 If you are using an older version of ubuntu, then you need to compile and
 install LLVM/clang from source.


 Preparing:

 Grab the latest release and extract it or get the source from
 the git repository as follows.  This process requires autoconf,
 automake, libtool and pkg-config.

 	git clone git://repo.or.cz/ppcg.git
 	cd ppcg
 	./get_submodules.sh
 	./autogen.sh


 Compilation:

 	./configure
 	make
 	make check

 If you have installed any of the required libraries in a non-standard
 location, then you may need to use the --with-gmp-prefix,
 --with-libyaml-prefix and/or --with-clang-prefix options
 when calling "./configure".


 Using PPCG to generate CUDA or OpenCL code

 To convert a fragment of a C program to CUDA, insert a line containing

 	#pragma scop

 before the fragment and add a line containing

 	#pragma endscop

 after the fragment.  To generate CUDA code run

 	ppcg --target=cuda file.c

 where file.c is the file containing the fragment.  The generated
 code is stored in file_host.cu and file_kernel.cu.

 To generate OpenCL code run

 	ppcg --target=opencl file.c

 where file.c is the file containing the fragment.  The generated code
 is stored in file_host.c and file_kernel.cl.


 Specifying tile, grid and block sizes

 The iterations space tile size, grid size and block size can
 be specified using the --sizes option.  The argument is a union map
 in isl notation mapping kernels identified by their sequence number
 in a "kernel" space to singleton sets in the "tile", "grid" and "block"
 spaces.  The sizes are specified outermost to innermost.

 The dimension of the "tile" space indicates the (maximal) number of loop
 dimensions to tile.  The elements of the single integer tuple
 specify the tile sizes in each dimension.
 In case of hybrid tiling, the first element is half the size of
 the tile in the time (sequential) dimension.  The second element
 specifies the number of elements in the base of the hexagon.
 The remaining elements specify the tile sizes in the remaining space
 dimensions.

 The dimension of the "grid" space indicates the (maximal) number of block
 dimensions in the grid.  The elements of the single integer tuple
 specify the number of blocks in each dimension.

 The dimension of the "block" space indicates the (maximal) number of thread
 dimensions in the grid.  The elements of the single integer tuple
 specify the number of threads in each dimension.

 For example,

     { kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 }

 specifies that in kernel 0, two loops should be tiled with a tile
 size of 64 in both dimensions and that all kernels except kernel 4
 should be run using a block of 16 threads.

 Since PPCG performs some scheduling, it can be difficult to predict
 what exactly will end up in a kernel.  If you want to specify
 tile, grid or block sizes, you may want to run PPCG first with the defaults,
 examine the kernels and then run PPCG again with the desired sizes.
 Instead of examining the kernels, you can also specify the option
 --dump-sizes on the first run to obtain the effectively used default sizes.


 Compiling the generated CUDA code with nvcc

 To get optimal performance from nvcc, it is important to choose --arch
 according to your target GPU.  Specifically, use the flag "--arch sm_20"
 for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for
 GK110 Kepler.  We discourage the use of older cards as we have seen
 correctness issues with compilation for older architectures.
 Note that in the absence of any --arch flag, nvcc defaults to
 "--arch sm_13". This will not only be slower, but can also cause
 correctness issues.
 If you want to obtain results that are identical to those obtained
 by the original code, then you may need to disable some optimizations
 by passing the "--fmad=false" option.


 Compiling the generated OpenCL code with gcc

 To compile the host code you need to link against the file
 ocl_utilities.c which contains utility functions used by the generated
 OpenCL host code.  To compile the host code with gcc, run

   gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL

 Note that we have experienced the generated OpenCL code freezing
 on some inputs (e.g., the PolyBench symm benchmark) when using
 at least some version of the Nvidia OpenCL library, while the
 corresponding CUDA code runs fine.
 We have experienced no such freezes when using AMD, ARM or Intel
 OpenCL libraries.

 By default, the compiled executable will need the _kernel.cl file at
 run time.  Alternatively, the option --opencl-embed-kernel-code may be
 given to place the kernel code in a string literal.  The kernel code is
 then compiled into the host binary, such that the _kernel.cl file is no
 longer needed at run time.  Any kernel include files, in particular
 those supplied using --opencl-include-file, will still be required at
 run time.


 Function calls

 Function calls inside the analyzed fragment are reproduced
 in the CUDA or OpenCL code, but for now it is left to the user
 to make sure that the functions that are being called are
 available from the generated kernels.

 In the case of OpenCL code, the --opencl-include-file option
 may be used to specify one or more files to be #include'd
 from the generated code.  These files may then contain
 the definitions of the functions being called from the
 program fragment.  If the pathnames of the included files
 are relative to the current directory, then you may need
 to additionally specify the --opencl-compiler-options=-I.
 to make sure that the files can be found by the OpenCL compiler.
 The included files may contain definitions of types used by the
 generated kernels.  By default, PPCG generates definitions for
 types as needed, but these definitions may collide with those in
 the included files, as PPCG does not consider the contents of the
 included files.  The --no-opencl-print-kernel-types will prevent
 PPCG from generating type definitions.


 GNU extensions

 By default, PPCG may print out macro definitions that involve
 GNU extensions such as __typeof__ and statement expressions.
 Some compilers may not support these extensions.
 In particular, OpenCL 1.2 beignet 1.1.1 (git-6de6918)
 has been reported not to support __typeof__.
 The use of these extensions can be turned off with the
 --no-allow-gnu-extensions option.


 Processing PolyBench

 When processing a PolyBench/C 3.2 benchmark, you should always specify
 -DPOLYBENCH_USE_C99_PROTO on the ppcg command line.  Otherwise, the source
 files are inconsistent, having fixed size arrays but parametrically
 bounded loops iterating over them.
 However, you should not specify this define when compiling
 the PPCG generated code using nvcc since CUDA does not support VLAs.


 CUDA and function overloading

 While CUDA supports function overloading based on the arguments types,
 no such function overloading exists in the input language C.  Since PPCG
 simply prints out the same function name as in the original code, this
 may result in a different function being called based on the types
 of the arguments.  For example, if the original code contains a call
 to the function sqrt() with a float argument, then the argument will
 be promoted to a double and the sqrt() function will be called.
 In the transformed (CUDA) code, however, overloading will cause the
 function sqrtf() to be called.  Until this issue has been resolved in PPCG,
 we recommend that users either explicitly call the function sqrtf() or
 explicitly cast the argument to double in the input code.


 Contact

 For bug reports, feature requests and questions,
 contact http://groups.google.com/group/isl-development

 Whenever you report a bug, please mention the exact version of PPCG
 that you are using (output of "./ppcg --version").  If you are unable
 to compile PPCG, then report the git version (output of "git describe")
 or the version number included in the name of the tarball.


 Citing PPCG

 If you use PPCG for your research, you are invited to cite
 the following paper.

 @article{Verdoolaege2013PPCG,
     author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
 		G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
 		Catthoor, Francky},
     title = {Polyhedral parallel code generation for CUDA},
     journal = {ACM Trans. Archit. Code Optim.},
     issue_date = {January 2013},
     volume = {9},
     number = {4},
     month = jan,
     year = {2013},
     issn = {1544-3566},
     pages = {54:1--54:23},
     doi = {10.1145/2400682.2400713},
     acmid = {2400713},
     publisher = {ACM},
     address = {New York, NY, USA},
 }
	Requirements:

	- automake, autoconf, libtool
	(not needed when compiling a release)
	- pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config)
	(not needed when compiling a release using the included isl and pet)
	- gmp (http://gmplib.org/)
	- libyaml (http://pyyaml.org/wiki/LibYAML)
	(only needed if you want to compile the pet executable)
	- LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html)
	Unless you have some other reasons for wanting to use the svn version,
	it is best to install the latest release (3.9).
	For more details, see pet/README.

	If you are installing on Ubuntu, then you can install the following packages:

	automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm

	Note that you need at least version 3.2 of libclang-dev (ubuntu raring).
	Older versions of this package did not include the required libraries.
	If you are using an older version of ubuntu, then you need to compile and
	install LLVM/clang from source.


	Preparing:

	Grab the latest release and extract it or get the source from
	the git repository as follows. This process requires autoconf,
	automake, libtool and pkg-config.

	git clone git://repo.or.cz/ppcg.git
	cd ppcg
	./get_submodules.sh
	./autogen.sh


	Compilation:

	./configure
	make
	make check

	If you have installed any of the required libraries in a non-standard
	location, then you may need to use the --with-gmp-prefix,
	--with-libyaml-prefix and/or --with-clang-prefix options
	when calling "./configure".


	Using PPCG to generate CUDA or OpenCL code

	To convert a fragment of a C program to CUDA, insert a line containing

	#pragma scop

	before the fragment and add a line containing

	#pragma endscop

	after the fragment. To generate CUDA code run

	ppcg --target=cuda file.c

	where file.c is the file containing the fragment. The generated
	code is stored in file_host.cu and file_kernel.cu.

	To generate OpenCL code run

	ppcg --target=opencl file.c

	where file.c is the file containing the fragment. The generated code
	is stored in file_host.c and file_kernel.cl.


	Specifying tile, grid and block sizes

	The iterations space tile size, grid size and block size can
	be specified using the --sizes option. The argument is a union map
	in isl notation mapping kernels identified by their sequence number
	in a "kernel" space to singleton sets in the "tile", "grid" and "block"
	spaces. The sizes are specified outermost to innermost.

	The dimension of the "tile" space indicates the (maximal) number of loop
	dimensions to tile. The elements of the single integer tuple
	specify the tile sizes in each dimension.
	In case of hybrid tiling, the first element is half the size of
	the tile in the time (sequential) dimension. The second element
	specifies the number of elements in the base of the hexagon.
	The remaining elements specify the tile sizes in the remaining space
	dimensions.

	The dimension of the "grid" space indicates the (maximal) number of block
	dimensions in the grid. The elements of the single integer tuple
	specify the number of blocks in each dimension.

	The dimension of the "block" space indicates the (maximal) number of thread
	dimensions in the grid. The elements of the single integer tuple
	specify the number of threads in each dimension.

	For example,

	{ kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 }

	specifies that in kernel 0, two loops should be tiled with a tile
	size of 64 in both dimensions and that all kernels except kernel 4
	should be run using a block of 16 threads.

	Since PPCG performs some scheduling, it can be difficult to predict
	what exactly will end up in a kernel. If you want to specify
	tile, grid or block sizes, you may want to run PPCG first with the defaults,
	examine the kernels and then run PPCG again with the desired sizes.
	Instead of examining the kernels, you can also specify the option
	--dump-sizes on the first run to obtain the effectively used default sizes.


	Compiling the generated CUDA code with nvcc

	To get optimal performance from nvcc, it is important to choose --arch
	according to your target GPU. Specifically, use the flag "--arch sm_20"
	for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for
	GK110 Kepler. We discourage the use of older cards as we have seen
	correctness issues with compilation for older architectures.
	Note that in the absence of any --arch flag, nvcc defaults to
	"--arch sm_13". This will not only be slower, but can also cause
	correctness issues.
	If you want to obtain results that are identical to those obtained
	by the original code, then you may need to disable some optimizations
	by passing the "--fmad=false" option.


	Compiling the generated OpenCL code with gcc

	To compile the host code you need to link against the file
	ocl_utilities.c which contains utility functions used by the generated
	OpenCL host code. To compile the host code with gcc, run

	gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL

	Note that we have experienced the generated OpenCL code freezing
	on some inputs (e.g., the PolyBench symm benchmark) when using
	at least some version of the Nvidia OpenCL library, while the
	corresponding CUDA code runs fine.
	We have experienced no such freezes when using AMD, ARM or Intel
	OpenCL libraries.

	By default, the compiled executable will need the _kernel.cl file at
	run time. Alternatively, the option --opencl-embed-kernel-code may be
	given to place the kernel code in a string literal. The kernel code is
	then compiled into the host binary, such that the _kernel.cl file is no
	longer needed at run time. Any kernel include files, in particular
	those supplied using --opencl-include-file, will still be required at
	run time.


	Function calls

	Function calls inside the analyzed fragment are reproduced
	in the CUDA or OpenCL code, but for now it is left to the user
	to make sure that the functions that are being called are
	available from the generated kernels.

	In the case of OpenCL code, the --opencl-include-file option
	may be used to specify one or more files to be #include'd
	from the generated code. These files may then contain
	the definitions of the functions being called from the
	program fragment. If the pathnames of the included files
	are relative to the current directory, then you may need
	to additionally specify the --opencl-compiler-options=-I.
	to make sure that the files can be found by the OpenCL compiler.
	The included files may contain definitions of types used by the
	generated kernels. By default, PPCG generates definitions for
	types as needed, but these definitions may collide with those in
	the included files, as PPCG does not consider the contents of the
	included files. The --no-opencl-print-kernel-types will prevent
	PPCG from generating type definitions.


	GNU extensions

	By default, PPCG may print out macro definitions that involve
	GNU extensions such as __typeof__ and statement expressions.
	Some compilers may not support these extensions.
	In particular, OpenCL 1.2 beignet 1.1.1 (git-6de6918)
	has been reported not to support __typeof__.
	The use of these extensions can be turned off with the
	--no-allow-gnu-extensions option.


	Processing PolyBench

	When processing a PolyBench/C 3.2 benchmark, you should always specify
	-DPOLYBENCH_USE_C99_PROTO on the ppcg command line. Otherwise, the source
	files are inconsistent, having fixed size arrays but parametrically
	bounded loops iterating over them.
	However, you should not specify this define when compiling
	the PPCG generated code using nvcc since CUDA does not support VLAs.


	CUDA and function overloading

	While CUDA supports function overloading based on the arguments types,
	no such function overloading exists in the input language C. Since PPCG
	simply prints out the same function name as in the original code, this
	may result in a different function being called based on the types
	of the arguments. For example, if the original code contains a call
	to the function sqrt() with a float argument, then the argument will
	be promoted to a double and the sqrt() function will be called.
	In the transformed (CUDA) code, however, overloading will cause the
	function sqrtf() to be called. Until this issue has been resolved in PPCG,
	we recommend that users either explicitly call the function sqrtf() or
	explicitly cast the argument to double in the input code.


	Contact

	For bug reports, feature requests and questions,
	contact http://groups.google.com/group/isl-development

	Whenever you report a bug, please mention the exact version of PPCG
	that you are using (output of "./ppcg --version"). If you are unable
	to compile PPCG, then report the git version (output of "git describe")
	or the version number included in the name of the tarball.


	Citing PPCG

	If you use PPCG for your research, you are invited to cite
	the following paper.

	@article{Verdoolaege2013PPCG,
	author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
	G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
	Catthoor, Francky},
	title = {Polyhedral parallel code generation for CUDA},
	journal = {ACM Trans. Archit. Code Optim.},
	issue_date = {January 2013},
	volume = {9},
	number = {4},
	month = jan,
	year = {2013},
	issn = {1544-3566},
	pages = {54:1--54:23},
	doi = {10.1145/2400682.2400713},
	acmid = {2400713},
	publisher = {ACM},
	address = {New York, NY, USA},
	}