src/third_party/llvm-project/polly/docs/Architecture.rst - cobalt - Git at Google

 ================
 The Architecture
 ================

 Polly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
 extracts interesting loop kernels. For each kernel a mathematical model is
 derived which precisely describes the individual computations and memory
 accesses in the kernels. Within Polly a variety of analysis and code
 transformations are performed on this mathematical model. After all
 optimizations have been derived and applied, optimized LLVM-IR is regenerated
 and inserted into the LLVM-IR module.

 .. image:: images/architecture.png
     :align: center

 Polly in the LLVM pass pipeline
 -------------------------------

 The standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
 consists of a sequence of passes that can be grouped in different conceptual
 phases. The first phase, we call it here **Canonicalization**, is a scalar
 canonicalization phase that contains passes like -mem2reg, -instcombine,
 -cfgsimplify, or early loop unrolling. It has the goal of removing and
 simplifying the given IR as much as possible focusing mostly on scalar
 optimizations. The second phase consists of three conceptual groups that  are
 executed in the so-called **Inliner cycle**, This is again a set of **Scalar
 Simplification** passes, a set of **Simple Loop Optimizations**, and the
 **Inliner** itself. Even though these passes make up the majority of the LLVM
 pass pipeline, the primary goal of these passes is still canonicalization
 without loosing semantic information that complicates later analysis. As part of
 the inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
 canonicalization passes to exploit newly exposed simplification opportunities,
 and then tries to inline the further simplified functions. Some simple loop
 optimizations are executed as part of the inliner cycle. Even though they
 perform some optimizations, their primary goal is still the simplification of
 the program code. Loop invariant code motion is one such optimization that
 besides being beneficial for program performance also allows us to move
 computation out of loops and in the best case enables us to eliminate certain
 loops completely.  Only after the inliner cycle has been finished, a last
 **Target Specialization** phase is run, where IR complexity is deliberately
 increased to take advantage of target specific features that maximize the
 execution performance on the device we target. One of the principal
 optimizations in this phase is vectorization, but also target specific loop
 unrolling, or some loop transformations (e.g., distribution) that expose more
 vectorization opportunities.

 .. image:: images/LLVM-Passes-only.png
     :align: center

 Polly can conceptually be run at three different positions in the pass pipeline.
 As an early optimizer before the standard LLVM pass pipeline, as a later
 optimizer as part of the target specialization sequence, and theoretically also
 with the loop optimizations in the inliner cycle. We only discuss the first two
 options, as running Polly in the inline loop, is likely to disturb the inliner
 and is consequently not a good idea.

 .. image:: images/LLVM-Passes-all.png
     :align: center

 Running Polly early before the standard pass pipeline has the benefit that the
 LLVM-IR processed by Polly is still very close to the original input code.
 Hence, it is less likely that transformations applied by LLVM change the IR in
 ways not easily understandable for the programmer. As a result, user feedback is
 likely better and it is less likely that kernels that in C seem a perfect fit
 for Polly have been transformed such that Polly can not handle them any
 more. On the other hand, codes that require inlining to be optimized won't
 benefit if Polly is scheduled at this position. The additional set of
 canonicalization passes required will result in a small, but general compile
 time increase and some random run-time performance changes due to slightly
 different IR being passed through the optimizers. To force Polly to run early in
 the pass pipleline use the option *-polly-position=early* (default today).

 .. image:: images/LLVM-Passes-early.png
     :align: center

 Running Polly right before the vectorizer has the benefit that the full inlining
 cycle has been run and as a result even heavily templated C++ code could
 theoretically benefit from Polly (more work is necessary to make Polly here
 really effective). As the IR that is passed to Polly has already been
 canonicalized, there is also no need to run additional canonicalization passes.
 General compile time is almost not affected by Polly, as detection of loop
 kernels is generally very fast and the actual optimization and cleanup passes
 are only run on functions which contain loop kernels that are worth optimizing.
 However, due to the many optimizations that LLVM runs before Polly the IR that
 reaches Polly often has additional scalar dependences that make Polly a lot less
 efficient. To force Polly to run before the vectorizer in the pass pipleline use
 the option *-polly-position=before-vectorizer*. This position is not yet the
 default for Polly, but work is on its way to be effective even in presence of
 scalar dependences. After this work has been completed, Polly will likely use
 this position by default.

 .. image:: images/LLVM-Passes-late.png
     :align: center
	================
	The Architecture
	================

	Polly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
	extracts interesting loop kernels. For each kernel a mathematical model is
	derived which precisely describes the individual computations and memory
	accesses in the kernels. Within Polly a variety of analysis and code
	transformations are performed on this mathematical model. After all
	optimizations have been derived and applied, optimized LLVM-IR is regenerated
	and inserted into the LLVM-IR module.

	.. image:: images/architecture.png
	:align: center

	Polly in the LLVM pass pipeline
	-------------------------------

	The standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
	consists of a sequence of passes that can be grouped in different conceptual
	phases. The first phase, we call it here Canonicalization, is a scalar
	canonicalization phase that contains passes like -mem2reg, -instcombine,
	-cfgsimplify, or early loop unrolling. It has the goal of removing and
	simplifying the given IR as much as possible focusing mostly on scalar
	optimizations. The second phase consists of three conceptual groups that are
	executed in the so-called Inliner cycle, This is again a set of **Scalar
	Simplification passes, a set of Simple Loop Optimizations**, and the
	Inliner itself. Even though these passes make up the majority of the LLVM
	pass pipeline, the primary goal of these passes is still canonicalization
	without loosing semantic information that complicates later analysis. As part of
	the inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
	canonicalization passes to exploit newly exposed simplification opportunities,
	and then tries to inline the further simplified functions. Some simple loop
	optimizations are executed as part of the inliner cycle. Even though they
	perform some optimizations, their primary goal is still the simplification of
	the program code. Loop invariant code motion is one such optimization that
	besides being beneficial for program performance also allows us to move
	computation out of loops and in the best case enables us to eliminate certain
	loops completely. Only after the inliner cycle has been finished, a last
	Target Specialization phase is run, where IR complexity is deliberately
	increased to take advantage of target specific features that maximize the
	execution performance on the device we target. One of the principal
	optimizations in this phase is vectorization, but also target specific loop
	unrolling, or some loop transformations (e.g., distribution) that expose more
	vectorization opportunities.

	.. image:: images/LLVM-Passes-only.png
	:align: center

	Polly can conceptually be run at three different positions in the pass pipeline.
	As an early optimizer before the standard LLVM pass pipeline, as a later
	optimizer as part of the target specialization sequence, and theoretically also
	with the loop optimizations in the inliner cycle. We only discuss the first two
	options, as running Polly in the inline loop, is likely to disturb the inliner
	and is consequently not a good idea.

	.. image:: images/LLVM-Passes-all.png
	:align: center

	Running Polly early before the standard pass pipeline has the benefit that the
	LLVM-IR processed by Polly is still very close to the original input code.
	Hence, it is less likely that transformations applied by LLVM change the IR in
	ways not easily understandable for the programmer. As a result, user feedback is
	likely better and it is less likely that kernels that in C seem a perfect fit
	for Polly have been transformed such that Polly can not handle them any
	more. On the other hand, codes that require inlining to be optimized won't
	benefit if Polly is scheduled at this position. The additional set of
	canonicalization passes required will result in a small, but general compile
	time increase and some random run-time performance changes due to slightly
	different IR being passed through the optimizers. To force Polly to run early in
	the pass pipleline use the option -polly-position=early (default today).

	.. image:: images/LLVM-Passes-early.png
	:align: center

	Running Polly right before the vectorizer has the benefit that the full inlining
	cycle has been run and as a result even heavily templated C++ code could
	theoretically benefit from Polly (more work is necessary to make Polly here
	really effective). As the IR that is passed to Polly has already been
	canonicalized, there is also no need to run additional canonicalization passes.
	General compile time is almost not affected by Polly, as detection of loop
	kernels is generally very fast and the actual optimization and cleanup passes
	are only run on functions which contain loop kernels that are worth optimizing.
	However, due to the many optimizations that LLVM runs before Polly the IR that
	reaches Polly often has additional scalar dependences that make Polly a lot less
	efficient. To force Polly to run before the vectorizer in the pass pipleline use
	the option -polly-position=before-vectorizer. This position is not yet the
	default for Polly, but work is on its way to be effective even in presence of
	scalar dependences. After this work has been completed, Polly will likely use
	this position by default.

	.. image:: images/LLVM-Passes-late.png
	:align: center