| ================================================== |
| How to manually use the Individual pieces of Polly |
| ================================================== |
| |
| Execute the individual Polly passes manually |
| ============================================ |
| |
| .. sectionauthor:: Singapuram Sanjay Srivallabh |
| |
| This example presents the individual passes that are involved when optimizing |
| code with Polly. We show how to execute them individually and explain for |
| each which analysis is performed or what transformation is applied. In this |
| example the polyhedral transformation is user-provided to show how much |
| performance improvement can be expected by an optimal automatic optimizer. |
| |
| 1. **Create LLVM-IR from the C code** |
| ------------------------------------- |
| Polly works on LLVM-IR. Hence it is necessary to translate the source |
| files into LLVM-IR. If more than one file should be optimized the |
| files can be combined into a single file with llvm-link. |
| |
| .. code-block:: console |
| |
| clang -S -emit-llvm matmul.c -o matmul.s |
| |
| |
| 2. **Prepare the LLVM-IR for Polly** |
| ------------------------------------ |
| |
| Polly is only able to work with code that matches a canonical form. |
| To translate the LLVM-IR into this form we use a set of |
| canonicalication passes. They are scheduled by using |
| '-polly-canonicalize'. |
| |
| .. code-block:: console |
| |
| opt -S -polly-canonicalize matmul.s > matmul.preopt.ll |
| |
| 3. **Show the SCoPs detected by Polly (optional)** |
| -------------------------------------------------- |
| |
| To understand if Polly was able to detect SCoPs, we print the structure |
| of the detected SCoPs. In our example two SCoPs are detected. One in |
| 'init_array' the other in 'main'. |
| |
| .. code-block:: console |
| |
| $ opt -polly-ast -analyze -q matmul.preopt.ll -polly-process-unprofitable |
| |
| .. code-block:: guess |
| |
| :: isl ast :: init_array :: %for.cond1.preheader---%for.end19 |
| |
| if (1) |
| |
| for (int c0 = 0; c0 <= 1535; c0 += 1) |
| for (int c1 = 0; c1 <= 1535; c1 += 1) |
| Stmt_for_body3(c0, c1); |
| |
| else |
| { /* original code */ } |
| |
| :: isl ast :: main :: %for.cond1.preheader---%for.end30 |
| |
| if (1) |
| |
| for (int c0 = 0; c0 <= 1535; c0 += 1) |
| for (int c1 = 0; c1 <= 1535; c1 += 1) { |
| Stmt_for_body3(c0, c1); |
| for (int c2 = 0; c2 <= 1535; c2 += 1) |
| Stmt_for_body8(c0, c1, c2); |
| } |
| |
| else |
| { /* original code */ } |
| |
| 4. **Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)** |
| ---------------------------------------------------------------------------------------- |
| |
| Polly can use graphviz to graphically show a CFG in which the detected |
| SCoPs are highlighted. It can also create '.dot' files that can be |
| translated by the 'dot' utility into various graphic formats. |
| |
| |
| .. code-block:: console |
| |
| $ opt -view-scops -disable-output matmul.preopt.ll |
| $ opt -view-scops-only -disable-output matmul.preopt.ll |
| |
| The output for the different functions: |
| |
| - view-scops : main_, init_array_, print_array_ |
| - view-scops-only : main-scopsonly_, init_array-scopsonly_, print_array-scopsonly_ |
| |
| .. _main: http://polly.llvm.org/experiments/matmul/scops.main.dot.png |
| .. _init_array: http://polly.llvm.org/experiments/matmul/scops.init_array.dot.png |
| .. _print_array: http://polly.llvm.org/experiments/matmul/scops.print_array.dot.png |
| .. _main-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.main.dot.png |
| .. _init_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.init_array.dot.png |
| .. _print_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.print_array.dot.png |
| |
| 5. **View the polyhedral representation of the SCoPs** |
| ------------------------------------------------------ |
| |
| .. code-block:: console |
| |
| $ opt -polly-scops -analyze matmul.preopt.ll -polly-process-unprofitable |
| |
| .. code-block:: guess |
| |
| [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end19' in function 'init_array': |
| Function: init_array |
| Region: %for.cond1.preheader---%for.end19 |
| Max Loop Depth: 2 |
| Invariant Accesses: { |
| } |
| Context: |
| { : } |
| Assumed Context: |
| { : } |
| Invalid Context: |
| { : 1 = 0 } |
| Arrays { |
| float MemRef_A[*][1536]; // Element size 4 |
| float MemRef_B[*][1536]; // Element size 4 |
| } |
| Arrays (Bounds as pw_affs) { |
| float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4 |
| float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4 |
| } |
| Alias Groups (0): |
| n/a |
| Statements { |
| Stmt_for_body3 |
| Domain := |
| { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 }; |
| Schedule := |
| { Stmt_for_body3[i0, i1] -> [i0, i1] }; |
| MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body3[i0, i1] -> MemRef_A[i0, i1] }; |
| MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body3[i0, i1] -> MemRef_B[i0, i1] }; |
| } |
| [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end30' in function 'main': |
| Function: main |
| Region: %for.cond1.preheader---%for.end30 |
| Max Loop Depth: 3 |
| Invariant Accesses: { |
| } |
| Context: |
| { : } |
| Assumed Context: |
| { : } |
| Invalid Context: |
| { : 1 = 0 } |
| Arrays { |
| float MemRef_C[*][1536]; // Element size 4 |
| float MemRef_A[*][1536]; // Element size 4 |
| float MemRef_B[*][1536]; // Element size 4 |
| } |
| Arrays (Bounds as pw_affs) { |
| float MemRef_C[*][ { [] -> [(1536)] } ]; // Element size 4 |
| float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4 |
| float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4 |
| } |
| Alias Groups (0): |
| n/a |
| Statements { |
| Stmt_for_body3 |
| Domain := |
| { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 }; |
| Schedule := |
| { Stmt_for_body3[i0, i1] -> [i0, i1, 0, 0] }; |
| MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body3[i0, i1] -> MemRef_C[i0, i1] }; |
| Stmt_for_body8 |
| Domain := |
| { Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1535 }; |
| Schedule := |
| { Stmt_for_body8[i0, i1, i2] -> [i0, i1, 1, i2] }; |
| ReadAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] }; |
| ReadAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body8[i0, i1, i2] -> MemRef_A[i0, i2] }; |
| ReadAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body8[i0, i1, i2] -> MemRef_B[i2, i1] }; |
| MustWriteAccess := [Reduction Type: NONE] [Scalar: 0] |
| { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] }; |
| } |
| |
| |
| 6. **Show the dependences for the SCoPs** |
| ----------------------------------------- |
| |
| .. code-block:: console |
| |
| $ opt -polly-dependences -analyze matmul.preopt.ll -polly-process-unprofitable |
| |
| .. code-block:: guess |
| |
| [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end19' in function 'init_array': |
| RAW dependences: |
| { } |
| WAR dependences: |
| { } |
| WAW dependences: |
| { } |
| Reduction dependences: |
| n/a |
| Transitive closure of reduction dependences: |
| { } |
| [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end30' in function 'main': |
| RAW dependences: |
| { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 } |
| WAR dependences: |
| { } |
| WAW dependences: |
| { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 } |
| Reduction dependences: |
| n/a |
| Transitive closure of reduction dependences: |
| { } |
| |
| 7. **Export jscop files** |
| ------------------------- |
| |
| .. code-block:: console |
| |
| $ opt -polly-export-jscop matmul.preopt.ll -polly-process-unprofitable |
| |
| .. code-block:: guess |
| |
| [...]Writing JScop '%for.cond1.preheader---%for.end19' in function 'init_array' to './init_array___%for.cond1.preheader---%for.end19.jscop'. |
| |
| Writing JScop '%for.cond1.preheader---%for.end30' in function 'main' to './main___%for.cond1.preheader---%for.end30.jscop'. |
| |
| |
| |
| 8. **Import the changed jscop files and print the updated SCoP structure (optional)** |
| ------------------------------------------------------------------------------------- |
| |
| Polly can reimport jscop files, in which the schedules of the statements |
| are changed. These changed schedules are used to descripe |
| transformations. It is possible to import different jscop files by |
| providing the postfix of the jscop file that is imported. |
| |
| We apply three different transformations on the SCoP in the main |
| function. The jscop files describing these transformations are |
| hand written (and available in docs/experiments/matmul). |
| |
| **No Polly** |
| |
| As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-ast -analyze -polly-process-unprofitable |
| |
| .. code-block:: c |
| |
| [...] |
| :: isl ast :: main :: %for.cond1.preheader---%for.end30 |
| |
| if (1) |
| |
| for (int c0 = 0; c0 <= 1535; c0 += 1) |
| for (int c1 = 0; c1 <= 1535; c1 += 1) { |
| Stmt_for_body3(c0, c1); |
| for (int c3 = 0; c3 <= 1535; c3 += 1) |
| Stmt_for_body8(c0, c1, c3); |
| } |
| |
| else |
| { /* original code */ } |
| [...] |
| |
| **Loop Interchange (and Fission to allow the interchange)** |
| |
| We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_for_body8. |
| |
| .. Although I feel (and have created a .jscop) we can avoid splitting the loops. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-ast -analyze -polly-process-unprofitable |
| |
| .. code-block:: c |
| |
| [...] |
| :: isl ast :: main :: %for.cond1.preheader---%for.end30 |
| |
| if (1) |
| |
| { |
| for (int c1 = 0; c1 <= 1535; c1 += 1) |
| for (int c2 = 0; c2 <= 1535; c2 += 1) |
| Stmt_for_body3(c1, c2); |
| for (int c1 = 0; c1 <= 1535; c1 += 1) |
| for (int c2 = 0; c2 <= 1535; c2 += 1) |
| for (int c3 = 0; c3 <= 1535; c3 += 1) |
| Stmt_for_body8(c1, c3, c2); |
| } |
| |
| else |
| { /* original code */ } |
| [...] |
| |
| **Interchange + Tiling** |
| |
| In addition to the interchange we now tile the second loop nest. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable |
| |
| .. code-block:: c |
| |
| [...] |
| :: isl ast :: main :: %for.cond1.preheader---%for.end30 |
| |
| if (1) |
| |
| { |
| for (int c1 = 0; c1 <= 1535; c1 += 1) |
| for (int c2 = 0; c2 <= 1535; c2 += 1) |
| Stmt_for_body3(c1, c2); |
| for (int c1 = 0; c1 <= 1535; c1 += 64) |
| for (int c2 = 0; c2 <= 1535; c2 += 64) |
| for (int c3 = 0; c3 <= 1535; c3 += 64) |
| for (int c4 = c1; c4 <= c1 + 63; c4 += 1) |
| for (int c5 = c3; c5 <= c3 + 63; c5 += 1) |
| for (int c6 = c2; c6 <= c2 + 63; c6 += 1) |
| Stmt_for_body8(c4, c6, c5); |
| } |
| |
| else |
| { /* original code */ } |
| [...] |
| |
| |
| **Interchange + Tiling + Strip-mining to prepare vectorization** |
| |
| To later allow vectorization we create a so called trivially |
| parallelizable loop. It is innermost, parallel and has only four |
| iterations. It can be replaced by 4-element SIMD instructions. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable |
| |
| .. code-block:: c |
| |
| [...] |
| :: isl ast :: main :: %for.cond1.preheader---%for.end30 |
| |
| if (1) |
| |
| { |
| for (int c1 = 0; c1 <= 1535; c1 += 1) |
| for (int c2 = 0; c2 <= 1535; c2 += 1) |
| Stmt_for_body3(c1, c2); |
| for (int c1 = 0; c1 <= 1535; c1 += 64) |
| for (int c2 = 0; c2 <= 1535; c2 += 64) |
| for (int c3 = 0; c3 <= 1535; c3 += 64) |
| for (int c4 = c1; c4 <= c1 + 63; c4 += 1) |
| for (int c5 = c3; c5 <= c3 + 63; c5 += 1) |
| for (int c6 = c2; c6 <= c2 + 63; c6 += 4) |
| for (int c7 = c6; c7 <= c6 + 3; c7 += 1) |
| Stmt_for_body8(c4, c7, c5); |
| } |
| |
| else |
| { /* original code */ } |
| [...] |
| |
| 9. **Codegenerate the SCoPs** |
| ----------------------------- |
| |
| This generates new code for the SCoPs detected by polly. If |
| -polly-import-jscop is present, transformations specified in the |
| imported jscop files will be applied. |
| |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-codegen -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged.ll |
| |
| .. code-block:: guess |
| |
| Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged'. |
| File could not be read: No such file or directory |
| Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged'. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-codegen -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled.ll |
| |
| .. code-block:: guess |
| |
| Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled'. |
| File could not be read: No such file or directory |
| Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled'. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled+vector.ll |
| |
| .. code-block:: guess |
| |
| Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'. |
| File could not be read: No such file or directory |
| Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'. |
| |
| .. code-block:: console |
| |
| $ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-parallel -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled+openmp.ll |
| |
| .. code-block:: guess |
| |
| Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'. |
| File could not be read: No such file or directory |
| Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'. |
| |
| |
| 10. **Create the executables** |
| ------------------------------ |
| |
| .. code-block:: console |
| |
| $ llc matmul.normalopt.ll -o matmul.normalopt.s && gcc matmul.normalopt.s -o matmul.normalopt.exe |
| $ llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe |
| $ llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe |
| $ llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe |
| $ llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && gcc -fopenmp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe |
| |
| 11. **Compare the runtime of the executables** |
| ---------------------------------------------- |
| |
| By comparing the runtimes of the different code snippets we see that a |
| simple loop interchange gives here the largest performance boost. |
| However in this case, adding vectorization and using OpenMP degrades |
| the performance. |
| |
| .. code-block:: console |
| |
| $ time ./matmul.normalopt.exe |
| |
| real 0m11.295s |
| user 0m11.288s |
| sys 0m0.004s |
| $ time ./matmul.polly.interchanged.exe |
| |
| real 0m0.988s |
| user 0m0.980s |
| sys 0m0.008s |
| $ time ./matmul.polly.interchanged+tiled.exe |
| |
| real 0m0.830s |
| user 0m0.816s |
| sys 0m0.012s |
| $ time ./matmul.polly.interchanged+tiled+vector.exe |
| |
| real 0m5.430s |
| user 0m5.424s |
| sys 0m0.004s |
| $ time ./matmul.polly.interchanged+tiled+vector+openmp.exe |
| |
| real 0m3.184s |
| user 0m11.972s |
| sys 0m0.036s |
| |