src/third_party/llvm-project/polly/www/performance.html - cobalt - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
           "http://www.w3.org/TR/html4/strict.dtd">
 <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
 <html>
 <head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
   <title>Polly - Performance</title>
   <link type="text/css" rel="stylesheet" href="menu.css">
   <link type="text/css" rel="stylesheet" href="content.css">
 </head>
 <body>
 <div id="box">
 <!--#include virtual="menu.html.incl"-->
 <div id="content">
 <h1>Performance</h1>

 <p>To evaluate the performance benefits Polly currently provides we compiled the
 <a href="http://www.cse.ohio-state.edu/~pouchet/software/polybench/">Polybench
 2.0</a> benchmark suite.  Each benchmark was run with double precision floating
 point values on an Intel Core Xeon X5670 CPU @ 2.93GHz (12 cores, 24 thread)
 system. We used <a href="http://pocc.sf.net">PoCC</a> and the included <a
 href="http://pluto-compiler.sf.net">Pluto</a> transformations to optimize the
 code. The source code of Polly and LLVM/clang was checked out on
 25/03/2011.</p>

 <p>The results shown were created fully automatically without manual
 interaction. We did not yet spend any time to tune the results. Hence
 further improvments may be achieved by tuning the code generated by Polly, the
 heuristics used by Pluto or by investigating if more code could be optimized.
 As Pluto was never used at such a low level, its heuristics are probably
 far from perfect. Another area where we expect larger performance improvements
 is the SIMD vector code generation. At the moment, it rarely yields to
 performance improvements, as we did not yet include vectorization in our
 heuristics. By changing this we should be able to significantly increase the
 number of test cases that show improvements.</p>

 <p>The polybench test suite contains computation kernels from linear algebra
 routines, stencil computations, image processing and data mining. Polly
 recognices the majority of them and is able to show good speedup. However,
 to show similar speedup on larger examples like the SPEC CPU benchmarks Polly
 still misses support for integer casts, variable-sized multi-dimensional arrays
 and probably several other construts. This support is necessary as such
 constructs appear in larger programs, but not in our limited test suite.

 <h2> Sequential runs</h2>

 For the sequential runs we used Polly to create a program structure that is
 optimized for data-locality. One of the major optimizations performed is tiling.
 The speedups shown are without the use of any multi-core parallelism. No
 additional hardware is used, but the single available core is used more
 efficiently.
 <h3> Small data size</h3>
 <img src="images/performance/sequential-small.png" /><br />
 <h3> Large data size</h3>
 <img src="images/performance/sequential-large.png" />
 <h2> Parallel runs</h2>
 For the parallel runs we used Polly to expose parallelism and to add calls to an
 OpenMP runtime library. With OpenMP we can use all 12 hardware cores
 instead of the single core that was used before. We can see that in several
 cases we obtain more than linear speedup. This additional speedup is due to
 improved data-locality.
 <h3> Small data size</h3>
 <img src="images/performance/parallel-small.png" /><br />
 <h3> Large data size</h3>
 <img src="images/performance/parallel-large.png" />
 </div>
 </div>
 </body>
 </html>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
	"http://www.w3.org/TR/html4/strict.dtd">
	<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
	<html>
	<head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
	<title>Polly - Performance</title>
	<link type="text/css" rel="stylesheet" href="menu.css">
	<link type="text/css" rel="stylesheet" href="content.css">
	</head>
	<body>
	<div id="box">
	<!--#include virtual="menu.html.incl"-->
	<div id="content">
	<h1>Performance</h1>

	<p>To evaluate the performance benefits Polly currently provides we compiled the
	<a href="http://www.cse.ohio-state.edu/~pouchet/software/polybench/">Polybench
	2.0</a> benchmark suite. Each benchmark was run with double precision floating
	point values on an Intel Core Xeon X5670 CPU @ 2.93GHz (12 cores, 24 thread)
	system. We used <a href="http://pocc.sf.net">PoCC</a> and the included <a
	href="http://pluto-compiler.sf.net">Pluto</a> transformations to optimize the
	code. The source code of Polly and LLVM/clang was checked out on
	25/03/2011.</p>

	<p>The results shown were created fully automatically without manual
	interaction. We did not yet spend any time to tune the results. Hence
	further improvments may be achieved by tuning the code generated by Polly, the
	heuristics used by Pluto or by investigating if more code could be optimized.
	As Pluto was never used at such a low level, its heuristics are probably
	far from perfect. Another area where we expect larger performance improvements
	is the SIMD vector code generation. At the moment, it rarely yields to
	performance improvements, as we did not yet include vectorization in our
	heuristics. By changing this we should be able to significantly increase the
	number of test cases that show improvements.</p>

	<p>The polybench test suite contains computation kernels from linear algebra
	routines, stencil computations, image processing and data mining. Polly
	recognices the majority of them and is able to show good speedup. However,
	to show similar speedup on larger examples like the SPEC CPU benchmarks Polly
	still misses support for integer casts, variable-sized multi-dimensional arrays
	and probably several other construts. This support is necessary as such
	constructs appear in larger programs, but not in our limited test suite.

	<h2> Sequential runs</h2>

	For the sequential runs we used Polly to create a program structure that is
	optimized for data-locality. One of the major optimizations performed is tiling.
	The speedups shown are without the use of any multi-core parallelism. No
	additional hardware is used, but the single available core is used more
	efficiently.
	<h3> Small data size</h3>
	<img src="images/performance/sequential-small.png" /><br />
	<h3> Large data size</h3>
	<img src="images/performance/sequential-large.png" />
	<h2> Parallel runs</h2>
	For the parallel runs we used Polly to expose parallelism and to add calls to an
	OpenMP runtime library. With OpenMP we can use all 12 hardware cores
	instead of the single core that was used before. We can see that in several
	cases we obtain more than linear speedup. This additional speedup is due to
	improved data-locality.
	<h3> Small data size</h3>
	<img src="images/performance/parallel-small.png" /><br />
	<h3> Large data size</h3>
	<img src="images/performance/parallel-large.png" />
	</div>
	</div>
	</body>
	</html>