blob: 3a3b85b829f54da82f96e81550bfd5bb81269576 [file] [view]
# Performance Tuning
Cobalt is designed to choose sensible parameters for all performance-related
options and parameters, however sometimes these need to be explicitly set
to allow Cobalt to run optimally for a specific platform. This document
discusses some of the tweakable parameters in Cobalt that can have an
affect on performance.
A number of tweaks are listed below in no particular order. Each item
has a set of tags keywords to make it easy to search for items related
to a specific type of performance metric (e.g. "framerate").
Many of the tweaks involve adding a new gyp variable to your platform's
`gyp_configuration.gypi` file. The default values for these variables are
defined in either
[`base_configuration.gypi`](../../starboard/build/base_configuration.gypi) or
[`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi).
### Use a Release Build
Cobalt has a number of different build configurations (e.g. "debug", "devel",
"qa" and "gold" in slowest-to-fastest order), with varying degrees of
optimizations enabled. For example, while "devel" has compiler optimizations
enabled, it does not disable DCHECKS (debug assertions) which can decrease
Cobalt's performance. The "qa" build is most similar to "gold", but it still
has some debug features enabled (such as the debug console which can consume
memory, and decrease performance while it is visible). For the best
performance, build Cobalt in the "gold" configuration.
**Tags:** *framerate, startup, browse-to-watch, cpu memory, input latency.*
### Switch JavaScript Engine to V8
Cobalt supports both SpiderMonkey and V8 as JavaScript engines. SpiderMonkey
is the default JavaScript engine since it is the most compatible in that it
does not require your platform to support Just-In-Time (JIT) compiling.
However, if your platform supports it, we strongly recommend that you use
V8, as it has been shown to provide 20-50% speed improvements on JavaScript
execution across the board. Note however that V8 has also been found to
consume around 10MB more memory than SpiderMonkey.
To enable V8, you must modify the `GetVariables()` method in your
`gyp_configuration.py` file and ensure that the variables dictionary that is
returned contains the following key/value pairs:
```
{
'javascript_engine': 'v8',
'cobalt_enable_jit': 1,
}
```
Note also that use of V8 requires Starboard version 10 or higher.
**Tags:** *startup, browse-to-watch, cpu memory, input latency.*
### Framerate throttling
If you're willing to accept a lower framerate, there is potential that
JavaScript execution can be made to run faster (which can improve startup
time, browse-to-watch time, and input latency). Without any special
settings in place, the renderer will attempt to render each frame as fast
as it can, limited only by the display's refresh rate, which is usually 60Hz.
By artificially throttling this rate to a lower value, like 30Hz, CPU
resources can be freed to work on other tasks. You can enable framerate
throttling by setting a value for `cobalt_minimum_frame_time_in_milliseconds`
in your platform's `gyp_configuration.gypi` file. Setting it to 33, for
example, will throttle Cobalt's renderer to 30 frames per second.
**Tags:** *gyp_configuration.gypi, framerate, startup, browse-to-watch,
input latency.*
### Image cache capacity
Cobalt's image cache is used to cache decoded image data. The image data
in the image cache is stored as a texture, and so it will occupy GPU memory.
The image cache capacity dictates how long images will be kept resident in
memory even if they are not currently visible on the web page. By reducing
this value, you can lower GPU memory usage, at the cost of having Cobalt
make more network requests and image decodes for previously seen images.
Cobalt will automatically set the image cache capacity to a reasonable value,
but if you wish to override this, you can do so by setting the
`image_cache_size_in_bytes` variable in your `gyp_configuration.gypi` file. For
the YouTube web app, we have found that at 1080p, 32MB will allow around
5 thumbnail shelves to stay resident at a time, with 720p and 4K resolutions
using proportionally less and more memory, respectively.
**Tags:** *gyp_configuration.gypi, cpu memory, gpu memory.*
### Image cache capacity multiplier during video playback
Cobalt provides a feature where the image cache capacity will be reduced
as soon as video playback begins. This can be useful for reducing peak
GPU memory usage, which usually occurs during video playback. The
downside to lowering the image cache during video playback is that it
may need to evict some images when the capacity changes, and so it is
more likely that Cobalt will have to re-download and decode images after
returning from video playback. Note that this feature is not well tested.
The feature can be activated by setting
`image_cache_capacity_multiplier_when_playing_video` to a value between
`0.0` and `1.0` in your `gyp_configuration.gypi` file. The image cache
capacity will be multiplied by this value during video playback.
**Tags:** *gyp_configuration.gypi, gpu memory.*
### Scratch Surface cache capacity
This only affects GLES renderers. While rasterizing a frame, it is
occasionally necessary to render to a temporary offscreen surface and then
apply that surface to the original render target. Offscreen surface
rendering may also need to be performed multiple times per frame. The
scratch surface cache will keep allocated a set of scratch textures that
will be reused (within and across frames) for offscreen rendering. Reusing
offscreen surfaces allows render target allocations, which can be expensive
on some platforms, to be minimized. However, it has been found that some
platforms (especially those with tiled renderers, like the Raspberry Pi's
Broadcom VideoCore), reading and writing again and again to the same texture
can result in performance degradation. Memory may also be potentially saved
by disabling this cache, since when it is enabled, if the cache is filled, it
may be occupying memory that it is not currently using. This setting can
be adjusted by setting `surface_cache_size_in_bytes` in your
`gyp_configuration.gypi` file. A value of `0` will disable the surface cache.
**Tags:** *gyp_configuration.gypi, gpu memory, framerate.*
### Glyph atlas size
This only affects GLES renderers. Skia sets up glyph atlases to which
it software rasterizes glyphs the first time they are encountered, and
from which the glyphs are used as textures for hardware accelerated glyph
rendering to the render target. Adjusting this value will adjust
GPU memory usage, but at the cost of performance as text glyphs will be
less likely to be cached already. Note that if experimenting with
modifications to this setting, be sure to test many languages, as some
are more demanding (e.g. Chinese and Japanese) on the glyph cache than
others. This value can be adjusted by changing the values of
the `skia_glyph_atlas_width` and `skia_glyph_atlas_height` variables in your
`gyp_configuration.gypi` file. Note that by default, these will be
automatically configured by Cobalt to values found to be optimal for
the application's resolution.
**Tags:** *gyp_configuration.gypi, gpu memory, input latency, framerate.*
### Software surface cache capacity
This only affects Starboard Blitter API renderers. The Starboard Blitter API
has only limited support for rendering special effects, so often Cobalt will
have to fallback to a software rasterizer for rendering certain visual
elements (most notably, text). In order to avoid expensive software
renders, the results are cached and re-used across frames. The software
surface cache is crucial to achieving an acceptable framerate on Blitter API
platforms. The size of this cache is specified by the
`software_surface_cache_size_in_bytes` variable in `gyp_configuration.gypi`.
**Tags:** *gyp_configuration.gypi, gpu memory, framerate.*
### Toggle Just-In-Time JavaScript Compilation
Just-in-time (JIT) compilation of JavaScript is well known to significantly
improve the speed of JavaScript execution. However, in the context of Cobalt
and its web apps (like YouTube's HTML5 TV application), JITting may not be
the best or fastest thing to do. Enabling JIT can result in Cobalt using
more memory (to store compiled code) and can also actually slow down
JavaScript execution (e.g. time must now be spent compiling code). It is
recommended that JIT support be left disabled, but you can experiment with
it by setting the `cobalt_enable_jit` `gyp_configuration.gypi` variable to `1`
to enable JIT, or `0` to disable it.
**Tags:** *gyp_configuration.gypi, startup, browse-to-watch, input latency,
cpu memory.*
### Garbage collection trigger threshold
The SpiderMonkey JavaScript engine provides a parameter that describes how
aggressive it will be at performing garbage collections to reduce memory
usage. By lowering this value, garbage collection will occur more often,
thus reducing performance, but memory usage will be lowered. We have found
that performance reductions are modest, so it is not unreasonable to set this
value to something low like 1MB if your platform is low on memory. This
setting can be adjusted by setting the value of
`mozjs_garbage_collection_threshold_in_bytes` in your `gyp_configuration.gypi`
file.
**Tags:** *gyp_configuration.gypi, startup, browse-to-watch, input latency,
cpu memory.*
### Ensure that you are not requesting Cobalt to render unchanging frames
Some platforms require that the display buffer is swapped frequently, and
so in these cases Cobalt will render the scene every frame, even if it is
not changing, which consumes CPU resources. This behavior is defined by the
value of `SB_MUST_FREQUENTLY_FLIP_DISPLAY_BUFFER` in your platform's
`configuration_public.h` file. Unless your platform is restricted in this
aspect, you should ensure that `SB_MUST_FREQUENTLY_FLIP_DISPLAY_BUFFER`
is set to `0`.
**Tags:** *configuration_public.h, startup, browse-to-watch, input latency,
framerate.*
### Try enabling rendering only to regions that change
If you set the
[`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi) variable,
`render_dirty_region_only` to `1`, then Cobalt will invoke logic to detect which
part of the frame has been affected by animations and can be configured to only
render to that region. However, this feature requires support from the driver
for GLES platforms. In particular, `eglChooseConfig()` will first be called
with `EGL_SWAP_BEHAVIOR_PRESERVED_BIT` set in its attribute list. If this
fails, Cobalt will call eglChooseConfig() again without
`EGL_SWAP_BEHAVIOR_PRESERVED_BIT` set and dirty region rendering will
be disabled. By having Cobalt render only small parts of the screen,
CPU (and GPU) resources can be freed to work on other tasks. This can
especially affect startup time since usually only a small part of the
screen is updating (e.g. displaying an animated spinner). Thus, if
possible, ensure that your EGL/GLES driver supports
`EGL_SWAP_BEHAVIOR_PRESERVED_BIT`. Note that it is possible (but not
necessary) that GLES drivers will implement this feature by allocating a new
offscreen buffer, which can significantly affect GPU memory usage. If you are
on a Blitter API platform, enabling this functionality will result in the
allocation and blit of a fullscreen "intermediate" back buffer target.
**Tags:** *startup, framerate, gpu memory.*
### Ensure that thread priorities are respected
Cobalt makes use of thread priorities to ensure that animations remain smooth
even while JavaScript is being executed, and to ensure that JavaScript is
processed (e.g. in response to a key press) before images are decoded. Thus
having support for priorities can improve the overall performance of the
application. To enable thread priority support, you should set the value
of `SB_HAS_THREAD_PRIORITY_SUPPORT` to `1` in your `configuration_public.h`
file, and then also ensure that your platform's implementation of
`SbThreadCreate()` properly forwards the priority parameter down to the
platform.
**Tags:** *configuration_public.h, framerate, startup, browse-to-watch,
input latency.*
### Tweak compiler/linker optimization flags
Huge performance improvements can be obtained by ensuring that the right
optimizations are enabled by your compiler and linker flag settings. You
can set these up within `gyp_configuration.gypi` by adjusting the list
variables `compiler_flags` and `linker_flags`. See also
`compiler_flags_gold` and `linker_flags_gold` which describe flags that
apply only to gold builds where performance is critical. Note that
unless you explicitly set this up, it is unlikely that compiler/linker
flags will carry over from external shell environment settings; they
must be set explicitly in `gyp_configuration.gypi`.
**Tags:** *framerate, startup, browse-to-watch, input latency*
#### Link Time Optimization (LTO)
If your toolchain supports it, it is recommended that you enable the LTO
optimization, as it has been reported to yield significant performance
improvements in many high profile projects.
**Tags:** *framerate, startup, browse-to-watch, input latency*
#### The GCC '-mplt' flag for MIPS architectures
The '-mplt' flag has been found to improve all around performance by
~20% on MIPS architecture platforms. If your platform has a MIPS
architecture, it is suggested that you enable this flag in gold builds.
**Tags:** *gyp_configuration.gypi, framerate, startup, browse-to-watch,
input latency.*
### Close "Stats for Nerds" when measuring performance
The YouTube web app offers a feature called "Stats for Nerds" that enables
a stats overlay to appear on the screen during video playback. Rendering
this overlay requires a significant amount of processing, so it is
recommended that all performance evaluation is done without the
"Stats for Nerds" overlay active. This can greatly affect browse-to-watch
time and potentially affect the video frame drop rate.
**Tags:** *browse-to-watch, framerate, youtube.*
### Close the debug console when measuring performance
Cobalt provides a debug console in non-gold builds to allow the display
of variables overlayed on top of the application. This can be helpful
for debugging issues and keeping track of things like app lifetime, but
the debug console consumes significant resources when it is visible in order
to render it, so it should be hidden when performance is being evaluated.
**Tags:** *framerate, startup, browse-to-watch, input latency.*
### Toggle between dlmalloc and system allocator
Cobalt includes dlmalloc and can be configured to use it to handle all
memory allocations. It should be carefully evaluated however whether
dlmalloc performs better or worse than your system allocator, in terms
of both memory fragmentation efficiency as well as runtime performance.
To use dlmalloc, you should adjust your starboard_platform.gyp file to
use the Starboard [`starboard/memory.h`](../../starboard/memory.h) function
implementations defined in
[`starboard/shared/dlmalloc/`](../../starboard/shared/dlmalloc). To use
your system allocator, you should adjust your starboard_platform.gyp file
to use the Starboard [`starboard/memory.h`](../../starboard/memory.h) function
implementations defined in
[`starboard/shared/iso/`](../../starboard/shared/iso).
**Tags:** *framerate, startup, browse-to-watch, input latency, cpu memory.*
### Media buffer allocation strategy
During video playback, memory is reserved by Cobalt to contain the encoded
media data (separated into video and audio), and we refer to this memory
as the media buffers. By default, Cobalt pre-allocates the memory and
wraps it with a custom allocator, in order to avoid fragmentation of main
memory. However, depending on your platform and your system allocator,
overall memory usage may improve if media buffer allocations were made
normally via the system allocator instead. This can be achieved by setting
`cobalt_media_buffer_initial_capacity` and `cobalt_media_buffer_allocation_unit`
to 0 in gyp_configuration.gypi. Note also that if you choose to pre-allocate
memory, for 1080p video it has been found that 24MB is a good media buffer size.
The pre-allocated media buffer capacity size can be adjusted by modifying the
value of `cobalt_media_buffer_initial_capacity` mentioned above.
**Tags:** *configuration_public.h, cpu memory.*
### Adjust media buffer size settings
Many of the parameters around media buffer allocation can be adjusted in your
gyp_configuration.gypi file. The variables in question are the family of
`cobalt_media_*` variables, whose default values are specified in
[`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi). In
particular, if your maximum video output resolution is less than 1080, then you
may lower the budgets for many of the categories according to your maximum
resolution.
**Tags:** *cpu memory*
### Avoid using a the YouTube web app FPS counter (i.e. "?fps=1")
The YouTube web app is able to display a Frames Per Second (FPS) counter in the
corner when the URL parameter "fps=1" is set. Unfortunately, activating this
timer will cause Cobalt to re-layout and re-render the scene frequently in
order to update the FPS counter. Instead, we recommend instead to either
measure the framerate in the GLES driver and periodically printing it, or
hacking Cobalt to measure the framerate and periodically print it. In order to
hack in an FPS counter, you will want to look at the
`HardwareRasterizer::Impl::Submit()` function in
[`cobalt/renderer/rasterizer/skia/hardware_rasterizer.cc`](../renderer/rasterizer/skia/hardware_rasterizer.cc).
The work required to update the counter has the potential to affect many
aspects of performance. TODO: Cobalt should add a command line switch to
enable printing of the framerate in gold builds.
**Tags:** *framerate, startup, browse-to-watch, input latency,*
### Implement hardware image decoding
The Starboard header file [`starboard/image.h`](../../starboard/image.h) defines
functions that allow platforms to implement hardware-accelerated image
decoding, if available. In particular, if `SbImageIsDecodeSupported()` returns
true for the specified mime type and output format, then instead of using the
software-based libpng or libjpeg libraries, Cobalt will instead call
`SbImageDecode()`. `SbImageDecode()` is expected to return a decoded image as
a `SbDecodeTarget` option, from which Cobalt will extract a GL texture or
Blitter API surface object when rendering. If non-CPU hardware is used to
decode images, it would alleviate the load on the CPU, and possibly also
increase the speed at which images can be decoded.
**Tags:** *startup, browse-to-watch, input latency.*
### Use Chromium's about:tracing tool to debug Cobalt performance
Cobalt has support for generating profiling data that is viewable through
Chromium's about:tracing tool. This feature is available in all Cobalt
configurations except for "gold" ("qa" is the best build to use for performance
investigations here). There are currently two ways to tell Cobalt
to generate this data:
1. The command line option, "--timed_trace=XX" will instruct Cobalt to trace
upon startup, for XX seconds (e.g. "--timed_trace=25"). When completed,
the output will be written to the file `timed_trace.json`.
2. Using the debug console (hit CTRL+O on a keyboard once or twice), type in the
command "d.trace()" and hit enter. Cobalt will begin a trace. After
some time has passed (and presumably you have performed some actions), you
can open the debug console again and type "d.trace()" again to end the trace.
The trace output will be written to the file `triggered_trace.json`.
The directory the output files will be placed within is the directory that the
Starboard function `SbSystemGetPath()` returns with a `path_id` of
`kSbSystemPathDebugOutputDirectory`, so you may need to check your
implementation of `SbSystemGetPath()` to discover where this is.
Once the trace file is created, it can be opened in Chrome by navigating to
`about:tracing` or `chrome://tracing`, clicking the "Load" button near the top
left, and then opening the JSON file created earlier.
Of particular interest in the output view is the `MainWebModule` thread where
JavaScript and layout are executed, and `Rasterizer` where per-frame rendering
takes place.
**Tags:** *framerate, startup, browse-to-watch, input latency.*