| # Performance Tuning |
| |
| Cobalt is designed to choose sensible parameters for all performance-related |
| options and parameters, however sometimes these need to be explicitly set |
| to allow Cobalt to run optimally for a specific platform. This document |
| discusses some of the tweakable parameters in Cobalt that can have an |
| affect on performance. |
| |
| A number of tweaks are listed below in no particular order. Each item |
| has a set of tags keywords to make it easy to search for items related |
| to a specific type of performance metric (e.g. "framerate"). |
| |
| Many of the tweaks involve adding a new gyp variable to your platform's |
| `gyp_configuration.gypi` file. The default values for these variables are |
| defined in either |
| [`base_configuration.gypi`](../../starboard/build/base_configuration.gypi) or |
| [`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi). |
| |
| ### Use a Release Build |
| |
| Cobalt has a number of different build configurations (e.g. "debug", "devel", |
| "qa" and "gold" in slowest-to-fastest order), with varying degrees of |
| optimizations enabled. For example, while "devel" has compiler optimizations |
| enabled, it does not disable DCHECKS (debug assertions) which can decrease |
| Cobalt's performance. The "qa" build is most similar to "gold", but it still |
| has some debug features enabled (such as the debug console which can consume |
| memory, and decrease performance while it is visible). For the best |
| performance, build Cobalt in the "gold" configuration. |
| |
| **Tags:** *framerate, startup, browse-to-watch, cpu memory, input latency.* |
| |
| |
| ### Switch JavaScript Engine to V8 |
| |
| Cobalt supports both SpiderMonkey and V8 as JavaScript engines. SpiderMonkey |
| is the default JavaScript engine since it is the most compatible in that it |
| does not require your platform to support Just-In-Time (JIT) compiling. |
| However, if your platform supports it, we strongly recommend that you use |
| V8, as it has been shown to provide 20-50% speed improvements on JavaScript |
| execution across the board. Note however that V8 has also been found to |
| consume around 10MB more memory than SpiderMonkey. |
| |
| To enable V8, you must modify the `GetVariables()` method in your |
| `gyp_configuration.py` file and ensure that the variables dictionary that is |
| returned contains the following key/value pairs: |
| |
| ``` |
| { |
| 'javascript_engine': 'v8', |
| 'cobalt_enable_jit': 1, |
| } |
| ``` |
| |
| Note also that use of V8 requires Starboard version 10 or higher. |
| |
| **Tags:** *startup, browse-to-watch, cpu memory, input latency.* |
| |
| |
| ### Framerate throttling |
| |
| If you're willing to accept a lower framerate, there is potential that |
| JavaScript execution can be made to run faster (which can improve startup |
| time, browse-to-watch time, and input latency). Without any special |
| settings in place, the renderer will attempt to render each frame as fast |
| as it can, limited only by the display's refresh rate, which is usually 60Hz. |
| By artificially throttling this rate to a lower value, like 30Hz, CPU |
| resources can be freed to work on other tasks. You can enable framerate |
| throttling by setting a value for `cobalt_minimum_frame_time_in_milliseconds` |
| in your platform's `gyp_configuration.gypi` file. Setting it to 33, for |
| example, will throttle Cobalt's renderer to 30 frames per second. |
| |
| **Tags:** *gyp_configuration.gypi, framerate, startup, browse-to-watch, |
| input latency.* |
| |
| |
| ### Image cache capacity |
| |
| Cobalt's image cache is used to cache decoded image data. The image data |
| in the image cache is stored as a texture, and so it will occupy GPU memory. |
| The image cache capacity dictates how long images will be kept resident in |
| memory even if they are not currently visible on the web page. By reducing |
| this value, you can lower GPU memory usage, at the cost of having Cobalt |
| make more network requests and image decodes for previously seen images. |
| Cobalt will automatically set the image cache capacity to a reasonable value, |
| but if you wish to override this, you can do so by setting the |
| `image_cache_size_in_bytes` variable in your `gyp_configuration.gypi` file. For |
| the YouTube web app, we have found that at 1080p, 32MB will allow around |
| 5 thumbnail shelves to stay resident at a time, with 720p and 4K resolutions |
| using proportionally less and more memory, respectively. |
| |
| **Tags:** *gyp_configuration.gypi, cpu memory, gpu memory.* |
| |
| |
| ### Image cache capacity multiplier during video playback |
| |
| Cobalt provides a feature where the image cache capacity will be reduced |
| as soon as video playback begins. This can be useful for reducing peak |
| GPU memory usage, which usually occurs during video playback. The |
| downside to lowering the image cache during video playback is that it |
| may need to evict some images when the capacity changes, and so it is |
| more likely that Cobalt will have to re-download and decode images after |
| returning from video playback. Note that this feature is not well tested. |
| The feature can be activated by setting |
| `image_cache_capacity_multiplier_when_playing_video` to a value between |
| `0.0` and `1.0` in your `gyp_configuration.gypi` file. The image cache |
| capacity will be multiplied by this value during video playback. |
| |
| **Tags:** *gyp_configuration.gypi, gpu memory.* |
| |
| |
| ### Scratch Surface cache capacity |
| |
| This only affects GLES renderers. While rasterizing a frame, it is |
| occasionally necessary to render to a temporary offscreen surface and then |
| apply that surface to the original render target. Offscreen surface |
| rendering may also need to be performed multiple times per frame. The |
| scratch surface cache will keep allocated a set of scratch textures that |
| will be reused (within and across frames) for offscreen rendering. Reusing |
| offscreen surfaces allows render target allocations, which can be expensive |
| on some platforms, to be minimized. However, it has been found that some |
| platforms (especially those with tiled renderers, like the Raspberry Pi's |
| Broadcom VideoCore), reading and writing again and again to the same texture |
| can result in performance degradation. Memory may also be potentially saved |
| by disabling this cache, since when it is enabled, if the cache is filled, it |
| may be occupying memory that it is not currently using. This setting can |
| be adjusted by setting `surface_cache_size_in_bytes` in your |
| `gyp_configuration.gypi` file. A value of `0` will disable the surface cache. |
| |
| **Tags:** *gyp_configuration.gypi, gpu memory, framerate.* |
| |
| |
| ### Glyph atlas size |
| |
| This only affects GLES renderers. Skia sets up glyph atlases to which |
| it software rasterizes glyphs the first time they are encountered, and |
| from which the glyphs are used as textures for hardware accelerated glyph |
| rendering to the render target. Adjusting this value will adjust |
| GPU memory usage, but at the cost of performance as text glyphs will be |
| less likely to be cached already. Note that if experimenting with |
| modifications to this setting, be sure to test many languages, as some |
| are more demanding (e.g. Chinese and Japanese) on the glyph cache than |
| others. This value can be adjusted by changing the values of |
| the `skia_glyph_atlas_width` and `skia_glyph_atlas_height` variables in your |
| `gyp_configuration.gypi` file. Note that by default, these will be |
| automatically configured by Cobalt to values found to be optimal for |
| the application's resolution. |
| |
| **Tags:** *gyp_configuration.gypi, gpu memory, input latency, framerate.* |
| |
| |
| ### Software surface cache capacity |
| |
| This only affects Starboard Blitter API renderers. The Starboard Blitter API |
| has only limited support for rendering special effects, so often Cobalt will |
| have to fallback to a software rasterizer for rendering certain visual |
| elements (most notably, text). In order to avoid expensive software |
| renders, the results are cached and re-used across frames. The software |
| surface cache is crucial to achieving an acceptable framerate on Blitter API |
| platforms. The size of this cache is specified by the |
| `software_surface_cache_size_in_bytes` variable in `gyp_configuration.gypi`. |
| |
| **Tags:** *gyp_configuration.gypi, gpu memory, framerate.* |
| |
| |
| ### Toggle Just-In-Time JavaScript Compilation |
| |
| Just-in-time (JIT) compilation of JavaScript is well known to significantly |
| improve the speed of JavaScript execution. However, in the context of Cobalt |
| and its web apps (like YouTube's HTML5 TV application), JITting may not be |
| the best or fastest thing to do. Enabling JIT can result in Cobalt using |
| more memory (to store compiled code) and can also actually slow down |
| JavaScript execution (e.g. time must now be spent compiling code). It is |
| recommended that JIT support be left disabled, but you can experiment with |
| it by setting the `cobalt_enable_jit` `gyp_configuration.gypi` variable to `1` |
| to enable JIT, or `0` to disable it. |
| |
| **Tags:** *gyp_configuration.gypi, startup, browse-to-watch, input latency, |
| cpu memory.* |
| |
| |
| ### Garbage collection trigger threshold |
| |
| The SpiderMonkey JavaScript engine provides a parameter that describes how |
| aggressive it will be at performing garbage collections to reduce memory |
| usage. By lowering this value, garbage collection will occur more often, |
| thus reducing performance, but memory usage will be lowered. We have found |
| that performance reductions are modest, so it is not unreasonable to set this |
| value to something low like 1MB if your platform is low on memory. This |
| setting can be adjusted by setting the value of |
| `mozjs_garbage_collection_threshold_in_bytes` in your `gyp_configuration.gypi` |
| file. |
| |
| **Tags:** *gyp_configuration.gypi, startup, browse-to-watch, input latency, |
| cpu memory.* |
| |
| |
| ### Ensure that you are not requesting Cobalt to render unchanging frames |
| |
| Some platforms require that the display buffer is swapped frequently, and |
| so in these cases Cobalt will render the scene every frame, even if it is |
| not changing, which consumes CPU resources. This behavior is defined by the |
| value of `SB_MUST_FREQUENTLY_FLIP_DISPLAY_BUFFER` in your platform's |
| `configuration_public.h` file. Unless your platform is restricted in this |
| aspect, you should ensure that `SB_MUST_FREQUENTLY_FLIP_DISPLAY_BUFFER` |
| is set to `0`. |
| |
| **Tags:** *configuration_public.h, startup, browse-to-watch, input latency, |
| framerate.* |
| |
| |
| ### Try enabling rendering only to regions that change |
| |
| If you set the |
| [`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi) variable, |
| `render_dirty_region_only` to `1`, then Cobalt will invoke logic to detect which |
| part of the frame has been affected by animations and can be configured to only |
| render to that region. However, this feature requires support from the driver |
| for GLES platforms. In particular, `eglChooseConfig()` will first be called |
| with `EGL_SWAP_BEHAVIOR_PRESERVED_BIT` set in its attribute list. If this |
| fails, Cobalt will call eglChooseConfig() again without |
| `EGL_SWAP_BEHAVIOR_PRESERVED_BIT` set and dirty region rendering will |
| be disabled. By having Cobalt render only small parts of the screen, |
| CPU (and GPU) resources can be freed to work on other tasks. This can |
| especially affect startup time since usually only a small part of the |
| screen is updating (e.g. displaying an animated spinner). Thus, if |
| possible, ensure that your EGL/GLES driver supports |
| `EGL_SWAP_BEHAVIOR_PRESERVED_BIT`. Note that it is possible (but not |
| necessary) that GLES drivers will implement this feature by allocating a new |
| offscreen buffer, which can significantly affect GPU memory usage. If you are |
| on a Blitter API platform, enabling this functionality will result in the |
| allocation and blit of a fullscreen "intermediate" back buffer target. |
| |
| **Tags:** *startup, framerate, gpu memory.* |
| |
| |
| ### Ensure that thread priorities are respected |
| |
| Cobalt makes use of thread priorities to ensure that animations remain smooth |
| even while JavaScript is being executed, and to ensure that JavaScript is |
| processed (e.g. in response to a key press) before images are decoded. Thus |
| having support for priorities can improve the overall performance of the |
| application. To enable thread priority support, you should set the value |
| of `SB_HAS_THREAD_PRIORITY_SUPPORT` to `1` in your `configuration_public.h` |
| file, and then also ensure that your platform's implementation of |
| `SbThreadCreate()` properly forwards the priority parameter down to the |
| platform. |
| |
| **Tags:** *configuration_public.h, framerate, startup, browse-to-watch, |
| input latency.* |
| |
| |
| ### Tweak compiler/linker optimization flags |
| |
| Huge performance improvements can be obtained by ensuring that the right |
| optimizations are enabled by your compiler and linker flag settings. You |
| can set these up within `gyp_configuration.gypi` by adjusting the list |
| variables `compiler_flags` and `linker_flags`. See also |
| `compiler_flags_gold` and `linker_flags_gold` which describe flags that |
| apply only to gold builds where performance is critical. Note that |
| unless you explicitly set this up, it is unlikely that compiler/linker |
| flags will carry over from external shell environment settings; they |
| must be set explicitly in `gyp_configuration.gypi`. |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency* |
| |
| |
| #### Link Time Optimization (LTO) |
| If your toolchain supports it, it is recommended that you enable the LTO |
| optimization, as it has been reported to yield significant performance |
| improvements in many high profile projects. |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency* |
| |
| |
| #### The GCC '-mplt' flag for MIPS architectures |
| The '-mplt' flag has been found to improve all around performance by |
| ~20% on MIPS architecture platforms. If your platform has a MIPS |
| architecture, it is suggested that you enable this flag in gold builds. |
| |
| **Tags:** *gyp_configuration.gypi, framerate, startup, browse-to-watch, |
| input latency.* |
| |
| |
| ### Close "Stats for Nerds" when measuring performance |
| |
| The YouTube web app offers a feature called "Stats for Nerds" that enables |
| a stats overlay to appear on the screen during video playback. Rendering |
| this overlay requires a significant amount of processing, so it is |
| recommended that all performance evaluation is done without the |
| "Stats for Nerds" overlay active. This can greatly affect browse-to-watch |
| time and potentially affect the video frame drop rate. |
| |
| **Tags:** *browse-to-watch, framerate, youtube.* |
| |
| |
| ### Close the debug console when measuring performance |
| |
| Cobalt provides a debug console in non-gold builds to allow the display |
| of variables overlayed on top of the application. This can be helpful |
| for debugging issues and keeping track of things like app lifetime, but |
| the debug console consumes significant resources when it is visible in order |
| to render it, so it should be hidden when performance is being evaluated. |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency.* |
| |
| |
| ### Toggle between dlmalloc and system allocator |
| |
| Cobalt includes dlmalloc and can be configured to use it to handle all |
| memory allocations. It should be carefully evaluated however whether |
| dlmalloc performs better or worse than your system allocator, in terms |
| of both memory fragmentation efficiency as well as runtime performance. |
| To use dlmalloc, you should adjust your starboard_platform.gyp file to |
| use the Starboard [`starboard/memory.h`](../../starboard/memory.h) function |
| implementations defined in |
| [`starboard/shared/dlmalloc/`](../../starboard/shared/dlmalloc). To use |
| your system allocator, you should adjust your starboard_platform.gyp file |
| to use the Starboard [`starboard/memory.h`](../../starboard/memory.h) function |
| implementations defined in |
| [`starboard/shared/iso/`](../../starboard/shared/iso). |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency, cpu memory.* |
| |
| |
| ### Media buffer allocation strategy |
| |
| During video playback, memory is reserved by Cobalt to contain the encoded |
| media data (separated into video and audio), and we refer to this memory |
| as the media buffers. By default, Cobalt pre-allocates the memory and |
| wraps it with a custom allocator, in order to avoid fragmentation of main |
| memory. However, depending on your platform and your system allocator, |
| overall memory usage may improve if media buffer allocations were made |
| normally via the system allocator instead. This can be achieved by setting |
| `cobalt_media_buffer_initial_capacity` and `cobalt_media_buffer_allocation_unit` |
| to 0 in gyp_configuration.gypi. Note also that if you choose to pre-allocate |
| memory, for 1080p video it has been found that 24MB is a good media buffer size. |
| The pre-allocated media buffer capacity size can be adjusted by modifying the |
| value of `cobalt_media_buffer_initial_capacity` mentioned above. |
| |
| **Tags:** *configuration_public.h, cpu memory.* |
| |
| |
| ### Adjust media buffer size settings |
| |
| Many of the parameters around media buffer allocation can be adjusted in your |
| gyp_configuration.gypi file. The variables in question are the family of |
| `cobalt_media_*` variables, whose default values are specified in |
| [`cobalt_configuration.gypi`](../build/cobalt_configuration.gypi). In |
| particular, if your maximum video output resolution is less than 1080, then you |
| may lower the budgets for many of the categories according to your maximum |
| resolution. |
| |
| **Tags:** *cpu memory* |
| |
| |
| ### Avoid using a the YouTube web app FPS counter (i.e. "?fps=1") |
| |
| The YouTube web app is able to display a Frames Per Second (FPS) counter in the |
| corner when the URL parameter "fps=1" is set. Unfortunately, activating this |
| timer will cause Cobalt to re-layout and re-render the scene frequently in |
| order to update the FPS counter. Instead, we recommend instead to either |
| measure the framerate in the GLES driver and periodically printing it, or |
| hacking Cobalt to measure the framerate and periodically print it. In order to |
| hack in an FPS counter, you will want to look at the |
| `HardwareRasterizer::Impl::Submit()` function in |
| [`cobalt/renderer/rasterizer/skia/hardware_rasterizer.cc`](../renderer/rasterizer/skia/hardware_rasterizer.cc). |
| The work required to update the counter has the potential to affect many |
| aspects of performance. TODO: Cobalt should add a command line switch to |
| enable printing of the framerate in gold builds. |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency,* |
| |
| |
| ### Implement hardware image decoding |
| |
| The Starboard header file [`starboard/image.h`](../../starboard/image.h) defines |
| functions that allow platforms to implement hardware-accelerated image |
| decoding, if available. In particular, if `SbImageIsDecodeSupported()` returns |
| true for the specified mime type and output format, then instead of using the |
| software-based libpng or libjpeg libraries, Cobalt will instead call |
| `SbImageDecode()`. `SbImageDecode()` is expected to return a decoded image as |
| a `SbDecodeTarget` option, from which Cobalt will extract a GL texture or |
| Blitter API surface object when rendering. If non-CPU hardware is used to |
| decode images, it would alleviate the load on the CPU, and possibly also |
| increase the speed at which images can be decoded. |
| |
| **Tags:** *startup, browse-to-watch, input latency.* |
| |
| |
| ### Use Chromium's about:tracing tool to debug Cobalt performance |
| |
| Cobalt has support for generating profiling data that is viewable through |
| Chromium's about:tracing tool. This feature is available in all Cobalt |
| configurations except for "gold" ("qa" is the best build to use for performance |
| investigations here). There are currently two ways to tell Cobalt |
| to generate this data: |
| |
| 1. The command line option, "--timed_trace=XX" will instruct Cobalt to trace |
| upon startup, for XX seconds (e.g. "--timed_trace=25"). When completed, |
| the output will be written to the file `timed_trace.json`. |
| 2. Using the debug console (hit CTRL+O on a keyboard once or twice), type in the |
| command "d.trace()" and hit enter. Cobalt will begin a trace. After |
| some time has passed (and presumably you have performed some actions), you |
| can open the debug console again and type "d.trace()" again to end the trace. |
| The trace output will be written to the file `triggered_trace.json`. |
| |
| The directory the output files will be placed within is the directory that the |
| Starboard function `SbSystemGetPath()` returns with a `path_id` of |
| `kSbSystemPathDebugOutputDirectory`, so you may need to check your |
| implementation of `SbSystemGetPath()` to discover where this is. |
| |
| Once the trace file is created, it can be opened in Chrome by navigating to |
| `about:tracing` or `chrome://tracing`, clicking the "Load" button near the top |
| left, and then opening the JSON file created earlier. |
| |
| Of particular interest in the output view is the `MainWebModule` thread where |
| JavaScript and layout are executed, and `Rasterizer` where per-frame rendering |
| takes place. |
| |
| **Tags:** *framerate, startup, browse-to-watch, input latency.* |