Skia's New Approach to SIMD

Most hot software paths in Skia are implemented with processor-specific SIMD instructions. For graphics performance, the parallelism from SIMD is essential: there is simply no realistic way to eek the same performance out of portable C++ code as we can from the SSE family of instruction sets on x86 or from NEON on ARM or from MIPS32's DSP instructions. Depending on the particular code path and math involved, we see 2, 4, 8, or even ~16x performance increases over portable code when really exploiting the processor-specific SIMD instructions.

But the SIMD code we‘ve piled up over the years has some serious problems. It’s often quite low-level, with poor factoring leading to verbose, bug prone, and difficult to read code. SIMD instrinsic types and functions take a good long while to get used to reading, let alone writing, and assembly is generally just a complete non-starter. SIMD coverage of Skia methods is not dense: a particular drawing routine might be specialized for NEON but not for SSE, or might have a MIPS DSP implementation but no NEON. Even when we have full instruction set coverage, the implementations of these specialized routines may not produce identical results, either when compared with each other or with our portable fallback code. The SIMD implementations are often simply incorrect, but the code is so fragile and difficult to understand, we can‘t fix it. There are long lived bugs in our tracker involving crashes and buffer under- and overflows that we simply cannot fix because no one on the team understands the code involved. And finally, to top it all off, the code isn’t always even really that fast.

This all needs to change. I want Skia developers to be able to write correct, clear, and fast code, and in software rendering, SIMD is the only way to get “fast”. This document outlines a new vision for how Skia will use SIMD instructions with no compromises, writing clear code once that runs quickly on all platforms we support.

The Plan

We‘re going to wrap low-level platform-specific instrinsics with zero-cost abstractions with interfaces matching Skia’s higher-level-but-still-quite-low-level use cases. Skia code will write to this interface once, which then compiles to efficient SSE, NEON, or portable code (MIPS is quite TBD, for now group it conceptually under portable code) via platform-specific backends. The key here is to find the right sweet spot of abstraction that allows us to express the graphical concepts we want in Skia while allowing each of those platform-specific backends flexibility to implement those concepts as efficiently as possible.

While Skia uses a mix of float, 32-bit, 16-bit, and 8-bit integer SIMD instructions, 32-bit integers fall quite behind the rest in usage. Since we tend to operate on 8888 ARGB values, 8-bit SIMD tends to be the most natural and fastest approach, but when multiplication gets involved (essentially all the time), 16-bit SIMD inevitably gets tangled in there. For some operations like division, square roots, or math with high range or precision requirements, we expand our 8-bit pixel components up to floats, and working with a single pixel as a 4-float vector becomes most natural. This plan focuses on how we'll deal with these majority cases: floats, and 8- and 16-bit integers.

`SkNf` for floats

Wrapping floats with an API that allows efficient implementation on SSE and NEON is by far the easiest task involved here. Both SSE and NEON naturally work with 128-bit vectors of 4 floats, and they have a near 1-to-1 correspondence between operations. Indeed, the correspondence is so close that it's tempting to solve this problem by picking one set of intrinsics, e.g. NEON, and just #defineing portable and SSE implementations of NEON:

#define float32x4_t __m128
#define vmulq_f32 _mm_mul_ps
#define vaddq_f32 _mm_add_ps
#define vld1q_f32 _mm_loadu_ps
#define vst1q_f32 _mm_storeu_ps
...

This temptation starts to break down when you notice:

there are operations that don't quite correspond, e.g. _mm_movemask_ps; and
math written with either SSE or NEON instrinsics is still very hard to read; and
sometimes we want to work with 4 floats, but sometimes 2, maybe even 8, etc.

So we use a wrapper class SkNf<N>, parameterized on N, how many floats the vector contains, constrained at compile time to be a power of 2. SkNf provides all the methods you‘d expect on vector of N floats: loading and storing from float arrays, all the usual arithmetic operators, min and max, low and high precision reciprocal and sqrt, all the usual comparison operators, and a .thenElse() method acting as a non-branching ternary ?: operator. To support Skia’s main graphic needs, SkNf can also load and store from a vector of N bytes, converting up to a float when loading and rounding down to [0,255] when storing.

As a convenience, SkNf<N> has two default implementations: SkNf<1> performs all these operations on a single float, and the generic SkNf<N> simply recurses onto two SkNf<N/2>. This allows our different backends to inject specialiations where most natural: the portable backend does nothing, so all SkNf<N> recurse down to the default SkNf<1>; the NEON backend specializes SkNf<2> with float32x2_t and 64-bit SIMD methods, and SkNf<4> with float32x4_t and 128-bit SIMD methods; the SSE backend specializes both SkNf<4> and SkNf<2> to use the full or lower half of an __m128 vector, respectively. A future AVX backend could simply drop in an SkNf<8> specialization.

Our most common float use cases are working with 2D coordinates and with 4-float-component pixels. Since these are so common, we've made simple typedefs for these two use cases, Sk2f and Sk4f, and also versions reminding you that it can work with vectors of SkScalar (a Skia-specific float typedef) too: Sk2s, Sk4s.

`SkNf` in practice

To date we have implemented several parts of Skia using Sk4f:

SkColorMatrixFilter
SkRadialGradient
SkColorCubeFilter
Three complicated SkXfermode subclasses: ColorBurn, ColorDodge, and SoftLight.

In all these cases, we have been able to write a single implementation, producing the same results cross-platform. The first three of those sites using Sk4f are entirely newly vectorized, and run much faster than the previous portable implementations. The 3 Sk4f transfermodes replaced portable, SSE, and NEON implementations which all produced different results, and the Sk4f versions are all faster than their predecessors.

SkColorCubeFilter stands out as a particularly good example of how and why to use Sk4f over custom platform-specific intrinsics. Starting from some portable code and a rather slow SSE-only sketch, a Google Chromium dev, an Intel contributor, and I worked together to write an Sk4f version that's more than twice as fast as the original, and runs fast on both x86 and ARM.

`SkPx` for 8- and 16-bit fixed point math

Building an abstraction layer over 8- and 16-bit fixed point math has proven to be quite a challenge. In fixed point, NEON and SSE again have some overlap, and they could probably be implemented in terms of each other if you were willing to sacrifice performance on SSE in favor of NEON or vice versa. But unlike with floats, where SkNf is really a pretty thin veneer over very similar operations, to really get the best performance out of each fixed point instruction set you need to work in rather different idioms.

SkPx, our latest approach (there have been alpha Sk16b and beta Sk4px predecessors) to 8- and 16-bit SIMD tries to abstract over those idioms to again allow Skia developers to write one piece of clear graphics code that different backends can translate into their native intrinsics idiomatically.

SkPx is really a family of three related types:

SkPx itself represents between 1 and SkPx::N 8888 ARGB pixels, where SkPx::N is a backend-specific compile-time power of 2.
SkPx::Wide represents those same pixels, but with 16-bits of space per component.
SkPx::Alpha represents the alpha channels of those same pixels.

SkPx, Wide and Alpha create a somewhat complicated algebra of operations entirely motivated by the graphical operations we need to perform. Here are some examples:

SkPx::LoadN(const uint32_t*)   -> SkPx  // Load full cruising-speed SkPx.
SkPx::Load(n, const uint32_t*) -> SkPx  // For the 0<n<N ragged tail.

SkPx.storeN(uint32_t*)   // Store a full SkPx.
SkPx.store(n, uint32_t*) // For the ragged 0<n<N tail.

SkPx + SkPx -> SkPx
SkPx - SkPx -> SkPx
SkPx.saturatedAdd(SkPx) -> SkPx

SkPx.alpha() -> Alpha   // Extract alpha channels.
Alpha::LoadN(const uint8_t*)   -> Alpha  // Like SkPx loads, in 8-bit steps.
Alpha::Load(n, const uint8_t*) -> Alpha

SkPx.widenLo()   -> Wide  // argb -> 0a0r0g0b
SkPx.widenHi()   -> Wide  // argb -> a0r0g0b0
SkPx.widenLoHi() -> Wide  // argb -> aarrggbb

Wide + Wide -> Wide
Wide - Wide -> Wide
Wide << bits -> Wide
Wide >> bits -> Wide

SkPx * Alpha -> Wide    // 8 x 8 -> 16 bit
Wide.div255() -> SkPx   // 16-bit -> 8 bit

// A faster approximation of (SkPx * Alpha).div255().
SkPx.approxMulDiv255(Alpha) -> SkPx

We allow each SkPx backend to choose how it physically represents SkPx, SkPx::Wide, and SkPx::Alpha and to choose any power of two as its SkPx::N sweet spot. Code working with SkPx typically runs a loop like this:

while (n >= SkPx::N) {
	// Apply some_function() to SkPx::N pixels.
	some_function(SkPx::LoadN(src), SkPx::LoadN(dst)).storeN(dst);
	src += SkPx::N; dst += SkPx::N; n -= SkPx::N;
}
if (n > 0) {
	// Finish up the tail of 0<n<N pixels.
	some_function(SkPx::Load(n, src), SkPx::Load(n, dst)).store(n, dst);
}

The portable code is of course the simplest place to start looking at implementation details: its SkPx is just uint8_t[4], its SkPx::Wide uint16_t[4], and its SkPx::Alpha just uint8_t. Its preferred number of pixels to work with is SkPx::N = 1. (Amusingly, GCC and Clang seem pretty good about autovectorizing this backend using 32-bit math, which typically ends up within ~2x of the best we can do ourselves.)

The most important difference between SSE and NEON when working in fixed point is that SSE works most naturally with 4 interlaced pixels at a time (argbargbargbargb), while NEON works most naturally with 8 planar pixels at a time (aaaaaaaa, rrrrrrrr, gggggggg, bbbbbbbb). Trying to jam one of these instruction sets into the other's idiom ends up somewhere between not quite optimal (working with interlaced pixels in NEON) and ridiculously inefficient (trying to work with planar pixels in SSE).

So SkPx‘s SSE backend sets N to 4 pixels, stores them interlaced in an __m128i, representing Wide as two __m128i and Alpha as an __m128i with each pixel’s alpha component replicated four times. SkPx‘s NEON backend works with 8 planar pixels, loading them with vld4_u8 into an uint8x8x4_t struct of 4 8-component uint8x8_t planes. Alpha is just a single uint8x8_t 8-component plane, and Wide is NEON’s natural choice, uint16x8x4_t.

(It‘s fun to speculate what an AVX2 backend might look like. Do we make SkPx declare it wants to work with 8 pixels at a time, or leave it at 4? Does SkPx become __m256i, or maybe only SkPx::Wide does? What’s the best way to represent Alpha? And of course, what about AVX-512?)

Keeping Alpha as a single dense uint8x8_t plane allows the NEON backend to be much more efficient with operations involving Alpha. We‘d love to do this in SSE too, where we store Alpha somewhat inefficiently with each alpha component replicated 4 times, but SSE simply doesn’t expose efficient ways to transpose interlaced pixels into planar pixels and vice versa. We could write them ourselves, but only as rather complex compound operations that slow things down more than they help.

These details will inevitably change over time. The important takeaway here is, to really work at peak throughput in SIMD fixed point, you need to work with the idiom of the instruction set, and SkPx is a design that can present a consistent interface to abstract away backend details for you.

`SkPx` in practice

I am in the process of rolling out SkPx. Some Skia code is already using its precursor, Sk4px, which is a bit like SkPx that forces N=4 and restricts the layout to always use interlaced pixels: i.e. fine for SSE, not great for NEON.

All ~20 other SkXfermode subclasses that are not implemented with SkNf.
SkBlitRow::Color32
SkBlitMask::BlitColor

I can certainly say that the Sk4px and SkPx implementations of these methods are clearer, less buggy, and that all the SkXfermode implementations sped up at least 2x when porting from custom per-platform intrinsics. Sk4px has lead to some pretty bad performance regressions that SkPx is designed to avoid. This is an area of active experiementation and iteration.

In Summary

I am confident that Skia developers soon will be able to write single, clear, maintainable, and of course fast, graphical algorithms using SkNf and SkPx. As I have been porting our algorithms, I have perversely enjoyed replacing thousands of lines of unmaintainable code with usually mere dozens of readable code.

I‘m also confident that if you’re looking to use floats, SkNf is ready. Do not write NEON or SSE SIMD code if you're looking to use floats, and do not accept external contributions that do so. Use SkNf instead.

SkPx is less proven, and while its design and early tests look promising, it's still at the stage where we should try it aware that we might need to fall back on hand-written SSE or NEON.

Skia's New Approach to SIMD

The Plan

SkNf for floats

SkNf in practice

SkPx for 8- and 16-bit fixed point math

SkPx in practice

In Summary

`SkNf` for floats

`SkNf` in practice

`SkPx` for 8- and 16-bit fixed point math

`SkPx` in practice