Blame - src/third_party/llvm-project/clang/docs/ControlFlowIntegrityDesign.rst - cobalt

blob: 15e20e1e1d8e5862b13c425444cc6def8211ca84 [file] [log] [blame]

Andrew Top	61a8495	2019-04-30 15:07:33 -0700	[diff] [blame]	1	===========================================
				2	Control Flow Integrity Design Documentation
				3	===========================================
				4
				5	This page documents the design of the :doc:`ControlFlowIntegrity` schemes
				6	supported by Clang.
				7
				8	Forward-Edge CFI for Virtual Calls
				9	==================================
				10
				11	This scheme works by allocating, for each static type used to make a virtual
				12	call, a region of read-only storage in the object file holding a bit vector
				13	that maps onto to the region of storage used for those virtual tables. Each
				14	set bit in the bit vector corresponds to the `address point`_ for a virtual
				15	table compatible with the static type for which the bit vector is being built.
				16
				17	For example, consider the following three C++ classes:
				18
				19	.. code-block:: c++
				20
				21	struct A {
				22	virtual void f1();
				23	virtual void f2();
				24	virtual void f3();
				25	};
				26
				27	struct B : A {
				28	virtual void f1();
				29	virtual void f2();
				30	virtual void f3();
				31	};
				32
				33	struct C : A {
				34	virtual void f1();
				35	virtual void f2();
				36	virtual void f3();
				37	};
				38
				39	The scheme will cause the virtual tables for A, B and C to be laid out
				40	consecutively:
				41
				42	.. csv-table:: Virtual Table Layout for A, B, C
				43	:header: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
				44
				45	A::offset-to-top, &A::rtti, &A::f1, &A::f2, &A::f3, B::offset-to-top, &B::rtti, &B::f1, &B::f2, &B::f3, C::offset-to-top, &C::rtti, &C::f1, &C::f2, &C::f3
				46
				47	The bit vector for static types A, B and C will look like this:
				48
				49	.. csv-table:: Bit Vectors for A, B, C
				50	:header: Class, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
				51
				52	A, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0
				53	B, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
				54	C, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0
				55
				56	Bit vectors are represented in the object file as byte arrays. By loading
				57	from indexed offsets into the byte array and applying a mask, a program can
				58	test bits from the bit set with a relatively short instruction sequence. Bit
				59	vectors may overlap so long as they use different bits. For the full details,
				60	see the `ByteArrayBuilder`_ class.
				61
				62	In this case, assuming A is laid out at offset 0 in bit 0, B at offset 0 in
				63	bit 1 and C at offset 0 in bit 2, the byte array would look like this:
				64
				65	.. code-block:: c++
				66
				67	char bits[] = { 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0 };
				68
				69	To emit a virtual call, the compiler will assemble code that checks that
				70	the object's virtual table pointer is in-bounds and aligned and that the
				71	relevant bit is set in the bit vector.
				72
				73	For example on x86 a typical virtual call may look like this:
				74
				75	.. code-block:: none
				76
				77	ca7fbb: 48 8b 0f mov (%rdi),%rcx
				78	ca7fbe: 48 8d 15 c3 42 fb 07 lea 0x7fb42c3(%rip),%rdx
				79	ca7fc5: 48 89 c8 mov %rcx,%rax
				80	ca7fc8: 48 29 d0 sub %rdx,%rax
				81	ca7fcb: 48 c1 c0 3d rol $0x3d,%rax
				82	ca7fcf: 48 3d 7f 01 00 00 cmp $0x17f,%rax
				83	ca7fd5: 0f 87 36 05 00 00 ja ca8511
				84	ca7fdb: 48 8d 15 c0 0b f7 06 lea 0x6f70bc0(%rip),%rdx
				85	ca7fe2: f6 04 10 10 testb $0x10,(%rax,%rdx,1)
				86	ca7fe6: 0f 84 25 05 00 00 je ca8511
				87	ca7fec: ff 91 98 00 00 00 callq *0x98(%rcx)
				88	[...]
				89	ca8511: 0f 0b ud2
				90
				91	The compiler relies on co-operation from the linker in order to assemble
				92	the bit vectors for the whole program. It currently does this using LLVM's
				93	`type metadata`_ mechanism together with link-time optimization.
				94
				95	.. _address point: http://itanium-cxx-abi.github.io/cxx-abi/abi.html#vtable-general
				96	.. _type metadata: http://llvm.org/docs/TypeMetadata.html
				97	.. _ByteArrayBuilder: http://llvm.org/docs/doxygen/html/structllvm_1_1ByteArrayBuilder.html
				98
				99	Optimizations
				100	-------------
				101
				102	The scheme as described above is the fully general variant of the scheme.
				103	Most of the time we are able to apply one or more of the following
				104	optimizations to improve binary size or performance.
				105
				106	In fact, if you try the above example with the current version of the
				107	compiler, you will probably find that it will not use the described virtual
				108	table layout or machine instructions. Some of the optimizations we are about
				109	to introduce cause the compiler to use a different layout or a different
				110	sequence of machine instructions.
				111
				112	Stripping Leading/Trailing Zeros in Bit Vectors
				113	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				114
				115	If a bit vector contains leading or trailing zeros, we can strip them from
				116	the vector. The compiler will emit code to check if the pointer is in range
				117	of the region covered by ones, and perform the bit vector check using a
				118	truncated version of the bit vector. For example, the bit vectors for our
				119	example class hierarchy will be emitted like this:
				120
				121	.. csv-table:: Bit Vectors for A, B, C
				122	:header: Class, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
				123
				124	A, , , 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ,
				125	B, , , , , , , , 1, , , , , , ,
				126	C, , , , , , , , , , , , , 1, ,
				127
				128	Short Inline Bit Vectors
				129	~~~~~~~~~~~~~~~~~~~~~~~~
				130
				131	If the vector is sufficiently short, we can represent it as an inline constant
				132	on x86. This saves us a few instructions when reading the correct element
				133	of the bit vector.
				134
				135	If the bit vector fits in 32 bits, the code looks like this:
				136
				137	.. code-block:: none
				138
				139	dc2: 48 8b 03 mov (%rbx),%rax
				140	dc5: 48 8d 15 14 1e 00 00 lea 0x1e14(%rip),%rdx
				141	dcc: 48 89 c1 mov %rax,%rcx
				142	dcf: 48 29 d1 sub %rdx,%rcx
				143	dd2: 48 c1 c1 3d rol $0x3d,%rcx
				144	dd6: 48 83 f9 03 cmp $0x3,%rcx
				145	dda: 77 2f ja e0b <main+0x9b>
				146	ddc: ba 09 00 00 00 mov $0x9,%edx
				147	de1: 0f a3 ca bt %ecx,%edx
				148	de4: 73 25 jae e0b <main+0x9b>
				149	de6: 48 89 df mov %rbx,%rdi
				150	de9: ff 10 callq *(%rax)
				151	[...]
				152	e0b: 0f 0b ud2
				153
				154	Or if the bit vector fits in 64 bits:
				155
				156	.. code-block:: none
				157
				158	11a6: 48 8b 03 mov (%rbx),%rax
				159	11a9: 48 8d 15 d0 28 00 00 lea 0x28d0(%rip),%rdx
				160	11b0: 48 89 c1 mov %rax,%rcx
				161	11b3: 48 29 d1 sub %rdx,%rcx
				162	11b6: 48 c1 c1 3d rol $0x3d,%rcx
				163	11ba: 48 83 f9 2a cmp $0x2a,%rcx
				164	11be: 77 35 ja 11f5 <main+0xb5>
				165	11c0: 48 ba 09 00 00 00 00 movabs $0x40000000009,%rdx
				166	11c7: 04 00 00
				167	11ca: 48 0f a3 ca bt %rcx,%rdx
				168	11ce: 73 25 jae 11f5 <main+0xb5>
				169	11d0: 48 89 df mov %rbx,%rdi
				170	11d3: ff 10 callq *(%rax)
				171	[...]
				172	11f5: 0f 0b ud2
				173
				174	If the bit vector consists of a single bit, there is only one possible
				175	virtual table, and the check can consist of a single equality comparison:
				176
				177	.. code-block:: none
				178
				179	9a2: 48 8b 03 mov (%rbx),%rax
				180	9a5: 48 8d 0d a4 13 00 00 lea 0x13a4(%rip),%rcx
				181	9ac: 48 39 c8 cmp %rcx,%rax
				182	9af: 75 25 jne 9d6 <main+0x86>
				183	9b1: 48 89 df mov %rbx,%rdi
				184	9b4: ff 10 callq *(%rax)
				185	[...]
				186	9d6: 0f 0b ud2
				187
				188	Virtual Table Layout
				189	~~~~~~~~~~~~~~~~~~~~
				190
				191	The compiler lays out classes of disjoint hierarchies in separate regions
				192	of the object file. At worst, bit vectors in disjoint hierarchies only
				193	need to cover their disjoint hierarchy. But the closer that classes in
				194	sub-hierarchies are laid out to each other, the smaller the bit vectors for
				195	those sub-hierarchies need to be (see "Stripping Leading/Trailing Zeros in Bit
				196	Vectors" above). The `GlobalLayoutBuilder`_ class is responsible for laying
				197	out the globals efficiently to minimize the sizes of the underlying bitsets.
				198
				199	.. _GlobalLayoutBuilder: http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Transforms/IPO/LowerTypeTests.h?view=markup
				200
				201	Alignment
				202	~~~~~~~~~
				203
				204	If all gaps between address points in a particular bit vector are multiples
				205	of powers of 2, the compiler can compress the bit vector by strengthening
				206	the alignment requirements of the virtual table pointer. For example, given
				207	this class hierarchy:
				208
				209	.. code-block:: c++
				210
				211	struct A {
				212	virtual void f1();
				213	virtual void f2();
				214	};
				215
				216	struct B : A {
				217	virtual void f1();
				218	virtual void f2();
				219	virtual void f3();
				220	virtual void f4();
				221	virtual void f5();
				222	virtual void f6();
				223	};
				224
				225	struct C : A {
				226	virtual void f1();
				227	virtual void f2();
				228	};
				229
				230	The virtual tables will be laid out like this:
				231
				232	.. csv-table:: Virtual Table Layout for A, B, C
				233	:header: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
				234
				235	A::offset-to-top, &A::rtti, &A::f1, &A::f2, B::offset-to-top, &B::rtti, &B::f1, &B::f2, &B::f3, &B::f4, &B::f5, &B::f6, C::offset-to-top, &C::rtti, &C::f1, &C::f2
				236
				237	Notice that each address point for A is separated by 4 words. This lets us
				238	emit a compressed bit vector for A that looks like this:
				239
				240	.. csv-table::
				241	:header: 2, 6, 10, 14
				242
				243	1, 1, 0, 1
				244
				245	At call sites, the compiler will strengthen the alignment requirements by
				246	using a different rotate count. For example, on a 64-bit machine where the
				247	address points are 4-word aligned (as in A from our example), the ``rol``
				248	instruction may look like this:
				249
				250	.. code-block:: none
				251
				252	dd2: 48 c1 c1 3b rol $0x3b,%rcx
				253
				254	Padding to Powers of 2
				255	~~~~~~~~~~~~~~~~~~~~~~
				256
				257	Of course, this alignment scheme works best if the address points are
				258	in fact aligned correctly. To make this more likely to happen, we insert
				259	padding between virtual tables that in many cases aligns address points to
				260	a power of 2. Specifically, our padding aligns virtual tables to the next
				261	highest power of 2 bytes; because address points for specific base classes
				262	normally appear at fixed offsets within the virtual table, this normally
				263	has the effect of aligning the address points as well.
				264
				265	This scheme introduces tradeoffs between decreased space overhead for
				266	instructions and bit vectors and increased overhead in the form of padding. We
				267	therefore limit the amount of padding so that we align to no more than 128
				268	bytes. This number was found experimentally to provide a good tradeoff.
				269
				270	Eliminating Bit Vector Checks for All-Ones Bit Vectors
				271	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				272
				273	If the bit vector is all ones, the bit vector check is redundant; we simply
				274	need to check that the address is in range and well aligned. This is more
				275	likely to occur if the virtual tables are padded.
				276
				277	Forward-Edge CFI for Indirect Function Calls
				278	============================================
				279
				280	Under forward-edge CFI for indirect function calls, each unique function
				281	type has its own bit vector, and at each call site we need to check that the
				282	function pointer is a member of the function type's bit vector. This scheme
				283	works in a similar way to forward-edge CFI for virtual calls, the distinction
				284	being that we need to build bit vectors of function entry points rather than
				285	of virtual tables.
				286
				287	Unlike when re-arranging global variables, we cannot re-arrange functions
				288	in a particular order and base our calculations on the layout of the
				289	functions' entry points, as we have no idea how large a particular function
				290	will end up being (the function sizes could even depend on how we arrange
				291	the functions). Instead, we build a jump table, which is a block of code
				292	consisting of one branch instruction for each of the functions in the bit
				293	set that branches to the target function, and redirect any taken function
				294	addresses to the corresponding jump table entry. In this way, the distance
				295	between function entry points is predictable and controllable. In the object
				296	file's symbol table, the symbols for the target functions also refer to the
				297	jump table entries, so that addresses taken outside the module will pass
				298	any verification done inside the module.
				299
				300	In more concrete terms, suppose we have three functions ``f``, ``g``,
				301	``h`` which are all of the same type, and a function foo that returns their
				302	addresses:
				303
				304	.. code-block:: none
				305
				306	f:
				307	mov 0, %eax
				308	ret
				309
				310	g:
				311	mov 1, %eax
				312	ret
				313
				314	h:
				315	mov 2, %eax
				316	ret
				317
				318	foo:
				319	mov f, %eax
				320	mov g, %edx
				321	mov h, %ecx
				322	ret
				323
				324	Our jump table will (conceptually) look like this:
				325
				326	.. code-block:: none
				327
				328	f:
				329	jmp .Ltmp0 ; 5 bytes
				330	int3 ; 1 byte
				331	int3 ; 1 byte
				332	int3 ; 1 byte
				333
				334	g:
				335	jmp .Ltmp1 ; 5 bytes
				336	int3 ; 1 byte
				337	int3 ; 1 byte
				338	int3 ; 1 byte
				339
				340	h:
				341	jmp .Ltmp2 ; 5 bytes
				342	int3 ; 1 byte
				343	int3 ; 1 byte
				344	int3 ; 1 byte
				345
				346	.Ltmp0:
				347	mov 0, %eax
				348	ret
				349
				350	.Ltmp1:
				351	mov 1, %eax
				352	ret
				353
				354	.Ltmp2:
				355	mov 2, %eax
				356	ret
				357
				358	foo:
				359	mov f, %eax
				360	mov g, %edx
				361	mov h, %ecx
				362	ret
				363
				364	Because the addresses of ``f``, ``g``, ``h`` are evenly spaced at a power of
				365	2, and function types do not overlap (unlike class types with base classes),
				366	we can normally apply the `Alignment`_ and `Eliminating Bit Vector Checks
				367	for All-Ones Bit Vectors`_ optimizations thus simplifying the check at each
				368	call site to a range and alignment check.
				369
				370	Shared library support
				371	======================
				372
				373	EXPERIMENTAL
				374
				375	The basic CFI mode described above assumes that the application is a
				376	monolithic binary; at least that all possible virtual/indirect call
				377	targets and the entire class hierarchy are known at link time. The
				378	cross-DSO mode, enabled with -f[no-]sanitize-cfi-cross-dso relaxes
				379	this requirement by allowing virtual and indirect calls to cross the
				380	DSO boundary.
				381
				382	Assuming the following setup: the binary consists of several
				383	instrumented and several uninstrumented DSOs. Some of them may be
				384	dlopen-ed/dlclose-d periodically, even frequently.
				385
				386	- Calls made from uninstrumented DSOs are not checked and just work.
				387	- Calls inside any instrumented DSO are fully protected.
				388	- Calls between different instrumented DSOs are also protected, with
				389	a performance penalty (in addition to the monolithic CFI
				390	overhead).
				391	- Calls from an instrumented DSO to an uninstrumented one are
				392	unchecked and just work, with performance penalty.
				393	- Calls from an instrumented DSO outside of any known DSO are
				394	detected as CFI violations.
				395
				396	In the monolithic scheme a call site is instrumented as
				397
				398	.. code-block:: none
				399
				400	if (!InlinedFastCheck(f))
				401	abort();
				402	call *f
				403
				404	In the cross-DSO scheme it becomes
				405
				406	.. code-block:: none
				407
				408	if (!InlinedFastCheck(f))
				409	__cfi_slowpath(CallSiteTypeId, f);
				410	call *f
				411
				412	CallSiteTypeId
				413	--------------
				414
				415	``CallSiteTypeId`` is a stable process-wide identifier of the
				416	call-site type. For a virtual call site, the type in question is the class
				417	type; for an indirect function call it is the function signature. The
				418	mapping from a type to an identifier is an ABI detail. In the current,
				419	experimental, implementation the identifier of type T is calculated as
				420	follows:
				421
				422	- Obtain the mangled name for "typeinfo name for T".
				423	- Calculate MD5 hash of the name as a string.
				424	- Reinterpret the first 8 bytes of the hash as a little-endian
				425	64-bit integer.
				426
				427	It is possible, but unlikely, that collisions in the
				428	``CallSiteTypeId`` hashing will result in weaker CFI checks that would
				429	still be conservatively correct.
				430
				431	CFI_Check
				432	---------
				433
				434	In the general case, only the target DSO knows whether the call to
				435	function ``f`` with type ``CallSiteTypeId`` is valid or not. To
				436	export this information, every DSO implements
				437
				438	.. code-block:: none
				439
				440	void __cfi_check(uint64 CallSiteTypeId, void TargetAddr, void DiagData)
				441
				442	This function provides external modules with access to CFI checks for
				443	the targets inside this DSO. For each known ``CallSiteTypeId``, this
				444	function performs an ``llvm.type.test`` with the corresponding type
				445	identifier. It reports an error if the type is unknown, or if the
				446	check fails. Depending on the values of compiler flags
				447	``-fsanitize-trap`` and ``-fsanitize-recover``, this function may
				448	print an error, abort and/or return to the caller. ``DiagData`` is an
				449	opaque pointer to the diagnostic information about the error, or
				450	``null`` if the caller does not provide this information.
				451
				452	The basic implementation is a large switch statement over all values
				453	of CallSiteTypeId supported by this DSO, and each case is similar to
				454	the InlinedFastCheck() in the basic CFI mode.
				455
				456	CFI Shadow
				457	----------
				458
				459	To route CFI checks to the target DSO's __cfi_check function, a
				460	mapping from possible virtual / indirect call targets to the
				461	corresponding __cfi_check functions is maintained. This mapping is
				462	implemented as a sparse array of 2 bytes for every possible page (4096
				463	bytes) of memory. The table is kept readonly most of the time.
				464
				465	There are 3 types of shadow values:
				466
				467	- Address in a CFI-instrumented DSO.
				468	- Unchecked address (a “trusted” non-instrumented DSO). Encoded as
				469	value 0xFFFF.
				470	- Invalid address (everything else). Encoded as value 0.
				471
				472	For a CFI-instrumented DSO, a shadow value encodes the address of the
				473	__cfi_check function for all call targets in the corresponding memory
				474	page. If Addr is the target address, and V is the shadow value, then
				475	the address of __cfi_check is calculated as
				476
				477	.. code-block:: none
				478
				479	__cfi_check = AlignUpTo(Addr, 4096) - (V + 1) * 4096
				480
				481	This works as long as __cfi_check is aligned by 4096 bytes and located
				482	below any call targets in its DSO, but not more than 256MB apart from
				483	them.
				484
				485	CFI_SlowPath
				486	------------
				487
				488	The slow path check is implemented in a runtime support library as
				489
				490	.. code-block:: none
				491
				492	void __cfi_slowpath(uint64 CallSiteTypeId, void *TargetAddr)
				493	void __cfi_slowpath_diag(uint64 CallSiteTypeId, void TargetAddr, void DiagData)
				494
				495	These functions loads a shadow value for ``TargetAddr``, finds the
				496	address of ``__cfi_check`` as described above and calls
				497	that. ``DiagData`` is an opaque pointer to diagnostic data which is
				498	passed verbatim to ``__cfi_check``, and ``__cfi_slowpath`` passes
				499	``nullptr`` instead.
				500
				501	Compiler-RT library contains reference implementations of slowpath
				502	functions, but they have unresolvable issues with correctness and
				503	performance in the handling of dlopen(). It is recommended that
				504	platforms provide their own implementations, usually as part of libc
				505	or libdl.
				506
				507	Position-independent executable requirement
				508	-------------------------------------------
				509
				510	Cross-DSO CFI mode requires that the main executable is built as PIE.
				511	In non-PIE executables the address of an external function (taken from
				512	the main executable) is the address of that function’s PLT record in
				513	the main executable. This would break the CFI checks.
				514
				515	Backward-edge CFI for return statements (RCFI)
				516	==============================================
				517
				518	This section is a proposal. As of March 2017 it is not implemented.
				519
				520	Backward-edge control flow (`RET` instructions) can be hijacked
				521	via overwriting the return address (`RA`) on stack.
				522	Various mitigation techniques (e.g. `SafeStack`_, `RFG`_, `Intel CET`_)
				523	try to detect or prevent `RA` corruption on stack.
				524
				525	RCFI enforces the expected control flow in several different ways described below.
				526	RCFI heavily relies on LTO.
				527
				528	Leaf Functions
				529	--------------
				530	If `f()` is a leaf function (i.e. it has no calls
				531	except maybe no-return calls) it can be called using a special calling convention
				532	that stores `RA` in a dedicated register `R` before the `CALL` instruction.
				533	`f()` does not spill `R` and does not use the `RET` instruction,
				534	instead it uses the value in `R` to `JMP` to `RA`.
				535
				536	This flavour of CFI is precise, i.e. the function is guaranteed to return
				537	to the point exactly following the call.
				538
				539	An alternative approach is to
				540	copy `RA` from stack to `R` in the first instruction of `f()`,
				541	then `JMP` to `R`.
				542	This approach is simpler to implement (does not require changing the caller)
				543	but weaker (there is a small window when `RA` is actually stored on stack).
				544
				545
				546	Functions called once
				547	---------------------
				548	Suppose `f()` is called in just one place in the program
				549	(assuming we can verify this in LTO mode).
				550	In this case we can replace the `RET` instruction with a `JMP` instruction
				551	with the immediate constant for `RA`.
				552	This will precisely enforce the return control flow no matter what is stored on stack.
				553
				554	Another variant is to compare `RA` on stack with the known constant and abort
				555	if they don't match; then `JMP` to the known constant address.
				556
				557	Functions called in a small number of call sites
				558	------------------------------------------------
				559	We may extend the above approach to cases where `f()`
				560	is called more than once (but still a small number of times).
				561	With LTO we know all possible values of `RA` and we check them
				562	one-by-one (or using binary search) against the value on stack.
				563	If the match is found, we `JMP` to the known constant address, otherwise abort.
				564
				565	This protection is near-precise, i.e. it guarantees that the control flow will
				566	be transferred to one of the valid return addresses for this function,
				567	but not necessary to the point of the most recent `CALL`.
				568
				569	General case
				570	------------
				571	For functions called multiple times a return jump table is constructed
				572	in the same manner as jump tables for indirect function calls (see above).
				573	The correct jump table entry (or it's index) is passed by `CALL` to `f()`
				574	(as an extra argument) and then spilled to stack.
				575	The `RET` instruction is replaced with a load of the jump table entry,
				576	jump table range check, and `JMP` to the jump table entry.
				577
				578	This protection is also near-precise.
				579
				580	Returns from functions called indirectly
				581	----------------------------------------
				582
				583	If a function is called indirectly, the return jump table is constructed for the
				584	equivalence class of functions instead of a single function.
				585
				586	Cross-DSO calls
				587	---------------
				588	Consider two instrumented DSOs, `A` and `B`. `A` defines `f()` and `B` calls it.
				589
				590	This case will be handled similarly to the cross-DSO scheme using the slow path callback.
				591
				592	Non-goals
				593	---------
				594
				595	RCFI does not protect `RET` instructions:
				596	* in non-instrumented DSOs,
				597	* in instrumented DSOs for functions that are called from non-instrumented DSOs,
				598	* embedded into other instructions (e.g. `0f4fc3 cmovg %ebx,%eax`).
				599
				600	.. _SafeStack: https://clang.llvm.org/docs/SafeStack.html
				601	.. _RFG: http://xlab.tencent.com/en/2016/11/02/return-flow-guard
				602	.. _Intel CET: https://software.intel.com/en-us/blogs/2016/06/09/intel-release-new-technology-specifications-protect-rop-attacks
				603
				604	Hardware support
				605	================
				606
				607	We believe that the above design can be efficiently implemented in hardware.
				608	A single new instruction added to an ISA would allow to perform the forward-edge CFI check
				609	with fewer bytes per check (smaller code size overhead) and potentially more
				610	efficiently. The current software-only instrumentation requires at least
				611	32-bytes per check (on x86_64).
				612	A hardware instruction may probably be less than ~ 12 bytes.
				613	Such instruction would check that the argument pointer is in-bounds,
				614	and is properly aligned, and if the checks fail it will either trap (in monolithic scheme)
				615	or call the slow path function (cross-DSO scheme).
				616	The bit vector lookup is probably too complex for a hardware implementation.
				617
				618	.. code-block:: none
				619
				620	// This instruction checks that 'Ptr'
				621	// * is aligned by (1 << kAlignment) and
				622	// * is inside [kRangeBeg, kRangeBeg+(kRangeSize<<kAlignment))
				623	// and if the check fails it jumps to the given target (slow path).
				624	//
				625	// 'Ptr' is a register, pointing to the virtual function table
				626	// or to the function which we need to check. We may require an explicit
				627	// fixed register to be used.
				628	// 'kAlignment' is a 4-bit constant.
				629	// 'kRangeSize' is a ~20-bit constant.
				630	// 'kRangeBeg' is a PC-relative constant (~28 bits)
				631	// pointing to the beginning of the allowed range for 'Ptr'.
				632	// 'kFailedCheckTarget': is a PC-relative constant (~28 bits)
				633	// representing the target to branch to when the check fails.
				634	// If kFailedCheckTarget==0, the process will trap
				635	// (monolithic binary scheme).
				636	// Otherwise it will jump to a handler that implements `CFI_SlowPath`
				637	// (cross-DSO scheme).
				638	CFI_Check(Ptr, kAlignment, kRangeSize, kRangeBeg, kFailedCheckTarget) {
				639	if (Ptr < kRangeBeg \|\|
				640	Ptr >= kRangeBeg + (kRangeSize << kAlignment) \|\|
				641	Ptr & ((1 << kAlignment) - 1))
				642	Jump(kFailedCheckTarget);
				643	}
				644
				645	An alternative and more compact encoding would not use `kFailedCheckTarget`,
				646	and will trap on check failure instead.
				647	This will allow us to fit the instruction into 8-9 bytes.
				648	The cross-DSO checks will be performed by a trap handler and
				649	performance-critical ones will have to be black-listed and checked using the
				650	software-only scheme.
				651
				652	Note that such hardware extension would be complementary to checks
				653	at the callee side, such as e.g. Intel ENDBRANCH.
				654	Moreover, CFI would have two benefits over ENDBRANCH: a) precision and b)
				655	ability to protect against invalid casts between polymorphic types.