third_party/angle/src/libANGLE/renderer/vulkan/doc/PresentSemaphores.md - cobalt.git - Git at Google

 # Queue Present Wait Semaphore Management

 The following shorthand notations are used throughout this document:

 - PE: Presentation Engine
 - ANI: vkAcquireNextImageKHR
 - QS: vkQueueSubmit
 - QP: vkQueuePresentKHR
 - W: Wait
 - S: Signal
 - R: Render
 - P: Present
 - SN: Semaphore N
 - IN: Swapchain image N
 - FN: Fence N

 ---

 ## Introduction

 Vulkan requires the application (ANGLE in this case) to acquire swapchain images and queue them for
 presentation, synchronizing GPU submissions with semaphores.  A single frame looks like the
 following:

     CPU: ANI  ... QS   ... QP
          S:S1     W:S1     W:S2
                   S:S2
     GPU:          <------------ R ----------->
      PE:                                      <-------- P ------>

 That is, the GPU starts rendering after submission, and the presentation is started when rendering is
 finished.  Note that Vulkan tries to abstract a large variety of PE architectures, some of which do
 not behave in a straight-forward manner.  As such, ANGLE cannot know what the PE is exactly doing
 with the images or when the images are visible on the screen.  The only signal out of the PE is
 received through the semaphore that's used in ANI.

 With multiple frames, the pipeline looks different based on present mode.  Let's focus on
 FIFO (the arguments in this document translate to all modes) with 3 images:

     CPU: QS QP QS QP QS QP QS QP
          I1 I1 I2 I2 I3 I3 I1 I1
     GPU: <---- R I1 ----><---- R I2 ----><---- R I3 ----><---- R I1 ---->
      PE:                 <----- P I1 -----><----- P I2 -----><----- P I3 -----><----- P I1 ----->

 First, an issue is evident here.  The CPU is submitting jobs and queuing images for presentation
 faster than the GPU can render them or the PE can view them.  This can cause the length of the
 submit queue to grow indefinitely, resulting in larger and larger input lag.  In FIFO mode, the PE
 present queue also grows indefinitely.

 To address this issue, ANGLE paces the CPU such that the length of the submit queue is kept at a
 maximum of 1 image (i.e. submission with one image is being processed, and another one is in queue):

     CPU: QS   QS          W:F1 QS         W:F2 QS
          I1   I2               I3              I1
          S:F1 S:F2             S:F3            S:F4
     GPU: <---- R I1 ----><---- R I2 ----><---- R I3 ----><---- R I1 ---->

 > Note: Ideally, the length of the PE present queue should also be kept at a maximum of 1 (i.e. one
 > image being presented, and another in queue).  However, the Vulkan WSI extension doesn't provide
 > enough control to achieve this.  In heavy application, the length of the PE present queue is
 > probably 1 anyway (as the rendering time is almost as long as the frame (i.e. present time), in
 > which case pacing the submissions similarly paces the presentation).  In theory, in FIFO mode, the
 > length of the PE present queue is below n+2 where n is the number of swapchain images.
 >
 > To understand why, imagine a FIFO swapchain with 1000 images and submissions that are
 > infinitesimally short.  In this case, the CPU pacing is effectively a no-op (as the GPU instantly
 > finishes jobs) for the first 1002 submissions.  The 1003rd submission waits for F1001 (which uses
 > I1).  However, the 1001st submission will not start until the PE switches to presenting I2 (at the
 > next V-Sync).  The CPU then waits for V-Sync before the 1003rd submission.  The CPU waits for one
 > V-Sync for every subsequent submission, keeping the length of the queue 1002.
 > [`VK_GOOGLE_display_timing`][DisplayTimingGOOGLE] is likely a solution to this problem.

 Associated with each QP operation is a semaphore signaled by the preceding QS and waited on by the
 PE before the image can be presented.  Currently, there's no feedback from Vulkan (See [internal
 Khronos issue][VulkanIssue1060]) regarding _when_ the PE has actually finished waiting on the
 semaphore!  This means that the application cannot generally know when to destroy the corresponding
 semaphore.  However, taking ANGLE's CPU pacing into account, we are able to destroy (or rather
 reuse) semaphores when they are provably unused.

 This document describes an approach for destroying semaphores that should work with all valid PE
 architectures, but will be described in terms of more common PE architectures (e.g. where the PE
 only backs each VkImage and VkSemaphore handle with one actual memory object, and where the PE
 cycles between the swapchain images in a straight-forward manner).

 The interested reader may follow the discussion in this abandoned [gerrit CL][CL1757018] for more
 background and ideas.

 [DisplayTimingGOOGLE]: https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VK_GOOGLE_display_timing.html
 [VulkanIssue1060]: https://gitlab.khronos.org/vulkan/vulkan/issues/1060
 [CL1757018]: https://chromium-review.googlesource.com/c/angle/angle/+/1757018

 ## Determining When a QP Semaphore is Waited On

 Let's combine the above diagrams with all the details:

     CPU: ANI   | QS    | QP    | ANI   | QS    | QP    | ANI   | W:F1 | QS    | QP    | ANI   | W:F2 | QS    | QP
          I1    | I1    | I1    | I2    | I2    | I2    | I3    |      | I3    | I3    | I1    |      | I1    | I1
          S:SA1 | W:SA1 |       | S:SA2 | W:SA2 |       | S:SA3 |      | W:SA3 |       | S:SA4 |      | W:SA4 |
                | S:SP1 | W:SP1 |       | S:SP2 | W:SP2 |       |      | S:SP3 | W:SP3 |       |      | S:SP4 | W:SP4
                | S:F1  |       |       | S:F2  |       |       |      | S:F3  |       |       |      | S:F4  |

 Let's focus only on sequences that return the same image:

     CPU: ANI   | W:F(X-2) | QS    | QP    | ... | ANI   | W:F(Y-2) | QS    | QP
          I1    |          | I1    | I1    |     | I1    |          | I1    | I1
          S:SAX |          | W:SAX |       |     | S:SAY |          | W:SAY |
                |          | S:SPX | W:SPX |     |       |          | S:SPY | W:SPY
                |          | S:FX  |       |     |       |          | S:FY  |

 Note that X and Y are arbitrarily distanced (including possibly being sequential).

 Say we are at frame Y+2.  There's therefore a wait on FY.  The following holds:

     FY is signaled
     => SAY is signaled
     => The PE has handed I1 back to the application
     => The PE has already processed the *previous* QP of I1
     => SPX is waited on

 At this point, we can destroy SPX.  In other words, in frame Y+2, we can destroy SPX (note that 2 is
 the number of frames the CPU pacing code uses).  If frame Y+1 is not using I1, this means the
 history of present semaphores for I1 would be `{SPX, SPY}` and we can destroy the oldest semaphore
 in this list.  If frame Y+1 is also using I1, we should still destroy SPX in frame Y+2, but the
 history of the present semaphores for I1 would be `{SPX, SPY, SP(Y+1)}`.

 In the Vulkan backend, we simplify destruction of semaphores by always keeping a history of 3
 present semaphores for each image (again, 3 is H+1 where H is the swap history size used in CPU
 pacing) and always reuse (instead of destroy) the oldest semaphore of the image that is about to be
 presented.

 To summarize, we use the completion of a submission using an image to prove when the semaphore used
 for the *previous* presentation of that image is no longer in use (and can be safely destroyed or
 reused).

 ## Swapchain recreation

 When recreating the swapchain, all images are eventually freed and new ones are created, possibly
 with a different count and present mode.  For the old swapchain, we can no longer rely on the
 completion of a future submission to know when a previous presentation's semaphore can be destroyed,
 as there won't be any more submissions using images from the old swapchain.

 > For example, imagine the old swapchain was created in FIFO mode, and one image is being presented
 > until the next V-Sync.  Furthermore, imagine the new swapchain is created in MAILBOX mode.  Since
 > the old swapchain's image will remain presented until V-Sync, the new MAILBOX swapchain can
 > perform an arbitrarily large number of (throw-away) presentations.  The old swapchain (and its
 > associated present semaphores) cannot be destroyed until V-Sync; a signal that's not captured by
 > Vulkan.

 ANGLE resolves this issue by deferring the destruction of the old swapchain and its remaining
 present semaphores to the time when the semaphore corresponding to the first present of the new
 swapchain can be destroyed.  In the example in the previous section, if SPX is the present semaphore
 of the first QP performed on the new swapchain, at frame Y+2, when we know SPX can be destroyed, we
 know that the first image of the new swapchain has already been presented.  This proves that all
 previous QPs of the old swapchain have been processed.

 > Note: the swapchain can potentially be destroyed much earlier, but with no feedback from the
 > presentation engine, we cannot know that.  This delays means that the swapchain could be recreated
 > while there are pending old swapchains to be destroyed.  The destruction of both old swapchains
 > must now be deferred to when the first QP of the new swapchain has been processed.  If an
 > application resizes the window constantly and at a high rate, ANGLE would keep accumulating old
 > swapchains and not free them until it stops.  While a user will likely not be able to do this (as
 > the rate of window system events is lower than the framerate), this can be programmatically done
 > (as indeed done in EGL dEQP tests).  Nvidia for example fails creation of a new swapchain if there
 > are already 20 allocated (on desktop, or less than ten on Quadro).  If the backlog of old
 > swapchains get larger than a threshold, ANGLE calls `vkQueueWaitIdle()` and destroys the
 > swapchains.
	# Queue Present Wait Semaphore Management

	The following shorthand notations are used throughout this document:

	- PE: Presentation Engine
	- ANI: vkAcquireNextImageKHR
	- QS: vkQueueSubmit
	- QP: vkQueuePresentKHR
	- W: Wait
	- S: Signal
	- R: Render
	- P: Present
	- SN: Semaphore N
	- IN: Swapchain image N
	- FN: Fence N

	---

	## Introduction

	Vulkan requires the application (ANGLE in this case) to acquire swapchain images and queue them for
	presentation, synchronizing GPU submissions with semaphores. A single frame looks like the
	following:

	CPU: ANI ... QS ... QP
	S:S1 W:S1 W:S2
	S:S2
	GPU: <------------ R ----------->
	PE: <-------- P ------>

	That is, the GPU starts rendering after submission, and the presentation is started when rendering is
	finished. Note that Vulkan tries to abstract a large variety of PE architectures, some of which do
	not behave in a straight-forward manner. As such, ANGLE cannot know what the PE is exactly doing
	with the images or when the images are visible on the screen. The only signal out of the PE is
	received through the semaphore that's used in ANI.

	With multiple frames, the pipeline looks different based on present mode. Let's focus on
	FIFO (the arguments in this document translate to all modes) with 3 images:

	CPU: QS QP QS QP QS QP QS QP
	I1 I1 I2 I2 I3 I3 I1 I1
	GPU: <---- R I1 ----><---- R I2 ----><---- R I3 ----><---- R I1 ---->
	PE: <----- P I1 -----><----- P I2 -----><----- P I3 -----><----- P I1 ----->

	First, an issue is evident here. The CPU is submitting jobs and queuing images for presentation
	faster than the GPU can render them or the PE can view them. This can cause the length of the
	submit queue to grow indefinitely, resulting in larger and larger input lag. In FIFO mode, the PE
	present queue also grows indefinitely.

	To address this issue, ANGLE paces the CPU such that the length of the submit queue is kept at a
	maximum of 1 image (i.e. submission with one image is being processed, and another one is in queue):

	CPU: QS QS W:F1 QS W:F2 QS
	I1 I2 I3 I1
	S:F1 S:F2 S:F3 S:F4
	GPU: <---- R I1 ----><---- R I2 ----><---- R I3 ----><---- R I1 ---->

	> Note: Ideally, the length of the PE present queue should also be kept at a maximum of 1 (i.e. one
	> image being presented, and another in queue). However, the Vulkan WSI extension doesn't provide
	> enough control to achieve this. In heavy application, the length of the PE present queue is
	> probably 1 anyway (as the rendering time is almost as long as the frame (i.e. present time), in
	> which case pacing the submissions similarly paces the presentation). In theory, in FIFO mode, the
	> length of the PE present queue is below n+2 where n is the number of swapchain images.
	>
	> To understand why, imagine a FIFO swapchain with 1000 images and submissions that are
	> infinitesimally short. In this case, the CPU pacing is effectively a no-op (as the GPU instantly
	> finishes jobs) for the first 1002 submissions. The 1003rd submission waits for F1001 (which uses
	> I1). However, the 1001st submission will not start until the PE switches to presenting I2 (at the
	> next V-Sync). The CPU then waits for V-Sync before the 1003rd submission. The CPU waits for one
	> V-Sync for every subsequent submission, keeping the length of the queue 1002.
	> [`VK_GOOGLE_display_timing`][DisplayTimingGOOGLE] is likely a solution to this problem.

	Associated with each QP operation is a semaphore signaled by the preceding QS and waited on by the
	PE before the image can be presented. Currently, there's no feedback from Vulkan (See [internal
	Khronos issue][VulkanIssue1060]) regarding _when_ the PE has actually finished waiting on the
	semaphore! This means that the application cannot generally know when to destroy the corresponding
	semaphore. However, taking ANGLE's CPU pacing into account, we are able to destroy (or rather
	reuse) semaphores when they are provably unused.

	This document describes an approach for destroying semaphores that should work with all valid PE
	architectures, but will be described in terms of more common PE architectures (e.g. where the PE
	only backs each VkImage and VkSemaphore handle with one actual memory object, and where the PE
	cycles between the swapchain images in a straight-forward manner).

	The interested reader may follow the discussion in this abandoned [gerrit CL][CL1757018] for more
	background and ideas.

	[DisplayTimingGOOGLE]: https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VK_GOOGLE_display_timing.html
	[VulkanIssue1060]: https://gitlab.khronos.org/vulkan/vulkan/issues/1060
	[CL1757018]: https://chromium-review.googlesource.com/c/angle/angle/+/1757018

	## Determining When a QP Semaphore is Waited On

	Let's combine the above diagrams with all the details:

	CPU: ANI \| QS \| QP \| ANI \| QS \| QP \| ANI \| W:F1 \| QS \| QP \| ANI \| W:F2 \| QS \| QP
	I1 \| I1 \| I1 \| I2 \| I2 \| I2 \| I3 \| \| I3 \| I3 \| I1 \| \| I1 \| I1
	S:SA1 \| W:SA1 \| \| S:SA2 \| W:SA2 \| \| S:SA3 \| \| W:SA3 \| \| S:SA4 \| \| W:SA4 \|
	\| S:SP1 \| W:SP1 \| \| S:SP2 \| W:SP2 \| \| \| S:SP3 \| W:SP3 \| \| \| S:SP4 \| W:SP4
	\| S:F1 \| \| \| S:F2 \| \| \| \| S:F3 \| \| \| \| S:F4 \|

	Let's focus only on sequences that return the same image:

	CPU: ANI \| W:F(X-2) \| QS \| QP \| ... \| ANI \| W:F(Y-2) \| QS \| QP
	I1 \| \| I1 \| I1 \| \| I1 \| \| I1 \| I1
	S:SAX \| \| W:SAX \| \| \| S:SAY \| \| W:SAY \|
	\| \| S:SPX \| W:SPX \| \| \| \| S:SPY \| W:SPY
	\| \| S:FX \| \| \| \| \| S:FY \|

	Note that X and Y are arbitrarily distanced (including possibly being sequential).

	Say we are at frame Y+2. There's therefore a wait on FY. The following holds:

	FY is signaled
	=> SAY is signaled
	=> The PE has handed I1 back to the application
	=> The PE has already processed the previous QP of I1
	=> SPX is waited on

	At this point, we can destroy SPX. In other words, in frame Y+2, we can destroy SPX (note that 2 is
	the number of frames the CPU pacing code uses). If frame Y+1 is not using I1, this means the
	history of present semaphores for I1 would be `{SPX, SPY}` and we can destroy the oldest semaphore
	in this list. If frame Y+1 is also using I1, we should still destroy SPX in frame Y+2, but the
	history of the present semaphores for I1 would be `{SPX, SPY, SP(Y+1)}`.

	In the Vulkan backend, we simplify destruction of semaphores by always keeping a history of 3
	present semaphores for each image (again, 3 is H+1 where H is the swap history size used in CPU
	pacing) and always reuse (instead of destroy) the oldest semaphore of the image that is about to be
	presented.

	To summarize, we use the completion of a submission using an image to prove when the semaphore used
	for the previous presentation of that image is no longer in use (and can be safely destroyed or
	reused).

	## Swapchain recreation

	When recreating the swapchain, all images are eventually freed and new ones are created, possibly
	with a different count and present mode. For the old swapchain, we can no longer rely on the
	completion of a future submission to know when a previous presentation's semaphore can be destroyed,
	as there won't be any more submissions using images from the old swapchain.

	> For example, imagine the old swapchain was created in FIFO mode, and one image is being presented
	> until the next V-Sync. Furthermore, imagine the new swapchain is created in MAILBOX mode. Since
	> the old swapchain's image will remain presented until V-Sync, the new MAILBOX swapchain can
	> perform an arbitrarily large number of (throw-away) presentations. The old swapchain (and its
	> associated present semaphores) cannot be destroyed until V-Sync; a signal that's not captured by
	> Vulkan.

	ANGLE resolves this issue by deferring the destruction of the old swapchain and its remaining
	present semaphores to the time when the semaphore corresponding to the first present of the new
	swapchain can be destroyed. In the example in the previous section, if SPX is the present semaphore
	of the first QP performed on the new swapchain, at frame Y+2, when we know SPX can be destroyed, we
	know that the first image of the new swapchain has already been presented. This proves that all
	previous QPs of the old swapchain have been processed.

	> Note: the swapchain can potentially be destroyed much earlier, but with no feedback from the
	> presentation engine, we cannot know that. This delays means that the swapchain could be recreated
	> while there are pending old swapchains to be destroyed. The destruction of both old swapchains
	> must now be deferred to when the first QP of the new swapchain has been processed. If an
	> application resizes the window constantly and at a high rate, ANGLE would keep accumulating old
	> swapchains and not free them until it stops. While a user will likely not be able to do this (as
	> the rate of window system events is lower than the framerate), this can be programmatically done
	> (as indeed done in EGL dEQP tests). Nvidia for example fails creation of a new swapchain if there
	> are already 20 allocated (on desktop, or less than ten on Quadro). If the backlog of old
	> swapchains get larger than a threshold, ANGLE calls `vkQueueWaitIdle()` and destroys the
	> swapchains.