mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
171 lines
10 KiB
171 lines
10 KiB
Single-sampled Color Compression |
|
================================ |
|
|
|
Starting with Ivy Bridge, Intel graphics hardware provides a form of color |
|
compression for single-sampled surfaces. In its initial form, this provided an |
|
acceleration of render target clear operations that, in the common case, allows |
|
you to avoid almost all of the bandwidth of a full-surface clear operation. On |
|
Sky Lake, single-sampled color compression was extended to allow for the |
|
compression color values from actual rendering and not just the initial clear. |
|
From here on, the older Ivy Bridge form of color compression will be called |
|
"fast-clears" and term "color compression" will be reserved for the more |
|
powerful Sky Lake form. |
|
|
|
The documentation for Ivy Bridge through Broadwell overloads the term MCS for |
|
referring both to the *multisample control surface* used for multisample |
|
compression and the control surface used for fast-clears. In ISL, the |
|
:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_MCS` enum always refers to |
|
multisample color compression while the |
|
:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_CCS_` enums always refer to |
|
single-sampled color compression. Throughout this chapter and the rest of the |
|
ISL documentation, we will use the term "color control surface", abbreviated |
|
CCS, to denote the control surface used for both fast-clears and color |
|
compression. While this is still an overloaded term, Ivy Bridge fast-clears |
|
are much closer to Sky Lake color compression than they are to multisample |
|
compression. |
|
|
|
CCS data |
|
-------- |
|
|
|
Fast clears and CCS are possibly the single most poorly documented aspect of |
|
surface layout/setup for Intel graphics hardware (with HiZ coming in a neat |
|
second). All the documentation really says is that you can use an MCS buffer on |
|
single-sampled surfaces (we will call it the CCS in this case). It also |
|
provides some documentation on how to program the hardware to perform clear |
|
operations, but that's it. How big is this buffer? What does it contain? |
|
Those question are left as exercises to the reader. Almost everything we know |
|
about the contents of the CCS is gleaned from reverse-engineering of the |
|
hardware. The best bit of documentation we have ever had comes from the |
|
display section of the Sky Lake PRM Vol 12 section on planes (p. 159): |
|
|
|
The Color Control Surface (CCS) contains the compression status of the |
|
cache-line pairs. The compression state of the cache-line pair is |
|
specified by 2 bits in the CCS. Each CCS cache-line represents an area |
|
on the main surface of 16x16 sets of 128 byte Y-tiled cache-line-pairs. |
|
CCS is always Y tiled. |
|
|
|
While this is technically for color compression and not fast-clears, it |
|
provides a good bit of insight into how color compression and fast-clears |
|
operate. Each cache-line pair, in the main surface corresponds to 1 or 2 bits |
|
in the CCS. The primary difference, as far as the current discussion is |
|
concerned, is that fast-clears use only 1 bit per cache-line pair whereas color |
|
compression uses 2 bits. |
|
|
|
What is a cache-line pair? Both the X and Y tiling formats are arranged as an |
|
8x8 grid of cache lines. (See the [chapter on tiling](#tiling) for more |
|
details.) In either case, a cache-line pair is a pair of cache lines whose |
|
starting addresses differ by 512 bytes or 8 cache lines. This results in the |
|
two cache lines being vertically adjacent when the main surface is X-tiled and |
|
horizontally adjacent when the main surface is Y-tiled. For an X-tiled surface |
|
this forms an area of 64B x 2rows and for a Y-tiled surface this forms an area |
|
of 32B x 4rows. In either case, it is guaranteed that, regardless of surface |
|
format, each 2x2 subspan coming out of a shader will land entirely within one |
|
cache-line pair. |
|
|
|
What is the correspondence between bits and cache-line pairs? The best model I |
|
(Jason) know of is to consider the CCS as having a 1-bit color format for |
|
fast-clears and a 2-bit format for color compression and a special tiling |
|
format. The CCS tiling formats operate on a 1 or 2-bit granularity rather than |
|
the byte granularity of most tiling formats. |
|
|
|
The following table represents the bit-layouts that yield the CCS tiling format |
|
on different hardware generations. Bits 0-11 correspond to the regular swizzle |
|
of bytes within a 4KB page whereas the negative bits represent the address of |
|
the particular 1 or 2-bit portion of a byte. (Note: The haswell data was |
|
gathered on a dual-channel system so bit-6 swizzling was enabled. It's unclear |
|
how this affects the CCS layout.) |
|
|
|
============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
|
Generation Tiling 11 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 |
|
============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
|
Ivy Bridge X or Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` |
|
Haswell X :math:`u_6` :math:`u_5` :math:`v_3 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` |
|
Haswell Y :math:`u_6` :math:`u_5` :math:`v_2 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` |
|
Broadwell X :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`u_3` :math:`v_3` :math:`u_2` :math:`u_1` :math:`u_0` :math:`v_2` :math:`v_1` :math:`v_0` |
|
Broadwell Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_1` :math:`v_0` :math:`u_0` |
|
Sky Lake Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_0` :math:`u_0` |
|
============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
|
|
|
CCS surface layout |
|
------------------ |
|
|
|
Starting with Broadwell, fast-clears and color compression can be used on |
|
mipmapped and array surfaces. When considered from a higher level, the CCS is |
|
layed out like any other surface. The Broadwell and Sky Lake PRMs describe |
|
this as follows: |
|
|
|
Broadwell PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 676): |
|
|
|
Mip-mapped and arrayed surfaces are supported with MCS buffer layout with |
|
these alignments in the RT space: Horizontal Alignment = 256 and Vertical |
|
Alignment = 128. |
|
|
|
Broadwell PRM Vol 2d, "RENDER_SURFACE_STATE" (p. 279): |
|
|
|
For non-multisampled render target's auxiliary surface, MCS, QPitch must be |
|
computed with Horizontal Alignment = 256 and Surface Vertical Alignment = |
|
128. These alignments are only for MCS buffer and not for associated render |
|
target. |
|
|
|
Sky Lake PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 632): |
|
|
|
Mip-mapped and arrayed surfaces are supported with MCS buffer layout with |
|
these alignments in the RT space: Horizontal Alignment = 128 and Vertical |
|
Alignment = 64. |
|
|
|
Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 435): |
|
|
|
For non-multisampled render target's CCS auxiliary surface, QPitch must be |
|
computed with Horizontal Alignment = 128 and Surface Vertical Alignment |
|
= 256. These alignments are only for CCS buffer and not for associated |
|
render target. |
|
|
|
Empirical evidence seems to confirm this. On Sky Lake, the vertical alignment |
|
is always one cache line. The horizontal alignment, however, varies by main |
|
surface format: 1 cache line for 32bpp, 2 for 64bpp and 4 cache lines for |
|
128bpp formats. This nicely corresponds to the alignment of 128x64 pixels in |
|
the primary color surface. The second PRM citation about Sky Lake CCS above |
|
gives a vertical alignment of 256 rather than 64. With a little |
|
experimentation, this additional alignment appears to only apply to QPitch and |
|
not to the miplevels within a slice. |
|
|
|
On Broadwell, each miplevel in the CCS is aligned to a cache-line pair |
|
boundary: horizontal when the primary surface is X-tiled and vertical when |
|
Y-tiled. For a 32bpp format, this works out to an alignment of 256x128 main |
|
surface pixels regardless of X or Y tiling. On Sky Lake, the alignment is |
|
a single cache line which works out to an alignment of 128x64 main surface |
|
pixels. |
|
|
|
TODO: More than just 32bpp formats on Broadwell! |
|
|
|
Once armed with the above alignment information, we can lay out the CCS surface |
|
itself. The way ISL does CCS layout calculations is by a very careful and |
|
subtle application of its normal surface layout code. |
|
|
|
Above, we described the CCS data layout as mapping of address bits. In |
|
ISL, this is represented by :cpp:enumerator:`isl_tiling::ISL_TILING_CCS`. The |
|
logical and physical tile dimensions corresponding to the above mapping. |
|
|
|
We also have special :cpp:enum:`isl_format` enums for CCS. These formats are 1 |
|
bit-per-pixel on Ivy Bridge through Broadwell and 2 bits-per-pixel on Skylake |
|
and above to correspond to the 1 and 2-bit values represented in the CCS data. |
|
They have a block size (similar to a block compressed format such as BC or |
|
ASTC) which says what area (in surface elements) in the main surface is covered |
|
by a single CCS element (1 or 2-bit). Because this depends on the main surface |
|
tiling and format, we have several different CCS formats. |
|
|
|
Once the appropriate :cpp:enum:`isl_format` has been selected, computing the |
|
size and layout of a CCS surface is as simple as passing the same surface |
|
creation parameters to :cpp:func:`isl_surf_init_s` as were used to create the |
|
primary surface only with :cpp:enumerator:`isl_tiling::ISL_TILING_CCS` and the |
|
correct CCS format. This not only results in a correctly sized surface but |
|
most other ISL helpers for things such as computing offsets into surfaces work |
|
correctly as well. |
|
|
|
CCS on Tigerlake and above |
|
-------------------------- |
|
|
|
Starting with Tigerlake, CCS is no longer done via a surface and, instead, the |
|
term CCS gets overloaded once again (gotta love it!) to now refer to a form of |
|
universal compression which can be applied to almost any surface. Nothing in |
|
this chapter applies to any hardware with a graphics IP version 12 or above.
|
|
|