SRPCore Material classification與最佳化Screen Space Reflection

凶恶的真实發表於2024-05-07

序言

在寫這篇之前,原本是打算寫一篇RenderGraph的簡易使用指南的,潛意識的我覺得寫這個的意義不是特別大,因為感覺並不能在寫的過程中讓自己學到點什麼新的知識(特別是看到DebugView的GC之後),除非把整個Render Graph(資源管理,Pass Compile流程)理一遍,這樣的話就不算是簡易指南了
所以,還是更傾向於搞一點對我而言比較新的東西,比如在崗時就想做的Visibility Buffer
而在寫Visibility Buffer流程之前,還得給自己的管線新增延遲渲染的流程。
於是乎在參考HDRP的Deferred Lighting的時候看到了Indirect Dispatch的最佳化手法,然後也意識到這樣的最佳化手法也能夠作用到之前寫的SSR之中
一下子就覺得寫這個比寫RenderGraph更值得記錄下來,於是決定還是先寫這個,RenderGraph還是往後延一下。

Material classification

延遲渲染的時候,很多時候的消耗是在於,為了支援多種光照模型,Deferred輸出的shader需要計算當前畫素的光照模型(動態分支),以及會讓Shader變得過於臃腫(帶來更多的渲染資源的繫結)。
為了解決這個動態分支,需要提前將畫面上的材質進行分類,把動態分支變成靜態分支。
HDRP對於這個材質分類的做法是跟神海4的2016 SIGGRAPH的類似https://zhuanlan.zhihu.com/p/77542272
也是在GBuffer Pass階段輸出材質型別的ID
然後透過Compute Shader透過執行緒原子操作計算16*16畫素的Tile內的材質型別,並且進行分類
並把對應的Variant InterlockAdd Indirect Buffer得到對應Feature Variant要Dispatch的執行緒組數量(Deferred Shader Indirect Dispatch用)中。
在材質型別分類的基礎上,HDRP在做FPTL的時候順便對不同的光照情況進行了更近一步的分類。

FeatureFlag

FeatureFlag的種類

這裡依舊是用GenerateHLSL Tag控制管線程式碼與Shader程式碼的同步
https://www.cnblogs.com/OneStargazer/p/18131227

從LightDefinitions.s_LightFeatureMaskFlags可以看出Light Feature Flag佔32位uint的後面16位,Material Flag佔前面的16位。
其中SSRefraction與SSReflection並沒有對Tile進行分類,只是在FeatureFlag上佔了位

[GenerateHLSL]
internal enum LightFeatureFlags
{
    // Light bit mask must match LightDefinitions.s_LightFeatureMaskFlags value
    Punctual = 1 << 12,
    Area = 1 << 13,
    Directional = 1 << 14,
    Env = 1 << 15,
    Sky = 1 << 16,
    SSRefraction = 1 << 17,
    SSReflection = 1 << 18,
    // If adding more light be sure to not overflow LightDefinitions.s_LightFeatureMaskFlags
}

[GenerateHLSL]
class LightDefinitions
{
    ...
    
    // Following define the maximum number of bits use in each feature category.
    public static uint s_LightFeatureMaskFlags = 0xFFF000;
    public static uint s_LightFeatureMaskFlagsOpaque = 0xFFF000 & ~((uint) LightFeatureFlags.SSRefraction); // Opaque don't support screen space refraction
    public static uint s_LightFeatureMaskFlagsTransparent = 0xFFF000 & ~((uint) LightFeatureFlags.SSReflection); // Transparent don't support screen space reflection
    public static uint s_MaterialFeatureMaskFlags = 0x000FFF; // don't use all bits just to be safe from signed and/or float conversions :/

}

Tile內的Light Feature Flag計算

在之前的FPTL的流程解析之中,FeatureFlag的計算由於對Forward的計算光照的流程沒有影響,於是就略過了。
這裡就只把FTPL計算每個Tile的FeatureFlag的程式碼拉出來,FPTL的具體流程可以看我之前寫的文章這裡就不再贅述了。
https://www.cnblogs.com/OneStargazer/p/18105322

計算Tile內的LightFeatureFlag存放到RWStructuredBuffer的g_TileFeatureFlags
(tileIDX.y * nrTilesX + tileIDX.x把螢幕上的Tile轉換成一維)

#define NR_THREADS              PLATFORM_LANE_COUNT
#else
#define NR_THREADS              64                                  // default to 64 threads per group on other platforms..
#endif

#ifdef USE_FEATURE_FLAGS
groupshared uint ldsFeatureFlags;
RWStructuredBuffer<uint> g_TileFeatureFlags;
#endif

#define TILE_SIZE_FPTL (16)

//lightlistbuild.compute
[numthreads(NR_THREADS, 1, 1)]
void TileLightListGen(uint3 dispatchThreadId : SV_DispatchThreadID, uint threadID : SV_GroupIndex, uint3 u3GroupID : SV_GroupID)
{
    ...(SphericalIntersectionTests)

    ...(FinePruneLights)

    if (t < CATEGORY_LIST_SIZE) 
        ldsCategoryListCount[t] = 0;

    #ifdef USE_FEATURE_FLAGS
    if(t==0) 
        ldsFeatureFlags=0;
    #endif

    #if NR_THREADS > PLATFORM_LANE_COUNT
    GroupMemoryBarrierWithGroupSync();
    #endif

    #define LIGHT_LIST_MAX_COARSE_ENTRIES (64)
    
    //遍歷經過SphericalIntersectionTests以及FinePruneLights剔除計算之後的Tile內燈光列表
    int nrLightsCombinedList = min(ldsNrLightsFinal,LIGHT_LIST_MAX_COARSE_ENTRIES);
    for (int i = t; i < nrLightsCombinedList; i += NR_THREADS)
    {
        const int lightBoundIndex = GenerateLightCullDataIndex(prunedList[i], g_iNrVisibLights, unity_StereoEyeIndex);

        InterlockedAdd(ldsCategoryListCount[_LightVolumeData[lightBoundIndex].lightCategory], 1);

        //原子操作計算當前Tile的FeatureFlag(一個執行緒組計算一個Tile)
        //LightVolumeData的featureFlags為當前Volume Data的Light Feature Flag
        //Light Feature Flag有(Punctal,Area,Direction,Environment,Sky,ScreenSpaceReflection,ScreenSpaceRefraction)
        #ifdef USE_FEATURE_FLAGS
        InterlockedOr(ldsFeatureFlags, _LightVolumeData[lightBoundIndex].featureFlags);
        #endif
    }


    #ifdef USE_FEATURE_FLAGS
    if(t == 0)
    {
        //g_BaseFeatureFlags = LightFeatureFlags.Directional(Direction Light Count>0)|LightFeatureFlags.Sky(skyEnabled);
        uint featureFlags = ldsFeatureFlags | g_BaseFeatureFlags;

        //整個Tile所有畫素都是用作渲染天空盒
        // In case of back
        if(ldsZMax < ldsZMin)   // is background pixel
        {
            // There is no stencil usage with compute path, featureFlags set to 0 is use to have fast rejection of tile in this case. It will still execute but will do nothing
            featureFlags = 0;
        }

        g_TileFeatureFlags[tileIDX.y * nrTilesX + tileIDX.x + unity_StereoEyeIndex * nrTilesX * nrTilesY] = featureFlags;
    }
    #endif

2f910af4467dd9f3231a480ed7479ea7.png

Debug LightFeatureFlag對應的Variant Index(自己寫的管線)

g_BaseFeatureFlags的設定

平行光數量大於0時,g_BaseFeatureFlags開啟LightFeatureFlags.Directional Flag
天空盒渲染開啟時,g_BaseFeatureFlags開啟LightFeatureFlags.Sky Flag

//HDRenderPipeline.LightLoop.cs

static void BuildPerTileLightList(BuildGPULightListPassData data, ref bool tileFlagsWritten, CommandBuffer cmd)
{
    // optimized for opaques only
    if (data.runLightList && data.runFPTL)
    {
        ...
        var localLightListCB = data.lightListCB;

        if (data.enableFeatureVariants)
        {
            uint baseFeatureFlags = 0;
            if (data.directionalLightCount > 0)
            {
                baseFeatureFlags |= (uint)LightFeatureFlags.Directional;
            }
            if (data.skyEnabled)
            {
                baseFeatureFlags |= (uint)LightFeatureFlags.Sky;
            }
            if (!data.computeMaterialVariants)
            {
                baseFeatureFlags |= LightDefinitions.s_MaterialFeatureMaskFlags;
            }

            localLightListCB.g_BaseFeatureFlags = baseFeatureFlags;

            cmd.SetComputeBufferParam(data.buildPerTileLightListShader, data.buildPerTileLightListKernel, HDShaderIDs.g_TileFeatureFlags, data.output.tileFeatureFlags);
            tileFlagsWritten = true;
        }

        ConstantBuffer.Push(cmd, localLightListCB, data.buildPerTileLightListShader, HDShaderIDs._ShaderVariablesLightList);
        ...
    }
}

MaterialFlag

對於MaterialFlag,HDRP只是對Lit的
LitSubsurfaceScattering
LitTransmission
LitStandard
LitAnisotropy
LitIridescence
LitClearCoat

做了Tile的分類。
LitSpecularColor在DeferredShader著色還是用的是動態分支。

//Packages/com.unity.render-pipelines.high-definition@12.1.6/Runtime/Material/Lit/Lit.cs

partial class Lit : RenderPipelineMaterial
{
    // Currently we have only one materialId (Standard GGX), so it is not store in the GBuffer and we don't test for it

    // If change, be sure it match what is done in Lit.hlsl: MaterialFeatureFlagsFromGBuffer
    // Material bit mask must match the size define LightDefinitions.s_MaterialFeatureMaskFlags value
    [GenerateHLSL(PackingRules.Exact)]
    public enum MaterialFeatureFlags
    {
        LitStandard = 1 << 0,   // For material classification we need to identify that we are indeed use as standard material, else we are consider as sky/background element
        LitSpecularColor = 1 << 1,   // LitSpecularColor is not use statically but only dynamically
        LitSubsurfaceScattering = 1 << 2,
        LitTransmission = 1 << 3,
        LitAnisotropy = 1 << 4,
        LitIridescence = 1 << 5,
        LitClearCoat = 1 << 6
    };
...
}

BuildMaterialFlag

在FPTL對螢幕上16*16的Tile計算好Light Feature Flag之後,就輪到讀取GBuffer上的Material Flag,對整個Tile內所有畫素的MaterialFlag規約計算出當前Tile最終的FeatureFlag

(HDRP透過MATERIAL_FEATURE_FLAGS_FROM_GBUFFER讀取GBuffer2即可獲取MaterialFeatureFlag)


#pragma kernel MaterialFlagsGen
//USE_OR啟用Light Feature Flag計算開啟,即前面的buildlightlist.compute計算
#pragma multi_compile _ USE_OR

...

#define USE_MATERIAL_FEATURE_FLAGS

#ifdef PLATFORM_LANE_COUNT                                          // We can infer the size of a wave. This is currently not possible on non-consoles, so we have to fallback to a sensible default in those cases.
#define NR_THREADS              PLATFORM_LANE_COUNT
#else
#define NR_THREADS              64                                  // default to 64 threads per group on other platforms..
#endif

groupshared uint ldsFeatureFlags;
RWStructuredBuffer<uint> g_TileFeatureFlags;

TEXTURE2D_X_UINT2(_StencilTexture);

[numthreads(NR_THREADS, 1, 1)]
void MaterialFlagsGen(uint3 dispatchThreadId : SV_DispatchThreadID, uint threadID : SV_GroupIndex, uint3 u3GroupID : SV_GroupID)
{
    UNITY_XR_ASSIGN_VIEW_INDEX(dispatchThreadId.z);

    uint2 tileIDX = u3GroupID.xy;

    uint iWidth = g_viDimensions.x;
    uint iHeight = g_viDimensions.y;
    uint nrTilesX = (iWidth + (TILE_SIZE_FPTL - 1)) / TILE_SIZE_FPTL;
    uint nrTilesY = (iHeight + (TILE_SIZE_FPTL - 1)) / TILE_SIZE_FPTL;

    // 16 * 4 = 64. We process data by group of 4 pixel
    uint2 viTilLL = 16 * tileIDX;

    float2 invScreenSize = float2(1.0 / iWidth, 1.0 / iHeight);

    if (threadID == 0)
    {
        ldsFeatureFlags = 0;
    }
    GroupMemoryBarrierWithGroupSync();

    uint materialFeatureFlags = g_BaseFeatureFlags; // Contain all lightFeatures or 0 (depends if we enable light classification or not)
    UNITY_UNROLL
    for (int i = 0; i < 4; i++)
    {
        int idx = i * NR_THREADS + threadID;
        uint2 uCrd = min(uint2(viTilLL.x + (idx & 0xf), viTilLL.y + (idx >> 4)), uint2(iWidth - 1, iHeight - 1));

        // Unlit object, sky/background and forward opaque tag don't tag the StencilUsage.RequiresDeferredLighting bit
        uint stencilVal = GetStencilValue(LOAD_TEXTURE2D_X(_StencilTexture, uCrd));
        if ((stencilVal & STENCILUSAGE_REQUIRES_DEFERRED_LIGHTING) > 0)
        {
            PositionInputs posInput = GetPositionInput(uCrd, invScreenSize);
            materialFeatureFlags |= MATERIAL_FEATURE_FLAGS_FROM_GBUFFER(posInput.positionSS);
        }
    }

    InterlockedOr(ldsFeatureFlags, materialFeatureFlags);   //TODO: driver might optimize this or we might have to do a manual reduction
    GroupMemoryBarrierWithGroupSync();

    if (threadID == 0)
    {
        uint tileIndex = tileIDX.y * nrTilesX + tileIDX.x;

        // TODO: shouldn't this always enabled?
        #if defined(UNITY_STEREO_INSTANCING_ENABLED)
            tileIndex += unity_StereoEyeIndex * nrTilesX * nrTilesY;
        #endif

#ifdef USE_OR
        g_TileFeatureFlags[tileIndex] |= ldsFeatureFlags;
#else // Use in case we have disabled light classification
        g_TileFeatureFlags[tileIndex] = ldsFeatureFlags;
#endif
    }
}

計算 Variant Index並Build Indirect Buffer

Variant Index

上面計算出LightFeatureFlag以及MaterialFeatureFlag合併成最終的FeatureFlag之後
就需要計算Tile歸屬於的Variant Index(遍歷靜態的kFeatureVariantFlags裡面所有可能的組合)

說到kFeatureVariantFlags Table,
除了之前提到的其中SSRefraction與SSReflection並沒有逐Tile進行分類之外
值得留意的是次表面散射以及透射材質的MaterialFeature被合併成一種Variant處理
(MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING,MATERIALFEATUREFLAGS_LIT_TRANSMISSION)

kFeatureVariantFlags Table
//Packages/com.unity.render-pipelines.high-definition@12.1.6/Runtime/Material/Lit/Lit.hlsl

// Combination need to be define in increasing "comlexity" order as define by FeatureFlagsToTileVariant
static const uint kFeatureVariantFlags[NUM_FEATURE_VARIANTS] =
{
    // Precomputed illumination (no dynamic lights) with standard
    /*  0 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_STANDARD,
    // Precomputed illumination (no dynamic lights) with standard, SSS and transmission
    /*  1 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING | MATERIALFEATUREFLAGS_LIT_TRANSMISSION |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    // Precomputed illumination (no dynamic lights) for all material types
    /*  2 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIAL_FEATURE_MASK_FLAGS,

    /*  3 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /*  4 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_AREA | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /*  5 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /*  6 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /*  7 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIALFEATUREFLAGS_LIT_STANDARD,

    // Standard with SSS and Transmission
    /*  8 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING | MATERIALFEATUREFLAGS_LIT_TRANSMISSION |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    /*  9 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_AREA | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING | MATERIALFEATUREFLAGS_LIT_TRANSMISSION |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 10 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING |
    MATERIALFEATUREFLAGS_LIT_TRANSMISSION | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 11 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING |
    MATERIALFEATUREFLAGS_LIT_TRANSMISSION | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 12 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIALFEATUREFLAGS_LIT_SUBSURFACE_SCATTERING | MATERIALFEATUREFLAGS_LIT_TRANSMISSION | MATERIALFEATUREFLAGS_LIT_STANDARD,

    // Anisotropy
    /* 13 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | MATERIALFEATUREFLAGS_LIT_ANISOTROPY | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 14 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_AREA | MATERIALFEATUREFLAGS_LIT_ANISOTROPY | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 15 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_ANISOTROPY | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 16 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_ANISOTROPY |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 17 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIALFEATUREFLAGS_LIT_ANISOTROPY | MATERIALFEATUREFLAGS_LIT_STANDARD,

    // Standard with clear coat
    /* 18 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | MATERIALFEATUREFLAGS_LIT_CLEAR_COAT | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 19 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_AREA | MATERIALFEATUREFLAGS_LIT_CLEAR_COAT | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 20 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_CLEAR_COAT | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 21 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_CLEAR_COAT |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 22 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIALFEATUREFLAGS_LIT_CLEAR_COAT | MATERIALFEATUREFLAGS_LIT_STANDARD,

    // Standard with Iridescence
    /* 23 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | MATERIALFEATUREFLAGS_LIT_IRIDESCENCE | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 24 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_AREA | MATERIALFEATUREFLAGS_LIT_IRIDESCENCE | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 25 */ LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_IRIDESCENCE | MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 26 */
    LIGHTFEATUREFLAGS_SKY | LIGHTFEATUREFLAGS_DIRECTIONAL | LIGHTFEATUREFLAGS_PUNCTUAL | LIGHTFEATUREFLAGS_ENV | LIGHTFEATUREFLAGS_SSREFLECTION | MATERIALFEATUREFLAGS_LIT_IRIDESCENCE |
    MATERIALFEATUREFLAGS_LIT_STANDARD,
    /* 27 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIALFEATUREFLAGS_LIT_IRIDESCENCE | MATERIALFEATUREFLAGS_LIT_STANDARD,

    /* 28 */ LIGHT_FEATURE_MASK_FLAGS_OPAQUE | MATERIAL_FEATURE_MASK_FLAGS, // Catch all case with MATERIAL_FEATURE_MASK_FLAGS is needed in case we disable material classification
};

uint FeatureFlagsToTileVariant(uint featureFlags)
{
    for (int i = 0; i < NUM_FEATURE_VARIANTS; i++)
    {
        if ((featureFlags & kFeatureVariantFlags[i]) == featureFlags)
            return i;
    }
    return NUM_FEATURE_VARIANTS - 1;
}
//Packages/com.unity.render-pipelines.high-definition@12.1.6/Runtime/Lighting/LightLoop/builddispatchindirect.compute

#ifdef PLATFORM_LANE_COUNT      // We can infer the size of a wave. This is currently not possible on non-consoles, so we have to fallback to a sensible default in those cases.
#define NR_THREADS              PLATFORM_LANE_COUNT
#else
#define NR_THREADS              64                                  // default to 64 threads per group on other platforms..
#endif

RWBuffer<uint> g_DispatchIndirectBuffer : register( u0 );   // Indirect arguments have to be in a _buffer_, not a structured buffer
RWStructuredBuffer<uint> g_TileList;
StructuredBuffer<uint> g_TileFeatureFlags;

uniform uint g_NumTiles;
uniform uint g_NumTilesX;

[numthreads(NR_THREADS, 1, 1)]
void BuildIndirect(uint3 dispatchThreadId : SV_DispatchThreadID)
{
    if (dispatchThreadId.x >= g_NumTiles)
        return;

    UNITY_XR_ASSIGN_VIEW_INDEX(dispatchThreadId.z);

    uint featureFlags = g_TileFeatureFlags[dispatchThreadId.x + unity_StereoEyeIndex * g_NumTiles];

    uint tileY = (dispatchThreadId.x + 0.5f) / (float)g_NumTilesX;    // Integer division is extremely expensive, so we better avoid it
    uint tileX = dispatchThreadId.x - tileY * g_NumTilesX;

    // Check if there is no material (means it is a sky/background pixel).
    // Note that we can have no lights, yet we still need to render geometry with precomputed illumination.
    if ((featureFlags & MATERIAL_FEATURE_MASK_FLAGS) != 0)
    {
        uint variant = FeatureFlagsToTileVariant(featureFlags);
        uint tileOffset;
        ...
    }

31862503f634c23dc24ba0590c4cfb3b.png

修改HDRP中在Deferred CS根據不同的VARIANT疊加DebugColor

Build Indirect Buffer

需要留意的是Deferred著色的Tile的大小是8*8,而上面FPTL計算的Tile大小是16*16,所以給Variant Indirect Buffer需要InterlockAdd 4個Tile(4*8*8=16*16)

//Packages/com.unity.render-pipelines.high-definition@12.1.6/Runtime/Lighting/LightLoop/builddispatchindirect.compute

#ifdef PLATFORM_LANE_COUNT      // We can infer the size of a wave. This is currently not possible on non-consoles, so we have to fallback to a sensible default in those cases.
#define NR_THREADS              PLATFORM_LANE_COUNT
#else
#define NR_THREADS              64                                  // default to 64 threads per group on other platforms..
#endif

RWBuffer<uint> g_DispatchIndirectBuffer : register( u0 );   // Indirect arguments have to be in a _buffer_, not a structured buffer
RWStructuredBuffer<uint> g_TileList;
StructuredBuffer<uint> g_TileFeatureFlags;

uniform uint g_NumTiles;
uniform uint g_NumTilesX;

[numthreads(NR_THREADS, 1, 1)]
void BuildIndirect(uint3 dispatchThreadId : SV_DispatchThreadID)
{
        ...
        uint variant = FeatureFlagsToTileVariant(featureFlags);
        uint tileOffset;

#ifdef IS_DRAWPROCEDURALINDIRECT
        // We are filling up an indirect argument buffer for DrawProceduralIndirect.
        // The buffer contains {vertex count per instance, instance count, start vertex location, and start instance location} = {0, instance count, 0, 0, 0}
        InterlockedAdd(g_DispatchIndirectBuffer[variant * 4 + 1], 1, tileOffset);
#else
        uint prevGroupCnt;

        // We are filling up an indirect argument buffer for DispatchIndirect.
        // The buffer contains {groupCntX, groupCntY, groupCntZ} = {groupCnt, 0, 0}.
        InterlockedAdd(g_DispatchIndirectBuffer[variant * 3 + 0], 4, prevGroupCnt);
        tileOffset = prevGroupCnt / 4; // 4x 8x8 groups per a 16x16 tile
#endif

        // See LightDefinitions class in LightLoop.cs
        uint tileIndex = (unity_StereoEyeIndex << TILE_INDEX_SHIFT_EYE) | (tileY << TILE_INDEX_SHIFT_Y) | (tileX << TILE_INDEX_SHIFT_X);
        // For g_TileList each VR eye is interlaced instead of one eye and then the other. Thus why we use _XRViewCount here
        g_TileList[variant * g_NumTiles * _XRViewCount + tileOffset] = tileIndex;
    }
}

Deferred Compute Shader

計算完不同Variant的Indirect Buffer,之後就可以Deferred透過固定的ArgsOffset對不同的Variant渲染路徑進行Indirect Dispatch
(CommandBuffer.DispatchCompute(ComputeShader cs,int kernelIndex,ComputeBuffer indirectBuffer,uint argsOffset))

static void RenderComputeDeferredLighting(DeferredLightingPassData data, RenderTargetIdentifier[] colorBuffers, CommandBuffer cmd)
{
    using (new ProfilingScope(cmd, ProfilingSampler.Get(HDProfileId.RenderDeferredLightingCompute)))
    {
        ...
        //遍歷所有的可能variant
        for (int variant = 0; variant < data.numVariants; variant++)
        {
            int kernel;

            if (data.enableFeatureVariants)
            {
                kernel = s_shadeOpaqueIndirectFptlKernels[variant];
            }
            else
            {
                kernel = data.debugDisplaySettings.IsDebugDisplayEnabled() ? s_shadeOpaqueDirectFptlDebugDisplayKernel : s_shadeOpaqueDirectFptlKernel;
            }

            cmd.SetComputeTextureParam(data.deferredComputeShader, kernel, HDShaderIDs._CameraDepthTexture, data.depthTexture);

            // TODO: Is it possible to setup this outside the loop ? Can figure out how, get this: Property (specularLightingUAV) at kernel index (21) is not set
            cmd.SetComputeTextureParam(data.deferredComputeShader, kernel, HDShaderIDs.specularLightingUAV, colorBuffers[0]);
            cmd.SetComputeTextureParam(data.deferredComputeShader, kernel, HDShaderIDs.diffuseLightingUAV, colorBuffers[1]);
            cmd.SetComputeBufferParam(data.deferredComputeShader, kernel, HDShaderIDs.g_vLightListTile, data.lightListBuffer);

            cmd.SetComputeTextureParam(data.deferredComputeShader, kernel, HDShaderIDs._StencilTexture, data.depthBuffer, 0, RenderTextureSubElement.Stencil);

            // always do deferred lighting in blocks of 16x16 (not same as tiled light size)
            if (data.enableFeatureVariants)
            {
                cmd.SetComputeBufferParam(data.deferredComputeShader, kernel, HDShaderIDs.g_TileFeatureFlags, data.tileFeatureFlagsBuffer);
                cmd.SetComputeIntParam(data.deferredComputeShader, HDShaderIDs.g_TileListOffset, variant * data.numTiles * data.viewCount);
                cmd.SetComputeBufferParam(data.deferredComputeShader, kernel, HDShaderIDs.g_TileList, data.tileListBuffer);
                cmd.DispatchCompute(data.deferredComputeShader, kernel, data.dispatchIndirectBuffer, (uint)variant * 3 * sizeof(uint));
            }
            else
            {
                // 4x 8x8 groups per a 16x16 tile.
                cmd.DispatchCompute(data.deferredComputeShader, kernel, data.numTilesX * 2, data.numTilesY * 2, data.viewCount);
            }
        }
    }
}

值得留意的一點是在上面的kFeatureVariantFlags是靜態的uint陣列
DeferredShader透過定義不同的Variant Index配合上是否含有對應渲染特性的判斷
(Ex:if (featureFlags & LIGHTFEATUREFLAGS_DIRECTIONAL)、HasFlag(surfaceData.materialFeatures, MATERIALFEATUREFLAGS_LIT_TRANSMISSION)),
就可以在編譯階段對動態分支進行移除。

#pragma kernel Deferred_Direct_Fptl                                SHADE_OPAQUE_ENTRY=Deferred_Direct_Fptl
#pragma kernel Deferred_Direct_Fptl_DebugDisplay                   SHADE_OPAQUE_ENTRY=Deferred_Direct_Fptl_DebugDisplay             DEBUG_DISPLAY

// Variant with and without shadowmask
#pragma kernel Deferred_Indirect_Fptl_Variant0      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant0      USE_INDIRECT    VARIANT=0
#pragma kernel Deferred_Indirect_Fptl_Variant1      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant1      USE_INDIRECT    VARIANT=1
#pragma kernel Deferred_Indirect_Fptl_Variant2      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant2      USE_INDIRECT    VARIANT=2
#pragma kernel Deferred_Indirect_Fptl_Variant3      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant3      USE_INDIRECT    VARIANT=3
#pragma kernel Deferred_Indirect_Fptl_Variant4      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant4      USE_INDIRECT    VARIANT=4
#pragma kernel Deferred_Indirect_Fptl_Variant5      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant5      USE_INDIRECT    VARIANT=5
#pragma kernel Deferred_Indirect_Fptl_Variant6      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant6      USE_INDIRECT    VARIANT=6
#pragma kernel Deferred_Indirect_Fptl_Variant7      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant7      USE_INDIRECT    VARIANT=7
#pragma kernel Deferred_Indirect_Fptl_Variant8      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant8      USE_INDIRECT    VARIANT=8
#pragma kernel Deferred_Indirect_Fptl_Variant9      SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant9      USE_INDIRECT    VARIANT=9
#pragma kernel Deferred_Indirect_Fptl_Variant10     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant10     USE_INDIRECT    VARIANT=10
#pragma kernel Deferred_Indirect_Fptl_Variant11     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant11     USE_INDIRECT    VARIANT=11
#pragma kernel Deferred_Indirect_Fptl_Variant12     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant12     USE_INDIRECT    VARIANT=12
#pragma kernel Deferred_Indirect_Fptl_Variant13     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant13     USE_INDIRECT    VARIANT=13
#pragma kernel Deferred_Indirect_Fptl_Variant14     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant14     USE_INDIRECT    VARIANT=14
#pragma kernel Deferred_Indirect_Fptl_Variant15     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant15     USE_INDIRECT    VARIANT=15
#pragma kernel Deferred_Indirect_Fptl_Variant16     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant16     USE_INDIRECT    VARIANT=16
#pragma kernel Deferred_Indirect_Fptl_Variant17     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant17     USE_INDIRECT    VARIANT=17
#pragma kernel Deferred_Indirect_Fptl_Variant18     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant18     USE_INDIRECT    VARIANT=18
#pragma kernel Deferred_Indirect_Fptl_Variant19     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant19     USE_INDIRECT    VARIANT=19
#pragma kernel Deferred_Indirect_Fptl_Variant20     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant20     USE_INDIRECT    VARIANT=20
#pragma kernel Deferred_Indirect_Fptl_Variant21     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant21     USE_INDIRECT    VARIANT=21
#pragma kernel Deferred_Indirect_Fptl_Variant22     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant22     USE_INDIRECT    VARIANT=22
#pragma kernel Deferred_Indirect_Fptl_Variant23     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant23     USE_INDIRECT    VARIANT=23
#pragma kernel Deferred_Indirect_Fptl_Variant24     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant24     USE_INDIRECT    VARIANT=24
#pragma kernel Deferred_Indirect_Fptl_Variant25     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant25     USE_INDIRECT    VARIANT=25
#pragma kernel Deferred_Indirect_Fptl_Variant26     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant26     USE_INDIRECT    VARIANT=26
#pragma kernel Deferred_Indirect_Fptl_Variant27     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant27     USE_INDIRECT    VARIANT=27
#pragma kernel Deferred_Indirect_Fptl_Variant28     SHADE_OPAQUE_ENTRY=Deferred_Indirect_Fptl_Variant28     USE_INDIRECT    VARIANT=28

...
#ifdef USE_INDIRECT

StructuredBuffer<uint> g_TileList;
// Indirect
[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]
void SHADE_OPAQUE_ENTRY(uint2 groupThreadId : SV_GroupThreadID, uint groupId : SV_GroupID)
{
    ...
}

Screen Space Reflection

上面提到了HDRP的SSRefraction與SSReflection並沒有對Tile進行Variant分類
可能是出於對SSReflection分類容易把Tile弄得過於零散,
但是這也啟發了我,在BuildMaterialFlag的時候可以得到SSReflection的Indirect Buffer,
同時在這裡說一下SSR其他最佳化加速的技巧。

Hiz加速求交

首先肯定是最常見的Hiz加速,Hiz加速這部分HDRP已經做了。(可以抄)
這裡簡單地說一下這個過程。

Depth Pyramid

Hiz首先需要生成相機深度圖的Depth Pyramid,作為Ray Trace判斷是否往下一級Mip進行步進的依據
不同級別(Hierarchical)的UVoffset的確定跟Color Pyramid是保持一致的。
(MipGenerator:https://www.cnblogs.com/OneStargazer/p/18150428)

#include "Packages/com.unity.render-pipelines.core/ShaderLibrary/Common.hlsl"
#pragma only_renderers d3d11 playstation xboxone xboxseries vulkan metal switch

#pragma kernel KDepthDownsample8DualUav  KERNEL_SIZE=8  KERNEL_NAME=KDepthDownsample8DualUav

RW_TEXTURE2D(float, _DepthMipChain);

CBUFFER_START(cb)
uint4 _SrcOffsetAndLimit; // {x, y, w - 1, h - 1}
uint4 _DstOffset; // {x, y, 0, 0}
CBUFFER_END

#if UNITY_REVERSED_Z
# define MIN_DEPTH(l, r) max(l, r)
#else
# define MIN_DEPTH(l, r) min(l, r)
#endif

// Downsample a depth texture by taking the min value of sampled pixels
// The size of the dispatch is (DstMipSize / KernelSize).
[numthreads(KERNEL_SIZE, KERNEL_SIZE, 1)]

void KERNEL_NAME(uint3 dispatchThreadId : SV_DispatchThreadID)
{
    uint2 srcOffset = _SrcOffsetAndLimit.xy;
    uint2 srcLimit = _SrcOffsetAndLimit.zw;
    uint2 dstOffset = _DstOffset.xy;

    // Upper-left pixel coordinate of quad that this thread will read
    uint2 srcPixelUL = srcOffset + (dispatchThreadId.xy << 1);

    float p00 = _DepthMipChain[(min(srcPixelUL + uint2(0u, 0u), srcLimit))];
    float p10 = _DepthMipChain[(min(srcPixelUL + uint2(1u, 0u), srcLimit))];
    float p01 = _DepthMipChain[(min(srcPixelUL + uint2(0u, 1u), srcLimit))];
    float p11 = _DepthMipChain[(min(srcPixelUL + uint2(1u, 1u), srcLimit))];
    float4 depths = float4(p00, p10, p01, p11);

    // Select the closest sample
    float minDepth = MIN_DEPTH(MIN_DEPTH(depths.x, depths.y), MIN_DEPTH(depths.z, depths.w));

    _DepthMipChain[(dstOffset + dispatchThreadId.xy)] = minDepth;
}

a6630a349a21f19127b87a5f699c4b4a.png

Screen Space Reflections Tracing

在Tracing開始之前需要先幹掉不需要計算的Ray。

Texture2D<uint2> _StencilTexture;

#define NR_THREADS              8

[numthreads(NR_THREADS,NR_THREADS,1)]
void ScreenSpaceReflectionsTracing(uint3 dispatchThreadId : SV_DispatchThreadID, uint2 groupThreadId : SV_GroupThreadID, uint groupId : SV_GroupID)
{
    uint2 positionSS = dispatchThreadId;

    //模板值判斷
    uint stencilValue = GetStencilValue(LOAD_TEXTURE2D(_StencilTexture, positionSS));
    _SsrStencilBit = (1 << 3);
    bool noReceiveSSR = (stencilValue & 32) == 0;

    if (noReceiveSSR)
         return;

    float deviceDepth = LOAD_TEXTURE2D(_CameraDepthTexture, positionSS).r;

    if (deviceDepth < FLT_EPS)
        return;
}

判斷Ray是否值得後續做Tracing計算
計算第一次反射光線螢幕空間座標是否在NDC內
計算當前反射點對應的NDotV以及Roughness是否在閾值內


[numthreads(NR_THREADS,NR_THREADS,1)]
void ScreenSpaceReflectionsTracing(uint3 dispatchThreadId : SV_DispatchThreadID, uint2 groupThreadId : SV_GroupThreadID, uint groupId : SV_GroupID)
{
    ...

    NormalData normalData;
    ZERO_INITIALIZE(NormalData, normalData);
    GetNormalData(positionSS, normalData);

    float2 positionNDC = positionSS * _ScreenSize.zw + (0.5 * _ScreenSize.zw); // Should we precompute the half-texel bias? We seem to use it a lot.
    float3 positionWS = ComputeWorldSpacePosition(positionNDC, deviceDepth, UNITY_MATRIX_I_VP); // Jittered
    float3 V = GetWorldSpaceNormalizeViewDir(positionWS);

    //從DepthNormalBuffer讀取Normal,perceptualRoughness計算NdotV,判斷perceptualRoughness是否在閾值內
    float3 N = normalData.normalWS;
    float perceptualRoughness = normalData.perceptualRoughness;
    float3 R = reflect(-V, N);

    float3 cameraPositionWS = GetCurrentViewPosition();

    float bias = (1.0f - 0.001f * rcp(max(dot(N, V), FLT_EPS)));
    positionWS = cameraPositionWS + (positionWS - cameraPositionWS) * bias;
    deviceDepth = ComputeNormalizedDeviceCoordinatesWithZ(positionWS, UNITY_MATRIX_VP).z;

    float3 rayOrigin = float3(positionSS + 0.5, deviceDepth);

    //計算第一次反射光線螢幕空間座標是否在NDC內
    float3 reflPosWS = positionWS + R;
    float3 reflPosNDC = ComputeNormalizedDeviceCoordinatesWithZ(reflPosWS,UNITY_MATRIX_VP);
    float3 reflPosSS = float3(reflPosNDC.xy * _ScreenSize.xy, reflPosNDC.z);
    float3 rayDir = reflPosSS - rayOrigin;

    float3 rcpRayDir = rcp(rayDir);

    int2 rayStep = int2(rcpRayDir.x >= 0 ? 1 : 0,
                        rcpRayDir.y >= 0 ? 1 : 0);

    float3 raySign = float3(rcpRayDir.x >= 0 ? 1 : -1,
                            rcpRayDir.y >= 0 ? 1 : -1,
                            rcpRayDir.z >= 0 ? 1 : -1);


    bool killRay = (reflPosSS.z <= 0);
    killRay = killRay || (dot(N, V) <= 0);
    killRay = killRay || (perceptualRoughness > _SsrRoughnessFadeEnd);
    if (killRay)
    {
        return;
    }
...
}

Tracing
在正式Tracing開始之前需要對光線步進的範圍進行限制(tMax),限制t的範圍,使得t的範圍不會超出螢幕,根據是否反射SkyBox上的畫素調整Bounds.z的範圍。

如果最近的交叉點是單元格的牆壁,切換到更粗粒度的MIP,並推進射線。mip++
如果最近的交叉點位於Z plane之下,切換到更細粒度的MIP,並推進射線。mip--
如果最近的交叉點位於Z plane之上,切換到更細粒度的MIP,不推進射線。mip--
光線步進的過程:
01.png
02.png
03.png
04.png
05.png
06.png

[numthreads(NR_THREADS,NR_THREADS,1)]
void ScreenSpaceReflectionsTracing(uint3 dispatchThreadId : SV_DispatchThreadID, uint2 groupThreadId : SV_GroupThreadID, uint groupId : SV_GroupID)
{
    ...

    float bias = (1.0f - 0.001f * rcp(max(dot(N, V), FLT_EPS)));
    positionWS = cameraPositionWS + (positionWS - cameraPositionWS) * bias;
    deviceDepth = ComputeNormalizedDeviceCoordinatesWithZ(positionWS, UNITY_MATRIX_VP).z;

    float3 rayOrigin = float3(positionSS + 0.5, deviceDepth);

    float3 reflPosWS = positionWS + R;
    float3 reflPosNDC = ComputeNormalizedDeviceCoordinatesWithZ(reflPosWS,UNITY_MATRIX_VP);
    float3 reflPosSS = float3(reflPosNDC.xy * _ScreenSize.xy, reflPosNDC.z);
    float3 rayDir = reflPosSS - rayOrigin;

    float3 rcpRayDir = rcp(rayDir);

    int2 rayStep = int2(rcpRayDir.x >= 0 ? 1 : 0,
                        rcpRayDir.y >= 0 ? 1 : 0);

    float3 raySign = float3(rcpRayDir.x >= 0 ? 1 : -1,
                            rcpRayDir.y >= 0 ? 1 : -1,
                            rcpRayDir.z >= 0 ? 1 : -1);


    ...
    //計算光線步進的範圍tMax
    float tMax;
    {
        const float halfTexel = 0.5f;
        float3 bounds;
        bounds.x = (rcpRayDir.x >= 0) ? _ScreenSize.x - halfTexel : halfTexel;
        bounds.y = (rcpRayDir.y >= 0) ? _ScreenSize.y - halfTexel : halfTexel;

        float maxDepth = (_SsrReflectsSky != 0) ? -0.00000024 : 0.00000024;
        bounds.z = (rcpRayDir.z >= 0) ? 1 : maxDepth;

        float3 dist = bounds * rcpRayDir - (rayOrigin * rcpRayDir);
        tMax = Min3(dist.x, dist.y, dist.z);
    }

    const int maxMipLevel = min(_SsrDepthPyramidMaxMip, 14);

    // 
    //Still has question at starting ray marching from the next texel
    //
    // Start ray marching from the next texel to avoid self-intersections.
    float t;
    {
        //dist.x=1/(2*(reflpos.x-positionss.x)-1)
        //dist.y=1/(2*(reflpos.y-positionss.y)-1)
        // -1即為從下一個Texel開始RayTracing?
        // 'rayOrigin' is the exact texel center.
        float2 dist = abs(0.5 * rcpRayDir.xy);
        t = min(dist.x, dist.y);
    }

    float3 rayPos;

    int mipLevel = 0;
    int iterCount = 0;
    bool hit = false;
    bool miss = false;
    bool belowMip0 = false; // This value is set prior to entering the cell

    //尚未Hit也尚未Miss時才繼續迴圈
    //t在tMax範圍內(一定迭代次數)進行迴圈
    while (!(hit || miss) && (t <= tMax) && (iterCount < _SsrIterLimit))
    {
        rayPos = rayOrigin + t * rayDir;

        #define SSR_TRACE_EPS               0.000488281f // 2^-11, should be good up to 4K

        // Ray position often ends up on the edge. To determine (and look up) the right cell,
        // we need to bias the position by a small epsilon in the direction of the ray.
        float2 sgnEdgeDist = round(rayPos.xy) - rayPos.xy;
        float2 satEdgeDist = clamp(raySign.xy * sgnEdgeDist + SSR_TRACE_EPS, 0, SSR_TRACE_EPS);
        rayPos.xy += raySign.xy * satEdgeDist;


        int2 mipCoord = (int2)rayPos.xy >> mipLevel;
        int2 mipOffset = _DepthPyramidMipLevelOffsets[mipLevel];
        // Bounds define 4 faces of a cube:
        // 2 walls in front of the ray, and a floor and a base below it.
        float4 bounds;

        bounds.xy = (mipCoord + rayStep) << mipLevel;
        bounds.z = LOAD_TEXTURE2D(_CameraDepthTexture, mipOffset + mipCoord).r;

        // 拉大當前Ray所在的Hierarchical Depth Bound.z,讓部分Tracing結果原本是Miss的Ray變成Hit
        // We define the depth of the base as the depth value as:
        // b = DeviceDepth((1 + thickness) * LinearDepth(d))
        // b = ((f - n) * d + n * (1 - (1 + thickness))) / ((f - n) * (1 + thickness))
        // b = ((f - n) * d - n * thickness) / ((f - n) * (1 + thickness))
        // b = d / (1 + thickness) - n / (f - n) * (thickness / (1 + thickness))
        // b = d * k_s + k_b
        bounds.w = bounds.z * _SsrThicknessScale + _SsrThicknessBias;

        float4 dist = bounds * rcpRayDir.xyzz - (rayOrigin.xyzz * rcpRayDir.xyzz);
        float distWall = min(dist.x, dist.y);
        float distFloor = dist.z;
        float distBase = dist.w;

        // Note: 'rayPos' given by 't' can correspond to one of several depth values:
        // - above or exactly on the floor
        // - inside the floor (between the floor and the base)
        // - below the base

        // #if 0
        // bool belowFloor  = (raySign.z * (t - distFloor)) <  0;
        // bool aboveBase   = (raySign.z * (t - distBase )) >= 0;
        // #else
        bool belowFloor = rayPos.z < bounds.z;
        bool aboveBase = rayPos.z >= bounds.w;
        // #endif


        bool insideFloor = belowFloor && aboveBase;
        bool hitFloor = (t <= distFloor) && (distFloor <= distWall);

        // Game rules:
        // *如果最近的交叉點是單元格的壁,切換到更粗的MIP,並推進射線。mip++
        // *如果最近的交叉點與下面的高度相交,切換到更精細的MIP,並推進射線。mip--
        // *如果最近的交叉點與上面的高度相交,切換到更精細的MIP,不推進射線。mip--
        // Victory conditions:
        // * See below. Do NOT reorder the statements!

        //#ifdef SSR_TRACE_BEHIND_OBJECTS
        miss = belowMip0 && insideFloor;
        //#else
        //miss      = belowMip0;
        //#endif
        hit = (mipLevel == 0) && (hitFloor || insideFloor); //Mip=0時最精細的時候,與Floor相交才判定是Hit
        belowMip0 = (mipLevel == 0) && belowFloor;

        // 'distFloor' can be smaller than the current distance 't'.
        // We can also safely ignore 'distBase'.
        // If we hit the floor, it's always safe to jump there.
        // If we are at (mipLevel != 0) and we are below the floor, we should not move.
        t = hitFloor ? distFloor : (((mipLevel != 0) && belowFloor) ? t : distWall);
        rayPos.z = bounds.z; // Retain the depth of the potential intersection

        // Warning: both rays towards the eye, and tracing behind objects has linear
        // rather than logarithmic complexity! This is due to the fact that we only store
        // the maximum value of depth, and not the min-max.
        mipLevel += (hitFloor || belowFloor || rayTowardsEye) ? -1 : 1;
        mipLevel = clamp(mipLevel, 0, maxMipLevel);

        iterCount++;
    }

    miss = miss || ((_SsrReflectsSky == 0) && (rayPos.z == 0));
    hit = hit && !miss;

    if (hit)
    {
        // Note that we are using 'rayPos' from the penultimate iteration, rather than
        // recompute it using the last value of 't', which would result in an overshoot.
        // It also needs to be precisely at the center of the pixel to avoid artifacts.
        float2 hitPositionNDC = floor(rayPos.xy) * _ScreenSize.zw + (0.5 * _ScreenSize.zw); // Should we precompute the half-texel bias? We seem to use it a lot.
        _SsrHitPointTexture[(positionSS)] = hitPositionNDC.xy;
    }

}

分幀Accumulate

由於全屏RayTracing消耗還是挺大的,所以我第一時間想到直接試著將大表哥渲染體積雲的的棋盤格分幀最佳化思路拿來試試看,把RayMarch的計算壓力分散到四幀中。

Reprojection

Reprojection這一步主要是用MotionVector重投影得到hitPositionNDC對應的畫素在上一幀的位置(prevFrameNDC),
然後就能利用上一幀的_ColorPyramidTexture得到當前幀的反射結果。
這樣最大好處就是,能夠讓SSR的計算能夠跟加入到AsyncCompute當中,提高GPU的利用率。https://zhuanlan.zhihu.com/p/425830762
有關於MotionVector的生成可以看我上一篇文章:https://www.cnblogs.com/OneStargazer/p/18139671
ComputeEngine.png

// Tweak parameters.
// #define DEBUG
#define SSR_TRACE_BEHIND_OBJECTS
#define SSR_TRACE_TOWARDS_EYE
#ifndef SSR_APPROX
#define SAMPLES_VNDF
#endif
#define SSR_TRACE_EPS               0.000488281f // 2^-11, should be good up to 4K
#define MIN_GGX_ROUGHNESS           0.00001f
#define MAX_GGX_ROUGHNESS           0.99999f

Texture2D<uint2> _StencilTexture;
RW_TEXTURE2D(float2, _SsrHitPointTexture);
RW_TEXTURE2D(float4, _SsrAccumPrev);
RW_TEXTURE2D(float4, _SsrLightingTextureRW);
RW_TEXTURE2D(float4, _SSRAccumTexture);


StructuredBuffer<int2> _DepthPyramidMipLevelOffsets;
StructuredBuffer<uint> _CoarseStencilBuffer;

uint2 GetPositionSS(uint2 dispatchThreadId, uint2 groupThreadId)
{
    const uint2 offset[4] = {uint2(0, 1), uint2(1, 1), uint2(1, 0), uint2(0, 0)};
    uint indexOffset = groupThreadId.x % 2 == 1 || groupThreadId.y % 2 == 1 ? 1 : 0;
    if (groupThreadId.x % 2 == 1 && groupThreadId.y % 2 == 1)
        indexOffset = 0;
    indexOffset = (ssrFrameIndex + indexOffset) % 4;
    return dispatchThreadId.xy * 2 + offset[indexOffset];
}

[numthreads(8,8,1)]
void ScreenSpaceReflectionsReprojection(uint3 dispatchThreadId : SV_DispatchThreadID, uint2 groupThreadId : SV_GroupThreadID, uint groupId : SV_GroupID)
{
    const uint2 positionSS = GetPositionSS(dispatchThreadId.xy, groupThreadId);

    float3 N;
    float perceptualRoughness;
    GetNormalAndPerceptualRoughness(positionSS, N, perceptualRoughness);


    float2 hitPositionNDC = LOAD_TEXTURE2D(_SsrHitPointTexture, positionSS).xy;
    if (max(hitPositionNDC.x, hitPositionNDC.y) == 0)
        return;


    // TODO: this texture is sparse (mostly black). Can we avoid reading every texel? How about using Hi-S?
    float2 motionVectorNDC;
    float4 motionVectorBufferValue = SAMPLE_TEXTURE2D_LOD(_CameraMotionVectorsTexture, s_linear_clamp_sampler, min(hitPositionNDC, 1.0f - 0.5f * _ScreenSize.zw) * _RTHandleScale.xy, 0);
    DecodeMotionVector(motionVectorBufferValue, motionVectorNDC);
    float2 prevFrameNDC = hitPositionNDC - motionVectorNDC;
    float2 prevFrameUV = prevFrameNDC * _ColorPyramidUvScaleAndLimitPrevFrame.xy;

    // TODO: optimize with max().

    // if ((prevFrameUV.x < 0) || (prevFrameUV.x > _ColorPyramidUvScaleAndLimitPrevFrame.z) ||
    //     (prevFrameUV.y < 0) || (prevFrameUV.y > _ColorPyramidUvScaleAndLimitPrevFrame.w))
    if (max(prevFrameUV.x, prevFrameUV.y) < 0 ||
        (prevFrameUV.x > _ColorPyramidUvScaleAndLimitPrevFrame.z) ||
        (prevFrameUV.y > _ColorPyramidUvScaleAndLimitPrevFrame.w)
    )
    {
        // Off-Screen.
        return;
    }

    float opacity = EdgeOfScreenFade(prevFrameNDC, _SsrEdgeFadeRcpLength)
        * PerceptualRoughnessFade(perceptualRoughness, _SsrRoughnessFadeRcpLength, _SsrRoughnessFadeEndTimesRcpLength);

    // TODO: filtering is quite awful. Needs to be non-Gaussian, bilateral and anisotropic.
    float mipLevel = lerp(0, _SsrColorPyramidMaxMip, perceptualRoughness);

    // Note that the color pyramid uses it's own viewport scale, since it lives on the camera.
    float3 color = SAMPLE_TEXTURE2D_LOD(_ColorPyramidTexture, s_trilinear_clamp_sampler, prevFrameUV, mipLevel).rgb;

    // Disable SSR for negative, infinite and NaN history values.
    uint3 intCol = asuint(color);
    bool isPosFin = Max3(intCol.r, intCol.g, intCol.b) < 0x7F800000;

    color = isPosFin ? color : 0;
    opacity = isPosFin ? opacity : 0;

    _SSRAccumTexture[positionSS] = float4(color, 1.0f) * opacity;
}

Accumulate

在上面透過Reprojection得到了當前幀的SSR的1/4結果(_SSRAccumTexture)自然需要另外的3/4將其補齊。
另外的3/4就從歷史幀(_SsrAccumPrev)當中去讀取獲得。
最終把累積的結果輸出到_SsrLightingTextureRW之中。

為了消除移動過大產生的偽影讀取MotionVector,如果當前畫素的MotionVector過大,就直接將其截斷。

[numthreads(8,8,1)]
void ScreenSpaceReflectionsAccumulate(uint3 dispatchThreadId : SV_DispatchThreadID)
{
    uint2 positionSS = dispatchThreadId.xy;
    float2 hitPositionNDC = LOAD_TEXTURE2D(_SsrHitPointTexture, positionSS).xy;

    //(x,y) current frame (z,w) last frame (this is only used for buffered RTHandle Systems)
    float2 prevHistoryScale = _RTHandleScaleHistory.zw / _RTHandleScaleHistory.xy;

    float4 original = _SSRAccumTexture[(positionSS)];
    float4 previous = _SsrAccumPrev[(positionSS * prevHistoryScale + 0.5f / prevHistoryScale)];

    float2 motionVectorNDC;
    float4 motionVectorBufferValue = SAMPLE_TEXTURE2D_LOD(_CameraMotionVectorsTexture, s_linear_clamp_sampler, min(hitPositionNDC, 1.0f - 0.5f * _ScreenSize.zw) * _RTHandleScale.xy, 0);
    DecodeMotionVector(motionVectorBufferValue, motionVectorNDC);
    float speedDst = length(motionVectorNDC);

    float2 motionVectorCenterNDC;
    float2 positionNDC = positionSS * _ScreenSize.zw + (0.5 * _ScreenSize.zw);
    float4 motionVectorCenterBufferValue = SAMPLE_TEXTURE2D_LOD(_CameraMotionVectorsTexture, s_linear_clamp_sampler, min(positionNDC, 1.0f - 0.5f * _ScreenSize.zw) * _RTHandleScale.xy, 0);
    DecodeMotionVector(motionVectorCenterBufferValue, motionVectorCenterNDC);
    float speedSrc = length(motionVectorCenterNDC);
    float speed = saturate((speedDst + speedSrc) * 128.0f);

    float coefExpAvg = lerp(_SsrAccumulationAmount, 1.0f, speed);

    float4 result = lerp(previous, original, coefExpAvg);

    uint3 intCol = asuint(result.rgb);

    bool isPosFin = Max3(intCol.r, intCol.g, intCol.b) < 0x7F800000;

    result.rgb = isPosFin ? result.rgb : 0;
    result.w = isPosFin ? result.w : 0;

    _SsrLightingTextureRW[positionSS] = result;
    _SSRAccumTexture[positionSS] = result;
}

Indirect Buffer

仿造之前提到Material Feature Flag的流程,可以透過提前計算Tile內的所有畫素是否適合發射射線(NeedKillRay評估當前畫素位置的粗糙度以及NdotV)就能夠得到Dispatch SSR的Indirect Buffer.
需要留意的是,由於分幀SSR減少了1/4執行緒的關係,使得原本一個16*16的Tile需要新增的執行緒組數量4*8*8的變成1*8*8

RWBuffer<uint> g_SSRDispatchIndirectBuffer : register( u0 );
RWStructuredBuffer<uint> g_SSRTileList;

float _SsrRoughnessFadeEnd;

groupshared uint ldsNeedSSR;

bool killRay(uint2 positionSS, float deviceDepth)
{
    float4 inGBuffer1 = LOAD_TEXTURE2D(_GBufferTexture1, positionSS);
    NormalData normalData;
    ZERO_INITIALIZE(NormalData, normalData);
    DecodeFromNormalBuffer(inGBuffer1, normalData);

    float2 positionNDC = positionSS * _ScreenSize.zw + (0.5 * _ScreenSize.zw); // Should we precompute the half-texel bias? We seem to use it a lot.
    float3 positionWS = ComputeWorldSpacePosition(positionNDC, deviceDepth, UNITY_MATRIX_I_VP); // Jittered
    float3 V = GetWorldSpaceNormalizeViewDir(positionWS);

    float3 N = normalData.normalWS;


    if (normalData.perceptualRoughness > _SsrRoughnessFadeEnd || dot(N, V) <= 0)
        return false;
    return true;
}

[numthreads(NR_THREADS, 1, 1)]
void BuildMaterialFlag(uint3 dispatchThreadId : SV_DispatchThreadID, uint threadID : SV_GroupIndex, uint3 u3GroupID : SV_GroupID)
{
    uint2 tileIDX = u3GroupID.xy;

    uint iWidth = g_viDimensions.x;
    uint iHeight = g_viDimensions.y;
    uint nrTilesX = (iWidth + (TILE_SIZE_FPTL - 1)) / TILE_SIZE_FPTL;
    uint nrTilesY = (iHeight + (TILE_SIZE_FPTL - 1)) / TILE_SIZE_FPTL;

    // 16 * 4 = 64. We process data by group of 4 pixel
    uint2 viTilLL = 16 * tileIDX;

    float2 invScreenSize = float2(1.0 / iWidth, 1.0 / iHeight);

    if (threadID == 0)
    {
        ldsFeatureFlags = 0;
        ldsNeedSSR = false;
    }
    GroupMemoryBarrierWithGroupSync();


    uint materialFeatureFlags = g_BaseFeatureFlags; // Contain all lightFeatures or 0 (depends if we enable light classification or not)
    bool tileNeedSSR = false;

    UNITY_UNROLL
    for (int i = 0; i < 4; i++)
    {
        int idx = i * NR_THREADS + threadID;
        uint2 positionSS = min(uint2(viTilLL.x + (idx & 0xf), viTilLL.y + (idx >> 4)), uint2(iWidth - 1, iHeight - 1));
        float depth = FetchDepth(positionSS);
        if (depth < VIEWPORT_SCALE_Z)
            materialFeatureFlags |= MaterialFeatureFlagsFromGBuffer(positionSS);

        tileNeedSSR = tileNeedSSR || killRay(positionSS, depth);
    }

    InterlockedOr(ldsFeatureFlags, materialFeatureFlags); //TODO: driver might optimize this or we might have to do a manual reduction
    InterlockedOr(ldsNeedSSR, tileNeedSSR);
    GroupMemoryBarrierWithGroupSync();

    if (threadID == 0)
    {
        uint tileIndex = tileIDX.y * nrTilesX + tileIDX.x;

        // TODO: shouldn't this always enabled?
        #if defined(UNITY_STEREO_INSTANCING_ENABLED)
        tileIndex += unity_StereoEyeIndex * nrTilesX * nrTilesY;
        #endif

        g_TileFeatureFlags[tileIndex] |= ldsFeatureFlags;

        if (materialFeatureFlags != g_BaseFeatureFlags && ldsNeedSSR)
        {
            const uint unity_StereoEyeIndex = 0;
            const uint _XRViewCount = 1;

            uint tileOffset;
            uint prevGroupCnt;
            InterlockedAdd(g_SSRDispatchIndirectBuffer[0], 1, prevGroupCnt);
            tileOffset = prevGroupCnt; // 1x 8x8 groups per a 16x16 tile

            // uint tileY = (dispatchThreadId.x + 0.5f) / (float)g_NumTilesX; // Integer division is extremely expensive, so we better avoid it
            // uint tileX = dispatchThreadId.x - tileY * g_NumTilesX;
            //
            //
            uint tileY = tileIDX.y;
            uint tileX = tileIDX.x;
            // See LightDefinitions class in LightLoop.cs
            uint tileIndex = (unity_StereoEyeIndex << TILE_INDEX_SHIFT_EYE) | (tileY << TILE_INDEX_SHIFT_Y) | (tileX << TILE_INDEX_SHIFT_X);
            // For g_TileList each VR eye is interlaced instead of one eye and then the other. Thus why we use _XRViewCount here

            g_SSRTileList[tileOffset] = tileIndex;
        }
    }
}


Pipeline Code參考 SSR計算
//SSR
TextureHandle RenderSSR(RenderGraph renderGraph,
    HQCamera hqCamera,
    ref PrepassOutput prepassOutput,
    ref BuildGPULightListOutput buildGPULightListOutput,
    Texture skyTexture,
    bool transparent = false)
{
    void UpdateSSRConstantBuffer(HQCamera camera, ScreenSpaceReflection settings, ref ShaderVariablesScreenSpaceReflection cb)
    {
        float n = camera.camera.nearClipPlane;
        float f = camera.camera.farClipPlane;
        float thickness = settings.depthBufferThickness.value;

        cb._SsrThicknessScale = 1.0f / (1.0f + thickness);
        cb._SsrThicknessBias = -n / (f - n) * (thickness * cb._SsrThicknessScale);
        cb._SsrIterLimit = settings.RayMaxIterations;
        cb._SsrReflectsSky = 1;
        cb._SsrStencilBit = 1 << 3; //(int)StencilUsage.TraceReflectionRay;
        float roughnessFadeStart = 1 - settings.smoothnessFadeStart;
        cb._SsrRoughnessFadeEnd = 1 - settings.m_MinSmoothness.value;
        float roughnessFadeLength = cb._SsrRoughnessFadeEnd - roughnessFadeStart;
        cb._SsrRoughnessFadeEndTimesRcpLength = (roughnessFadeLength != 0) ? (cb._SsrRoughnessFadeEnd * (1.0f / roughnessFadeLength)) : 1;
        cb._SsrRoughnessFadeRcpLength = (roughnessFadeLength != 0) ? (1.0f / roughnessFadeLength) : 0;
        cb._SsrEdgeFadeRcpLength = Mathf.Min(1.0f / settings.screenFadeDistance.value, float.MaxValue);
        cb._ColorPyramidUvScaleAndLimitPrevFrame =
            HQUtils.ComputeViewportScaleAndLimit(camera.historyRTHandleProperties.previousViewportSize, camera.historyRTHandleProperties.previousRenderTargetSize);
        cb._SsrColorPyramidMaxMip = camera.colorPyramidHistoryMipCount - 1;
        cb._SsrDepthPyramidMaxMip = camera.DepthBufferMipChainInfo.mipLevelCount - 1;
        if (camera.isFirstFrame || camera.cameraFrameIndex <= 3)
            cb._SsrAccumulationAmount = 1.0f;
        else
            cb._SsrAccumulationAmount = Mathf.Pow(2, Mathf.Lerp(0.0f, -7.0f, settings.accumulationFactor.value));

        cb.ssrFrameIndex = camera.cameraFrameIndex % 4;
    }

    if (hqCamera.camera.cameraType == CameraType.Preview)
        return renderGraph.defaultResources.blackTextureXR;

    RTHandle colorPyramidRT = hqCamera.GetPreviousFrameHistoryRT((int) HQCameraHistoryRTType.ColorBufferMipChain);
    if (colorPyramidRT == null)
        return renderGraph.defaultResources.blackTextureXR;

    RTHandle depthPyramidRT = hqCamera.GetCurrentFrameHistoryRT((int) HQCameraHistoryRTType.DepthPyramid);


    using (var builder = renderGraph.AddRenderPass<RenderSSRPassData>("Render SSR", out var passData))
    {
        passData.screenSpaceReflectionsCS = m_ScreenSpaceReflectionsCS;
        passData.tracingKernel = m_SsrTracingKernel;
        passData.reprojectionKernel = m_SsrReprojectionKernel;
        passData.accumulateKernel = m_SsrAccumulateKernel;
        passData.width = hqCamera.actualWidth;
        passData.height = hqCamera.actualHeight;
        passData.viewCount = hqCamera.viewCount;

        int w = hqCamera.actualWidth;
        int h = hqCamera.actualHeight;
        int numTilesX = (w + 15) / 16;
        int numTilesY = (h + 15) / 16;
        int numTiles = numTilesX * numTilesY;

        UpdateSSRConstantBuffer(hqCamera, screenSpaceReflection, ref passData.cb);

        hqCamera.DepthBufferMipChainInfo.GetOffsetBufferData(m_DepthPyramidMipLevelOffsetsBuffer);
        passData.offsetBufferData = builder.ReadComputeBuffer(renderGraph.ImportComputeBuffer(m_DepthPyramidMipLevelOffsetsBuffer));
        passData.ssrIndirectBuffer = builder.ReadComputeBuffer(buildGPULightListOutput.SSRDispatchIndirectBuffer);
        passData.ssrTileIndexBuffer = builder.ReadComputeBuffer(buildGPULightListOutput.SSRTileList);

        passData.colorPyramid = builder.ReadTexture(renderGraph.ImportTexture(colorPyramidRT));
        passData.depthPyramid = builder.ReadTexture(renderGraph.ImportTexture(depthPyramidRT));
        passData.normalBuffer = builder.ReadTexture(prepassOutput.normalBuffer);
        passData.stencilBuffer = builder.ReadTexture(prepassOutput.depthBuffer);
        passData.motionVectorsBuffer = builder.ReadTexture(prepassOutput.motionVectorsBuffer);
        passData.hitPointsTexture = builder.CreateTransientTexture(new TextureDesc(Vector2.one, true, false)
        {
            colorFormat = GraphicsFormat.R16G16_UNorm, clearBuffer = true, clearColor = Color.clear, enableRandomWrite = true,
            name = transparent ? "SSR_Hit_Point_Texture_Trans" : "SSR_Hit_Point_Texture"
        });
        passData.lightingTexture = builder.WriteTexture(renderGraph.CreateTexture(new TextureDesc(Vector2.one, false, false)
            {colorFormat = GraphicsFormat.R16G16B16A16_SFloat, clearBuffer = true, clearColor = Color.clear, enableRandomWrite = true, name = "SSR_Lighting_Texture"}));

        passData.ssrAccum = builder.WriteTexture(renderGraph.ImportTexture(hqCamera.GetCurrentFrameHistoryRT((int) HQCameraHistoryRTType.ScreenSpaceReflectionAccumulation)));
        passData.ssrAccumPrev = builder.WriteTexture(renderGraph.ImportTexture(hqCamera.GetPreviousFrameHistoryRT((int) HQCameraHistoryRTType.ScreenSpaceReflectionAccumulation)));
        passData.validColorPyramid = hqCamera.cameraFrameIndex > 1;
        passData.ssrUseIndirect = pipelineQualitySetting.renderPathSetting.indirectSSR;
        passData.tileListOffset = 0 * numTiles;
        CoreUtils.SetKeyword(m_ScreenSpaceReflectionsCS, "USE_INDIRECT", passData.ssrUseIndirect);

        builder.AllowPassCulling(false);
        builder.SetRenderFunc((RenderSSRPassData data, RenderGraphContext context) =>
        {
            CommandBuffer cmd = context.cmd;
            ComputeShader cs = data.screenSpaceReflectionsCS;
            //
            using (new ProfilingScope(cmd, ProfilingSampler.Get(ProfileIDList.SSRTraing)))
            {
                ConstantBuffer.Push(cmd, data.cb, cs, ShaderIDs._ShaderVariablesScreenSpaceReflection);

                cmd.SetComputeTextureParam(cs, data.tracingKernel, ShaderIDs._CameraDepthTexture, data.depthPyramid);
                cmd.SetComputeTextureParam(cs, data.tracingKernel, ShaderIDs._NormalBufferTexture, data.normalBuffer);
                cmd.SetComputeTextureParam(cs, data.tracingKernel, ShaderIDs._SsrHitPointTexture, data.hitPointsTexture);
                //uint GetStencilValue(uint2 stencilBufferVal)=>stencilBufferVal.y
                cmd.SetComputeTextureParam(cs, data.tracingKernel, ShaderIDs._StencilTexture, data.stencilBuffer, 0, RenderTextureSubElement.Stencil);
                cmd.SetComputeBufferParam(cs, data.tracingKernel, ShaderIDs._DepthPyramidMipLevelOffsets, data.offsetBufferData);

                if (data.ssrUseIndirect)
                {
                    cmd.SetComputeBufferParam(cs, data.tracingKernel, ShaderIDs.g_TileList, data.ssrTileIndexBuffer);
                    cmd.DispatchCompute(cs, data.tracingKernel, data.ssrIndirectBuffer, 0);

                    // cmd.RequestAsyncReadback(data.ssrIndirectBuffer, (request =>
                    // {
                    //     var data = request.GetData<uint>();
                    //     foreach (var tileCount in data)
                    //     {
                    //         Debug.Log(tileCount);
                    //     }
                    // }));
                }
                else
                {
                    cmd.DispatchCompute(cs, data.tracingKernel, HQUtils.DivRoundUp(data.width / 2, 8), HQUtils.DivRoundUp(data.height / 2, 8), data.viewCount);
                }
            }

            using (new ProfilingScope(cmd, ProfilingSampler.Get(ProfileIDList.SSRReprojection)))
            {
                CoreUtils.SetRenderTarget(cmd, data.ssrAccum, ClearFlag.Color, Color.clear);

                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._CameraDepthTexture, data.depthPyramid);
                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._NormalBufferTexture, data.normalBuffer);
                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._SsrHitPointTexture, data.hitPointsTexture);
                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._CameraMotionVectorsTexture, data.motionVectorsBuffer);
                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._ColorPyramidTexture, data.colorPyramid);
                //uint GetStencilValue(uint2 stencilBufferVal)=>stencilBufferVal.y
                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._StencilTexture, data.stencilBuffer, 0, RenderTextureSubElement.Stencil);
                cmd.SetComputeBufferParam(cs, data.reprojectionKernel, ShaderIDs._DepthPyramidMipLevelOffsets, data.offsetBufferData);


                cmd.SetComputeTextureParam(cs, data.reprojectionKernel, ShaderIDs._SSRAccumTexture, data.ssrAccum);
                cmd.DispatchCompute(cs, data.reprojectionKernel, HQUtils.DivRoundUp(data.width / 2, 8), HQUtils.DivRoundUp(data.height / 2, 8), data.viewCount);
            }

            if (!data.validColorPyramid)
            {
                CoreUtils.SetRenderTarget(cmd, data.ssrAccum, ClearFlag.Color, Color.clear);
                CoreUtils.SetRenderTarget(cmd, data.ssrAccumPrev, ClearFlag.Color, Color.clear);
            }

            using (new ProfilingScope(cmd, ProfilingSampler.Get(ProfileIDList.SSRAccumulate)))
            {
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._CameraDepthTexture, data.depthPyramid);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._NormalBufferTexture, data.normalBuffer);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._ColorPyramidTexture, data.colorPyramid);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._SsrHitPointTexture, data.hitPointsTexture);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._SSRAccumTexture, data.ssrAccum);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._SsrLightingTextureRW, data.lightingTexture);
                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._SsrAccumPrev, data.ssrAccumPrev);

                cmd.SetComputeTextureParam(cs, data.accumulateKernel, ShaderIDs._CameraMotionVectorsTexture, data.motionVectorsBuffer);

                cmd.DispatchCompute(cs, data.accumulateKernel, HQUtils.DivRoundUp(data.width, 8), HQUtils.DivRoundUp(data.height, 8), data.viewCount);
            }
        });
        return passData.lightingTexture;
    }
}

Indirect Buffer


static void BuildDispatchIndirectArguments(BuildGPULightListPassData data, bool tileFlagsWritten, CommandBuffer cmd)
{
    using (new ProfilingScope(cmd, ProfilingSampler.Get(ProfileIDList.BuildDispatchIndirectArguments)))
    {
        if (!data.useIndirectDeferred)
            return;
        ConstantBuffer.Push(cmd, data.lightListCB, data.clearDispatchIndirectShader, ShaderIDs.ShaderVariablesLightList);

        cmd.SetComputeBufferParam(data.clearDispatchIndirectShader, s_ClearDispatchIndirectKernel, ShaderIDs.g_DispatchIndirectBuffer, data.output.dispatchIndirectBuffer);
        cmd.SetComputeBufferParam(data.clearDispatchIndirectShader, s_ClearDispatchIndirectKernel, ShaderIDs.g_SSRDispatchIndirectBuffer, data.output.SSRDispatchIndirectBuffer);
        cmd.DispatchCompute(data.clearDispatchIndirectShader, s_ClearDispatchIndirectKernel, 1, 1, 1);

        if (data.computeMaterialFlag)
        {
            ComputeShader buildMaterialFlagsShader = data.buildMaterialFlagsShader;

            ConstantBuffer.Push(cmd, data.lightListCB, buildMaterialFlagsShader, ShaderIDs.ShaderVariablesLightList);

            cmd.SetComputeBufferParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs.g_TileFeatureFlags, data.output.tileFeatureFlags);

            cmd.SetComputeBufferParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs.g_SSRDispatchIndirectBuffer, data.output.SSRDispatchIndirectBuffer);
            cmd.SetComputeBufferParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs.g_SSRTileList, data.output.SSRTileList);
            cmd.SetComputeFloatParam(buildMaterialFlagsShader, "_SsrRoughnessFadeEnd", data.ssrRoughnessFadeEnd);

            cmd.SetComputeTextureParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs._GBufferTexture[2], data.gBufferMaterialFlagTexture);
            cmd.SetComputeTextureParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs._GBufferTexture[1], data.gBufferNormalDataTexture);
            cmd.SetComputeTextureParam(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, ShaderIDs.g_depth_tex, data.depthBuffer);
            cmd.DispatchCompute(buildMaterialFlagsShader, s_BuildMaterialFlagKernel, data.numTilesFPTLX, data.numTilesFPTLY, data.viewCount);
        }


        const int k_ThreadGroupOptimalSize = 64;
        // add tiles to indirect buffer
        cmd.SetComputeBufferParam(data.buildDispatchIndirectShader, s_BuildIndirectKernel, ShaderIDs.g_DispatchIndirectBuffer, data.output.dispatchIndirectBuffer);
        cmd.SetComputeBufferParam(data.buildDispatchIndirectShader, s_BuildIndirectKernel, ShaderIDs.g_TileList, data.output.tileList);
        cmd.SetComputeBufferParam(data.buildDispatchIndirectShader, s_BuildIndirectKernel, ShaderIDs.g_TileFeatureFlags, data.output.tileFeatureFlags);
        cmd.SetComputeIntParam(data.buildDispatchIndirectShader, ShaderIDs.g_NumTiles, data.numTilesFPTL);
        cmd.SetComputeIntParam(data.buildDispatchIndirectShader, ShaderIDs.g_NumTilesX, data.numTilesFPTLX);
        // Round on k_ThreadGroupOptimalSize so we have optimal thread for buildDispatchIndirectShader kernel
        cmd.DispatchCompute(data.buildDispatchIndirectShader, s_BuildIndirectKernel, (data.numTilesFPTL + k_ThreadGroupOptimalSize - 1) / k_ThreadGroupOptimalSize, 1, data.viewCount);
    }
}

private BuildGPULightListOutput BuildGPULightList(RenderGraph renderGraph, HQCamera hqCamera, TileAndClusterData tileAndClusterData,
    int directionalLightCount, int totalLightCount,
    ref ShaderVariablesLightList constantBuffer,
    TextureHandle depthStencilBuffer,
    TextureHandle stencilBufferCopy,
    ref PrepassOutput prepassOutput
)
{
    using (var builder = renderGraph.AddRenderPass<BuildGPULightListPassData>("Build Light List", out var passData, ProfilingSampler.Get(ProfileIDList.BuildLightList)))
    {
        {
            PrepareBuildGPULightListPassData(renderGraph, builder, hqCamera,
                tileAndClusterData, ref constantBuffer,
                directionalLightCount, totalLightCount,
                depthStencilBuffer, stencilBufferCopy, ref prepassOutput,
                passData);

            builder.SetRenderFunc(
                (BuildGPULightListPassData data, RenderGraphContext context) =>
                {
                    bool tileFlagsWritten = false;

                    {
                        ClearLightLists(data, context.cmd);
                        GenerateLightsScreenSpaceAABBs(data, context.cmd);
                        BigTilePrepass(data, context.cmd);
                        BuildPerTileLightList(data, ref tileFlagsWritten, context.cmd);
                        // VoxelLightListGeneration(data, context.cmd);

                        BuildDispatchIndirectArguments(data, tileFlagsWritten, context.cmd);
                    }
                });
        }
        return passData.output;
    }
}

效能對比

f76b224efeb4811ca7b47a52c57f88db.png
測試的顯示卡是RTX3080,左邊的是Indirect Dispatch的耗時,右邊是不用Indirect Dispatch的耗時,快了0.02ms
如果評估Tile內所有像是否合適發射射線的不加入Roughness以及NdotV的判斷,實際上耗時的差距並不會拉得特別大,但是用了Indirect Dispatch之後,耗時會平穩很多(提高了GPU的利用率)。
另外BuildLightList還好也並沒有因為新增加的計算(InterlockAdd)而加大了計算的耗時。
f35230c54e3282cfb9fb3f77bb56ab38.png

相關文章