Walking backwards into the future – A look at descriptor heap in Granite
It seems like I can never quite escape the allure of fiddling with bits more efficiently every passing year. I recently went through the process of porting over Granite’s Vulkan backend to use VK_EXT_descriptor_heap. There wasn’t exactly a burning need to do this work, but science demands I sacrifice my limited free time for these experiments. My name may or may not be on the extension summary, and it’s important to eat your own dog food. In this post, I want to explore ways in which we can port over an old school binding model to newer APIs should the need arise. Granite’s binding model is designed for really old Vulkan. The project started in January 2017 after all, at which point Vulkan was in its infancy. Bindless was not really a thing yet, and I had to contend with really old mobile hardware. Slot-based bindings have been with us since OpenGL and early D3D. I still think it’s a fine model from a user’s perspective. I have no problem writing code like: It’s very friendly to tooling and validation and I just find it easy to use overall. GPU performance is great too since vendors have maximal flexibility in how to implement the API. The major downside is the relatively heavy CPU cost associated with it since there are many API calls to make. In my projects, it’s rarely a concern, but when doing heavy CPU-bound workloads like PS2 GS emulation, it did start to matter quite a bit When SPIR-V shaders are consumed in Granite, they are automatically reflected. E.g., with GLSL: I automatically generate VkDescriptorSetLayout for each unique set, and combine these into a VkPipelineLayout as one does. VkDescriptorSetLayouts is hash’n’cached into a DescriptorSetAllocator. The implicit assumption by shaders I write is that low-frequency updates have lower set values. This matches Vulkan’s pipeline layout compatibility rules too. Given the hardcore descriptor churn this old model can incur, UBOs originally used VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. Since linearly allocating new UBOs per draw is a hot path, I wanted to avoid having to allocate and write new descriptor sets all the time. This is precisely what the dynamic buffer types were designed for. I did not use it for SSBOs since DYNAMIC has some unfortunate interactions with descriptor size, since you cannot change the size, only offset. The size of UBOs is somewhat irrelevant, and I just hardcoded in a 64K window. There are two main strategies for allocating sets from a VkDescriptorPool, both which are kinda bad. The typical model I believe most do is the “jumbo” allocator where you create a big pool with many sets and many descriptors with different descriptor types and pray for the best. When the pool is OOM-ed, allocate another. One unfortunate thing about the jumbo pool is that you can’t really know up front exactly how to balance the descriptor types properly. It will always be a shaky heuristic. In raw Vulkan 1.0, it was straight up illegal to allocate any further once a limit had been reached, causing even more headaches. The very first maintenance extension to Vulkan fixed this and added OUT_OF_POOL_MEMORY which allows applications to just keep going until the pool is exhausted. Fun fact is that some vendors would never exhaust the pool and just straight up ignore what you pass into vkCreateDescriptorPool, so that’s fun. Granite went the route of a slab allocator per VkDescriptorSetLayout instead, one allocator per thread. Allocate a group of like 64 VkDescriptorSets in one go and parcel them out as needed. Main advantage here was no need to keep calling vkAllocateDescriptorSets over and over, and in the early years, I even hash’n’cached the descriptor sets. The primary reason for doing that was that some early mobile drivers were extreeeeeeeemely slow at vkUpdateDescriptorSets for some reason. Not a great time. This slab approach lead to memory bloat though. At some point VK_KHR_descriptor_update_template was added which aims to accelerate vkUpdateDescriptorsSets. Instead of having the driver parse the structs and switching on the descriptorType to write descriptors, the update template allows drivers in theory to “precompile” a highly optimized function that updates descriptors based on the template that is provided in vkCreateDescriptorUpdateTemplate. This was a nice incremental thing to add to Granite. I don’t think the promise of update templates really worked out in the end though. Most drivers I think just resorted to parsing the original template instead, leading to no speedup. Push descriptors were designed quite early on in Vulkan’s life, but its adoption was … spotty at best. It didn’t make it into core until Vulkan 1.4! Push descriptors solved some issues for us slot and binding troglodytes since there was simply no need to mess around with allocating sets and pools when we could just push descriptors and the driver would deal with it. The major downside is that only one descriptor set can be a push set, but in Granite’s case, I could design for that limitation when writing shaders. The last set index in a VkPipelineLayout would get assigned as a push set. After going push descriptors, I dropped the old UBO_DYNAMIC path, since push descriptors are not compatible with it, and the UBO_DYNAMIC wins were … questionable at best anyway. It took a while to move to this model though. AMD Windows driver was infamously dragging its feet for years before finally accepting reality and at that point I was ready to move over. It’s still not a hard requirement in Granite due to mobile concerns, but then the driver hits the slow path, and I don’t really care anymore At some point, any modern renderer has to deal with this and Granite hit this wall with clustered shading, where an array of shadow maps became a hard necessity. I’m not a big fan of “everything is bindless” myself, since I think it makes debugging way more annoying and stresses tooling and validation more than it should, but sometimes the scissor juggling is necessary. When Granite reflects a shader looking like this: The set layout is converted into an UPDATE_AFTER_BIND set with VARIABLE_COUNT array length. There is also a special helper function to aid in allocating these bindless sets where the API mostly turns into: The CPU overhead of this isn’t quite trivial either, but with the set and pool model, it’s not easy to escape this reality without a lot of rewrites. For now, I only support sampled images with bindless and I never really had any need or desire to add more. For bindless buffers, there is the glorious buffer_device_address instead. This model has served and keeps serving Granite well. Once this model is in place, the only real reason to go beyond this for my use cases is performance (and curiosity). VK_EXT_descriptor_buffer asks the question of what happens when we just remove the worst parts of the descriptor API: Sets are now backed by a slice of memory, and pools are replaced by a big descriptor buffer that is bound to a command buffer. Some warts remain however, as VkDescriptorSetLayout and PipelineLayout persist. If you’re porting from the legacy model like I was, this poses no issues at all, and actually reduces the friction. Descriptor buffers are a perfectly sound middle-ground alternative for those who aren’t a complete bindless junkie yet, but want some CPU gains along the way. In the ideal use case for descriptor buffers, we have one big descriptor buffer that is always bound. This is allocated with PCI-e BAR on dGPUs, so DEVICE_LOCAL | HOST_VISIBLE. Instead of allocating descriptor sets, command buffer performs a linear allocation which is backed by slices allocated from the global descriptor buffer. No API calls needed. The size to allocate for VkDescriptorSet is queried from the set layout itself, and each descriptor is assigned an offset that the driver controls. There is a wart in the spec where the min-spec for sampler descriptor buffers is very small (4K samplers). In this case, there is a risk that just linearly allocating out of the heap will trivially OOM the entire thing and we have to allocate new sampler descriptor buffers all the time. In practice, this limitation is completely moot. Granite only opts into descriptor buffers if the limits are reasonable. There is supposed to be a performance hit to rebinding descriptor buffers, but in practice, no vendor actually ended up implementing descriptor buffers like that. However, since VK_EXT_descriptor_heap will be way more strict about these kinds of limitations, I designed the descriptor_buffer implementation around the single global heap model to avoid rewrites later. There is certainly a risk of going OOM when linearly allocating like this, but I’ve never hit close to the limits. It’s not hard to write an app that would break Granite in half though, but I consider that a “doctor, my GPU hurts when I allocate like this” kind of situation. This is where we should have a major win, but it’s not all that clear. For each descriptor type, I have different strategies on how to deal with them. The basic idea of descriptor buffers is that we can call vkGetDescriptorEXT to build a descriptor in raw bytes. This descriptor can now be copied around freely by the CPU with e.g. memcpy, or even on the GPU in shaders (but that’s a level of scissor juggling I am not brave enough for). These are the simplest ones to contend with. Descriptor buffers still retain the VkImageView and VkSampler object. The main addition I made was to allocate a small payload up front and write the descriptor once. E.g.: Instead of vkUpdateDescriptorSets, we can now replace it with a trivial memcpy. The memcpy functions are function pointers that resolve the byte count. This is a nice optimization since the memcpy functions can unroll to perfectly unrolled SIMD load-store. Allocating bindless sets of sampled images with this method becomes super efficient, since it boils down to a special function that does: I rarely use these, but they are also quite neat in descriptor buffers. VkBufferView is gone now, so we just need to create a descriptor payload once from VkDeviceAddress and it’s otherwise the same as above. This descriptor type is somewhat of a relic these days, but anyone coming from a GL/GLES background instead of D3D will likely use this descriptor type out of old habit, me included. The API here is slightly more unfortunate, since there is no obvious way to create these descriptors up-front. We don’t necessarily know all the samplers an image will be combined with, so we have to do it last minute, calling vkGetDescriptorEXT to create the combined descriptor. We cannot meaningfully pre-create descriptors for UBOs and SSBOs so we’re in a similar situation where we have to call vkGetDescriptorEXT for each buffer last-minute. Unfortunately, there is no array of descriptor version for GetDescriptorEXT, so in the extreme cases, descriptor buffers can actually have worse CPU overhead than legacy model. DXVK going via winevulkan .dll <-> .so translation overhead has been known to hit this, but for everyone else I’d expect the difference to be moot. Since descriptor buffer is an incremental improvement over legacy model, we retain optional support for push descriptors. This can be useful in some use cases (it’s critical for vkd3d-proton), but Granite does need it. Once we’re in descriptor buffer land, we’re locked in. Descriptor buffers are battle tested and very well supported at this point. Perhaps not on very old mobile drivers, but slightly newer devices tend to have it, so there’s that! RenderDoc has solid support these days as well. At a quick glance, descriptor heap looks very similar to D3D12 (and it is), but there are various additions on top to make it more compatible with the various binding models that exist out there in the wild, especially for people who come from a GL/Vulkan 1.0 kind of engine design. The normal D3D12 model has some flaws if you’re not fully committed to bindless all day every day, mainly that: This is to match how some hardware works, nothing too complicated. I allocate for the supported ~1 million resource descriptors and 4096 samplers. There is a reserved region for descriptors as well which is new to this extension. In D3D12 this is all abstracted away since applications don’t have direct access to the descriptor heap memory. For the resource heap, we have a 512 K descriptor area which can be freely allocated from, like we did with descriptor buffer. Unlike descriptor buffer where we hammer this arena allocator all the time, we will only rarely need to touch it with descriptor heap. The next ~500k or so descriptors are dedicated to holding the descriptor payload for VkImageView, VkSampler and VkBufferView. All of these objects are now obsolete. When Granite creates a Vulkan::ImageView, it internally allocates a free slab index from this upper region, writes the descriptor there and stores the heap index instead. This enables “true” bindless in a performant way. We could have done this before if we wanted to, but in descriptor buffer we would have eaten a painful indirection on a lot of hardware, which is not great. Some Vulkan drivers actually works just like this internally. You can easily tell, because some drivers report that an image descriptor is just sizeof(uint32_t). We’d have our index into the “heap”, which gets translated into yet another index into the “true” (hidden) heap. Chasing pointers is bad for perf as we all know. We keep a copy of the descriptor payload in CPU memory too, in case we have to write to the arena allocated portion of the heap later. The upper region of ~10k descriptors or so (depends on the driver) is just a reserved region we bind and never touch. It’s there so that drivers can deal with CmdResolveImage, CmdBlitImage and other such special APIs that internally require descriptors. For samplers, there is no arena allocator. It’s so tiny. Instead, when creating a sampler, we allocate a slab index and return a dummy handle by just pointer casting the index instead. We’ll make good use of the mapping APIs later to deal with this lack of arena allocation. In fact, we will never have to copy sampler descriptor payloads around, and we don’t have to mess around with static samplers either, neat! For the static sampler crowd, there is full support for embedded samplers which functions just like D3D12 static samplers, so there’s that but Granite doesn’t use it. It was a non-trivial amount of code to get to this point, but hey, that’s what happens when you try to support 3 descriptor models at once I guess … Core Vulkan 1.0 settled on 128 bytes of push constants being the limit. This was raised in Vulkan 1.4 but Granite keeps the old limit (I could probably live with 32 or 64 bytes to be fair). Push data expands to 256 byte as a minimum, and the main idea behind descriptor heap is that pipeline layouts are completely gone, and we get to decide how the driver should interpret the push data space. This is similar to D3D12 root parameters except it’s not abstracted behind a SetRootParameter() kind of interface that is called one at a time. In Vulkan, we can call CmdPushDataEXT once. VkPipelineLayout and VkDescriptorSetLayout is just gone now, poof, does not exist at all. This is huge for usability. Effectively, we can pretend that the VkPipelineLayout is now just push constant range of 256 bytes, and that’s it. If you’re fully committed to go bindless, we could just do the equivalent of SM 6.6 ResourceDescriptorHeap and SamplerDescriptorHeap and buffer_device_address to get everything done. However, Granite is still a good old slot based system, so I need to use the mapping features to tell the driver how to translate set/binding into actual descriptors. This mapping can be different per-shader too, which fixes a lot of really annoying problems with EXT_graphics_pipeline_library and EXT_shader_object if I feel like going down that path in the future. The natural thing to do for me was to split up the space into maximum 128 byte push constants, then 32 bytes per descriptor set (I support 4 sets, Vulkan 1.0 min-spec). It’s certainly possible to parcel out the data more intelligently, but that causes some issues with set compatibility which I don’t want to deal with. For every set, I split it up into buffers and images and decide on a strategy for each. Buffers are decided first since they have the largest impact on performance in my experience. This is very simple. If there are 3 or fewer buffers in a set (24 bytes), we can just stuff the raw pointers into push data and tell the driver to use that pointer. This is D3D12 root descriptors in a nutshell. Especially for UBOs, this is very handy for performance. We lose robustness here, but I never rely on buffer robustness anyway. The push data layout looks something like this: This is a new Vulkan speciality. Without modifying the shaders, we can tell the driver to load a buffer device address from a pointer in push data instead. This way we don’t have to allocate from the descriptor heap itself, we can just do a normal linear UBO allocation, write some VkDeviceAddresses in there and have fun. Given the single indirection to load the “descriptor” here, this looks a lot like Vulkan 1.0 descriptor sets, except there’s no API necessary to write them. This isn’t the ideal path, but sometimes we’re forced to allocate from the heap. This can happen if we have one of these cases: This is a pretty much D3D12’s root tables, but in Vulkan we can be a bit more optimal with memory since buffer descriptors tend to be smaller than image descriptors and we can pack them tightly. D3D12 has one global stride for any resource descriptor while Vulkan exposes separate sizes that applications can take advantage of. vkWriteResourceDescriptorsEXT is required here to write the SSBO descriptors. After buffers are parceled out for a descriptor set, we have some space left for images. At minimum, we have 8 bytes left (32 – 3 * sizeof(VkDeviceAddress)). This is the common and ideal case. If we don’t have any arrays of images, we can just have a bunch of uint32_t indices directly into the heap. At image view and buffer view creation time, we already allocated a persistent index into the heap that we can refer to. No API calls required when emitting commands. Combined image samplers work quite well in this model, because Vulkan adds a special mapping mode that packs both sampler index and the image index together. This fixes one of the annoying issues in EXT_descriptor_buffer. If we cannot use the simple inline indices, we have two options. The preferred one right now is to just allocate space in the descriptor heap just like the descriptor buffer path, because I’m quite concerned with unnecessary indirections when possible. At least we get to copy the payloads around without API commands. This path is also used for bindless sets. Unlike the descriptor buffer path, there is a major problem which is that linearly allocating from the sampler heap is not viable. The sampler heap is really small now just like in D3D12. In this case, Vulkan has an answer. This is a special Vulkan feature that functions like an indirect root table. This one is similar to INDIRECT_ADDRESS in that we don’t have to allocate anything from the heap directly and we can just stuff heap indices straight into a UBO. Overall, I think these new mapping types allows us to reuse old shaders quite effectively and it’s possible to start slowly rewriting shaders to take full advantage of descriptor_heap once this machinery is in place. For GPU performance, it seemed to be on-par with the other descriptor models on NVIDIA and AMD which was expected. Granite does not really hit the cases where descriptor_heap should meaningfully improve GPU performance over descriptor_buffer, but I only did a rough glance. For CPU performance, things were a bit more interesting, and I learned that Granite has quite significant overhead on its own, which is hardly surprising. That’s the cost of an old school slot and binding model after all, and I never did a serious optimization pass over it. A more forward looking rendering abstraction can eliminate most, if not all this overhead. The numbers here are for RADV, but it’s using the pending merge request for descriptor_heap support. – ~27 us to write 4096 image descriptors on a Ryzen 3950x with a RX 6800. This is basically exactly the same. ~13 us. This is really just a push_back and memcpy bench at this point. This case hits the optimal inline BDA case for heap. ~ 279 ns per dispatch. Doesn’t feel very impressive. Basically same perf, but lots of overhead has now shifted over to Granite. Certainly things can be optimized further. GetDescriptorEXT is somehow much faster than UpdateDescriptorSetWithTemplate though. ~ 157 ns / dispatch now, and most of the overhead is now in Granite itself, which is ideal. I added an extra buffer descriptor per set which hits the INDIRECT_ADDRESS path. Heap regressed significantly, but it’s all in Granite code at least. Likely related having to page in new UBO blocks, but I didn’t look too closely. ~ 375 ns / dispatch, hnnnnnng. The other paths don’t change much as is expected. About ~ 310 ns / dispatch for legacy and descriptor buffer models. This is the happy path for descriptor heap. ~ 161 ns / dispatch ~ 166 ns. Quite interesting that it got slower. The slab allocator for legacy sets seems to be doing its job very well. The actual descriptor copying vanished from the top list at least. ~ 145 ns. A very modest gain, and most of the overhead is now just Granite jank. All the paths look very similar now. ~ 170 ns or so. On RTX 4070 with 595 drivers. The improvements especially for buffers is quite large on NV, interestingly enough. For the legacy buffer tests, it’s heavily biased towards driver overhead: For the image tests the gains are modest, which is somewhat expected given how NV implements image descriptors before descriptor heap. It’s just some trivial u32 indices. Overall, it’s interesting how well the legacy Vulkan 1.0 model holds up here, at least on RADV on my implementation. Descriptor buffer and heap cannot truly shine unless the abstraction using it is written with performance in mind. This sentiment is hardly new. Just porting OpenGL-style code over to Vulkan doesn’t give amazing gains, just like porting old and crusty binding models won’t magically perform with newer APIs either. Either way, this level of performance is good enough for my needs, and the days of spamming out 100k draw calls is kinda over anyway, since it’s all GPU driven with large bindless data sets these days. Adding descriptor buffer and heap support to Granite was generally motivated by curiosity rather than a desperate need for perf, but I hope this post serves as an example of what can be done. There’s a lot of descriptor heap that hasn’t been explored here. GPU performance for heavily bindless workloads is another topic entirely, and I also haven’t really touched on how it would be more practical to start writing code like: which would side-step almost all Granite overhead. Overall I quite like what we’ve got now with descriptor heap as an API, a bastard child of descriptor buffer and D3D12 that gets the job done. As tooling and driver support matures, I will likely just delete the descriptor buffer path, keeping the legacy stuff around for compatibility. VkDescriptorSet VkDescriptorPool vkUpdateDescriptorSets (kinda) VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE VK_DESCRIPTOR_TYPE_STORAGE_IMAGE VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT VK_DESCRIPTOR_TYPE_SAMPLER VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER VK_DESCRIPTOR_TYPE_STORAGE_BUFFER VK_DESCRIPTOR_TYPE_ACCELERATION_STRUCTURE_KHR You very quickly end up having to call CopyDescriptorsSimple a LOT to shuffle descriptors into the heap. Since this is a call into the driver just to copy a few bytes around, it can quickly be a source of performance issues. In vkd3d-proton, we went to hell and back to optimize this case because in many titles, it was the number 1 performance overhead. Dealing with samplers is a major pain. The 2K sampler heap limit can be rather limiting since there is no good way to linearly allocate on such a small heap. Static samplers are quite common as a result, but they have other problems. Recompiling shaders because you change Aniso 4x to 8x in the settings menu is kinda a hilarious situation to be in, but some games have been known to do just that … The shader is using OpArrayLength on an SSBO. We need real descriptors in this case. The current implementation just scans the SPIR-V shader module for this instruction, but could be improved in theory. The shader is using an array of descriptors. For buffers, this should be very rare, but the PUSH_ADDRESS and INDIRECT_ADDRESS interfaces do not support this. Robustness is enabled. Test #1: Write 4096 image descriptors: 17.6 us (copies u32 indices) Test #2: 693 ns Test #3: 726 ns Test #4: 377 ns Test #5: 408 ns Test #1: 10.2 us (copies u32 indices) Test #2: 434 ns Test #3: 479 ns Test #4: 307 ns Test #5: 315 ns Test #1: 11 us (copies real 32 byte descriptors) Test #2: 389 ns Test #3: 405 ns Test #4: 321 ns Test #5: 365 ns