GreatReads - Blog Aggregator · Phoenix Framework

A case for learning GPU programming with a compute-first mindset

Beginners coming into our little corner of the programming world have it rough. Normal CPU-centric programming tends to start out with a “Hello World” sample, which can be completed in mere minutes. It takes far longer to simply download the toolchains and set them up. If you’re on a developer friendly OS, this can be completed in seconds as well. However, in the graphics world, young whippersnappers cut their teeth at rendering the elusive “Hello Triangle” to demonstrate that yes, we can indeed do what our forebears accomplished 40 years ago, except with 20x the effort and 100000x the performance. There’s no shortage of examples of beginners rendering a simple triangle (or a cube), and with the new APIs having completely displaced the oxygen of older APIs, there is a certain expectation of ridiculous complexity and raw grit required to tickle some pixels on the display. 1000 lines of code, two weeks of grinding, debugging black screens etc, etc. Something is obviously wrong here, and it’s not going to get easier. I would argue that trying to hammer through the brick wall of graphics is the wrong approach in 2025. Graphics itself is less and less relevant for any hopeful new GPU programmer. Notice I wrote “GPU programmer”, not graphics programmer, because most interesting work these days happens with compute shaders, not traditional “graphics” rendering. Instead, I would argue we should start teaching compute with a debugger/profiler first mindset, building up the understanding of how GPUs execute code, and eventually introduce the fixed-function rasterization pipeline as a specialization once all the fundamentals are already in place. The raster pipeline was simple enough to teach 20 years ago, but those days are long gone, and unless you plan to hack on pre-historic games as a hobby project, it’s an extremely large API surface to learn. When compute is the focus, there’s a lot of APIs we could ponder, like CUDA and OpenCL, but I think Vulkan compute is the best compute focused API to start out with. I’m totally not biased obviously The end goal is of course to also understand the graphics pipeline, and pure compute APIs will not help you there. I don’t intend to write a big book here that has all the answers on how to become a competent GPU programmer. Instead, I want to try outlining some kind of “meta-tutorial” that could be fleshed out further. I’ve been writing compute shaders since the release of OpenGL 4.3 ages ago and I still learn new things. For this exercise, I will rely on a mid-level API abstraction like my own Granite . I don’t think throwing developers into the raw API is the best idea to begin with, but there must be some connection with the underlying API, i.e., no multi-API abstraction that barely resembles the actual API which you’ll typically find in large production engines. Granite is a pure Vulkan abstraction I’ve been chiseling away at for years for my own needs (I ship it in tons of side projects and stuff), but it’s not really an API I’ve intended others to actively use and ship software on. Migrating away from the training wheels quickly is important though and compute makes that fairly painless. Granite is just one of many approaches to tackling Vulkan, and the intent is not to present this as the one true way. Getting something to show up on screen is important to keep the dopamine juices flowing. Fortunately, we can actually do this without messing around with graphics directly. With RenderDoc captures we get debugging + something visual in the same package, and learning tooling early is critical to be effective. Debugging sufficiently interesting GPU code is impossible without this. The debug flow I propose with RenderDoc will rely on a lot of shader replacements and roundtrips via SPIRV-Cross’ GLSL backend, so Vulkan GLSL is the appropriate language to start with. It’s more or less a dead language at this point, but it’s also the most documented language that has support for buffer device addresses, which I will introduce right away to avoid having the brick wall of descriptors and binding models. This is a very compute-centric move, but makes other parts of the API easier to grasp later. HLSL from the Direct3D ecosystem is a popular option, but as a compute language, HLSL is weaker than Vulkan GLSL in my experience, due to lack of a lot of features that come up in more interesting compute workloads, but being bilingual in this area is unavoidable these days. No matter which language you use, someone will call you a filthy degenerate anyway. :v Being debugger-centric we can avoid poking at explicit synchronization for a very long time and once we get there, we can simplify a ton. You can do a lot of interesting things with a single dispatch after all. Here’s a very basic program that copies some data around. It should trivially build on Linux or Windows on the usual compilers. Make sure to clone or symlink Granite so that it can be picked up by the CMake build. If you try to run this, the output might look something like this: The code we just wrote executes on the GPU, but we have no easy way to observe the code actually running on the device. This is where RenderDoc comes in. Point to the executable we built. After launching , the capture happens automatically, and when the process terminates, the capture should appear. Clicking on the copy command and double-clicking the destination buffer, we can see the raw contents: The zero-initialization flag we passed into buffer creation was technically not needed, but it helped make the capture a little easier to understand. That clear happened automagically inside Granite. Normally, memory is not assumed to be zero-cleared on allocation. Instead of using copies, we can create our own little memcpy. Here’s an updated sample gist. To keep things simple, we can use shaderc ‘s method of compiling GLSL into a C header file. Vulkan 1.2 is used here since that introduced buffer device addresses in core. Building and capturing again gives us: To inspect the push constants, it’s under uniform buffers: RenderDoc understands how to resolve pointers in buffers into links that open the relevant buffer. If you click on an event before the dispatch you’ll see the writes disappear. This workflow is extremely powerful for difficult debugging scenarios and I cannot do my job without this. It’s imperative to learn this early. Debug gives you a more traditional step debugger. Depending on the bug, this may be the correct tool (e.g. inspecting a particular broken pixel), but in my experience, when working with a ton of GPU threads in parallel, it’s often required to study it in aggregate to see what is going on, since you might not even know which thread is at fault to begin with. For now, select Edit -> Decompile with SPIRV-Cross The Vulkan API uses the SPIR-V intermediate representation and SPIRV-Cross converts this back to equivalent GLSL. Fortunately, this result looks very similar to our original shader. This is one of the main reasons I prefer working with GLSL since the translation back and forth to SPIR-V is the least lossy compared to alternatives. E.g. we can hack in some debug prints, hit Apply and the dispatch will have messages attached to them. RenderDoc implements debugPrintfEXT by rewriting the SPIR-V to write the printf values back to host. The Vulkan driver itself does not understand how to printf. It will just ignore the SPIR-V command to printf. Shader replacement like this is not just for debug prints, but you can modify the code and see the results without having to recompile the entire program and recapturing. For fun, debug print “Hello World” instead, and you have your checkbox ticked off. If you’re on a driver that exposes it, you can study the machine code. For this purpose, I highly recommend AMD GPUs with RADV driver on Linux. The ISA is arguably the most straight forward to read. All Mesa drivers should give you ISA no matter the graphics card if you don’t have an AMD card lying around. Before trying to make sense of this, we need a mental model for how GPU compute executes on the device. This model was more or less introduced by CUDA in 2007 and has remained effectively unchanged since, neat! At the highest level, in the CPU side, we dispatch a 3D grid of workgroups. In this sample, we just have a 1x1x1 cube or dispatches. For every workgroup, there is another 3D grid of invocations . Multiple invocations work together and are able to efficiently communicate with each other. Communicating across workgroups is possible in some situations, but requires some scissor juggling. GPUs are extremely parallel machines. To get optimal performance, we have to map the very scalar-looking code to SIMT. The model employed for essentially all modern GPUs is that one lane in the vector units maps to one thread. Inside the workgroup, the invocations are split up into subgroups. The mental model to understand the distinction is: For a workgroup to be running well, the number of invocations in it should be an integer multiple of the subgroup size, otherwise there will be lanes doing nothing, and that’s no fun. The subgroup sizes in the wild vary quite a lot, but there is an upper legal limit of 128. In practice, these are the values you can expect to find in the wild: Some vendors support multiple subgroup sizes. Usually we don’t have to care too much about this until you graduate to the more hardcore level of compute shader programming, but Vulkan gives you control to force subgroup sizes when need be. For desktop use cases, catering to the range of 16 to 64 is reasonable. In the example we’ve been looking at, the workgroup size is just 16, so this is not optimal. On mobile GPUs, you might need to consider a wider range of hardware. The rule of thumb (for desktop) is to use one of three constellations for local_size: Integer multiples of these are fine too. This should make almost any GPU happy. The maximum limit is 1024 invocations, but I never recommend going that high unless there are very good reasons to. For the AMD case, v_ instructions are vector instructions, meaning while it looks like a simple register, there are multiple instances of it, one for every invocation in the subgroup. s_ instructions are scalar. It runs once per subgroup and runs in parallel with the vector units. Taking advantage of subgroup uniform code can be very powerful. One way to think of this is that v_ instructions look more like SIMD, except that the SIMD width is much larger than CPU instruction sets: where scalar instructions look more like regular CPU instructions: Vector load-store looks more like gather/scatter, and scalar load-store is more like normal CPU load-store. In rare cases, it’s useful to do minor in-place modifications to SPIR-V itself. As a very ad-hoc sample, we can attempt to clean up the awkward 64-bit pointer math by using OpInBoundsAccessChain instead of OpAccessChain . Edit -> Decompile with spirv-dis Now replace OpAccessChain with OpInBoundsAccessChain and apply. While keeping the shader tab open, go back to Pipeline State viewer and look at GPU ISA: OpInBoundsAccessChain tells the compiler that we cannot index outside the array, which means negative indices and massively large indices are not allowed, and this allows the compiler to emit u64 + u32 addressing format. Sure looks much nicer now. This is way beyond what a beginner should care about, but the point is to demonstrate that we can replace raw SPIR-V too, and also demonstrates how you can inspect the SPIR-V of shaders easily. We can do a lot with simple buffer device addresses to process data, but there is a limit to how far we can get with that approach if the end goal is game rendering. There are things GPUs can do that raw pointers cannot: CPU ISAs do none of these. In this updated memcpy sample , I introduce two descriptor types, STORAGE_BUFFER and UNIFORM_TEXEL_BUFFER . Using descriptors like this is the “normal” way to use Vulkan, and should be preferred when feasible. For pragmatic reasons, it’s easier to debug and validate descriptors compared to raw pointers. Raw pointers are also prone to GPU crashes, which are very painful and annoying to debug. Unlike CPU debugging, we cannot just capture a SIGSEGV in a debugger and call it a day. Already, the resources show up in a more convenient way. Even though the shader didn’t specify 8-bit inputs anywhere, it just works. Typed formats have up to 4 components. This is enough to cover RGB and Alpha channels, which of course has its origins in graphics concepts. texelFetch cannot know if we have R8_UINT or R16G16B16A16_UINT for example, so we have to select the .x component in the shader. Granite implements a rather old school binding model, but I think this model is overall very easy to understand and use. More modern bindless design is introduced later when it becomes relevant. In the shader, I declare things like layout(set = 0, binding = 1). In Granite, this simply means that we have to bind a resource of the appropriate type before dispatching, e.g.: This papers over a ton of concepts, and makes it very easy and convenient to program. In reality, there are a lot of API objects in play here. When the compiler sees layout(set = 0, binding = 1) for example, it needs to check the provided VkPipelineLayout given to pipeline creation. A set denotes a group of resources that are bound together as one contiguous entity. In the ISA, set = 0 is determined to be initialized in a certain scalar register: The VkPipelineLayout also contains information about what e.g. binding = 1 means. In this case, the driver happened to decide that binding = 0 is at offset 0, and binding = 1 is at offset 16. Since these descriptors are adjacent in memory we got a lucky optimization where we load 32 bytes at once. On the API side, we need a compatible VkPipelineLayout object when recording the command buffer to ensure that everything lines up. Granite does this automatically, through shader reflection, which synthesizes a working layout for us. Based on the contained VkDescriptorSetLayout inside the pipeline layout, it knows how to allocate a VkDescriptorSet from a VkDescriptorPool and write descriptors to it. Then it can bind the descriptor set to the command buffer before dispatching. We can see all of this in effect in the capture. Turn off the Filter and we get: The descriptor set is updated, then later bound. In reality vkCmdBindDescriptorSets is just a 32-bit push constant, which the shader ends up reading in s3 register. Managing descriptors is always a point of contention in Vulkan programming if you’re writing the raw API. There’s a ton of concepts to juggle and it’s mostly pretty dull stuff. As an extension to the original old school model I outlined about, it’s possible to treat a descriptor set as raw memory which gets rid of a ton of jank. Granite supports using this model by opting in to it. Change the sample to use and recapture: Make sure to make a release build of the test and not a debug build, otherwise descriptor buffers are disabled. Note that this requires a recent build of RenderDoc. The latest stable v1.40 release supports descriptor buffers. Now we explicitly tell the driver that the descriptor set lives at offset 0 from the bound buffer. If we then inspect the bound descriptor buffer … Now we can see the raw guts of the storage buffer and texel buffer being encoded. You can even see the 0x40 and 0x10 being encoded there which corresponds to the sizes of the descriptors. To get something interesting on screen to end this bringup exercise, we could port some shadertoy shaders. These are super convenient since many of them don’t require anything fancy to run like external textures or anything. I picked some shadertoy arbitrarily. Store this to mandelbulb.glsl, and then we replace our shader with a mandelbulb.comp that calls the shadertoy code: On the API side, we simply need to create a storage texture and bind it to the shader. Just with this simple setup, you can go completely nuts and play around with the more math-heavy side of graphics if you want. From here, I think the natural evolution is to learn about: After that, it’s a matter of learning the common algorithms that show up all the time, like parallel scans, classification, binning, etc, etc. This naturally leads to indirect dispatches, and once these concepts are in place, we can design a very simple compute shader rasterizer that renders some simple glTF models. Only when those concepts land do we consider the graphics pipeline. Workgroup -> runs concurrently on the same shader core Subgroup -> runs in lock-step in a SIMT fashion 4: Really old Mali Bifrost, old iPhones 8: Intel Arc Alchemist 16: Intel Arc Battlemage, Mali Valhall, Intel (runs slower if not Battlemage) 32: AMD RDNA, NVIDIA, Intel upper limit (runs even slower) 64: Adreno, AMD GCN + RDNA 128: Adreno (64, 1, 1) for 1D (8, 8, 1) for 2D (4, 4, 4) for 3D Efficiently sample textures Automatic format conversions “Free” bounds checking Atomics Living in a world without mutexes: lockless programming with millions of threads Shared memory Subgroup operations Case study: Scalarization Case study: Bindless and non-uniform indexing of descriptors Texture sampling and mip-mapping The bread and butter of graphics Case study: do some simple image processing with simple filters Memory coherency and how to communicate with other workgroups Case study: Single pass down-sampling If relevant, start porting over the code to more shading languages API synchronization and how to keep CPU and GPU pipelined … and maybe only then start looking at getting some images on screen (with compute) Bring up your own Vulkan code from scratch to get rid of the training wheels and make sure you understand how everything comes together

Programming

0 views

Maister's Graphics Adventures 4 months ago

I designed my own ridiculously fast game streaming video codec – PyroWave

Streaming gameplay from one machine to another over a network is a reasonably popular use case these days. These use cases demand very, very low latency. Every millisecond counts here. We need to: Every step in this chain adds latency and we want to minimize this as much as possible. The go-to solution here is GPU accelerated video compression using whatever codec you have available, usually H.264, HEVC or if you’re really fancy, AV1. Ideally, we want all of this to complete in roughly ~20 ms. To make this use case work well, we have to strangle the codecs quite a bit. Modern video codecs love latency since it allows tricks like flexible rate control and B-frames. That is what allows these codecs to operate at ridiculous compression ratios, but we have to throw away most of these tricks now. Since the codec cannot add latency, and we’re working on a fixed bit-rate budget (the ethernet cable or WiFi antenna), we’re left with: When game streaming, the expectation is that we have a lot of bandwidth available. Streaming locally on a LAN in particular, bandwidth is basically free. Gigabit ethernet is ancient technology and hundreds of megabits over WiFi is no problem either. This shifts priorities a little bit for me at least. Back in my student days, I designed a very simple low-complexity video codec for my master thesis, and after fiddling with Vulkan video and PyroFling for a while, that old itch was scratched again. I wanted to see what would happen if I designed a codec with laser focus on local streaming with the absolute lowest possible latency, what could go wrong? This is the grug-brained approach to video, but it’s not as silly as it sounds. Bit-rates explode of course, but we gain: Intra-only has use cases in digital cinema (motion JPEG2000) and more professionally oriented applications where these concerns are likely more important than squeezing bandwidth. We’re now working at 100+ Mbits/s instead of ~10-20 Mbit/s, so streaming over the internet is no longer feasible outside of peer-to-peer with fiber links. For reference, raw 1080p60 with 420 chroma subsampling is in the range of 1.5 Gbit/s, and it only gets worse from there. Entropy coding is an absolute nightmare for parallelization, which means encoding solely on the GPU with compute shaders becomes an extremely painful affair. Let’s just throw that out and see how far we get. Gotta go fast! There are codecs in this domain too, but it’s getting very specialized at this point. In the professional broadcasting space, there are codecs designed to squeeze more video through existing infrastructure with “zero” lag and minimal hardware cost. My master thesis was about this, for example. A more consumer oriented example is VESA display stream compression (I’m not sure if it does entropy coding, but the compression ratios are small enough I doubt it). There isn’t much readily available software in this domain, it’s generally all implemented in tiny ASICs. If FFmpeg doesn’t support it, it doesn’t exist for mere mortals. While modern codecs are all block-based Discrete Cosine Transform (DCT) + a million hacks on top to make it good, there is an alternative that tried its best in the 90s to replace DCTs, but kinda failed in the end. https://www.youtube.com/watch?v=UGXeRx0Tic4 is a nice video explaining some of the lore. DWT-based compression has a niche today, but only in intra video compression. It’s a shame, because it’s quite elegant. https://en.wikipedia.org/wiki/Discrete_wavelet_transform A graphics programmer will be familiar with this structure immediately, because this is just good old mip-maps with spice. Effectively, we downsample images, and also compute the “error” between the high-res picture and low-res picture. With signal processing lenses on, we can say it’s a critically sampled filter-bank. After processing N pixels, we obtain N / 2 low-pass and N / 2 high-pass pixels. The filters designed to do this are very particular (I really don’t know or care how they were made), but it’s basically just a basic convolution kernel, nothing too wild. The number of levels can vary but I chose 5 levels of decomposition. Once the image is filtered into different bands, the values are quantized. Quantizing wavelets is a little tricky since we need to consider that during reconstruction, the filters have different gains. For the CDF 9/7 filter, high-pass is attenuated by 6 dB, and there are other effects when upsampling the lower resolution bands (zero-insertion). Rather than sweating out new graphs, I’ll just copy paste from my thesis. CDF 9/7 has very similar looking spectrum to the 5/3 I used here. After normalizing the noise power, higher frequency bands can be quantized much harder than low-frequency bands. This exploits human psychovisual effects. This effect is used during rate control, which is another interesting problem. In the end, the higher frequency bands quantize to zero for most of the frame, with bits being allocated to critical regions of the image. The JPEG blocking artifact is infamous. Wavelet’s typical failure mode is that all high-pass information is quantized to 0, even where it shouldn’t be. This leads to a blurring – and if severe – ringing artifact. Given how blurry games these days can be with TAA, maybe this simply isn’t all that noticeable? Modern problems require modern solutions. Fiddling with this part of the codec was the thing that took the longest, but I think I landed on something alright eventually. The basic block is 32×32 coefficients. This forms a standalone unit of the bitstream that can be decoded independently. If there is packet loss, we can error correct by simply assuming all coefficients are zero. This leads to a tiny blur somewhere random in the frame which is likely not even going to be perceptible. The 32×32 block is further broken down into 8×8 blocks, which are then broken down into 4×2 blocks. This design is optimized for GPUs hierarchy of threads: 8 coefficients per thread is deliberately chosen so that we can be byte oriented. Vulkan widely supports 8-bit storage of SSBOs, so I rely on that. We absolutely cannot be in a situation where we do bit fiddling on memory. That makes GPUs sad. Like most wavelet codecs, I went with bit-plane encoding, but rather than employing a highly complicated (and terribly slow) entropy coding scheme, the bit-planes are just emitted raw as-is. I did this in my master thesis project and I found it surprisingly effective. The number of bits per coefficient are signaled at a 4×2 block level. I did some experiments on these block sizes and 4×2 was the right tradeoff. Using subgroup operations and some light prefix sums across the workgroup, it’s very efficient to decode and encode data this way. For non-zero coefficients, sign bits are tightly packed to the end of the 32×32 block. This was mildly tricky to implement, but not too bad. The details are in my draft of the bitstream . In this style of compression, rate control is extremely important. We have a fixed (but huge) budget we have to meet. Most video codecs struggle with this requirement since the number of bits we get out of entropy coding is not easily knowable before we have actually done the compression. There is usually a lot of slack available to codecs when operating under normal VBR constraints. If a frame overshoots by 30%, we can amortize that over a few frames no problem, but that slack does not exist here since we’re assuming zero buffering latency. Without entropy coding, we can trivialize the problem. For every 32×32 block, I test what happens if I throw away 1, 2, 3, … bits. I measure psychovisually weighted error (MSE) and bit-cost from that and store it in a buffer for later. During RD analysis I can loosely sort the decisions to throw away bits by order of least distortion per bit saved. After the required number of bits have been saved through some prefix summing, we have achieved a roughly optimal rate distortion across the entire image in fixed time. In the final pass, every 32×32 block checks how many bits to throw away and packs out the final bit-stream to a buffer. The result is guaranteed to be within the rate limit we set, usually ~10-20 bytes under the target. Being able to rate limit like this is a common strength of wavelet codecs. Most of them end up iterating from most significant to least significant bit-plane and can stop encoding when rate limit is met, which is pretty cool, but also horribly slow … So … is it fast? I think so. Here’s a 1080p 4:2:0 encode and decode of Expedition 33, which I found to be on the “brutal” end for image compression. Lots of green foliage and a lot of TAA noise is quite hard to encode. 0.13 ms on a RX 9070 XT on RADV. Decoding is also very fast. Under 100 microseconds. I don’t think anything else even comes close. The DWT pass was quite heavily optimized. It’s one of the few times where I found that packed FP16 math actually helped a lot, even on a beast like RDNA4. The quantizer pass does the most work of all the passes, and it took some effort to optimize too. Doing DWT in FP16 does have a knock-on effect on the maximum quality metrics we can achieve though. Encoding more “normal” games, the quant + analysis pass has an easier time. 80 microsecond encode is pretty good. Here’s a 4K 4:2:0 encode of the infamous ParkJoy scene. 0.25 ms, showing that 1080p struggles a bit to keep the GPU fully occupied. An interesting data point is that transferring a 4K RGBA8 image over the PCI-e bus is far slower than compressing it on the GPU like this, followed by copying over the compressed payload. Maybe there is a really cursed use case here … I think this is an order of magnitude faster than even dedicated hardware codecs on GPUs. This performance improvement translates directly to lower latency (less time to encode and less time to decode), so maybe I’m onto something here. Power consumption on the Steam Deck when decoding is also barely measurable. Not sure it’s less than the hardware video decoder, but I actually wouldn’t be surprised if it were. Given how niche and esoteric this codec is, it’s hard to find any actual competing codecs to compare against. Given the domain is game streaming the only real alternative is to test against the GPU vendors’ encoders with H.264/HEVC/AV1 codecs in FFmpeg. NVENC is the obvious one to test here. VAAPI is also an option but at least FFmpeg’s implementation of VAAPI fails to meet CBR targets which is cheating and broken for this particular use case. It’s possible I held it wrong, but I’m not going to try debugging that. Over 200mbit/s at 60 fps, I find it hard to tell any compression artifacts without side-by-side comparisons + zooming in, which is about 1.5 bpp. For something as simple as this codec, that’s quite neat. For objective metrics, I made use of https://github.com/psy-ex/metrics . To even begin to compare a trivial codec like this against these codecs is a little silly, but we can level the playing field a bit by putting these codecs under the same harsh restriction that we have in PyroWave: Example command line here: No one in their right mind would stream like this, but let’s try anyway. The video clips from the games are 5s clips I captured myself to raw NV12 video. I don’t think it’s super useful to upload those. My hacked up scripts to generate the graphs can be found here for reference. I ran the NVENC tests on an RTX 4070 on 575 drivers. I included this as a baseline since this sequence has been seared into the mind of every video engineer out there (I think?) … This clip is 50 fps, but since my test script is hard-coded for 1080p60, I hex-edited the .y4m :’) O_O. One thing to note here is that AV1/HEVC rate control kinda fails in this scenario. It ends up using less than the allotted budget, probably because it has to be conservative to meet the ridiculous hard-capped CBR. The graphs are done using the final encoded size however. … VMAF, are you drunk? Back to reality. The more typical metrics look more like what I would expect. XPSNR is supposed to be a weighted PSNR that takes psychovisual effects into account, but I have no idea if it’s a good objective metric. Quite hard to encode. It was this game that actually gave me the last push to make this codec since even at 50 mbit/s with motion estimation, I recall some sections giving the encoder real trouble. Bumping bit rates just never cleaned things up properly for whatever reason … I don’t know why, but VMAF really likes PyroWave. Surprisingly easy to code. Must be the blurred backgrounds in play. The use of FP16 kinda limits how high the PSNR can go. This is way beyond transparency, so, whatever. This scene is arguably a good argument for 4:4:4 … More foliage, which I expected to be kinda hard. The game’s presentation is very soft and it shows in the compression rates VMAF really seems like a joke metric for these use cases … Another example of PSNR flattening off due to lowered internal precision. I’m quite happy with this as-is, and having a 100% DIY streaming solution for my own use is fun. Send controller input from machine A to B over network B renders a frame on the GPU B encodes the frame into a bitstream B sends the result over a network to A A decodes the bitstream A displays image on screen Dopamine is released in target brain Hard-capped constant bit rate There is no buffer that can soak up variable bit rate Infinite GOP P-frames or intra-refresh Either choice deals with packet loss differently Excellent error resilience Even on local WiFi streaming to a handheld device, this does matter quite a lot, at least in my experience Simplicity Duh Consistent quality With CBR, video quality is heavily dependent on how good of a job motion estimation can do 1 thread: 4×2 coefficients Cluster of threads (subgroup): 8×8 block Workgroup: 32×32 block (128 threads) CBR with hard cap Encoder is not allowed any slack, which should make the rate control really sweat Fastest modes (not sure it matters that much to intra-only)

Programming

Performance

Gaming

0 views

Maister's Graphics Adventures 5 months ago

Conquering FidelityFX FSR4 – Enabling the pretty pixels on Linux through maniacal persistence

As AMD’s RDNA4 GPUs released, FSR 4 was released alongside it to much fanfare. It’s a massive leap in quality over FSR 3.1, and with FSR 4 moving to a machine learning model instead of an analytical model, FSR 3.1 marks the farewell to fully open and grok-able upscalers. It’s truly a shame, but the graphics world cares little about such sentimentality when pretty pixels are at stake. FSR 4 is not actually released yet to developers for integration in engines, but exists as a driver toggle that can be enabled for certain FSR 3.1 games. The FSR 3.1 implementation had the foresight to let the implementation peek into the driver DLLs and fish out a replacement implementation. This is of course a horrible mess from any compatibility perspective, but we just have to make it work. From day 1 of RDNA4’s release, there’s been a lot of pressure from the Linux gaming community to get this working on Linux as well. Buying an RDNA4 GPU would be far less enticing if there’s no way to get FSR 4 working after all … Given the (currently) proprietary nature of FSR 4, we could easily have been in the situation of DLSS where it’s literally impossible to re-implement it. DLSS uses interop with CUDA for example and there’s 0% chance anyone outside of NVIDIA can deal with that. Fortunately NVIDIA provides the shims required to make DLSS work on Proton these days, but we’re on our own at this time for FSR 4. It all started with an issue made on vkd3d-proton’s tracker asking for FSR 4 support. Somehow, with the OptiScaler project, they had got to the point where vkd3d-proton failed to compile compute shaders. This was very encouraging, because it means that FSR 4 goes through D3D12 APIs somehow. Of course, D3D12’s way of vendor extensions is the most disgusting thing ever devised, but we’ll get to that later … The flow of things seems to be: With undocumented opcodes, I really didn’t feel like trying anything. Attempting to reverse something like that is just too much work. There was high risk of spending weeks on something that didn’t work out, but someone found a very handy file checked into the open source LLPC repos from AMD . This file is far from complete, but it’s a solid start. At this point it was clear that FSR 4 is based on the WMMA (wave matrix multiply accumulate) instructions found on RDNA3 and 4 GPUs. This is encouraging since we have VK_KHR_cooperative_matrix in Vulkan which maps directly to WMMA. D3D12 was supposed to get this feature in SM 6.8, but it was dropped for some inexplicable reason. It’s unknown if FSR 4 could have used standard WMMA opcodes in DXIL if we went down that timeline. Certainly would have saved me a lot of pain and suffering … Next step was being able to capture the shaders in question. Ratchet & Clank – Rift Apart was on the list of supported games from AMD and was the smallest install I had readily available, so I fired it up with RGP on Windows and managed to observe WMMA opcodes. Encouraging! Next step was dumping the actual DXIL shaders, however, it seems like the driver blocks FSR 4 if RenderDoc is attached (or some other unknown weirdness occurs), so a different strategy was necessary. First attempt was to take a FSR 3.1 application without anti-tampering and hook that somehow. The FSR 3.1 demo app in the SDK was suitable for this task. The driver refused to use FSR 4 here, but when I renamed the demo .exe to RiftApart.exe it worked. quack.exe lives on! Now that I had DXIL and some example ISA to go along with it, it was looking possible to slowly piece together how it all works. Shader extensions in D3D is a disgusting mess and this is roughly how it works. dxil-spirv in vkd3d-proton already had some code to deal with this as Mortal Kombat 11 has some AGS shaders for 64-bit atomics (DXIL, but before SM 6.6). Every WMMA opcode is translated to a ton of these magic instructions back-to-back. A maximum of 21 (!) in fact. dxil-spirv has to pattern match more or less. The exact details of how WMMA is represented with these opcodes isn’t super exciting, but for testing this functionality, I implemented a header . It seems like a wave matrix is represented with 8 uints. This tracks, as for 16×16 matrix and wave32 + FP32, you need 256 bits per lane in the worst case. Here’s how WMMA matmul can be represented for example: Fortunately, FSR 4 never tries to be clever with the WMMA_Matrix elements. It seems possible to pass in whatever reason uint you want, but the shaders follow a strict pattern so dxil-spirv can do type promotion from uint to OpTypeCooperativeMatrixKHR on the first element and ignore the rest. It took many days of agonizing trial and error, but eventually I managed to put together a test suite that exercises all the opcodes that I found in the DXIL files I dumped. Ideally, there would be a straight forward implementation to KHR_cooperative_matrix, but that’s not really the case. FSR 4 is heavily reliant on FP8 WMMA. This is an exclusive RDNA4 feature. RDNA3 has WMMA, but only FP16. There is currently no SPIR-V support for Float8 either (but given that BFloat16 just released it’s not a stretch to assume something will happen in this area at some point). To make something that is actually compliant with Vulkan at this point, I implemented emulation of FP8. Converting FP8 to FP16 is fairly straight forward. While we don’t have float8 yet, we have uint8. Doing element-wise conversions like this is not strictly well defined, since the wave layout of different types can change, but it works fine in practice. I’m quite happy with this bit-hackery. Sign-extend, shift, bit-and, and an fmul. FSR 4 also relies on FP32 -> FP8 conversions to quantize accumulation results back to FP8 for storage between stages. This is … significantly more terrible to emulate. Doing accurate RTE with denorm handling in soft-float is GPU sadism. It explodes driver compile times and runtime performance is borderline unusable as a result. In many places, we need to handle loading 8-bit matrices, which are converted to FP32 accumulators and back. Vulkan can support this, but it relies on drivers exposing it via the physical device query. No driver I know of exposes 8-bit accumulator support in any operations, which means we’re forced to go out of spec. With some light tweaks to RADV, it works as expected however. A driver should be able to expose 8-bit accumulator types and do the trunc/extend as needed. It’s somewhat awkward that there is no way to expose format support separately, but it is what it is. In several places, the shaders need to convert e.g. Accumulation matrices to B matrices. This is a use case not covered by KHR_cooperative_matrix. The universal workaround is to roundtrip store/load via LDS, but that’s kind of horrible for perf. I ended up with some hacky code paths for now: After implementing all the opcodes and making all my tests pass, it was time throw this at the wall. (Screenshots from TkG, pardon the French) … fun. First process was to figure out if there was something I may have missed about opcode behavior. For SSBOs and LDS, I simply assumed that 4 byte alignment would be fine, then I found: 1 byte aligned coopmat (8-bit matrix), yikes … Technically out of spec for cooperative_matrix, but nothing we can do about that. AMD deals with this just fine. In the original code, I had used indexing into a u32 SSBO and just divided the byte offset by 4, but obviously that breaks. I added support for 8-bit SSBO aliases in dxil-spirv, updated test suite and we get: Still not right. Eventually I found some questionable looking code with LDS load-store. The strides couldn’t possibly be right. It turns out that offset/stride for LDS is in terms of u32 words, not bytes (?!). This detail wasn’t caught in the test suite because that bug would cancel each other out on a store and load. Fortunately, this quirk made my life easier in dxil-spirv, since there’s no easy way of emitting 8-bit LDS aliases. Things were looking good, but it was far from done. There was still intense shimmering and ghosting, which couldn’t be observed from screenshots and I couldn’t figure it out by simply staring at code. It was time to bring out the big guns. I needed a way to directly compare outputs from native and vkd3d-proton, to be able to isolate exactly where implementations diverged. Fortunately, since we’re not getting blocked by driver when trying to capture anymore, I captured a RenderDoc frame from the 3.1 demo. Fortunately the inputs and outputs are very simple. The middle passes were very simple from a resource binding standpoint: In RenderDoc, I dumped the buffer contents to disk, and built a small D3D12 test app that invokes the shaders with the buffers in question. Every dispatch, dump the scratch buffers contents out to disk, and by running it against the native D3D12 driver and vkd3d-proton, figure out where things diverge. Turns out, yes, they can be Turns out many of the shaders only allocate 256 bytes of LDS, yet the shaders actually need 512. Classic undefined behavior. The reason this “happens” to work on native is that AMD allocates LDS space with 512 byte granularity. However, dxil-spirv also emits some LDS to deal with matrix transpositions and it ended up clobbering the AGS shader’s LDS space … One disgusting workaround later … and games were rendering correctly. FP16 path on RDNA4 conquered. Absolute garbage, as expected. 1440p on 9070xt on native is about 0.85 ms and my implementation was about 3 ms. RADV obviously cannot ship FP8 before there is a Vulkan / SPIR-V spec, but with the power of open source and courage, why not experiment. Nothing is stopping us from just emitting: and see what happens. Georg Lehmann brought up FP8 support in NIR and ACO enough to support FP8 WMMA. Hacking FP8 support into dxil-spirv was quite straight forward and done in an hour. Getting the test suite to pass was smooth and easy, but … the real battle was yet to come. Games were completely broken in the FP8 path. Fortunately, this difference reproduced in my test bench. The real issue now was bisecting the shader itself to figure out where the shaders diverge. These shaders are not small. It’s full of incomprehensible ML gibberish, so only solution I could come up with was capturing both FP16 and FP8 paths and debug printing side-by-side. Fortunately, RenderDoc makes this super easy. First, I had to hack together FP8 support in SPIRV-Cross and glslang so that roundtripping could work: Eventually, I found the divergent spot, and Georg narrowed it down to broken FP8 vectorization in ACO. Once this was fixed, FP8 was up and running. Runtime was now down to 1.3 ms. FSR 4 really likes to convert Accumulator to B matrices, but on RDNA4, the layouts match (at least for 8-bit) so until we have NV_cooperative_matrix2 implementation, I pretended it worked by copying elements instead, and runtime went down to about 1 ms. RADV codegen for coopmat is currently very naive, especially buffer loading code is extremely inefficient, but despite that, we’re pretty close to native here. Now that there is a good use case for cooperative matrix, I’m sure it will get optimized eventually. At this point, FP8 path is fully functional and performant enough, but of course it needs building random Mesa branches and enabling hacked up code paths in vkd3d-proton. RDNA3 is not officially supported at the moment, but given I already went through the pain of emulating FP8, there’s no reason it cannot work on RDNA3. Given the terrible performance I got in FP16 emulation, I can understand why RDNA3 is not supported though … FSR 4 requires a lot of WMMA brute force to work, and RDNA3’s lesser WMMA grunt is simply not strong enough. Maybe it would work better if a dedicated FP16 model is designed, but that’s not on me to figure out. It took a good while to debug this, since once again the test suite was running fine … Unfortunately, RDNA3 is quite strange when it comes to WMMA layouts. Accumulator has 8 elements, but A and B matrices have 16 for some reason. NV_cooperative_matrix2 will help here for sure. After fixing the LDS roundtrip, the test bench passed, but games still looked completely broken on RDNA3. This narrowed down the problem to either the pre-pass or post-pass. Dual GPU and opening up a capture of the FSR 3.1 demo side by side on RDNA3 and RDNA4, I finally narrowed it down to questionable shader code that unnecessarily relies on implementation defined behavior. This is not well behaved cooperative matrix code since it relies on the register layout. RDNA3 and 4 actually differ. The columns are interleaved in very different ways. I found a workaround which can be applied to RDNA3 wave32. That was the best I could do without resorting to full shader replacement. The actual fix would be for the shader to just perform this addition after loading from LDS, like: This would at least be portable. With this, RDNA 3 can do FSR 4 on vkd3d-proton if you’re willing to take a massive dump on performance. I was fearing getting FSR 4 up and running would take a year, but here we are. Lots of different people in the community ended up contributing to this in smaller ways to unblock the debugging process. There probably won’t be a straight forward way to make use of this work until FSR 4 is released in an official SDK, FP8 actually lands in Vulkan, etc, etc, so I’ll leave the end-user side of things out of this blog. FSR 3.1 DLL tries to open the AMD d3d12 driver DLL (amdxc64.dll) and queries for a COM interface. Presumably, after checking that FSR 4 is enabled in control panel and checking if the .exe is allowed to promote, it loads amdxcffx64.dll, which contains the actual implementation. amdxcffx64.dll creates normal D3D12 compute shaders against the supplied ID3D12Device (phew!), with a metric ton of undocumented AGS magic opcodes. Until games start shipping their own FSR 4 implementation, I’d expect that users need to copy over that DLL from an AMD driver install somehow, but that’s outside the scope of my work. amdxcffx64.dll also seems to call back into the driver DLL to ask for which configuration to use. Someone else managed to patch this check out and eventually figure out how to implement this part in a custom driver shim. Declare a magic UAV at some magic register + space combo Emit a ton of oddly encoded atomic compare exchanges to that UAV DXC has no idea what any of this means, but it must emit the DXIL as is Compare exchange is likely chosen because it’s never okay to reorder, merge or eliminate those operations Driver compiler recognizes these magic back-doors, consumes the magic stream of atomic compare exchanges, and translates that into something that makes sense On RDNA4, Accum layout is basically same as B layout, so I can abuse implementation specific behavior by copying over elements one by one into a new coopmat with different type. On RDNA3, this doesn’t work since len(B) != len(Accum). NV_cooperative_matrix2 actually has a feature bit to support this exact use case without hackery, so I can take advantage of that when RADV implements support for it. A pre-pass that reads textures, and does very light WMMA work at the end A bunch of raw passes that spam WMMA like no tomorrow, likely to implement the ML network A final post-pass that synthesizes the final image with more WMMA work of course One big weight buffer One big scratch buffer

Hardware

Gaming

0 views

Maister's Graphics Adventures 6 months ago

Graphics programming like it’s 2000 – An esoteric introduction to PlayStation 2 graphics – Part 1

Graphics programming in 2025 can be confusing and exhausting. Let’s travel back to a simpler time. Imagine it’s 2000 again and we’re anticipating what will turn out to be the most successful game console of all time. In our reverie, we have acquired a virtual development kit from the future to get ahead of the curve. Like many others, we must do our taxes and start with Hello Triangle. However, this Hello Triangle will likely be the strangest Hello Triangle yet. Like any graphics chip – even to this day – it chews a sequence of commands and spits out pixels. The GS chip itself is based around the idea of writing to various hardware registers to trigger work. Everything from drawing triangles to copying image data around is all done by poking the right registers in the right order. To automate this process of peeking and poking hardware registers, the front-end is responsible for reading a command stream and tickle the registers. To get graphics on the screen, our goal will be to prepare a packet of data that a hypothetical GS can process. Where we’re going we need no pesky API. First, we need to program some HW registers: The GIFTag tells the hardware how to interpret the packet, which is followed by 3 Address + Data packets that tickle the hardware register of our choosing. PRMODE: Programs global settings like texture on/off, fogging on/off, blending on/off, etc. We just need to turn Gouraud shading on, i.e., color is interpolated across the triangle. FRAME: Program where the frame buffer is in VRAM. There is no height. That’s what scissor is for. SCISSOR: Set the scissor rect. This now forms a packet and we can write that out to file. Time for a new packet. We need to clear the frame buffer to some aesthetically pleasing color. SPRITE primitive to the rescue. Unlike those silly modern GPUs we have a straight forward quad primitive here. It takes two points – meaning we cannot freely rotate sprites 90 or 270 degrees this way – but we have triangles for those edge cases. The GIFTag programs primitive list of SPRITE and sets it up so that we interpret 3 registers as RGBA color followed by XYZ. Writing to XYZ “kicks” the vertex. Sounds familiar? glVertex3f in hardware? Yup, yup! Then the final packet for our triangle: Now we just need to program the hardware to read RGBA + XYZ in a loop 3 times and we can draw a triangle: Now the triangle is in memory and we need to display its lovely pixels on screen. To this this we must program the CRTC. This is mostly boilerplate. Flush out all this to disk, load it up in parallel-gs-stream and presto: Compile-able source code can be found here for reference: https://gist.github.com/HansKristian-Work/b88066eb8f14be21277c550a6f775956 parallel-gs-stream can read from a mkfifo file, so you could technically open a file as a FIFO and animate a triangle by writing the SPRITE + TRIANGLE packets followed by vsync packet in a loop. No need to complicate things. Stay tuned for simple texture mapping with perspective correction.

Hardware

Programming

0 views

Maister's Graphics Adventures 1 years ago

PlayStation 2 GS emulation – the final frontier of Vulkan compute emulation

As you may, or may not know, I wrote paraLLEl-RDP back in 2020. It aimed at implementing the N64 RDP in Vulkan compute. Lightning fast, and extremely accurate, plus the added support of up-scaling on top. I’m quite happy how it turned out. Of course, the extreme accuracy was due to Angrylion being used as reference and I could aim for bit-exactness against that implementation. Since then, there’s been the lingering idea of doing the same thing, but for PlayStation 2. Until now, there’s really only been one implementation in town, GSdx, which has remained the state-of-the-art for 20 years. paraLLEl-GS is actually not the first compute implementation of the PS2 GS. An attempt was made back in 2014 for OpenCL as far as I recall, but it was never completed. At the very least, I cannot find it in the current upstream repo anymore. The argument for doing compute shader raster on PS2 is certainly weaker than on N64. Angrylion was – and is – extremely slow, and N64 is extremely sensitive to accuracy where hardware acceleration with graphics APIs is impossible without serious compromises. PCSX2 on the other hand has a well-optimized software renderer, and a pretty solid graphics-based renderer, but that doesn’t mean there aren’t issues. The software renderer does not support up-scaling for example, and there are a myriad bugs and glitches with the graphics-based renderer, especially with up-scaling. As we’ll see, the PS2 GS is quite the nightmare to emulate in its own way. My main motivation here is basically “because I can”. I already had a project lying around that did “generic” compute shader rasterization. I figured that maybe we could retro-fit this to support PS2 rendering. I didn’t work on this project alone. My colleague, Runar Heyer, helped out a great deal in the beginning to get this started, doing all the leg-work to study the PS2 from various resources, doing the initial prototype implementation and fleshing out the Vulkan GLSL to emulate PS2 shading. Eventually, we hit some serious roadblocks in debugging various games, and the project was put on ice for a while since I was too drained dealing with horrible D3D12 game debugging day in and day out. The last months haven’t been a constant fire fight, so I’ve finally had the mental energy to finish it. My understanding of the GS is mostly based on what Runar figured out, and what I’ve seen by debugging games. The GSdx software renderer does not seem like it’s hardware bit-accurate, so we were constantly second-guessing things when trying to compare output. This caused a major problem when we had the idea of writing detailed tests that directly compared against GSdx software renderer, and the test-driven approach fell flat very quickly. As a result, paraLLEl-GS isn’t really aiming for bit-accuracy against hardware, but it tries hard to avoid obvious accuracy issues at the very least. Again, this is based on my understanding, and it might not be correct. The GS is infamous for its insane fill-rate and bandwidth. It could push over a billion pixels per second (in theory at least) back in 2000 which was nuts. While the VRAM is quite small (4 MiB), it was designed to be continuously streamed into using the various DMA engines. Given the extreme fill-rate requirements, we have to design our renderer accordingly. In many ways, the GS is actually simpler than N64 RDP. Single texture, and a single cycle combiner, where N64 had a two stage combiner + two stage blender. Whatever AA support is there is extremely basic as well, where N64 is delightfully esoteric. The parts of the pixel pipeline that is painful to implement with traditional graphics APIs is: Inherited from PS1, 0x80 is treated as 1.0, and it can go all the way up to 0xff (almost 2). Shifting by 7 is easier than dividing by 255 I suppose. I’ve seen some extremely ugly workarounds in PCSX2 before to try working around this since UNORM formats cannot support this as is. Textures are similar, where alpha > 1.0 is representable. There is also wrapping logic that can be used for when colors or alpha goes above 0xFF. The destination alpha can be used as a pseudo-stencil of sorts, and this is extremely painful without programmable blending. I suspect this was added as PS1 compatibility, since PS1 also had this strange feature. Based on the alpha, it’s possible to conditionally disable blending. Quite awkward without programmable blending … This is another PS1 compat feature. With PS1, it can be emulated by rendering every primitive twice with state changes in-between, but this quickly gets impractical with PS2. Before alpha is written out, it’s possible to OR in the MSB. Essentially forcing alpha to 1. It is not equivalent to alphaToOne however, since it’s a bit-wise OR of the MSB. A fun thing alpha tests can do is to partially discard. E.g. you can discard just color, but keep the depth write. Quite nutty. This is also kinda awkward. The only anti-alias PS2 has is AA1 which is a coverage-to-alpha feature. Supposedly, less than 100% coverage should disable depth writes (and blending is enabled), but the GSdx software renderer behavior here is extremely puzzling. I don’t really understand it yet. I’ve still yet to see any games actually using this, but technically, it has D32_UINT support. Fun! From what I could grasp, GSdx software renderer implements this with FP64 (one of the many reasons I refuse to believe GSdx is bit-accurate), but FP64 is completely impractical on GPUs. When I have to, I’ll implement this with fixed-point math. 24-bit Z and 16-bit should be fine with FP32 interpolation I think. If you’re on a pure TBDR GPU most of this is quite doable, but immediate mode desktop GPUs quickly degenerates into ROV or per-pixel barriers after every primitive to emulate programmable blending, both which are horrifying for performance. Of course, with compute we can make our own TBDR to bypass all this. Primitives are fortunately provided in a plain form in clip-space. No awkward N64 edge equations here. The VU1 unit is supposed to do transforms and clipping, and emit various per-vertex attributes: X/Y: 12.4 unsigned fixed-point Z: 24-bit or 32-bit uint FOG: 8-bit uint RGBA: 8-bit, for per-vertex lighting STQ: For perspective correct texturing with normalized coordinates. Q = 1 / w, S = s * Q, T = t * Q. Apparently the lower 8-bits of the mantissa are clipped away, so bfloat24? Q can be negative, which is always fun. No idea how this interacts with Inf and NaN … UV: For non-perspective correct texturing. 12.4 fixed-point un-normalized. All of this can be implemented fairly easily in normal graphics APIs, as long as we don’t consider upscaling. We have to rely on implementation details in GL and Vulkan, since these APIs don’t technically guarantee top-left raster rules. Since X/Y is unsigned, there is an XY offset that can be applied to center the viewport where you want. This means the effective range of X/Y is +/- 4k pixels, a healthy guard band for 640×448 resolutions. The GS feels very much like old school OpenGL 1.0 with glVertex3f and friends. It even supports TRIANGLE_FAN! Amazing … RGBA, STQ and various registers are set, and every XYZ register write forms a vertex “kick” which latches vertex state and advances the queue. An XYZ register write may also be a drawing kick, which draws a primitive if the vertex queue is sufficiently filled. The vertex queue is managed differently depending on the topology. The semantics here seem to be pretty straight forward where strip primitives shift the queue by one, and list primitives clear the queue. Triangle fans keep the first element in the queue. A clever idea is that while rendering to 24-bit color or 24-bit depth, there is 8 bits left unused in the MSBs. You can place textures there, because why not. 8H, 4HL, 4HH formats support 8-bit and 4-bit palettes nicely. Pixel coordinates on PS2 are arranged into “pages”, which are 8 KiB, then subdivided into 32 blocks, and then, the smaller blocks are swizzled into a layout that fits well with a DDA-style renderer. E.g. for 32-bit RGBA, a page is 64×32 pixels, and 32 8×8 blocks are Z-order swizzled into that page. There is a dedicated cache for framebuffer rendering and textures, one page’s worth. Games often abuse this to perform feedback loops, where they render on top of the pixels being sampled from. This is the root cause of extreme pain. N64 avoided this problem by having explicit copies into TMEM (and not really having the bandwidth to do elaborate feedback effects), and other consoles rendered to embedded SRAM (ala a tiler GPU), so these feedbacks aren’t as painful, but the GS is complete YOLO. Dealing with this gracefully is probably the biggest challenge. Combined with the PS2 being a bandwidth monster, developers knew how to take advantage of copious blending and blurring passes … Texturing on the GS is both very familar, and arcane. On the plus side, the texel-center is at half-pixel, just like modern APIs. It seems like it has 4-bit sub-texel precision instead of 8 however. This is easily solved with some rounding. It also seems to have floor-rounding instead of nearest-rounding for bi-linear. The bi-linear filter is a normal bi-linear. No weird 3-point N64 filter here. On the weirder side, there are two special addressing modes. REGION_CLAMP supports an arbitrary clamp inside a texture atlas (wouldn’t this be nice in normal graphics APIs? :D). It also works with REPEAT, so you can have REPEAT semantics on border, but then clamp slightly into the next “wrap”. This is trivial to emulate. REGION_REPEAT is … worse. Here we can have custom bit-wise computation per coordinate. So something like u’ = (u & MASK) | FIX. This is done per-coordinate in bi-linear filtering, which is … painful, but solvable. This is another weird PS1 feature that was likely inherited for compatibility. At least on PS1, there was no bi-linear filtering to complicate things Mip-mapping is also somewhat curious. Rather than relying on derivatives, the log2 of interpolated Q factor, along with some scaling factors are used to compute the LOD. This is quite clever, but I haven’t really seen any games use it. The down-side is that triangle-setup becomes rather complex if you want to account for correct tri-linear filtering, and it cannot support e.g. anisotropic filtering, but this is 2000, who cares! Not relying on derivatives is a huge boon for the compute implementation. Formats are always “normalized” to RGBA8_UNORM. 5551 format is expanded to 8888 without bit-replication. There is no RGBA4444 format. It’s quite feasible to implement the texturing with plain bindless. This is a 1 KiB cache that holds the current palette. There is an explicit copy step from VRAM into that CLUT cache before it can be used. Why hello there, N64 TMEM! The CLUT is organized such that it can hold one full 256 color palette in 32-bit colors. On the other end, it can hold 32 palettes of 16 colors at 16 bpp. There is an explicit command that functions like a “sync and invalidate texture cache”. In the beginning I was hoping to rely on this to guide the hazard tracking, but oh how naive I was. In the end, I simply had to ignore TEXFLUSH. Basically, there are two styles of caching we could take with GS. With “maximal” caching, we can assume that frame buffer caches and texture caches are infinitely large. The only way a hazard needs to be considered is after an explicit flush. This … breaks hard. Either games forget to use TEXFLUSH (because it happened to work on real hardware), or they TEXFLUSH way too much. With “minimal” caching, we assume there is no caching and hazards are tracked directly. Some edge case handling is considered for feedback loops. I went with “minimal”, and I believe GSdx did too. The way to interact with the GS hardware is through the GIF, which is basically a unit that reads data and pokes the correct hardware registers. At the start of a GIF packet, there is a header which configures which registers should be written to, and how many “loops” there are. This maps very well to mesh rendering. We can consider something like one “loop” being: And if we have 300 vertices to render, we’d use 300 loops. State registers can be poked through the Address + Data pair, which just encodes target register + 64-bit payload. It’s possible to render this way too of course, but it’s just inefficient. Textures are uploaded through the same mechanism. Various state registers are written to set up transfer destinations, formats, etc, and a special register is nudged to transfer 64-bit at a time to VRAM. If you missed the brain-dead simplicity of OpenGL 1.0, this is the API for you! For testing purposes, I added a tool to generate a .gs dump format that PCSX2 can consume. This is handy for comparing implementation behavior. First, we program the frame buffer and scissor: Then we nudge some registers to draw: This draws a triangle. We provide coordinates directly in screen-space. And finally, we need to program the CRTC. Most of this is just copy-pasta from whatever games tend to do. When the GS is dumped, we can load it up in PCSX2 and voila: And here’s the same .gs dump is played through parallel-gs-replayer with RenderDoc. For debugging, I’ve spent a lot of time making it reasonably convenient. The images are debug storage images where I can store before and after color, depth, debug values for interpolants, depth testing state, etc, etc. It’s super handy to narrow down problem cases. The render pass can be split into 1 or more triangle chunks as needed. To add some textures, and flex the capabilities of the CRTC a bit, we can try uploading a texture: While PS2 requires POT sizes for textures, REGION_CLAMP is handy for NPOT. Super useful for texture atlases. Here we render a sprite with un-normalized coordinates. Finally, we use the CRTC to do blending against white background. Glorious 256×179 logo Before we get into the page tracker, it’s useful to define a rendering pipeline where synchronization is implied between each stage. This pipeline matches what we expect a game to do over and over: When there are no backwards hazards here, we can happily keep batching and defer any synchronization. This is critical to get any performance out of this style of renderer. Some common hazards here include: This is often a false positive, but we cannot track per-byte. This becomes a simple copy barrier and we move on. Since the GS has a tiny 4 MiB VRAM, it’s very common that textures are continuously streamed in, sampled from, and thrown away. When this is detected, we have to submit all vram copy work, all texture unswizzle work and then begin a new batch. Primitive batches are not disrupted. This means we’ll often see: Similar, but here we need to flush out everything. This basically breaks the render pass and we start another one. Too many of these is problematic for performance obviously. Basically same as sampling textures, this is a full flush. Other hazards are ignored, since they are implicitly handled by our pipeline. Arguably, the hardest part of GS emulation is dealing with hazards. VRAM is read and written to with reckless abandon and any potential read-after-write or write-after-write hazard needs to be dealt with. We cannot rely on any game doing this for us, since PS2 GS just deals with sync in most cases, and TEXFLUSH is the only real command games will use (or forget to use). Tracking per byte is ridiculous, so my solution is to first subdivide the 4 MiB VRAM into pages. A page is the unit for frame buffers and depth buffers, so it is the most meaningful place to start. On page granularity, we track: Textures and VRAM copies have 256 byte alignment, and to avoid a ton of false positives, we need to track on a per-block basis. There are 32 blocks per page, so a u32 bit-mask is okay. As mentioned earlier, there are also cases where you can render to 24-bit color, while sampling from the upper 8-bits without hazard. We need to optimize for that case too, so there is also: In the example above, FB write mask is 0xffffff and texture cache mask is 0xff000000. No overlap, no invalidate For host access, there are also timeline semaphore values per page. These values state which sync point to wait for if the host desires mapped read or mapped write access. Mapped write access may require more sync than mapped read if there are pending reads on that page. Every page contains a list of VkImages which have been associated with it. When a page’s textures has been invalidated, the image is destroyed and has to be unswizzled again from VRAM. There is a one-to-many relationship with textures and pages. A texture may span more than one page, and it’s enough that only one page is clobbered before the texture is invalidated. Overall, there are a lot of micro-details here, but the important things to note here is that conservative and simple tracking will not work on PS2 games. Tracking at a 256 byte block level and considering write/read masks is critical. There are various situations where we may have false positives due to how textures work. Since textures are POT sized, it’s fairly common for e.g. a 512×448 texture of a render target to be programmed as a 512×512 texture. The unused region should ideally be clamped out with REGION_CLAMP, but most games don’t. A render target might occupy those unused pages. As long as the game’s UV coordinates don’t extend into the unused red zone, there are no hazards, but this is very painful to track. We would have to analyze every single primitive to detect if it’s sampling into the red zone. As a workaround, we ignore any potential hazard in that red zone, and just pray that a game isn’t somehow relying on ridiculous spooky-action-at-a-distance hazards to work in the game’s favor. There are more spicy special cases, especially with texture sampling feedback, but that will be for later. Since we want to batch texture uploads, we have to batch CLUT uploads too. To make this work, we have 1024 copies of CLUT, a ring buffer of snapshots. One workgroup loops through the updates and writes them to an SSBO. I did a similar thing for N64 RDP’s TMEM update, where TMEM was instanced. Fortunately, CLUT update is far simpler than TMEM update. One potential optimization is that for 256 color / 32 bpp updates, we can parallelize the CLUT update, since nothing from previous iterations will be preserved, but the CLUT update time is tiny anyway. Since this is Vulkan, we can just allocate a new VkImage, suballocate it from VkDeviceMemory and blast it with a compute shader. Using Vulkan’s specialization constants, we specialize the texture format and all the swizzling logic becomes straight forward code. REGION_REPEAT shenanigans is also resolved here, so that the ubershader doesn’t have to consider that case and do manual bilinear filtering. Even for render targets, we roundtrip through the VRAM SSBO. There is not really a point going to the length of trying to forward render targets into textures. Way too many bugs to squash and edge cases to think about. Like paraLLEl-RDP, paraLLEl-GS is a tile-based renderer. Before binning can happen, we need triangle setup. As inputs, we provide attributes in three arrays. For rasterization, we have a straight forward barycentric-based rasterizer. It is heavily inspired by https://fgiesen.wordpress.com/2011/07/06/a-trip-through-the-graphics-pipeline-2011-part-6/ , which in turn is based on A Parallel Algorithm for Polygon Rasterization (Paneda, 1988) and describes the “standard” way to write a rasterizer with parallel hardware. Of course, the PS2 GS is DDA, i.e. a scanline rasterizer, but in practice, this is just a question of nudging ULPs of precision, and since I’m not aware of a bit-exact description of the GS’s DDA, this is fine. paraLLEl-RDP implements the raw DDA form for example. It’s certainly possible if we have to. As an extension to a straight-forward triangle rasterizer, I also need to support parallelograms. This is used to implement wide-lines and sprites. Especially wide-line is kinda questionable, but I’m not sure it’s possible to fully solve up-scaling + Bresenham in the general case. At least I haven’t run into a case where this really matters. Evaluating coverage and barycentric I/J turns into something like this: inv_area is computed in a custom fixed-point RCP, which is ~24.0 bit accurate. Using the standard GPU RCP would be bad since it’s just ~22.5 bit accurate and not consistent across implementations. There is no reason to skimp on reproducibility and accuracy, since we’re not doing work per-pixel. error_i and error_j terms are caused by the downsampling of the edge equations and tie-break rules. As a side effect of the GS’s [-4k, +4k] pixel range, the range of the cross-product requires 33-bit in signed integers. By downsampling a bit, we can get 32-bit integer math to work just fine with 8 sub-pixel accuracy for super-sampling / multi-sampling. Theoretically, this means our upper up-sampling limit is 8×8, but that’s ridiculous anyway, so we’re good here. The parallelogram offsets are very small numbers meant to nudge the tie-break rules in our favor as needed. The exact details of the implementation escape me. I wrote that code years ago. It’s not very hard to derive however. Every primitive gets a struct of transformed attributes as well. This is only read if we actually end up shading a primitive, so it’s important to keep this separate to avoid polluting caches with too much garbage. Using I/J like this will lead to small inaccuracies when interpolating primitives which expect to land exactly on the top-left corner of a texel with NEAREST filtering. To combat this, a tiny epsilon offset is used when snapping texture coordinates. Very YOLO, but what can you do. As far as I know, hardware behavior is sub-texel floor, not sub-texel round. This is mostly uninteresting. Every NxN pixel block gets an array of u16 primitive indices to shade. This makes the maximum number of primitives per render pass 64k, but that’s enough for PS2 games. Most games I’ve seen so far tend to be between 10k and 30k primitives for the “main” render pass, but I haven’t tested the real juggernauts of primitive grunt yet, but even so, having to do a little bit of incremental rendering isn’t a big deal. NxN is usually 32×32, but it can be dynamically changed depending on how heavy the geometry load is. For large resolutions and high primitive counts, the binning and memory cost is unacceptable if the resolution is just 16×16 for example. One subgroup is responsible for iterating through all primitives in a block. Since binning and triangle is state-less, triangle-setup and binning for back-to-back passes are batched up nicely to avoid lots of silly barriers. A key difference between N64 and PS2 is fill-rate and per-pixel complexity. For N64, the ideal approach is to specialize the rasterizing shader, write out per-pixel color + depth + coverage + etc, then merge that data in a much simpler ubershader that only needs to consider depth and blend state rather than full texturing state and combiner state. This is very bandwidth intensive on the GPU, but the alternative is the slowest ubershader written by man. We’re saved by the fact that N64 fill-rate is abysmal. Check out this video by Kaze to see how horrible it is . The GS is a quite different beast. Fill-rate is very high, and per-pixel complexity is fairly low, so a pure ubershader is viable. We can also rely on bindless this time around too, so texturing complexity becomes a fraction of what I had to deal with on N64. Every tile is 4×4, 4×8 and 8×8 for subgroup sizes 16, 32 and 64 respectively. For super-sampling it’s even smaller (it’s 4×4 / 4×8 / 8×8 in the higher resolution domain instead). In the outer loop, we pull in up to SubgroupSize’s worth of primitives, and bin them in parallel. In the inner loop, we can do a scalarized loop which checks coverage per-pixel, one primitive at a time. We can take advantage of early-Z testing of course, but we have to be careful if there are rasterized pixels we haven’t resolved yet, and there are Z-writes in flight. In this case we have to defer to late Z to perform test. Since we’re an uber-shader, all pixels are “on-chip”, i.e. in registers, so we can take advantage of culling pixels that won’t be visible anyway. The basic idea here is that after rasterization, if a pixel is considered opaque, it will simply replace the shading request that exists for that framebuffer coordinate. It won’t be visible at all anyway. We only need to perform shading when we really have to, i.e., we’re shading a pixel that depends on the previous pixel’s results. This can happen for e.g. alpha test (if test fails, we preserve existing data), color write masks, or of course, alpha blending. If our pixel remains opaque, we can just kill the pending pixel shade request. Very nice indeed. The gain here wasn’t as amazing as I had hoped since PS2 games love blending, but it helps culling out a lot of shading work. If we have flushes that need to happen, we do so if one pixel needs it. It’s just as fast to resolve all pixels anyway. The resolve is a straight forward waterfall loop that stays in uniform control flow to be well defined on devices without maximal reconvergence support. This scalarization ensures that all branches on things like alpha test mode, blend modes, etc, are purely scalar, and GPUs like that. Scalarizing on the texture index is technically not that critical, but it means we end up hitting the same branches for filtering modes, UBOs for scaling factors are loaded uniformly, etc. When everything is done, the resulting framebuffer color and depth is written out to SSBO. GPU bandwidth is kept to a minimum, just like a normal TBDR renderer. Just implementing single sampled rendering isn’t enough for this renderer to be really useful. The software renderer is certainly quite fast, but not fast enough to keep up with intense super-sampling. We can fix that now. For e.g. 8x SSAA, we keep 10 versions of VRAM on the GPU. When rendering super-sampled, we load the single-sampled VRAM and reference. If they match, we load the super-sampled version. This is important for cases where we’re doing incremental rendering. On tile completion we use clustered subgroup ops to do multi-sample resolve, then write out the super-samples, and the two single-sampled copies. The main advantage of super-sampling over straight up-scaling is that up-scaling will still have jagged edges, and super-sampling retains a coherent visual look where 3D elements have similar resolution as UI elements. One of my pet peeves is when UI elements have a significantly different resolution from 3D objects and textures. HD texture packs can of course alleviate that, but that’s a very different beast. Super-sampling also lends itself very well to CRT post-processing shading, which is also a nice bonus. It’s a fact of life that super-sampling always introduces horrible artifacts if not handled with utmost care. Mitigating this is arguably easier with software renderers over traditional graphics APIs, since we’re not limited by the fixed function interpolators. These tricks won’t make it perfect by any means, but it greatly mitigates jank in my experience, and I already fixed many upscaling bugs that GSdx Vulkan backend does not solve as we shall see later. Sprites are always UI elements or similar, and games do not expect us to up-scale them. Doing so either results in artifacts where we sample outside the intended rect, or we risk overblurring the image if bilinear filtering is used. The trick here is just to force-snap the pixel coordinate we use when rasterizing and interpolating. This is very inefficient of course, but UI shouldn’t take up the entire screen. And if it does (like in a menu), the GPU load is tiny anyway. Going further, we can demote SSAA interpolation to MSAA center interpolation dynamically. Many UI elements are unfortunately rendered with normal triangles, so we have to be a bit more careful. This snap only affects attribute interpolation, not Z of course. Here, we snap interpolation to the top-left pixel. This fixes any artifacts for primitives which align their rendering to a pixel center, but some games are mis-aligned, so this snapping can cause texture coordinates to go outside the expected area. To clean this up, we compute a bounding box of final texture coordinates. Adding bounding boxes can technically cause notorious block-edge artifacts, but that was mostly a thing on PS1 since emulators like to convert nearest sampling to bilinear. The heuristic for this is fairly simple. If perspective is used, if all vertices in a triangle have exact same Q, we assume it’s a flat UI primitive. The primitive’s Z coordinates must also match. This is done during triangle setup on the GPU. There can of course be false positives here, but it should be rare. In my experience this hack works well enough in the games I tried. Here’s a good example of up-sampling going awry in PCSX2. This is with Vulkan backend: Notice the bloom on the glass being mis-aligned and a subtle (?) rectangular pattern being overlaid over the image. This is caused by a post-processing pass rendering in a page-like pattern, presumably to optimize for GS caching behavior. With 8x SSAA in paraLLEl-GS it looks like this instead. There is FSR1 post-upscale in effect here which changes the look a bit, but the usual trappings of bad upscale cannot be observed here. This is another reason to do super-sample; texture mis-alignment has a tendency to fix itself. Also, if you’re staring at the perf numbers, this is RX 7600 in a low power state :’) Typical UI issues can be seen in games as well. Here’s native resolution: and 4x upscale, which … does not look acceptable. This UI is tricky to render in upscaled mode, since it uses triangles, but the MSAA snap trick above works well and avoids all artifacts. With straight upscale, this is hard to achieve in normal graphics APIs since you’d need interpolateAtOffset beyond 0.5 pixels, which isn’t supported. Perhaps you could do custom interpolation with derivatives or something like that, but either way, this glitch can be avoided. The core message is basically to never upscale UI beyond plain nearest neighbor integer scale. It just looks bad. There are cases where PCSX2 asks for high blending accuracy. One example is MGS2, and I found a spot where GPU perf is murdered. My desktop GPU cannot keep 60 FPS here at 4x upscale. PCSX2 asks you to turn up blend-accuracy for this game, but … What happens here is we hit the programmable blending path with barrier between every primitive. Ouch! This wouldn’t be bad for the tiler mobile GPUs, but for a desktop GPU, it is where perf goes to die. The shader in question does subpassLoad and does programmable blending as expected. Barrier, tiny triangle, barrier, tiny triangle, hnnnnnnng. paraLLEl-GS on the other hand always runs with 100% blend accuracy (assuming no bugs of course). Here’s 16xSSAA (equivalent to 4x upscale). This is just 25 W and 17% GPU utilization on RX 7600. Not bad. Other difficult cases include texture sampling feedback. One particular case I found was in Valkyrie Profile 2. This game has a case where it’s sampling it’s own pixel’s alpha as a palette index. Quirky as all hell, and similar to MGS2 there’s a barrier between every pixel. In paraLLEl-GS, this case is detected, and we emit a magical texture index, which resolved to just looking at in-register framebuffer color instead. Programmable blending go brr. These cases have to be checked per primitive, which is quite rough on CPU time, but it is what it is. If we don’t hit the good path, GPU performance completely tanks. The trick here is to analyze the effective UV coordinates, and see if UV == framebuffer position. If we fall off this path, we have to go via texture uploads, which is bad. It’s comfortably full-speed on PCSX2 here, despite the copious number of barriers, but paraLLEl-GS is reasonably close perf-wise, actually. At 8x SSAA. Overall, we get away with 18 render pass barriers instead of 500+ which was the case without this optimization. You may notice the interlacing artifacts on the swirlies. Silly game has a progressive scan output, but downsamples it on its own to a field before hitting CRTC, hnnnnng Redirecting framebuffer locations in CRTC might work as a per-game hack, but either way, I still need to consider a better de-interlacer. Some games actually render explicitly in fields (640×224), which is very annoying. This scene in the MGS2 intro also exposes some funny edge cases with sampling. To get the camo effect, it’s sampling its own framebuffer as a texture, with overlapping coordinates, but not pixel aligned, so this raises some serious questions about caching behavior. PCSX2 doesn’t seem to add any barriers here, and I kinda had to do the same thing. It looks fine to me compared to software renderer at least. If we’re in a mode where texture points directly to the frame buffer we should relax the hazard tracking a bit to avoid 2000+ barriers. This is clearly spooky since Tales of Abyss’s bloom effect as shown earlier depends on this to be well behaved, but in that case, at least it uses REGION_CLAMP to explicitly mark the ping-pong behavior. I’m not sure what the proper solution is here. The only plausible solution to true bit-accuracy with real hardware is to emulate the caches directly, one pixel at a time. You can kiss performance good bye in that case. One of the worst stress tests I’ve found so far has to be Shadow of the Collosus. Just in the intro, we can make the GPU kneel down to 24 FPS with maximum blend accuracy on PCSX2, at just 2x upscale! Even with normal blending accuracy, it is extremely heavy during the intro cinematic. At 8x SSAA, perf is still looking pretty good for paraLLEl-GS, but it’s clearly sweating now. We’re actually still CPU bound on the geometry processing. Optimizing the CPU code hasn’t been a huge priority yet. There’s unfortunately a lot of code that has to run per-primitive, where hazards can happen around every corner that has to be dealt with somehow. I do some obvious optimizations, but it’s obviously not as well-oiled as PCSX2 in that regard. It seems fast enough to comfortably do 4x SSAA. Maybe not in SotC, but … hey. For now, the only real way to test this is through GS dumps. There’s a hack-patch for PCSX2 that lets you dump out a raw GS trace, which can be replayed. This works via mkfifo as a crude hack to test in real-time, but some kind of integration into an emulator needs to happen at some point if this is to turn into something that’s useful for end users. There’s guaranteed to be a million bugs lurking since the PS2 library is ridiculously large and there’s only so much I can be arsed to test myself. At least, paraLLEl-GS has now become my preferred way to play PS2 games, so I can say mission complete. A potential use case for this is due to its standalone library nature, it may be useful as very old-school rendering API for the old greybeards around that still yearn for the day of PS2 programming for whatever reason :p Triangles are top-left raster, just like modern GPUs. Pixel center is on integer coordinate, just like D3D9. (This is a common design mistake that D3D10+ and GL/Vulkan avoids). Lines use Bresenham’s algorithm, which is not really feasible to upscale, so we have to fudge it with rect or parallelogram. Points snap to nearest pixel. Unsure which rounding is used though … There is no interpolation ala gl_PointCoord. Sprites are simple quads with two coordinates. STQ or UV can be interpolated and it seems to assume non-rotated coordinates. To support rotation, you’d need 3 coordinates to disambiguate. Write RGBA vertex color Write texture coordinate Write position with draw kick Synchronize CPU copy of VRAM to GPU. This is mostly unused, but happens for save state load, or similar Upload data to VRAM (or perform local-to-local copy) Update CLUT cache from VRAM Unswizzle VRAM into VkImages that can be sampled directly, and handle palettes as needed, sampling from CLUT cache Perform rendering Synchronize GPU copy of VRAM back to CPU. This will be useful for readbacks. Then CPU should be able to unswizzle directly from a HOST_CACHED_BIT buffer as needed Upload texture to VRAM Upload palette to VRAM Update CLUT cache Draw with texture Trigger unswizzle from VRAM into VkImage if needed Begins building a “render pass”, a batch of primitives Unswizzle xN Unswizzle xN Pending frame buffer write? Pending frame buffer read? (read-only depth) VRAM copy writes VRAM copy reads Pending read into CLUT cache or VkImage Blocks which have been clobbered by any write, on next texture cache invalidate, throw away images that overlap A write mask for framebuffers A read mask for textures 1 copy represents the single-sampled VRAM. It is super-sampled. 1 copy represents the reference value for single-sampled VRAM. This allows us to track when we should discard the super-samples and splat the single sample to all. This can happen if someone copies to VRAM over a render target for whatever reason. 8 copies which each represent the super-samples. Technically, we can reconstruct a higher resolution image from these samples if we really want to, but only the CRTC could easily do that.

Programming

Gaming

0 views

Maister's Graphics Adventures 1 years ago

Real-time video streaming experiments with forward error correction

As I previously discussed in my PyroFling post about real-time video streaming, one challenge I mentioned was related to error correction. Using UDP, packet loss is inevitable, so there’s two approaches to reduce streaming jank: Re-sending packets is a waste of time in a low latency environment like this, so we can ignore that. If re-sending was okay, I’d just use TCP and forget about all of this anyway. With intra-refresh, error masking is half decent, so I wanted to focus on FEC. Error correction is its own field of study, but I didn’t have the time to actually study the field. State-of-the-art error correction is extremely advanced, complex to implement and IP encumbered (*ahem*, RaptorQ, *ahem*), but I evaluated some less recent approaches. Which FEC mechanism we choose needs to take the input data into consideration. A video packet is a variable number of bytes every frame, which I split up into 1024 bytes (+ header) each. A successful transmission only happens when all N bytes are received successfully. Sending partially valid data to the video decoder is likely going to result in horrible things happening, so if even one sub-packet is dropped, I have to drop the full frame. Some error correction schemes rely on fixed block lengths, which isn’t ideal for our variable length input. The classic example everyone taking classes on the subject learn is the Hamming (7, 4) code, but this code is better suited for noisy analog channels where we don’t know if any bit was actually received correctly. What we really want is a method that takes extra knowledge about packet loss into account. Sending UDP packets over the internet functions like an erasure channel . At the receiver, we know if data was missed. Corrupt packets are dropped by the network (random bit-errors causing CRC check to pass is theoretically possible I suppose, but I don’t consider that). Since a packet is received all or nothing, we’re actually error correcting in a vectorized fashion. The message we’re looking at error correcting is byte n for for all sub-packets P in [0, ceil(N / 1024)), where N is the video packet size in bytes. The error correction algorithm will perform the exact same operation for every byte n in a given packet. Another way of looking at this is to consider every packet a single 8192-bit number, but that’s a very mathematical way of looking at it. Either way, given a typical 10 mbit/s stream at 60 fps, we expect about 20 sub-packets per video frame. Some frames will be very small, and some will be larger. Some block codecs like Reed-Solomon are well known and very powerful, but seemed a bit too rigid in its block structure. It also has block lengths that seem better suited to bit-streams. Being able to adapt how much FEC is used is quite useful in a dynamic system such as streaming. A Compact Disc (which uses Reed-Solomon) has to bake in a fixed amount of error correction, but with streaming, a feedback channel can let us dynamically adjust the amount of error correction used as needed. I quickly rejected these codes. I don’t think this code has a formal name, but understanding it is the foundation for the upcoming section. Given N packets, just take the XOR of all the packets and send that as one FEC block. If there is 1 packet loss, we can recover it by taking the XOR of all received packets and the FEC block. For small video frames with a small number of packets, this is actually the method I went with. The downside of course is that there is no obvious way to recover more than 1 packet. I found that spurious packet loss could have 2 or 3 drops in some cases, especially in very large video frames that span up to 100 sub-packets, so this approach was too naive for me. While looking around, I ran into a very clever scheme called a fountain code, in particular, the Luby Transform. There is a nice YouTube video explaining it . I also dug up Chapter 50 in an old textbook I used in my university studies , which had a chapter dedicated to this with more mathematical rigor. This method has some nice properties that are well suited for network transmission: A fountain code is called so because the encoder can spit out an arbitrary number of packets. There is no fixed block structure, and the process is pseudo-random. As long as the encoder and decoder agree on a seed for the process, very little side channel data needs to be communicated. The algorithm is essentially YOLO XOR, with a lot of statistical tweaks. First, we consider the degree d of a packet, which is the number of blocks we take the XOR of when generating a packet. A degree of 1 is the base case where we send a block as-is, and a degree of ceil(N / 1024) is taking the XOR of everything (i.e. the YOLO XOR case). The packets chosen for XOR-ing are randomized. On the receiver end, we look at all our received packets, and if we find a case where we have a packet with degree d, and d – 1 of the packets have been recovered, we can recover the last one through … more XOR-ing. By recovering a new packet, this may cause other packets to reach this condition and the cycle continues. To kick-start this process, some packets with degree = 1 must be transmitted. To make this work well, the literature describes a very particular distribution for d to minimize the expected redundancy. I implemented all of this, but I found some unfortunate practical problems. It is (very) possible I had bugs of course, but debugging completely random processes is not very fun and it’s not like I had a reference result to compare against. Given the completely random process, it’s unbounded how many packets have to be encoded to actually be able to decode, even with no packet loss. Studying the literature, the examples I found seemed to assume a very large number of blocks K. K would be 10000 for example, and as K increases, the variance of redundancy ratios decreases. For my example of K = 20, the algorithm seemed to collapse. Occasionally, I needed 2 or 3x redundancy to complete the decode, which is obviously unacceptable. The statistical distribution for the degree factor d depends on the number of blocks to send, K. This value changes every frame. K can be arbitrarily large, so computing LUTs got awkward. Following the enlightened example of grug brain , I massively simplified the LT code down to something that maybe isn’t as theoretically good, but it worked very well in practice for my particular needs, where packet loss ratios are fairly low and K is low. This basically boils down to a heavily “rigged” LT, but otherwise the encoder and decoder does not really change. This is an obvious thing to do. If there is no packet loss, we guarantee that we can start decoding immediately (good for latency). There is no randomness in this process. I found that a fixed d factor of K / 2 worked well. For odd K, alternate d between ceil(K / 2) and floor(K / 2). For large K, clamping the factor d to something reasonable like 64 worked well too. For every pair of blocks, we want to ensure that blocks selected for XOR are the complement of each other. This guarantees that by receiving a pair of FEC blocks, we will always be able to correct one missed packet. With 50% probability, we can recover two packet losses. The odd/even split above is designed to make sure that an odd/even pair always covers all K blocks. As the number of blocks increases, we’ll be able to recover more losses (with lower and lower probability). Given a fixed number of data packets, and N randomly lost packets, we can observe how well this FEC recovers data. The recovery rate for 1 lost packet is 100% due to our design, so that’s not interesting. XOR degree factor is 20. With larger data blocks, and more FEC blocks to match the redundancy ratio, recovery ratio improves, but beyond 4 losses, the codec starts collapsing. That’s fine. I haven’t had too many issues with burst losses like these. XOR degree factor is 40. There’s some interesting stair-stepping here, which might be caused by the mirroring. This suggests we get most bang for our bandwidth by using odd number of FEC blocks. It’s possible to tweak things however. Using a smaller degree is good when using a lot of FEC packets and more errors are expected. Maybe it’s possible to use a blend of high degree and low degree packets as well (basically the entire point of LT), but this kind of tweaking can be left to another time. PyroFling’s expectation is low number of losses per video frame and simplicity beats theoretical performance. If we add e.g. 0.5% random, uncorrelated packet loss ratio, a degree factor of d = N / 2 seems much better. For larger data sets, the number of expected losses starts increasing, so degree factors seem to prefer d = 100 over d = 200. For larger video frames, we’re far more likely to encounter at least one packet loss, so that’s why loss ratio for 0 FEC packets approaches 100%. Compared to the state of the art, this is likely far from optimal, but this was good enough for my uses and here’s the latest log from a 4-hour play session between Trondheim and Bergen: ~99.5% of dropped packets were avoided. This was at 25% FEC redundancy rate. Every dropped video packet is disruptive and lasts many frames, so this improvement was transformative. This isn’t an academic project, so I don’t really care about comparing against a million different FEC algorithms. A common technique in UDP streaming is to not send all packets immediately, but pace them over an interval. Sending over the full frame interval increases latency by quite a bit, but pacing a stream to a max instantaneous rate of e.g. 60 mbit/s worked alright. The added latency is only a few ms for a ~10-15 mbit/s stream, which is acceptable. Since I’m on Linux, and I was lazy, I rely on the kernel to do this automatically: Hopefully this demonstrates a simple FEC that is fairly accessible. The last piece of the PyroFling adventure will be to finally tackle Vulkan Video encode. Error masking – hallucinate missed frames Forward error correction (FEC) – add redundancy to avoid dropped packets Designed for erasure channels (e.g. IP networks) Flexible FEC ratios Receiver can complete decode after receiving enough data packets (with some major caveats)

Performance

Programming

0 views

Maister's Graphics Adventures 1 years ago

Modernizing Granite’s mesh rendering

Granite’s renderer is currently quite old school. It was written with 2017 mobile hardware in mind after all. Little to no indirect drawing, a bindful material system, etc. We’ve all seen that a hundred times before. Granite’s niche ended up exploring esoteric use cases, not high-end rendering, so it was never a big priority for me to fix that. Now that mesh shading has starting shipping and is somewhat proven in the wild with several games shipping UE5 Nanite, and Alan Wake II – which all rely on mesh shaders to not run horribly slow – it was time to make a more serious push towards rewriting the entire renderer in Granite. This has been a slow burn project that’s been haunting me for almost half a year at this point. I haven’t really had the energy to rewrite a ton of code like this in my spare time, but as always, holidays tend to give me some energy for these things. Video shenanigans have also kept me distracted this fall. I’m still not done with this rewrite, but enough things have fallen into place, that I think it’s time to write down my findings so far. I had some goals for this new method. Unlike UE5 Nanite and Alan Wake II, I don’t want to hard-require actual VK_EXT_mesh_shader support to run acceptably. Just thinking in terms of meshlets should benefit us in plain multi-draw-indirect (MDI) as well. For various mobile hardware that doesn’t support MDI well (or at all …), I’d also like a fallback path that ends up using regular direct draws. That fallback path is necessary to evaluate performance uplift as well. This is something to avoid. Nanite relies heavily on rendering primitive IDs to a visibility buffer, where attributes are resolved later. In the primary compute software rasterizer, this becomes a 64-bit atomic, and in the mesh shader fallback, a single primitive ID is exported to fragment stage as a per-primitive varying, where fragment shader just does the atomic (no render targets, super fun to debug …). The problem here is that per-primitive varyings don’t exist in the classic vertex -> fragment pipeline. There are two obvious alternatives to work around this: From my spelunking in various shipped titles, Nanite does the latter, and fallback rendering performance is halved as a result (!). Depending on the game, meshlet fallbacks are either very common or very rare, so real world impact is scene and resolution dependent, but Immortals of Aveum lost 5-15% FPS when I tested it. The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading suggests rendering out a visibility G-Buffer using InstanceID (fed through some mechanism) and SV_PrimitiveID, which might be worth exploring at some point. I’m not sure why Nanite did not go that route. It seems like it would have avoided the duplicated vertices. Mesh shaders are basically a hard requirement for this game. It will technically boot without mesh shader support, but the game gives you a stern warning about performance, and they are not kidding. I haven’t dug into what the fallback is doing, but I’ve seen people posting videos demonstrating sub-10 FPS on a 1080 Ti. Given the abysmal performance, I wouldn’t be surprised if they just disabled all culling and draw everything in the fallback. While studying https://github.com/zeux/meshoptimizer I found support for compressed meshes, a format that was turned into a glTF EXT . It seems to be designed for decompressing on CPU (completely serial algorithm), which was not all that exciting for me, but this sparked an idea. What if I could decompress meshlets on the GPU instead? There are two ways this can be useful: I haven’t seen any ready-to-go implementation of this yet, so I figured this would be my starting point for the renderer. Always nice to have an excuse to write some cursed compute shaders. One annoying problem with mesh shading is that different vendors have very different fast paths through their hardware. There is no single implementation that fits all. I’ve spent some time testing various parameters and observe what makes NV and AMD go fast w.r.t. mesh shaders, with questionable results. I believe this is the number 1 reason mesh shaders are still considered a niche feature. Since we’re baking meshlets offline, the format itself must be able to adapt to implementations that prefer 32/64/128/256 primitive meshlets. It must also adapt nicely to MultiDrawIndirect-style rendering. It should be efficient to decode meshlets in parallel, and in complete isolation. I went through some (read: way too many) design iterations before landing on this design. Going wide means we get lower culling overhead and emitting larger MDI calls avoids us getting completely bottlenecked on command stream frontend churn. I tried going lower than 256, but performance suffered greatly. 256 seemed like a good compromise. With 256 prim/verts, we can use 8-bit index buffers as well, which saves quite a lot of memory. To consider various hardware implementations, very few will be happy with full, fat 256 primitive meshlets. To remedy this, the encoding is grouped in units of 32 – a “sublet” – where we can shade the 8 groups independently, or have larger workgroups that shade multiple sublets together. Some consideration is key to be performance portable. At runtime we can specialize our shaders to fit whatever hardware we’re targeting. Using grouping of 32 is core to the format as well, since we can exploit NV warps being 32-wide and force Wave32 on RDNA hardware to get subgroup accelerated mesh shading. The style signals type of mesh. This is naturally engine specific. A stream is 32 values encoded in some way. Each meshlet has stream_count number of Stream headers. The indexing is trivial: This is where things get a bit more interesting. I ended up supporting some encoding styles that are tailored for the various attribute formats. There’s two parts to this problem. First is to decide on some N-bit fixed point values, and then find the most efficient way to pull those bits from a buffer. I went through several iterations on the actual bit-stuffing. A base value is encoded in Stream::base_value, and the decoded bits are an offset from the base. To start approaching speed-of-light decoding, this is about as fancy as we can do it. I went through various iterations of this model. The first idea had a predictive encoding between neighbor values, where subgroup scan operations were used to complete the decode, but it was too slow in practice, and didn’t really improve bit rates at all. Since the sublet is just 32-wide, we can encode with 5-bit indices. 15 bit / primitive. There is no real reason to use delta encode here, so instead of storing base values in the stream header, I opted to use those bits to encode vertex/index counts. This is decoded to 3×16-bit SINT. The shared exponent is stored in top 16 bits of Stream::bits. This facilitates arbitrary quantization as well. Similar idea as position, but 2×16-bit SINT. After decoding similar to position, a simple fixup is made to cater to typical UVs which lie in range of [0, +1], not [-1, +1]. Encoded as 4×8-bit SNORM. Normal (XY) and Tangent (ZW) are encoded with Octahedral encoding from meshoptimizer library . To encode the sign of tangent, Stream::bits stores 2 bits, which signals one of three modes: Basically same as Normal/Tangent, but ignore tangent sign handling. For a long time, I was pursuing bitplane encoding, which is one of the simplest ways to encode variable bitrates. We can encode 1 bit for 32 values by packing them in one u32. To speed up decoding further, I aimed to pack everything into 128-bit aligned loads. This avoids having to wait for tiny, dependent 32-bit loads. For example, for index buffers: On Deck, this ends up looking like Thinking about ALU and loads in terms of scalar and vectors can greatly help AMD performance when done right, so this approach felt natural. For variable bit rates, I’d have code like: However, I abandoned this idea, since while favoring SMEM so heavily, the VALU with all the bitfield ops wasn’t exactly amazing for perf. I’m still just clocking out one bit per operation here. AMD performance was quite alright compared to what I ended up with in the end, but NVIDIA performance was abysmal, so I went back to the drawing board, and ended up with the absolute simplest solution that would work. This idea is to just literally pack bits together, clearly a revolutionary idea that noone has ever done before. A VMEM load or two per thread, then some shifts should be all that is needed to move the components into place. E.g. for index buffers: For the actual decode I figured it would be pretty fast if all the shifts could be done in 64-bit. At least AMD has native instructions for that. There is one detail here. For 13, 14 and 15 bit components with uvec3 decode, more than two u32 words may be needed, so in this case, encoder must choose 16 bit. (16-bit works due to alignment.) This only comes up in position encode, and encoder can easily just ensure 12 bit deltas is enough to encode, quantizing a bit more as necessary. Every 256-wide meshlet can turn into an indexed draw call with VK_INDEX_TYPE_UINT8_EXT, which is nice for saving VRAM. The “task shader” becomes a compute shader that dumps out a big multi-draw indirect buffer. The DrawIndex builtin in Vulkan ends up replacing WorkGroupID in mesh shader for pulling in per-meshlet data. Before going further with mesh shading fun, it’s important to validate performance. I needed at least a ballpark idea of how many primitives could be pumped through the GPU with a good old vkCmdDrawIndexed and the MDI method where one draw call is one meshlet. This was then to be compared against a straight forward mesh shader. Zeux’s Niagara renderer helpfully has a simple OBJ for us to play with. When exported to the new meshlet format it looks like: One annoying thing about meshlets is attribute duplication when one vertex is reused across meshlets, and using tiny 32-wide meshlets makes this way worse. Add padding on top for encode and the compression ratio isn’t that amazing anymore. The primitive to vertex ratio is ~1.95 here which is really solid, but turning things into meshlets tends to converge to ~1.0. I tried different sublet sizes, but NVIDIA performance collapsed when I didn’t use 32-wide sublets, and going to 64 primitive / 32 vertex only marginally helped P/V ratios. AMD runtime performance did not like that in my testing (~30% throughput loss), so 32/32 it is! After writing this section, AMD released a blog post suggesting that the 2N/N structure is actually good , but I couldn’t replicate that in my testing at least and I don’t have the energy anymore to rewrite everything (again) to test that. The classic “instance the same mesh a million times” strategy. This was tested on RTX 3070 (AMD numbers to follow, there are way more permutations to test there …). The mesh is instanced in a 13x13x13 grid. Here we’re throwing 63.59 million triangles at the GPU in one go. This is the most basic thing to do, so for reference. Here we’re just doing basic frustum culling of meshlets as well as back-face cone culling and emitting one draw call per meshlet that passes test. Significantly more geometry is rejected now due to back-face cull and tighter frustum cull, but performance isn’t that much better. Once we start considering occlusion culling, this should turn into a major win over normal draw calls. In this path, we have a bit more indirection in the vertex shader, so that probably accounts for some loss as well. Here, the meshlet will read directly from the encoded payload, and decode inline in the shader. No per-primitive culling is performed. We’re at the point where we are bound on fixed function throughput. Encoded and Decoded paths are basically both hitting the limit of how much data we can pump to the rasterizer. To actually make good use of mesh shading, we need to consider per-primitive culling. For this section, I’ll be assuming a subgroup size of 32, and a meshlet size of 32. There are other code paths for larger workgroups, which require some use of groupshared memory, but that’s not very exciting for this discussion. The gist of this idea was implemented in https://gpuopen.com/geometryfx/ . Various AMD drivers adopted the idea as well to perform magic driver culling, but the code here isn’t based on any other code in particular. This is tricky, but we only need to be conservative, not exact. We can only reject when we know for sure the primitive is not visible. The first step is to do W divide per vertex and study how that vertex clips against the X, Y, and W planes. We don’t really care about Z. Near-plane clip is covered by negative W tests, and far plane should be covered by simple frustum test, assuming we have a far plane at all. There are things to unpack here. The INACCURATE clip code is used to denote a problem where we might start to run into accuracy issues when converting to fixed point, or GPUs might start doing clipping due to guard band exhaustion. I picked the value arbitrarily. The window coordinate is then computed by simulating the fixed point window coordinate snapping done by real GPUs. Any GPU supporting DirectX will have a very precise way of doing this, so this should be okay in practice. Vulkan also exposes the number of sub-pixel bits in the viewport transform. On all GPUs I know of, this is 8. DirectX mandates exactly 8. This particular way of doing it comes into play later when discussing micro-poly rejection. One thing to note here is that Vulkan clip-to-window coordinate transform does not flip Y-sign. D3D does however, so beware. Based on clip codes we can immediately accept or reject primitives. To compute winding, we need a 2D cross product. While noodling with this code, I noticed that we can still do it in FP32 instead of full 64-bit integer math. We’re working with integer-rounded values here, so based on the magnitudes involved we can pick the exact GEQ test. If we risk FP rounding error, we can use GE test. If the results don’t test equal, we know for sure area must be negative, otherwise, it’s possible it could have been positive, but the intermediate values rounded to same value in the end. Culling primitives helped as expected. Less pressure on the fixed function units. Given how pathologically geometry dense this scene is, we expect that most primitives never trigger the rasterizer at all. If we can prove that the bounding box of the primitive lands between two pixel grids, we can reject it since it will never have coverage. There is a lot to unpack in this code. If we re-examine the viewport transform: First, we need to shift by 0.5 pixels. The rasterization test happens at the center of a pixel, and it’s more convenient to sample at integer points. Then, due to top-left rasterization rules on all desktop GPUs (a DirectX requirement), we shift the result by one sub-pixel. This ensures that should a primitive have a bounding box of [1.0, 2.0], we will consider it for rasterization, but [1.0 + 1.0 / 256.0, 2.0] will not. Top-left rules are not technically guaranteed in Vulkan however (it just has to have some rule), so if you’re paranoid, increase the upper bound by one sub-pixel. Now we’re only submitting 1.2 M primitives to the rasterizer, which is pretty cool, given that we started with 31 M potential primitives. Of course, this is a contrived example with ridiculous micro-poly issues. We’re actually at the point here where reporting the invocation stats (one atomic per workgroup) becomes a performance problem, so turning that off: With inline decoding there’s some extra overhead, but we’re still well ahead: This is quite straight forward. Once we have the counts, SetMeshOutputCounts is called and we can compute the packed output indices with a mask and popcount. Can we improve things from here? On NVIDIA, yes. NVIDIA seems to under-dimension the shader export buffers in their hardware compared to peak triangle throughput, and their developer documentation on the topic suggests : Using VK_KHR_fragment_shader_barycentrics we can write code like: Quite the dramatic gain! Nsight Graphics suggests we’re finally SM bound at this point (> 80% utilization), where we used to be ISBE bound (primitive / varying allocation). An alternative that I assume would work just as well is to pass down a primitive ID to a G-buffer similar to Nanite. There are a lot of caveats with this approach however, and I don’t think I will pursue it: Either way, this result was useful to observe. Before running the numbers, we have to consider that the RADV driver already does some mesh shader optimizations for us automatically. The NGG geometry pipeline automatically converts vertex shading workloads into pseudo-meshlets , and RADV also does primitive culling in the driver-generated shader. To get the raw baseline, we’ll first consider the tests without that path, so we can see how well RADV’s own culling is doing. The legacy vertex path is completely gone on RDNA3 as far as I know, so these tests have to be done on RDNA2. Even locked to 1600 MHz (peak), GPU is still just consuming 5.5 W. We’re 100% bound on fixed function logic here, the shader cores are sleeping. As expected, performance scales as we cull. Still 5.5 W. 27.9 ms Not too much changed in performance here. We’re still bound on the same fixed function units pumping invisible primitives through. 28.4 ms When we don’t cripple RADV, we get a lot of benefit from driver culling. GPU hits 12.1 W now. 9.6 ms Slight win. 8.9 ms Using Vulkan 1.3’s subgroup size control feature, we can force RDNA2 to execute in Wave32 mode. This requires support in The Deck drivers and upstream Mesa ship support for requiredSize task/mesh shaders now which is very handy. AMD’s Windows drivers or AMDVLK/amdgpu-pro do not, however It’s possible Wave32 isn’t the best idea for AMD mesh shaders in the first place, it’s just that the format favors Wave32, so I enable it if I can. While NVIDIA really likes 32/32 (anything else I tried fell off the perf cliff), AMD should in theory favor larger workgroups. However, it’s not that easy in practice, as I found. These results are … surprising. Apparently Deck (or RDNA2 in general) likes small meshlets? No meaningful difference in performance on Deck. No meaningful difference either. This is a very NVIDIA-centric optimization I think. In Vulkan, there are some properties that AMD sets for mesh shaders. This means that we should write outputs using LocalInvocationIndex, which corresponds to how RDNA hardware works. Each thread can export one primitive and one vertex and the thread index corresponds to primitive index / vertex index. Due to culling and compaction, we will have to roundtrip through groupshared memory somehow to satisfy this. For the encoded representation, I found that it’s actually faster to ignore this suggestion, but for the decoded representation, we can just send the vertex IDs through groupshared, and do split vertex / attribute shading. E.g.: Only computing visible attributes is a very common optimization in GPUs in general and RADV’s NGG implementation does it roughly like this. Either way, we’re not actually beating the driver-based meshlet culling on Deck. It’s more or less already doing this work for us. Given how close the results are, it’s possible we’re still bound on something that’s not raw compute. On the positive side, the cost of using encoded representation is very small here, and saving RAM for meshes is always nice. Already, the permutation hell is starting to become a problem. It’s getting quite obvious why mesh shaders haven’t taken off yet Data dump section incoming … By default RADV disables NGG culling on RDNA3, because apparently it has a much stronger fixed function culling in hardware now. I tried forcing it on with RADV_DEBUG=nggc, but found no uplift in performance for normal vertex shaders. Curious. Here’s with no culling, where the shader is completely export bound. But, force NGG on, and it still doesn’t help much. Culling path takes as much time as the other, the instruction latencies are just spread around more. Wave64 mode is doing quite well here. From what I understand, RADV hasn’t fully taken advantage of the dual-issue instructions in RDNA3 specifically yet, which is important for Wave32 performance, so that might be a plausible explanation. There was also no meaningful difference in doing VertexID passthrough. It’s not exactly easy to deduce anything meaningful out of these numbers, other than 32/32 being bad on RDNA3, while good on RDNA2 (Deck)? AMD doesn’t seem to like the smaller 256 primitive draws on the larger desktop GPUs. I tried 512 and 1024 as a quick test and that improved throughput considerably, still, with finer grained culling in place, it should be a significant win. Since we cannot request specific subgroup size, the driver is free to pick Wave32 or Wave64 as it pleases, so I cannot test the difference. It won’t hit the subgroup optimized paths however. I also did some quick spot checks on AMDVLK, and the numbers are very similar. The proprietary driver is doing quite well here in mesh shaders. On desktop, we can get significant wins on both RADV and proprietary with mesh shaders, which is nice to see. It seems like the AMD Windows driver skipped NGG culling on RDNA3 as well. Performance is basically the same. The job of task shaders is to generate mesh shader work on the fly. In principle this is nicer than indirect rendering with mesh shaders for two reasons: However, it turns out that this shader stage is even more vendor specific when it comes to tuning for performance. So far, no game I know of has actually shipped with task shaders (or the D3D12 equivalent amplification shader), and I think I now understand why. The basic task unit I settled on was: An array of these is prepared on CPU. Each scene entity translates to one or more TaskInfos. Those are batched up into one big buffer, and off we go. The logical task shader for me was to have N = 32 threads which tests AABB of N tasks in parallel. For the tasks that pass the test, test 32 meshlets in parallel. This makes it so the task workgroup can emit up to 1024 meshlets. When I tried this on NVIDIA however … 10x slowdown … The NVIDIA docs do mention that large outputs are bad, but I didn’t expect it to be this bad: Avoid large outputs from the amplification shader, as this can incur a significant performance penalty. Generally, we encourage a flexible implementation that allows for fine-tuning. With that in mind, there are a number of generic factors that impact performance: Size of the payloads. The AS payload should preferably stay below 108 bytes, but if that is not possible, then keep it at least under 236 bytes. If we remove all support for hierarchical culling, the task shader runs alright again. 1 thread emits 0 or 1 meshlet. However, this means a lot of threads dedicated to culling, but it’s similar in performance to plain indirect mesh shading. AMD however, is a completely different story. Task shaders are implemented by essentially emitting a bunch of tiny indirect mesh shader dispatches anyway, so the usefulness of task shaders on AMD is questionable from a performance point of view. While writing this blog, AMD released a new blog on the topic, how convenient! When I tried NV-style task shader on AMD, performance suffered quite a lot. However, the only thing that gets us to max perf on both AMD and NV is to forget about task shaders and go with vkCmdDrawMeshTasksIndirectCountEXT instead. While the optimal task shader path for each vendor gets close to indirect mesh shading, having a universal fast path is good for my sanity. The task shader loss was about 10% for me even in ideal situations on both vendors, which isn’t great. As rejection ratios increase, this loss grows even more. This kind of occupancy looks way better The reason for using multi-indirect-count is to deal with the limitation that we can only submit about 64k workgroups in any dimension, similar to compute. This makes 1D atomic increments awkward, since we’ll easily blow past the 64k limit. One alternative is to take another tiny compute pass that prepares a multi-indirect draw, but that’s not really needed. Compute shader code like this works too: This prepares a bunch of (8, 32k, 1) dispatches that are processed in one go. No chance to observe a bunch of dead dispatches back-to-back like task shaders can cause. In the mesh shader, we can use DrawIndex to offset the WorkGroupID by the appropriate amount (yay, Vulkan). A dispatchX count of 8 is to shade the full 256-wide meshlet through 8x 32-wide workgroups. As the workgroup size increases to handle more sublets per group, dispatchX count decreases similarly. To complete the meshlet renderer, we need to consider occlusion culling. The go-to technique for this these days is two-phase occlusion culling with HiZ depth buffer. Some references: Basic gist is to keep track of which meshlets are considered visible. This requires persistent storage of 1 bit per unit of visibility. Each pass in the renderer needs to keep track of its own bit-array. E.g. shadow passes have different visibility compared to main scene rendering. For Granite, I went with an approach where 1 TaskInfo points to one uint32_t bitmask. Each of the 32 meshlets within the TaskInfo gets 1 bit. This makes the hierarchical culling nice too, since we can just test for visibility != 0 on the entire word. Nifty! Here we render all objects which were considered visible last frame. It’s extremely likely that whatever was visible last frame is visible this frame, unless there was a full camera cut or similar. It’s important that we’re actually rendering to the framebuffer now. In theory, we’d be done rendering now if there were no changes to camera or objects in the scene. Based on the objects we drew in phase 1, build a HiZ depth map. This topic is actually kinda tricky. Building the mip-chain in one pass is great for performance, but causes some problems. With NPOT textures and single pass, there is no obvious way to create a functional HiZ, and the go-to shader for this, FidelityFX SPD , doesn’t support that use case. The problem is that the size of mip-maps round down, so if we have a 7×7 texture, LOD 1 is 3×3 and LOD 2 is 1×1. In LOD2, we will be able to query a 4×4 depth region, but the edge pixels are forgotten. The “obvious” workaround is to pad the texture to POT, but that is a horrible waste of VRAM. The solution I went with instead was to fold in the neighbors as the mips are reduced. This makes it so that the edge pixels in each LOD also remembers depth information for pixels which were truncated away due to NPOT rounding. I rolled a custom HiZ shader similar to SPD with some extra subgroup shenanigans because why not (SubgroupShuffleXor with 4 and 8). In this pass we submit for rendering any object which became visible this frame, i.e. the visibility bit was not set, but it passed occlusion test now. Again, if camera did not change, and objects did not move, then nothing should be rendered here. However, we still have to test every object, in order to update the visibility buffer for next frame. We don’t want visibility to remain sticky, unless we have dedicated proxy geometry to serve as occluders (might still be a thing if game needs to handle camera cuts without large jumps in rendering time). In this pass we can cull meshlet bounds against the HiZ. Because I cannot be arsed to make a fancy SVG for this, the math to compute a tight AABB bound for a sphere is straight forward once the geometry is understood. The gist is to figure out the angle, then rotate the (X, W) vector with positive and negative angles. X / W becomes the projected lower or upper bound. Y bounds are computed separately. The math is done in view space where the sphere is still a sphere, which is then scaled to window coordinates afterwards. To make the math easier to work with, I use a modified view space in this code where +Y is down and +Z is in view direction. First, convert to integer coordinates. Figure out a LOD where we only have to sample a 2×2 footprint. findMSB to the rescue. And finally, sample: Trying to get up-close, it’s quite effective. Without culling: With two-phase: As the culling becomes more extreme, GPU go brrrrr. Mostly just bound on HiZ pass and culling passes now which can probably be tuned a lot more. I’ve spent way too much time on this now, and I just need to stop endlessly tuning various parameters. This is the true curse of mesh shaders, there’s always something to tweak. Given the performance I’m getting, I can call this a success, even if there might be some wins left on the table by tweaking some more. Now I just need to take a long break from mesh shaders before I actually rewrite the renderer to use this new code … And maybe one day I can even think about how to deal with LODs, then I would truly have Nanite at home! The “compression” format ended up being something that can barely be called a compression format. To chase decode performance of tens of billions of primitives per second through, I suppose that’s just how it is. Geometry shaders. Pass-through mode can potentially be used if all the stars align on supported hardware, but using geometry shaders should revoke your graphics programmer’s license. Unroll a meshlet into a non-indexed draw. Duplicate primitive ID into 3 vertices. Use flat shading to pull in the primitive ID. Would it be fast enough to decompress inline inside the mesh shader? This can potentially save a lot of read bandwidth during rendering and save precious VRAM. Bandwidth amplifier on asset loading time. Only the compressed meshlet format needs to go over PCI-e wire, and we decompress directly into VRAM. Similar idea to GDeflate and other compression formats, except I should be able to come up with something that is way faster than a general purpose algorithm and also give decent compression ratios. Wireframe: A pure position + index buffer Textured: Adds UV + Normal + Tangent Skinned: Adds bone indices and weights on top Uniform W = -1 Uniform W = +1 LSB of decoded W encodes tangent W. Tangent’s second component loses 1 bit of precision. If all three vertices are outside one of the clip planes, reject immediately If any vertex is considered inaccurate, accept immediately If one or two of the vertices have negative W, we have clipping. Our math won’t work, so accept immediately. (If all three vertices have negative W, the first test rejects). Perform actual back-face cull. Replace attributes with barycentrics and allowing the Pixel Shader to fetch and interpolate the attributes Moves a ton of extra work to fragment stage I’m not aiming for Nanite-style micro-poly hell here, so doing work per-vertex seems better than per-fragment This result isn’t representative of a real scene where fragment shader load would be far more significant Incompatible with encoded meshlet scheme It is possible to decode individual values, but it sure is a lot of dependent memory loads to extract a single value Very awkward to write shader code like this at scale Probably need some kind of meta compiler that can generate code, but that’s a rabbit hole I’m not going down Need fallbacks, barycentrics is a very modern feature Makes skinning even more annoying Loading multiple matrices with fully dynamic index in fragment shader does not scream performance, then combine that with having to compute motion vectors on top … Only seems to help throughput on NVIDIA We’re already way ahead of MDI anyway 32/32: 9.3 ms 64/64: 10.5 ms 128/128: 11.2 ms 256/256: 12.8 ms 32/32: 10.7 ms 64/64: 11.8 ms 128/128: 12.7 ms 256/256: 14.7 ms vkCmdDrawIndexed, no frustum culling: 5.9 ms With frustum cull: 3.7 ms MDI: 5.0 ms Encoded – 32/32: 3.3 ms Encoded – 64/64 : 2.5 ms Encoded – 128/128: 2.7 ms Encoded – 256/256: 2.9 ms Decoded – 32/32: 3.3 ms Decoded – 64/64: 2.4 ms Decoded – 128/128: 2.6 ms Decoded – 256/256: 2.7 ms Encoded – 64/64: 2.4 ms Encoded – 128/128: 2.6 ms Encoded – 256/256: 2.7 ms Decoded – 64/64: 2.2 ms Decoded – 128/128: 2.5 ms Decoded – 256/256: 2.7 ms vkCmdDrawIndexed, no culling : 6.2 ms With frustum cull: 4.0 ms MDI: 5.3 ms Meshlet – Encoded – 32/32: 2.5 ms Meshlet – Encoded – 64/64 : 2.6 ms Meshlet – Encoded – 128/128: 2.7 ms Meshlet – Encoded – 256/256: 2.6 ms Meshlet – Decoded – 32/32: 2.1 ms Meshlet – Decoded – 64/64: 2.1 ms Meshlet – Decoded – 128/128: 2.1 ms Meshlet – Decoded – 256/256: 2.1 ms No need to allocate temporary memory to hold for indirect draw No need to add extra compute passes with barriers Size of the payloads. The AS payload should preferably stay below 108 bytes, but if that is not possible, then keep it at least under 236 bytes. https://advances.realtimerendering.com/s2015/ – GPU-Driven Rendering Pipelines https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501 – This is a quite nice tutorial on the subject. https://www.youtube.com/watch?v=eviSykqSUUw – 07:53 – Nanite deep-dive presentation https://github.com/zeux/niagara – Niagara renderer by Zeux. Basically implemented all of this a long time ago.

Performance

0 views

Maister's Graphics Adventures 1 years ago

My scuffed game streaming adventure – PyroFling

My side projects have a tendency to evolve from a tiny weekend experiment into something that ends up satisfying a very specific niche use case after multiple weekends of nerdsniping myself. This is one of those projects where I started experimenting with how to use external memory in Vulkan and file descriptor flinging on Linux, and it just … grew from there. This is a wild braindump ride with some of the topics being: The first part of this project was to make my own custom WSI implementation and a “server” that could act as a compositor of some sorts. The main difference was that rather than putting the swapchain on screen – which is a rabbit hole I’m not getting myself into – I just wanted to dump the results to a video file. At the end of last year, I was fiddling around with Vulkan video + FFmpeg , and this was the perfect excuse to start considering encoding as well. It would be pretty neat to get a swapchain to stay in VRAM, be encoded directly in Vulkan video and then get back H.264/H.265/AV1 packets. Rather than redirecting WSI to a different “surface” which can get very tricky, this approach is very simple. This is implemented in a Vulkan layer where we hook the swapchain. The basic gist is that we copy any presented swapchain image to an image owned by a layer, which is then sent over to the “compositor” with external memory. Synchronization is all explicit using external semaphores, because of course! The protocol needed for a Vulkan swapchain is pretty simple. In Linux, we can use a Unix domain socket with SOCK_SEQPACKET. This is kinda like a reliable datagram that can also send and receive file descriptors as side band information. From here, clients can connect to the server using e.g. connect() and server can listen() / accept() clients, just like normal TCP. The main difference is that SEQPACKET is not stream based, so we can send individual messages instead, ala UDP, using sendmsg() rather than plain send(). On the receiving end: and then we grab the FDs. These FDs are tied to the message, so we know if this is an image, a semaphore, etc. The protocol from here is pretty simple. Most WSI implementations would be some kind of variant of this under the hood I think. To use external memory in Vulkan we must be sure that the devices are compatible. We can get compatibility information in VkPhysicalDeviceIDProperties. For OPAQUE_FD external types in Vulkan, these must match. There is no particular need to be fancy and use DRM modifiers here. Client sends this information over once. Each VkSurfaceKHR has one connection associated with it. In Vulkan, there can only be one active non-retired swapchain assigned to a surface, so this model works well. When using external memory in Vulkan, the creator and consumer of the external memory must agree on VkImageCreateInfo parameters, so we just fling that information over as-is. If this were a more normal WSI, like X or Wayland, this is where DRM modifiers becomes important, because the consumer is probably not Vulkan, but I only really care about OPAQUE_FD for my use case since I know the image is being consumed in Vulkan. Along with this message, num_image FDs are expected. The server will then import the memory, create images and bind. If the server’s Vulkan device differs from the client, we can round-trip through system memory with VK_EXT_external_host_memory. Two separate GPUs can import the same system memory. This is very useful to me since I have two GPUs installed and being able to render on one GPU and encode on another GPU is pretty nifty. Can also be very nice to let iGPU do hardware accelerated encode down the line. One binary semaphore is expected as FD here. Explicit sync, yay. I could of course have used timeline semaphores here, but I really didn’t need anything more fancy than binary semaphores and Vulkan WSI requires binary semaphores anyway. If I ever want to port this to Windows, I’ll run into the problem that AMD does not support external timeline OPAQUE_WIN32, so … there’s that The client needs to perform an image barrier to VK_QUEUE_FAMILY_EXTERNAL. The server side completes the transition with an acquire barrier from EXTERNAL into whatever queue family it uses. The present ID is used later so we can implement KHR_present_wait properly. Acquire is async as intended. Typically, the server side does RGB -> YUV conversion and once that “blit” is done, we can release the image to client as long as there are new pending presents that are done. Fortunately, we don’t have to hook vkAcquireNextImageKHR in this implementation since we’re still rendering to the display as normal. In QueuePresentKHR, we’ll do: However, if we were redirecting the WSI completely, implementing the semaphore and fence parameters in vkAcquireNextImageKHR is actually quite awkward since there is no host vkSignalSemaphore and vkSignalFence in Vulkan sadly. Some bonus tips how to do it properly for the curious: The semaphore you give to vkAcquireNextImageKHR isn’t really signaled as you’d expect, rather, it has temporary import semantics with a magic payload, i.e. the semaphore is replaced with a temporary payload of unknown type. When you subsequently wait on that semaphore, the temporary payload is consumed and freed and the semaphore is reverted to its original state. This is very useful, since we should implement AcquireNextImageKHR with vkImportSemaphoreFd and vkImportFenceFd. Passing a semaphore to vkAcquireNextImageKHR is equivalent to temporarily importing a semaphore payload to that semaphore. Because the exportable handle types of an imported semaphore correspond to its current imported payload, and vkAcquireNextImageKHR behaves the same as a temporary import operation for which the source semaphore is opaque to the application, applications have no way of determining whether any external handle types can be exported from a semaphore in this state . Therefore, applications must not attempt to export external handles from semaphores using a temporarily imported payload from vkAcquireNextImageKHR . As long as we can import a payload, we can do whatever we want, neat! This is trivial, just import the binary semaphore we got from AcquireImage message. If the server gives us back a CPU-side eventfd or something similar, this is more awkward. On Linux, we can import SYNC_FD with fd -1. This means it’s already signaled, and it’s a way to signal a binary semaphore from CPU. However, not all implementations support SYNC_FD, although I believe the last holdout (NVIDIA) added support for it in a recent beta, so maybe relying on SYNC_FD on Linux is feasible soon. If that isn’t available we have to go into really nasty hackery, having a pool of already signaled binary OPAQUE_FD semaphores for example. On present, we can signal a new payload on the side, place that in our “pool” of binary semaphores that we can import into an acquire later. Supremely scuffed, but hey, gotta do what you gotta do. I don’t think it was a good idea in the end, but I tried splitting the acquire process in two. The basic idea was that I could aggressively signal acquire early, letting the CPU start recording commands, but before you’d actually submit rendering, you’d have to block until the retire event came through. Alternatively, you could wait for acquire + retire events to come through before considering an acquire complete. In practice, this ended up being a vestigial feature and I should probably just get rid of it. It maps rather poorly to Vulkan WSI. This event represents a “vblank” event. A completion event is fired when an image was done rendering and was consumed by a “vblank” (i.e. encoding) event. This can be used to implement KHR_present_wait, proper frame pacing, latency control, etc. I didn’t implement all of the fields here fully, but when you control the protocol you can do whatever you want Overall, this protocol ended up looking vaguely similar to X11 DRI3 Present protocol with the improvement of being explicit sync, async acquire by default, and a better FIFO queue model that does not require insane hackery to accomplish. Implementing FIFO well on X11 is actually a nightmare requiring worker threads juggling scissors to ensure forward progress. Don’t look at wsi_common_x11.c in Mesa if you value your sanity, just saying. A common concern I have with typical screen recording software is that the video output is not well-paced at all. If I record at 60 fps and I’m playing at 144 fps, there’s no way the output will be well paced if it’s just doing the equivalent of taking a snapshot every 16.6 ms. To fix this, I added some modes to optimize for various use cases: The client becomes locked to the server refresh rate. Frame limiting happens either in QueuePresentKHR or WaitForPresentKHR. If the application is using presentIds, we can just redirect WaitForPresentKHR to wait for completion events from our server, instead of the actual swapchain. If it does not use present_wait, we can fall back to frame limiting in QueuePresentKHR. (Frame limiting in AcquireNextImageKHR is broken since you can acquire multiple images in Vulkan and may happen at arbitrary times). Depending on the use case it can be useful to force MAILBOX present mode on the swapchain to avoid a scenario where we’re blocking on two separate clocks at the same time. If I’m playing on a 144 Hz VRR monitor while being frame limited to 60 fps, that’s not a problem, but recording at 60 fps with a 60 Hz monitor could be a problem. If frame pacing of recording is more important than frame pacing of local monitor, the swapchain that goes on screen should have MAILBOX or IMMEDIATE. Client renders unlocked and server will use whatever latest ready image is. Basically MAILBOX. Choose between above modes depending if application is using FIFO or non-FIFO presentation modes. Since we’re not tied to a particular display, we can pretend that every N milliseconds, we’re woken up to encode a video frame. At this point, we pick the last ready image whose requested earliest present time has not been reached, simple enough. We can implement present interval quite neatly as well. When a present request is received, we compute the earliest timestamp we should present based on existing images in the queue. The timestamp_completed here is in number of frames. This is pretty simple and handles any presentation interval. If the period is 0, we can have multiple presentations in flight where they all have target_ts being equal. In that case we use the largest presentation ID to make sure we’re picking the last image. Now the image is queued, but it is still in-flight on GPU. Now we kick off a background task which waits for the presentation to complete. At that point we transition the state from Queued to Ready. Once an image becomes Ready, we can retire old images since we know that they will never be used again as an encode source. If this were a normal fullscreen FLIP-style swapchain, we’d have to careful not to signal acquire semaphores until the newly Ready image was actually flipped on screen. We’re doing a BLIT-style swapchain due to our encoding however, so we can retire right away. At vblank time, we’ll pick the appropriate image to encode. If this image is in the Ready state, this is the time to transition it to Complete and send a complete event. There are some quirks compared to a normal FIFO swapchain however. If the server is being very slow to encode, it’s possible that it misses its own vblank intervals. In this case, we might end up “skipping” ahead in the FIFO queue. E.g. an application might have queued up images to be encoded at frame 1000, 1001 and 1002, but server might end up doing 1000, drop, 1002 where frame 1001 is just skipped. This is technically out of spec for Vulkan FIFO, but I don’t care I considered keeping up the pace more important rather than slowing down the client progress just because the encoder was too slow for a split second. From here, video and audio can be encoded fairly straight forward with FFmpeg. After all this, I felt the side project had kind of come to an end for the time being. I removed some old cobwebs in the IPC parts of my brain and got a deeper understanding of WSI on Linux and got basic hwaccel encoding working with NVENC and VAAPI, mission complete. Now I could do: The pyrofling layer automatically connects to the server if it’s spawned after game starts, and you can restart the server and it reconnects seamlessly. Neat! The plan at this point was to wait until Vulkan video encode matured and then hook up the encode path properly, but … things happened, as they usually do. Replaying a classic game with friends and family during the holidays tends to be quite enjoyable, and at some point we ended up trying to recreate the experience remotely. The ideal situation was that one of us would host the game and play it while the other would watch the stream and we could banter. The first attempt was to do this over Discord screen sharing, but the experience here was abysmal. Horrible video quality, stutter, performance, and no good solution for piping through high quality game audio. This testing included Windows. Maybe there’s a way, but I couldn’t find it. I don’t think Discord is designed for this use case. Bad frame pacing completely breaks immersion, simply unacceptable. At this point, I figured OBS might be a solution. Just stream to Twitch or something and people could watch that stream while talking over Discord. While this “worked” in the sense that video was smooth and audio quality good, there were some major drawbacks: At this point, I wanted to test if OBS was adding more buffering than expected, so I dusted off pyrofling, added an option to mux to RTMP / FLV which Twitch expects, and that’s about all you need to stream to Twitch, really. It worked just fine, but latency did not improve. For just watching a stream and talking / commenting alongside it, I needed to find a way to get it down to about 100-200 ms, which is the middle ground of latency. I figured most of the delay was due to buffering on Twitch’s end, so I wondered if it’d be possible to host something similar locally. I’d only need to serve one client after all, so bandwidth was not a concern. This venture quickly failed. The closest I found was https://github.com/ossrs/srs , but I couldn’t get it to work reliably and I couldn’t be arsed to troubleshoot some random Github project. The first idea I came up with was to use MPEG-TS as a muxer, add an IO callback, so that instead of writing the MPEG-TS to file I’d beam the data over a socket to any TCP client that connected. FFmpeg can do something similar for you by using “tcp://local-ip:port?listen=1” as the output path, but this is blocking and not practical to use with the FFmpeg API in a multiplexed server. Video players like MPV and VLC can easily open a raw stream over TCP with e.g. tcp://ip:port. It’ll figure out the stream is MPEG-TS and start playing it just fine. This actually worked! But again, I ran into issues. Even in low-latency / no-buffer modes in MPV / VLC, the latency was questionable at best. At the start of the stream, I could observe usable levels of latency, but there seemed to be no internal system to keep latency levels stable over time, especially when audio was also part of the stream. The buffer sizes would randomly grow and eventually I’d sit at over half a second latency. I spent some time searching for random comments from people having the same problems and trying a million different CLI commands that “supposedly” fix the problem, but none of them satisfied me. At this point, I was too deep in, so … Time to write a no-frills custom video player designed for stable low latency streaming. FFmpeg and most/all container formats have a concept of a PTS, when to display a video frame or play back audio. This is used to guide A/V synchronization. I already had this path implemented in Granite. Audio playback is continuous, and we can constantly measure the playback cursor of the audio. If we’re a typical media player with a long latency audio buffer to eliminate any chance of audio hick-ups, we take the audio buffer latency into account as well, so at any instantaneous time point, we can estimate current audio PTS as: This raw form of PTS cannot be used as is, since it’s too noisy. Audio is processed in chunks of about 10 ms in most cases, so the estimate will be erratic. The solution is to smooth this out. We expect the audio PTS to increase linearly with time (duh), so the way I went about it was to fuse wall clock with audio PTS to stay in sync and avoid drift. Now that we have a smooth estimate of PTS, video sync is implemented by simply displaying the frame that has the PTS closest to our estimate. If you have the luxury of present timing, you could queue up a present at some future time where you know audio PTS will match for perfect sync. In my experience you can be off by about 40 ms (don’t quote me on that) before you start noticing something’s off for non-interactive content. While sync-on-audio is great for normal video content, it is kinda iffy for latency. At no point in the algorithm above did we consider video latency, and that is kinda the point here. Video latency is the more important target. Correct audio sync becomes less important when chasing low latency I think. In a real-time decoding scenario, we’re going to be continuously receiving packets, and we have to decode them as soon as they are sent over the wire. This means that at any point, we can query what the last decoded video PTS is. Based on that, we can set our ideal target video PTS as: Again, this estimate will be very noisy, so we smooth it out as before using wall time as the fused timer: Now we have another problem. What to do about audio? Frame skipping or frame duplication is not possible with audio, even a single sample of gap in the audio has disastrous results. The solution is to dynamically speed audio up and down very slightly in order to tune ourselves to the correct latency target. The idea is basically to sample our estimated audio PTS regularly and adjust the resampling ratio. This of course requires you to have a high quality audio resampler that can do dynamic adjustment of resampling ratio, which I wrote myself way back in the day for retro emulation purposes. While this technically distorts the audio a bit by altering the pitch, this level of funging is inaudible. 1 cent of a semitone (about 0.05%) is nothing. I believe this is also how MPV’s sync-on-video works. It’s a useful technique for displaying buttery smooth 60 fps video on a 60 Hz monitor. By targeting a reasonably low latency in the new player, we were able to get an acceptable stream going over the internet. We did some basic comparisons and Discord voice came through at the same time as the video feed according to my testers, so mission accomplished I guess! The main drawback now was stream robustness. TCP for live streaming is not a great idea. The second there are hick-ups in the network, the stream collapses for a hot minute since TCP does not accept any loss. When we were all on ethernet instead of Wi-Fi, the experience was generally fine due to near-zero packet loss rates, but right away, a new use case arose: Wouldn’t it be really cool if we could switch who controls the game? This is basically the idea of Steam Remote Play Together, which we have used in the past, but it is not really an option for us based on past experience: At this point I knew I had work cut out for me. Latency had to drop by an order of magnitude to make it acceptable for interactive use. The first step in the process was to investigate the actual latency by the encoder and decoder chains, and the results were … kinda depressing. On the right, my test app was running, while the left was the video feedback over localhost on the same display. The video player was hacked to always present the last available image. 100 ms latency, yikes … I eventually narrowed part of this down to MPEG-TS muxer in FFmpeg adding a lot of latency, over 50 ms just on its own. It was pretty clear I had to get rid of MPEG-TS somehow. Around this point I started to investigate RTP, but I quickly rejected it. RTP does not support multiple streams. It cannot mux audio and video, which is mildly baffling. Apparently you’re supposed to use two completely different RTP streams on different ports. Some kind of external protocol is used to provide this as side band information, and when trying to play an RTP stream in FFmpeg you get hit with: Apparently this is https://en.wikipedia.org/wiki/Session_Description_Protocol , and the whole affair was so cursed I just rejected the entire idea, and rolled my own protocol. I just need to bang over some UDP packets with some sequence counters, payloads and metadata after all, how hard can it be, right? Turns out it wasn’t hard at all. The rest of the latency issues were removed by: For example, here’s some options for NVENC. Some local results with all this hackery in libx264. On my 144 Hz monitor I could sometimes hit a scenario where the video stream and application hit the same vblank interval, which means we just achieved < 7 ms latency, very nice! NVENC also hits this target, but more consistently, here with HEVC encode. AMD with VAAPI HEVC on RX 6800 isn’t quite as snappy though … Hoping Vulkan encode can help here. There might be some weird buffering going on in the VAAPI backends that I have no control over, but still. We’re still in the ~10 ms ballpark. I had better results with AV1 encode on my RX 7600, but I couldn’t be arsed to swap out GPUs just to get some screenshots. Of course, we’re working with the most trivial video footage possible. The true test is real games where I expect encode/decode latency to be more obvious. When doing very low-latency streaming like this, the traditional GOP structure completely breaks down. Intra frames (or I-frames) are encoded as still images and tend to be fairly large. P- and B-frames on the other hand consume far fewer bits. Low latency streaming also requires a lot more bitrate than normal high-latency encoding since we’re making life very difficult for the poor encoder. In a constant bit-rate world where we’re streaming over a link with limited bandwidth, the common solution to this bitrate fluctuation is to just buffer. It’s okay if an I-frame takes 100ms+ to transmit as long as the decode buffer is large enough to absorb the bandwidth difference, but we cannot rely on this for low latency streaming. Here’s a link to the 2010 post by x264 legend Dark Shikari. The solution is intra-refresh where parts of the image is continuously refreshed with I-blocks. Effectively, we have no GOP structure anymore at this point. This allows us to avoid the huge spikes in bandwidth. libx264 and NVENC support this, but sadly, VAAPI implementations do not Hoping we can get this working in Vulkan video encode somehow … The forced-idr option is used so that we can still force normal I-frames at will. Clients can request “pure” I-frames when connecting to the server to kick-start the decode process. Technically, with intra-refresh you can just keep decoding until the image has been fully refreshed at least once, but I had various issues with FFmpeg decoding errors when trying to decode raw intra-refresh without ever seeing a keyframe first, especially with HEVC, so I went with the dumb solution It worked fine. When I try to just display the frames as they come in over the network, the results are … less than ideal. The pacing is completely butchered due to variability in: Under ideal conditions over a local network, network jitter is mostly mitigated, but the variability in encode/decode time is still noticeable on a fixed rate display, causing constant frame drops or dupes. My solution here was to re-introduce a little bit of latency to smooth over this variability. VK_KHR_present_wait is critical to ensure we get the lowest possible latency. On a 60 Hz monitor, we want this frame loop: Just in case we barely miss deadline due to shenanigans, FIFO_RELAXED is useful as well. This is fairly magical and I don’t think any generic “screen capturing” software can and will do this. The idea is that there is an ideal time target when new video frames should be done. If they arrive too early, we can ask the game to slow down slightly, and if it arrives too late, speed up a bit. Basically, this is a phase locked loop over the network. One nice property of this is that we don’t really need to know the network latency at all, it’s self stabilizing. Since the server controls when to send Complete events to the game running, we have full control over whether to render at 60.0 FPS, 60.01 FPS or 59.99 FPS. Tiny adjustments like these is enough to keep the system stable over time. It can also handle scenarios where the refresh rates are a bit more off, for example 59.95 Hz. Of course, large network spikes, lost packets or just good old game stutter breaks the smooth video, but it will recover nicely. With target_phase_offset = -8ms and deadline of 8ms, I have a very enjoyable gaming experience with smooth frame pacing and low latency over local network. At this point, we don’t really care about A/V sync by PTS. The assumption is that audio and video packets arrive together with very similar PTS values. Rather than trying to target a specific PTS, we just want to ensure there is a consistent amount of audio buffering to safely avoid underrun while keeping latency minimal. This strategy is good enough in practice. As cherry on top, we just need to let the client send gamepad events. Using /dev/uinput on Linux, it’s very simple to create a virtual gamepad that Steam can pick up and it “just werks” in all games I tested. It works fine in other programs too of course. It’s trivial to hook this up. For game content in darker regions, I noticed that 10-bit HEVC looked dramatically better than 8-bit, so I went with that. >30mbit/s 10-bit streams with HEVC or AV1 looks basically flawless to me on a TV even with really difficult game content that tends to obliterate most streams. Good luck getting game streaming services to provide that any time soon! The main problem left is that packet loss recovery isn’t really there. I’m not sure if FFmpeg has facilities to recover gracefully from dropped packets other than freaking out about missing reference frames in the logs, and this is a bit outside my wheelhouse. Intra refresh does a decent job of quickly recovering however. I have some hopes that using Vulkan video decode directly will allow me to fake the presence of missed reference frames to mask most of the remaining issues, but that’s a lot of work for questionable gain. Audio is a bit more YOLO, I just ignore it. That seems to be the general strategy of VoIP anyways. There’s also zero security / encryption. I don’t really care. Sadly, I haven’t had much luck getting the work in progress Vulkan encode work to run yet. Hooking up a fully Vulkan encode -> decode chain will be nice when that matures. The decode path is already working. If you actually made it this far, congratulations. I mostly aimed to make this post a braindump of the techniques I went through to make this and I achieved what I set out to do, useful low-latency game streaming tailored exactly for my needs. Basic Unix IPC Fling those file descriptors like a champ Writing a Vulkan layer that captures a swapchain Knowing how to write a layer is pretty useful for any hardcore Vulkan programmer A deeper understanding of how Vulkan WSI can be implemented under the hood Acquire elite, arcane knowledge Techniques for audio/video sync with low latency A million MPV flags will not save you Bespoke hacks will, however How to coax FFmpeg into encoding video with very low latency All the AVOptions, oh my! Using /dev/uinput to create virtual gamepads Tie it all together Wait for QueuePresentKHR semaphores Acquire image from server (in the common case, this never blocks) Queue wait for our acquire semaphore Copy WSI image to internal image (transition image layouts as necessary) Resignal QueuePresentKHR semaphores + signal external OPAQUE_FD semaphore Send present message to server Call QueuePresentKHR as normal Async compute shader that rescales and converts RGB to YUV420 Ideally, we’d pass that on to Vulkan video encode directly, but for now, just read back YUV image to system memory Copy into an AVFrame If using hwaccel, av_hwframe_transfer (so many copies …) Send AVFrame to codec Get AVPacket out Send to muxer (e.g. MKV) Create a recording stream Either monitor the soundcard output as an input device … or use pipewire patch bay to record specific audio streams Automating this process would be cool, but … eh Twitch’s idea of “low latency” mode is misleading at best. Expect between 1 and 2 seconds of delay, and as much as 3 in some cases. This was completely useless in practice. It might be barely okay for a streamer watching comments and interacting with an audience. When communicating with “the audience” over voice, and hearing reactions delayed by seconds, it was unusable. Horrible video quality. Twitch caps you to about 6 mbit/s + 8-bit H.264 which has very questionable video quality for game content even with a competent encoder. (Popular streamers get more bandwidth, or so I hear.) This basically forced me into 720p. Surely we can do better than this in 2023 … OBS did not like my multi-GPU setup on Linux and trying to hardware encode on top of that was … not fun Latency too high Video quality not great Only supported by specific games And usually only multi-player co-op games Won’t help us playing non-Steam stuff Disabling frame queue in NVENC Disabling encoding FIFO in server Just encode as soon as possible and blast the packet over UDP Pacing be damned We’ll solve frame pacing later Remove B-frames and look-aheads Well, duh :p “zerolatency” tune in libx264 GPU time for game to render Encoding time (scene dependent) Network jitter Decoding time

Programming

Linux

TypeScript

0 views

Maister's Graphics Adventures 2 years ago

Hardcore Vulkan debugging – Digging deep on Linux + AMDGPU

Everyone battle hardened in the world of Vulkan and D3D12 knows that debugging is ridiculously hard once we enter the domain of crashes and hangs. No one wants to do it, and seeing a random GPU crash show up is enough to want to quit graphics programming and take up farming on a remote island. Someone has to do it though, and given how questionable a lot of D3D12 content is w.r.t. correctness, this comes up a lot more often that we’d like in vkd3d-proton land. The end goal of this blog is to demonstrate the magical UMR tool on Linux, which I would argue is the only reasonable post-mortem debugging method currently available on PC, but before we go that deep, we need to look at the current state of crash debugging on PC and the bespoke tooling we have in vkd3d-proton to deal with crashes. Breadcrumbs is a common technique that most modern APIs have some kind of implementation of. The goal of breadcrumbs is simply to narrow down which draws or dispatches caused the failure. This information is extremely limited, but can sometimes be enough to figure out a crash if you’re very lucky and you have knowledge about the application’s intentions with every shader (from vkd3d-proton’s point of view, we don’t obviously). Depending on the competency of the breadcrumb tool, you’d get this information: As far as I know, this is where D3D12 on Windows ends, with two standard alternatives: There are vendor tools at least from NVIDIA and AMD which should make this neater, but I don’t have direct experience with any of these tools in D3D12, so let’s move on to the Vulkan side of things. Buffer markers is the simplest possible solution for implementing breadcrumbs. The basic idea is that a value is written to memory either before the GPU processes a command, or after work is done. On a device lost, counters can be inspected. The user will have to instrument the code somehow, either through a layer or directly. In vkd3d-proton, we can enable debug code which automatically does this for all D3D12 commands with VKD3D_CONFIG=breadcrumbs (not available in release builds). For example, from our dispatch implementation: Then it’s a matter of writing the breadcrumb somewhere: We’ll also record commands and the parameters used in a side band buffer so that we can display the faulting command buffers. Another thing to consider is that the buffer we write to must be coherent with the host. On a device lost happening randomly inside a command we won’t have the opportunity to perform host memory barriers and signal a fence properly, so we must make sure the memory punches straight through to VRAM. On AMD, we can do this with On fault, we scan through the host buffer, and if we observe that TOP and BOTTOM markers are not 0 (never executed) or UINT32_MAX (done), scan through and report the range of failing commands. GPUs execute commands concurrently unless barriers are emitted. This means that there is a large range of potential draws or dispatches in flight at any one time. RADV_DEBUG=syncshaders adds barriers in between every command so that we’re guaranteed a hang will narrow down to a single command. No other Vulkan driver supports this, and it makes RADV the only practical driver for breadcrumb techniques, at least on Vulkan. Sure, it is possible to add barriers yourself between every command to emulate this, but for render passes, this becomes extremely annoying since you have to consider restarting the render pass for every draw call … As a simple example, I’ve hacked up one of the vkd3d-proton tests to write a bogus root descriptor address, which is a great way to crash GPUs in D3D12 and Vulkan. When running with just breadcrumbs, it’s useless: Instead, with syncshaders, it becomes: That’s actionable. A lot of drivers actually support the buffer marker vendor extension, at least in Mesa land, and even NVIDIA (although, on NVIDIA we use another extension for breadcrumb purposes …) With async compute, it’s possible that multiple command streams are in flight, and with breadcrumbs, it’s not possible to determine which queue actually faulted. To aid this, we have VKD3D_CONFIG=single_queue which serializes everything into one VkQueue. The NV vendor extension simplifies things a fair bit. Rather than allocating memory in sysmem and manually writing out markers, one call is made after every command: The argument is a void pointer where you can place whatever you want, so we encode command list index and a counter there. On device fault, you can then query checkpoints from your queues. From there, start looking for TOP_OF_PIPE and BOTTOM_OF_PIPE pipeline stages to get a potential range of commands. BOTTOM_OF_PIPE means we know for sure all commands before completed execution, and TOP_OF_PIPE means the command processor might have started executing all commands up to that point. The main flaw with this extension is there is no easy way to narrow down the range of commands. With RADV we can enforce sync with syncshaders as a (very useful) hack, but there is no such method on NV unless we do it ourselves If we can narrow down a breadcrumb to a specific shader – and it’s reproducible – it might be the time to perform the dark art of shader replacement and GPU debug printing. We know where it crashes, but why is still a mystery. Shader replacement is a special kind of magic that vkd3d-proton needs to consider since we have no means of modifying the original game shaders directly. We get (incomprehensible) DXIL which we then translate into SPIR-V. vkd3d-proton supports bypassing the translation step, and using our own SPIR-V instead. This SPIR-V can be instrumented with debug code which lets us drill down. This is a horribly slow and tedious process, but it’s the only thing we have to inspect shader execution in real time. First, we have to dump all shaders used by a game. From a breadcrumb trace, we’ll hopefully know the shader hashes which are relevant to look at, and we can round-trip them to Vulkan GLSL using SPIRV-Cross. From here, we can modify the shader as we please. It’s pretty obvious where it crashes here, but to demonstrate … Now we can run again with: As expected, the shader did not reach 2, because it crashed. The address also correlates with dmesg: As you can imagine, this is not a fun process when debugging games with 3 ksloc+ long shaders with tons of resource access. To aid this process, we really need UMR … To make debug print work in crash scenarios, we need to use the same trick as buffer markers, i.e., make the print ring buffer device coherent and uncached. If all else fails, we have a trump card. This tool is unique to AMD + Linux as far as I know and it lets us inspect all waves which were executing on the GPU at the time of crash. It’s developed by AMD on freedesktop . Alternatively, just install umr-git from AUR if you’re on Arch Linux. Now this is the real deal. RADV can invoke UMR on crashes and dump out a bunch of useful information. The UMR tool is standalone and should work with AMDVLK or amdgpu-pro as well. Nothing stops you from invoking the same CLI commands that RADV does while the device is hung. UMR needs to do pretty deep kernel poking, so permissions will be required. First, we have to expose the debug interfaces. Run this as super user after every boot: If you’re on a multi-GPU system (a good idea if you’re debugging hangs all day every day), it’s possible that the AMD GPU won’t be DRI instance 0. If the instance is not 0, RADV currently does not forward the proper arguments to UMR, so check out this Mesa MR for now and add Hopefully a more automatic solution will present itself soon. If trying to debug games from within Steam, Proton or native, the pressure-vessel container will sandbox away /usr/bin/umr, so you’ll need to bypass it somehow. Details are left elsewhere. Currently, RADV only knows how to dump the GFX ring, so we need to ensure only that queue is used if crashes happen in async compute. In vkd3d-proton, we have VKD3D_CONFIG=single_queue for that purpose. In RADV, this does a few things: It’s also possible to add this debug option to make page faults a little nicer to debug, but usually not needed: Note that while in hang debug mode, games will usually run at less than half performance due to the aggressive synchronization. RADV will try to dump dmesg too, but you probably won’t have permissions, it’s not a big deal. There’s a lot of useful information here, time to bring out your text editor. First, SPIR-V relative to the faulting PSO is dumped. In vkd3d-proton, we emit the DXBC/DXIL hash inside the SPIR-V to correlate back to breadcrumbs or shader dumps. This is a straight up ISA dump of NIR -> ACO -> AMD ISA. This logs all allocations and frees. It’s intended to be parsed with e.g. This is mostly useful to prove application use-after-free or similar. This is the most important dump of all. An entry is made for every active wave. It’s very verbose, but the important parts I think are: Here we can see the faulting instruction is the global_atomic_add. It’s using the address in SGPR 2/3, which we can see is … 10400000, ffff8001, which in little endian is 8001’10400000. Only the lower 48 bits are relevant, and if we look at the page fault, the address matches. If the fault happened in a descriptor-based instruction, we can inspect the descriptor as well, since on AMD, descriptors are consumed in the instruction itself. It’s really convenient in situations like these. Correlating fault site back to HLSL or GLSL needs to be done manually, but it is not particularly difficult. This can be used to inspect the PM4 stream. I rarely find it actionable, but it’s nice to have regardless. It adds breadcrumbs, but at the lowest level possible. The most useful thing (for me at least) is that we can inspect the user registers being set. After the last Rashid update (of course this dropped while I was on vacation, hnnnnng), users were observing random GPU hangs during online play, which is the worst kind of GPU hang. After some online play in custom rooms with people on Discord ready to join the breadcrumb crusade, I was able to reproduce the hang at a ~10% rate. I went through a goose chase. Breadcrumb trace always pointed to the same shader, which is always a good start. We observed a page fault, so it could have been anything. Use-after-free or OOB access is the usual cause here. The address did not correspond to any resource however, so that’s always a fun start. Replacing shaders seemed to mask the bug, which was rather puzzling. Usually this points to a Vulkan driver bug, but that lead got us nowhere. When dealing with low repro rate random GPU hangs, this is always the worst, since we don’t know if we were just very unlucky with repros, or if the change actually made a difference … (I didn’t have the old crash dumps lying around, so please excuse the lack of ISA spam. I didn’t feel like spending my Sunday reproducing a bug that I already fixed :v) Sometimes RADV_DEBUG=hang masks bugs as well due to extra sync, but fortunately we got a wave dump eventually. The failure was in a scalar load from a raw pointer. Normally, this means an out-of-bounds root CBV descriptor access. First hint was that this was loading 8 dwords in one go, i.e. an image descriptor. I correlated the ISA with the Vulkan GLSL disassembly and it pointed to this code: It was also bindless. Normally, my spider senses would immediately think that this was an out of bounds descriptor heap access. The descriptor index was computed as root table offset + dynamic offset. Studying the ISA I realized that it was not actually the dynamic offset that was the culprit, but rather the root table offset. Figuring this out would have taken an eternity without SGPR dumps. From the PM4 trace, I was then able to confirm that the SGPR root table offset correlated with vkCmdPushConstants on our end. This was rather puzzling … I had a theory that maybe our root parameter flushing code had a bug, so I added extra instrumentation to our breadcrumbs … Another crash later, I was able to prove that on a GPU fault: Game bug, oops! Turns out this scenario does not trigger an error in D3D12 validation layers when I wrote some tests to study this UB scenario (yaaaay <_<). It’s possible to trigger GPU hangs on the native AMD D3D12 driver this way. Maybe they app-opt it on their end for RE Engine, we’ll never know Our workaround was to emit offset 0 for every unset root table access and the crash went away. … For hardcore GPU debugging on PC, I think RADV currently provides the best experience by far. If you’re stuck debugging hangs in D3D12 on Windows, maybe give RADV + vkd3d-proton a shot. GPU hang recovery on amdgpu is sadly still questionable on Linux, but I have a good time as long as the AMD GPU is not driving the desktop session. I suggest multi-GPU here for an enjoyable experience. I’m also hoping this functionality can be added to the newly released RGD by AMD. A range of draws or dispatches which could potentially have caused failure. Ideally, exactly the draw or dispatch which caused failure. If page fault, which address caused failure? Which resource corresponds to that failure? It is also possible that the address does not correspond to any resource. Causing true OOB on D3D12 and Vulkan is very easy. WriteBufferImmediate (Basically VK_AMD_buffer_marker) Corresponding SPIR-V dump SGPR / VGPR register dumps Which instruction was being executed in every wave GPU disassembly around the crash site syncshaders is implied After every queue submission, RADV waits for idle. If it times out, it is considered a hang Dump a bunch of internal state, disassemblies, etc Invoke UMR to provide wave dumps Root table #10 was never set in the command list The shader in question accessed a descriptor array which maps to root table #10 The root table #10 read an undefined offset as a result

Programming

0 views

Maister's Graphics Adventures 2 years ago

Vulkan video shenanigans – FFmpeg + RADV integration experiments

Vulkan video is finally here and it’s a fierce battle to get things working fully. The leaders of the pack right now with the full release is RADV (Dave Airlie) and FFmpeg (Lynne). In Granite, I’ve been wanting a solid GPU video decoding solution and I figured I’d work on a Vulkan video implementation over the holidays to try helping iron out any kinks with real-world application integration. The goal was achieving everything a 3D engine could potentially want out of video decode. This blog is mostly here to demonstrate the progress in FFmpeg + RADV. I made a neat little sample app that fully uses Vulkan video to do a simple Sponza cinema. It supports A/V sync and seeking, which covers most of what a real media player would need. Ideally, this can be used as a test bench. Place a video feed as a 3D object inside Sponza, why not? This blog post by Lynne summarizes the state of Vulkan video at the time it was written. Note that none of this is merged upstream as of writing and APIs are changing rapidly. Make sure to install the very latest Vulkan headers. On Arch Linux, install vulkan-headers-git from AUR for example. Check out the branch in the blog and build. Make sure to install it in some throwaway prefix, e.g. Check out https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode . Then build with: Basic operation, a weird video player where the image is a flat 3D object floating in space. For fun the video is also mip-mapped and the plane is anisotropically filtered, because why not. If you have https://github.com/KhronosGroup/glTF-Sample-Models checked out you can add a glTF scene as well for fun. I hacked it together with Sponza in mind, so: and then you get the screenshot above with whatever video you’re using The Granite implementation can be found in https://github.com/Themaister/Granite/blob/master/video/ffmpeg_decode.cpp . It will probably be different in the final upstreamed version, so beware. I’m not an FFmpeg developer either FWIW, so take this implementation with a few grains of salt. To integrate with Vulkan video, there are some steps we need to take. This assumes some familiarity with FFmpeg APIs. This is mostly interesting for non-FFmpeg developers. I had to figure this out with help from Lynne, spelunking in mpv and looking over the hardware decode samples in FFmpeg upstream. Before opening the decode context with: we will provide libavcodec with a hardware device context. to scan through until you find a Vulkan configuration. To interoperate with FFmpeg, we have to provide it our own Vulkan device and lots of information about how we created the device. Fortunately, I had most of this query scaffolding in place for Fossilize integration already. Vulkan 1.3 core is required here as well, so I had to bump that too when Vulkan video is enabled. We need to let FFmpeg know about how it can query queues. Close match with Granite, but I had to add some extra APIs to make this work. We also need a way to lock Vulkan queues: For integration purposes, not making vkQueueSubmit internally synchronized in Vulkan was a mistake I think, oh well. Once we’ve created a hardware context, we can let the codec context borrow it: We also have to override get_format() and return the hardware pixel format. This will work, but we’re also supposed to create a frames context before returning from get_format(). This also lets us configure how Vulkan images are created. The primary motivation for overriding image creation was that I wanted to do YCbCr to RGB conversion in a more unified way, i.e. using individual planes. That would be compatible with non-Vulkan video as well, but taking plane views of an image requires VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT. Using per-plane views is important, as we’ll see later. YCbCr samplers fall flat when dealing with practical video use cases. In FFmpeg, decoding works by sending AVPackets to a codec and it spits out AVFrame objects. If these frames are emitted by a software codec, we just poke at AVFrame::data[] directly, but with hardware decoders, AVFrame::pix_fmt is an opaque type. There are two ways we can deal with this. For non-Vulkan hardware decoders, just read-back and upload planes to a VkBuffer staging buffer later, ewwww. Each hardware pixel format lets you reinterpret AVFrame::data[] in a “magical” way if you’re willing to poke into low-level data structures. For VAAPI, VDPAU and APIs like that there are ways to use buffer sharing somehow, but the details are extremely hairy and is best left to experts. For Vulkan, we don’t even need external memory! First, we need to extract the decode format: Then we can query the VkFormat if we want to stay multi-plane. However, this has some pitfalls in practice. Video frames tend to be aligned to a macro-block size or similar, meaning that the VkImage dimension might not be equal to the actual size we’re supposed to display. Even 1080p falls in this category for example since 1080 does not cleanly divide into 16×16 macro blocks. The only way to resolve this without extra copies is to view planes separately with VK_IMAGE_ASPECT_PLANE_n_BIT and do texture coordinate clamping manually. This way we avoid sampling garbage when converting to RGB. av_vkfmt_from_pixfmt can help here to deduce the per-plane Vulkan formats, but I just did it manually either way. Processing the frame itself starts with magic casts: We have to lock the frame while accessing it, FFmpeg is threaded. Now, we have to wait on the timeline semaphore (note that Vulkan 1.3 is required, so this is guaranteed to be supported). Create a VkImageView from the provided image. Based on av_vkfmt_from_pixfmt2 or per-plane formats from earlier, we know the appropriate Vulkan format to use when creating a view. Queue family ownership transfer is not needed. FFmpeg uses CONCURRENT for sake of our sanity. Transition the layout: Now, we can convert this to RGB as we desire. I went with an async compute formulation. If this were a pure video player we could probably blit this directly to screen with some fancy scaling filters. When we’re done, we have to “release” the image back to FFmpeg. And that’s it! I tried various codec configurations to see state of things. There’s a preliminary branch by Airlie again, but it doesn’t seem to have been updated for final spec yet. Exciting times for Vulkan video. The API is ridiculously low level and way too complicated for mere graphics programming mortals, which is why having first class support in FFmpeg and friends will be so important to make the API usable. Hardware accelerated GPU decode to RGB without round-trip through system memory (with optional mip generation when placed in a 3D world) Audio decode A/V synchronization WASD: move camera Arrow keys: rotate camera Space: Toggle pause HJKL: Vim style for seeking H.264 – 8bit: Works H.264 – 10bit: Not supported by hardware H.265 – 8bit: Works H.265 – 10bit: Works H.264: Broken H.265: Seems to work

Programming

C++

0 views