AMD GPU Debugger
I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx