GreatReads - Blog Aggregator · Phoenix Framework

AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx

Programming

Hardware Bash

0 views

Abdelhadi | عبدالهادي 2 years ago

Vulkan Foliage rendering using GPU Instancing

I was watching acerola’s video on foliage rendering and I liked the idea of rendering millions of grass blades, it was a good opportunity to play around with GPU instancing and indirect draw. The features and extensions set used in this abroach are: This extension allows for using pointers in GLSL and passing them in push constants or buffers, this extension along with makes dealing with buffers and textures soo much easier and nicer IMHO. This feature allows for multiple draw calls in indirect buffer make use of them to draw multiple LODs of the grass. Basically, there’s a compute pass that generates info about the grass blades stores them in a buffer, and does frustum culling and LOD selection and fill indirect commands, then a graphics pass to draw the blades. our grass blade is defined by the following we want to generate all of this data we start by It starts by generating random positions inside a rectangle area defined by center, width and height using the following formula then it generates a uv coords for each grass blade to sample a texture by doing after that using the uv coords, we can sample a simplex noise texture to have height multiplier or width multiplier we can also add some terms for user control using simplex noise for height makes sense as the tall grass tends to stick together in real life. This is a term to define how bendable a grass blade is which will be used later as a multiplier in vertex shader to do the animation. This is a term used by the animation formula to animate the grass it’s the same for all of the vertices so calculating it here saves us time and allows us to not pass uvs to vertex shader which is 2 floats but for millions of blads it will be 100s of megabytes. It’s calculated by the following In the vertex shader this value will be used as a parameter for the sin function which will result in a kinda wind like wave, you can see for yourself here in shader toy It’s called because I’ll pass it to sin function later in the vertex shader. defines the angle of rotation around the UP axis, which is always in our case the grass will always point upwards, so all we need is just an angle to construct a rotation matrix in the vertex shader. You can make this random or always face the camera or be controlled by the user whatever suits your needs. For the frustum culling, we start by generating a sphere around each blade the radius of the circle is determined by the max of height and width then we transform the sphere to camera space, at this point the distance from the camera is just the length of the point as in camera space the camera is at we use this to apply the cutoff distance and choose LOD. then the frustum culling we kinda doing the projection by hand and then check if the sphere is in range, I only do culling on the x and y axes as for the z axis we already have a cutoff but adding that is also trivial. I learned this way of culling on Arseny Kapoulkine’s stream they explain it much better but basically, we extract the left or right plane(we need just one of them) and the top or bottom plane then on the GPU we calculate the dot product between the sphere center and planes while taking the abs of the x component of the center of the sphere to do the culling on both sides at same time utilizing its symmetry. After that, we fill the blade info in the respective index in the blades data buffer After generating the data we can fill the command buffer, each thread will atomically increase the number of instances in the commands buffer this allows us to know to use the instance count as an index in another buffer to store the indices of the visible blades which allows us to access the values using we can also copy the buffer and sort using prefix sum scan just like Acerola in his video but this is better memory-wise and probably performance wise but I didn’t measure performance. in summary, the vertex shader will use to index into a buffer that contains the indices of the visible grass blades. The compute shader makes use of 2 indirect draw commands one for high LOD and other for low LOD we can add as many LOD levels as we want, and we check if it’s low LOD or high and then increase in the respective command buffer. after that, we use to index into the visible blades indices buffer and store the index of the current grass blade. Now we have our indices data in a continuous buffer to index into using and to determine which indices buffer to read from. We are ready to draw the grass blades. In the vertex shader we start by pulling the Blade data respective to the current instance. After that, we construct a rotation matrix and apply the height and width multiplier. Then we use the sin term to animate the grass blade using a sin function we scroll with time and the height of the vertex because naturally the tip of of the grass blade skew more than the base. here is how it looks ::youtube{url=“ https://www.youtube.com/embed/mDakjkrvH-0 ”} For the color, I opted for a simple gradient that goes brighter as it gets higher. I plan to improve this for example using normals for the grass also add some specular lighting as it could look really nice for example like Ghost of Tsushima’s grass. The simplest thing I thought of was just to reduce the amount of the work the vertex shader does since it will run millions of times a low hanging fruit was multiplying the projection and view matrix on the CPU and have it ready for the vertex shader, the next thing we can do is optimize the grass blade mesh I’ve used Mesh Optimizer by Arseny Kapoulkine I used it and did multiple optimizations and the one who had the most impact was converting the grass blade from a triangle list to a triangle strip that reduced the number of vertex shader invocations drastically and almost cut the vertex shader work in half and the shape of the grass blade can be represented nicely as a strip. [^source]: I got the numbers using Vulkan’s timestamp queries On my RX5600XT using the open source drivers(RADV) on Linux in 1080p resleoution my GPU can process about 6’770’688 grass blade with all of them visible the compute shader takes about and drawing it takes around that’s about it more than that it drop blew 60fps.[^source] Increasing the area covered by the grass to 1000 by 1000 we can consider up to 19’066’880 grass blade with compute shader taking about and drawing them takes about . This extension allows for using pointers in GLSL and passing them in push constants or buffers, this extension along with makes dealing with buffers and textures soo much easier and nicer IMHO. This feature allows for multiple draw calls in indirect buffer make use of them to draw multiple LODs of the grass.

Tutorial

Programming

0 views

Abdelhadi | عبدالهادي 2 years ago

How The WebSocket Server Works

This post is largely inspired by my project zig-ws , it’s an interesting protocol and relatively easy to implement so let’s see how it works on the server side. it’s a protocol to provide a real-time 2-way connection over a persistent TCP connection it does that by using HTTP handshake and then using the TCP connection used for the handshake to send frames. Frames are packages of data with some header that is needed for the protocol to operate. A normal GET request from a client with some requirements specified in the spec , I won’t get into them because it’s more client-related but you can always read the spec and you should. and the server parses that header, performs some operations and returns a response based on some info on the request. One of the requirements for the handshake request is to set the header with the base64 encoding of a random 16-byte value. The request MUST include a header field with the name , the value of this header field MUST be a nonce consisting of a randomly selected 16-byte value that has been base64-encoded. The nonce MUST be selected randomly for each connection. The server concat the header value with the magic string and then SHA1 the result and then base64 encode the hashed value and sets the header on the response with the encoded value there’s more to the handshake but it’s simple and you can always read the spec . After that, we can read directly from the TCP socket used for the handshake. {% mark %} the spec is your best friend when implementing any kind of protocol. {% /mark %} The TCP protocol itself doesn’t have the concept of framing(messages) meaning when you write N bytes to the network stream the other side might read it in N read calls on the network stream, you don’t know if it’s the end or not nor if it’s one message or not or whatever the frame header tries to provide the needed info to read messages from a network stream. {% sub %} This diagram is taken from the spec which describes how the framing works. {% /sub %} The server starts by reading 2 bytes(which is the minimum WebSocket frame size) the data on the 2 bytes(as shown on the diagram) is: The last 7 bits of 2 bytes header contain the size if it’s >= 125 if it’s bigger than that and fits on 2 bytes(u16) the length value is set to 126 if it’s longer than that and fits on 8 bytes(u64) the size value must be set to 127. To summarize this the server reads 2 bytes and then looks at the last 7 bits if the value >= 125 then that’s the length if it’s 126 it reads the next 2 bytes after the header and that’s the length of the header other than that it reads the next 8 bytes and that’s the length. {% mark %} the bytes are in the network byte order (big endian) you need to flip them if you’re on little endian machine(you probably are).{% /mark %} {% mark %} if the data is masked you need to read the masking key first (4 bytes).{% /mark %} The spec requires all the clients to mask the data(message payload) to see how you can consult the spec , the server knows that it will always get masked data but you should support the unmasked data too just in case. To unmask the data you need to first read the masking key if the mask bit on the header is set(have the value of 1) which is 4 bytes each byte represents a number(u8) and then follow This algorithm: {% mark %} the mask key(4 bytes) comes before the extended length value. {% /mark %} This feature allows the client to send a message in fragments which can be useful in the case that the message size is unknown at the send time, for example the client’s message size depends on something outside of its control, so it can send the message in fragments of 8 bytes each meaning every time the client has 8 bytes it sends a fragment until it’s done. The way fragmentation works is first fragment the client must unset the and set the opcode to the opcode of the whole message when it’s assembled the next fragment client sends the same but it sets the opcode to 0(continuation), until it’s done the last message it sends the must be set(has the value of 1) and the opcode is 0. {% mark %} control frames can be received in the middle of a fragmented message.{% /mark %} Control frames have special purposes, for example, the Pong` frames are used to see if the other side is alive(redundant imho). The one we care the most about is the close frame which has the opcode 8 which indicates the end the end of the connection. It can have a payload(reason to close) or nothing if it has a body the first 2 bytes are used to represent a status code(u16) and the rest is just a message. status codes: The WebSocket protocol is a nice protocol to implement with the spec being very clear and easy to follow. This blog post is a simple overview of what the server in WebSocket connection does if you want to implement it yourself you should read the spec , and you can always check my Zig implementation here . First Byte bit 0 is the used to represent if it’s the final message fragment or not(will return to this later). bit 1 to 3 are reserved for future use. bit 4 to 7(last 4 bits) are the opcode of this message. Second Byte bit 0 is the used to indicate if the data(message payload) is masked(will return to this later). bit 1 to 7 used for the message size(u8 value) with the values , being special values.

Web Development

Backend

Programming Zig

0 views