Posts in Matlab (2 found)

Emulating old junk from yesteryear – or my obsession making native resolution PS2 emulation look good

Lately I’ve been on a kick to tackle the latter part of PS2 graphics emulation that never seems to come up, analog TV emulation. 99% of the PS2 emulation space is all about upscaling to the max, cranking out 4K with polygons so sharp they can cleave mountains in half, but that’s not particularly to my taste, which is why paraLLEl-GS focuses more on the super-sampling aspects rather than raw upscaling. Earlier blog post on the paraLLEl-GS here if you have no idea what I’m referring to. For example, a raw native render with progressive scan (640×448) with paraLLEl-GS backend: With Vulkan HW renderer in PCSX2, a basic 2x upscale would look like: which is obviously higher resolution, but does not attempt to resolve aliasing either, and as upscaling factors go up, the mismatch between texture resolution, polygon counts and output resolution create a jarring effect for me. The approach I tend to favor is super-sampling and keeping the resolution native, even if it means a more blurred resolve. Here’s how it’d look with 16x SSAA. This is quite overkill for most cases, but why not. There is a special mode in paraLLEl-GS that can scanout the 16x super-samples at double the resolution, which is effectively 4x SSAA at double the resolution. It can look great when it works well, but the key word is when . This approach is not playing to the strengths of paraLLEl-GS at all, but it’s a thing. The main problem is that 2D images like HUD elements remain at native resolution with integer scale, which can create a jarring look, and post processing passes can mess up things in some cases. My main focus is on the native resolution, super sampled output, since it has very few gotchas compared to other upscaling styles. For a while, I’ve relied on FSR1 to do the upscaling, which is just a temporary hack. While properly anti-aliased 3D can look surprisingly good with FSR1 when blown up to large resolutions (what it is designed for), 2D game elements still create questionable artifacts. Text upscaling starts looking like those old HQ2x, SuperEagle and xBR filters that used to be popular with SNES emulators. That’s likely because FSR1 is “just” edge-aware Lanczos filtering at its core. Still, at SD resolutions there’s only so much you can do to blow it up to a modern display. The result here is of course quite blurry, but looking at it from a reasonable distance, it can look sort-of okay on a good day. The only proper way to display SD content is in my opinion to use a CRT, or in lieu of getting a nerd tan, CRT shaders. Simulating CRTs has been done to death to the point Hel is blushing, so I feel a bit uncomfortable even trying to write about it, but this post wouldn’t be complete without it. RetroArch has like 23428323 CRT shader presets already, and there’s nothing novel about any of this. However, there are some considerations for PS2 that most CRT shaders don’t target: A lot of CRT shaders assume a high quality VGA-style monitor. Nothing wrong with that of course, but I find most of them a bit “too” good for PS2. Where we’re going, we’ll need that analog fuzz too. I’ve been deep down the rabbit hole looking up old specifications to get a deeper appreciation for the brilliant engineering involved in making color TV work all those years ago, and the PS2 era was the last hurrah for SD CRTs, 480i warts and all. I’ve sort of gone through all of this stuff before back in my early days of programming, but I’d like to think I’m a little smarter than I was back then, and I learned far more details than I knew before. From a graphics programming PoV this is hardly “difficult” stuff like debugging random GPU hangs at 10pm on a Friday coming from the latest and greatest AAA game, but hey, sometimes you just gotta relax with some good old DSP coding to stay sane. The signal processing for NTSC and PAL is fairly straight forward and it’s actually a good entry point into signal processing for graphics programmers since it combines very visual things with actual real world problems. Component is the highest quality analog cable (actually, 3 cables!) available to consumers. It supports progressive scan which very few PS2 games natively supported. Simulating this cable is quite trivial. Composite is the yellow cable “everyone” used. This one is the hardest since separating luma from chroma in the same single-channel signal is actually kinda complicated to do well. Most of the time I spent was trying various strategies for this problem and try to come up with a look that seems authentic and “tastefully shitty” S-Video is very similar to composite, except that luma and chroma are separate signals. Chroma is packed into one cable and chroma decoding is basically same as for composite signals, where phase and amplitude dictate the hue and saturation. Main difference to component is that bandwidth for chroma should be quite low, so more color smearing should be present. Us Euro-bros technically had RGB SCART too, but I’ve never seen those cables myself for a Sony console. I believe our old N64 and GameCube had those, but I’d have to double check next time I visit the family … While it is very easy to find a ton of content online about old TV standards, your random YouTube video is not going to have the minute detail needed to implement much. I found “Video Demystified – A Handbook for the Digital Engineer (2005)” by Keith Jack which has ton of great detail that is often left out to make sure my implementation stays grounded. BT.601 defined how to take NTSC and PAL signals and turn them into the digital domain. It is the foundation for all digital video today. The primary interest for us is that the standard defines a 13.5 MHz sampling rate. This was supposedly chosen since it was convenient for both NTSC and PAL and filtering requirements. This scheme works out to 720 horizontal pixels for NTSC and PAL when you account for 525 lines at 29.97 Hz and 625 lines at 30 Hz. The H-sync part of the analog signal pads out to a bit over 800 pixels, but that’s irrelevant here. I also learned just recently why it’s called YUV444, 422, 420, etc: The “4:2:2” notation now commonly used originally applied to NTSC and PAL video, 480i and 480p Systems implying that Y, U and V were sampled at 4×, 2× and 2× the color subcarrier frequency, respectively. The “4:2:2” notation was then adapted to BT.601 digital component video, implying that the sampling frequencies of Y, Cb and Cr were 4×, 2× and 2× 3.375 MHz, respectively. Now you know! The Nyquist frequency of SD video is thus 6.75 MHz. Talking about video like this is a little weird, but it will make more sense later since analog signals like NTSC and PAL have a certain bandwidth, which limits how much horizontal resolution we can cram into it. The number of video lines is hardcoded. The video signal is really just a 1D signal if you look at it in an oscilloscope (haven’t seen those since my Bachelors …). The next question was to determine the sampling rate of the PS2 CRTC. Given it generates analog signals, it’s not obvious that PS2 would use a standard frequency here. However, several old forum threads I found did indeed suggest that PS2 is based around the 13.5 MHz rate. This is likely because it could use off-the-shelf video DACs at the time. The nominal maximum horizontal resolution of PS2 is 640 pixels, but it seems like overscan is supposed to stretch the 640 pixels out to fill the screen anyway. Even if 480 lines are “visible” in NTSC, games typically just render 448 lines since the top and bottom is eaten by overscan by most TVs. The actual CRTC clock seems to be 54 MHz, because when programming the CRTC, you’re supposed to set some dividers which ends up determining the resolution. E.g. 640 pixel width is a divider of 4, 512 pixels a divider of 5 and so on. It conveniently supports most relevant horizontal resolutions like this. My armchair theory for how this works is that the CRTC runs at 54 MHz and uses the divider to do a zero order hold (aka nearest filtering) which is then fed into the DAC. Either that or the console is really doing horizontal linear filtering up to 640 pixels and does composite video generation at 13.5 MHz. I’m not sure what really happens, so I just have to guess. I’m not quite obsessed enough to have an oscilloscope to suss out micro-details like these from a real PS2. The PS2 is technically able to emit 1080i and 720p signals, so there’s no reason why it couldn’t do analog video processing at the full 54 MHz. To debug any of this we need test images. I don’t exactly have test signal generators that TV engineers had back in the day, so I had to synthesize something. The use for these is to validate various edge cases in the NTSC and PAL decoding process. The basic idea is to have some 75% color bars at the top to validate the color pipeline. The middle section is a sweeping increase of horizontal frequency up to Nyquist of 6.75 MHz. Each 1 MHz section is delimited by a single line color bar. The lower portion is a sweep with increasing frequencies. This is used to test the comb filter. Doing FIR filter design by hand is uh … not something I have time for. Anything beyond the basic windowed sinc design process gets annoying very quickly. GNU Octave is a free alternative to Matlab that does what we need for these tasks. While we can do composite video generation at the 13.5 MHz rate, it is not easy to avoid aliasing, especially since we will be modulating with a chroma subcarrier that will shift the spectrum all over the place, potentially creating aliases out of thin air. The handbook calls for a 2x oversampling to avoid this. First, the image is padded out to fill 720 horizontal pixels for BT.601 reasons. Then, integer scale the image up to 2880xN (to match with how I understand the CRTC to work), and downsample that with a low-pass to 1440xN which completes a clean 2x oversampling. A polyphase upsampling filter is of course possible too to avoid the intermediate upscale, but that’s needless complexity for something that literally takes 10 microseconds on the GPU The immediate upscale is not written to memory at least, the filter just assumes a 2880 pixel input and it’s just sampling the input texture redundantly instead. Since these are fundamentally analog signals, we only filter horizontally. The natural image type for these is 1D Array. All the processing shaders are 1D as well. (The 1504 width is just extra padding for convolutions.) It’s all the same thing really. NTSC is the oddball one with YIQ but in practice this difference is completely irrelevant. The original idea of IQ was to take the blue and red difference signals and rotate them by 33 degrees to make it so that I would align with skin tones better, and give I more bandwidth than Q . However, the handbook doesn’t seem to give it too much consideration. Especially not for composite signals which are not meant to be sent over the air. The implementation of this is really just doing a 3×3 matrix multiply with RGB, nothing special. NTSC and PAL have fairly narrow bandwidth defined for their broadcast signals, but composite signals don’t really have those strict limits. However, since the digital input has Nyquist at 6.75 MHz the handbook calls for bandlimiting the signal to about 6 MHz anyway. Super sharp falloff filters just lead to ringing. Component video is specified by BT.1358 and doubles the sampling rate to 27 MHz. Y should fall off at about 11 MHz and chroma half that. Interpreting that in interlaced terms, the falloff should start at 5.5 MHz, getting close-ish to the Nyquist limit for SD video. After filtering luma and chroma, just decode back to RGB and we have a nice signal, done. Chroma is nuked down to ~1.3 MHz or so for composite / S-Video. Sometimes, 0.6 MHz is called for it seems, but it’s quite unclear … S-Video can ignore the modulation + demodulation part if we just want to simulate the smear, but it was easier to let it go through the full chain for completeness. Despite the chroma bandwidth being so horrible, it sort of looks good. Amazing how terrible our eyes are at seeing color detail. I wonder if this is there the esoteric 4:1:1 YCbCr subsampling mode comes from now that I think about it … E.g. NTSC luma filter, which starts falling off at about 4.2 MHz and stops at 6.75 MHz-ish. Broadcast NTSC is capped well below this bandwidth. Chroma encode, with ~1.3 MHz passband: Main difference for PAL is that luma passband is a bit higher. PAL-B (which Norway used) seems to specify 5 MHz rather than 4.2 MHz. Generating these filters can be done easily with firls and friends in Octave. The handbook calls for a gaussian kernel for chroma to avoid any ringing, but I missed that memo Either way, these are implemented with trivial convolutions which GPUs eat up like butter. This is the first point where NTSC and PAL differ significantly. While NTSC and PAL have different color subcarrier frequencies, PAL is also a bit more sophisticated. The “sign” of V is where Phase Alternate Line comes in. It flips every scanline. In broadcasting, this was meant to fix bad colors being introduced during broadcasting through complicated terrain (and boy do we have that over here in Norway). Bad phase shifts being introduced by NTSC will manifest as hue shifts. The basic idea behind PAL is that if phase shifts are introduced by signal reflections the sign flipping of V every scanline ensures that the decoded hue is complementary from scanline to scanline. This manifests as the Hanover Bar artifact. By averaging chroma from scanlines, the errors cancel out and we recover the correct color with slightly less saturation. The cost is of course reduced vertical chroma precision, but given how comically smeared chroma is horizontally, I’m not sure this matters, and digital video uses 4:2:0 subsampling anyway. Now, broadcasting considerations are kind of irrelevant for something like composite video (I would think), but I’m not sure if PAL TVs skip the filter for anything not coming from an antenna. I kept the vertical chroma filter in my implementation because it’s neat to have. The NTSC chroma subcarrier is constructed such that every scanline completes 227.5 cycles. Every line flips chroma phase which is very convenient and makes luma and chroma separation less complicated. The NTSC chroma pattern is a checkerboard as a result. PAL is more annoying here. A half cycle method would not work since V is already flipping every line, so PAL chose 287.75 cycles. On top of that a tiny 1 / 625 cycle offset per line is added for … reasons. The V flipping leads to a chroma pattern where U takes a diagonal pattern while V has a pattern along the other diagonal. Frame progression is also a concern. As fields are scanned, each time the same field is drawn, the chroma subcarrier should have opposite phases. This follows naturally from how fields are drawn. The period of NTSC is 4 fields: As we can see, things repeat after 4 fields and this point is easy to miss. The artifacts introduced by composite video should ideally cancel themselves out over time which manifests itself as flickery noise rather than a horribly glitched image. PAL is annoying and has a longer cycle of 8 fields due to the three-quarter of a cycle setup. After completing 2500 lines (625 * 4), the chroma subcarrier has completed an integer number of cycles, and the sign of V is back to where it started. After modulation, the NTSC color bar signal looks more like: and the next line flips phase as expected: Some old Nintendo consoles (and likely others) emit NTSC and PAL in non-standard ways. E.g. NES is infamous for shifting the chroma carrier by 120 degrees instead of 180 which leads to very particular artifacts. See NESdev Wiki for more detail. It’s easy to mess up RGB to YUV conversions and the compositing process. The handbook had reference outputs for RGB inputs in NTSC and PAL where I could confirm that the math was indeed correct. The things to check are that minimum and maximum of the signal are what they should be. NTSC, at least the US variant of it maps black to 7.5 IRE (just think of it as some abstract voltage) and white to 100.0 IRE, and it tripped me up a bit at first since the NTSC color bars were defined in terms of this shifted and scaled IRE value. Looking at the peaks and valleys of the generated composite signal in RenderDoc is enough since we just need to eyeball it. Close enough is good enough. This is non-trivial and a source of endless head scratching. Chroma information lives in the frequency spectrum around the carrier, but so does higher frequency luma detail. The basic theory for comb filters is to take advantage of the opposing phase of the chroma carrier. For NTSC, I used this basic structure from the handbook: In code, this translates to: Ideally, by adding two neighbor lines, chroma should cancel out and only luma remains. Subtracting, we cancel luma and only chroma remains. We know that chroma won’t (or at least shouldn’t) exist outside its bandwidth, so the result is run through a bandpass filter that centers around the carrier frequency, and we have an estimate for the modulated chroma signal. Since the composite signal is Y + C, we subtract the chroma estimate from composite signal to form a Y estimate. Chroma can now be demodulated and low-passed to remove the harmonics introduced by demodulation. This filter works “perfectly” for regions where the chroma is constant, but not so much where there are discontinuities. This results in “chroma dots” where the color subcarrier bleeds into decoded luminance. Notice the dot pattern on the bottom of the image. Thus, different colors modulate the luminance intensity differently, creating a “dot” pattern on the scan line between two col- ors. To eliminate these “hanging dots,” a chroma trap filter is sometimes used after the comb filter. In the real world of analog circuitry, having a perfectly locked signal like this is probably also not too realistic to assume. The literature also calls out for notch filtering as an approach. I attempted combining a comb filter with a notch filter on top to reduce the artifacts, but it is quite tricky to create a notch filter that works well. A simple FIR notch filter with zeroes is easy enough to make: This filter is convolved with a simple low-pass to complete the luma decoding filter. This approach just leads to severe blurring for NTSC and a band-stop filter approach just lead to less severe blurring and ringing instead, so I’m not sure what should be done. Unfortunately, the handbook isn’t clear on what kind of filtering is called for here. IIR notch filter designs can be super sharp to surgically carve out the carrier, but IIR filtering is also a massive pain in the ass on GPUs. It’s also likely to ring heavily too, which I found rather annoying in my testing. E.g. 3-line comb NTSC without the notch (integer nearest upscale from 640×240): and with notch: Yikes. There’s no way this notch approach is correct. It’s like we’re getting double vision here. It does clean up the chroma dots though, so … yay? Going beyond these base techniques there’s adaptive filtering where the filtering strategy changes based on which kind of case we’re dealing with. And even more sophisticated is taking advantage of temporal information (look ma’, TAA has been a thing since forever :D) since N fields in the past we have complementary chroma phase perfectly aligned to our pixel grid. Very cool stuff, but I doubt consumer TVs at the time would have those. The added latency for doing this kind of analysis doesn’t sound like something you’d want for games at least … Either way, I’m not designing high-end TV circuitry in the late 90s/early 2000s here. We can just flip on S-Video to simulate the perfect Y/C separator, so at some point I have to decide that I’ve done enough filter masturbation and move on. This was way harder than expected and I had to bang my head against the wall for a while to come up with a good solution. The ~90 degree shift every scanline means the basic comb filter for NTSC won’t work at all. The handbook has two main strategies here. Either a delay line which is slightly longer than a scanline to align the phases: Or use a highly magical “PAL modifier”: The function of this modifier is esoteric as all hell, but I think the purpose of it is to phase shift the signal by 90 degrees to “realign” the carrier somehow (it will still be off by 0.6 degrees). This filter path with two bandpass filters just got so messy, and I couldn’t figure out how to debug the thing effectively (were the inevitable visual artifacts my bugs or just the filter being bad?) that I eventually gave up and designed my own filter. That’s more fun anyway. I started from first principles and designed a 3×3 kernel that should be able to perfectly pass a chrominance signal and 100% reject any luminance signal that is DC in either horizontal or vertical direction. To make things simple I started with the assumption of 4 samples per subcarrier cycle to make the examples easier. Given a constant value of 1.0 for U, a signal would look something like: U = sin(wt) with N + 0.75 cycles per line here. A kernel that satisfies the criteria is: The sum of all rows and column is 0, meaning that if the signal is DC either horizontally or vertically, the result is completely filtered out. The filter also rejects V signals. V = +/- cos(wt) and looks like This is just the same signal flipped horizontally, so: Then a combined filter is made that accepts U and V signals together. U and V can be perfectly split later during demodulation so that is okay. This just boils down to a simple diagonal edge detection filter (high pass vertically and horizontally), but actually works quite well. To deal with the actual 2x oversampled rate of 27 MHz and the PAL subcarrier at ~4.433 MHz, the ~90 degree shift per line is about 1.51 samples, so to make this sort of work, I stretched out the horizontal kernel to a 5-tap filter: The vertical kernel remains the same of Some error is introduced since we’re not sampling the signal 100% correctly anymore (theoretically we need a sinc to reconstruct the signal perfectly), but I measured the reconstructed error to be -40 dB, which is good enough I think. The measured error for U and V were also similar, which indicates no weird artifacts from the V flips. With this 15-tap kernel, we get a pretty good chroma estimate even in PAL. From here the same ideas as NTSC apply, bandpass the estimate and subtract it from composite signal to get luma. Notch filtering to cleanup the chroma dots worked way, way better for PAL than NTSC, likely because the carrier has a much higher frequency on PAL, so the low pass behavior of the notch isn’t as devastating to image quality as it is on NTSC. In the end I think I prefer 3-line comb + notch for PAL and just plain 3-line comb for NTSC. These screenshots are just one still frame (or rather, field). The color fringing will cancel out the next field and it’s hard to show the effect without seeing it at full 60 fields per second. While PlayStation 2 didn’t support this hack mode, GameCube did back in the day. It’s a non-standard video mode that has same refresh rate as NTSC and vertical resolution, but retains the bandwidth and chroma encoding system of PAL, the best of both worlds! My implementation can trivially implement this by just enabling PAL on 60 Hz games. Only thing I’m not quite clear on is how the 1 / 625 subcarrier offset per line is supposed to work, but it’s a non-standard mode anyway, so eeeeeh. With comb filter and notch on NTSC: As expected, luma detail is murdered around ~3.58 MHz carrier. Also serious color fringing due to the extreme high frequency diagonal patterns. In the pattern at the bottom, no fringing is observed since the comb filter did its job as expected. PAL is similar, but the carrier and notch moves to ~4.43 MHz instead: The main feature of PAL is being robust against phase shifts during analog broadcasts. It’s a little unclear if composite inputs cared about this case, but for completeness sake, I implemented it. This path is naturally skipped for S-Video and Component outputs since I can’t imagine a TV caring about that for the more luxurious inputs. It doesn’t take many degrees of error in the phase to get quite different colors for NTSC. For PAL, the phase error manifests as complementary errors every line. However, by averaging out chroma vertically, we can recover the original chroma almost perfectly. At worst, a little less saturation. This topic is kind of unfortunate in that it’s done to death already, and it’s a 100% subjective topic meaning that everyone has some kind of opinion, none which agree with each other. Holy wars have been fought over less. Trying to write anything fresh about this topic is futile in 2026 – the heyday of CRT hobbyist shader development was in the early 2010s – but I felt the need to explain what I did at the very least. If anything, it’s a useful intro to writing your own shader. https://nyanpasu64.gitlab.io/blog/crt-appearance-tv-monitor/ is a good read too for more background information. The most obvious part of a CRT filter is scanlines, however, the idea that CRT images should have clearly visible scanlines is actually an artifact of 240p. For PS2 games, we’re operating at either 480i or 480p for NTSC. For interlaced video, we expect each individual field to have clearly visible scanlines, but the complete image (fused together by our brains) should not. The beam profile should be tuned as such. What most shaders do is for each scanline to take a gaussian distribution in the Y direction, sampled for the neighbor lines to cover the useful portion of the kernel. Another common effect is that very bright scanlines are smeared out, supposedly due to the electron guns not being as stable when they’re driven at high voltages. This can be simulated by varying the standard deviation. It can be subtle, but creates a neat effect I think. Exactly how to come up with the beam profile for a given input voltage is purely up to taste I suppose, I doubt there is a linear relationship between R’G’B’ value and standard deviation of the beam The sampled RGB value is in gamma-space still, since the CRT gamma curve is due to the phosphor response, not the CRT itself adjusting the gamma curve. The BT.1886 standard calls for a 2.4 gamma for SD content, which is the default and looks good. I also added options to use 2.2 (NTSC legacy) and 2.8 (?!, PAL legacy) for fun. Most CRT shaders I’ve seen apply gamma in this way: I think it would depend on whether or not the phosphor’s response is a function of how many electrons hit it, or individual “particles” respond to the energy of the electrons hitting them, where the gaussian beam profile is just a distribution of how many particles light up. In the latter interpretation, the code as-is makes sense, while the first interpretation would call for apply a gamma function on the gaussian profile. The visual output as-is looks good to me at any rate. After this point, all color math happens in linear light, so floating point render targets is a must. Color CRTs get their colors by having colored phosphors that are arranged in some kind of grid. The venerable Trinitrons use vertical stripes of RGB, and I like that look. While CRTs don’t really have a horizontal resolution, there is a “dot pitch”, which sort of dictates resolution. This part is the key to create the “texture” of a CRT. From what I read online a typical dot pitch for consumer TVs was 0.5 – 1.0 mm, and for a typical 20″ CRT, I estimated a reasonable number of RGB triads to be ~640 or so. Close enough to BT.601 standard horizontal resolution, neat. From what I understand, this value is also referred to as “TVL”, and these values seem ballpark reasonable. When looked at a distance from the screen, the dots blend together nicely as we’d expect. It’s basically just LCD subpixels, just larger. The dot layout I used was mostly lifted from Lotte’s CRT shader, but the approach can probably be found in a million shaders already. Just alternating stripes of R, G and B. I suppose a perfect mask of 0.0 should be used here, but it doesn’t end up looking as good as I’d like, even after adding glow effects, so I think the intent behind passing through a portion of other colors is more of a pragmatic decision. At least I cannot think of a physical interpretation of why we’d want to do it. The signal we are creating has a ton of high frequency information and we need to be very careful sampling it such that obvious aliasing is avoided. The common mistake is to just render this shader at output resolution and hoping for the best. This will almost surely lead to terrible aliasing patterns in the image. Bad aliasing of the scanlines in Y direction leads to a low frequency pumping pattern in the image which is extremely distracting, and bad aliasing in the X direction leads to a horribly noisy pattern due to uneven sampling of the dot mask. The easy fix is to render the effect at an integer multiple. E.g. if input image is 240 lines, render to a height that is an integer multiple of that. For the color dot mask, make sure the horizontal resolution is e.g. 3 times the dot resolution (one pixel for R, G, B dots). I ended up with something like width = 640 * 3, and height = 240 * 6 (3x sampling for progressive). A nebulous effect of the CRT is the glow aspect to it. Anyone can tell it’s there, but it’s not entirely clear to me why this happens. Google searches don’t turn up anything useful either. Without knowing the physical reason for it, it’s hard to emulate accurately. Could it be scattering effects inside the thick CRT glass perhaps? Either way, the common way to emulate this effect is to compute a gaussian blur (lots of those around here) and composite it over the original image. Very similar to the usual HDR + bloom effect that games in the 2010s loved to overuse. The main effect here is that the phosphor dots end up blending together nicely, yet retain the added “texture” that the aperture grille pattern gives. Humans like to see some high frequency detail, even if that detail is completely bogus. That’s the common trick behind video compression after all. The glow component, boosted up a ton to make it very visible: With the typical HDR effect in HD games, only very bright pixels participate in the effect, effectively spreading excess light energy over a larger area of pixels. It makes little sense to do anything like this for a CRT shader unless there is a physical threshold where phosphors just randomly start to glow more than they should, but all of this is purely up to taste anyway. Here’s from Soul Calibur II with progressive scan, component cable emulation and 16x SSAA, without any glow added. The look is very harsh to me. (The full-screen image is needed to see it without the extreme aliasing caused by thumbnailing.) Some glow on top and it looks like: Purely up to taste how much to add of course. I like a decent amount of glow. Phosphors don’t turn off right away when they’re lit. It’s very quick though, but adding a few percent of feedback between frames seems to help a bit with making 480i games look better with less flicker. This is not really how things work in the real world I think, but a reasonable approximation. We need a high quality rescaler to get the integer sampled CRT to the screen without introducing significant aliasing from the aperture grille or scanlines. The way to go here is simply to use a proper windowed sinc or something like that. I don’t like it, so I don’t implement it. If you do, make sure to consider resampling it properly to not introduce more aliasing. A point many shaders miss is that the RGB of a modern monitor is not the same as RGB on an old CRT TV. What we think of RGB today is usually BT.709 sRGB which defines a set of color primaries and white point. Old SDTV era video uses BT.601 which is a bit narrower than BT.709. In linear RGB space, this transform is a trivial 3×3 matrix multiply. I actually learned that the very old NTSC 1953 standard defines a set of primaries that were extremely saturated compared to the standards of today. While the primaries of 1953 were aspirational, it was clearly way too early. SMPTE refined NTSC to use more reasonable primaries as part of BT.601. PAL primaries are almost exactly the same (TVs tended to use the same phosphor formulation across the world I suppose), but there’s a theoretical difference so I added both for good measure. Supposedly, Japan kept the use of legacy NTSC 1953 primaries, so that opens an interesting question if the same games actually looked vastly different in Japan compared to the rest of the world? I’ve never heard anything mention this before, so who knows. Either way, I support enabling those primaries for fun. The look of it is quite … something? It would need a HDR monitor with solid gamut to do justice. Here’s with standard BT.601 primaries: (From Legaia 2, which is a purely field rendered game, hence the scanlines) and with NTSC 1953 primaries. Almost like the “Interpret sRGB as Display P3” bullshit that phones do these days to “pop”. Given all the masking we’re doing which lower average brightness, it’s beneficial to support HDR10 rendering. Now that HDR is widely available on Linux too, enjoying some HDR CRT shaders is a good time I added a few modes where I can target a specified maximum nit level. KDE at least respects MaxCLL HDR metadata and disables tonemapping if MaxCLL falls within bounds of the display. I also added a no-tonemap option where MaxCLL = 0 (unknown), which makes KDE tonemap how it wants to. Black Frame Insertion and their friends have been a thing in emulation for ages, and it works quite well with CRT simulation, especially to sell the de-interlacing effect properly. In my implementation, I query EXT_present_timing and decide how many frames I should insert in-between. There’s a gentler falloff between the frames. The overall screen brightness decreases a lot as expected, but with HDR, we can crank the brightness of the proper frame up to compensate. It’s still very experimental and any missed frame leads to horrrrrrrible flicker at the moment (big epilepsy warning), so it’s not something I actually recommend, but it’s fun to experiment with. While simulating each field independently with scanlines, we get de-interlacing the same way a CRT would in theory. This is not free from flicker of course, but most games had mitigation strategies for this. The PS2 supported blending two frames being sent to the video output. Most games render internally at e.g. 448p @ 30 FPS, but since they cannot output that resolution without component cables, the frame is output interlaced over two fields where the CRTC scans every other line rather than every line. That tends to look quite flickery if done as-is given how aliased PS2 graphics are, so what pretty much every game did was using the two CRTCs to blend the two frames vertically before sending it by programming a 1 pixel offset with duplicate inputs. By shifting the offset every field, the 30 FPS progressive image could be scanned out nicely into a 60 FPS interlaced image. This is the “flicker filter” that some games allowed toggling. Here’s a RenderDoc capture showing that CRTC 1 and 2 are configured with one pixel offset in Y: After merging and blending the frames together, a smoothed image is sent. PCSX2 GSdx and paraLLEl-GS have modes to detect this pattern and just scan out the full 480p of course, without the added blurring. Most games fall into this pattern, which is fortunate if we want to avoid interlacing shenanigans, but not all games are so nice to deal with. The “Anti-Blur” option is designed precisely for disabling this filter. I also added an option to force-disable the automatic progressive scan, mostly to test the video output that the games would actually have output back in the day, which is interlaced video. Some games decided that they wanted to render at 60 FPS and sacrifice half the vertical resolution to do so, jittering the rendering to stay in sync with the interlaced output. These are painful to deal with to this day since they absolutely require some kind of de-interlacing solution to look good. I never got satisfactory results with a typical de-interlacer, but the CRT simulation does a quite good job at it I think. It’s not perfect (interlacing wasn’t exactly perfect on CRTs either), but it’s usable for me to the point now that I can play interlaced games as intended. It’s not really possible to demonstrate this with screenshots. Some games break if you try to promote them to progressive scan, because the games might decide for some stupid reason to use the SCANMSK feature to discard pixels every other line, and rely on the FRAME scanout mode to exactly scan out the pixels that were not masked. Kings’s Field IV is an example of this absolute insanity. I maintain a patch for PCSX2-git which supports parallel-gs and now this CRT/analog thing. This is super niche stuff that I don’t really expect many people to actually use, but it’s there for those who are interested. It does what I need at least, and that’s what’s important to me. I put together some test video clips in HEVC/PQ/4:4:4/1440p at ridiculously high bitrates. I tried AV1 but my CPU could not decode it in real time, so it is what it is ._. Clip 1 is: Raw RGB passed into the CRT. It uses the game’s native 480p and widescreen support. Super sampling is 4x SSAA. Clip 2 is: “PAL60” with 3-line comb + notch. It also uses the more default 4:3 and interlaced video. It has some frame drops which completely botch the interlace and even at comically large bitrates, the aperture grill effect doesn’t translate well, so it is what it is, but it’s a rough approximation of what it should look like. Anyway, I’m happy with the results. Time to actually play something instead of debugging stuff Analog TV input path 480i instead of 240p High refresh rate simulation Field 0: 262 lines, chroma carrier starts in + phase, ends in + phase due to even number of lines Field 1: 263 lines, chroma carrier starts in + phase, ends in – phase due to odd number of lines Field 2: 262 lines, chroma carrier starts in – phase, …

0 views
Hillel Wayne 2 years ago

The Hunt for the Missing Data Type

A (directed) graph is a set of nodes, connected by arrows ( edges ). The nodes and edges may contain data. Here are some graphs: Graphs are ubiquitous in software engineering: Graphs are also widespread in business logic. Whitepapers with references form graphs of citations. Transportation networks are graphs of routes. Social networks are graphs of connections. If you work in software development long enough, you will end up encountering graphs somewhere . I see graphs everywhere and use them to analyze all sorts of systems. At the same time, I dread actually using graphs in my code. There is almost no graph support in any mainstream language. None have it as a built-in type, very few have them in the standard library, and many don’t have a robust third-party library in the ecosystem. Most of the time, I have to roll graphs from scratch. There’s a gap between how often software engineers could use graphs and how little our programming ecosystems support them. Where are all the graph types? As I ran into more and more graphs in my work, this question became more and more intriguing to me. So late last year I finally looked for an answer. I put a call out on my newsletter asking for people with relevant expertise— graph algorithm inventors, language committee members, graph library maintainers— to reach out. I expected to interview a dozen people, but in the end I only needed to talk to four: After these four people all gave similar answers, I stopped interviewing and start writing. So far I’ve been describing directed graphs. There are also undirected graphs, where edges don’t have a direction. Both directed and undirected graphs can either be simple graphs , where there is a maximum of one edge between two nodes, or multigraphs , where there can be many edges. And then for each of those types we have hypergraphs, where an edge can connect three or more nodes, and ubergraphs, where edges can point to other edges. For each possible variation you have more choices to make: do you assign ids to edges or just to nodes? What data can be stored in a node, and what can be stored in an edge? That’s a lot of decisions for a library to make! But wait, do these distinctions matter at all? A simple graph is just a degenerate multigraph, and and undirected edge can be losslessly transformed into two directed edges. A language could just provide directed hyperubermultigraphs and let users restrict it however they want. There are two problems with this. First of all, it changes the interface, like whether various operations return single values or lists. Second, as I’ll discuss later, graph algorithm performance is a serious consideration and the special cases really matter . Kelly raised the example of maximum weight matching . If you know that your graph is “bipartite”, you can use a particular fast algorithm to find a matching, while for other graphs you need to use a slow, more general algorithm. [It] ties back to the “algorithm dispatch problem.” Given a Problem P, a Graph G, and Algorithms A, B, C to solve P on G… which one do you run? If we don’t know that G is bipartite, and Algorithm C only works on bipartite graphs, how much time can we afford to determine whether or not G is bipartite? — Kelly The perfect graph library would support a lot of different kinds of graphs. But that takes time away from supporting what people want to do with graphs. Graph algorithms are notoriously hard to get right. In this essay , the inventor of Python implemented his own algorithm. It had to be updated with corrections five times! Every single implementation of pagerank that I compared to was wrong. — Nicole So which algorithms should come with the library? “The amount of things people want to do with graphs is absurd,” Kelly told me. That matches my experience, and the experiences of all my interviewees. It sometimes seems like graphs are too powerful , that all their possibilities are beyond my understanding. “The question is,” Kelly said, “where do you draw the line?” For NetworkX, “the line” is approximately 500 distinct graph algorithms, by themselves making up almost 60,000 lines of code. By comparison, the entire Python standard library, composed of 300 packages, is just under 600,000 lines. 2 With all that, it’s unsurprising that you don’t see graphs in standard libraries. The language maintainers would have to decide which types of graphs to support, what topologies to special-case, and what algorithms to include. It makes sense to push this maintenance work onto third parties. This is already the mainstream trend in language development; even Python, famous for being “batteries included”, is removing 20 batteries . Third parties can make opinionated decisions on how to design graphs and what algorithms to include. But then they’re faced with the next problem: once you have a graph interface, how do you represent it? Let’s imagine we’re supporting only barebones simple directed graphs: nodes have identities, edges do not, neither has any associated data. How do we encode this graph? Here are four possible ways a programming language could internally store it: Different graph operations have different performance characteristics on different representations. Take a directed graph with 100 nodes and 200 edges. If we use an adjacency matrix representation, we need a 100×100 matrix containing 200 ones and 9,800 zeros. If we instead use an edge list we need only 200 pairs of nodes. Depending on your PL and level of optimizations that could be a memory difference of 20x or more. Now instead take a graph with 100 nodes and 8,000 edges and try to find whether an edge exists between node 0 and node 93. In the matrix representation, that’s an O(1) lookup on . In the edge list representation, that’s an O(|edge|) iteration through all 8,000 edges. 3 Graphs with only a few edges are sparse and graphs with almost all edges are dense . The same program may need to do both operations on both kinds of graph topologies: if you’re constructing a graph from external data, you could start out with a sparse graph and later have a dense one. There’s no “good option” for the internal graph representation. And all this trouble is just for the most barebones directed graph! What about implementing node data? Edge data? Different types of nodes and edges? Most third party libraries roughly fall in one of two categories: Offer a single rich datatype that covers all use-cases at the cost of efficiency. NetworkX stores graph as a dict of dicts of dicts, so that both nodes and edges can have arbitrary data. 4 Offer separate graph types for each representation, and rely on the user to store node and edge data separately from the graph type. An example of the second case would be Petgraph , the most popular graph library for Rust. Petgraph has , , and for different use-cases. Bradford used Petgraph for Nosey Parker , a security tool that scans for secrets across an entire history of a git repo. His benchmarking graph is CPython, which has 250k commits and 1.3M objects but only a few edges per commit node. He went with an adjacency list. Supporting many representations has a serious downside: you have to do a lot more work to add algorithms. If you write a separate version of the algorithm for each graph representation, you’re tripling or quadrupling the maintenance burden. If you instead write a generic abstraction over polymorphic types, then your library is less performant. One programmer I talked to estimated that a hand-rolled graph algorithm can be 20x faster or more than a generic algorithm. And this gets into every interviewee’s major complaint. A “generic” graph implementation often doesn’t cut it. — Bradford This is the big one. Many, many graph algorithms are NP-complete or harder. 5 While NP-complete is often tractable for large problems , graphs can be enormous problems. The choice of representation plays a big role in how fast you can complete it, as do the specifics of your algorithm implementation. Everyone I talked to had stories about this. In Nosey Parker, Bradford needed to reconstruct a snapshot of the filesystem for each commit, which meant traversing the object graph. None of the four provided graph walkers scaled to his use case. Instead he had to design a “semi-novel” graph traversal algorithm on the fly, which reduced the memory footprint by a factor of a thousand. I was able to get working a proof of concept pretty quickly with [petgraph], but then… this is one of those cases where the performance constraints end up meeting reality. — Bradford Zayenz raised a different problem: what if the graph is simply too big to work with? He gave the example of finding a solution to the 15 puzzle . This is done by running a A* search on the state space. A state space with over 20 trillion states . If you generate all the nodes, you’ve lost already. — Zayenz Zayenz oversaw one research project to add graphs to the Gecode constraint solver. They eventually found that a generic graph type simply couldn’t compete with handpicking the representation for the problem. Even graph databases, designed entirely around running complex graph algorithms, struggle with this problem. Nicole, the graph database engineer, told me about some of the challenges with optimizing even basic graph operations. If you’re doing a traversal, you either have to limit your depth or accept you’re going to visit the entire graph. When you do a depth search, like “go out three steps from this and find the path if it exists”, then you’re just committing to visiting quite a bit of data. — Nicole After leaving that job, she worked as a graph query performance consultant. This usually meant migrating off the graph database. She told me about one such project: to speed the graph queries up, she left one computation as-is and rewrote the rest as MapReduce procedures. “Which was a lot harder to understand,” she said, “But would actually finish overnight.” All of this means that if you have graph problems you want to solve, you need a lot of control over the specifics of your data representation and algorithm. You simply cannot afford to leave performance on the table. So, the reasons we don’t have widespread graph support: This explains why languages don’t support graphs in their standard libraries: too many design decisions, too many tradeoffs, and too much maintenance burden. It explains why programmers might avoid third party graph libraries, because they’re either too limited or too slow. And it explains why programmers might not want to think about things in terms of graphs except in extreme circumstances: it’s just too hard to work with them. Since starting this research, I’ve run into several new graph problems in my job. I still appreciate analyzing systems as graphs and dread implementing them. But now I know why everybody else dreads them, too. Thank you for reading! Thanks to Predrag Gruevski for research help, Lars Hupel , Predrag Gruevski , Dan Luu , and Marianne Bellotti for feedback, and to all of the people who agreed to do interviews. If you liked this post, come join my newsletter ! I write new essays there every week. I train companies in formal methods, making software development faster, cheaper, and safer. Learn more here . Graph querying languages (GQLs) 6 are to graph databases what SQL is to relational databases. There is no widely-used standard, but two of the most popular are SPARQL for querying RDF triples and Neo4j’s cypher . Ironically, GraphQL is not a graph querying language, instead being named for its connection to the Facebook Graph Search . I considered graph databases themselves mostly distinct from graphs in programming languages, but their query languages show how graphs could work in a PL. The main difference between all GQLs and SQL is that the “joins” (relationships) are first-class entities. Imagine a dataset of movies and people, where people act in, direct, or produce movies. In SQL you’d implement each relationship as a many-to-many tables, which makes it easy to query “who acted in movie X” but hard to query “who had any role in movie Y, and what was that role”. In SPARQL relationships are just edges, making the same query easy. Cypher has a similar construct. GQLs can also manipulate edges: reverse them, compose them together, take the transitive closure, etc. If we wanted to find all actors with some degree of separation from Kevin Bacon, we could write SPARQL cannot give the length of the path nor do computation along the path, like collecting the chain of movies linking two actors. GQLs that support this are significantly more complicated. My main takeaway from looking at GQLs is that there’s a set of useful traversal primitives that a PL with graph support would need to provide. Interestingly, the formal specification language Alloy has all of these primitives for its “relation” datatype. For this reason I find working with a graph representation in Alloy much easier than in a proper programming language. That said, these all work with labeled edges and may not work for other graph representations. Python added a graphlib in 2020. Based on the discussion here , it was because topological sorting is a “fundamental algorithm” and it would be useful for “pure Python implementations of MRO [Method Resolution Order] logic”. Graphlib has no other methods besides , which only takes graphs represented as node dicts. Unusually, the direction of the node dict is reversed : the graph is represented as . As of 2023, nothing in CPython uses graphlib and there are fewer than 900 files referencing it on Github . By comparison, another package added in 2020, zoneinfo, appears in over 6,000 files, and the term appears in 4,000. I’d guess a lot of these are from before 2020, though. Some skimming suggests that all of these custom topological sorts take different graph representations than graphlib, so they wouldn’t be convertable regardless. Graph representation matters. There are two other languages I found with graph types: Erlang and SWI-Prolog . I don’t know either language and cannot tell when they were added; with Erlang, at least, it was before 2008. I reached out to a person on the Erlang core language committee but did not hear back. Programming languages where “everything is a graph” in the same way that everything in bash a string and everything in lisp is a list. Some examples include GP2 and Grape . Based on some correspondence with people in the field, right now this is still highly academic. Mathematica, MATLAB, Maple, etc all have graph libraries of some form or another. I am not paying the thousands of dollars in licensing needed to learn more. I’ve collected some of the comments I received on this post here . Package dependencies form directed graphs, as do module imports. The internet is a graph of links between webpages. Model checkers analyze software by exploring the “state space” of all possible configurations. Nodes are states, edges are valid transitions between states. Relational databases are graphs where the nodes are records and the edges are foreign keys. Graphs are a generalization of linked lists, binary trees, and hash tables. 1 Zayenz : Former core developer of the Gecode constraint solver , and who has “implemented every graph algorithm there is” Bradford : Author of the Nosey Parker security library and inventor of several new graph algorithms Nicole : Former graph database engineer Kelly : Maintainer on the NetworkX python graph library and compiler developer . Adjacency list: Adjacency matrix: A set of three structs with references to each other Offer a single rich datatype that covers all use-cases at the cost of efficiency. NetworkX stores graph as a dict of dicts of dicts, so that both nodes and edges can have arbitrary data. 4 Offer separate graph types for each representation, and rely on the user to store node and edge data separately from the graph type. There are many different kinds of graphs There are many different representations of each kind of graph There are many different graph algorithms Graph algorithm performance is very sensitive to graph representation and implementation details People run very expensive algorithms on very big graphs. No, really. Hash tables are bipartite graphs . This was used to prove performance of cuckoo hashing operations . [return] I derived both computations with cloc 1.96. I ran cloc in (56989) and in (588167). The whole networkX library is ~90,000 lines of code. [return] You can make this more efficient by keeping the edge list sorted and doing an binary search, at the cost of making edge insertions more expensive. [return] NetworkX has functions to convert graphs into other representations but not for working with those representations directly. [return] 14 of the 21 canonical NP-complete problems are graph problems. [return] Not to be confused with the GQL language, a proposed GQL standard that’s still under development. [return]

0 views