GreatReads - Blog Aggregator · Phoenix Framework

Text rendering and effects using GPU-computed distances

Text rendering is cursed . Anyone who has worked on text will tell you the same; whether it's about layout, bi-directional, shaping, Unicode, or the rendering itself, it's never a completely solved problem. In my personal case, I've been working on trying to render text in the context of a compositing engine for creative content. I needed crazy text effects, and I needed them to be reasonably fast, which implied working with the GPU as much as possible. The distance field was an obvious requirement because it unlocks anti-aliasing and the ability to make many great effects for basically free. In this article, we will see how to compute signed distance field on the GPU because it's much faster than doing it on the CPU, especially when targeting mobile devices. We will make the algorithm decently fast, then after lamenting about the limitations, we will see what kind of effects this opens up. Non-bitmap fonts contain glyphs defined by outlines made of closed sequences of lines and (quadratic or cubic) Bézier curves. Extracting them isn't exactly complicated: FreeType or ttf-parser typically expose a way to do that. For the purpose of this article, we're going to hard code the list of the Bézier curves inside the shader, but of course in a more serious setup those would be uploaded through storage buffers or similar. Using this tiny program , a glyph can be dumped as series of outlines into a fixed size array: contains how many Bézier curves there is for each sub-shape composing the glyph, and contains that list of Bézier cubic curves. Even though glyphs are also composed of lines and quadratic curves, we expand them all into cubics: "who can do more can do less". We use these formulas to respectively expand lines and quadratics into cubics: Where P_n are the Bézier control points. For simplicity and because we want to make sure the most complex case is well tested, we will stick to this approach in this article. But it also means there is a lot of room for further optimizations. Since solving linear and quadratics is much simpler, this is left as an exercise for the reader. You may be tempted to upload the polynomial form of these curves directly to save some computation in the shader. Don't. You will lose the exact stitching property because one evaluated polynomial end B_n(1) will not necessary match the next polynomial start B_{n+1}(0) . This makes artificial "precision holes" that will break rendering in obscure ways. In the previous article , we saw how to get the distance to a cubic Bézier curve. Each glyph being composed of multiple outlines, we can simply run over all of them and pick the shortest distance. Where is the distance to the Bézier curve, squared, as defined in the previous article . This works just fine, but as you can imagine, it's not cheap to solve so many distances per pixel. A first straightforward optimization would be to ignore any curve with a bounding box further than our currently best distance, because none of them can give a shorter one: Where each box encloses a Bézier curve like this: We could use a tighter bound but it would require more computation so this felt like a good trade-off. Implementing this in the inner loop is pretty simple: The distance to the bounding box formula comes from this explicative video by Inigo Quilez (the basic one, without the inside distance), adapted to the Bézier control point coordinates. This saves a lot of computation in certain cases, but the worst case is still pretty terrible, as shown by the heat map of this 'C' glyph: Indeed sometimes, it takes a long time to reach a good Bézier curve that is small enough to disregard most of the others. We observe it the further we go away from the beginning of the shape. So the next step is to find a good initial candidate. One cheap way to do that is to first compute the distance to the center point of each curve, and pick the smallest: This optimization is immediately reflected on the heat map, where only the central point seems to become a critical point (this glyph is a pathological case as it forms a circle): The last step is to figure out whether we are inside or outside the shape. There are two schools here, the even-odd and the non-zero rules. We'll pick the latter because that's the expectation in the case of font rendering. In deconstructing Bézier curves , we explained the theory of that specific algorithm so we're not going to dive into the details again. The basic idea is to strike a ray in one direction from our current position, and get how many times we cross a given curve. Here we will arbitrarily choose a horizontal ray line y = P_y where P is our current coordinate. The topology of each curve can hint us on whether it's worth considering it or not. For example, if every control point is above or below our current position, it can be ignored. We can store all the signs in a mask and bail out as soon as the ray is either completely below or completely above the bounding box of the curve: Each sign indicates the position of the control point with regard to the ray. We can use the relative position of the starting point as a reference for the overall orientation (if there is a crossing, we know it will come from below or above): We also need to convert the Bézier curves to the usual polynomial at^3+bt^2+ct+d : Then we can find the y-roots and check every point on the x-axis. For every crossing point (at most 3), we switch the sign: Since we already have a 5th degree root finder from the previous article, we just have to build a tiny version for the 3rd degree: Our root finder doesn't return roots outside [0,1] so no filtering is required. To summarize: For every sub-shape, we can accumulate the winding number, and use it at the end to decide whether we're inside or outside: This winding number logic might be too fragile: it doesn't cover potential degenerate cases such as horizontal tangents / duplicated roots. But for some reason, while I fought these issues for years, none of the weird corner cases seemed to glitch in my extensive tests, probably because the root finder is more resilient than what I was using before. This may look satisfying, but it's only the beginning of the problems. For example, variadic fonts are typically following chaotic patterns: In addition to the self overlapping part, notice the reverse folding triangle on the right. This completely wreck the distance field: Even with a simple character display (meaning something that doesn't exploit the wide range of effects available with an SDF), it starts to glitch: Little "cracks" should appears around the overlaps. This can be mitigated by lowering the distance by a tiny constant to avoid the zero-crossing, but it impacts the overall glyph (it gets more bold). And it's not just because of variadic problem, sometimes designers rely on overlaps for simplicity: And sometimes... well let's say they have a legitimate reason to do it: This is not something that can be addressed easily. For example, take these two overlapping shapes: We see that the actual distance (white circle) is not the smallest distance to either shape, and it's not even the smallest distance to any edge: it is at an intersection point between two curves, which we do not have. Here we're dealing with line segments, but with cubic curves, the problem explodes in complexity. At this point, we need another strategy, like feeding the GPU renderer with preprocessed outline-only curves. Many people rely on curves flattening to address this issue. This is unfortunately yet another field of research that we're not going to explore this time. Inigo talked about the combination of signed distance if you want some ideas, but aside from the first one (giving up), none seems particularly applicable here. Some effects such as blur or glow expand beyond the boundaries of the characters, so the distance field needs to be larger than the glyph itself. This means when an effect spread too large, there will be an overlap. If we're making an effect on a word, the distance field must be the unified version of all the word glyphs (or sometimes even the sentence). The classic approach of an atlas of glyph distances will not work reliably. In the following illustration, a geometry per glyph is used, each geometry is enlarged to account for the larger distance field, and we end up with potential overlaps when applying effects. Like all distance maps, it suffers from the same limitations. The most common one is the rounded corners problem. This is typically addressed using a multi-channel signed distance field generator , but it's hard for me to tell how accessible it is for a portage on the GPU. This problem only appears with intermediate textures. When computing exact distances like here directly in the shaders, this is not an issue. Despite all these limitations, we can already do so much, so let's close this article on a positive note. This is not done here, but all of these effects are free as soon as we have the distance field stored in an intermediate texture. First, we have anti-aliasing / blur: I wrote a dedicated article on the subject of AA (and blur) on SDF if you want more information on how to achieve that. The shape can also be drastically altered with a simple operator such as "rounding": This is the same technique we suggested to cover up for the overlap glitch earlier, just rebranded as an effect. In the same spirit we can also create an outline stroke (on the outer edge to preserve the original glyph design): This is sooo useful because it makes it possible for our text to be visible no matter what the background is. So many editors don't have this feature because it's hard and expensive to do correctly. Given a distance field though, all we have to do is this (which also includes anti-aliasing on every border): We can also dig into our character with : And maybe apply some glow to create a neon effect: We could also do drop shadows, all sorts of distortions, or so many other creative way exploiting this distance. You get the idea: it is fundamental as soon as you want fast visual effects. This article is the last of the series on 2D rendering for me. I've wanted to share this experience and knowledge after many years of struggling (mostly alone) on these issues. I wish I could have succeeded in providing a good free and open-source text effects rendering engine to compete with the industry standards. (Un)fortunately for me, the adventure stops here, but I hope this will benefit creators and future tinkerers interested in the subject.

Programming

Performance

0 views

pkh.me 1 months ago

Fast calculation of the distance to cubic Bezier curves on the GPU

Bézier curves are a core building block of text and 2D shapes rendering. There are several approaches to rendering them, but one especially challenging problem, both mathematically and technically, is computing the distance to a Bézier curve. For quadratic curves (one control point), this is fairly accessible, but for cubic (two control points) we're going to see why it is so hard. Having this distance field opens up many rendering possibilities. It's hard, but it's possible; here is a live proof: In this visualization, I'm borrowing your device resources to compute the distance to the curve for every single pixel. The yellow points are the control points of the curve (in white) and the blue zone is a representation of the distance field. All the demos and code in this article are self-contained GLSL fragment shaders. Most of the code can be found in the article, but feel free to inspect the source code of any of these WebGL demo for the complete code. They can be run verbatim using ShaderWorkshop . In a previous article , we explained that a Bézier curve can be expressed as a polynomial. In our case, a cubic polynomial: Where a , b , c and d are the vector coefficients derived from the start ( P_0 ), end ( P_3 ), and control points ( P_1 , P_2 ) using the following formulas (you can refer to the previous article for details): For a given point p in 2D space, the distance to that Bézier curve can be expressed as a length between our curve and p : Our goal is to find the t value where d(t) is the smallest. The length formula has an annoying square root, so we start with the distance squared for simplicity, which we are going to unroll: The derivative of that function will allow us to identify critical points: that is, points where the distance starts growing or reducing. Said differently, solving D'(t)=0 will identify all the maximums and minimums (we're interested in the latter) of D(t) (and thus d(t) as well). It is a bit convoluted in our case but straightforward to compute: A polynomial, this time of degree 5, emerges here. For conciseness, we can express D'(t) polynomial coefficients as a bunch of dot products: Finally, we notice that solving D'(t)=0 is equivalent to solving D'(t)/2 = 0 , so we simplify the expression: Assuming we are able to solve this equation, we will get at most 5 values of t , among which we should find the shortest distance from p to the curve. Since t is bound within 0 and 1 (start and end of the curve), we will also have to test the distance at these locations. We could also compute the 2nd derivative in order to differentiate minimums from maximums, but simply evaluating the 5(+2) potential t values and keeping the smallest works just fine. The red dot in the blue field is a random point in space. The red lines show which distances are evaluated (at most 5+2) to find the smallest one. Transposing these formulas into code gives us this base template code: is a shorthand for the length squared, of course cheaper than computing which contains a square root. We assume here the root finder only returns the roots that are within [0,1] . is our 5th degree root finder, that is the function that gives us all the t (at most 5) which satisfy: But before we are able to solve that, we need to study the simpler 2nd degree polynomial solving: Diving into the rabbit hole of solving polynomial numerically will lead you to insanity. But we still have to scratch the surface because superior degree solvers usually rely on it. My favorite quadratic root finding formula is the super simple one introduced by 3Blue1Brown , which involves locating a mid point m from which you get the 2 surrounding roots r : In GLSL, a code to cover most common corner cases would look like this: Not quite as straightforward as the math formula, isn't it? We cannot know in advance whether the division is going to succeed, so we do run divisions and only then check if they failed (and assume a reason for the failing). This is much more reliable than an arbitrary epsilon value. We also try to avoid duplicated roots. The roots are automatically sorted because z is always positive. may not be as reliable because in GLSL "NaNs are not required to be generated", meaning some edge case may not be supported depending on the hardware, drivers, and the current weather in Yokohama. As much as I like it, this implementation might not be the most stable numerically (even though I don't have have strong data to back this claim). Instead, we may prefer the formula from Numerical Recipes : Leading to the following alternative implementation: This is not perfect at all (especially with the b²-4ac part). There are actually many other possible implementations, and this HAL CNRS paper shows how near impossible it is to make a correct one. It is an interesting but depressing read, especially since it "only" covers IEEE 754 floats, and we have no such guarantee on GPUs. We also don't have in WebGL, which greatly limits improvements. For now, it will have to do. Solving polynomials of degree 5 cannot be solved analytically like quadratics. And even if they were, we probably wouldn't do it because of numerical instability. Typically, in my experience, analytical 3rd degree polynomials solver do not provide reliable results. The first iterative algorithm I picked was the Aberth–Ehrlich method . Nowadays, more appropriate algorithms exist, but at the time I started messing up with these problems (several years ago), it was a fairly good contender. This video explores how it works. The convergence to the roots is quick, and it's overall simple to implement. But it's not without flaws. The main problem is that it works in complex space. We can't ignore the complex roots because they all "respond" to each others. And filtering these roots out at the end implies some unreliable arbitrary threshold mechanism (we keep the root only when the imaginary part is close to 0). The initialization process also annoyingly requires you to come up with a guess at what the roots are, and doesn't provide anything relevant. Aberth-Ehrlich works by refining these initial roots, similar to a more elaborate Newton iterations algorithm. Choosing better initial estimates leads to a faster convergence (meaning less iterations). The Cauchy bound specifies a space by defining the radius of a disk (complex numbers are in 2D space) where all the roots of a polynomial should lie within. We are going to use it for the initial guess, and more specifically its "tight" version (which unfortunately relies on ). Since Aberth-Ehrlich is a refinement and not just a shrinking process, we define and use an inner disk that has half the area of Cauchy bound disk. That way, we're more likely to start with initial guesses spread in the "middle" of the roots; this is where the √2 comes from in the formula below. The circle constants are generated with the following script: Next, it's basically a simple iterative process. Unrolling everything for degree 5 looks like this: When the main coefficient is too small, we fall back on the 4th degree (and so on until we reach the analytic quadratic). The 4th and 3rd degree versions of this function are easy to guess (they're pretty much identical, just removing one coefficient at each degree). We're also hardcoding a maximum of 16 iterations here because it's usually enough. To have an idea of how many iterations are required in practice, following is a visualization of the heat map of the number of iterations for every pixel: The big picture and the weaknesses of the algorithm should be pretty obvious by now. Among all drawbacks of this approach, there are also surprising pathological cases where the algorithm is not performing well. Fortunately, there were some progress on the state of the art in recent years. In 2022, Cem Yuksel published a new algorithm for polynomial root solving . Initially I had my reservations because the official implementation had a few shortcomings on some edge cases , which made me question its reliability. It's also optimized for CPU computation and is, to my very personal taste, overly complex. Fortunately, Christoph Peters showed that it was possible on the GPU by implementing it for very large degrees, and without any recursion. Inspired by that, I decided to unroll it myself for degree 5. One core difference with Aberth approach is that it is designed for arbitrary ranges. In our case this is actually convenient because, due to how Bézier curves are defined, we are only interested in roots between 0 and 1. We will need to adjust the Quadratic function to work in this range, as well as keeping the roots ordered: The core logic of the algorithm relies on a cascade of derivatives for every degree. Christoph Peters provides an analytic formula to obtain the derivative for any degree. This is a huge helper when we need to work for an arbitrary degree, but in our case we can just differentiate manually: Since we're only interested in the roots, similar to what we did to D(t) , we can simplify some of these expressions: The purpose of that cascade of derivatives is to cut the curve into monotonic segments. In practice, the core function looks like this: We could unroll , , and , but to keep the code simple, the degree 3 to 5 will share the same function, with leading coefficients set to 0 (hopefully the compiler does its job properly). relies on roots found (at most 4) at previous stages to define intervals of search: Such an approach has the nice side effect of keeping the roots ordered. The solver itself is not that complex either: The last brick of the algorithm is the Newton bisection, the slowest part: And that's pretty much it. Looking at its heat map, it has a completely different look than Aberth: The number of iterations might be larger but it is much faster (I observed a factor 3 on my machine), the code is shorter, and actually more reliable. The scale used to represent the heat map is not the same as the one used in Aberth, but it is identical with the method presented in the next section. The bisection being the hot loop, it is interesting to ponder how to make this faster. A while back, Raph Levien hypothesized about how the ITP method could perform. Out of curiosity, I gave it a chance. The function is designed to work like a bisection, claiming to be as performant in the worst case. There isn't a lot of code, and the paper provides a pseudo-code. But implementing it was actually challenging in many ways. First of all, the authors didn't seem to find relevant to mention that it only works if f(a)<0<f(b) . If f(a)>0>f(b) , you're pretty much on your own. It requires just 2 lines of adjustments but figuring out this shortcoming of the algorithm was particularly unexpected. Another bothering aspect concerns the parameters: K_1 , K_2 and n_0 . The paper proposes those: I played with them for a while and couldn't find any set that would make a real difference, so I ended up with the following: In the end, the function looks like this: This function can be used as a drop'in replacement for . I had a lot of expectations about it, but in the end it requires more iterations than the bisection we implemented. The paper claims to perform at least as good as a bisection, but our is driven by the derivatives so it converges much faster. Here is the heat map with instead of : The naive unrolled version of Cem Yuksel paper definitely is, so far, the best choice for our problem. I have still concerns about how to implement a good quadratic formula, and I have my reservations about various edge cases. There is also still room for improvements in the cubic solver (degree 3) because it's still a special case where analytical formulas exist, but in general this implementation is satisfying. The next step is to work with chains of Bézier curves to make up complex shapes (such as font glyphs). It will lead us to build a signed distance field. This is not trivial at all and mandates one or several dedicated articles. We will hopefully study these subjects in the not-so-distant future. For performance reasons, reducing K_2 to a value of 2 saves a call to . For K_1 , CRAN seems to suggest \frac{0.2}{b-a} so I went along with it And for n_0 , well 1 or 2 seem to be the usual parameter.

Web Development

Tutorial

Programming

0 views

pkh.me 2 months ago

Code golfing a tiny demo using maths and a pinch of insanity

A matrix can be seen as a function, so mathematically writing p'=M \cdot p would be equivalent to the code with: Doing p'=M \cdot p is rotating the space p lies into, which means it gives the illusion the object is rotating clockwise . Though, in the expression , I can't help but be bothered by the redundancy of , so I would prefer to write instead. Since matrices are not commutative, this will instead do a counter-clockwise rotation of the object. The inlined rotation ends up being: To make the rotation clockwise, we can of course use , or we can transpose the matrix: . This is problematic though: we need to repeat the angle 4 times, which can be particularly troublesome if we want to create a macro and/or don't want an intermediate variable for the angle. But I got you covered: trigonometry has a shitton of identities, and we can express every according to a (and the other way around). For example, here is another formulation of the same expression: Now the angle appears only once, in a vectorized cosine call. GLSL has and functions, but it doesn't expose anything for \pi nor \tau constants. And of course, it doesn't have and implementation either. So it's obvious they want us to use \arccos(-1) for \pi and \arccos(0) for \pi/2 : To specify as a normalized value, we can use . On his Unofficial Shadertoy blog, Fabrice Neyret goes further and provide us with a very cute approximation , which is the one we will use: I checked for the best numbers in 2 digits , and I can confirm they are indeed the ones providing the best accuracy. On this last figure, the slight red/green on the outline of the circle represents the loss of precision. With 3 digits, and can respectively be used instead of and . This is good when we want a dynamic rotation angle (we will need that for the camera panning typically), but sometimes we just need a hardcoded value: for example in the of our combined noise function. is fine but we can do better. Through Inigo's demos I found the following: . It makes a rotation angle of about 37° (around 0.64 radians) in a very tiny form. Since 0.5 was pretty much arbitrary, we can just use this matrix as well. And we can make it even smaller (thank you jolle ): One last rotation tip from Fabrice's bag of tricks: rotating in 3D around an axis can be done with the help of GLSL swizzling: We will use this too. is the same , if we need to save one character and can't transpose the matrix. One last essential before going creative is the camera setup. We start with the 2D pixel coordinates which we are going to make resolution independent by transforming them into a traditional mathematical coordinates system: Since we know our demo will be rendered in landscape mode, dividing by is enough. We can also save one character using : To enter 3D space, we append a third component, giving us either a right or a left-handed Y-up coordinates system. This choice is not completely random. Indeed, it's easier/shorter to add a 3rd dimension at the end compared to interleaving a middle component. Compare the length of to (Z-up convention). In the former case, picking just a plane remains short and easy thanks to swizzling: instead of . To work in 3D, we need an origin point ( for ray origin) and a looking direction ( for ray direction). is picked arbitrarily for the eye position, while is usually calculated thanks to a helper: Which is then used like that, for example: I made a Shadertoy demo to experiment with different 3D coordinate spaces if you are interested in digging this further. All of this is perfectly fine because it is flexible, but it's also way too much unnecessary code for our needs, so we need to shrink it. One approach is to pick a simple origin and straight target point so that the matrix is as simple as possible. And then later on apply some transformations on the point. If we give and , we end up with an identity matrix, so we can ditch everything and just write: This can be shorten further: since the vector is normalized anyway, we can scale it at will, for example by a factor , saving us precious characters: And just like that, we are located at the origin , looking toward Z+, ready to render our scene. It's finally time to build our scene. We're going to start with our function previously defined, but we're going tweak it in various ways to craft a mountain height map function. Here is our first draft: We're exploiting one important correlation of the noise function: at every octave, the amplitude is halving while the frequency is doubling. So instead of having 2 running variables, we just have an amplitude getting halved every octave, and we divide our position by (which is the same as multiplying by a frequency that doubles itself). I actually like this way of writing the loop because we can stop the loop when the amplitude is meaningless ( acts as a precision stopper). Unfortunately, we'll have to change it to save one character: is too long for the iteration, we're going to double instead by using which saves one character. So instead the loop will be written the other way around: . It's not exactly equivalent, but it's good enough (and we can still tweak the values if necessary). We're going to inline the constants and rotate, and use one more cool trick: can be shortened: we just need another . Luckily we have , so we can simply write . Similarly, if we needed we could have written (it works also like that: to shorten ). We can also get rid of the braces of the loop by using the in its local scope. In the end, this is our function: To render this in 3D, we are going to do some ray-marching. The main technique used in most Shadertoy demos is ray-marching. I will assume familiarity with the technique, but if that's not the case, An introduction to Raymarching (YouTube) by kishimisu and Painting with Math: A Gentle Study of Raymarching by Maxime Heckel were good resources for me. In short: we start from a position in space called the ray origin and we project it toward a ray direction . At every iteration we check the distance to the closest solid in our scene, and step toward that distance, hoping to converge closer and closer to the object boundary. We end up with this main loop template: This works fine for solids expressed with 3D distance fields , that is functions that for a given point give the distance to the object. We will use it for our mountain, with one subtlety: the noise height map of the mountain is not exactly a distance (it is only the distance to what's below our current point ): Because of this, we can't step by the distance directly, or we're likely to go through mountains during the stepping ( ). A common workaround here is to step a certain percentage of that distance to play it safe. Technically we should figure out the theorical proper shrink factor , but we're going to take a shortcut today and just arbitrarily cut. Using trial and error I ended up with 20% of the distance. After a few simplifications, we end up with the following (complete) code: We start at so I dropped the variable entirely. Also, to avoid the division by 0 in in , is moved right at the beginning (we could also initialize to a value slightly different than 0). You may be curious about the power at the end; this is just a combination of luminance perception with gamma 2.2 (sRGB) transfer function. It only works well for grayscale; for more information, see my previous article on blending . Compared to the mountain, the clouds and fog will need a 3 dimensional noise. Well, we don't need to be very original here; we simply extend the 2D noise to 3D: The base frequency is lowered to to make it smoother, and the goes from 2 to 3 dimensions. Notice how the rotation is only done on the y-axis, the one pointing up): don't worry, it's good enough for our purpose. We also add a phase (meaning we are offsetting the sinusoid) of ( is the time in seconds, slowed down by the multiply) to slowly morph it over time. The base frequency and time scale being identical is a happy "coincidence" to be factored out later (I actually forgot about it until jolle reminded me of it). You also most definitely noticed isn't explicitly initialized: while only true WebGL, it guarantees zero initialization so we're saving a few characters here. For volumetric material (clouds and fog), the loop is a bit different: instead of calculating the distance to the solid for our current point , we do compute the density of our target "object". Funny enough, it can be thought as a 3D SDF but with the sign flipped: positive inside (because the density increases as we go deeper) and negative outside (there is no density, we're not in it). For simplicity, we're going to rewrite the function like this: Compared to the solid ray-marching loop, the volumetric one doesn't bail out when it reaches the target. Instead, it slowly steps into it, damping the light as the density increases: The core idea is that the volumetric material emit some radiance but also absorbs the atmospheric light. The deeper we get, the smaller the transmittance gets, til it converges to 0 and stops all light. All the threshold you see are chosen by tweaking them through trial and error, not any particular logic. It is also highly dependent on the total number of iterations. Steps get larger and larger as the distance increases; this is because we don't need as much precision per "slice", but we still want to reach a long distance. We want to be positioned below the clouds, so we're going to need a simple sign flip in the function. The fog will take the place at the bottom, except upside down (the sharpness will give a mountain-hug feeling) and at a different position. becomes: Having a single ray-marching loop combining the two methods (solid and volumetric) can be challenging. In theory, we should stop the marching when we hit a solid, bail out of the loop, do some fancy normal calculations along with light position. We can't afford any of that, so we're going to start doing art from now on. We start from the volumetric ray-marching loop, and add the distance to the mountain: If gets small enough, we can assume we hit a solid: In volumetric, the attenuation is calculated with the Beer-Lambert law. For solid, we're simply going to make it fairly high: This has the effect of making the mountain like a very dense gas. We're also going to disable the light emission from the solid (it will be handled differently down the line): The transmittance is not going to be changed when we hit a solid as we just want to accumulate light onto it: Finally, we have to combine the volumetric stepping ( ) with the solid stepping ( ) by choosing the safest step length, that is the minimum: We end up with the following: We can notice the mountain from negative space and the discrete presence of the fog, but it's definitely way too dark. So the first thing we're going to do is boost the radiance, as well as the absorption for the contrast: This will make the light actually overshoot, so we also have to replace the current gamma 2.2 correction with a cheap and simple tone mapping hack : . Halving the color is yet another tweak that is not obtained on anything but trial and error. There might be clever ways to reach the same result, but I leave that up to the reader: The clouds and fog are much better but the mountain is still trying to act cool. So we're going to tweak it in the loop: This boosts the overall emission. While at it, since the horizon is also sadly dark, we want to blast some light into it: When the density is null (meaning we're outside clouds and fog), an additional light is added, proportional to how far we are from any solid (the sky gets the most boost basically). The mountain looks fine but I wanted a more eerie atmosphere, so I changed the attenuation: Now instead of being a hard value, the attenuation is correlated with the proximity to the solid (when getting close to it). This has nothing to with any physics formula or anything, it's more of an implementation trick which relies on the ray-marching algorithm. The effect it creates is those crack-like polygon edges on the mountain. To add more to the effect, the emission boost is tweaked into: This makes the bottom of the mountain darker quadratically: only the tip of the mountain would have the glowing cracks. We've been working in grayscale so far, which is a usually a sound approach to visual art in general. But we can afford a few more characters to move the scene to a decent piece of art from the 21st century. Adding the color just requires very tiny changes. First, the emission boost is going to target only the red component of the color: And similarly, the overall addition of light into the horizon/atmosphere is going to get a redish/orange tint: We're almost done. For the last tweak, we're going to add a cyclic panning rotation of the camera, and adjust the moving speed: I'm currently satisfied with the "seed" of the scene, but otherwise it would have been possible to nudge the noise in different ways. For example, remember the can be replaced with in either or both volumetric and mountain related noises. Similarly, the offsetting could be changed into for a different morphing effect. And of course the rotations can be swapped (either by changing into or transposing the values). At this point, our code went through early stages of code golfing, but it still needs some work to reach perfection. Stripped out of its comments, it looks like this: The first thing we're going to do is notice that both the mountain, clouds, and fog use the exact same loop. Factoring them out and inlining the whole thing in the main function is the obvious move: Next, we are going to do the following changes: Onto the next pass of tricks: I'm also reordering a bit some instructions for clarity 🙃 The last touch is going to be nasty: we're going to reorder the instructions such that the 2nd loop is located at the very beginning of the 1st one: "Why?!" you may ask. Before answering this question, let's see why it still works: the first iteration ends up being executed with , where most calculations just cancel themselves out, leading to one wasted iteration (out of 100). Visually, it makes zero difference. But thanks to this weird change, we end up with a bunch of instructions that we can pack into the last placeholder of the main loop, comma separated. This notably allows us to drop the of the main loop: And here we are. All we have to do now is remove all unnecessary spaces and line breaks to obtain the final version. I'll leave you here with this readable version. I'm definitely breaking the magic of that artwork by explaining everything in detail here. But it should be replaced with an appreciation for how much concepts, math, and art can be packed in so little space. Maybe this is possible because they fundamentally overlap? Nevertheless, writing such a piece was extremely refreshing and liberating. As a developer, we're so used to navigate through mountains of abstractions, dealing with interoperability issues, and pissing glue code like robots. Here, even though GLSL is a very crude language, I can't stop but being in awe by how much beauty we can produce with a standalone shader. It's just... Pure code and math, and I just love it.

JavaScript

Tutorial

Programming

2 views

pkh.me 4 months ago

Perfecting anti-aliasing on signed distance functions

Doing anti-aliasing on SDF is not as straightforward as it seems. Most of the time, we see people use a with hardcoded constants, sometimes with screen space information, sometimes cryptic or convoluted formulas. Even if SDFs have the perfect mathematical properties needed for a clean anti-aliasing, the whole issue has a scope larger than it appears at first glance. And even when trivial solutions exist, it's not always clear why they are a good fit. Let's study that together. The article assumes that you are at least a bit familiar with what an SDF is, but if I had to provide a quick and informal definition, I would say something like: "It's a function (or lookup-table of said function, usually stored in a texture) which returns the signed distance from the specified coordinates to a given shape, where the sign indicates whether you're inside or outside the shape." A common visualization of it looks like this: The distance is fancily colored here for illustrative purpose, and the shape is animated to see how it affects the field. Another way of seeing it is to switch to a 3D view: For the sign interpretation, here we're using the convention positive inside and negative outside , as seen for example on the Wikipedia illustration . But this is not always the case, for example, Inigo prefers the opposite: negative inside and positive outside . I personally find the Wikipedia convention to be more intuitive and easy to work with, but that's a matter of preferences so we'll figure out the formulas for both models. Switching from one to the other is just a sign swap, but it's important to know what we are working with. A properly crafted SDF has a gradient of length 1, meaning the slope is either going up or down, but always at the same constant rate of 1: This is an important property since anti-aliasing is all about transitioning smoothly toward (or away from) the shape. For our first attempt at anti-aliasing we will simply follow that ramp and make a straight transition. Once again, we are going to rely on , one of the most useful math formulas : And more specifically we will need its saturated version : This is the same as the well-known , except it's a straight line when transitioning from to . The length of our ramp (the transition zone between and ) is going to be arbitrary at first, we will call it (for "width"). It's our diffuse, or blur parameter if you prefer. The height we are looking for corresponds to the opacity of our shape. Given a positive inside and negative outside SDF, we will start with the transition centered around the boundary between the shape and its outside. You might be confused about the relationship between the distance and the transition zone (diffuse width ). The following diagram may help clarifying why: Remember, the gradient of an SDF is supposed to have a length of 1. This means there is a direct match between the height of the signed distance (y-axis on the figure), and the spacial distance traveled (x-axis on the figure). This is why Inigo and other folks spend a lot of energy into looking for the perfect formula for the distance to an ellipse . We cannot just stretch a circle as it would distort the SDF, and thus break this important property. A broken AA would be one of the consequences. The previous figure shows that for a centered transition, when the distance is within , it represents a transition width of size around the edge, so we want it to be mapped to an opacity within . This can be expressed with: Which can be unrolled and simplified into the following tiny form: We can also decide to make the transition on the outside or the inside boundary of the shape: And we can be creative and have a cursor indicating where we are on the border. If we give for inside, for centered and for outside: For the negative inside and positive outside SDF, we simply swap the sign: Any of these one-liners is all we need to have AA for our shape, but the question of what value to use for the ramp width arises. The difference between a blur and anti-aliasing is simply the width value. With AA, it's the size of a "pixel", and with a blur it's typically a user input or an arbitrarily large value. If we are in 2D and have access to the pixel resolution, we can use it to get the pixel size. Note that this is closely tied to the coordinate space we use to calculate the SDF. For example, let's say we have a canvas for which we don't know the aspect ratio, we can calculate the screen coordinates like this: This will give us a value within on the shortest axis (the y-axis in landscape mode) and preserve a squared aspect ratio. That means the shortest axis will have an amplitude of , which means the number of pixel on that axis correspond to 2 units. As a result, the unit-width that will be used for the signed distance can be obtained with: Remember that this is true only if the position we use for the SDF is in that range. Basically, we have to adjust this formula to the coordinate space we are using. The same with a x10 resolution to better see the AA: But sometimes we might not have access to the resolution, or we may want to map that 2D SDF onto a plane in 3D or some other transformation. For example, a decal or a text on the wall in a video game. In that later case, if we were to use the screen resolution, it would lead to inconsistent anti-aliasing: We can see that, when put in perspective, the edge in the back gets way too sharp while the edge in the front becomes a bit too blurry. Fortunately, there is a magic trick we can use, the numerical derivatives: Now we magically have a smooth pixel-wise anti-aliasing, no matter the perspective. What is this sorcery? 🧙 calculate the rate of change of a given variable using fragment-based numerical derivatives. Mathematically this is a L1-norm (also known as Taxicab or Manhattan norm) defined as: . But how the hell is this providing a good pixel width estimate? Let's see the simple case where we want observe the rate of change of one variable across one axis. For example, where is the pixel coordinate: . We will have . Why? Because changes at a constant rate of 1 (exactly like our SDF) from one pixel to another. If we remap to a value between using (where is the number of pixels on the x axis), we can follow derivation rules and end up with: . This matches with our initial pixel size computation in our previous section. Now when 3D and perspective distortions are involved, this still holds and you may be wondering why. The intuitive answer is that is the rate of change of the signed distance as seen from the flat pixel screen perspective . In 3D view, in the back of the shape, the distance changes sharply from one pixel to another (meaning will be high), while in the front it's way smoother (meaning will be low). So this numerical derivative is used to scale the distance back to a transition that works smoothly from the 2D pixel point of view. Instead of , we could also use the L2-norm (also known as euclidean distance): This is more expensive than , but it can be considered as an alternative. The AA will be slightly different but it's hard to say which one really is better than the other: Instead of a some people like to use . The main reason is probably because is available in builtin while isn't. But is it a better choice? Intuitively, to me at least, it makes perfect sense for the alpha value to follow a linear ramp. A few weeks ago I would have adamantly argued that it's actually a faster and more logical choice than its curved version . Well... I did some tests. With a large diffuse, here is what it looks like with a linear ramp: It looks like there is a brighter highlight around the border (before the fall-off), doesn't it? Well, it's an illusion, it's just our brain noticing the discontinuity and telling us about it. With a things get better: I wasn't expecting that, so I stand corrected: is actually a better choice. It's also builtin, so we won't need to define our own function. Of course, one may prefer an even smoother curve, for example using the quintic curve instead of the hermite: For pixel-wise anti-aliasing, the discontinuity won't be noticed so using a linear interpolation still is a perfectly valid choice. When we have our anti-aliasing value, it's pretty much done. But we still have the question on how to use it. My previous examples were in black and white, but in many cases we need blending between colors. The question of how to blend is probably the most tricky of all, and my previous article was entirely dedicated to this particular issue . In all the examples from this page, I've been using the OkLab blending because "it's perfect". But the reality is likely to force you to use a simple linear blending. For anti-aliasing, it's honestly just fine, the illusion still works out, but if you're trying to blur I would advise you against it and switch to a better colorspace like OkLab whenever possible. See here how, because of how the human perception works, the linear blending does feel "bobby" and too large compared to OkLab: Given all these tools we can combine them according to our needs and preferences. As closing words, let me propose a few reference examples: Anti-aliasing SDFs can be beautifully simple, and now we also understand better the magic behind. ✨

Design

Programming

0 views

pkh.me 4 months ago

The current technology is not ready for proper blending

The idea that we must always linearize sRGB gradients or work in a perceptually uniform colorspace is starting to be accepted universally. But is it that simple? When I learned about the subject, it felt like being handed a hammer and using it everywhere. The reality is a bit more nuanced. In this article we will see when to use which, how to use them, and we will then see why the situation is more dire than it looks. Before we start, since we are going to use GLSL as language, following are the reference functions we will use for the rest of the article. Also, the output of the pipeline will be expected to be sRGB all the time. To illustrate how sRGB, linear RGB and OkLab respectively look like, let's interpolate between two colors with each one of them: The 3 stripes were generated like this: Where is simply the x coordinate between 0 and 1, the left color, and the right one. The input colors are considered to be sRGB in input. Similarly, we always make sure to output sRGB at the end (with ) because that's what the pipeline expects. Key takeaways: The general consensus is as follow: if you need a color transition within a shape or texture, or some sort of color map, OkLab is the best tool, while linear is cheap, physically correct, and usually acceptable visually. Things are not as obvious as they seem when we work in monochrome. If instead of red and blue we pick black and white, this is what happens: Suddenly this tells a whole different story. sRGB becomes perfectly acceptable, while linear favors way too much the lightness, and OkLab remains the best. The linear gradient felt acceptable before, but now it is highly questionable. Just to be clear, the linear strip is linear, you can see it as linear energy or casually said "wattage", to which our perception does respond non-linearly. At this point one may even argue that sRGB looks best. So what can we do about this? First of all, we always need to question what we are trying to achieve, and fortunately sometimes we can take a few shortcuts. For example, let's say we want to depict a heat map in black and white. In my previous article I had to display 2D noise, so I wanted the observer to experience a linear perception of the "height" of the noise. In this case, working in sRGB (that is, doing zero effort with regards to perception) is actually a better call than mixing between black and white in linear space: Here we are comparing these two: We removed the out of the formulas because and , which have the same value when uncompressed to linear space. To do things right we may want to use OkLab but this feels overkill since this is just a straightforward monochromatic signal. Fortunately, the perceptual lightness can be fairly simple to model. With monochromatic input, OkLab uses , which is basically equivalent to do a gamma correction with . This means that we can simplify the OkLab interpolation we used before to the very simple: Doing this simple operation is exactly equivalent to interpolating between black and white in OkLab space, except it's just 2 extra multiplications. We still need to be extra careful if we want to swap the black and white. needs to be swapped before the gamma encoding, and that means before the sRGB gamma encoding as well: sRGB has a curve that closely approximates a gamma correction . So sometimes, instead of using , we may prefer to use the simpler . It means we could replace with the following to merge the two operations: And the white-to-black version: Whenever you use , make sure your input is positive. Adding a for safety might be reasonable in certain cases. The difference between a proper sRGB conversion and the combined gamma is pretty small: Sometimes, instead of fading colors into each others, we need to compose shapes, textures, masks, ... This need for compositing, or blending, arises when the pipelines are separated, meaning we are not working in the same fragment shader for everything. For example, we could have a shape generated in a fragment, which we need to overlay onto a surface. That shape might have some non-binary transparency, either for anti-aliasing purposes, a blur, or similar. If that shape were to be blend onto another colored surface, we would like to have the same effect as the gradient earlier. For well-known reasons , it is likely that this shape would end up as a pre-multiplied color, which would be blend onto one or more layers. If what I just said is confusing, I recommend checking out this good article on alpha compositing from Bartosz Ciechanowski . The literature is quite extensive on the subject so I will assume familiarity with it. Of course, if we are to do things right, the blending would have to happen in linear space. Do not consider sRGB blending in alpha blending, it's an even more terrible idea than before because of the bilinear filtering, transforms or mipmaps that can happen between the pre-multiplication and the blending itself. But that means we would end up with the linear gradient shortcomings from earlier, wouldn't we? And this is where things get ugly. Look at the difference between a linear and an OkLab blending, in black and white: If we invert the colors: We have the exact same problem as earlier, but seeing it with an actual blending of shapes makes the problem particularly striking. The white and black OkLab circles look the same size (because they are), and they don't have the unfortunate "bobbing" effect of the linear version (on the white onto black). The OkLab blending is done with pre-multiplied Lab colors. It is important not to pre-multiply linear values which are then converted to OkLab, this will give very unexpected results. The problem is, it is very unlikely that your whole graphics pipeline would switch to OkLab for every textures and buffers. And since most of the time the pipelines are built for more than just black and white, the cube hack suggested earlier has a very limited scope. In the case of shape blending, it is almost certain that the whole pipeline would not be contained in a single shader where you can just mix in OkLab. You're probably thinking of using sRGB, but in a blending pipeline this really is a terrible idea. In practice, neither sRGB nor pure linear blending give good results, and using OkLab is not always an option. And unfortunately, I don't have a good answer to this whole situation. My next article is about anti-aliasing where this problem also exists, and I must admit this whole ordeal puts me in quite some distress; I had to talk about this issue first.

Programming

0 views

pkh.me 5 months ago

Sharing everything I could understand about gradient noise

Adjusting our gradient noise functions to return the derivatives along with the noise value itself: The derivatives are in the first component so that given we can write for x/y/z partial derivatives instead of the more awkward . To get the equivalent derivatives for value noise, we just set g=0 in the final expression of the derivatives. It is possible to unroll the nested expression to (maybe) make it faster, but I find the mix expressions simple, elegant, and likely more numerically stable. There are important things to take into consideration when working with the derivatives. Most notably, it's important to follow the chain rule and the product rule . For example, let's say we're working with a noise function that returns the derivatives: If we multiply the input of our function by the frequency, then we need to multiply the derivatives by the same frequency factor. Similarly, let's say we want to move v from [-1,1] to [0,1] : The intuitive explanation is that if we are squeezing the value in half its original size, then the derivatives (slopes) are getting flatter as well. This may not sound very important, but failing to propagate these scale factors correctly often leads to subtle bugs. For example in the fBm, if we want it to return the correct derivatives it is necessary to do something like that (notice the ): Things get tricky when the noise is being adjusted with the derivatives themselves as we saw before with the erosion, especially since this would imply second order derivatives. Similarly, there is a technique involving the rotation of p at every iteration of the fBm to reduce the correlations between noise layers. This rotation is a linear transformation, but it requires careful adjustments to the derivatives as well. Implementing these transformations correctly is left as an exercise; careful bookkeeping of derivatives is essential. With this introduction we only explored the tip of the iceberg. For example, you'll be interested to know that there are alternative noises such as OpenSimplex , where we interpolate with lattice positions on a simplex grid (triangles in 2D, tetrahedra in 3D) instead of a rectangular grid. It has useful properties, for example in fBm it yields fewer directional artifacts, eliminating the need for ad-hoc rotations. Speaking of fBm, you may want to check out domain warping where nested fBm together make fancy effects: Another thing you'll be tempted to research is how to consistently scale the noise so that it fits into a controlled range of value. Spoiler alert: it's a particularly complex subject. Similarly, we stopped ourselves at 3D, but 4D is also useful, for example let's say we want to morph the 3D noise with time: p=(x,y,z,t) . I hope this article was able to give a good overview of the concepts, and I'll see you next time for new adventures.

Programming

0 views

pkh.me 1 years ago

Fixing the iterative damping interpolation in video games

Looking at the previous curves but now with the new formula and an adjusted rate: So there we have it, the perfect formula, frame rate agnostic ✅, deterministic ✅ and resilient to overshooting ✅. If you've quickly skimmed through the maths, here is what you need to know: Should be changed to: With the precomputed (where is the natural logarithm), or simply using as a rough equivalent. Also, any existing overshooting clamping can safely be dropped. Now please adjust your game to make the world a better and safer place for everyone ♥ As suggested on HN :

Performance

Gaming

Programming

0 views

pkh.me 2 years ago

Hacking window titles to help OBS

This write-up is meant to present the rationale and technical details behind a tiny project I wrote the other day, WTH, or WindowTitleHack , which is meant to force a constant window name for apps that keep changing it (I'm looking specifically at Firefox and Krita, but there are probably many others). I've been streaming on Twitch from Linux (X11) with a barebone OBS Studio setup for a while now, and while most of the experience has been relatively smooth, one particularly striking frustration has been dealing with windows detection. If we don't want to capture the whole desktop for privacy reasons or simply to have control over the scene layout depending on the currently focused app, we need to rely on the source. This works mostly fine, and it is actually able to track windows even when their title bar is renamed. But obviously, upon restart it can't find them again because both the window titles and the window IDs changed, meaning we have to redo our setup by reselecting the windows again. It would have been acceptable if that was the only issue I had, but one of the more advanced feature I'm extensively using is the (the builtin one, available through the menu). This tool is a basic window title pattern matching system that allows automatic scene switches depending on the current window. Note that it does seem to support regex, which could help with the problem, but there is no guarantee that the app would leave a recognizable matchable pattern in its title. Also, if we want multiple Firefox windows but only match one in particular, the regex wouldn't help. One unreliable hack would be to spam commands to correct the window title. This could be a resource hog, and it would create quite a bunch of races. One slight improvement over this would be to use , but that wouldn't address the race conditions (since we would adjust the title after it's been already changed). So how do we deal with that properly? Well, on X11 with the reference library ( ) there are actually various (actually a lot of) ways of changing the title bar. It took me a while to identify which call(s) to target, but ended up with the following call graph, where each function is actually exposed publicly: From this we can easily see that we only need to hook the deepest function , and check if the property is (or its "modern" sibling, ). How do we do that? With the help of the environment variable and a dynamic library that implements a custom . First, we grab the original function: We also need to craft a custom atom: With this we are now able to intercept all the events and override them with our own: We wrap all of this into our own redefinition of and… that's pretty much it. Now due to a long history of development, has been "deprecated" and superseded by . Both are widely used, but fortunately the APIs are more or less similar. The function to hook is , and defining is slightly more cumbered but not exactly challenging: Aside from that, the code is pretty much the same. To pass down the custom title to override, I've been relying on an environment variable . From a user point of view, it looks like this: We could probably improve the usability by creating a wrapping tool (so that we could have something such as ). Unfortunately I wasn't yet able to make a self-referencing executable accepted by , so for now the manual and environment will do just fine. To avoid a bunch of redundant function round-trips we need to globally cache a few things: the new title (to avoid fetching it in the environment all the time), the original functions (to save the call), and . Those are loaded lazily at the first function call, but we have no guarantee with regards to concurrent calls on that hooked function so we must create our own lock. I initially though about using but unfortunately the initialization callback mechanism doesn't allow any custom argument. Again, this is merely a slight annoyance since we can implement our own in a few lines of code: Which we use like this: I've been delaying doing this project for weeks because it felt complex at first glance, but it actually just took me a few hours. Probably the same amount of time it took me to write this article. While the project is admittedly really small, it still feel like a nice accomplishment. I hope it's useful to other people. Now, the Wayland support is probably the most obvious improvement the project can receive, but I don't have such a setup locally to test yet, so this is postponed for an undetermined amount of time. The code is released with a permissive license (MIT); if you want to contribute you can open a pull request but getting in touch with me first is appreciated to avoid unnecessary and overlapping efforts.

Open Source

Programming

Linux

0 views

pkh.me 2 years ago

Improving color quantization heuristics

Special thanks to for helping me a ton on the math area, this last formula is from them. Looking at the formula, we can see how similar it is to certain branches of the heuristics tree, so we can start getting an intuition about the result of the experiment. Short deviation from the main topic (feel free to skip to the next section): working in C within FFmpeg quickly became a hurdle more than anything. Aside from the lack of flexibility, the implicit casts destroying the precision deceitfully, and the undefined behaviours, all kind of C quirks went in the way several times, which made me question my sanity. This one typically severely messed me up while trying to average the colors: Anyway, I know this is obvious but if you aren't already doing that I suggest you build your experiments in another language, Python or whatever, and rewrite them in C later when you figured out your expected output. Re-implementing what I needed in Python didn't take me long. It was, and still is obviously much slower at runtime, but that's fine. There is a lot of room for speed improvement, typically by relying on (which I didn't bother with). I created a research repository for the occasion. The code to reproduce and the results can be found in the color quantization README . In short, based on the results, we can conclude that: To my surprise, normalizing the weights per box is not a good idea. I initially observed that by trial and error, which was actually one of the main motivator for this research. I initially thought normalizing each box was necessary in order to compare them against each others (such that they are compared on a common ground). My loose explanation of the phenomenon was that not normalizing causes a bias towards boxes with many colors, but that's actually exactly what we want. I believe it can also be explained by our evaluation function: we want to minimize the error across the whole set of colors, so small partitions (in color counts) must not be made stronger. At least not in the context of the target we chose. It's also interesting to see how the seems to perform better than the of the variance of each component most of the time. Admittedly my picture samples set is not that big, which may imply that more experiments to confirm that tendency are required. In retrospective, this might have been quickly predictable to someone with a mathematical background. But since I don't have that, nor do I trust my abstract thinking much, I'm kind of forced to try things out often. This is likely one of the many instances where I spent way too much energy on something obvious from the beginning, but I have the hope it will actually provide some useful information for other lost souls out there. There are two main limitations I want to discuss before closing this article. The first one is related to minimizing the MSE even more. We know the Median-Cut actually provides a rough estimate of the optimal palette. One thing we could do is use it as a first step before refinement, for example by running a few K-means iterations as post-processing (how much refinement/iterations could be a user control). The general idea of K-means is to progressively move each colors individually to a more appropriate box, that is a box for which the color distance to the average color of that box is smaller. I started implementing that in a very naive way, so it's extremely slow, but that's something to investigate further because it definitely improves the results. Most of the academic literature seems to suggest the use of the K-means clustering, but all of them require some startup step. Some come up with various heuristics, some use PCA, but I've yet to see one that rely on Median-Cut as first pass; maybe that's not such a good idea, but who knows. Another more annoying problem for which I have no solution for is with regards to the human perception being much more sensitive to light changes than hue. If you look at the first demo with the parrot, you may have observed the boxes are kind of thin. This is because the and components (respectively how green/red and blue/yellow the color is) have a much smaller amplitude compared to the (perceived lightness). You may rightfully question whether this is a problem or not. In practice, this means that when is low (let's say smaller than 8 or even 16), cuts along will almost always be preferred, causing the picture to be heavily desaturated. This is because it tries to preserve the most significant attribute in human perception: the lightness. That particular picture is actually a pathological study case: We can see the hue timidly appearing around (specifically it starts being more strongly noticeable starting the cut ). For now, I'm mostly done with this "week-end long project" into which I actually poured 2 or 3 months of lifetime. The FFmpeg patchset will likely be upstreamed soon so everyone should hopefully be able to benefit from it in the next release. It will also come with additional dithering methods , which implementation actually was a relaxing distraction from all this hardship. There are still many ways of improving this work, but it's the end of the line for me, so I'll trust the Internet with it.

Python

Programming

0 views

pkh.me 2 years ago

Porting OkLab colorspace to integer arithmetic

Translated into C we can write it like this: In case you are concerned about overflowing, is , so this is safe. In the integer version, our function input is within , so we need to make a few adjustments. The first issue we have is that with integer arithmetic our and are the same. We have as input, so because we are using an integer division, which in this case is equivalent to the expression in the float version. So while it's a simple and fast way to get and , we have an issue figuring out the ratio . One tool we have is the modulo operator: the integer division is destructive of the fractional part, but fortunately the modulo (the rest of the division) gives this information back. It can also be obtained for free most of the time because CPU division instructions tend to also provide that modulo as well without extra computation. If we give , we have the fractional part of the division expressed in the scale, which means we can derivate our ratio from it: . Slipping the division in our expression we end up with the following code: Testing this function for all the possible input of , the biggest inaccuracy is an off-by-one, which concerns 6280 of the 65536 possible values (less than 10%): 2886 "off by -1" and 3394 "off by +1". It matches exactly the inaccuracy of the float version of this function, so I think we can be pretty happy with it. Given how good this approach is, we could also consider applying the same strategy for , so this is left as an exercise to the reader. We're finally in our last function. Using everything we've learned so far, it can be trivially converted to integer arithmetic: Important things to notice: We're finally there. In the end the complete code is less than 200 lines of code and even less for the optimized float one (assuming we don't implement our own ). The complete code, test functions and benchmarks tools can be found on GitHub . Comparing the integer version to the reference float gives use the following results: I find these results pretty decent for an integer version, but you're free to disagree and improve them. The benchmarks are also interesting: on my main workstation (Intel® Core™ i7-12700, glibc 2.36, GCC 12.2.0), the integer arithmetic is slightly slower that the optimized float version: Observations: On the other hand, on one of my random ARM board (NanoPI NEO 2 with a Cortex A53, glibc 2.35, GCC 12.1.0), I get different results: Not that much faster proportionally speaking, but the integer version is still significantly faster overall on such low-end device. This took me ages to complete, way longer than I expected but I'm pretty happy with the end results and with everything I learned in the process. Also, you may have noticed how much I referred to previous work; this has been particularly satisfying from my point of view (re-using previous toolboxes means they were actually useful). This write-up won't be an exception to the rule: in a later article, I will make use of OkLab for another project I've been working on for a while now. See you soon!

Programming

1 views

pkh.me 3 years ago

GCC undefined behaviors are getting wild

Happy with my recent breakthrough in understanding C integer divisions after weeks of struggle, I was minding my own business having fun writing integer arithmetic code. Life was good, when suddenly… . That code wasn't messing with memory much so it was more likely to be a side effect of an arithmetic overflow or something. Using quickly identified the issue, which confirmed the presence of an integer overflow. The fix was easy but something felt off. I was under the impression my code was robust enough against that kind of honest mistake. Turns out, the protecting condition I had in place should indeed have been enough, so I tried to extract a minimal reproducible case: The overflow can happen on . Since an integer overflow is undefined, GCC makes the assumption that it cannot happen, ever. In practice in this case it does, but the condition should be enough to take care of it, whatever crazy value it becomes, right? Well, I have bad news: This is GCC on x86-64. We have as the result of the overflow, and nevertheless the execution violates the gate condition, spout a non-sense lie, go straight into dereferencing , and die miserably. Let's study what GCC is doing here. Firing up Ghidra we observe the following decompiled code: When I said GCC makes the assumption that it cannot happen this is what I meant: is not supposed to overflow so part of the condition I had in place was simply removed. More specifically since can not be lesser than , and since GCC assumes a multiplication cannot overflow into a random value (that could be negative) because it is undefined behavior, it then decides to drop the "redundant" condition because "it cannot happen". I reported that exact issue to GCC to make sure it wasn't a bug, and it was indeed confirmed to me that the undefined behavior of an integer overflow is not limited in scope to whatever insane value it could take: it is apparently perfectly acceptable to mess up the code flow entirely. While I understand how attractive it can be from an optimization point of view, the paranoid developer in me is straight up terrified by the perspective of a single integer overflow removing security protection and causing such havoc. I've worked several years in a project where the integer overflows were (and probably still are) legion. Identifying and fixing of all them is likely a lifetime mission of several opinionated individuals. I'm expecting this article to make the rust crew go in a crusade again, and I think I might be with them this time. Edit : it was made clear to me while reading Predrag's blog that the key to my misunderstanding boils down to this: "Undefined behavior is not the same as implementation-defined behavior". While I was indeed talking about undefined behavior, subconsciously I was thinking that the behavior of an overflow on a multiplication would be "implementation-defined behavior". This is not the case, it is indeed an undefined behavior, and yes the compiler is free to do whatever it wants to because it is compliant with the specifications. It's my mistake of course, but to my defense, despite the arrogant comments I read, this confusion happens a lot. This happens I believe because it's violating the Principle of least astonishment . To illustrate this I'll take this interesting old OpenBSD developer blog post being concerned about the result of the multiplication rather than the invalidation of any guarantee with regard to what's going to happen to the execution flow (before and after). This is not uncommon and in my opinion perfectly understandable.

Programming

Rust

0 views

pkh.me 3 years ago

Figuring out round, floor and ceil with integer division

Lately I've been transforming a float based algorithm to integers in order to make it bit-exact. Preserving the precision as best as possible was way more challenging than I initially though, which forced me to go deep down the rabbit hole. During the process I realized I had many wrong assumptions about integer divisions, and also discovered some remarkably useful mathematical properties. This story is about a journey into figuring out equivalent functions to , and with and integers, while staying in the integer domain (no intermediate transformation allowed). For the sake of conciseness (and to make a bridge with the mathematics world), and will sometimes respectively be written and . Better than explained with words, here is how the functions we're looking for behave with a real as input: The dots indicate on which lines the stitching applies; for example is , not . Here are the corresponding prototypes, in C: We're going to work in C99 (or more recent), and this is actually the first warning I have here. If you're working with a different language, you must absolutely look into how its integer division works. In C, the integer division is toward zero , for both positive and negative integers , and only defined as such starting C99 (it is implementation defined before that). Be mindful about it if your codebase is in C89 or C90. This means that in C: This is typically different in Python: In Python 2 and 3, the integer division is toward -∞, which means it is directly equivalent to how the function behaves. In C, the integer division is equivalent to only for positive numbers , otherwise it behaves the same as . This is the division behavior we will assume in this article: And again, I can't stress that enough: make sure you understand how the integer division of your language works. Similarly, you may have noticed we picked the function as defined by POSIX, meaning rounding half away from . Again, in Python a different method was selected: Python is following the round toward even choice rule. This is not what we are implementing here ( Edit : a partial implementation is provided at the end though). There are many ways of rounding , so make sure you've clarified what method your language picked. The integer division is symmetrical around but and aren't, so we need a way get the sign in order to branch in one direction or another. If and have the same sign, then is positive, otherwise it's negative. This is well expressed with a operator, so we will be using the sign of (where is a operator). Of course we only need to the sign bit so we could instead use but it is a bit more complex. Edit : note that is not when . Also, as pointed out on lobste.rs it's likely to rely on unspecified / implementation-defined behavior (hopefully not undefined behavior). We could use the safer form which only generates an extra shift instruction on x86. Looking at the graphics, we observe the following symmetries: We can translate these observations into code using a modulo trick (which purpose is to not offset the stitching point when the division is round): One may wonder about the double division ( and ), but fortunately CPU architectures usually offer a division instruction that computes both at once so this is not as expensive as it would seem in the first place. Now you also have an alternative without the modulo, but it generates less effective code (at least here on with a modern CPU according to my benchmarks): Edit : note that these versions suffer from undefined behavior in case of as pointed out by in previous comment about . I have no hard proof to provide for these right now, so this is left as an exercise to the reader, but some tools can be found in Concrete Mathematics (2nd ed) by Ronald L. Graham, Donald E. Knuth and Oren Patashnik. In particular: The function is the most useful one when trying to approximate floats operations with integers (typically what I was looking for initially: converting an algorithm into a bit-exact one). We are going to study the positive ones only at first, and try to define it according to the integer C division (just like we did for and ). Since we are on the positive side, the division is equivalent to a , which simplifies a bunch of things. I initially used a function defined as and thought to myself "if we are improving the accuracy of the division by using a offset, why shouldn't we also improve the accuracy of by doing instead?" Very proud of my deep insight I went on with this, until I realized it was causing more off by ones (with a bias always in the same direction). So don't do that , it's wrong, we will instead try to find the appropriate formula. Looking at the function we can make the observation that it's pretty much the function with the offset by : So we have: We could stop right here but this suffers from overflow limitations if translated into C. We are lucky though, because we're about to discover the most mind blowing property of integers division: You may not immediately realize how insane and great this is, so let me elaborate: it basically means successive truncating divisions can be merged into one without loss of precision (and the other way around). Here is a concrete example: That's great but how does that help us? Well, we can do this now: How cute is that, we're back to the original formula I was using: (because again the C division is equivalent to for positive values). Now how about the negative version, that is when ? We can make the similar observation that for a negative , , so we have: And since is negative, the C division is equivalent to . So in the end we simply have: This is the generic version, but of course in many cases we can (and probably should) simplify the expression appropriately. Let's say for example we want to remap an to an : . The appropriate way to round this division is simply: , or just: . Edit : it was pointed out several times on HackerNews that this function still suffer from overflows. Though, it remains more robust than the previous version with . Equivalent to , this function provided by Antonio on Mastodon can be used: This only works with positive values. Since you should definitely not trust my math nor my understanding of computers, here is a test code to demonstrate the exactitude of the formulas: These trivial code snippets have proven to be extremely useful to me so far, and I have the hope that it will benefit others as well. I spent an unreasonable amount of time on this issue, and given the amount of mistakes (or at the very least non optimal code) I've observed in the wild, I'm most certainly not the only one being confused about all of this.

Programming

Python

0 views