GreatReads - Blog Aggregator · Phoenix Framework

Rust

0 views

fnands 5 months ago

Can multi-sensor foundation models be more than the sum of their parts?

Geospatial foundation models (GFMs) have been on my mind recently, partially because I attended the ESA-NASA International Workshop on AI Foundation Model for EO and partially because I’ve been working on fine tuning some GFMs for downstream use at work for the last while. This post was in part prompted by two recent LinkedIn posts, one by Christopher Ren and the other by Madeline Lisaius , both of which express some amount of skepticism about the way in which current GFMs are trained, although from somewhat different angles. Christopher Ren also wrote an expanded blog post on the subject, which takes aim mostly at IBMs new TerraMind GFM, but it is worth reading the responses from one of the TerraMind authors at the bottom of the post as that adds some nuance to the arguments. It’s somewhat clear that GFMs are a hot topic in the Earth Observation (EO) space at the moment, and it is fair to question whether the hype is warranted. At the ESA-NASA workshop one of the points made was that there seems to be much more activity in the creation of GFMs than actual downstream use of them so far, and there were some interesting discussions as to why this might be the case at the moment. A recent post from Bruno Sanchez-Andrade Nuño (director of the Clay project) also made me think that there is a rough bifurcation in the GFM space appearing: one branch goes deep and the other goes wide. I think it is best if we understand which branch a model fits into and not judge one by the standards of the others. I’m not going to directly respond to the other conversations going on: I’m just adding my two cents to the mix, and I want to be clear that the work I am doing definitely falls into the “go deep” branch, and my opinions are very much coloured by that fact. On the surface this might seem like a slightly odd question seeing as one of the principle reasons people are interested in GFMs (and FMs in general) is better generalization: EO is after all often a global endeavour, and it is desirable to have a foundation model that will help your downstream tasks generalize across geographies, illumination conditions, imaging angles etc. But there are many aspects to generalization, some of which don’t apply to all sensors. An example is the time of day an image was taken at. This can strongly affect what your image looks like, as shadows and illumination levels can vary greatly by time of day. This however does not really affect missions like Sentinel-2, where the orbit has been selected in such a way that the mean local solar time when the image is taken is always approximately 10:30 am, leading (by design) to very consistent illumination levels. Similar arguments go for viewing angles. One of the ways that people have been trying to get more general is to train foundation models on multiple sources. Examples of this are the Clay foundation model (at least V1) which was trained on a wide range of sensors from MODIS, with a 500 m GSD, to aerial imagery of New Zealand at under 50 cm GSD: Another example of this is DOFA, which takes a similar approach to variety in input sensors, this time including hyperspectral data at 224 spectral bands: The DOFA paper is worth a read, and putting on my scientist hat: this is really interesting work, and it’s really interesting to see the different solutions that these authors have come up with to make a single model deal with such varied inputs. But putting back on my engineer hat I have to ask: what do we gain from this? One of the points made in the DOFA paper: The increasing number of specialized foundation models makes it difficult to select the most appropriate one for a specific downstream task. On the surface this sounds fair, but is it really that hard to go to PapersWithCode , find the most similar dataset to your downstream task and select a model based on that? I can’t really think of a scenario where you would not just spend a day or two searching through the literature for the most fitting model for your particular use case. The one case I can maybe think this might be the case is if you are a geospatial person with no ML skills and have a model that was set up for you as a black box behind some interface and squeezing every last bit of performance out is not critical to you. When implementing a model for a specific product need, one often focuses on the specific sensor, or at least a set of very similar sensors, e.g. sub-meter GSD sensors with at least the four usual spectral bands. When building a product that will utilize exclusively Sentinel-1 data, does the model really gain anything from being trained on Sentinel-2 and aerial imagery as well? With all that being said, if you do have multiple sensors available at inference time (e.g. Sentinel-2 and Sentinel-1 data), it does seem to make sense to train/infer on multiple modalities at once. See e.g. table 2 in the TerraMind paper . A while ago we were testing out a few foundation models as backbones for a product we are developing, which boils down to bi-temporal change detection using Planet’s SkySat constellation. We chose the main backbone we are using based on benchmarks, but I did have the nagging question of how much do we really gain from this, and if other backbones might offer better performance? This was basically the theme of my talk at the aforementioned ESA-NASA workshop. I ran a few test using a variety of FM backbones, some trained on remote sensing data, and some just on natural images, just to see how much the pre-training dataset really matters. To make the comparison fair, the backbones used all had around 100 M parameters, but I did throw in a smaller baseline ( ChangeFormer ), as well as a 300 M version of the best performing network just to see if size matters (spoiler: it does). One of the most interesting comparisons here is DINOv2: I used two variations, one using the original weights trained on natural images from Meta, and another with weights from Keumgang Cha , which were trained on the MillionAID and SkyScript datasets. MillionAID is exclusively aerial imagery, while SkyScript contains mostly aerial imagery, plus some SkySat, Sentinel-2 and Landsat images. It’s abundantly clear that the same architecture trained on remote sensing images greatly improve downstream performance compared to a the variant that was trained on natural images. This is expected, but it’s impressive to see how large this gap is. The best model we tested was trained mostly on aerial imagery, showing the domain gap isn’t so much whether or not your sensor is in space or on a plane, but has more to do with similar resolutions. The models were all trained for the same number of epochs, on the same relatively small dataset (around 1600 patches of 512 x 512 pixels) with the same optimizer etc. The encoders were not frozen, but trained with a lower learning rate than the decoders, as is common practice in most transfer learning scenarios. I will caveat this all with saying that I didn’t do a massive amount of hyperparameter tuning for this test, but I think the differences are significant enough that it probably wouldn’t make too much of a difference. What I would need to see to be convinced is that when training a foundation model on multiple sensors that it would perform better on downstream tasks on each of the sensors than it would be if it was trained exclusively on the specific sensor to be used. I.e. one would need to show that the model would be more than the sum of it’s parts. The question is pretty much, given the same architecture, compute budget and dataset size, can a model learn something from one sensor that improves its performance on another? Or could it be that we need to throw everything into a big bucket and burn a lot of compute in the fashion of the current big LLMs that are so popular right now in order to really see generalization? I’m definitely not ruling out the possibility that there might be some case (e.g. the sensor you are targeting doesn’t have a lot of data available), but I have the feeling that the further away in GSD and spectral characteristics you go the less helpful pre-training becomes. It’s fairly obvious that the best GFM you can choose will likely be the one trained on the exact sensor you are targeting for your downstream task. This is fairly easy for sensors like the Sentinel missions or the Landsat missions, where anyone with a bit of compute to burn can easily download tons of data from those missions and train a model. Even for aerial imagery there is a lot of open data available, with the caveat that the data is not as global, and aerial sensors do have some sensor to sensor variability. Where this gets tricky is in the commercial domain, where data isn’t freely available and providers put strict licenses on their data 1 . To train a foundation model on commercial data requires you to dump somewhere between hundreds of thousands up to millions of Euros on data alone, which is infeasible for most researchers, and a significant investment for most companies. The only case that I know of so far of someone creating a sensor specific foundation model is a Pleiades Neo foundation model created by Disaitek , which was made possible by being granted access to Pleiades Neo imagery through a “Call for Innovation” from Aribus and CNES. Disaitek of course does not make this model public, as this presumably gives them a bit of an edge over their competitors, and as the model was trained on data covering France only, it is questionable of how much use it would be in other parts of the world. So what can be done in the commercial space? Most companies don’t have access to enough data to easily train a foundation model, and those who do are unlikely to share it as it gives them an edge over their competition. The only players with both the access to the data and the incentive to make these models available to others are the imagery providers themselves, i.e. Planet, Airbus, Maxar, Blacksky, Capella etc. Do I think these providers will just open these models for all to use? I doubt it, but they might offer it as a perk to their customers. I.e. something along the lines of “buy at least X Euro worth of imagery per year and get access to our FM”. The competition in the 30 cm class imagery space seems to be heating up, with several players building up large constellations of satellites in this resolution range, like Maxar’s Legion, Planet’s Pelican and BlackSky’s Gen-3. One way these providers can differentiate their offerings would be by offering a foundation model trained on their specific sensor. Whether I think it’s likely that they do this is another question. Please take this post for what it is: the opinionated rant of someone who works in a somewhat privileged niche of the EO domain where I have a lot of expensive VHR data to play with. The problems I am trying to solve and the constraints I have are likely quite different from those that others might encounter. With that being said, if you find yourself if a similar boat to me and are wondering which foundation model to pick for your particular task: pick the one trained on the closest thing you can find to the sensor you are targeting. I am kind of hoping that someone does prove me wrong, and I will happily write an apology post if some does so. The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎ The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎

Machine Learning

0 views

fnands 7 months ago

A quick first look at GPU programming in Mojo

The day has finally arrived. Well actually, the day arrived in February, but who’s counting. The Mojo language has finally publicly released the ability to do GPU programming - if you have a reasonably modern NVIDIA GPU. Luckily for me, I have an RTX 3090, and although it isn’t officially supported , it is basically an A10, which is. Looking at some of the comments on the nightly releases, it does seem that AMD support is on the way as well. The Modular team publicly released the ability to do GPU programming in Mojo in release 25.1, with further support and documentation in release 25.2. Fun fact: release 25.2 also saw my first (tiny) contribution to the Mojo standard library. This is a really important step for Mojo, a language that bills itself as a language designed to solve a variety of AI development challenges, which in this day and age basically means programming an increasingly heterogeneous stack of hardware. Today this mostly means GPUs, but there is an explosion of new accelerators like the ones from Cerebras, Groq and SambaNova, not to mention the not-so-new TPU from Google. As DeepSeek showed the world recently: if you’re willing to put the work in, there is a lot more to be squeezed out of current-gen hardware than most people thought. Now, I don’t think every ML engineer or researcher should be looking for every possible way to get more out of their compute, but there are definitely some wins to be had. As an example, I’m really fascinated by the work of Tri Dao and his collaborators, who work on deeply hardware aware improvements in machine learning, e.g. FlashAttention , which is mathematically equivalent to the attention mechanism that powers all transformer models, but with hardware aware optimizations that take into account the cost of memory access in GPUs. This does make me wonder what other optimizations are out there to be discovered. This however is not easy, as the authors note in the “Limitations and Future Directions” section of the FlashAttention paper: Our current approach to building IO-aware implementations of attention requires writing a new CUDA kernel for each new attention implementation. This requires writing the attention algorithm in a considerably lower-level language than PyTorch, and requires significant engineering effort. Implementations may also not be transferrable across GPU architectures. These limitations suggest the need for a method that supports writing attention algorithms in a high-level language (e.g., PyTorch), and compiling to IO-aware implementations in CUDA What makes GPU programming in Mojo interesting is that you don’t need the CUDA toolkit to do so, and compiles down to PTX which you can think of as NVIDIA’s version of assembly. If Mojo (and Max in general) can make it easier to write GPU kernels in a more user-friendly language, it could be a game changer. If you want to get started, there is a guide for getting started with GPU programming in Mojo from Modular (the company behind Mojo), which I strongly recommend. I learn by doing, so I wanted to try to implement something with relatively simple using the GPU. The example idea I chose is to transform an RGB image to grayscale, which is an embarrassingly parallel problem without a lot of complexity. I was halfway through writing this post before I realized that there was already an example of how to do grayscale conversion in the Mojo repo, but oh well. I basically just start with what’s in the documentation, but I added another example that I did do myself. To start, let’s read in an image using mimage , an image processing library I am working on. The image is represented here as a rank three tensor with the dimensions being width, height and channels, and the data type is an unsigned 8-bit integer. In this case we have four channels: red, green, blue and alpha (transparency), the latter being 255 for all pixels. So what we want to do here is to sum together the RGB values for each pixel, using the weights , and for red, green and blue respectively. If you want to know why we are using these weights, read this article . Now that we have that, let’s define a simple version of the transform we want on CPU. So hopefully that worked! Let’s see if it’s correct. I haven’t implemented image saving in mimage yet, so let’s use the good old Python PIL library to save the image. Now that we have a working CPU implementation, let’s try to implement the same function on the GPU. But first, let’s check if Mojo can actually find my GPU: Now that we know that Mojo can find our GPU, let’s define the function that will do the actual conversion. This kernel reads a pixel from the input tensor, converts it to grayscale and writes the result to the output tensor. It is parallelized across the output tensor, which means that each thread is responsible for one pixel in the output tensor. As you can see, it takes in as parameters the layout specifications of the input and output tensors, the width and height of the image, and the input and output tensors themselves. Now, the first slightly awkward thing I had to do was convert the image from a , which is what is returned by , to a , which is the new tensor type that is compatible with GPU programming. I am assuming that will be deprecated in the future. With this new tensor type you can explicitly set which device the tensor should be allocated on. In this case I will allocate it to the CPU, i.e. the host device, and then copy over the data from the old tensor to the new one. Next, we have to move the tensor to the GPU. Now that was easy enough. The next step is to allocate the output grayscale tensor. As we don’t need to copy over the data from the old tensor, we can just allocate it on the GPU immediately. Next, we get the layout tensors for the input and output tensors. The documentation on LayoutTensor is a bit sparse, but it seems to be there to make it easy to reason about memory layouts. There seems to be two ways to use GPU functions in Mojo. The first is to use the function, which is what I do here. This compiles the gpu kernel into a function which can be called as normal. While this function is being executed on the GPU, the host device will wait until it is finished before moving on. Later in this post I will show the other option which allows the host device to do other things while waiting for the GPU. And that’s it! Let’s call the GPU function. Here I will device the image up into blocks of 32x32 pixels, and then call the function. I have to admit, I have no clue what the best practices are for choosing the block size, so if you know a good rule of thumb, please let me know. I wonder if there is a way to tune these parameters at compile time? Once that is run, we move the grayscale tensor back to the CPU and compare the results. and there we have it! We have successfully converted an image to grayscale using the GPU. Another example I wanted to try is downsampling an image. This is a bit more complex than the grayscale conversion, because we need to handle the different dimensions of the input and output tensors. First let’s define some test images to make sure the function is doing what we expect. If this works we should have a downsampled 8x8 image with the same values as the original image. Let’s start with a CPU implementation: So it works! This does make some assumptions about the input image, like that it is a multiple of the factor. But good enough for a blog post. Now let’s try to do the same on the GPU. We again define our output tensor on the GPU, get the layout tensor and move the data from the host device to the GPU. This time we will try the other way of using GPU functions: enqueing the function(s) to be executed on the GPU. This means the host device will not wait for the GPU to finish the function, but can do other things while the GPU is running. When we call the host device will wait for the GPU to finish all enqueued functions. This allows for some interesting things, like running the GPU function in parallel with some other code on the host device. This is can also be a little bit dangerous if you try to access the GPU memory from the host device while the GPU is still running. Let’s try it out: Again, it works! Let’s try it on our original image, and downsample it by a factor of 2 and 4. Let’s also do a CPU version for comparison, and define the output tensors on the GPU. Now we can call the GPU function. Notice how we can enqueue a second function while the first one is still running. As it does not depend on the first function to finish, it can potentially start running before the first function has finished. Now let’s verify the results: Great! We can save these and see what they look like: And as we can see, the images get progressively more blurry the more we downsample. This was my first quick look at GPU programming in Mojo. I feel the hardest thing is conceptually understanding how to properly divide the work between threads, and how to assign the correct numbers of threads, blocks and warps (which I didn’t even get into here). I guess the next move is to look up some guide on how to efficiently program GPUs, and to maybe try some more substantial examples. The documentation on GPU programming in Mojo is still a bit sparse, and there aren’t many examples out there in the wild to learn from, but I am sure that will change soon. The Moduar team did say they are releasing it unpolished so that they can gather some community feedback early. For someone who uses GPUs a lot in my day job, I never really interact with the GPUs at a low level; it’s always through PyTorch or JAX or some other layer of abstraction from Python. It’s quite fun to have such low level access to the hardware in a language that doesn’t feel that dissimilar from Python. I think this is really where I am starting to see the vision behind Mojo more clearly. I think the shallow take is that Mojo is a faster Python, or basically some ungodly hybrid between Python and Rust, but the more I play with it the more I feel it’s a language designed to make programming heterogenous hardware easier. I don’t think it will be the only language like this we’ll see, and I am curious to see if other languages based on MLIR will pop up soon, or if some existing languages will adapt. Maybe basing Julia 2.0 off MLIR instead of LLVM is a good next move for the language. You only need to look at the schematic off Apple silicon chips these days to see which way the wind is blowing: a significant fraction of the chip is dedicated to GPU cores. I think the days where having a GPU attached to your computer was only for specialists is going out the window, and we might pretty soon be able to safely assume that every modern computer will have at least a decent amount of GPU cores available for general purpose tasks, and not just graphics. Still, I doubt most programmers will ever have to worry about actually directly programming GPUs, but I am interested to see how libraries take advantage of this fact.

Programming Julia

Hardware

0 views

fnands 1 years ago

Speeding up CRC-32 calculations in Mojo

In a previous post on parsing PNG images in I very briefly mentioned cyclic redundancy checks, and posted a rather cryptic looking function which I claimed was a bit inefficient. In this post I want to follow up on that a bit and see how we can speed up these calculations. For reference, this post was done with Mojo 24.5, so a few language details have changed since my last post (e.g. got moved to the top-level and a few of it’s functions have been renamed). I actually wrote most of this post in June with Mojo 24.4, but ran into a bit of an issue which has now been resolved. It even resulted in a new unit test for Mojo, so thanks to Laszlo Kindrat for fixing the issue, and for soraros and martinvuyk for helping figure out what the actual issue was. But first, let’s go through a bit of background so we know what we’re dealing with. CRCs are error detecting codes that are often used to detect corruption of data in digital files, an example of which is PNG files. In the case of PNGs for example the CRC-32 is calculated for the data of each chunk and appended to the end of the chunk, so that the person reading the file can verify whether the data they read was the same as the data that was written. A CRC check technically does “long division in the ring of polynomials of binary coefficients ( )” 😳. It’s not as complicated as it sounds. I found the Wikipedia article on Polynomial long division to be helpful, and if you want an in depth explanation then this post by Kareem Omar does an excellent job of explaining both the concept and implementation considerations. I won’t go deep into the explanations, so I recommend you read at least the first part of Kareem’s post for more background. I pretty much use his post as a guide. Did you read that post? Then welcome back, and well continue from there. But tl;dr: XOR is equivalent to polynomial long division (over a finite field) for binary numbers, and XOR is a very efficient operation to calculate. Essentially what a CRC check does in practice is to run through a sequence of bytes, and iteratively perform a lot of XORs and bit-shifts. By iteratively going through each bit, one can come up with a value that will (nearly certainly) change if data is corrupted somehow. The CRC-32 check from my previous post looked something like this: I’ll step through this in a moment, but the first thing you might notice here is that I am reversing a lot of bits here. The table argument is a bit of future proofing, which we won’t need for now, but will become apparent soon. Just ignore it for now. This is because when I was implementing this function (based off a C example), I implemented a little-endian version of the algorithm, while PNGs are encoded as big-endian. It’s not a huge deal, but it does mean that I am constantly reversing bytes, and then reversing the output again. We can make this better by implementing the big-endian version: This is very similar, and just entails that we use a the reverse of the polynomial we did before (if you bit reverse you get ). This also saves us one 24-bit bit-shift, as we are now working on the bottom 8 bits of the instead of the top 8. Just to verify that these implementations are equivalent, let’s do a quick test: And there we go, a more elegant version of the CRC-32 check I implemented last time. As the theme of today’s post is trying to speed things up, let’s do a little bit of benchmarking to see if this change has saved us any time. As we are doing one fewer bit reverse and bit shift per byte, as well as a final reverse, we should see a bit of a performance uplift. So let’s define a benchmarking function. This function will take two version of the CRC32 function and benchmark their runtimes. We have to a little bit of work first. The functions we test need to return , so we need to wrap our functions in functions with no return value. Note the calls: the compiler will realize that is never used and will compile this away unless you instruct it to keep them: Next we need a test case. I’m not sure if there is a nicer way to fill a List with random values yet, but for now we can list create an , alloc some space, and fill it with random numbers using . Then we can init a using that data: And finally we are ready to benchmark: Nice! So just by avoiding a few bit reversals and bit shifts we get about a 30% uplift in performance depending on the run. Note: I am doing this in a Jupyter notebook, so there is a bit of variance from run to run. While we’re checking performance, let’s see how this implementation would perform in Python: And let’s just do a sanity check to assure ourselves that we produce the same CRC-32 value given the same bytestream: That’s pretty slow in fact this means that: So that’s a nice little speedup we can get by writing essentially the same logic in Mojo. Now this is a bit unfair. If you actually wrote the this function in Python for anything other than educational purposes, then you are using Python wrong, but I’ll get back to how you would actually do it later. Now, the majority of CRC-32 implementations you see will usually use a table. This is called the Sarwate algorithm and exploits the fact that as we are operating on data one byte at a time, there are only 256 unique values that the calculation in the inner loop of the algorithm can take. This is a small enough table (1024 bytes for 256 unsigned 32-bit integers) that it can easily be kept in the L1 cache during computation. Let’s try it out to see how much faster this gets us. First, we need to pre-compute the table: We’ve now effectively amortized the cost of the inner loop and replaced it with a table lookup which should in principle be faster. Our CRC-32 algorithm now take this form: Now we can test it out! So a speedup of around 4-5 times as fast as when we started. This is already a pretty good result, but can we do better? In principle, we could load the data as a 16-bit integer at a time and use the same trick above to build a table. However, such a table would have 65536 entries, resulting in a 256 KB table, which is a lot bigger than the 48 KB of L1 cache on my machine. It turns out, we can still do two bytes at a time . Following the description given by Kareem Omar , we realize that if we have two successive bytes in our message, let’s call them and : if we pad these messages with zeroes What this means is that if we have a CRC algorithm that works on a 16-bit message called M then: For the operation on , leading zeros don’t affect the calculation, so we can just use the 8-bit CRC algorithm we have developed above: The same does not hold for trailing zeroes. However, as will always be of the form and there can only be 256 unique values for , we can build a new table with 256 entries to look up these values. We can build two separate 256 entry tables, or we can just build one 512 entry table, So let’s construct this new table: We have to now update our algorithm to do two table lookups instead of one. Additionally, we have to modify our algorithm to read the data two bytes at a time. So let’s see if this speeds things up: Nice, now we are at an around 7x speedup to where we began. But can we go further? There’s nothing in the above that forces us to only use two bytes at a time. Nothing stops us from doing 4 bytes, i.e. 32-bits at a time with similar logic to above. I’ll quickly create a function that will fill an arbitrarily sized table: Now to create the 4-byte function: And presto, it still works! Let’s see how much faster we are now? A 14 times speedup over where we started! But why stop there? We can in principle explicitly write out version that take 8, 16 or however many bytes at time. This get’s a little long-winded, so I’ll write some generic functions to make functions of arbitrary size. Let’s do a quick sanity check to see if this works: Let’s try and increase the table size as far as we can. We’ll go up in powers of two, and see how far we can go. And there it is. At least on my machine, 32 bytes is the limit, maxing out at a roughly 40 times speedup over the naive implementation. After that we start to see a performance decrease. Let’s plot this to see the trend: As you can see from the above, with a 32-byte table we hit our maximum speedup of around 40 times as fast as the naive implementation. After that, the performance falls off a cliff. Why would this be? If you read the blog post I linked above you already know the answer: Cache. In Kareem Omar’s original post the recommendation is to not go above a 16-byte table, as this will take up approximately half of the standard 32 KB of L1 cache on most computers. However, since this post was written in 2019, L1 cache sizes have increased a bit, and on my machine with a 48 KB L1 cache the 32-byte table performs best, but it’s clear that once you go past that you run into cache issues. This is actually a place then where some compile-time metaprogramming might help: depending on the size of your L1 cache, you might want to compile your application with a 16-byte or a 32-byte table. Or if you want to future proof your application for the (presumable not so distant) future case where the CPUs have at least 128 KB of L1 cache, then you can even add the option of a 64-byte table. At some point Mojo had an autotune functionality that would have been able to do this, but it was removed and we’ll have to wait for it to be added back. Now if you read Kareem’s post, you might realize he went even further by calling some (hardware accelerated) intrinsic functions to go even faster. There is a caveat here in that in that case, the polynomial used is baked into the silicone, and the variation for which this works is called where the C stands for Castagnoli, and importantly this is not the variation that is used for PNG checksums, so I won’t go further with this. Taking our best result from above, we get: So an astounding 500 times speedup over pure Python. Now this is a completely unfair comparison, as I am comparing the naive Python implementation to the optimized Mojo one. Now as I hinted before, no-one in their right mind would write a CRC-32 check like I did above in Python. What one would really do is use zlib from Python’s standard library. In this case we get: So when using Python as you realistically would, Python is actually still more than three times faster than our best effort so far! Of course, the counterpoint to this is that the zlib implementation is done in C, not Python, so we are effectively comparing Mojo to C at this point. The above begs the question however, why is the zlib version still faster than Mojo? What kind of black magic is happening in zlib? Well, this lead me down a rabbit hole and I ended up reading a pretty informative whitepaper by Andrew Kadatch and Bob Jenkins titled Everything We Know About CRC But Afraid To Forget . Now, I can’t find where this was officially published, and the only record of this seems to be in some guy’s github repo. I’m kidding a little on the last point, it’s in the zlib repo which is maintained by Mark Adler . Update: I have been informed that it was originally published as part of the release of crcutil on Google Code . Thanks to jorams on HN who pointed this out. The zlib CRC32 implementation is written in C that has been optimized to within an inch of it’s life, and has so many statement in there that it’s hard to know which way is up. In any case, there is a commit by Mark Adler from 2018 that’s titled Speed up software CRC-32 computation by a factor of 1.5 to 3. . Well, that’s about the amount of performance I am missing, so I guess that’s where I need to start looking. The commit message states: Use the interleaved method of Kadatch and Jenkins in order to make use of pipelined instructions through multiple ALUs in a single core. This also speeds up and simplifies the combination of CRCs, and updates the functions to pre-calculate and use an operator for CRC combination. It’s a pretty large commit and reads about as easily as hieroglyphics, so it might take me a moment to digest what’s going on there. The whole thing that kicked this work off was reading PNG image in Mojo. So how much faster does this improved CRC32 check make reading PNGs? Not much it seems. Reading a pixel PNG image with Mimage is now about 3.5% faster. I suspect the majority of the time reading PNGs is either spent reading from disk or decompressing the actual data. But hey, faster is faster. I’m wondering if/when Mojo will get it’s equivalent of zlib, and what shape that might take.

0 views

fnands 1 years ago

Building your own personal ghostwriter

This past weekend I participated in a AI hackathon organized by Factory Network and {Tech: Berlin} with my friends Axel Nordfeldt and Jonathan Nye . The sponsors were Mistral , Weaviate and LumaAI , meaning we had a bunch of fun credits to play around with. I’m an ML engineer, but I haven’t had that much exposure to the world brave new world of AI engineering that consists of calling LLM API’s and prompt engineering, so I wanted to use the hackathon as a way to get a better understanding of which tools are out there are how people are using them. We spitballed a few ideas the week before the event, and eventually settled on building a kind of ghostwriter during the event. The idea wasn’t to get chatbot to write you a story based on a minimal prompt, but rather to have something that interviews you in depth, and uses the interview as the basis of a blog post, short story or article. The idea was inspired by an experience Axel had where he went on a trip and later recounted it to a friend who is a talented writer. The friend then wrote up a summary of the trip and gave it to Axel, who was impressed by how well the friend had captured the essence of the trip. Writing about my own travel experiences is something I also wish I could do better. Not to for the sake of anyone else, but just for my own memories. I went to India in 2012 and I have a handful of photos of the trip, but I didn’t write much down at the time. I’d love to have a more detailed account of the trip, but I can’t remember all the details anymore. I recently found a blog post by someone I had met on the trip which even includes a quote from me, which I hardly remember. I struggle with journaling or keeping diaries in general, so I would like to have a tool that can help me write about my experiences. I don’t know if it is just because writing is hard (part of the reason I keep this blog is to practice writing), or if it is just because I feel awkward writing about myself. What isn’t that hard however is telling someone about my experiences, especially someone who knows how to ask pertinent questions. So the idea was to build a tool that could ask you questions about an experience, and then write up a summary based on your answers. Effectively the way a ghostwriter would work, but instead of writing a book for you, it would write a blog post or short story. For the two-day hackathon we had access to the sponsors’ APIs. We especially wanted to play around with Mistral’s API, which had just recently added a new multi-modal model, Pixtral that can understand images as well as text. So we basically built the app using Streamlit (called JournalAIst) as two parts: an interviewer and a writer. Basically we had a chat loop in which you could upload images (interpreted by Pixtral) and during which the interviewer model ( ) would ask you questions about your experience and the images that you added. After answering the interviewers questions for a while, you can choose to end the conversation, after which you are sent to the writer. The writer will then be given the transcript of your conversation, as well as Pixtral’s descriptions of any images you might have uploaded. The model then uses this context to write a few paragraphs worth of text about your experience. You get to choose if you want a blog post, short story or article, each of which having a slightly different tone and viewpoint. We had a little bit of time left at the end so we ended up also adding a section where we fed a summary of your post to LumaAI’s Dream Machine model, which generates a five second video based on this input, which was a cute little addition. Try and imagine the story that resulted in this video: Getting the whole thing up and running was surprisingly easy given that none of us had any experience with Streamlit (shout-out to Vindiw Wijesooriya who’s mistral-streamlit-chat repo pointed us in the right direction). The story writer worked pretty well off the bat, but getting the interviewer to work well was the biggest challenge. None of us had a lot of experience prompt engineering, and the interviewer kept repeating questions (like asking who you were with three times in a row), so messing with the interviewer prompt was one of the most time consuming steps. If you have ever been interviewed by a good interviewer you know that someone skilled at the craft of interviewing can steer a conversation in interesting directions, and that is the feeling we were going for here. Easier said than done. It definitely worked best when your initial response contains quite a bit of context, hence why our initial question to the user asks for at least a few sentences. Eventually we can up with something that gave a reasonable experience, but I feel there is a lot to be done there. Maybe coming up with a set list of questions, and only having the model do follow up questions? We made the story writer output the story it writes as markdown so they can easily be viewed and used as blog posts. As for the images we did something janky: we saved them as where is the order in which the images were uploaded, and told the model to refer to them in that way in markdown. This felt a little weird at first, but it seems to work well enough. The model will just dump a in the markdown when it wanted to refer to the images, and it worked most of the time. In case you were wondering, no I didn’t use JournalAIst to write this blog post. The language is clearly not flowery enough. To give you an idea of the quality of the writing, here are a few samples: I recounted my experience of being in the Kruger National Park with a couple of friends last year. For this case I basically just threw in a handful of details and sent the model off to write. See the video at the end that we created while the LumaAI credits were still good. The video seems to try and combine a Lion, Elephant and Zebra, with the Elephant getting the Elephant’s share of the mix. Also, note the abomination in the background. Blog Post: Kruger Story 1 Despite the minimal details, it did summarise what I told it quite truthfully, but with a healthy dash of breathless wonder, which doesn’t sound like me. This might be as the prompt was write in the style of Bill Bryson, but even there I wouldn’t say it sounds a lot like him either. The prompt also requests the model to write in a “humorous and engaging” manner, so I guess that might be the reason it sounds as it does. I tried giving it another go (caveat, this one was generated after the free credits expired, so we switched the model to ), but this time added more context and chatted for a bit longer. Blog Post: Kruger Story 2 Again, the story is reasonable, although I wished it was maybe a bit longer? Or maybe it is mercifully short, depending on your perspective. I also asked it to write up my experience of the hackathon with a few pictures I took during the event. I tried to continue the interview for a while to give as much info as possible. Journal Entry: Hackathon Story The story is OK. Everything it stated was factual, and as the prompt this time was to write a journal entry and not a blog post, it’s not as fantastical as before. It’s a little short though, and I could likely have written the same if not more myself. Our project was one of the few that actually fully worked by the end of the hackathon, and our first showing really impressed the judges (they told me afterwards). On the strength of that we made it to the final six teams (out of about 25 that decided to enter). The pitch went fine, and we again did a live demo in the four minutes allotted to us. We didn’t end up winning any prizes, but we were just happy to be in the finals. Maybe we needed a salesperson on our team and not just a bunch of engineers 😉. I spoke to some of the judges afterwards, and one of the criticisms was that while initially impressive, they had worries about hallucinations: in our pitch we had highlighted that this is a way to hopefully tell an authentic and meaningful story, but with hallucinations, you might spend as much time fixing the story as you do talking to the model. This is extremely fair criticism and was also mirrored our worries. It did seem the longer you chatted to it the more factual the story ended up being (unsurprisingly), but at what point do you spend more time chatting then it would just have taken you to write it? On the last point: I would still argue that the job of a ghostwriter is to take your unstructured thoughts and edit them down into something coherent. My feeling is that the chat interface, or at least typing is not as free as I would have liked. One of the judges suggested a voice interface, which is something we considered as well (i.e. using something like Whisper to turn speech into text), but would have been hard to get working in the little time we had. My feeling is that there is something interesting here, and that making the interaction with the interviewer more natural is probably where we should focus our efforts. The quality of the interviewer questions could also be better. Maybe we can prompt it by giving it some transcripts of interviews by Louis Theroux or some other good interviewer. After the hackathon the API keys provided expired, but Mistral recently opened up a free tier , so we bumped the main model down to one of the smaller ones supported on the free tier. In any case, the code is here if you wish to play around with it. The code is what I would call “Hackathon Quality” so beware. You’ll need a Mistral API key to run it, but it will work with the free tier, so won’t cost you anything. It’s also deployed on Streamlit cloud , which might work depending on whether or not we’ve hit the limits of Mistral’s free tier.

Writing

0 views

fnands 1 years ago

Parsing PNG images in Mojo

So for the past while I’ve been trying to follow along with the development of Mojo, but so far I’ve mostly just followed along with the changelog and written some pretty trivial pieces of code. In my last post I said I wanted to try something a bit more substantial, so here goes. I was looking at the Basalt project, which tries to build a Machine Learning framework in pure Mojo, and realized that the only images used so far were MNIST, which come in a weird binary format anyway. Why no other though? As Mojo does not yet support accelerators (like GPUs) Imagenet is probably impractical, but it should be fairly quick to train a CNN on something like CIFAR-10 on a CPU these days. The CIFAR-10 dataset is available from the original source as either a pickle archive or some custom binary format. I thought about writing decoders for these, but it might be more useful to write a PNG parser in Mojo, and then use the version of the dataset hosted on Kaggle and other places, or just transform the original to PNGs using this package . That way the code can be used to open PNG images in general. Don’t mistake this post for a tutorial: read it as someone discovering the gory details of the PNG standard while learning a new language. If you want to read more about the PNG format, the wikipedia page is pretty helpful as an overview, and the W3C page provides a lot of detail. For reference, this was written with Mojo , and as Mojo is still changing pretty fast a lot of what is done below might be outdated. I actually I did basically the entire post in but got released just before I published, but it required only minor changes to make it work in the new version. The goal here is not to build a tool to display PNGs, but just to read them into an array (or tensor) that can be used for ML purposes, so I will skip over a lot of the more display oriented details. To start let’s take a test image. This is the test image from the PIL library, which is an image of the OG programmer Grace Hopper : This is a relatively simple PNG, so it should be a good place to start. Now that Mojo has implemented it’s version of pathlib in the stdlib, we can actually check if the file exists: We’ll also import the image via Python so we can compare if the outputs we get match the Python case. We’re going to read the raw bytes. I would have expected the data to be unsigned 8-bit integers, but Mojo reads them as signed 8-bit integers. There is however a proposal to change this , so this might change soon. PNG files have a signature defined in the first 8-bytes, part of which is the letters PNG in ASCII. Well define a little helper function to convert from bytes to String: To make sure we are actually dealing with a PNG, we can check the bits 1 to 3: Yup, it’s telling us it is a PNG file. So now we read the first “chunk”, which should be the header. Each chunk consists of four parts, the chunk length (4 bytes), the chunk type (4 bytes), the chunk data (however long the first 4 bytes said it was), and a checksum (called the CRC) computed from the data (4 bytes). When reading in data with , the data comes as a list of signed 8-bit integers, but we would like to interpret the data as 32-bit unsigned integers. Below is a helper function to do so (thanks to Michael Kowalski ) for the help. The firs chunk after the file header should always be the image header, so let’s have a look at it: Let’s see how long the first chunk is: So the first chunk is 13 bytes long. Let’s see what type it is: IHDR, which confirms that this chunk is the image header. We can now parse the next 13 bytes of header data to get information about the image: The first two chunks tell us the width and height of the image respectively: So our image is 128x128 pixels in size. The next bytes tell is the bit depth of each pixel, color type, compression method, filter method, and whether the image is interlaced or not. So the color type is , so RGB, with a bit depth of 8. Interesting side note: in the PIL PngImagePlugin there is a changelog: I like the comment from 2004: and then interlaced PNG being supported about 13 years after PNG reading was added to PIL. I have a feeling I won’t be dealing with interlaced files in this post… The final part of this chunk is the CRC32 value, which is the 32-bit cyclic redundancy check . I don’t go into too much details, but it’s basically an error-detecting code that’s added to detect if the chunk data is corrupt. By checking the provided CRC32 value against one we calculate ourselves we can ensure that the data we are reading is not corrupt. We need a little bit of code to calculate the CRC32 value. This is not the most efficient implementation, but it is simple. I’ll probably do a follow up post where I explain what this does in more detail. Great, so the CRC hexes match, so we know that the data in our IHDR chunk is good. Now, reading parts of each chunk will get repetitive, so let’s define a struct called to hold the information contained in a chunk, and a function that will parse chunks for us and return the constituent parts: During chunk creation the CRC32 value for the chunk data is computed, and an issue will be raised if it is different to what is expected. Let’s test this to see if it parses the IHDR chunk: The next few chunks are called “Ancillary chunks”, and are not strictly necessary. They contain image attributes (like gamma ) that may be used in rendering the image: The IDAT chunk (there can actually be several of them per image) contains the actual image data. PNGs are compressed (losslessly) with the DEFLATE compression algorithm. PNGs are first filtered, then compressed, but as we are decoding, we need to first uncompress the data and the undo the filter. This next section is why I said in “pure-ish” Mojo: I considered implementing it, but that would be quite a lot of work, so I am hoping that either someone else does this, or that I might dig into this in the future. So for the moment, I am using the zlib version of the algorithm through Mojo’s foreign function interface (FFI). The following I lightly adapted from the Mojo discord from a thread between Ilya Lubenets and Jack Clayton: Drumroll… let’s see if this worked: Now we have a list of uncompressed bytes. However, these are not pixel values yet. The uncompressed data has a length of 49280 bytes. We know we have an RGB image with 8-bit colour depth, so expect bytes worth of pixel data. Notice that , and that our image has a shape of . These extra 128 bytes are to let us know what filter was used to transform the byte values each line of pixels (knows as scanlines) into something that can be efficiently compressed. The possible filter types specified by the PNG specification are: There is some subtle points to pay attention to in the specification, such as the fact that these filters are applied per byte, and not per pixel value. For 8-bit colour depth this is unimportant, but at 16-bits, this means the first byte of a pixel (the MSB, or most significant byte) will be computed separately from the second byte (the LSB, or least significant byte). I won’t go too deep into all the details here, but you can read the details of the specification here . I’ll briefly explain the basic idea behind each filter: So when decoding the filtered data, we need to reverse the above operations to regain the pixel values. Now that we know that, let’s look at our first byte value: So we are dealing with filter type 1 here. Let’s decode the first row: And let’s confirm that the row we decoded is the same as PIL would do: Now that we have the general idea of things, let’s write this more generally, and do the other filters as well. For an idea of how filters are chosen, read this stackoverflow post and the resources it points to: How do PNG encoders pick which filter to use? I’ve done these as functions that take 16-bit signed integers. This is important mostly for the case of the Paeth filter, where the standard states: The calculations within the PaethPredictor function must be performed exactly, without overflow. Arithmetic modulo 256 is to be used only for the final step of subtracting the function result from the target byte value. So basically we need to keep a higher level of precision and then cast back to bytes at the end. I based the implementation on the iPXE implementation of a png decoder written in C. For the function, I was trying to add the separate filters to some kind of Tuple or List so I could just index them (hence the uniform signatures), but wasn’t able to figure out how to do this in Mojo yet. So let’s apply these to the whole image and confirm that we have the same results as we would get from Python: And that’s it. If the above runs it means we’ve sucessfully parsed a PNG file, and at least get the same data out as you would by using Pillow. Now ideally we want the above into a Tensor. Lets write a function that will parse the image data and return a Tensor for us. I’m not entirely sure why I need to use while setting items, but when getting I can just provide indices: And there we have it. I will put it all together soon but let’s finish parsing the file quickly. There are a few more chunks at this point: text chunks which hold some comments, and an end chunk, which denotes the end of the file: The text chunks above actually have more info, but seem to be UTF-8 encoded, and Mojo just seems to handle ASCII? Let’s package the logic above up a bit more nicely. I’m thinking something that resembles PIL. Well start with a struct called Well, it’s not the prettiest, but let’s see if it works: If the above runs, then it means we read the image correctly! Let’s try on a PNG image from the CIFAR-10 dataset: This also works! Now we should be able to read the CIFAR-10 dataset! I have a few questions about my implementation above, i.e: This is one of those points where I don’t feel I know what the idiomatic way to do this in Mojo is yet. The 🪄Mojical🪄 way, you might say 😉. Mojo is so young that I’m not sure an idiomatic way has emerged yet. Reading PNGs was quite a fun topic for a blog post. It made me really get my hands dirty with some of the more low level concepts in Mojo, something I felt I didn’t fully grasp before. I’ll admit, this ended up being a bit more work than I expected. To paraphrase JFK: It’s impressive how far Mojo has come in just a few months: when I was trying to write a bit of Mojo in September of last year it felt hard to do anything practical, while now the language seems to be quite usable. There are still a few things I need to get used to. One thing is I always feel like I “need” to write functions, and not functions. This is good practice when writing libraries and such, but it makes me wonder: when is writing style functions appropriate, as will always be safer and more performant? I refactored the code from this blog post a bit and wrote it up into a library I am calling Mimage . The goal would be to be able to read and write common image formats in pure Mojo without having to resort to calling Python or C. Currently Mimage still requires some C libraries for the uncompress step, but I am hoping that those will be available in pure Mojo soon. The next steps will likely be adding support to Mimage for 16-bit PNGs and JPEGs. The long term goal would be to be able to read and write all the same image formats as Python’s Pillow , but that will likely take a long time to reach. As I am on the ML side of things, I’ll try and focus on the formats and functionality needed for ML purposes, like being able to read all the images in the Imagenet dataset. 0: None No filter is applied and each byte value is just the raw pixel value. 1: Sub Each byte has the preceding byte value subtracted from it 2: Up Each byte has the value of the byte above it subtracted from it. 3: Average: Each byte has the (floor) of the average of the bytes above and to the left of it subtracted from it. 4: Paeth: The three neighbouring pixels (left, above and upper left) are used to calculate a value that is subtracted from the pixel. It’s a little more involved than the other three. How to handle 16-bit images? Do we need a separate function, i.e. ? Is a struct the correct way to do this?

Machine Learning

0 views

fnands 1 years ago

Mojo’s standard library goes open source

A lot has changed in Mojo land since I last had a look. The last time I wrote one of these Mojo was at version 0.6, and now it’s at 24.2! This is just due to a change in version numbering, so they have basically still kept the pace of about one new version per month, so the releases (ignoring bugfix releases) goes 0.6 -> 0.7 -> 24.1 -> 24.2. Why 24.1/24.2? Modular have moved to a numbering scheme, so 24.1 and 24.2 are the first and second releases (under this naming scheme) in 2024, respectively. The reason this was done is to keep the Mojo versioning in sync with Max, Modular’s other product. A minor worry I have about this is that unlike semantic versioning where you get a nice 1.0 release, it’s a bit hard to know now what a stable release will look like. Another big piece of news is that the Mojo standard library is now open source, under the Apache 2.0 license (with LLVM exceptions). So far only the SDK has been available, but with the stdlib this is one step closer to the entire language being open source. Some people on HackerNews have grumbled about the language not being open source, but I do totally get why the developers want to reach some level of maturity before opening their language up to the peanut gallery. In any case, I want to have a bit of a look at what has changed. The one object I’ve severely missed in Mojo has been a dictionary type, and in 0.7 they finally added it. It’s one of the most useful data structures in Python, and they have kept the implementation similar to the Python one (so a hash map under the hood). A slight difference is that for Mojo dicts the types must be statically specified, unlike Python where you can just kinda do whatever you want. So to define a dictionary: So pretty straightforward stuff. Nice! Another profiling feature that I’ve never seen anywhere before is getting not just the number of cores on your system, but the types as well. So you can see my machine has 20 total “cores”, which I believe breaks down as 6 performance cores and 8 efficiency cores, for 6 + 8 = 14 physical cores, and the 6 performance cores each having two threads, so 2 * 6 + 8 = 20 threads. I’ve never really had to consider what kind of cores my program runs on, but this will probably become an increasingly important thing to consider going forward. Another interesting thing that caught my eye is the new type, which is taking a page out of the Rust playbook. Checking the roadmap, they do state that a borrow checker is on the way. This does make a lot of sense: if you want to be able to give people access to low-level memory management, do it the Rust way. Right now you can mess around with raw pointers and the compiler will just let you, but it seems this will change in the future. I mentioned in an earlier post that Mojo felt like writing C++ with Python syntax, but maybe it will be more like writing Rust with Python syntax soon. Talking of comparing Mojo to Rust, ThePrimeagen made a video discussing the topic. Actually it’s discussing a blog post by Modular that is discussing an earlier video from him discussing a Mojo community blog post . It’s a whole thing. In any case, it’s an interesting video going into a lot of the rationale behind how a lot of the design decisions about Mojo are being made. Another release, another useful data structure. This time it’s Mojo’s implementation of a set: So is a nice little addition, and works as you would expect. The syntax from Python is also new, so every day we get a bit closer to Python. A feature I haven’t appreciated before is unbound values and the declaration: so you can create an alias of an object, which is basically a partial initialization that can then be fully initialized later. This reminds me a bit of in Python (however, this does happen at compile time), and might make sense if you are initializing multiple versions of large objects. On that note, the compile-time metaprogramming side of Mojo is something I still need to explore and something I haven’t fully grasped yet. A new feature is that you can unbind any number of parameters with An issue I ran into earlier was the lack of support for iteration in , which has now been rectified! Additionally (in v24.2), the been replaced with a , which has a distinctly Pythonic feel to it: Even negative indexing is now supported: If you are wondering what the syntax in is for is that the iterator over the list returns references, and is the dereference operator in Mojo. Additionally, file handling just became easier with the introduction of and , which should be familiar to anyone who uses Python. An interesting change is the removal of the declaration, i.e. the way you declare immutable variables in Mojo. I thought this was a bit strange at first, but there has been a lot of thought put into the decision, see this post by Chris Lattner. Partly it’s due to simplicity, as Python has no concept of immutability, but the one interesting argument that stuck out to me was: > The immutability only applies to the local value i.e. only the (or ) is immutable if you use , not the value being pointed to, which was causing some confusion. It sounds like the concept is being re-evaluated and will likely make a comeback in some shape later. in Mojo is now very close to the Python version, so you can specify the separator and end: Other than that, there are a few changes that bring the language closer to Python, such as adding variadic keyword arguments (as ), but the biggest change in 24.2 is really the release of the . Mojo is finally starting to look usable. Earlier this year there was still too much missing to do much more than a little sandboxed demo in Mojo, and it looks like this is starting to change. Someone is even starting building an ML framework (called Basalt ) in Mojo, and I’m excited to see where that goes. On that note, I’m very curious to see when Mojo will finally support GPUs, as that will really be the point where it will start living up to it’s promise of being the language of machine learning. So far I’ve basically just been writing these posts as an exercise to make myself really parse the Mojo changelog, but I’m tempted to actually start running some real world tests.

C++

Open Source

Rust

0 views

fnands 1 years ago

An intuitive introduction to pansharpening

Many earth observation satellites are equipped with a higher resolution panchromatic sensor, and a lower resolution multispectral sensor. One example of this is the Pleiades satellite constellation by Airbus, for which the panchromatic sensor has a 70 cm ground sampling distance (GSD), and the multispectral sensor has a 2.8 m GSD, i.e. four times lower than the panchromatic band. The panchromatic band is sensitive to a wide spectrum of light, usually overlapping with several of the other spectral bands, while the multispectral bands focus on narrower parts of the spectrum. The image below shows the spectral resolution of the panchromatic as well the the blue, green, red and near infra-red (NIR) bands of the Pleiades sensor (taken from here ): One of the reasons for having different resolutions for different spectral bands is that a sensor needs to receive a certain amount of light before it can form a reliable image. If you increase the spatial resolution or decrease the spectral resolution, you reduce the amount of light reaching each pixel, meaning satellite designers often have to choose between high spectral resolution or high spatial resolution. For a more in-depth explanation see this answer on GIS StackExchange . What this results in is that you often have one image with high spatial information, but low spectral information, and one with low spatial information but high spectral information. But what if you want both? This is where pansharpening comes in. By combining the spatial information from the panchromatic image and the spectral information from the multispectral image, we can create an image with the best of both worlds. Remote sensing images are sometimes a bit unintuitive to deal with, so I will use a “normal” image as an example, in this case a logo I created with Ideogram : To simulate the situation in satellite images we will create a “panchromatic” image by making a grayscale image: and to simulate a lower resolution multispectral image we will downsample the image to get a lower resolution version of the colour image. In this case our original image had a resolution of 512 x 512 pixels, and we downsample each side by a factor of four to get a 128 x 128 pixel image, meaning for every pixel in the multispectral image we will have sixteen (4 x 4) pixels in the panchromatic image. To see the difference we can naively upsample the multispectral image and show the images side-by-side: As you can see we now have a high resolution “panchromatic” image and a low resolution “multispectral” image. So how can we add back sharpness to the image on the right? One of the simplest (and most widely used) methods for pansharpening is called the Brovey transformation. For reference, it is the default pansharpening operation in GDAL . To do a Brovey transform you need to upsample the multispectral image so your images have the same resolution. We did this above, but to confirm let us check the resolutions of our images: Next, for every pixel we create a “pseudo-pan” value by combining the values of the spectral values at each pixel of the (upsampled) multispectral image. However the above assumes that the spectral bands contributed equally to the panchromatic image. As you can see in the illustration of the Pleiades spectral resolution, the green and red bands share a larger overlap with the pancharomatic band than the blue or NIR bands, and should therefore contribute more. In fact, the four spectral bands of the multispectral sensor do not cover the entire spectral range covered by the panchromatic sensor. Therefore we need a set of weights that tells us how each band contributes to the panchromatic image. In case of satellite sensors, these can be found via calibration and are usually available from the manufacturer. In the case of our toy example, we know the weights used to create our “panchromatic” image. The weights are , and for red green and blue respectively, and a description of why these weights are used can be found here . Our equation is now: and we are now technically doing a weighted Brovey transform. By doing this for every pixel we can get an approximation of what the expected panchromatic image should look like: By using the ratio between real panchromatic image and the psuedo-pan image we can get a map that tells us which pixels to lighten and which to darken: Applying this ratio to each of the bands of our upscaled multispectral image, we get our pansharpened image: This looks pretty good! The image is in colour, and has crisp sharp edges. Putting these side-by-side, with the original image on the left and the pansharpened image on the right, we can see a few small differences: The most prominent artefacting occurs in the white lines in the “satellites”, i.e. in areas where there is a sudden change from one colour to another. In order to more clearly see the differences between the two cases, let us create a difference map: As might be expected, the largest differences occur where colours change quickly (i.e. high gradient). This is because the lower resolution sensor will naturally average the colour information in a pixel, so the sharpness that we add to it doesn’t perfectly correspond to the averaged out colour information. The above is an intuitive introduction to pansharpening, and the Brovey pansharpening algorithm in particular. Applying the concept to remote sensing images is not much more complex, one just needs to account for more spectral bands and different ratios between bands. There are more complex pansharpening algorithms out there, but Brovey is one of the most widely used, and gives good enough results for most applications.

Data Analysis Science

0 views

fnands 1 years ago

Mojo 0.6.0, now with traits and better Python like string wrangling.

Another month, another Mojo release. I am busy doing the 2023 edition of the Advent of Code (AoC) in Mojo, and had a few complaints 😅. If you not familiar with the AoC, it’s basically a coding advent calendar that gives you a new coding challenge every day for the first 25 days of December. In a bit of foreshadowing, I used an AoC 2022 puzzle in my first post on Mojo , which was using Mojo 0.2.1, and it is encouraging to see how far the language has come. The AoC puzzles are often pretty heavy in string wrangling, a task that Python is pretty strong in, and that Mojo is still somewhat lacking in. One of the features that I found was lacking in Mojo 0.5.0 was the ability to easily split a string as one does in Python. In the case of the first day, I found myself needing to split a string by newlines, something which you can do trivially in Python by calling . In Mojo 0.5.0 this did not exist and I had to write a struct to implement this functionality. I ended up generalizing it a bit and putting it in a library as it was super useful for the following days as well. And then on the fourth of December Mojo 0.6.0 was released , which now includes the ability to call on a string, as well as a bunch of useful Python methods ( , , , ). These will definitely help going forward with the AoC challenges. I’ll write a rundown of my experience with the AoC in Mojo when I complete all the puzzles, so now on the the spotlighted feature from 0.6.0: traits Traits are a fairly common concept in programming languages, and allow you to add required functionality to a struct if it conforms to this trait. As an example, take the function that we know and love from Python, and that is also now a part of Mojo. The trait associated with in Mojo is , meaning that any struct conforming to the trait is required to have a method that returns an integer size. When the function is applied to a struct that conforms to , the function is called. An example of a struct that conforms to the trait is the builtin : Additionally, we can then write our own struct that conforms to , and as long as it has a method named it will conform to the trait (the compiler will let you know if it doesn’t): If we now call on an instance of this struct it will return the size value: As a side note, I used the decorator above which hides a bit of boilerplate code for us. The above initialization is equivalent to: So is a pretty useful way to save us a few lines of boilerplate code. I’m still getting used to decorators in Mojo (maybe a good idea to do a post on them in the future). One question I had about traits is how difficult it is to chain them? E.g. what if I have a struct that I want to conform to both and , which allows the function to apply to the struct, and makes it printable? It turns out this is easy; just pass them during as a comma So it is very simple to add multiple traits. To create our own trait, we only need to define it with a method that conforming structs need to inherit: Here the indicates that nothing is specified yet (needs to be done per struct). It is not possible yet to define a default method, but is apparently coming in the future. Let’s create a struct that conforms to Jazzable: We can also define a function that calls a specific method. An example of this is the function that calls , we can create our own function that will call : Additionally, traits can inherit from other traits, and keep the functionality of the parent trait: This new struct will have all the methods of , so will work: And we can define additional functions that will activate the new methods as well: Traits provide a convenient way of adding functionality to structs, and as you can see they are pretty simple to use. I’ve never used traits in any other language before, but it does work similarly to generic classes, and feels really familiar, except for the fact that you can’t have default behaviour (yet). From what I’ve seen from Mojo so far, writing structs seems to be a pretty core part of how Mojo is supposed to be used, so I guess I better get used to it.

0 views

fnands 2 years ago

Mojo 0.5.0 and SIMD

Another month, another Mojo release! It feel like every time I run into a missing feature in Mojo it gets added in the next version: In my first post I complained about a lack of file handling, which was then added soon after in version . For version I ran into the issue that you can’t print Tensors, which has now been added in the release. So this means Mojo has now unlocked everyone’s favourite method of debugging: printing to stdout. In addition to that, Tensors can now also be written to and read from files with and . There have also been a couple of updates to the SIMD type, which lead me to ask: How does the SIMD type work in Mojo? For a bit of background, you might have noticed that CPUs clock speeds haven’t really increased by much in the last decade or so, but computers have definitely gotten faster. One of the factors that have increased processing speed has been a focus on vectorization through SIMD, which stands for , i.e. applying the same operation to multiple pieces of data. Modern CPUs come with SIMD registers that allow the CPU to apply the same operation over all the data in that register, resulting in large speedups, especially in cases where you are applying the same operation to multiple pieces of data, e.g. in image processing where you might apply the same operation to the millions of pixels in an image. One of the main goals of the Mojo language is to leverage the ability of modern hardware, both CPUs and GPUs, to execute SIMD operations. There is no native SIMD support in Python, however Numpy does make this possible. Note: SIMD is not the same as concurrency, where you have several different threads running different instructions. SIMD is doing the same operation on different data. Generally, SIMD objects are initialized as , so to create a SIMD object consisting of four 8-bit unsigned integers we would do: And actually, SIMD is so central to Mojo that the builtin type is actually just an alias for : Modern CPUs have SIMD registers, so lets use the package in Mojo to see what the register width on my computer is: This means we can pack 256 bits of data into this register and efficiently vectorize an operation over it. Some CPUs support AVX-512 , with as the name suggests 512 bit SIMD registers. Most modern CPUs will apply the same operation to all values in their register in one step, allowing for significant speedup for functions that can exploit SIMD vectorization. In my case, we’ll have to live with 256 bits. This means in this register we can either put 4 64-bit values, 8 32-bit values 16 16-bit values, or even 32 8-bit values. We can use the utility function to tell us how many 32-bit floating point numbers will fit in our register: One of the new features in Mojo is that SIMD types will default to the width of the architecture, meaning if we call: Mojo will automatically pack 8 32-bit values, or 32 8-bit values int the register. This is equivalent to calling: Operations over SIMD types are quite intuitive. Let’s try adding two SIMD objects together: Additionally, since the version 0.5.0, we can also concatenate SIMD objects with : Operations applied to a SIMD object will be applied element-wise to the data in it, if the function is set up to handle this: As far as I can tell, this doesn’t just work automatically. If I define a function as: Then applying it to a single floating point number works as expected: But trying this on a SIMD object does not: However, if I define a version of the function to take a SIMD object: Then (with the additional specification of the parameter ), it will apply the function to all the values: While still working on single floating point values, as they are just SIMD objects of width one under the hood: I do miss the flexibility of Julia a bit, where you can define one function and then vectorize it with a dot, i.e. if you have a function that operates on scalar values, then calling will apply it element-wise to all values of that vector, and return a vector of the same shape. But for the most part, defining functions just to apply to SIMD values in Mojo doesn’t lose you much generality anyway. To be honest, I was a little bit daunted when I first saw the SIMD datatype in Mojo. I vaguely remember playing around with SIMD in C++, where it can be quite complicate to implement SIMD operations. But in Mojo, it really is transparent and relatively straightforward to get going with SIMD. It is clear that exploiting vectorization is a top priority for the Modular team, and a lot of through has clearly gone into making it easy to exploit the SIMD capabilities of modern hardware. I might take a look at vectorization vs parallelization in Mojo in the future, and maybe even try my hand at a bit of benchmarking.

Julia

C++

Hardware

0 views

fnands 2 years ago

Parameters in Mojo

As Mojo is an extremely new language I want to keep track of the development, and try to learn the language as it evolves. I had a first look at version 0.2.1 of the language in a blog post a few weeks back, and while on vacation I decided I should probably try and write a blog post every time the Modular team releases a new Mojo update (at least every minor update, not patches). To my surprise, in the two weeks I was offline the Modular team managed to do two minor releases, so I’ll jump straight to version 0.4.0 . One of the issues I ran into in my last post was due to the fact that Mojo had no native file handling, and I had to invoke Python just to open a simple file. But this has been fixed now, and Mojo now has a very Pythonic way of opening files: Scanning the changelog, version 0.3.0 seems to bring changes mostly related to supporting keyword arguments, while 0.4.0 seems to mostly be related to how parameters are handled, e.g. adding default parameters. This brings up an interesting point: What exactly are parameters in Mojo? When searching for the definition of parameters vs. arguments in Python, I get: > A parameter is the variable listed inside the parentheses in the function definition. An argument is the value that are sent to the function when it is called. I.e. in Python it is the difference between the name in the function/method definition, vs. the actual data passed. To be honest, I have never heard anyone really making this distinction. In Mojo however takes a different route here and makes a stronger distinction between parameter and argument, as the following lines from the Mojo documentation shows: Python developers use the words “arguments” and “parameters” fairly interchangeably for “things that are passed into functions.” We decided to reclaim “parameter” and “parameter expression” to represent a compile-time value in Mojo, and continue to use “argument” and “expression” to refer to runtime values. This allows us to align around words like “parameterized” and “parametric” for compile-time metaprogramming. In Mojo, arguments are denoted by round brackets like in Python, parameter values are denoted by square brackets. So now we can define: and call it as: So the obvious question is: why ? Why separate arguments and parameters when they seem to do the same thing? The above statement worked because we are calling it with a fixed parameter value that is known at compile time. Let’s try and pass a variable instead: We get an error. Parameter values must be known at compile time, while argument values can be passed at runtime. So passing the variable to the argument will work: This is another tool one can use in Mojo for optimization: in Python, all arguments are evaluated at runtime, while Mojo adds the option of adding values that are known at compile time as parameters. This might be useful in cases where you have some generic version of a function that you might use several versions of. As an example, let’s write a sliding window summation function that can take different window sizes as parameters: The above function takes a window size as a parameter, and a tensor as an argument. The function will iterate over the tensor, and add all values in a sliding window of size . As is often the case when creating a convolutional neural network, the size of the kernel is known at compile time, but the data is unknown, allowing the compiler to optimize parts of the operation that are known ahead of time. This means we can define a generic version of the function once, and then still have compiled version of the specific functions we want. Let’s define a function that calls two versions of the summation: And let’s test it: I haven’t yet found an elegant way to print tensors in Mojo. In Python/Pytorch, you can just call to see all the values in a tensor. In Mojo this results in an error. I read a tweet by Mark Tenenholtz recently that rang true: Writing a lot of Mojo code and it feels like writing C on hard mode. No dict-like structures, file IO is super rudimentary, docs are very sparse, etc. It’s actually very fulfilling, though. As pointed out in the replies, even asking ChatGPT doesn’t get you very far as the language was released after the current training cut-off date (coupled with the fact that there is no corpus of StackOverflow posts to train on anyway).

0 views

fnands 2 years ago

Stereo vision and disparity maps (in Julia)

I’ve been working a lot recently with stereo vision and wanted to go through the basics of how disparity is calculated. I’m partially doing this as an excuse to get better at Julia (v1.9.3 used here). You can view the notebook for this blog post on Github: In much the same way that we as humans can have depth perception by sensing the difference in the images we see between our left and right eyes, we can calculate depth from a pair of images taken from different locations, called a stereo pair. If we know the positions of out cameras, then we can use matching points in our two images to estimate how far away from the camera those points are. Taking a look at the image below (from OpenCV ): If we have two identical cameras, at points and at a distance from each other, with focal length , we can calculate the distance ( ) to object by using the disparity between where the object appears in the left image ( ) and where it appears in the right image ( ). In this simple case, the relation between disparity and distance is simply: If we know an , then we can rearrange this to give us distance as a function of disparity: You might notice that in case the disparity is zero, you will have an undefined result. This is just due to the fact that in this case the cameras are pointing in parallel, so in principle a disparity of zero should not be possible. The general case is more complicated, but we will focus on this simple setup for now. We can define the function as: Where and are measured in pixels, and is measured in centimeters. There is an inverse relation between distance and disparity: So once we have a disparity, it’s relatively straightforward to get a distance. But how do we find disparities? We usually represent the disparities for a given pair of images as a disparity map , which is an array with the same dimensions as (one of) your images, but with disparity values for each pixel. In principle, this is a two-dimensional problem, as an object might be matched to a point that has both a horizontal and vertical shift, but luckily, you can always find a transformation to turn this into a one dimensional problem. The cartoon below illustrates what a disparity map might look like: Above, we calculate the disparity with respect to the right image (you can do it with respect to the left image as well), and as you can see the disparity map tells us how many pixels to the right each object shifted in the left image vs the right image. For a set of images (taken from the Middlebury Stereo Datasets ): The corresponding disparity map can be visualized as follows: With darker pixels having lower disparity values, and brighter pixels having higher disparity values, meaning the dark objects are far away from the cameras, while the bright ones are close. The ground truth disparity as shown above is usually calculated from LiDAR or some other accurate method, and our goal is to get as close as possible to those values using only the images above. So let’s try and calculate disparity for the images above. There are many, many approaches to calculating disparity, but let us begin with the most simple approach we can think of. As a start, let us go through each pixel in the right image, and for that pixel, try and find the most similar pixel in the left image. So let us try and take the squared difference between pixels values as our similarity metric. As we are going to be doing the same thing for every row of pixels, we are just going to define a function that does the basic logic, and then apply the same function to every case. Let’s define a distance metric as the squared distance: And as a test case let’s create the cartoon image we had above: Now we can try and match pixels in the right image to pixels in the left image. So how did we do? So the toy example works! The top line, which moved more pixels, shows up brighter (i.e. larger disparity values), and the lower line is dimmer. So let’s move on to real images. We’ll start with the example case above, but for simplicity we’ll stick to grayscale at first: Redefining slightly… So let’s see how we did? Looking at the predicted disparity, we can see there is some vague resemblance to the input image, but we’re still pretty far from the target: A significant problem seems to be erroneous matches, especially in the background. As you can imagine, we are only comparing single channel pixels values, and it’s very likely that we might just find a better match by chance. In grayscale we are only matching pixel intensity, and we have no idea whether something is bright green, or bright red. So let’s try and improve the odds of a good match by adding colour. So, a slight improvement! There seem to be fewer random matches in the background, but still not that close to the desired outcome. Is there more we can do? The obvious downside of the naive approach above is that it only ever looks at one pixel (in each image) at a time. That’s not a lot of information, and also not how we intuitively match objects. Look at the image below. Can you guess the best match for the pixel in the row of pixels below it? Given only this information, it’s impossible for us to guess whether the green pixel matches with the pixels at location 3, 5 or 7. If however I was to give you more context, i.e. a block of say 3x3 pixels, would this make things simpler? In this case, there is an unambiguous answer, which is the principle behind block-matching. To confirm our idea that more context results in better matches, we can take a quick look at a row of pixels: Given the pixel above, where in the row below do you think this pixel matches? You would guess somewhere in the orange part on the left right? But which pixel exactly is almost impossible to say. If we now take a block with more context: And compare it to the row below, the location of the match becomes more obvious: Calculating the difference metric for each point with different block sizes, we can clearly see that for low block sizes, the lowest metric value is ambiguous, while for larger block sizes it becomes more clear where exactly the best match is: And now we are ready to define our block matching algorithm, much in the way we did our pixel matching algorithm: Let’s see how this does on the full image in comparison to the pixel matching: Now we are getting somewhere! Compared to the earlier results we can now start making out the depth of the separate objects like the lamp, bust and camera. There are still a few things we could do to improve our simple algorithm (like only accepting matches that have below a certain score for the metric), but I will leave those as an exercise to the reader. Above we went through a basic introduction to stereo vision and disparity, and built a bare-bones block matching algorithm from scratch. The above is pretty far away from the state of the art, and there are many more advanced methods for calculating disparity, ranging from relatively simple methods like block matching to Deep Learning methods. Below are some posts/guides I found informative: Introduction to Epipolar Geometry and Stereo Vision Stereo Vision: Depth Estimation between object and camera Depth Map from Stereo Images

Tutorial Julia Computer Vision

0 views

fnands 2 years ago

A first look at Mojo 🔥

The Mojo programming language was officially released in May, but could only be used through some notebooks in a sandbox. Last week, the SDK (version 0.2.1) got released, so I decided to give it a look. Mojo’s goal is to “combine the usability of Python with the performance of C” , and bills itself as “the programming language for all AI developers” . It’s clear that Python is the dominant language when it comes to ML/AI, with great libraries like Pytorch and a few others being the main drivers of that. The problem comes with depth: all the fast libraries in Python are written in a performant language, usually C or C++, which means that if you want to dig into the internals of the tools you are using you have to switch languages, which greatly raises the barrier of entry for doing so. There are other languages that try to go for the usability of Python while retaining performance, and the first language that comes to mind for me in this respect is Julia. Julia is a pretty neat language, and writing math-heavy, fast code in it feels very elegant, while retaining a very Python like syntax. Julia is about twenty years younger than Python, and to me seems like they took the best aspects of Python and Fortran and rolled them into one language, allowing you to have performant and elegant code that is Julia all the way down. Given all this, in vacuum, Julia would seem like the obvious language to choose when it comes to ML/AI programming. The one major downside of Julia is that it doesn’t have the robust ecosystem of libraries that Python has, and unless something major changes, it seems that Python will keep winning. Enter Mojo, a language that then (aspires to) keep interoperability with Python, while itself being very performant and allowing you to write code that is Mojo all the way down. Basically if Mojo achieves its goals then we get to have our cake and eat it: we can keep the great ecosystem of packages that Python brings with it, while getting to write new performant code in a single. My guess is if this works out that all the major packages will eventually get rewritten in Mojo, but we can have a transition period where we still get to keep the C/C++ version of them until this can be done. The people behind Mojo (mostly Chris Lattner ) seem to know what they are doing, so I wish them all the best. I wanted to start with something basic, so I thought I would have a look at the first puzzle from the 2022 advent of code . Basically you are given a text file with a several lists of numbers representing the amount of calories some elves are carrying (go read up on the advent of code if you are unfamiliar, it will make sense then), and have to find which elves are carrying the most calories. So effectively a little bit of file parsing, with some basic arithmetic, i.e. a little puzzle to ease into Mojo. I won’t share the input because the creator of the AoC has explicitly asked people not to , but you can download your own and try the code below. At first glance, a lot of Python code will “just work”: However, it’s clear a lot is still missing, e.g. lambda functions don’t work yet: This is likely coming, but for now we have to live without it. So for the first step, let’s parse some text files. The first thing I found was that Mojo doesn’t have a native way to parse text yet. But luckily, you can just get Python to do it for you! In this case, you have to import Python as a module and call the builtin Python open function. It’s standard practice in Python to open text files with the incantation, but this doesn’t work in Mojo, so have to open and close files manually. All in all, it’s relatively standard Python, with a couple of caveats. One of the big things is that there is a distinction between Python types and Mojo types, i.e. the Python is not the same as Mojo’s , so if you want to get the most out of Mojo, you need to cast from the one to the other. Right now, there seems to be no direct way to go from to , so I had to take a detour via . I tried to keep the Python imports in the function, so that the other functions can be in “pure” Mojo. The my first impulse was to create a Python-esque list, but the builtin list in Mojo is immutable , so I had to go for a DynamicVector, which had a strong C++ flavour to it. Once that was done I was done with Python for this program and could go forth in pure Mojo. Below you can see I declare functions with while above I used . Both work in Mojo, but functions forces you to be strongly typed and enfoces some memory safe behaviour . You can see here the values are all declared as mutable ( ). You can also declare immutables with . This is enforced in functions. Other than that, a relatively standard loop over a container. Again, relatively straightforward. I’m definitely missing Python niceties like being able to easily sum over a container (can’t call in Mojo 😢). To put it all together we create a main , and notice that we need to indicate that it might raise errors as we are calling the unsafe . Mojo feels relatively familiar, but I will also say that when writing “pure” Mojo it feels like writing C with Python syntax. This makes sense given the goals of the language, but caught me a little off guard; I was expecting something a little closer to Julia, which still feels a lot like Python in most cases. This was not the greatest example to show Mojo off, as Mojo really shines in high performance environments, so the language didn’t really get to stretch its legs here. You can find some more performance oriented examples on the official Mojo website . I will probably give Mojo another look and try out something a bit more suited for the language in the future, maybe when the version of the language drops. I think I’ve been spoiled by mostly writing in two well supported languages (Python and C++) for which there are countless reference examples or StackOverflow posts on how to do things. Due to the fact that Mojo is brand new, there are very few examples to look to about how to do even relatively basic things. For now if you want to get started, I recommend starting with the exercises on mojodojo.dev .

C++ Julia