Latest Posts (16 found)
Ash's Blog 1 weeks ago

Before AI's Kepler Moment 🔭

I’ve always been fascinated by AI and mega-projects — and as I work on AI infrastructure , you might assume I’m equally fascinated by the current LLM race. In reality, I’m far more skeptical than most. While LLMs are undeniably useful, I’m not convinced “intelligence” is even the right scale to measure them against. The analogy I keep returning to comes not from computer science, but from astronomy: the story of epicycles .

0 views
Ash's Blog 3 weeks ago

How a String Library Beat OpenCV at Image Processing by 4x

To my great surprise, one of the biggest current users of my StringZilla library in the Python ecosystem is one of the world’s most widely used Image Augmentation libraries - Albumentations with over 100 million downloads on PyPI, growing by 5 million every month. Last year, Albumentations swapped parts of OpenCV - the world’s most widely used image processing library with 32 million monthly downloads in Python - for my strings library 🤯

0 views
Ash's Blog 1 months ago

Processing Strings 109x Faster than Nvidia on H100

I’ve just shipped StringZilla v4 , the first CUDA -capable release of my SIMD -first string processing library. Which in English means that it is now fast not only on CPUs, but also on GPUs! So not everything went to plan, but “StringZilla 4 CUDA” is finally here, bringing 500+ GigaCUPS of edit-distance calculations in a -able package, and a few more tricks up its sleeve, aimed at large-scale Information Retrieval, Databases and Datalake systems, as well as Bioinformatics workloads. All under a permissive Apache 2.0 open-source license, free for commercial use. So in this post, we’ll cover some of the most interesting parts of this release, including:

0 views
Ash's Blog 4 months ago

Beyond OpenMP in C++ & Rust: Taskflow, Rayon, Fork Union 🍴

TL;DR: Most C++ and Rust thread-pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork-join workloads and micro-benchmarks . So I’ve drafted a minimal ~300-line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies. Update (Sep 2025): Since the v2 release , Fork Union supports NUMA and Huge Pages, as well as , , and other “pro” features. Check the README for details .

0 views
Ash's Blog 6 months ago

CUDA Hello World: Done Less Wrong

You’ve probably seen a CUDA tutorial like this one — a classic “Hello World” blending CPU and GPU code in a single “heterogeneous” CUDA C++ source file, with the kernel launched using NVCC’s now-iconic triple-bracket syntax: I still see this exact pattern in production code — and I’ll admit, it shows up in some of my own toy projects too - one , two , and three . But relying on triple-bracket kernel launches in production isn’t ideal. They don’t return error codes, and they encourage a false sense of simplicity. So in the next ~25 KBytes of text, we’ll explore the less wrong ways to launch kernels.

0 views
Ash's Blog 8 months ago

The Longest Nvidia PTX Instruction

The race for AI dominance isn’t just about who has the most computing - it’s increasingly about who can use it most efficiently. With the recent emergence of DeepSeek and other competitors in the AI space, even well-funded companies are discovering that raw computational power isn’t enough. The ability to squeeze maximum performance out of hardware through low-level optimization is becoming a crucial differentiator. One powerful tool in this optimization arsenal is the ability to work directly with PTX, NVIDIA’s low-level Instruction Set Architecture (ISA) . However, PTX instructions are quite different from those for traditional CPU assembly. PTX Intermediate Representations (IR) live between high-level languages like CUDA and the actual hardware-specific Streaming Assembler (SASS) instructions. PTX is more akin to Java bytecode than x86 Assembly . And as we’re about to discover, they can reach lengths that would make even the most verbose x86 “opcodes” blush!

0 views
Ash's Blog 8 months ago

Hiding x86 Port Latency for 330 GB/s/core Reductions 🫣

For High-Performance Computing engineers, here’s the gist: On Intel CPUs, the instruction (vectorized addition) executes on ports 0 and 5. The instruction (vectorized fused multiply-add, or FMA) also executes on ports 0 and 5. On AMD CPUs, however, the instruction takes ports 2 and 3, and the instruction takes ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput of addition-heavy numerical kernels.

0 views
Ash's Blog 9 months ago

Parsing JSON in C & C++: Singleton Tax

I’d argue that almost every open-source developer gets an extra spark of joy when someone reads the documentation and uses their tool in a way that goes beyond the classic 101 examples. It’s a rare treat even for popular projects like JSON parsers, but if you are building high-throughput software, such as analyzing millions of network packets per second , you’ll have to dig deeper. The first thing I generally check in such libraries is the memory usage pattern and whether I can override the default memory allocator. It is the most common singleton of them all!

0 views
Ash's Blog 9 months ago

10x Faster C++ String Split, 16 Years Later 👴🏻

It’s 2025. Sixteen years ago, someone asked on StackOverflow how to split a string in C++ . With 3000 upvotes, you might think this question has been definitively answered. However, the provided solutions can be greatly improved in terms of both flexibility and performance, yielding up to a 10x speedup. In this post, we’ll explore three better ways to split strings in C++, including a solution I briefly mentioned in 2024 as part of a longer review of the Painful Pitfalls of C++ STL Strings .

0 views
Ash's Blog 10 months ago

The Next 31 Years of Developing Unum

When I was 20, I committed the next 40 years of my life to AI, approaching it from the infrastructure perspective. In 2015, before ChatGPT and the AI surge, such a long-term commitment seemed naive to many — almost like proposing marriage on a second date — the move I am most proud of. Yesterday, Unum celebrated its 9th anniversary, and my open-source contributions have surpassed 9,000 stars on GitHub. To mark the occasion, as I’ve done before, I’m releasing something new, free, and practical for the scientific community: efficient Bilinear Forms for real and complex numbers. These are useful in fields like statistics, computational physics, biology, or chemistry. These kernels may offer up to 5x improvements in mixed-precision throughput compared to BLAS and other libraries that power tools like NumPy, especially for simulating the time evolution of small systems of non-entangled quantum states. If you’re curious about Bilinear Forms, you can check the release notes of the SimSIMD project .

0 views
Ash's Blog 10 months ago

Understanding SIMD: Infinite Complexity of Trivial Problems 🔥

This blogpost is a mirror of the original post on Modular.com . Modern CPUs have an incredible superpower: super-scalar operations, made available through single instruction, multiple data (SIMD) parallel processing. Instead of doing one operation at a time, a single core can do up to 4, 8, 16, or even 32 operations in parallel. In a way, a modern CPU is like a mini GPU, able to perform a lot of simultaneous calculations. Yet, because it’s so tricky to write parallel operations, almost all that potential remains untapped, resulting in code that only does one operation at a time.

0 views
Ash's Blog 1 years ago

5x Faster Set Intersections: SVE2, AVX-512, & NEON 🤐

Set intersections are one of the standard operations in databases and search engines. They are used in: Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instructions . SIMD instructions make up the majority of modern Assembly instruction sets on x86 and Arm. They can yield 10x speedups, but due to their complexity, they are almost never used in production codebases.

0 views
Ash's Blog 1 years ago

35% Discount on Keyword Arguments in Python 🐍

Python has a straightforward syntax for positional and keyword arguments. Positional arguments are arguments passed to a function in a specific order, while keyword arguments are passed to a function by name. Surprising to most Python developers, the choice of positional vs keyword arguments can have huge implications on readability and performance . Let’s take the interface as an example. It’s a function implemented in SimSIMD , mimicking SciPy , that computes all pairwise distances between two sets of points, each represented by a matrix. It accepts up to 6 arguments:

0 views
Ash's Blog 1 years ago

NumPy vs BLAS: Losing 90% of Throughput

Downloaded over 5 Billion times , NumPy is the most popular library for numerical computing in Python. It wraps low-level HPC libraries like BLAS and LAPACK, providing a high-level interface for matrix operations. BLAS is mainly implemented in C, Fortran, or Assembly and is available for most modern chips, not just CPUs. BLAS is fast, but bindings aren’t generally free. So, how much of the BLAS performance is NumPy leaving on the table?

0 views
Ash's Blog 1 years ago

The Painful Pitfalls of C++ STL Strings 🧵

Criticizing software is easy, yet the C++ and C standard libraries have withstood the test of time admirably. Nevertheless, they are not perfect. Especially the , , and headers. The first two alone bring in over 20,000 lines of code , slowing the compilation of every translation unit by over 100 milliseconds. Most of that code seems dated, much slower than LibC, and equally error-prone , with interfaces that are very hard to distinguish.

0 views
Ash's Blog 1 years ago

USearch Molecules: 28 Billion Chemical Embeddings on AWS ⚗️

TLDR: I’ve finally finished a project that involved gathering 7 billion small molecules, each represented in SMILES notation and having fewer than 50 “heavy” non-hydrogen atoms. Those molecules were “fingerprinted”, producing 28 billion structural embeddings, using MACCS, PubChem, ECFP4, and FCFP4 techniques. These embeddings were indexed using Unum’s open-source tool USearch , to accelerate molecule search. This extensive dataset is now made available globally for free, thanks to a partnership with AWS Open Data . You can find the complete data sheet and scripts for data visualization on GitHub .

0 views