Latest Posts (20 found)

Benchmarking Apache Kafka Consumer Groups vs Share Groups (overhead test)

In my last blog post I introduced Dimster (DIMensional teSTER), a performance benchmarking tool for Apache Kafka with a specific set of philosophies. In this first share group benchmarking post, we’re going to use share groups as they are not intended to be used, but for a good reason. Share groups allow you to move past partitions as the unit of parallelism by allowing multiple consumers to read from the same partition, using message queue semantics. We’ll run those kinds of tests in the next post. In this post I just want to understand if the mechanics of how share groups work add any additional overhead compared to consumer groups. So we’ll use share groups as if they were consumer groups (by capping consumer count to partition count). Objective : Use synthetic tests to measure the overhead of share groups compared to consumer groups in identical conditions. How : Like-for-like tests which use an identical workload/topology using consumerType (CONSUMER_GROUP|SHARE_GROUP) as a dimension. Given identical producer/consumer counts, producer rate, topic/partition counts, do share groups scale as well as consumer groups? Do they add any latency overhead? These benchmarks are educational , they are not hard numbers, they are not some kind of canonical result (in fact, no such benchmark exists). And again, this is not a realistic test at all, they only serve to understand share group overhead. I ran all these benchmarks on a k3d Kubernetes cluster on my Threadripper 9980X: 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling.  This is not a production setup, but the hardware is more than capable of handling a small to medium sized Kafka cluster with excellent performance. The SSD can sustain around 1.7 GB/s once the SLC cache has filled up and none of these benchmarks exceed that in aggregate across the 3 brokers. All tests were run with TLS between the clients and brokers and between each broker. I prefer to run benchmarks with TLS enabled (though it reduces the numbers) because most people (hopefully?) run Kafka with full TLS.  Dimster uses named environments located in the dimster-config.yaml . Each environment targets a specific k8s cluster (via kubectl context), specifies the Kafka and client versions, sizes the Kafka pods, determines heap sizes, broker and log config files etc, all in one yaml block. This environment uses 36 of 128 CPU threads (16 of 64 cores) and 72 GB of 256 GB of RAM of my workstation, so we’re not pushing the Threadripper too hard. Note, the ‘requests’ field block is applied to both k8s requests and limits. The client pod is over-provisioned with 12 CPU cores (24 threads) and 24 GB RAM to avoid any client bottlenecks causing spurious results. The tests in this post compare consumer groups with share groups. To do that, I tried to isolate other factors as much as possible. Random load skew is one such important factor.  In these tests, I ensured that load was as even as possible over the brokers: Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Fig 1. Dimster’s partition pinning for even load distribution This is not like in real-life, but for this post I want to avoid the randomness involved with partition and broker skew so that we can compare consumer group vs share group performance without load skew randomness playing a role. I’ll be writing about and running benchmarks with partition and broker skew in a future post. Link to results as a tarball For the throughput benchmarks, I used Dimster’s explore mode, which probes the cluster to find the highest sustainable throughput while staying under a target end-to-end latency in ms and percentile (50 ms, p75 in this case). It measures e2e latency per-partition and uses the latency of the poorest performing partition as the yardstick.  Explore mode runs in phases: Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. The sustain phase is successful if 80% of the intervals (30 intervals of 10 seconds by default) meet the latency criteria. This rule exists as explore mode is trying to find the highest sustainable throughput which sits on the edge of the cluster’s limit, allowing for some latency spikes. I ran explore mode on the following workload: The first scenario has 4 test points which co-varies 4 workload aspects related to partition, client counts and consumer type as dimensions, repeating the tests 3 times. Fig 2. The merged result of three repeats (only small variance between runs) We see that share groups matched or even exceeded consumer group performance. Moreover, this pattern was broadly the same across the three test repeats. We can’t infer this as a generalizable result based on this one test, but my general observation, having been running these tests for a few weeks, on EKS clusters, my Threadripper and my Mac, is that throughput in this kind of synthetic test is comparable (between consumer/share groups). Scenario 2 - Varying fanout This scenario involved 1 topic with 12 partitions with a fanout of 2 and then 6. Fig 3. The merged result of three repeats (only small variance between runs) The surprising result was that share groups maintained a higher sustainable throughput with a fanout of 6. Explore mode is sensitive to spiky latency, and one thing I’ve observed is that share group latency can be more stable under stressful loads than consumer groups. Again, this may not be generalizable, but it shows that share groups might actually outperform consumer groups in some cases. I think the main takeaway from these limited tests is that share groups and consumer groups are in the same ball  park in terms of raw throughput. Link to results as a tarball The throughput benchmarks were a stress test of sorts, pushing Kafka right up to its limit. CPU was maxed out. We don’t want that for the latency benchmarks. We’re not going to push the Kafka cluster to the limit as we want to measure latencies within the performance envelope. With 4 vCPUs, around 100 clients and TLS, a 15 MB/s (1.3 TB daily) workload fits comfortably inside that envelope. I used run-mode , which are the standard fixed throughput benchmarks (best for measuring latency). I ran a single test campaign with 3 scenarios where consumerType was the dimension: 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. All ran with an aggregate producer rate of 15000 msg/s with a 1 KB message size (15 MB/s). Fig 4. End-to-end latency (p99) over time (10 second intervals). Note: you can select a time range on Dimster charts to zoom into a sub-range. Under this lighter load, we see that share groups add some overhead, with the e2e p99 latency being a little more choppy than the much flatter consumer group latency. Fig 5. End-to-end latency distribution. Note: you can select a percentile range on Dimster charts to zoom into a sub-range. Fig 6. p99 end-to-end latency over time (10 second intervals) The sharegroup overhead is more pronounced in this test. Fig 7. End-to-end latency distribution. Fig 8. p99 end-to-end latency over time (10 second intervals) Again we see the same overhead. The takeaway is that for an adequately sized cluster that is not stressed by the workload, we can expect to see some small share group end-to-end latency overhead. Just to show you this isn’t an artifact of running these tests on k3d on a single workstation, we see the same pattern on a 50 MB/s test I ran a few weeks ago on AWS EKS with the m6i.2xlarge instance (8 vCPU, 32 GB RAM, EBS). Fig 9. 50 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster And a 150 MB/s test which was more stressful Fig 10. 150 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster We see the typical Kafka latency spikes related to log flushing and rotation (which has this predictable cadence due to how all load starts at the same time, at a constant rate, on one topic). The share group tests consistently used more CPU than the consumer group tests, which is understandable given share groups do a lot more accounting and state management than consumer groups. For example, the first repeat of scenario 1 of the latency test (executed as test points CG, SG, CG, SG, CG, SG): Fig 11. CPU over three apache/kafka pods In all these tests, consumers did nothing with the messages except record some metrics. In the real world consumers write to databases and call APIs. It might take anywhere from < 1 ms to 30+ seconds to process a message. More useful benchmarks simulate consumer processing time which is exactly what we’ll do in the next post. When we add processing time, we start to see where share groups really shine. To summarize some findings from this post: Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups. 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling.  Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups.

0 views

Piri

This week on the People and Blogs series we have an interview with Piri, whose blog can be found at pketh.org . Tired of RSS? Read this in your browser or sign up for the newsletter . People and Blogs is supported by the "One a Month" club members. If you enjoy P&B, consider becoming one for as little as 1 dollar a month. Hey, I'm Piri. I'm a software designer, engineer, and artist of sorts. I build Kinopio , and have been blogging about the craft of making software for 12+ years (:O). I went to school in Toronto for biology and urban planning. There I learned that I liked illustration a lot more than writing boring reports and papers. After school, I got a job at a startup as an illustrator, that turned into product design, when also turned into writing code so I could build the ideas in my head. I can't remember a time when I didn't have some kind of blog. In university, I met a lot of new friends around the world by doing more angst-y cringe-y livejournal-y style writing. I started designing pketh.org while on a flight to SF, paid for by Yahoo, for a job interview at Flickr (times sure have changed). If you’re curious about the green design, I was inspired by the 1956 Jaguar D-Type, which I still think has such a unique prototype race car shape. My posts are usually long essays that take about a week or two to write and produce, so I try and make them timeless. When I have an idea for a post, I'll make a Kinopio space for it and collect thoughts, images, and URLs in it for a while. If after weeks or months it’s still on my mind, I'll start connecting and organizing everything into a rough outline. From there I'll start pasting things in and typing it up in either IA Writer or TextEdit. When the draft is done, I usually have someone proof-read it and use that feedback to make final edits. Then the final HTML formatting bits are done in my code editor of choice, SublimeText. Writing is like a muscle that atrophies when you don't use it. Mine's out of shape so the process is quite painful. When I finally a new post out to the world, I just want to lie down and never get up again. Probably related, but I end up throwing away 1/2 to 2/3 of what I write in a blog post. If I had the time to write more often I suspect it'd get easier. I think I could get pretty good at it. I prefer different places and tools depending on where I'm at in the process. I collect notes, inspiration, and connect related ideas wherever I am, usually on my phone. I like doing the early writing stage in a coffee shop or in bed. Anywhere that doesn't make me feel like I’m doing “Real Work™” yet. When I get really into it, I like to type on a desk with a good keyboard (I'm a big HHKB fan), on a screen big enough for me to keep my context windows (dictionary.app, Kinopio spaces, related web pages) next to my writing window. My blog uses Jekyll and is published on Github Pages. The domain stuff is done through Hover. It's quite basic. I might use something newer and nicer than Jekyll, but it would probably be compiled from markdown files the same way. The current design is a bit of a Ship of Theseus that I've been slowly and gently updating it over years, so it's kind of grown on me. I think the domain name is $20~/yr and I think that's it. I'm split on blogs with paid content: If writing is your job, then monetizing somehow totally makes sense. Quality independent writing and journalism is really important and should be compensated (I like Craig Mod's approach ). But for basically everyone else, blogging is a thing they do on the side for fun, and I think it sucks when people feel pressured to turn everything they do into a passive-income side-hustle potential-business-empire. Skimming the depths of my RSS feeds, I realized that I’ve subscribed to literally 1000s of blogs. But sadly most have withered away over the ages. Funkaoshi has been around for even longer than I've been writing – I consider the author my Toronto blogging senpai. I really enjoy Alexotos' in depth mechanical keyboard reviews. It's really cool and encouraging to see newer people blogging the same way we did. Lilly Ashton’s blog is worth reading If you're looking for something more personal and cozy. Since 2018, I've been building Kinopio , a spatial note-taking tool to collect and connect your thoughts, ideas, and plans. You can use it to make sense of your thorniest problems and grow your coolest new ideas into plans. I hope you enjoy it. Now that you're done reading the interview, go check the blog and subscribe to the RSS feed . If you're looking for more content, go read one of the previous 142 interviews . People and Blogs is possible because kind people support it.

0 views

Alleged Kimwolf Botmaster ‘Dort’ Arrested, Charged in U.S. and Canada

Canadian authorities on Wednesday arrested a 23-year-old Ottawa man on suspicion of building and operating Kimwolf , a fast spreading Internet-of-Things botnet that enslaved millions of devices for use in a series of massive distributed denial-of-service (DDoS) attacks over the past six months. KrebsOnSecurity publicly named the suspect in February 2026 after the accused launched a volley of DDoS, doxing and swatting campaigns against this author and a security researcher. He now faces criminal hacking charges in both Canada and the United States. A criminal complaint unsealed today in an Alaska district court charges Jacob Butler , a.k.a. “ Dort ,” of Ottawa, Canada with operating the Kimwolf DDoS botnet. A statement from the Department of Justice says the complaint against Butler was unsealed following the defendant’s arrest in Canada by the Ontario Provincial Police pursuant to a U.S. extradition warrant. Butler is currently in Canadian custody awaiting an initial court hearing scheduled for early next week. The government said Kimwolf targeted infected devices which were traditionally “firewalled” from the rest of the internet, such as digital photo frames and web cameras. The infected systems were then rented to other cybercriminals, or forced to participate in record-smashing DDoS attacks, as well as assaults that affected Internet address ranges for the Department of Defense . Consequently, the DoD’s Defense Criminal Investigative Service is investigating the case, with assistance from the FBI field office in Anchorage. “KimWolf was tied to DDoS attacks which were measured at nearly 30 Terabits per second, a record in recorded DDoS attack volume,” the Justice Department statement reads. “These attacks resulted in financial losses which, for some victims, exceeded one million dollars. The KimWolf botnet is alleged to have issued over 25,000 attack commands.” On March 19, U.S. authorities joined international law enforcement partners in seizing the technical infrastructure for Kimwolf and three other large DDoS botnets — named Aisuru , JackSkid and Mossad — that were all competing for the same pool of vulnerable devices. On February 28, KrebsOnSecurity identified Butler as the Kimwolf botmaster after digging through his various email addresses, registrations on the cybercrime forums, and posts to public Telegram and Discord servers. However, Dort continued to threaten and harass researchers who helped track down his real-life identity and dramatically slow the spread of his botnet. Dort claimed responsibility for at least two swatting attacks targeting the founder of Synthient , a security startup that helped to secure a widespread critical security weakness that Kimwolf was using to spread faster and more effectively than any other IoT botnet out there. Synthient was among many technology companies thanked by the Justice Department today, and Synthient’s founder Ben Brundage told KrebsOnSecurity he’s relieved Butler is in custody. “Hopefully this will end the harassment,” Brundage said. An excerpt from the criminal complaint against Butler, detailing how he ordered a swatting attack against Ben Brundage, the founder of the security firm Synthient. The government says investigators connected Butler to the administration of the KimWolf botnet through IP address, online account information, transaction records, and online messaging application records obtained through the issuance of legal process. The criminal complaint against Butler (PDF) shows he did little to separate his real-life and cybercriminal identities (something we demonstrated in our February unmasking of Dort). In April, the Justice Department joined authorities across Europe in seizing domain names tied to nearly four-dozen DDoS-for-hire services, although because of a bureaucratic mix-up the list of seized domains has remain sealed until today. The DOJ said at least one of those services collaborated with Butler’s Kimwolf botnet. A statement from the Ontario Provincial Police said a search warrant was executed on March 19 at Butler’s address in Ottawa, where they seized multiple devices. As a result of that investigation, Butler was arrested and charged this week with unauthorized user of computer; possession of device to obtain unauthorized use of computer system or to commit mischief; and mischief in relation to computer data. He is scheduled to remain in custody until a hearing on May 26. In the United States, Butler is facing one count of aiding and abetting computer intrusion. If extradited, tried and convicted in a U.S. court, Butler could face up to 10 years in prison, although that maximum sentence would likely be heavily tempered by considerations in the U.S. Sentencing Guidelines, which make allowances for mitigating factors such as youth, lack of criminal history and level of cooperation with investigators.

0 views

Datasette Agent

We just announced the first release of Datasette Agent , a new extensible AI assistant for Datasette. I've been working on my LLM Python library for just over three years now, and Datasette Agent represents the moment that LLM and Datasette finally come together. I'm really excited about it! Datasette Agent provides a conversational interface for asking questions of the data you have stored in Datasette. Add the datasette-agent-charts plugin and it can generate charts of your data as well. The announcement post (on the new Datasette project blog) includes this demo video : I recorded the video against the new agent.datasette.io live demo instance, which runs Datasette Agent against example databases including the classic global-power-plants by WRI , and a copy of the Datasette backup of my blog. The live demo runs on Gemini 3.1 Flash-Lite - it's cheap, fast and has no trouble writing SQLite queries. A question I asked in the demo was: when did Simon most recently see a pelican? Which ran this SQL query : And replied: The most recent sighting of a pelican by Simon was recorded on May 20, 2026 . The observation included a California Brown Pelican, along with a Common Loon, Canada Goose, Striped Shore Crab, and a California Sea Lion. Here's that sighting on my blog , and the Markdown export of the full conversation transcript. My favorite feature of Datasette Agent is that, like the rest of Datasette, it's extensible using plugins. We've shipped three plugins so far: Building plugins is really fun . I have a bunch more prototypes that aren't quite alpha-quality yet. Claude Code and OpenAI Codex are both proving excellent at writing plugins - just point them at a checkout of the datasette-agent repo for reference and tell them what you want to build! I've also been having fun running the new plugin against local models. Here's a one-liner to run the plugin against gemma-4-26b-a4b in LM Studio on a Mac: Datasette Agent needs reliable tool calls and the ability for a model to produce SQL queries that run against SQLite. The open weight models released in the past six months are increasingly able to handle that. Datasette Agent opens up so many opportunities for the LLM and Datasette ecosystem in general. It's already informed the major LLM 0.32a0 refactor which I'm nearly ready to roll into a stable release, maybe with some additional "LLM agent" abstractions extracte from Datasette Agent itself. I've been exploring my own take on the Claude Artifacts, which is shaping up nicely as a plugin. I'm excited to use Datasette Agent to build my own Claw - a personal AI assistant built around data imported from different parts of my digital life, which is a neat excuse to revisit my older Dogsheep family of tools. We'll also be rolling out Datasette Agent for users of Datasette Cloud . Join our #datasette-agent Discord channel if you'd like to talk about the project. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . datasette-agent-charts , shown in the video, adds charts to Datasette Agent, powered by Observable Plot . datasette-agent-openai-imagegen adds an image generation tool to Datasette Agent using ChatGPT Images 2.0 . datasette-agent-sprites provides tools for executing code in a Fly Sprites persistent sandbox.

0 views

The Trick to Designing Highly Replayable Arcade and Linear Games

I recently finished a prototype for a game called HARVEST MOVE based on an arcade style game of the same name that I developed 3 years ago. In the original version, the player moves on a grid and needs to collect as many crops as possible while avoiding enemies. If they get hit, the player is presented with their current score, their best score and the ability to play again. A screenshot of my HARVEST MOVE prototype However, while remaking it, I took a few design decisions that altered massively the replayability of the game. I’d like to share what I did. First, I structured the game into configuration of enemies (I’ll start using the term level rather than configuration to make things clearer) that determined what enemy types and how many where to be placed on the grid. I then, tied those levels to specific score thresholds. Therefore, once a player reached a threshold they would be seamlessly moved to the next level. Secondly, I turned the player’s score into a currency. When they had a game over, they could spend the wealth accumulated to unlock more valuable crops that would start spawning in their next attempt. Even though players always replayed the game starting from the first level, they could accumulate wealth much faster which translated to faster progress through levels they had previously played. Additionally, the more crops they unlocked, the more likely they were to reach late game levels locked behind higher score thresholds because they simply earned more. Now, why does this structure seemingly works in making an arcade/linear game more replayable? I have a few hypothesis to present which are listed below. There is now a sense of permanent progression which mitigates the feeling of loss you experience after losing in a score based game. This makes you more likely to try again. Replaying previous levels while being more powerful/quicker gives the player a very satisfying feeling of power which feeds into an innate power fantasy of some sort. If your game has upgrades that meaningfully alter the gameplay experience, then you tap into a novelty effect that will make the game more enjoyable for longer. For example, having unique items/weapons the player can unlock or alternative pathways in a level or an alternative level. I don’t think so. Roguelikes usually rely on procedural generation and randomization to keep things fresh. There is less emphasis on level design and more emphasis on making builds and testing them out. Rather, I would estimate this structure to be more similar to time loop based games like The Legend of Zelda Majora’s Mask. In Majora’s Mask, you play through the same handcrafted environment over and over and progressively unlock things in the game allowing you to access things you couldn’t before. However, that doesn’t mean procedural generation can’t be used. In fact, I used it in my game, but I don’t believe in it being a core aspect. Considering I accidently discovered this through experimentation, I still think there is room for fine tuning. However, this makes me excited to test this structure again in future projects. I wonder how this would fit with an RPG or a platformer, etc… That said, it would be nice if I could get your feedback on my game. To make things as convenient as possible, here is a google form (link to the game is inside the form) for you to provide your feedback so you don’t have to create an account to be able to comment. Anyway, if you’re interested in all things related to programming, game development and game design I recommend subscribing to not miss out on future posts. Subscribe now A screenshot of my HARVEST MOVE prototype However, while remaking it, I took a few design decisions that altered massively the replayability of the game. I’d like to share what I did. First, I structured the game into configuration of enemies (I’ll start using the term level rather than configuration to make things clearer) that determined what enemy types and how many where to be placed on the grid. I then, tied those levels to specific score thresholds. Therefore, once a player reached a threshold they would be seamlessly moved to the next level. Secondly, I turned the player’s score into a currency. When they had a game over, they could spend the wealth accumulated to unlock more valuable crops that would start spawning in their next attempt. Even though players always replayed the game starting from the first level, they could accumulate wealth much faster which translated to faster progress through levels they had previously played. Additionally, the more crops they unlocked, the more likely they were to reach late game levels locked behind higher score thresholds because they simply earned more. Now, why does this structure seemingly works in making an arcade/linear game more replayable? I have a few hypothesis to present which are listed below. There is now a sense of permanent progression which mitigates the feeling of loss you experience after losing in a score based game. This makes you more likely to try again. Replaying previous levels while being more powerful/quicker gives the player a very satisfying feeling of power which feeds into an innate power fantasy of some sort. If your game has upgrades that meaningfully alter the gameplay experience, then you tap into a novelty effect that will make the game more enjoyable for longer. For example, having unique items/weapons the player can unlock or alternative pathways in a level or an alternative level.

0 views
Jim Nielsen Yesterday

Book Notes: “Poor Charlie’s Almanack”

I’ve been slowly listening to Poor Charlie’s Almanack: The Essential Wit and Wisdom of Charles T. Munger . I like his practicality. He’s never trying to be overly academic, as if he needs to prove how smart he is. He says Berkshire’s success doesn’t come from them solving hard problems, but from spending their time knowing what a simple solution looks like — and acting on it when they see it! We’ve succeeded by making the world easy for us, not by solving the world’s hard problems. Munger analogizes their approach to investing like jumping a fence. They don’t spend all their time trying to figure out how to jump a seven-foot tall fence. Instead, they find a spot where the fence is only a foot tall, jump it, and take the reward on the other side. The approach he articulates for investing, in fact, seems broadly applicable to any kind of problem solving: Whenever people ask him for advice (as if somehow he could bestow upon them some kind of knowledge that will save them the pain and hardship of experience) he seems anathema to the idea that you can live life without making lots of mistakes. To paraphrase Charlie: “I don’t want you to think that we have a method of learning that will prevent you from making mistakes. The best you can do is learn to make fewer mistakes than others. And then, when you inevitably do make mistakes, learn to acknowledge them and fix them quickly.” Straightforward. Practical. No bullshit. No ego. (Basically the opposite of everything I see on social platforms.) I quite enjoyed his perspective. Reply via: Email · Mastodon · Bluesky Quickly eliminate the universe of what not to do. Follow up with a multi-disciplinary attack on what remains. Act decisively when — and only when — the right circumstances appear.

0 views
Unsung Yesterday

“This is a common tell in web apps, and we did a lot of work to eliminate it.”

I have mixed feelings about Raycast announcing their move from the native interface to one powered by web tech (this is the same thing that made Photoshop’s dialogs so bad ), but their blog post announcing the change has at least a useful list of some details that separate good native apps from bad web ones. I think it’s worth checking out that list and internalizing it even if you’re nowhere near that kind of a decision, because some of these are universal requirements for a better-feeling interface: There is more in the blog post , and a lot more still left unsaid. Let me add one that I see all the time: accidental text selection. Web makes all text selectable by default, regardless of whether it makes sense for that text to be selected. On top of that, text selection heuristics on complex layouts are not that great. That means that surprisingly often you will see half a text on the page being selected in response to an accidental click or drag. Here’s an example from YouTube I just spotted, where dragging a sidebar selects everything inside it: It’s all solvable via the use of event cancellation and , but requires someone to think about it happening. Yes, there are moments where GUIs allow you to select text for a reason… …but it’s always been a tricky proposition given the scarcity of affordances. It might be better to employ a pretty common “copy to clipboard” pattern instead. #interface design #web No on interactive controls. Desktop apps don’t do this. It’s small, but it immediately signals “this is a website.” No hover highlights on most controls. On macOS, buttons and list items don’t highlight on hover the way they do on the web. Settings open in a separate native window, not a modal or a side panel. Popovers and tooltips render as native windows, not as DOM elements inside the WebView. They can extend beyond the window bounds, just like native popovers do. On macOS Tahoe, we adopted Apple’s new Liquid Glass material so Raycast blends with the system’s updated visual language from day one. No flickering when views appear or transition. This is a common tell in web apps, and we did a lot of work to eliminate it.

0 views

Anthropic's "Profitability" Swindle

Yesterday, the Wall Street Journal ran a story about how Anthropic is “about to have its first profitable quarter,” specifically an operating profit, or EBITDA profitability: Interesting! That’s a lot of certainty considering we’re barely through the first half of the second quarter, and quite a specific number given the fact that June hasn’t started! And all of these numbers are mysteriously leaking exactly while it raises its funding round! Oh there’s also one important note: The Journal adds at the bottom of the article that “ ...it is unclear what accounting methods Anthropic has used to book revenue and costs, as the company isn’t yet required to follow the financial-reporting requirements of a public company. ” That’s right —-- Anthropic is possibly going to be EBITDA profitable for a single quarter, on a non-GAAP basis.  Anyway, I wonder how Anthropic did it? Because based on this unhelpfully-labeled diagram from the Journal, it appears ( as I said last year ) that its costs scale linearly with its revenues, except they…magically didn’t in the second quarter? I wonder if it'll stay profitable? That’s also interesting. So Anthropic may be profitable very specifically in Q2 2026 , but might not be afterward. It’s almost as if it found a way to specifically cut its costs in May and June somehow… …because it did! Remember that deal Anthropic signed with SpaceX to take over Colossus-1 ? Well it’s also taking over some or all of Colossus-2, paying SpaceX $1.25 billion a month starting in May and June… when it’ll have a reduced fee as it ramps up! Per SpaceX’s S-1 : That’s $15 billion a year in compute costs, but reduced to an indeterminately-discounted level for the precise months that Anthropic is using to tell investors and the media that it has an operating profit. That operating profit is a result of accountancy rather than any improvements to its business model. While I wouldn’t say this is cooking the books, it’s definitely a shiatsu-grade massaging of the numbers. Anthropic has deliberately leaked a quarterly “profit” where it knows it can suppress its costs, specifically made sure that the journalist gave it the out of “costs might increase,” and released it on the day of NVIDIA’s earnings as a means of keeping the AI bubble inflated. Nothing has changed. If Anthropic paid full-rate for its compute in those two months, its economics would shift back to what they’ve always been per my reporting from last year on its AWS costs — a business that has costs that linearly increase with its revenue growth. I also severely doubt that Anthropic managed to make the cost of running its services profitable in the space of six months. Per The Information in January , Anthropic missed on its gross margin projections, saying that its inference costs were 23% higher than the company had anticipated. How did Anthropic, which faced a massive influx of new business to the point that Anthropic was forced to buy more compute from Elon Musk , magically become profitable? Other than that discount, of course. I have a few guesses: Nevertheless, the revenue side is where the real problems lie. So, Anthropic has said it brought in $4.8 billion in revenue in Q1 2026, and projects to hit $10.9 billion in Q2 2026. This is tough to reconcile with previous reporting. On February 12, 2026, Anthropic claimed it had reached $14bn in annual recurring revenue (ARR) . As a reminder, ARR is an accounting tool largely used primarily by startups, where a snapshot of a single month’s income is taken and multiplied by twelve. This gives you an implied monthly revenue of roughly $1.17bn.  On March 3, 2026, Dario Amodei would claim Anthropic had reached $19bn in ARR — which works out to $1.58bn per month . Two days later, on March 9, Krishna Rao — Chief Financial Officer at Anthropic — would declare under oath in a court filing that Anthropic had brought in revenues “exceeding $5 billion to date. ”  Keep in mind that The Information had previously reported that Anthropic had $4.5 billion in revenue in 2025 , which I already found difficult to match with Rao's statements. While boosters may claim that “exceeding” could mean literally any number they want above $5 billion, I find it doubtful that the CFO of Anthropic would, under oath, lead the court to believe its business was 30% to 40% smaller than it was, especially when trying to convince it that the damage of being labeled a supply chain risk would ruin its business. At this point it’s impossible to reconcile the 2025 reporting with that $5 billion number. If we assume that the ARR claims made by Anthropic are correct, we can presume that it made revenues of roughly $2.5bn in March ( given that it claimed it had $30 billion in ARR on April 6 ), $1.58bn in February, and $1.17bn in January, for a total of $5.25 billion.  I realize that figure is in excess of what the Wall Street Journal had and, in some world, those numbers could be cherry-picked using particular periods to the point that the real revenues would be in the region of $4.8 billion. That's possible. But they don’t make a lick of sense when you bring up what Krishna Rao said. If we believe Anthropic’s leaks —-- putting aside all of the ARR figures for a second —-- this means that Anthropic: While I acknowledge that Anthropic has grown significantly, that level of stratospheric growth does stretch the limits of credibility. Moreover, the fact that previous ARR figures are inconsistent with the leaked charts from Anthropic further raises questions about the credibility of any numbers from the company.  The only real defense that anybody has here is that Krishna Rao, under oath , lowballed the US government and a judge to such a dramatic extent that he hid in excess of $4 billion in revenue.  And as I’ve discussed before — and FlyingPenguin helpfully collated — adding up Anthropic’s previously-reported ARR from January 2025 to March 3, 3rd 2026 already gets us to around $6.66 billion.  I can imagine this has felt like a big victory for boosters — proof that AI can be profitable, that inference is profitable, that some sort of business model is emerging…and I’m sorry, that’s not what’s happening. Dario Amodei and Elon Musk worked out a sweetheart deal, which they - framed as a “ramp-up,” - that allowed Anthropic to artificially depress its costs. I also question how much of a ramp-up there really was, or what Anthropic’s actual compute constraints were, because it immediately loosened rate limits for Claude subscribers on announcing the deal , meaning that it immediately started having higher inference costs, which…somehow led to it making a higher profit? Or did Musk — as literally described in its S-1 — have SpaceX charge Anthropic less for two specific months to make the numbers look better? In July, Anthropic will start paying SpaceX $1.25 billion a month,  - or $15 billion a year, - on top of all of its other compute deals with Google, Amazon and Microsoft.  If we assume that its spend is comparable on AWS and Google Cloud — and it’s most-assuredly more! — that means Anthropic is spending around $3.75 billion in compute costs, or $11.25 billion a quarter, or $45 billion a year.   There’s also a very compelling argument that Anthropic’s costs will increase and will eat up that profitability , to once again quote the Wall Street Journal: I also have to wonder: if you’re so profitable, why not IPO? Why not take this to the public markets?  Unless, of course, you’re only non-GAAP EBITDA profitable based on a two-month-long discount specifically covering the period in which you’re profitable. And, of course, when you’re not a publicly-traded company, and so you don’t actually have to publish any numbers (and no, leaking them doesn’t count), and you’re not subject to SEC oversight.  I will give Dario Amodei credit: nobody does financial engineering and a press-led information war better than Anthropic. The willingness of the press to eat up incongruent numbers and the eagerness of many to jump up and find obtuse ways to explain away the obvious problems is only made possible when a company has perfected the art of manipulation and ingratiation of those who want to feel like they’re “first.” If you take this as incontrovertible proof that Anthropic is profitable, you are deliberately ignoring the blatantly obvious ways these numbers are being massaged. We’ve got its CFO saying numbers that don’t match up with these leaks or Anthropic’s own marketing materials, and the aggressive and deluded way in which many people ignore them is equal parts frustrating and depressing.  Let me speak directly and with more empathy than usual: if you want Anthropic to win, you should be just as skeptical of these numbers as I am. You should want to smash my face in the tarmac with the most crystal-clear, impossible-to-argue with numbers, bereft of asterisks or discounts from suppliers or obfuscated accounting metrics.  You should want better from your heroes. If you truly think this company is amazing, unstoppable, and leading the tech industry to a glorious era of innovation, there shouldn’t be this many questions, and the metrics shouldn’t be this murky .   Every other time when a company has played this level of silly, weird bullshit has led to disaster — for example, WeWork claimed to be profitable since the second month of its operations , and repeated claims of profitability throughout its existence , and it turned out that it was only “profitable” if you removed things like “ some of the costs of doing business .” I get why you’re so defensive, and I get why you want this to work. A lot of you are very excited about generative AI, and being excited about it has given you a tremendous community of equally-excited people. I get that you like these tools.  And I need you to know these companies are laughing at you.  Anthropic timed this leak to focus on a specific quarter where it artificially suppressed costs, and gave you the flimsiest proof imaginable, specifically-crafted for you to share it as a triumph and spread the idea that “AI labs are actually profitable,” when their core economics haven’t changed. Costs increase linearly with revenue, and will continue to do so in perpetuity.  I genuinely can’t wait for both OpenAI and Anthropic to file their S-1s. If you liked this piece, you should subscribe to my premium newsletter. It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of  NVIDIA ,  Anthropic and OpenAI’s finances , and  the AI bubble writ large . My Hater's Guides To  Private Credit  and  Private Equity  are essential to understanding our current financial system, and my guide to how  OpenAI Kills Oracle  pairs nicely with my  Hater's Guide To Oracle . This week, I’ll publish the second part to my ongoing series (“ What If…We’re In An AI Bubble? ”) about the factors and events that will cause the AI bubble to finally pop.  Subscribing to premium is both great value and makes it possible to write large, deeply-researched free pieces every week.  For large enterprises, Anthropic is taking prepayment of tokens —-- say, $50 million intended to be spread over 12 months that it takes in as revenue. This would both inflate revenue numbers and depress costs, because Anthropic wouldn’t have actually provided the compute necessary to earn that revenue yet. Anthropic is already offering discounted tokens for Claude users through the “buy extra credits” page on their accounts, with discounts ranging from 10% to 30%. It may very well be booking this up-front. Anthropic could be front-loading annual commitments of any kind —– subscriptions to Claude, enterprise or team agreements, and so on. Anthropic could have ratcheted down training to ease the burden on its infrastructure to provide inference.  Made over 90% of its lifetime revenues in the first quarter of 2026.,  Made virtually no revenue in its previous years, and…  Leaked completely imaginary run rates to the media for years.

0 views

I feel your pain Sara

I stumbled on this piece of code recently that made me laugh, cry, sigh in despair, and think of poor Sara doing her best to make the web a better place. I guess people have forgot that is a thing that exists. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
Martin Fowler Yesterday

Bliki: Vibe Coding

Vibe coding is building a software application by prompting an LLM, telling it what to build, trying it out, prompting for changes - but without looking at any of the code that the LLM generates. This technique can be used by people without any knowledge of programming. However the resulting software often shows problems with maintainability, correctness, and security - so is best used for disposable software written for a limited audience. The term was coined in February 2025 by Andrej Karpathy, an experienced programmer, in a post on X: There's a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like “decrease the padding on the sidebar by half” because I'm too lazy to find it. I “Accept All” always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. -- Andrej Karpathy The key point about vibe coding is “forget that the code even exists” . This is what gives it much of its usefulness, but also its limitations. Since the November Inflection many programmers are getting LLMs to write all their code, commenting that they may never write a line of code directly again. However they do care about this code, reviewing it, paying attention to its internal structure. In that case, they aren't forgetting the code exists, so it's really a different thing that I call Agentic Programming . Sadly the term “vibe coding” really caught on, so many people use it to mean agentic programming. However I feel that despite this rapid Semantic Diffusion , it's worth trying to keep the concepts of vibe coding and agentic programming separate, as they are both different to use and different in their consequences. Because a vibe coder doesn't look at the code, they don't need programming skills, so it's perfect for someone with no programming knowledge to build applications for their own use. Experienced programmers may also find it handy for rapid development of disposable software or prototypes. Vibe coding is still new, so we are exploring its limitations, and those limitations change as the sophistication of models and their harnesses change. These limitations do introduce considerable risks, particularly if the vibed software is used widely or has access to sensitive information. Perhaps the most serious risk is that of security. LLMs are inherently vulnerable as they provide a large attack surface for predators. Vibe coded applications can often expose sensitive information or worse, credentials to attack deeper into an organization's systems. Even non-programmers need to be aware of the Lethal Trifecta . With little attention to the code, vibed software can rapidly produce many lines of code of a very low quality. Such code makes it difficult, even for an LLM, to modify and enhance the software in the future. While it's possible that growing LLM capabilities will allow it to work with even the largest bowls of spaghetti software, thus far it seems clear that well-structured software makes life easier for LLMs too. LLMs are famous for habit of hallucinating incorrect facts and presenting these with great confidence. This habit also leads them to create software that behaves incorrectly - and those errors may not be manifest to the user. Furthermore the non-determinism of LLMs means that it's likely that asking an LLM to enhance some software could easily lead it to introduce errors, even in parts of the code that shouldn't change due to the new request. We should thus treat LLM-generated software with skepticism, it can still be useful, but we need to be aware of the risks. On the whole vibe coding software is best used for disposable software that's only used by its author or a close group of collaborators who understand and accept the risks involved. Code that is more complex, more widely-used, and with more consequences to its risks should not be forgotten about.

0 views
Unsung Yesterday

Chrome’s abnormal tab search

Chrome’s find option, like every search coming from a good home, does something clever with accented characters – it normalizes them: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/1.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/2.1600w.avif" type="image/avif"> No matter whether you search with a proper accented character, or with its basic Latin equivalent, all the same stuff matches: The “ø” letter is treated the same as “o” both in the input field, and then in the search itself. Yet, Chrome’s tab search inexplicably doesn’t do that, which confused me when working on a post about diacritics earlier this week. Here, it should match all four open tabs: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/3.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/chromes-abnormal-tab-search/3.1600w.avif" type="image/avif"> Tab search was introduced years ago; the Occam’s Razor says this isn’t a recent bug, but that the feature has always behaved like this. I filed the bug , but even if it gets fixed quickly, I think this doesn’t reflect well on Chrome’s team. If the right code already exists for ⌘F, why not reuse it? If it cannot be reused, why not repurpose at least its unit tests or the QA process to make sure this doesn’t fall through the cracks? Normalization should be treated as a core property of any search, rather than an optional “nice to have.” But, Marcin, didn’t you just invalidate your assertion that diacritics actually matter ? After all, wouldn’t you input “nestlé” instead of “nestle” if they did? To this, I have a few answers: Why does it matter specifically for the ⌘F and the tab search experience? I have this personal theory: the simplest the search, the more the users will blame themselves if it doesn’t work, and assume the tab or the string just isn’t there, rather than rewrite their query. That’s what happened to me. I assumed that the tab wasn’t open and tried to get to it again, wasting time and effort. The rule might be universally true for any UI surface – the tighter it gets, the less likely we assume it can break. After all, there is a manual for a typewriter, but there isn’t one for the pencil! And these UIs do feel positively basic; they are small windows with basically one input field and an immediate as-you-type reaction. #definitions #keyboard #localization Input is not output. This is no different than autocorrect, autocomplete, or other IME helpers. The very fact that on many keyboards accented characters are hard to input is itself a sign of anglo-centrism of companies that made early typewriters (Remington, which established a lot of European layouts like QWERTZ and AZERTY, employed a person who bragged he didn’t actually speak any languages in a “how hard could it be” way) and then most microcomputers. There is this really interesting rule, also known as Postel’s Law : “be conservative in what you output, but liberal in what you accept as input.” It’s not universally applicable – sometimes it’s better to teach the user to be more explicit if it benefits them in the longer run – but it feels appropriate to me here.

0 views
Stratechery Yesterday

An Interview with Parallel Founder Parag Agarwal About Valuing Content on the Agentic Web

An interview with Parallel founder Parag Agarwal about valuing content and incentivizing its creation in a world of agents (plus questions about Twitter).

0 views

Installing SmartPoi D1 Mini version with Arduino IDE V2

4. Go to Tools -> Boards -> Boards Manager and select esp8266 to install (may need to re-start Arduino IDE before it shows up) 5. Install the ESP8266 LittleFS Uploader program in Arduino: Step 1: Download the Plugin You need to put this file in a specific directory. If the folder doesn’t exist yet, you will need to create it. The workflow to actually upload files is identical to the old version: The console at the bottom will compile your file system image and push it straight to the flash memory! 5. Get SmartPoi from the SmartPoi Firmware Downloader website 6. Select options in Arduino IDE 2.0: 7. Compile and Upload 8. Do the LittleFS Filesystem Upload mentioned above (step 5.4) The post Installing SmartPoi D1 Mini version with Arduino IDE V2 appeared first on Circus Scientist . Download and install Arduino IDE V2 Go to Tools -> Manage Libraries and install FastLED 3.7.5 (ESP8266 version of SmartPoi will not work with the latest FastLED!) Go to File -> Preferences and input the following in “Additional boards manager URLs” (adding ESP8266 boards support) : http://arduino.esp8266.com/stable/package_esp8266com_index.json Open your web browser and go to the official GitHub releases page for the tool: GitHub: arduino-littlefs-upload Download the latest version ending in (for example: ). Windows: 1. Navigate to: 2. Look for a hidden folder named (note the dot at the front).3. Inside , create a new folder named .4. Move the file into that folder. macOS / Linux: Open Finder/File Manager and go to your home directory: (You may need to hit on Mac to see hidden folders). Create a folder named inside it. Drop the file into that folder. Restart Arduino IDE 2. In IDE 1.8, the tool lived in the Tools menu. In IDE 2, it lives in the Command Palette . Open the Command Palette by pressing: Windows/Linux: + + macOS: + + Type into the prompt. You should see the option: Upload LittleFS to Pico/ESP8266/ESP32 . Open your Arduino sketch. Go to Sketch > Show Sketch Folder . Create a folder named exactly alongside your file. Place whatever HTML, TXT, or config files you want inside it. Important: Select your D1 Mini board and port, and close the Serial Monitor (if open, it blocks the upload). Open the Command Palette ( ) and click Upload LittleFS to Pico/ESP8266/ESP32 . CPU Frequency: 160mhz Board: LOLIN(WEMOS) D1 R2 & Mini Flash Size: “4MB (FS:3MB OTA: ~512KB)” Debug Port: Serial (if you want to see serial ouput – optional) Select your port (COM1, USB0 …) Leave everything else on default settings

0 views
Zak Knill Yesterday

AI token streaming isn't about SSE vs WebSockets

At Ably , we’ve solved production token streaming, so you don’t have to. And the hard-part isn’t SSE or WebSockets. Ask an agentic coding tool or chatbot “how to stream AI tokens to a client in production” and it’ll give you a section of the answer on SSE vs WebSockets. But that’s not the question, or really the answer. In a pure comparison of using SSE or WebSockets as the transport, SSE is the simpler choice, and is also the better choice for most usecases. The architecture you should build for production token streaming looks like the diagram below. It’s got separation of ‘prompt’ request and ‘response’ stream, and a token cache/data store for storing the tokens in allowing for resume and reconnection.

0 views
Sean Goedecke Yesterday

The famous o3 "GeoGuessr" prompt did not work

In April last year, Kelsey Piper discovered that OpenAI’s o3 model was surprisingly good at figuring out where a photo was taken from. Like human “geoguessr” pros , o3 could sometimes take a nondescript photo of a beach and tell you exactly where it is. Here’s the example Kelsey gave: Several people reproduced this with good results: not a 100% success rate, but clearly far better than you’d do with a random human guess. The lesson here is that model capabilities can surprise us . The o3 model had been released for two weeks before Kelsey’s tweet without anyone noticing how good it was at geolocation. What obscure capabilities did we never find? What capabilities of current models are we missing today? Some people drew another lesson from this: that “prompt engineering” can unlock brand-new capabilities. This is because Kelsey had a magic prompt that she built over time. When o3 got something wrong, she would ask it how it could have avoided the mistake, and then included that in the prompt. Here’s the first 10% of that prompt, so you get the idea: You are playing a one-round game of GeoGuessr. Your task: from a single still image, infer the most likely real-world location. Note that unlike in the GeoGuessr game, there is no guarantee that these images are taken somewhere Google’s Streetview car can reach: they are user submissions to test your image-finding savvy. Private land, someone’s backyard, or an offroad adventure are all real possibilities (though many images are findable on streetview). Be aware of your own strengths and weaknesses: following this protocol, you usually nail the continent and country… This prompt impressed a lot of people, who tried it out and reported that it correctly identified a lot of images. But of course, o3 correctly identified a lot of images with just a basic “think carefully about where this picture was taken?” prompt. Did the prompt actually help? It’d be tough to figure that out just from playing around in ChatGPT. You’d need to build an evaluation set of images and run o3 against them twice: once with the fancy prompt and once without it. So that’s what I did . I pulled 200 images from Wikimedia Commons, Geograph Britain and Ireland, and iNaturalist for the benchmark. You can read the AI-generated summary here , but here’s the key table: In general, the basic prompt did better on average. It consistently guessed closer to the actual location. Both prompts did pretty well, actually. Despite the fancy prompt being 10x larger, it only caused o3 to think for slightly longer (about one second on average, though the max was about double, at 10 minutes instead of 5 minutes). The images in my benchmark were fairly generic geoguessr-style outdoor images, with twelve indoor images thrown in for an extra challenge (the fancy prompt also did slightly worse on these). What’s going on? I think this shows how easy it is to fool yourself about the quality of prompting . When the model is already pretty good at a task, you can give it a very elaborate prompt without impacting performance. It’ll still be pretty good, except this time it’s good because of what you did . This is particularly true if you’re iterating with the model and asking it “what should I add to the prompt” for each mistake. Models will happily make up stories for you about their own reasoning processes, and will almost always say “yes, that helped a lot!” when you ask them if a particular prompt tweak made things better. The only way to actually know is by constructing some kind of benchmark 1 . It’s also interesting to me that nobody checked this at the time. It took me about six hours of fairly-distracted work and about $15 to construct and run this benchmark. Why didn’t anyone do this when they were writing articles about how good the o3 prompt was? One charitable reason might be that the story was more about o3’s real geolocation ability than about the magic prompt. The pricing for o3 also used to be about five times more expensive (though a benchmark of 40 images instead of 200 would still have thrown doubt on how much water the prompt was carrying). Also, AI just moves so fast . Geolocation was only the story for about a week: after that, GPT-4o’s sycophancy was what people were talking about. Another reason is that AI tooling wasn’t as good then. The benchmark was so easy for me to run because GPT-5.5 did most of the heavy lifting. Prior to strong agents, you would have had to write the (simple) benchmark yourself. I can’t point the finger too hard: I didn’t bother at the time either. Maybe my benchmark isn’t very good? The photos look reasonable enough: a wide variety of geoguessr-like shots of roads and landscapes, mostly. I could have tried to gather a few thousand photos instead of a few hundred, but if the magic prompt really was a big improvement you’d still expect to see that manifest on a benchmark this size. If someone wants to go and build a hundred-dollar geolocation benchmark instead of my fifteen-dollar one, I think that’d be an interesting project. Finally, let’s use the benchmark to answer a question I’ve had for a while: do gpt-5.4 and gpt-5.5 have o3’s geolocation abilities? The answer, apparently, is no. Whatever o3 had that made it good at this task hasn’t transferred to newer models. Benchmarks can mislead as well, but they’re better than just vibes. Benchmarks can mislead as well, but they’re better than just vibes. ↩

0 views
The Coder Cafe 2 days ago

LSM Trees Explained

☕ Welcome to The Coder Cafe! Some people reached out after I published the Build Your Own Key-Value Storage Engine series to say they hadn’t gone through all eight posts, but they were curious about the core ideas. So I distilled everything into a single post. No implementation, no exercises, just the core concepts behind LSM trees. Get cozy, grab a coffee, and let’s begin! Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. In a simple implementation, an SSTable is just a JSON array of key-value pairs, sorted by key: Production systems use a binary block-based format instead. The SSTable is divided into fixed-size blocks, typically 4 KB, though the exact size varies by implementation. Data blocks hold the actual key-value entries. The SSTable also contains an index block storing the first key of each data block, which makes it possible to binary search for the right block without reading the entire file. In most implementations, the index block is written at the end of the file, since block boundaries are only known after all data blocks have been streamed out. To look up a key, we read the index block, binary search it to find the right data block, fetch that single block from disk, verify its integrity with a checksum, and then binary search within the block. When the index block is not cached, this means most lookups read two disk pages: the index block and one data block. In practice, index blocks are typically kept in memory, so most lookups require only one disk read . Each data block also carries a checksum computed over the block’s bytes. Before using the data, the engine verifies the checksum. If they don’t match, the block is corrupted, and the read fails safely rather than returning garbage. As SSTables accumulate, the engine maintains a catalog file (often called a MANIFEST in systems like RocksDB), which is an append-only log listing all existing SSTables in order of creation. This catalog is the engine’s source of truth for what files exist on disk. On startup, the engine reads it to know which files are live, and replays the WAL to restore the memtable. After a successful flush, the old WAL can be discarded. The data is now safely in an SSTable. Production systems also compress data blocks , typically with a fast algorithm like Snappy, LZ4, or zstd. Compression reduces disk footprint and I/O at the cost of CPU, and it interacts with block sizing: a compressed block may be smaller than a disk page, so implementations often track both logical and physical block sizes. LSM trees are optimized for writes. Reads are where the trade-off shows . To look up a key, the engine searches in order of recency : first the memtable, then SSTables from newest to oldest. The first match wins. This ordering matters because the same key can appear multiple times across different SSTables. Each write to a key produces a new entry rather than updating the existing one. The newest version is the correct one. The problem becomes clear as SSTables accumulate. A key that was written once and never updated might still require the engine to search through dozens of SSTables before finding it, or confirming it doesn’t exist. Each SSTable search is a disk read. This is called read amplification : a single logical read triggers multiple physical reads. For a key that doesn’t exist at all, the engine must check every SSTable before returning a not-found error. That’s the worst case for read amplification, and it gets worse the more SSTables there are. This is a fundamental tension in LSM trees, and it reflects a deeper principle known as the RUM conjecture: a storage engine can excel at two of reads, updates, and memory efficiency, but not all three at once . LSM trees make a deliberate choice: optimize for updates, accept read amplification as the cost. The sorted structure also enables efficient range scans. To retrieve all keys between and , the engine scans the memtable in order, then merges sorted streams from the relevant SSTables. The answer to accumulating SSTables is compaction . Compaction is a background process that takes multiple SSTables, merges them into fewer, cleaner ones , and discards the originals. The result is fewer files to search through, which directly reduces read amplification. It also reclaims disk space consumed by redundant entries: if the same key appears in three different SSTables, compaction keeps only the newest version and discards the rest. One common algorithm is a k-way merge . The engine opens iterators over all SSTables being compacted, each positioned at the first entry. It uses a min-heap to always pull the smallest key across all iterators. When the same key appears in multiple SSTables, the engine picks the version from the newest SSTable and discards the older ones. The merged output is streamed into new SSTable files. In practice, real systems limit the number of SSTables that can participate in a single compaction run to keep resource consumption under control. Updating the catalog after compaction requires care . The engine must not delete the old SSTables before the new ones are safely written to disk. The safe sequence is: write new SSTables, fsync, write a new catalog pointing to the new files, fsync, then delete the old SSTables. A crash at any point leaves the engine in a recoverable state: either the old files are still referenced by the old catalog, or the new files are referenced by the new catalog. Compaction is not free . It consumes I/O and CPU in the background, competing with foreground reads and writes. Every byte of data gets rewritten multiple times across its lifetime, adding to write amplification. Tuning when compaction triggers (and how aggressively it runs) is one of the main knobs in LSM tree performance. We might expect deletion to be straightforward: find the key, remove it. In an LSM tree, it is anything but straightforward . SSTables are immutable. We cannot reach into an existing SSTable and remove an entry. So when a key is deleted, the engine writes a special marker to the memtable called a tombstone , an entry that says “ this key is deleted ”. It eventually gets flushed to an SSTable like any other write. During reads, the engine respects tombstones. If a tombstone for a key is found before a value for that key, scanning newest to oldest, the key is treated as deleted, and a not-found error is returned. The tombstone shadows any older value. The tricky part is knowing when it is safe to discard a tombstone during compaction. Consider this situation: a tombstone for key exists in a newer SSTable, and an old value for exists in an older SSTable that hasn’t been compacted yet. If we drop the tombstone during compaction without also removing the old value, the old value becomes visible again. Deleted data reappears. This is called data resurrection , and it is a correctness bug. NOTE : Correctness here means the engine returns what was actually written, not a stale or deleted value. This is different from consistency in the distributed systems sense, which describes the guarantees clients have about which version of data they see across replicas. The rule is strict: a tombstone can only be dropped when the engine can guarantee that no older value for that key exists anywhere below it on disk . In practice, this means the compaction must include the oldest SSTables that could still hold a shadowed value. This is one of those details that seems minor until we get it wrong. A storage engine that resurrects deleted data is not a storage engine we can trust. Getting this right requires knowing exactly where older values can hide, which brings us to how SSTables are organized on disk. Basic compaction, merging all SSTables into one flat pool, works but doesn’t scale. As the dataset grows, a flat pool of SSTables means reads still have to check many files. Leveling is the structural answer . In a leveled LSM tree, SSTables are organized into levels: , , , and so on. Each level has different rules: is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. When accumulates enough files, a compaction runs to merge into . This merge enforces the non-overlapping invariant: files (which may overlap) get merged with the relevant L1 files (which define the ranges), producing new files with clean, non-overlapping ranges. Similarly, when grows too large, a compaction merges part of into . Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. The Memtable Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. The Write-Ahead Log There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. SSTables As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Bloom Filters Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Concurrency Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency

0 views
Manuel Moreale 2 days ago

On people writing about their use of AI

I find the trend of people posting about the way they use generative AI to be fascinating at an anthropological level. I do not remember the last time a piece of technology pushed so many different people into writing about the way they use it, or not use it, or abuse it, or misuse it. To me, this is way more interesting and intriguing than the technology itself. I obviously do not know why so many people are doing so, and I suspect they must all have their own specific reasons, but I currently have three main theories but I’m sure there are more than that. The first theory is that a good percentage is trying to capitalize on the trend in an attempt to become some sort of AI thought leader. Those people are insufferable. They usually hang out on LinkedIn, but sometimes they escape containment, and they remember that they do have a blog (and that’s often a Substack, unsurprisingly) where they can post these generic-looking blog posts filled with lists and it’s-not-this-it's-that statements. The second theory is that techies are gonna tech. A lot of the people who have blogs are also into tech, and gen AI is an interesting piece of tech and so it’s natural that those people will end up writing about how they use AI. The third and final theory is that there’s a group of people who feel the need to distance themselves from what AI represents. So those posts are not really about the technology itself, but rather a statement on the state of the world around them, and they want to make it clear if and how they participate in it. This final group is to me the interesting one. Now, if you’re a techie, don’t be mad at me, I’m not saying you’re not interesting, because you are (if instead you’re an AI bro, click here . You're welcome.) I’m saying the last group is the interesting one because to me, it’s fascinating how people feel compelled to justify or explain to strangers on the Internet how they interact with a piece of technology. And it’s especially fascinating because it’s a completely pointless exercise in my opinion. Let’s pretend you just landed on my blog for the first time (hi, welcome, nice to have you here) and you have no idea who I am. For all you know, I might not even be a real person. This entire website could be a psyop run by the Italian government. With that in mind, what’s the value of a post in which I tell you how I use or not use AI from a moral perspective? Would it make a difference if I were to tell you that I don’t use it? Or that I use it maybe once a day to answer a coding-related question? What if I told you that I don’t use AI at all, but in reality, this post was entirely generated by a swarm of AI agents while I was outside walking the dog, enjoying life? Unless you have prior knowledge of me and this blog, a post like that, in a vacuum, would be meaningless. How about the opposite case, though? Let's now pretend you weren’t new here, and you had, in fact, been following this blog since 2017. If that was the case, you wouldn't even need me to write that blog post, because by this point, you’d have all the necessary information to make an informed judgment. And you’d also know that you could ping me via email or via DM and ask me directly if you had any doubt about anything related to this topic. In both cases, a post stating my use of AI would have pretty much zero value. Which genuinely makes me wonder why so many people feel compelled to write about this stuff. If you wrote one of these posts, can I ask you why? Why do you feel the need to explain how you use this technology? Is there a specific reason? I’d love to know. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views

Introducing Dimster, a performance benchmarking tool for Apache Kafka

Dimster = DIMensional teSTER for Apache Kafka On GitHub: https://github.com/dimster-hq/dimster Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka. In this post I’ll introduce some aspects of Dimster that are core to its design: Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime A benchmarking and stress testing technique I’ve used for years is something I have called “Dimensional Testing”. We can think of all the configs and workload aspects as forming N-dimensional space. Within that space we can explore the impact of points in that space along a single dimension, or even co-varying dimensions. Take a config or an aspect of a workload as a dimension, and run a series of identical benchmarks where a set of points along that dimension are explored (while everything else remains the same). The dimension could be a client config, such as batch.size or acks. It could be an aspect of the workload such as number of consumers, type of consumer, number of consumer groups, the partition count, the produce rate and so on. There are hundreds of dimensions to explore, which requires some patience and care lest you become overwhelmed. The below depicts just three dimensions, and a set of three scenarios which test performance along one or two dimensions at a time. Fig 1. Three examples of varying or co-varying an aspect of a workload as dimensions Each of the above 16 test points (across 3 scenarios) is a separate benchmark, with a fresh topic, warm-up time, recorded time, and cooldown time etc. The generated charts for throughput and various latencies are repeated for each of the three scenarios, with each test point within a scenario plotted as a series/bar on those charts. This makes it easy to compare the performance results of varying the values of a single dimension (or co-varying values across multiple dimensions). Fig 2. Each scenario maps to a set of charts, with the test points as data series. With share groups being relatively new, I could compare the performance of regular consumers against share group consumers, with identical benchmarks where the dimension explored is consumer type (CONSUMER_GROUP|SHARE_GROUP). The following test has as the base workload of ten topics with each topic having 6 partitions, 6 consumers and 4 producers. Each scenario changes the producer rate, and compares consumer groups to share groups. Record keys are used, so batch sizes will be small, which is a tougher workload than a no-key test which typically results in larger batches. The charts below show the results for an EKS deployment with Kafka deployed on 3x m6i.2xlarge with 300 MB/s provisioned gp3. At 50 MB/s we see that p99 end-to-end latency is stable, with roughly 15 ms overhead for share groups. At 200 MB/s, p99 end-to-end exhibits peaks in a periodic fashion. Dimster uses environments. The sizing of a test is determined by which environment is used. I ran some share group consumer scaling tests, with full mTLS, on Kafka clusters assigned 2, 4, and 8 CPUs. These are the equivalent of vCPUs, as my Threadripper has SMT (hyperthreading) enabled. 2-CPU environment on my Threadripper: I ran the following workload with the above environment, with the CPU requests/limit of 2, 4 and 8. Then I used the dimster compare command to generate comparison charts based on the JSON result files of each run. Each chart compares each test point side-by-side. 10k msg/s - 1000 consumers (6th test point in 1st scenario) We see that 2 CPUs fare a lot worse than 4 and 8 CPUs. 100k msg/s, 250 consumers (4th test point, 3rd scenario) The 2 CPU cluster simply can’t keep up with 100k msg/s and 250 consumers. If we unselect 2-CPU, we see that 4-CPU and 8-CPU was ok. Dimster charts are interactive. Series can be toggled, time and percentile ranges can be selected. One thing I really like about OMB is that it produces a JSON file for the results. These files are easy to store and easy to share. But there was also a lot missing for full traceability and reproducibility. Dimster includes the following in every test campaign result (a set of files in a result directory): Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. I can run a test campaign then send you the results and you’ll be able to reproduce the results because you know exactly what was run and on what. The results are also completely self-contained, if you want to see the dashboard to look at Kafka metrics during the test, it’s right there as an HTML file in the results. No need for access to Grafana and Prometheus and no need to keep monitoring infrastructure around, it can be ephemeral. Dimster comes with four test modes (which all support dimensional testing): Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.  Example 1: Peak sustainable throughput, 1 partition, share group consumers Explore mode on my Threadripper. The idea was to see the bottleneck of a single partition, as consumers are scaled out. The rule was for p75 e2e latency to stay below 50ms. Example 2: Consumer group vs share group with 1 ms processing time The prior example was an unrealistic synthetic test where the consumer spent no time processing. This explore test added 1 ms consumer processing time per message with 300 consumers. It compared a 300 member consumer group with 300 partitions, vs a 300 member share group, with 5, 10, 25 and 50 partitions. Share groups managed the same throughput (95% of theoretical max based on 1 ms processing time and consumer count), on only 10 partitions. Consumers groups needed 300 partitions. Personally, explore and run are my bread and butter benchmark modes. For a given workload I usually start by finding the throughput limit where Kafka transitions from normal stable performance into degraded territory. I either use run mode and use live interaction to discover the performance limit, or I use explore which is slower but I can leave to run and it discovers the limit in an automated way. For latency benchmarks, once I know the limit, I can craft benchmarks that fit inside the performance envelope for that workload on the specific version of Kafka on the specific hardware I am using. The Dimster CLI has some commands that help before running benchmarks and for post-processing. Dimster resources command The resources command calculates the network and disk throughput required to service a workload. This is important in the cloud for selecting the right instances, ensuring that baseline network and disk throughput are greater than the workload’s demands. Dimster compare command Compare different runs that were executed on different hardware, different broker configurations, different broker versions etc. Dimster pivot command You can slice and dice the data any way you want based on the CSV data. However, you can also pivot the results and generate a chart with the pivot command. This compares the Nth test point across all scenarios. Dimster is easiest to use with Kubernetes. Dimster has a CLI you use from your laptop which speaks Kubernetes and leverages it to run benchmarks on any hardware, any cloud, any laptop or workstation using the exact same orchestration logic. All it needs is a properly configured k8s cluster. It could be minikube or k3d on a laptop or workstation, or AWS EKS or Google Cloud GKE or your own in-house cluster. You can tell Dimster to deploy Apache Kafka to a stateful set in the k8s cluster: Fig 3. Dimster architecture in full deploy mode Or point Dimster (deployed to k8s) at a Kafka service or in-house Kafka cluster. When testing a Kafka service, you can provision a single powerful instance for the Dimster coordinator and worker, and deploy them to a local k8s distro such as Minikube, K3d or Kind. A single worker will happily consume all the cores and memory you give it. Fig 4. Dimster architecture in external deploy mode Or run a super-slim full setup in a tiny minikube/kind/etc local k8s distro: Fig 5. Dimster deployed in a tiny local k8s cluster The workflow is the same. If you can provide a k8s cluster, then Dimster does the rest. Deployment is really simple, monitoring, gathering results, troubleshooting is all simplified via a mix of the CLI being relatively capable, and k8s providing a well-understood platform. K8s is not obligatory , you can run dimster-core directly as a Java program, and point it at a Kafka cluster already provisioned. But you lose many features such as monitoring, live-interaction, automatic gathering of logs, automatic chart and CSV generation and so on. However, you can use the post-processing command dimster chart to generate the charts of a result JSON file. Run the Java directly via the benchmark script: ./bin/benchmark -w path/to/workload file I will be publishing a blog post regularly about Dimster and what you can do with it. So stay tuned. I invite you to go and play around with Dimster , even if it's just running benchmarks on your laptop or workstation. You can get an idea of what charts get produced, what kinds of benchmarks you can run, trying out dimensional testing etc. The docs are pretty decent and should cover most of it. It’s fully featured but still a 0.X version. Myself and a Confluent colleague are the only ones who have run it thus far, so there may be bugs you encounter, if you do encounter a problem, please open an issue with repro steps. If you want to run serious benchmarks, you’ll likely need an EKS or GKE type of Kubernetes cluster. Dimster comes with a special CLI for EKS to deploy EKS with node groups for Kafka, Dimster workers/coordinator, Grafana/Prometheus, as well as storage classes for gp3.  While evaluating consumer group vs share group consumers, I’ve been running benchmarks in k3d on my beefy Threadripper 9980X workstation with 64 cores (128 threads), 256 GB RAM and an Samsung 9100 PRO 8TB SSD, which is plenty to run an entire medium sized Kafka cluster plus workers on it. I’ll be sharing some share group benchmarks tomorrow. Happy testing! Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime Results :  The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc.  Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.

0 views
Martin Fowler 2 days ago

Three more static code analysis sensors

Birgitta Böckeler adds discussion of three more sensors for static code analysis, focusing on checking and enforcing better modularity. Computational sensors for dependency checks were good at enforcing rules, but the rules were limited. Building a computational sensor for coupling data proved lackluster. Prompting an inferential sensor to review modularity was more effective.

0 views
マリウス 2 days ago

Photography Workflow with ~~Darktable on Linux~~ Lightroom on GrapheneOS

Disclaimer: I had initially prepared this post under the title Photography Workflow with Darktable on Linux , but after endless fights with Darktable I eventually decided to scrap that workflow altogether and look for an alternative. The workflow documented herein is unfortunately very far from the result I was striving for, yet it is sadly the best I can put together given the current state of open-source RAW development and photo editing software. After I gave Adobe the finger back in 2019 and moved my photography workflow to Capture One on a MacBook , I eventually had to reconsider this approach when I moved back to Linux on the desktop and replaced the device with a Linux laptop . I briefly tried running Capture One in a Windows VM on my laptop , but decided against it, as it was a huge PITA and lacked proper hardware acceleration. Initially I considered a fork of what is probably the best-known open-source RAW developer and photography workflow application out there, Darktable , called Ansel , but ultimately decided against it. The points that Ansel ’s author, Aurélien, brought up seemed like valid criticisms and demonstrated both his knowledge of and his passion for making Darktable a better tool. However, reading further through his website and his GitHub account, it became apparent that he might be the kind of misunderstood genius who has great ideas and ambition, but who would ultimately struggle to operate within, let alone lead the kind of community required to successfully maintain a fork of a piece of software this large. I therefore didn’t have high hopes of this lone cowboy keeping up with, let alone surpassing, the development efforts the Darktable community is currently putting in. Given that Ansel was explicitly billed as a hard-fork that would not remain compatible with the official Darktable release, going down that path felt too risky. Ansel would ultimately have to provide a migration path for existing Darktable users, as otherwise there would be little to no incentive for anyone with a functioning Darktable workflow already in place to put up with the effort. Instead, I decided to stick with Darktable . For about a year I tried to build a new workflow on top of it. The things I would miss the most from Capture One were the VSCO presets that I had brought over from Lightroom , and for which there didn’t seem to be any way to convert them into a format compatible with Darktable while producing roughly similar results. Luckily, João, a developer and photographer, made what he calls t3mujinpack , a collection of film emulation presets for Darktable . In a blog post , he provides details on which film stocks are included and how to make use of them in Darktable . His pack includes the presets I almost exclusively use from VSCO : Kodak’s Portra 160, 400 and 800. While the results aren’t 100% identical to what Capture One produces with the converted VSCO packs, neither are those exports identical to what Lightroom originally produced. Every piece of software has slight differences in its inner workings, so this is to be expected and can be adjusted for. During my travel through all of Spain in 2024 I decided to rely exclusively on Darktable for developing and editing the photos that I would ultimately upload to this site. That was a big mistake. I rarely say bad things about truly open-source software, because ultimately it is open-source, it’s driven by a community of volunteers, and everyone should be happy that these people do what they do. Also, given that it’s open-source, anyone is free to go ahead and improve what they deem worth improving. However, Darktable is, in my opinion, one of the few exceptions that seem to have derailed so badly that it’s fair to say it has reached a point of no return in terms of usability and jankiness . Let me explain by starting with one of the most annoying things: More often than not, Darktable crashes in the middle of editing sessions, apparently due to Wayland-related issues. However, since I’m also running GIMP and Blender , which I would argue do similar, or even slightly more complex things than Darktable , yet don’t run into such issues, I’d assume that this is not a problem with my Wayland setup specifically. I didn’t try to debug the issue further, as I was mainly focused on testing and establishing a workflow. Had Darktable otherwise worked perfectly fine for me and only run into this issue every once in a while, I would have dug deeper to find the root cause. Unfortunately, this was only one of many things that kept me from continuing to use Darktable . Besides the random crashes, Darktable is unbearably janky and slow. The UI feels like it’s about to fall over at any moment, regardless of whether ROCm acceleration is enabled or not. UI elements feel hacked together, the overall navigation is hostile towards regular users, and it’s impossible to find anything just by looking, because everything is hidden behind collapsed modules, tabs and a gazillion sliders and buttons. To give a single example of the sheer UI craziness that is Darktable : To rotate an image to the right (clockwise), you need to drag a slider to the left (counterclockwise). While on a touchscreen interface this might be more intuitive, when using a touchpad on a laptop or even a mouse it definitely doesn’t feel natural. After all, maybe a slider isn’t the best UI element for this operation to begin with? Another issue that I experienced was related to organizing photos. With over 4000 (RAW) photos in the library, Darktable becomes unbearable to work with. Aside from the spontaneous crashes and overall slow UI, finding specific photos in a library of that size is an excruciatingly painful task. Unlike Capture One and Lightroom , Darktable doesn’t easily support a workflow based on individual, smaller libraries, e.g. organized by location or event. There are ways to sort photos within Darktable ’s main library, but I couldn’t find an easy way to split them out into multiple small libraries. Assuming that you managed to find and edit the photos you were looking for, the headaches continue when you try to export them. It appears that Darktable is unable to export photos with pixel-perfect adherence to the crop aspect ratio . The implementation details and the proposed solution appear to be just as janky as everything else, and a quick search for in the Darktable GitHub repository uncovers a lot more of that same jankiness. I ended up running the following command over every photo exported by Darktable , just to obtain a properly shaped image, meaning I’d lose a few pixels here and there: As mentioned a long time back in an update , I ended up with a broken Darktable library, meaning that I lost all the adjustments that I did manage to export up until that point . Short story long, I eventually ditched Darktable for a plan B . After Darktable broke my library and I lost months’ worth of edits, I found myself back at square one. The idea of returning to Adobe felt like defeat, but when I looked at what was actually available for my setup, which is a Google Pixel Tablet running GrapheneOS , Adobe Lightroom for Android turned out to be the only realistic option that could handle RAW files and offer a non-destructive editing workflow. Adobe Lightroom Mobile is, on paper, a reasonably capable RAW editor for Android. It supports a wide range of camera RAW formats and offers the familiar tone curve, HSL sliders, color grading, masking, and healing tools that anyone coming from desktop Lightroom will recognize. It can read photos directly from the device’s own storage, edit them locally without an internet connection, and export to JPEG with full control over quality and output dimensions. In short, the feature set is there. The physical side of the workflow is straightforward. I attach a USB-C SD card reader to the Pixel Tablet, open a file manager, and copy the RAW files from the card into a dedicated folder on the tablet’s internal storage. From there I open Lightroom , import the photos from that folder into a local album, and work through them one by one. Once a photo is where I want it, I export it as a JPEG into the folder on the tablet’s storage. That folder is monitored by Syncthing , which synchronizes the finished exports to my other devices in the background. The performance of Adobe Lightroom on Android is, to put it mildly, terrible. Rendering a RAW preview after entering edit mode takes long enough that you find yourself staring at a loading indicator more often than at the actual photo. Scrolling through a grid of thumbnails is a choppy, stuttering affair that makes you wonder whether the application is doing something computationally expensive or is just poorly written. I acknowledge that the Pixel Tablet is an older budget device, yet Lightroom treats it as if it were running on hardware from 2005. Lightroom on Android is every bit as buggy as Adobe products traditionally are on macOS and Windows, but somehow worse, because the interface is also frequently broken in ways that make the application essentially unusable without restarting it. The UI will routinely enter a state where confirmation and action buttons either stop responding to taps, as if the touch layer has fallen out of sync with whatever is rendered on screen, or simply disappear altogether. The only resolution is to quit the app and reopen it, at which point you hope that the edit you were in the middle of survived. Entire features will similarly go dark without warning. The auto-straighten function, which should detect the horizon in a photo and level it, simply grays out and stops working at some point. No error, no indication as to why it has become unavailable, nothing. Again, restart the app, try again, maybe it works this time. These are not edge cases or exotic scenarios, but rather the normal operating experience of Adobe Lightroom Mobile . One of the things I was most concerned about before committing to this workflow was the prospect of Adobe silently uploading my photos to their cloud infrastructure. The desktop version of Lightroom has a long and well-documented history of syncing content to Adobe’s servers in ways that are easy to miss and difficult to fully disable. On Android, GrapheneOS gives you a tool that the desktop doesn’t: Per-application network permission revocation. I first disabled the cloud sync option within Lightroom ’s own settings, then went into GrapheneOS’s permission manager and removed the network permission from the Lightroom app entirely. It continues to function as a local RAW editor without any network access whatsoever. Photos stay on the device. Nothing leaves without my explicit say-so via Syncthing. Note: To keep things simple, I did not go into the fact that Lightroom is running inside an Android 16 Private Space , which also contains a sandboxed instance of Google Play Services and lets me create a virtual barrier between the rest of the FOSS apps on the Pixel Tablet and this spyware malware crap proprietary software. With this setup, however, importing data becomes slightly more tedious, as it requires the Google Files app to be able to read an attached USB-C storage device (SD card) from within the Private Space . The Google Files app is a giant UX disaster all by itself, into which, for the sake of our both’s time and mental health, I won’t dive into. One pleasant surprise was that I managed to import the VSCO Lightroom presets I purchased well over a decade ago into Lightroom Mobile on Android. The preset files still work, and the film emulations I had relied on for years, in particular the Kodak Portra series, show up in the presets panel and can be applied to photos. With Adobe being Adobe, however, this had to come with a catch. Lightroom Mobile is apparently incapable of remembering which preset was applied to a given photo. Open an edited photo that had a VSCO preset applied, and Lightroom will display a warning telling you it cannot find the preset, even though the preset is sitting right there in the presets list, available and functioning, ready to be applied to new photos. The edit itself is intact… well… at least sometimes. Other times, Lightroom simply loses the edits altogether. It’s the kind of bug that suggests the feature was never properly tested beyond the initial happy path, which is about what you’d expect from Adobe. To be frank, this workflow sucks compared to the one I had on macOS using Capture One . Lightroom is still the terrible POS it had always been, and paying money to a company like Adobe feels like funding a criminal organization. Unfortunately, there doesn’t appear to be a viable alternative, especially not one that’s libre . The remaining options would be to either pay into Apple’s walled garden by purchasing one of their newer iAmtheproduct devices and subscribing to Capture One Mobile , or to rely exclusively on Fuji’s in-camera film simulations (which sadly won’t work for the Sony ). Judging by the reviews of Capture One Mobile , however, the former option doesn’t appear too promising either. Looking at the situation in a more positive light, I nevertheless managed to replace the underlying stack on which my photography workflow runs with more privacy-respecting software ( GrapheneOS ). That’s at least something , although it seems this workflow won’t live that long either, given that Google keeps locking down their Pixel devices and GrapheneOS appears to be pivoting to Motorola-made hardware , who might not release a GrapheneOS-compatible Moto Pad anytime soon. Oh well. Pro tip: A USI 2.0 pen makes using Lightroom on a device like the Pixel Tablet significantly less painful, at least as long as the USI pen actually works properly, which sadly isn’t always the case with the Renaisser pen I own. If you’re looking for a more general review of the Google Pixel Tablet with GrapheneOS, look here .

0 views