Posts in Sql (20 found)
Simon Willison 6 days ago

sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018. is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for working with them in the terminal. Almost every feature provided by the package is available via both of those surfaces. This is hopefully the last alpha before a 4.0 stable release. I use semantic versioning for this library, so the 4.0 version number indicates that there are backward incompatible changes that may affect code written against the 3.x line. These changes are mostly very minor: I don't want to break any existing code if I can avoid it. I made it all the way to version 3.38 before I had to ship a major release and I'm sad I couldn't push that even further! Here are the annotated release notes for 4.0a1. This change is for type hint enthusiasts. The Python library used to encourage accessing both SQL tables and SQL views through the syntactic sugar - but tables and view have different interfaces since there's no way to handle a on a SQLite view. If you want clean type hints for your code you can now use the and methods instead. A new feature, not a breaking change. I realized that supporting a stream of lists or tuples as an option for populating large tables would be a neat optimization over always dealing with dictionaries each of which duplicated the column names. I had the idea for this one while walking the dog and built the first prototype by prompting Claude Code for web on my phone. Here's the prompt I used and the prototype report it created , which included a benchmark estimating how much of a performance boost could be had for different sizes of tables. I was horrified to discover a while ago that I'd been creating SQLite columns called FLOAT but the correct type to use was REAL! This change fixes that. Previously the fix was to ask for tables to be created in strict mode. As part of this I also figured out recipes for using as a development environment for the package, which are now baked into the Justfile . This one is best explained in the issue . Another change which I would have made earlier but, since it introduces a minor behavior change to an existing feature, I reserved it for the 4.0 release. Back in 2018 when I started this project I was new to working in-depth with SQLite and incorrectly concluded that the correct way to create tables and columns named after reserved words was like this: That turned out to be a non-standard SQL syntax which the SQLite documentation describes like this : A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility. Unfortunately I baked it into the library early on and it's been polluting the world with weirdly escaped table and column names ever since! I've finally fixed that, with the help of Claude Code which took on the mind-numbing task of updating hundreds of existing tests that asserted against the generated schemas. The above example table schema now looks like this: This may seem like a pretty small change but I expect it to cause a fair amount of downstream pain purely in terms of updating tests that work against tables created by ! I made this change first in LLM and decided to bring it to for consistency between the two tools. One last minor ugliness that I waited for a major version bump to fix. Update : Now that the embargo has lifted I can reveal that a substantial amount of the work on this release was performed using a preview version of Anthropic's new Claude Opus 4.5 model . Here's the Claude Code transcript for the work to implement the ability to use an iterator over lists instead of dictionaries for bulk insert and upsert operations. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Breaking change : The method now only works with tables. To access a SQL view use instead. ( #657 ) The and methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See Inserting data from a list or tuple iterator for details. ( #672 ) Breaking change : The default floating point column type has been changed from to , which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. ( #645 ) Now uses in place of for packaging. ( #675 ) Tables in the Python API now do a much better job of remembering the primary key and other schema details from when they were first created. ( #655 ) Breaking change : The and mechanisms no longer skip values that evaluate to . Previously the option was needed, this has been removed. ( #542 ) Breaking change : Tables created by this library now wrap table and column names in in the schema. Previously they would use . ( #677 ) The CLI argument now accepts a path to a Python file in addition to accepting a string full of Python code. It can also now be specified multiple times. ( #659 ) Breaking change: Type detection is now the default behavior for the and CLI commands when importing CSV or TSV data. Previously all columns were treated as unless the flag was passed. Use the new flag to restore the old behavior. The environment variable has been removed. ( #679 )

0 views
Simon Willison 1 weeks ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views
Robin Moffatt 1 weeks ago

Stumbling into AI: Part 6—I've been thinking about Agents and MCP all wrong

Ever tried to hammer a nail in with a potato? Nor me, but that’s what I’ve felt like I’ve been attempting to do when trying to really understand agents, as well as to come up with an example agent to build. As I wrote about previously , citing Simon Willison, an LLM agent runs tools in a loop to achieve a goal . Unlike building ETL/ELT pipelines, these were some new concepts that I was struggling to fit to an even semi-plausible real world example. That’s because I was thinking about it all wrong. For the last cough 20 cough years I’ve built data processing pipelines, either for real or as examples based on my previous experience. It’s the same pattern, always: Data comes in Data gets processed Data goes out Maybe we fiddle around with the order of things (ELT vs ETL), maybe a particular example focusses more on one particular point in the pipeline—but all the concepts remain pleasingly familiar. All I need to do is figure out what goes in the boxes: I’ve even extended this to be able to wing my way through talking about applications and microservices (kind of). We get some input, we make something else happen. Somewhat stretching beyond my experience, admittedly, but it’s still the same principles. When this thing happens, make a computer do that thing. Perhaps I’m too literal, perhaps I’m cynical after too many years of vendor hype, or perhaps it’s just how my brain is wired—but I like concrete, tangible, real examples of something. So when it comes to agents, particularly with where we’re at in the current hype-cycle, I really wanted to have some actual examples on which to build my understanding. In addition, I wanted to build some of my own. But where to start? Here was my mental model; literally what I sketched out on a piece of paper as I tried to think about what real-world example could go in each box to make something plausible: But this is where I got stuck, and spun my proverbial wheels on for several days. Every example I could think of ended up with me uttering, exasperated… but why would you do it like that . My first mistake was focussing on the LLM bit as needing to do something to the input data . I had a whole bunch of interesting data sources (like river levels , for example) but my head blocked on " but that’s numbers, what can you get an LLM to do with those?! ". The LLM bit of an agent, I mistakenly thought, demanded unstructured input data for it to make any sense. After all, if it’s structured, why aren’t we just processing it with a regular process—no need for magic fairy dust here. This may also have been an over-fitting of an assumption based on my previous work with an LLM to summarise human-input data in a conference keynote . The tool bit baffled me just as much. With hindsight, the exact problem turned out to be the solution . Let me explain… Whilst there are other options, in many cases an agent calling a tool is going to do so using MCP. Thus, grabbing the dog firmly by the tail and proceeding to wag it, I went looking for MCP servers. Looking down a list of hosted MCP servers that I found, I saw that there was only about a half-dozen that were open, including GlobalPing , AlphaVantage , and CoinGecko . Flummoxed, I cast around for an actual use of one of these, with an unstructured data source. Oh jeez…are we really going to do the ' read a stream of tweets and look up the stock price/crypto-token ' thing again? The mistake I made was this: I’d focussed on the LLM bit of the agent definition: an LLM agent runs tools in a loop to achieve a goal Actually, what an agent is about is this: […] runs tools The LLM bit can do fancy LLM stuff—but it’s also there to just invoke the tool(s) and decide when they’ve done what they need to do . A tool is quite often just a wrapper on an API. So what we’re saying is, with MCP, we have a common interface to APIs. That’s…all. We can define agents to interact with systems, and the way they interact is through a common protocol: MCP. When we load a web page, we don’t concern ourselves with what Chrome is doing, and unless we stop and think about it we don’t think about the TCP and HTTP protocols being used. It’s just the common way of things talking to each other. And that’s the idea with MCP, and thus tool calling from agents. (Yes, there are other ways you can call tools from agents, but MCP is the big one, at the moment). Given this reframing, it makes sense why there are so few open MCP servers. If an MCP server is there to offer access to an API, who leaves their API open for anyone to use? Well, read-only data provided like CoinGecko and AlphaVantage, perhaps. In general though, the really useful thing we can do with tools is change the state of systems . That’s why any SaaS platform worth its salt is rushing to provide an MCP server. Not to jump on the AI bandwagon per se, but because if this is going to be the common protocol by which things get to be automated with agents, you don’t want to be there offering Betamax when everyone else has VHS. SaaS platforms will still provide their APIs for direct integration, but they will also provide MCP servers. There’s also no reason why applications developed within an organisation wouldn’t offer MCP either, in theory. No, not really. It actually makes a bunch of sense to me. I personally also like it a lot from a SQL-first, not-really-a-real-coder point of view. Let me explain. If you want to build a system to respond to something that’s happened by interacting with another external system, you have two choices now: Write custom code to call the external system’s API. Handle failures, retries, monitoring, etc. If you want to interact with a different system, you now need to understand the different API, work out calling it, write new code to do so. Write an agent that responds to the thing that happened, and have it call the tool. The agent framework now standardises handling failures, retries, and all the rest of it. If you want to call a different system, the agent stays pretty much the same. The only thing that you change is the MCP server and tool that you call. You could write custom code—and there are good examples of where you’ll continue to. But you no longer have to . For Kafka folk, my analogy here would be data integration with Kafka Connect. Kafka Connect provides the framework that handles all of the sticky and messy things about data integration (scale, error handling, types, connectivity, restarts, monitoring, schemas, etc etc etc). You just use the appropriate connector with it and configure it. Different system? Just swap out the connector. You want to re-invent the wheel and re-solve a solved-problem? Go ahead; maybe you’re special. Or maybe NIH is real ;P So…what does an actual agent look like now, given this different way of looking at it? How about this: Sure, the LLM could do a bunch of clever stuff with the input. But it can also just take our natural language expression of what we want to happen, and make it so. Agents can use multiple tools, from multiple MCP servers. Confluent launched Streaming Agents earlier this year. They’re part of the fully-managed Confluent Cloud platform and provide a way to run agents like I’ve described above, driven by events in a Kafka topic. Here’s what the above agent would look like as a Streaming Agent: Is this over-engineered? Do you even need an agent? Why not just do this? You can. Maybe you should. But…don’t forget failure conditions. And restarts. And testing. And scaling. All these things are taken care of for you by Flink. Although having the runtime considerations taken care of for you is nice, let’s not forget another failure vector which LLMs add into the mix: talking shite hallucinations. Compared to a lump of Python code which either works or doesn’t, LLMs keep us on our toes by sometimes confidently doing the wrong thing. However, how do we know it’s wrong? Our Python program might crash, or throw a nicely-handled error, but left to its own devices an AI Agent will happily report that everything worked even if it actually made up a parameter for a tool call that doesn’t exist. There are mitigating steps we can take, but it’s important to recognise the trade-offs between the approaches. Permit me to indulge this line of steel-manning, because I think I might even have a valid argument here. Let’s say we’ve built the above simplistic agent that sends a Slack when a data point is received. Now we want to enhance it to also include information about the weather forecast. An agent would conceptually be something like this: Our streaming agent above changes to just amending the prompt and adding a new tool (just DDL statements, defining the MCP server and its tools): Whilst the bespoke application might have a seemingly-innocuous small addition: But consider what this looks like in practice. Figuring out the API, new lines of code to handle calling it, failures, and so on. Oh, whilst you’re at it; don’t introduce any bugs into the bespoke code. And remember to document the change. Not insurmountable, and probably a good challenge if you like that kind of thing. But is it as straightforward as literally changing the prompt in an agent to use an additional tool, and let it figure the rest out (courtesy of MCP)? Let’s not gloss over the reality too much here though; whilst adding a new tool call into the agent is definitely easier and less prone to introducing code errors, LLMs are by their nature non-deterministic—meaning that we still need to take care with the prompt and the tool invocation to make sure that the agent is still doing what it’s designed to do. You wouldn’t be wrong to argue that at least the non-Agent route (of coding API invocations directly into your application) can actually be tested and proved to work. There are different types of AI Agent—the one I’ve described is a tools-based one. As I mentioned above, its job is to run tools . The LLM provides the natural language interface with which to invoke the tools. It can also , optionally , do additional bits of magic: Process [unstructured] input, such as summarising or extracting key values from it Decide which tool(s) need calling in order to achieve its aim But at the heart of it, it’s about the tool that gets called. That’s where I was going wrong with this. That’s the bit I needed to think differently about :) Data comes in Data gets processed Data goes out Write custom code to call the external system’s API. Handle failures, retries, monitoring, etc. If you want to interact with a different system, you now need to understand the different API, work out calling it, write new code to do so. Write an agent that responds to the thing that happened, and have it call the tool. The agent framework now standardises handling failures, retries, and all the rest of it. If you want to call a different system, the agent stays pretty much the same. The only thing that you change is the MCP server and tool that you call. Process [unstructured] input, such as summarising or extracting key values from it Decide which tool(s) need calling in order to achieve its aim

0 views
Jim Nielsen 2 weeks ago

Data Storage As Files on Disk Paired With an LLM

I recently added a bunch of app icons from macOS Tahoe to my collection . Afterwards, I realized some of them were missing relational metadata. For example, I have a collection of iMove icons through the years which are related in my collection by their App Store ID. However, the latest iMovie icon I added didn’t have this ID. This got me thinking, "Crap, I really want this metadata so I can see apps over time . Am I gonna have to go back through each icon I just posted and find their associated App Store ID?” Then I thought: “Hey, I bet AI could figure this out — right? It should be able to read through my collection of icons (which are stored as JSON files on disk), look for icons with the same name and developer, and see where I'm missing and .” So I formulated a prompt (in hindsight, a really poor one lol): look through all the files in and find any that start with and then find me any icons like iMovie that have a correlation to other icons in where it's missing and But AI did pretty good with that. I’ll save you the entire output, but Cursor thought for a bit, then asked to run this command: I was like, “Ok. I couldn’t write that myself, but that looks about right. Go ahead.” It ran the command, thought some more, then asked to run another command. Then another. It seemed unsatisfied with the results, so it changed course and wrote a node script and asked permission to run that. I looked at it and said, “Hey that’s probably how I would’ve approached this.” So I gave permission. It ran the script, thought a little, then rewrote it and asked permission to run again. Here’s the final version it ran: And with that, boom! It found a few newly-added icons with corollaries in my archive, pointed them out, then asked if I wanted to add the missing metadata. The beautiful part was I said “go ahead” and when it finished, I could see and review the staged changes in git. This let me double check the LLM’s findings with my existing collection to verify everything looked right — just to make sure there were no hallucinations. Turns out, storing all my icon data as JSON files on disk (rather than a database) wasn’t such a bad idea. Part of the reason I’ve never switched from static JSON files on disk to a database is because I always figured it would be easier for future me to find and work with files on disk (as opposed to learning how to setup, maintain, and query a database). Turns out that wasn’t such a bad bet. I’m sure AI could’ve helped me write some SQL queries to do all the stuff I did here. But what I did instead already fit within a workflow I understand: files on disk, modified with scripting, reviewed with git, checked in, and pushed to prod. So hey, storing data as JSON files in git doesn’t look like such a bad idea now, does it future Jim? Reply via: Email · Mastodon · Bluesky

0 views
devansh 3 weeks ago

Hitchhiker's Guide to Attack Surface Management

I first heard about the word "ASM" (i.e., Attack Surface Management) probably in late 2018, and I thought it must be some complex infrastructure for tracking assets of an organization. Looking back, I realize I almost had a similar stack for discovering, tracking, and detecting obscure assets of organizations, and I was using it for my bug hunting adventures. I feel my stack was kinda goated, as I was able to find obscure assets of Apple, Facebook, Shopify, Twitter, and many other Fortune 100 companies, and reported hundreds of bugs, all through automation. Back in the day, projects like ProjectDiscovery were not present, so if I had to write an effective port scanner, I had to do it from scratch. (Masscan and nmap were present, but I had my fair share of issues using them, this is a story for another time). I used to write DNS resolvers (massdns had a high error rate), port scanners, web scrapers, directory brute-force utilities, wordlists, lots of JavaScript parsing logic using regex, and a hell of a lot of other things. I used to have up to 50+ self-developed tools for bug-bounty recon stuff and another 60-something helper scripts written in bash. I used to orchestrate (gluing together with duct tape is a better word) and slap together scripts like a workflow, and save the output in text files. Whenever I dealt with a large number of domains, I used to distribute the load over multiple servers (server spin-up + SSH into it + SCP for pushing and pulling files from it). The setup was very fragile and error-prone, and I spent countless nights trying to debug errors in the workflows. But it was all worth it. I learned the art of Attack Surface Management without even trying to learn about it. I was just a teenager trying to make quick bucks through bug hunting, and this fragile, duct-taped system was my edge. Fast forward to today, I have now spent almost a decade in the bug bounty scene. I joined HackerOne in 2020 (to present) as a vulnerability triager, where I have triaged and reviewed tens of thousands of vulnerability submissions. Fair to say, I have seen a lot of things, from doomsday level 0-days, to reports related to leaked credentials which could have led to entire infrastructure compromise, just because some dev pushed an AWS secret key in git logs, to things where some organizations were not even aware they were running Jenkins servers on some obscure subdomain which could have allowed RCE and then lateral movement to other layers of infrastructure. A lot of these issues I have seen were totally avoidable, only if organizations followed some basic attack surface management techniques. If I search "Guide to ASM" on Internet, almost none of the supposed guides are real resources. They funnel you to their own ASM solution, and the guide is just present there to provide you with some surface-level information, and is mostly a marketing gimmick. This is precisely why I decided to write something where I try to cover everything I learned and know about ASM, and how to protect your organization's assets before bad actors could get to them. This is going to be a rough and raw guide, and will not lead you to a funnel where I am trying to sell my own ASM SaaS to you. I have nothing to sell, other than offering what I know. But in case you are an organization who needs help implementing the things I am mentioning below, you can reach out to me via X or email (both available on the homepage of this blog). This guide will provide you with insights into exactly how big your attack surface really is. CISOs can look at it and see if their organizations have all of these covered, security researchers and bug hunters can look at this and maybe find new ideas related to where to look during recon. Devs can look at it and see if they are unintentionally leaving any door open for hackers. If you are into security, it has something to offer you. Attack surface is one of those terms getting thrown around in security circles so much that it's become almost meaningless noise. In theory, it sounds simple enough, right. Attack surface is every single potential entry point, interaction vector, or exploitable interface an attacker could use to compromise your systems, steal your data, or generally wreck your day. But here's the thing, it's the sum total of everything you've exposed to the internet. Every API endpoint you forgot about, every subdomain some dev spun up for "testing purposes" five years ago and then abandoned, every IoT device plugged into your network, every employee laptop connecting from a coffee shop, every third-party vendor with a backdoor into your environment, every cloud storage bucket with permissions that make no sense, every Slack channel, every git commit leaking credentials, every paste on Pastebin containing your database passwords. Most organizations think about attack surface in incredibly narrow terms. They think if they have a website, an email server, and maybe some VPN endpoints, they've got "good visibility" into their assets. That's just plain wrong. Straight up wrong. Your actual attack surface would terrify you if you actually understood it. You run , and is your main domain. You probably know about , , maybe . But what about that your intern from 2015 spun up and just never bothered to delete. It's not documented anywhere. Nobody remembers it exists. Domain attack surface goes way beyond what's sitting in your asset management system. Every subdomain is a potential entry point. Most of these subdomains are completely forgotten. Subdomain enumeration is reconnaissance 101 for attackers and bug hunters. It's not rocket science. Setting up a tool that actively monitors through active and passive sources for new subdomains and generates alerts is honestly an hour's worth of work. You can use tools like Subfinder, Amass, or just mine Certificate Transparency logs to discover every single subdomain connected to your domain. Certificate Transparency logs were designed to increase security by making certificate issuance public, and they've become an absolute reconnaissance goldmine. Every time you get an SSL certificate for , that information is sitting in public logs for anyone to find. Attackers systematically enumerate these subdomains using Certificate Transparency log searches, DNS brute-forcing with massive wordlists, reverse DNS lookups to map IP ranges back to domains, historical DNS data from services like SecurityTrails, and zone transfer exploitation if your DNS is misconfigured. Attackers are looking for old development environments still running vulnerable software, staging servers with production data sitting on them, forgotten admin panels, API endpoints without authentication, internal tools accidentally exposed, and test environments with default credentials nobody changed. Every subdomain is an asset. Every asset is a potential vulnerability. Every vulnerability is an entry point. Domains and subdomains are just the starting point though. Once you've figured out all the subdomains belonging to your organization, the next step is to take a hard look at IP address space, which is another absolutely massive component of your attack surface. Organizations own, sometimes lease, IP ranges, sometimes small /24 blocks, sometimes massive /16 ranges, and every single IP address in those blocks and ranges that responds to external traffic is part of your attack surface. And attackers enumerate them all if you won't. They use WHOIS lookups to identify your IP ranges, port scanning to find what services are running where, service fingerprinting to identify exact software versions, and banner grabbing to extract configuration information. If you have a /24 network with 256 IP addresses and even 10% of those IPs are running services, you've got 25 potential attack vectors. Scale that to a /20 or /16 and you're looking at thousands of potential entry points. And attackers aren't just looking at the IPs you know about. They're looking at adjacent IP ranges you might have acquired through mergers, historical IP allocations that haven't been properly decommissioned, and shared IP ranges where your servers coexist with others. Traditional infrastructure was complicated enough, and now we have cloud. It's literally exploded organizations' attack surfaces in ways that are genuinely difficult to even comprehend. Every cloud service you spin up, be it an EC2 instance, S3 bucket, Lambda function, or API Gateway endpoint, all of this is a new attack vector. In my opinion and experience so far, I think the main issue with cloud infrastructure is that it's ephemeral and distributed. Resources get spun up and torn down constantly. Developers create instances for testing and forget about them. Auto-scaling groups generate new resources dynamically. Containerized workloads spin up massive Kubernetes clusters you have minimal visibility into. Your cloud attack surface could be literally anything. Examples are countless, but I'd categorize them into 8 different categories. Compute instances like EC2, Azure VMs, GCP Compute Engine instances exposed to the internet. Storage buckets like S3, Azure Blob Storage, GCP Cloud Storage with misconfigured permissions. Serverless stuff like Lambda functions with public URLs or overly permissive IAM roles. API endpoints like API Gateway, Azure API Management endpoints without proper authentication. Container registries like Docker images with embedded secrets or vulnerabilities. Kubernetes clusters with exposed API servers, misconfigured network policies, vulnerable ingress controllers. Managed databases like RDS, CosmosDB, Cloud SQL instances with weak access controls. IAM roles and service accounts with overly permissive identities that enable privilege escalation. I've seen instances in the past where a single misconfigured S3 bucket policy exposed terabytes of data. An overly permissive Lambda IAM role enabled lateral movement across an entire AWS account. A publicly accessible Kubernetes API server gave an attacker full cluster control. Honestly, cloud kinda scares me as well. And to top it off, multi-cloud infrastructure makes everything worse. If you're running AWS, Azure, and GCP together, you've just tripled your attack surface management complexity. Each cloud provider has different security models, different configuration profiles, and different attack vectors. Every application now uses APIs, and all applications nowadays are like a constellation of APIs talking to each other. Every API you use in your organization is your attack surface. The problem with APIs is that they're often deployed without the same security scrutiny as traditional web applications. Developers spin up API endpoints for specific features and those endpoints accumulate over time. Some of them are shadow APIs, meaning API endpoints which aren't documented anywhere. These endpoints are the equivalent of forgotten subdomains, and attackers can find them through analyzing JavaScript files for API endpoint references, fuzzing common API path patterns, examining mobile app traffic to discover backend APIs, and mining old documentation or code repositories for deprecated endpoints. Your API attack surface includes REST APIs exposed to the internet, GraphQL endpoints with overly broad query capabilities, WebSocket connections for real-time functionality, gRPC services for inter-service communication, and legacy SOAP APIs that never got decommissioned. If your organization has mobile apps, be it iOS, Android, or both, this is a direct window to your infrastructure and should be part of your attack surface management strategy. Mobile apps communicate with backend APIs and those API endpoints are discoverable by reversing the app. The reversed source of the app could reveal hard-coded API keys, tokens, and credentials. Using JADX plus APKTool plus Dex2jar is all a motivated attacker needs. Web servers often expose directories and files that weren't meant to be publicly accessible. Attackers systematically enumerate these using automated tools like ffuf, dirbuster, gobuster, and wfuzz with massive wordlists to discover hidden endpoints, configuration files, backup files, and administrative interfaces. Common exposed directories include admin panels, backup directories containing database dumps or source code, configuration files with database credentials and API keys, development directories with debug information, documentation directories revealing internal systems, upload directories for file storage, and old or forgotten directories from previous deployments. Your attack surface must include directories which are accidentally left accessible during deployments, staging servers with production data, backup directories with old source code versions, administrative interfaces without authentication, API documentation exposing endpoint details, and test directories with debug output enabled. Even if you've removed a directory from production, old cached versions may still be accessible through web caches or CDNs. Search engines also index these directories, making them discoverable through dorking techniques. If your organization is using IoT devices, and everyone uses these days, this should be part of your attack surface management strategy. They're invisible to traditional security tools. Your EDR solution doesn't protect IoT devices. Your vulnerability scanner can't inventory them. Your patch management system can't update them. Your IoT attack surface could include smart building systems like HVAC, lighting, access control. Security cameras and surveillance systems. Printers and copiers, which are computers with network access. Badge readers and physical access systems. Industrial control systems and SCADA devices. Medical devices in healthcare environments. Employee wearables and fitness trackers. Voice assistants and smart speakers. The problem with IoT devices is that they're often deployed without any security consideration. They have default credentials that never get changed, unpatched firmware with known vulnerabilities, no encryption for data in transit, weak authentication mechanisms, and insecure network configurations. Social media presence is an attack surface component that most organizations completely ignore. Attackers can use social media for reconnaissance by looking at employee profiles on LinkedIn to reveal organizational structure, technologies in use, and current projects. Twitter/X accounts can leak information about deployments, outages, and technology stack. Employee GitHub profiles expose email patterns and development practices. Company blogs can announce new features before security review. It could also be a direct attack vector. Attackers can use information from social media to craft convincing phishing attacks. Hijacked social media accounts can be used to spread malware or phishing links. Employees can accidentally share sensitive information. Fake accounts can impersonate your brand to defraud customers. Your employees' social media presence is part of your attack surface whether you like it or not. Third-party vendors, suppliers, contractors, or partners with access to your systems should be part of your attack surface. Supply chain attacks are becoming more and more common these days. Attackers can compromise a vendor with weaker security and then use that vendor's access to reach your environment. From there, they pivot from the vendor network to your systems. This isn't a hypothetical scenario, it has happened multiple times in the past. You might have heard about the SolarWinds attack, where attackers compromised SolarWinds' build system and distributed malware through software updates to thousands of customers. Another famous case study is the MOVEit vulnerability in MOVEit Transfer software, exploited by the Cl0p ransomware group, which affected over 2,700 organizations. These are examples of some high-profile supply chain security attacks. Your third-party attack surface could include things like VPNs, remote desktop connections, privileged access systems, third-party services with API keys to your systems, login credentials shared with vendors, SaaS applications storing your data, and external IT support with administrative access. It's obvious you can't directly control third-party security. You can audit them, have them pen-test their assets as part of your vendor compliance plan, and include security requirements in contracts, but ultimately their security posture is outside your control. And attackers know this. GitHub, GitLab, Bitbucket, they all are a massive attack surface. Attackers search through code repositories in hopes of finding hard-coded credentials like API keys, database passwords, and tokens. Private keys, SSH keys, TLS certificates, and encryption keys. Internal architecture documentation revealing infrastructure details in code comments. Configuration files with database connection strings and internal URLs. Deprecated code with vulnerabilities that's still in production. Even private repositories aren't safe. Attackers can compromise developer accounts to access private repositories, former employees retain access after leaving, and overly broad repository permissions grant access to too many people. Automated scanners continuously monitor public repositories for secrets. The moment a developer accidentally pushes credentials to a public repository, automated systems detect it within minutes. Attackers have already extracted and weaponized those credentials before the developer realizes the mistake. CI/CD pipelines are massive another attack vector. Especially in recent times, and not many organizations are giving attention to this attack vector. This should totally be part of your attack surface management. Attackers compromise GitHub Actions workflows with malicious code injection, Jenkins servers with weak authentication, GitLab CI/CD variables containing secrets, and build artifacts with embedded malware. The GitHub Actions supply chain attack, CVE-2025-30066, demonstrated this perfectly. Attackers compromised the Action used in over 23,000 repositories, injecting malicious code that leaked secrets from build logs. Jenkins specifically is a goldmine for attackers. An exposed Jenkins instance provides complete control over multiple critical servers, access to hardcoded AWS keys, Redis credentials, and BitBucket tokens, ability to manipulate builds and inject malicious code, and exfiltration of production database credentials containing PII. Modern collaboration tools are massive attack surface components that most organizations underestimate. Slack has hidden security risks despite being invite-only. Slack attack surface could include indefinite data retention where every message, channel, and file is stored forever unless admins configure retention periods. Public channels accessible to all users so one breached account opens the floodgates. Third-party integrations with excessive permissions accessing messages and user data. Former contractor access where individuals retain access long after projects end. Phishing and impersonation where it's easy to change names and pictures to impersonate senior personnel. In 2022, Slack leaked hashed passwords for five years affecting 0.5% of users. Slack channels commonly contain API keys, authentication tokens, database credentials, customer PII, financial data, internal system passwords, and confidential project information. The average cost of a breached record was $164 in 2022. When 1 in 166 messages in Slack contains confidential information, every new message adds another dollar to total risk exposure. With 5,000 employees sending 30 million Slack messages per year, that's substantial exposure. Trello board exposure is a significant attack surface. Trello attack vectors include public boards with sensitive information accidentally shared publicly, default public visibility where boards are created as public by default in some configurations, unsecured REST API allowing unauthenticated access to user data, and scraping attacks where attackers use email lists to enumerate Trello accounts. The 2024 Trello data breach exposed 15 million users' personal information when a threat actor named "emo" exploited an unsecured REST API using 500 million email addresses to compile detailed user profiles. Security researcher David Shear documented hundreds of public Trello boards exposing passwords, credentials, IT support customer access details, website admin logins, and client server management credentials. IT companies were using Trello to troubleshoot client requests and manage infrastructure, storing all credentials on public Trello boards. Jira misconfiguration is a widespread attack surface issue. Common misconfigurations include public dashboards and filters with "Everyone" access actually meaning public internet access, anonymous access enabled allowing unauthenticated users to browse, user picker functionality providing complete lists of usernames and email addresses, and project visibility allowing sensitive projects to be accessible without authentication. Confluence misconfiguration exposes internal documentation. Confluence attack surface components include anonymous access at site level allowing public access, public spaces where space admins grant anonymous permissions, inherited permissions where all content within a space inherits space-level access, and user profile visibility allowing anonymous users to view profiles of logged-in users. When anonymous access is enabled globally and space admins allow anonymous users to access their spaces, anyone on the internet can access that content. Confluence spaces often contain internal documentation with hardcoded credentials, financial information, project details, employee information, and API documentation with authentication details. Cloud storage misconfiguration is epidemic. Google Drive misconfiguration attack surface includes "Anyone with the link" sharing making files accessible without authentication, overly permissive sharing defaults making it easy to accidentally share publicly, inherited folder permissions exposing everything beneath, unmanaged third-party apps with excessive read/write/delete permissions, inactive user accounts where former employees retain access, and external ownership blind spots where externally-owned content is shared into the environment. Metomic's 2023 Google Scanner Report found that of 6.5 million Google Drive files analyzed, 40.2% contained sensitive information, 34.2% were shared externally, and 0.5% were publicly accessible, mostly unintentionally. In December 2023, Japanese game developer Ateam suffered a catastrophic Google Drive misconfiguration that exposed personal data of nearly 1 million people for over six years due to "Anyone with the link" settings. Based on Valence research, 22% of external data shares utilize open links, and 94% of these open link shares are inactive, forgotten files with public URLs floating around the internet. Dropbox, OneDrive, and Box share similar attack surface components including misconfigured sharing permissions, weak or missing password protection, overly broad access grants, third-party app integrations with excessive permissions, and lack of visibility into external sharing. Features that make file sharing convenient create data leakage risks when misconfigured. Pastebin and similar paste sites are both reconnaissance sources and attack vectors. Paste site attack surface includes public data dumps of stolen credentials, API keys, and database dumps posted publicly, malware hosting of obfuscated payloads, C2 communications where malware uses Pastebin for command and control, credential leakage from developers accidentally posting secrets, and bypassing security filters since Pastebin is legitimate so security tools don't block it. For organizations, leaked API keys or database credentials on Pastebin lead to unauthorized access, data exfiltration, and service disruption. Attackers continuously scan Pastebin for mentions of target organizations using automated tools. Security teams must actively monitor Pastebin and similar paste sites for company name mentions, email domain references, and specific keywords related to the organization. Because paste sites don't require registration or authentication and content is rarely removed, they've become permanent archives of leaked secrets. Container registries expose significant attack surface. Container registry attack surface includes secrets embedded in image layers where 30,000 unique secrets were found in 19,000 images, with 10% of scanned Docker images containing secrets, and 1,200 secrets, 4%, being active and valid. Immutable cached layers contain 85% of embedded secrets that can't be removed, exposed registries with 117 Docker registries accessible without authentication, unsecured registries allowing pull, push, and delete operations, and source code exposure where full application code is accessible by pulling images. GitGuardian's analysis of 200,000 publicly available Docker images revealed a staggering secret exposure problem. Even more alarming, 99% of images containing active secrets were pulled in 2024, demonstrating real-world exploitation. Unit 42's research identified 941 Docker registries exposed to the internet, with 117 accessible without authentication containing 2,956 repositories, 15,887 tags, and full source code and historical versions. Out of 117 unsecured registries, 80 allow pull operations to download images, 92 allow push operations to upload malicious images, and 7 allow delete operations for ransomware potential. Sysdig's analysis of over 250,000 Linux images on Docker Hub found 1,652 malicious images including cryptominers, most common, embedded secrets, second most prevalent, SSH keys and public keys for backdoor implants, API keys and authentication tokens, and database credentials. The secrets found in container images included AWS access keys, database passwords, SSH private keys, API tokens for cloud services, GitHub personal access tokens, and TLS certificates. Shadow IT includes unapproved SaaS applications like Dropbox, Google Drive, and personal cloud storage used for work. Personal devices like BYOD laptops, tablets, and smartphones accessing corporate data. Rogue cloud deployments where developers spin up AWS instances without approval. Unauthorized messaging apps like WhatsApp, Telegram, and Signal used for business communication. Unapproved IoT devices like smart speakers, wireless cameras, and fitness trackers on the corporate network. Gartner estimates that shadow IT makes up 30-40% of IT spending in large companies, and 76% of organizations surveyed experienced cyberattacks due to exploitation of unknown, unmanaged, or poorly managed assets. Shadow IT expands your attack surface because it's not protected by your security controls, it's not monitored by your security team, it's not included in your vulnerability scans, it's not patched by your IT department, and it often has weak or default credentials. And you can't secure what you don't know exists. Bring Your Own Device, BYOD, policies sound great for employee flexibility and cost savings. For security teams, they're a nightmare. BYOD expands your attack surface by introducing unmanaged endpoints like personal devices without EDR, antivirus, or encryption. Mixing personal and business use where work data is stored alongside personal apps with unknown security. Connecting from untrusted networks like public Wi-Fi and home networks with compromised routers. Installing unapproved applications with malware or excessive permissions. Lacking consistent security updates with devices running outdated operating systems. Common BYOD security issues include data leakage through personal cloud backup services, malware infections from personal app downloads, lost or stolen devices containing corporate data, family members using devices that access work systems, and lack of IT visibility and control. The 60% of small and mid-sized businesses that close within six months of a major cyberattack often have BYOD-related security gaps as contributing factors. Remote access infrastructure like VPNs and Remote Desktop Protocol, RDP, are among the most exploited attack vectors. SSL VPN appliances from vendors like Fortinet, SonicWall, Check Point, and Palo Alto are under constant attack. VPN attack vectors include authentication bypass vulnerabilities with CVEs allowing attackers to hijack active sessions, credential stuffing through brute-forcing VPN logins with leaked credentials, exploitation of unpatched vulnerabilities with critical CVEs in VPN appliances, and configuration weaknesses like default credentials, weak passwords, and lack of MFA. Real-world attacks demonstrate the risk. Check Point SSL VPN CVE-2024-24919 allowed authentication bypass for session hijacking. Fortinet SSL-VPN vulnerabilities were leveraged for lateral movement and privilege escalation. SonicWall CVE-2024-53704 allowed remote authentication bypass for SSL VPN. Once inside via VPN, attackers conduct network reconnaissance, lateral movement, and privilege escalation. RDP is worse. Sophos found that cybercriminals abused RDP in 90% of attacks they investigated. External remote services like RDP were the initial access vector in 65% of incident response cases. RDP attack vectors include exposed RDP ports with port 3389 open to the internet, weak authentication with simple passwords vulnerable to brute force, lack of MFA with no second factor for authentication, and credential reuse from compromised passwords in data breaches. In one Darktrace case, attackers compromised an organization four times in six months, each time through exposed RDP ports. The attack chain went successful RDP login, internal reconnaissance via WMI, lateral movement via PsExec, and objective achievement. The Palo Alto Unit 42 Incident Response report found RDP was the initial attack vector in 50% of ransomware deployment cases. Email infrastructure remains a primary attack vector. Your email attack surface includes mail servers like Exchange, Office 365, and Gmail with configuration weaknesses, email authentication with misconfigured SPF, DKIM, and DMARC records, phishing-susceptible users targeted through social engineering, email attachments and links as malware delivery mechanisms, and compromised accounts through credential stuffing or password reuse. Email authentication misconfiguration is particularly insidious. If your SPF, DKIM, and DMARC records are wrong or missing, attackers can spoof emails from your domain, your legitimate emails get marked as spam, and phishing emails impersonating your organization succeed. Email servers themselves are also targets. The NSA released guidance on Microsoft Exchange Server security specifically because Exchange servers are so frequently compromised. Container orchestration platforms like Kubernetes introduce massive attack surface complexity. The Kubernetes attack surface includes the Kubernetes API server with exposed or misconfigured API endpoints, container images with vulnerabilities in base images or application layers, container registries like Docker Hub, ECR, and GCR with weak access controls, pod security policies with overly permissive container configurations, network policies with insufficient micro-segmentation between pods, secrets management with hardcoded secrets or weak secret storage, and RBAC misconfigurations with overly broad service account permissions. Container security issues include containers running as root with excessive privileges, exposed Docker daemon sockets allowing container escape, vulnerable dependencies in container images, and lack of runtime security monitoring. The Docker daemon attack surface is particularly concerning. Running containers with privileged access or allowing docker.sock access can enable container escape and host compromise. Serverless computing like AWS Lambda, Azure Functions, and Google Cloud Functions promised to eliminate infrastructure management. Instead, it just created new attack surfaces. Serverless attack surface components include function code vulnerabilities like injection flaws and insecure dependencies, IAM misconfigurations with overly permissive Lambda execution roles, environment variables storing secrets as plain text, function URLs with publicly accessible endpoints without authentication, and event source mappings with untrusted input from various cloud services. The overabundance of event sources expands the attack surface. Lambda functions can be triggered by S3 events, API Gateway requests, DynamoDB streams, SNS topics, EventBridge schedules, IoT events, and dozens more. Each event source is a potential injection point. If function input validation is insufficient, attackers can manipulate event data to exploit the function. Real-world Lambda attacks include credential theft by exfiltrating IAM credentials from environment variables, lateral movement using over-permissioned roles to access other AWS resources, and data exfiltration by invoking functions to query and extract database contents. The Scarlet Eel adversary specifically targeted AWS Lambda for credential theft and lateral movement. Microservices architecture multiplies attack surface by decomposing monolithic applications into dozens or hundreds of independent services. Each microservice has its own attack surface including authentication mechanisms where each service needs to verify requests, authorization rules where each service enforces access controls, API endpoints for service-to-service communication channels, data stores where each service may have its own database, and network interfaces where each service exposes network ports. Microservices security challenges include east-west traffic vulnerabilities with service-to-service communication without encryption or authentication, authentication and authorization complexity from managing auth across 40 plus services multiplied by 3 environments equaling 240 configurations, service-to-service trust where services blindly trust internal traffic, network segmentation failures with flat networks allowing unrestricted pod-to-pod communication, and inconsistent security policies with different services having different security standards. One compromised microservice can enable lateral movement across the entire application. Without proper network segmentation and zero trust architecture, attackers pivot from service to service. How do you measure something this large, right. Attack surface measurement is complex. Attack surface metrics include the total number of assets with all discovered systems, applications, and devices, newly discovered assets found through continuous discovery, the number of exposed assets accessible from the internet, open ports and services with network services listening for connections, vulnerabilities by severity including critical, high, medium, and low CVEs, mean time to detect, MTTD, measuring how quickly new assets are discovered, mean time to remediate, MTTR, measuring how quickly vulnerabilities are fixed, shadow IT assets that are unknown or unmanaged, third-party exposure from vendor and partner access points, and attack surface change rate showing how rapidly the attack surface evolves. Academic research has produced formal attack surface measurement methods. Pratyusa Manadhata's foundational work defines attack surface as a three-tuple, System Attackability, Channel Attackability, Data Attackability. But in practice, most organizations struggle with basic attack surface visibility, let alone quantitative measurement. Your attack surface isn't static. It changes constantly. Changes happen because developers deploy new services and APIs, cloud auto-scaling spins up new instances, shadow IT appears as employees adopt unapproved tools, acquisitions bring new infrastructure into your environment, IoT devices get plugged into your network, and subdomains get created for new projects. Static, point-in-time assessments are obsolete. You need continuous asset discovery and monitoring. Continuous discovery methods include automated network scanning for regular scans to detect new devices, cloud API polling to query cloud provider APIs for resource changes, DNS monitoring to track new subdomains via Certificate Transparency logs, passive traffic analysis to observe network traffic and identify assets, integration with CMDB or ITSM to sync with configuration management databases, and cloud inventory automation using Infrastructure as Code to track deployments. Understanding your attack surface is step one. Reducing it is the goal. Attack surface reduction begins with asset elimination by removing unnecessary assets entirely. This includes decommissioning unused servers and applications, deleting abandoned subdomains and DNS records, shutting down forgotten development environments, disabling unused network services and ports, and removing unused user accounts and service identities. Access control hardening implements least privilege everywhere by enforcing multi-factor authentication, MFA, for all remote access, using role-based access control, RBAC, for cloud resources, implementing zero trust network architecture, restricting network access with micro-segmentation, and applying the principle of least privilege to IAM roles. Exposure minimization reduces what's visible to attackers by moving services behind VPNs or bastion hosts, using private IP ranges for internal services, implementing network address translation, NAT, for outbound access, restricting API endpoints to authorized sources only, and disabling unnecessary features and functionalities. Security hardening strengthens what remains by applying security patches promptly, using security configuration baselines, enabling encryption for data in transit and at rest, implementing Web Application Firewalls, WAF, for web apps, and deploying endpoint detection and response, EDR, on all devices. Monitoring and detection watch for attacks in progress by implementing real-time threat detection, enabling comprehensive logging and SIEM integration, deploying intrusion detection and prevention systems, IDS/IPS, monitoring for anomalous behavior patterns, and using threat intelligence feeds to identify known bad actors. Your attack surface is exponentially larger than you think it is. Every asset you know about probably has three you don't. Every known vulnerability probably has ten undiscovered ones. Every third-party integration probably grants more access than you realize. Every collaboration tool is leaking more data than you imagine. Every paste site contains more of your secrets than you want to admit. And attackers know this. They're not just looking at what you think you've secured. They're systematically enumerating every possible entry point. They're mining Certificate Transparency logs for forgotten subdomains. They're scanning every IP in your address space. They're reverse-engineering your mobile apps. They're buying employee credentials from data breach databases. They're compromising your vendors to reach you. They're scraping Pastebin for your leaked secrets. They're pulling your public Docker images and extracting the embedded credentials. They're accessing your misconfigured S3 buckets and exfiltrating terabytes of data. They're exploiting your exposed Jenkins instances to compromise your entire infrastructure. They're manipulating your AI agents to exfiltrate private Notion data. The asymmetry is brutal. You have to defend every single attack vector. They only need to find one that works. So what do you do. Start by accepting that you don't have complete visibility. Nobody does. But you can work toward better visibility through continuous discovery, automated asset management, and integration of security tools that help map your actual attack surface. Implement attack surface reduction aggressively. Every asset you eliminate is one less thing to defend. Every service you shut down is one less potential vulnerability. Every piece of shadow IT you discover and bring under management is one less blind spot. Every misconfigured cloud storage bucket you fix is terabytes of data no longer exposed. Every leaked secret you rotate is one less credential floating around the internet. Adopt zero trust architecture. Stop assuming that anything, internal services, microservices, authenticated users, collaboration tools, is inherently trustworthy. Verify everything. Monitor paste sites and code repositories. Your secrets are out there. Find them before attackers weaponize them. Secure your collaboration tools. Slack, Trello, Jira, Confluence, Notion, Google Drive, and Airtable are all leaking data. Lock them down. Fix your container security. Scan images for secrets. Use secret managers instead of environment variables. Secure your registries. Harden your CI/CD pipelines. Jenkins, GitHub Actions, and GitLab CI are high-value targets. Protect them. And test your assumptions with red team exercises and continuous security testing. Your attack surface is what an attacker can reach, not what you think you've secured. The attack surface problem isn't getting better. Cloud adoption, DevOps practices, remote work, IoT proliferation, supply chain complexity, collaboration tool sprawl, and container adoption are all expanding organizational attack surfaces faster than security teams can keep up. But understanding the problem is the first step toward managing it. And now you understand exactly how catastrophically large your attack surface actually is.

1 views
Robin Moffatt 3 weeks ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views
Simon Willison 3 weeks ago

A new SQL-powered permissions system in Datasette 1.0a20

Datasette 1.0a20 is out with the biggest breaking API change on the road to 1.0, improving how Datasette's permissions system works by migrating permission logic to SQL running in SQLite. This release involved 163 commits , with 10,660 additions and 1,825 deletions, most of which was written with the help of Claude Code. Datasette's permissions system exists to answer the following question: Is this actor allowed to perform this action , optionally against this particular resource ? An actor is usually a user, but might also be an automation operating via the Datasette API. An action is a thing they need to do - things like view-table, execute-sql, insert-row. A resource is the subject of the action - the database you are executing SQL against, the table you want to insert a row into. Datasette's default configuration is public but read-only: anyone can view databases and tables or execute read-only SQL queries but no-one can modify data. Datasette plugins can enable all sorts of additional ways to interact with databases, many of which need to be protected by a form of authentication Datasette also 1.0 includes a write API with a need to configure who can insert, update, and delete rows or create new tables. Actors can be authenticated in a number of different ways provided by plugins using the actor_from_request() plugin hook. datasette-auth-passwords and datasette-auth-github and datasette-auth-existing-cookies are examples of authentication plugins. The previous implementation included a design flaw common to permissions systems of this nature: each permission check involved a function call which would delegate to one or more plugins and return a True/False result. This works well for single checks, but has a significant problem: what if you need to show the user a list of things they can access, for example the tables they can view? I want Datasette to be able to handle potentially thousands of tables - tables in SQLite are cheap! I don't want to have to run 1,000+ permission checks just to show the user a list of tables. Since Datasette is built on top of SQLite we already have a powerful mechanism to help solve this problem. SQLite is really good at filtering large numbers of records. The biggest change in the new release is that I've replaced the previous plugin hook - which let a plugin determine if an actor could perform an action against a resource - with a new permission_resources_sql(actor, action) plugin hook. Instead of returning a True/False result, this new hook returns a SQL query that returns rules helping determine the resources the current actor can execute the specified action against. Here's an example, lifted from the documentation: This hook grants the actor with ID "alice" permission to view the "sales" table in the "accounting" database. The object should always return four columns: a parent, child, allow (1 or 0), and a reason string for debugging. When you ask Datasette to list the resources an actor can access for a specific action, it will combine the SQL returned by all installed plugins into a single query that joins against the internal catalog tables and efficiently lists all the resources the actor can access. This query can then be limited or paginated to avoid loading too many results at once. Datasette has several additional requirements that make the permissions system more complicated. Datasette permissions can optionally act against a two-level hierarchy . You can grant a user the ability to insert-row against a specific table, or every table in a specific database, or every table in every database in that Datasette instance. Some actions can apply at the table level, others the database level and others only make sense globally - enabling a new feature that isn't tied to tables or databases, for example. Datasette currently has ten default actions but plugins that add additional features can register new actions to better participate in the permission systems. Datasette's permission system has a mechanism to veto permission checks - a plugin can return a deny for a specific permission check which will override any allows. This needs to be hierarchy-aware - a deny at the database level can be outvoted by an allow at the table level. Finally, Datasette includes a mechanism for applying additional restrictions to a request. This was introduced for Datasette's API - it allows a user to create an API token that can act on their behalf but is only allowed to perform a subset of their capabilities - just reading from two specific tables, for example. Restrictions are described in more detail in the documentation. That's a lot of different moving parts for the new implementation to cover. Since permissions are critical to the security of a Datasette deployment it's vital that they are as easy to understand and debug as possible. The new alpha adds several new debugging tools, including this page that shows the full list of resources matching a specific action for the current user: And this page listing the rules that apply to that question - since different plugins may return different rules which get combined together: This screenshot illustrates two of Datasette's built-in rules: there is a default allow for read-only operations such as view-table (which can be over-ridden by plugins) and another rule that says the root user can do anything (provided Datasette was started with the option.) Those rules are defined in the datasette/default_permissions.py Python module. There's one question that the new system cannot answer: provide a full list of actors who can perform this action against this resource. It's not possibly to provide this globally for Datasette because Datasette doesn't have a way to track what "actors" exist in the system. SSO plugins such as mean a new authenticated GitHub user might show up at any time, with the ability to perform actions despite the Datasette system never having encountered that particular username before. API tokens and actor restrictions come into play here as well. A user might create a signed API token that can perform a subset of actions on their behalf - the existence of that token can't be predicted by the permissions system. This is a notable omission, but it's also quite common in other systems. AWS cannot provide a list of all actors who have permission to access a specific S3 bucket, for example - presumably for similar reasons. Datasette's plugin ecosystem is the reason I'm paying so much attention to ensuring Datasette 1.0 has a stable API. I don't want plugin authors to need to chase breaking changes once that 1.0 release is out. The Datasette upgrade guide includes detailed notes on upgrades that are needed between the 0.x and 1.0 alpha releases. I've added an extensive section about the permissions changes to that document. I've also been experimenting with dumping those instructions directly into coding agent tools - Claude Code and Codex CLI - to have them upgrade existing plugins for me. This has been working extremely well . I've even had Claude Code update those notes itself with things it learned during an upgrade process! This is greatly helped by the fact that every single Datasette plugin has an automated test suite that demonstrates the core functionality works as expected. Coding agents can use those tests to verify that their changes have had the desired effect. I've also been leaning heavily on to help with the upgrade process. I wrote myself two new helper scripts - and - to help test the new plugins. The and implementations can be found in this TIL . Some of my plugin upgrades have become a one-liner to the command, which runs OpenAI Codex CLI with a prompt without entering interactive mode: There are still a bunch more to go - there's a list in this tracking issue - but I expect to have the plugins I maintain all upgraded pretty quickly now that I have a solid process in place. This change to Datasette core by far the most ambitious piece of work I've ever attempted using a coding agent. Last year I agreed with the prevailing opinion that LLM assistance was much more useful for greenfield coding tasks than working on existing codebases. The amount you could usefully get done was greatly limited by the need to fit the entire codebase into the model's context window. Coding agents have entirely changed that calculation. Claude Code and Codex CLI still have relatively limited token windows - albeit larger than last year - but their ability to search through the codebase, read extra files on demand and "reason" about the code they are working with has made them vastly more capable. I no longer see codebase size as a limiting factor for how useful they can be. I've also spent enough time with Claude Sonnet 4.5 to build a weird level of trust in it. I can usually predict exactly what changes it will make for a prompt. If I tell it "extract this code into a separate function" or "update every instance of this pattern" I know it's likely to get it right. For something like permission code I still review everything it does, often by watching it as it works since it displays diffs in the UI. I also pay extremely close attention to the tests it's writing. Datasette 1.0a19 already had 1,439 tests, many of which exercised the existing permission system. 1.0a20 increases that to 1,583 tests. I feel very good about that, especially since most of the existing tests continued to pass without modification. I built several different proof-of-concept implementations of SQL permissions before settling on the final design. My research/sqlite-permissions-poc project was the one that finally convinced me of a viable approach, That one started as a free ranging conversation with Claude , at the end of which I told it to generate a specification which I then fed into GPT-5 to implement. You can see that specification at the end of the README . I later fed the POC itself into Claude Code and had it implement the first version of the new Datasette system based on that previous experiment. This is admittedly a very weird way of working, but it helped me finally break through on a problem that I'd been struggling with for months. Now that the new alpha is out my focus is upgrading the existing plugin ecosystem to use it, and supporting other plugin authors who are doing the same. The new permissions system unlocks some key improvements to Datasette Cloud concerning finely-grained permissions for larger teams, so I'll be integrating the new alpha there this week. This is the single biggest backwards-incompatible change required before Datasette 1.0. I plan to apply the lessons I learned from this project to the other, less intimidating changes. I'm hoping this can result in a final 1.0 release before the end of the year! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Understanding the permissions system Permissions systems need to be able to efficiently list things The new permission_resources_sql() plugin hook Hierarchies, plugins, vetoes, and restrictions New debugging tools The missing feature: list actors who can act on this resource Upgrading plugins for Datasette 1.0a20 Using Claude Code to implement this change Starting with a proof-of-concept Miscellaneous tips I picked up along the way What's next? = "test against datasette dev" - it runs a plugin's existing test suite against the current development version of Datasette checked out on my machine. It passes extra options through to so I can run or as needed. = "run against datasette dev" - it runs the latest dev command with the plugin installed. When working on anything relating to plugins it's vital to have at least a few real plugins that you upgrade in lock-step with the core changes. The and shortcuts were invaluable for productively working on those plugins while I made changes to core. Coding agents make experiments much cheaper. I threw away so much code on the way to the final implementation, which was psychologically easier because the cost to create that code in the first place was so low. Tests, tests, tests. This project would have been impossible without that existing test suite. The additional tests we built along the way give me confidence that the new system is as robust as I need it to be. Claude writes good commit messages now! I finally gave in and let it write these - previously I've been determined to write them myself. It's a big time saver to be able to say "write a tasteful commit message for these changes". Claude is also great at breaking up changes into smaller commits. It can also productively rewrite history to make it easier to follow, especially useful if you're still working in a branch. A really great way to review Claude's changes is with the GitHub PR interface. You can attach comments to individual lines of code and then later prompt Claude like this: . This is a very quick way to apply little nitpick changes - rename this function, refactor this repeated code, add types here etc. The code I write with LLMs is higher quality code . I usually find myself making constant trade-offs while coding: this function would be neater if I extracted this helper, it would be nice to have inline documentation here, this changing this would be good but would break a dozen tests... for each of those I have to determine if the additional time is worth the benefit. Claude can apply changes so much faster than me that these calculations have changed - almost any improvement is worth applying, no matter how trivial, because the time cost is so low. Internal tools are cheap now. The new debugging interfaces were mostly written by Claude and are significantly nicer to use and look at than the hacky versions I would have knocked out myself, if I had even taken the extra time to build them. That trick with a Markdown file full of upgrade instructions works astonishingly well - it's the same basic idea as Claude Skills . I maintain over 100 Datasette plugins now and I expect I'll be automating all sorts of minor upgrades in the future using this technique.

0 views
devansh 3 weeks ago

AI pentest scoping playbook

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. This guide/playbook exists because the current state of AI security testing is dangerously inadequate. The attack surface is massive. The risks are novel. The methodologies are immature. And the consequences of getting it wrong are catastrophic. These are my personal views, informed by professional experience but not representative of my employer. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement. Traditional web application pentests follow predictable patterns. You scope endpoints, define authentication boundaries, exclude production databases, and unleash testers to find SQL injection and XSS. The attack surface is finite, the vulnerabilities are catalogued, and the methodologies are mature. AI systems break all of that. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. This isn't incrementally harder than web pentesting. It's just fundamentally different. And if your scope document looks like a web app pentest with "LLM" find-and-replaced in, you're going to miss everything that matters. Before you can scope an AI security engagement, you need to understand what you're actually testing. And most organizations don't. Here's the stack: This is the thing everyone focuses on because it's the most visible. But "the model" isn't monolithic. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. AI systems don't just run models. They feed data into models. And that data pipeline is massive attack surface. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. Models don't exist in isolation. They're integrated into applications. And those integration points are attack surface. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ If you've built agentic systems, AI that can plan, reason, use tools, and take actions autonomously, you've added an entire new dimension of attack surface. Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? AI systems run on infrastructure. That infrastructure has traditional security vulnerabilities that still matter. Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? How much of that did you include in your last AI security scope document? If the answer is "less than 60%", your scope is inadequate. And you're going to get breached by someone who understands the full attack surface. The OWASP Top 10 for LLM Applications is the closest thing we have to a standardized framework for AI security testing. If you're scoping an AI engagement and you haven't mapped every item in this list to your test plan, you're doing it wrong. Here's the 2025 version: That's your baseline. But if you stop there, you're missing half the attack surface. The OWASP LLM Top 10 is valuable, but it's not comprehensive. Here's what's missing: Safety ≠ security . But unsafe AI systems cause real harm, and that's in scope for red teaming. Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Traditional adversarial ML attacks apply to AI systems. Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? If your AI system handles multiple modalities (text, images, audio, video), you have additional attack surface. Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. AI systems must comply with GDPR, HIPAA, CCPA, and other regulations. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? Before you write your scope document, answer every single one of these questions. If you can't answer them, you don't understand your system well enough to scope a meaningful AI security engagement. If you can answer all these questions, you're ready to scope. If you can't, you're not. Your AI pentest/engagement scope document needs to be more detailed than a traditional pentest scope. Here's the structure: What we're testing : One-paragraph description of the AI system. Why we're testing : Business objectives (compliance, pre-launch validation, continuous assurance, incident response). Key risks : Top 3-5 risks that drive the engagement. Success criteria : What does "passing" look like? Architectural diagram : Include everything—model, data pipelines, APIs, infrastructure, third-party services. Component inventory : List every testable component with owner, version, and deployment environment. Data flows : Document how data moves through the system, from user input to model output to downstream consumers. Trust boundaries : Identify where data crosses trust boundaries (user → app, app → model, model → tools, tools → external APIs). Be exhaustive. List: For each component, specify: Map every OWASP LLM Top 10 item to specific test cases. Example: LLM01 - Prompt Injection : Include specific threat scenarios: Explicitly list what's NOT being tested: Tools : List specific tools testers will use: Techniques : Test phases : Authorization : All testing must be explicitly authorized in writing. Include names, signatures, dates. Ethical boundaries : No attempts at physical harm, financial fraud, illegal content generation (unless explicitly scoped for red teaming). Disclosure : Critical findings must be disclosed immediately via designated channel (email, Slack, phone). Standard findings can wait for formal report. Data handling : Testers must not exfiltrate user data, training data, or model weights except as explicitly authorized for demonstration purposes. All test data must be destroyed post-engagement. Legal compliance : Testing must comply with all applicable laws and regulations. If testing involves accessing user data, appropriate legal review must be completed. Technical report : Detailed findings with severity ratings, reproduction steps, evidence (screenshots, logs, payloads), and remediation guidance. Executive summary : Business-focused summary of key risks and recommendations. Threat model : Updated threat model based on findings. Retest availability : Will testers be available for retest after fixes? Timeline : Start date, end date, report delivery date, retest window. Key contacts : That's your scope document. It should be 10-20 pages. If it's shorter, you're missing things. Here's what I see organizations get wrong: Mistake 1: Scoping only the application layer, not the model You test the web app that wraps the LLM, but you don't test the LLM itself. You find XSS and broken authz, but you miss prompt injection, jailbreaks, and data extraction. Fix : Scope the full stack-app, model, data pipelines, infrastructure. Mistake 2: Treating the model as a black box when you control it If you fine-tuned the model, you have access to training data and weights. Test for data poisoning, backdoors, and alignment failures. Don't just test the API. Fix : If you control any part of the model lifecycle (training, fine-tuning, deployment), include that in scope. Mistake 3: Ignoring RAG and vector databases You test the LLM, but you don't test the document store. Adversaries inject malicious documents, manipulate retrieval, and poison embeddings—and you never saw it coming. Fix : If you're using RAG, the vector database and document ingestion pipeline are in scope. Mistake 4: Not testing multi-turn interactions You test single-shot prompts, but adversaries condition the model over 10 turns to bypass refusal mechanisms. You missed the attack entirely. Fix : Test multi-turn dialogues explicitly. Test conversation history isolation. Test memory poisoning. Mistake 5: Assuming third-party models are safe You're using OpenAI's API, so you assume it's secure. But you're passing user PII in prompts, you're not validating outputs before execution, and you haven't considered what happens if OpenAI's safety mechanisms fail. Fix : Even with third-party models, test your integration. Test input/output handling. Test failure modes. Mistake 6: Not including AI safety in security scope You test for technical vulnerabilities but ignore alignment failures, bias amplification, and harmful content generation. Then your model generates racist outputs or dangerous instructions, and you're in the news. Fix : AI safety is part of AI security. Include alignment testing, bias audits, and harm reduction validation. Mistake 7: Underestimating autonomous agent risks You test the LLM, but your agent can execute code, call APIs, and access databases. An adversary hijacks the agent, and it deletes production data or exfiltrates secrets. Fix : Autonomous agents are their own attack surface. Test tool permissions, privilege escalation, and agent behavior boundaries. Mistake 8: Not planning for continuous testing You do one pentest before launch, then never test again. But you're fine-tuning weekly, adding new plugins monthly, and updating RAG documents daily. Your attack surface is constantly changing. Fix : Scope for continuous red teaming, not one-time assessment. Organizations hire expensive consultants to run a few prompt injection tests, declare the system "secure," and ship to production. Then they get breached six months later when someone figures out a multi-turn jailbreak or poisons the RAG document store. The problem isn't that the testers are bad. The problem is that the scopes are inadequate . You can't find what you're not looking for. If your scope doesn't include RAG poisoning, testers won't test for it. If your scope doesn't include membership inference, testers won't test for it. If your scope doesn't include agent privilege escalation, testers won't test for it. And attackers will. The asymmetry is brutal: you have to defend every attack vector. Attackers only need to find one that works. So when you scope your next AI security engagement, ask yourself: "If I were attacking this system, what would I target?" Then make sure every single one of those things is in your scope document. Because if it's not in scope, it's not getting tested. And if it's not getting tested, it's going to get exploited. Traditional pentests are point-in-time assessments. You test, you report, you fix, you're done. That doesn't work for AI systems. AI systems evolve constantly: Every change introduces new attack surface. And if you're only testing once a year, you're accumulating risk for 364 days. You need continuous red teaming . Here's how to build it: Use tools like Promptfoo, Garak, and PyRIT to run automated adversarial testing on every model update. Integrate tests into CI/CD pipelines so every deployment is validated before production. Set up continuous monitoring for: Quarterly or bi-annually, bring in expert red teams for comprehensive testing beyond what automation can catch. Focus deep assessments on: Train your own security team on AI-specific attack techniques. Develop internal playbooks for: Every quarter, revisit your threat model: Update your testing roadmap based on evolving threats. Scoping AI security engagements is harder than traditional pentests because the attack surface is larger, the risks are novel, and the methodologies are still maturing. But it's not impossible. You need to: If you do this right, you'll find vulnerabilities before attackers do. If you do it wrong, you'll end up in the news explaining why your AI leaked training data, generated harmful content, or got hijacked by adversaries. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? What base model are you using (GPT-4, Claude, Llama, Mistral, custom)? Is the model proprietary (OpenAI API) or open-source? Have you fine-tuned the base model? On what data? Have you applied instruction tuning, RLHF, or other alignment techniques? How is the model deployed (API, on-prem, container, serverless)? Do you have access to model weights? Can testers query the model directly, or only through your application? Are there rate limits? What are they? What's the model's context window size? Does the model support function calling or tool use? Is the model multimodal (vision, audio, text)? Are you using multiple models in ensemble or orchestration? Where did training data come from (public, proprietary, scraped, licensed)? Was training data curated or filtered? How? Is training data in scope for poisoning tests? Are you using RAG (Retrieval-Augmented Generation)? If RAG: What's the document store (vector DB, traditional DB, file system)? If RAG: How are documents ingested? Who controls ingestion? If RAG: Can testers inject malicious documents? If RAG: How is retrieval indexed and searched? Do you pull real-time data from external sources (APIs, databases)? How is input data preprocessed and sanitized? Is user conversation history stored? Where? For how long? Can users access other users' data? How do users interact with the model (web app, API, chat interface, mobile app)? What authentication mechanisms are used (OAuth, API keys, session tokens)? What authorization model is used (RBAC, ABAC, none)? Are there different user roles with different permissions? Is there rate limiting? At what levels (user, IP, API key)? Are inputs and outputs logged? Where? Who has access to logs? Are logs encrypted at rest and in transit? How are errors handled? Are error messages exposed to users? Are there webhooks or callbacks that the model can trigger? Can the model call external APIs? Which ones? Can the model execute code? In what environment? Can the model browse the web? Can the model read/write files? Can the model access databases? What permissions do plugins have? How are plugin outputs validated before use? Can users add custom plugins? Are plugin interactions logged? Do you have autonomous agents that plan and execute multi-step tasks? What tools can agents use? Can agents spawn other agents? Do agents have persistent memory? Where is it stored? How are agent goals and constraints defined? Can agents access sensitive resources (DBs, APIs, filesystems)? Can agents escalate privileges? Are there kill-switches or circuit breakers for agents? How is agent behavior monitored? What cloud provider(s) are you using (AWS, Azure, GCP, on-prem)? Are you using containers (Docker)? Orchestration (Kubernetes)? Where are model weights stored? Who has access? Where are API keys and secrets stored? Are secrets in environment variables, config files, or secret managers? How are dependencies managed (pip, npm, Docker images)? Have you scanned dependencies for known vulnerabilities? How are model updates deployed? What's the CI/CD pipeline? Who can deploy model updates? Are there staging environments separate from production? What safety mechanisms are in place (content filters, refusal training, constitutional AI)? Have you red-teamed for jailbreaks? Have you tested for bias across demographic groups? Have you tested for harmful content generation? Do you have human-in-the-loop review for sensitive outputs? What's your incident response plan if the model behaves unsafely? Can testers attempt to jailbreak the model? Can testers attempt prompt injection? Can testers attempt data extraction (training data, PII)? Can testers attempt model extraction or inversion? Can testers attempt DoS or resource exhaustion? Can testers poison training data (if applicable)? Can testers test multi-turn conversations? Can testers test RAG document injection? Can testers test plugin abuse? Can testers test agent privilege escalation? Are there any topics, content types, or test methods that are forbidden? What's the escalation process if critical issues are found during testing? What regulations apply (GDPR, HIPAA, CCPA, FTC, EU AI Act)? Do you process PII? What types? Do you have data processing agreements with model providers? Do you have the legal right to test this system? Are there export control restrictions on the model or data? What are the disclosure requirements for findings? What's the confidentiality agreement for testers? Model(s) : Exact model names, versions, access methods APIs : All endpoints with authentication requirements Data stores : Databases, vector stores, file systems, caches Integrations : Every third-party service, plugin, tool Infrastructure : Cloud accounts, containers, orchestration Applications : Web apps, mobile apps, admin panels Access credentials testers will use Environments (dev, staging, prod) that are in scope Testing windows (if limited) Rate limits or usage restrictions Test direct instruction override Test indirect injection via RAG documents Test multi-turn conditioning Test system prompt extraction Test jailbreak techniques (roleplay, hypotheticals, encoding) Test cross-turn memory poisoning "Can an attacker leak other users' conversation history?" "Can an attacker extract training data containing PII?" "Can an attacker bypass content filters to generate harmful instructions?" Production environments (if testing only staging) Physical security Social engineering of employees Third-party SaaS providers we don't control Specific attack types (if any are prohibited) Manual testing Promptfoo for LLM fuzzing Garak for red teaming PyRIT for adversarial prompting ART (Adversarial Robustness Toolbox) for ML attacks Custom scripts for specific attack vectors Traditional tools (Burp Suite, Caido, Nuclei) for infrastructure Prompt injection testing Jailbreak attempts Data extraction attacks Model inversion Membership inference Evasion attacks RAG poisoning Plugin abuse Agent privilege escalation Infrastructure scanning Reconnaissance and threat modeling Automated vulnerability scanning Manual testing of high-risk areas Exploitation and impact validation Reporting and remediation guidance Engagement lead (security team) Technical point of contact (AI team) Escalation contact (for critical findings) Legal contact (for questions on scope) Models get fine-tuned RAG document stores get updated New plugins get added Agents gain new capabilities Infrastructure changes Prompt injection attempts Jailbreak successes Data extraction queries Unusual tool usage patterns Agent behavior anomalies Novel attack vectors that tools don't cover Complex multi-step exploitation chains Social engineering combined with technical attacks Agent hijacking and multi-agent exploits Prompt injection testing Jailbreak methodology RAG poisoning Agent security testing What new attacks have been published? What new capabilities have you added? What new integrations are in place? What new risks does the threat landscape present? Understand the full stack : model, data pipelines, application, infrastructure, agents, everything. Map every attack vector : OWASP LLM Top 10 is your baseline, not your ceiling. Answer scoping questions (mentioned above) : If you can't answer them, you don't understand your system. Write detailed scope documents : 10-20 pages, not 2 pages. Use the right tools : Promptfoo, Garak, ART, LIME, SHAP—not just Burp Suite. Test continuously : Not once, but ongoing. Avoid common mistakes : Don't ignore RAG, don't underestimate agents, don't skip AI safety.

0 views
Armin Ronacher 3 weeks ago

Absurd Workflows: Durable Execution With Just Postgres

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution. Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd 1 , a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed. Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state. Because Postgres is excellent at queues thanks to , you can use it for the queue (e.g., with pgmq ). And because it’s a database, you can also use it to store the state. The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened. Absurd at the core is a single file ( ) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with. The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps , which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run ). The result of a step is stored in the database (a checkpoint ). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again. Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free. What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated: This defines a single task named , and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be , the second , , etc. Each state only stores the new messages it generated, not the entire message history. If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks. How do you kick it off? Simply enqueue it: And if you are curious, this is an example implementation of the function used above: And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so: Or if you want to wait for an event: Which someone else can emit: Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration . Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing. It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩ It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years. ↩

0 views
Emil Privér 1 months ago

We Re-Built Our Integration Service Using Postgres and Go

Our integration service connects our platform to external systems. Earlier this year, we reached a scaling limit at 40 integrations and rebuilt it from the ground up. The service handles three primary responsibilities: sending data to external systems, managing job queues, and prioritizing work based on criticality. The original implementation functioned but had architectural constraints that prevented horizontal scaling. We use microservices because different components have conflicting requirements. The management API handles complex business logic with normalized schemas—separate tables for translations and categories. The public API optimizes for read performance under load, using denormalized data by adding translations directly into category tables and handling filtering in Go. A monolithic architecture would require compromising performance in one area to accommodate the other. The integration service currently processes millions of events daily, with volume increasing as we onboard new customers. This post describes our implementation of a queue system using PostgreSQL and Go, focusing on design decisions and technical trade-offs. The first implementation used GCP Pub/Sub, a topic-to-many-subscription service where messages are replicated across multiple queues. This architecture introduced several scalability issues. The integration service maintained a database for integration configurations but lacked ownership of its operational data. This violated a distributed systems principle: services should own their data rather than depend on other services for it. This dependency forced our management service to serialize complete payloads into the queue. Updating a single attribute on a sub-object required sending the entire parent object with all nested sub-objects, metadata, and relationships. Different external APIs have varying data requirements—some need individual sub-objects while others require complete hierarchies. For clients with records containing 300-500 sub-objects, this resulted in significant message size inflation. GCP charges by message size rather than count, making large messages substantially more expensive than smaller ones. GCP’s WebSocket delivery requires clients to buffer messages internally. With 40 integrations running separate consumers with filters, traffic spikes created memory pressure: This prevented horizontal scaling and limited us to vertical scaling approaches. External APIs enforce varying rate limits. Our in-memory rate limiter tracked requests per integration but prevented horizontal scaling since state couldn’t be shared across instances without risking rate limit violations. By early 2025, these issues had compounded: excessive message sizes increasing costs, memory bloat requiring oversized containers, vertical-only scaling, high operational expenses, rate limiting preventing horizontal scale, and lack of data independence. The system couldn’t accommodate our growth trajectory. A complete rebuild was necessary. The v2 design addressed specific limitations: Additional improvements: The standard approach involves the producer computing payloads and sending them to the queue for consumer processing. We used this in v1 but rejected it for v2. Customers frequently make multiple rapid changes to the same record—updating a title, then a price, then a description. Each change triggers an event. Instead of sending three separate updates, we consolidate changes into a single update. We implemented a in the jobs table. Multiple updates to the same record within a short time window are deduplicated into a single job, reducing load on both our system and recipient systems. We chose PostgreSQL as our queue backend for several reasons: Often, we think we need something bigger like Apache Kafka when a relational database like PostgreSQL is sufficient for our requirements. The jobs table structure: Each job tracks: Postgres-backed queues require careful indexing. We use partial indexes (with WHERE clauses) only for actively queried states: , , , and . We don’t index or states. These statuses contain the majority of jobs in the table and aren’t needed in the job processing flow. Indexing them would just add more data into the memory when we don’t use it in the flow. Jobs are ordered by for FIFO processing, with priority queue overrides when applicable. Jobs follow a defined lifecycle: Timestamp fields serve observability purposes, measuring job duration and identifying bottlenecks. For jobs, retry timing is calculated using exponential backoff. The worker system requirements: We evaluated two approaches: maintaining in-memory queues with multiple goroutines using for and select to fetch jobs, or having goroutines fetch data from the database and iterate over the results. We chose the database iteration approach for its simplicity. pgxpool handles connection pooling, eliminating the need for channel-based in-memory queues. Each worker runs in a separate goroutine, using a ticker to poll for jobs every second. Before processing, workers check for shutdown signals ( or channel). When shutdown is initiated, workers stop accepting new jobs and mark in-flight jobs as . This prevents stalled jobs from blocking integration queues. Checking shutdown signals between jobs ensures clean shutdowns. During shutdown, we create a fresh context with for retrying jobs. This prevents database write failures when the main context is canceled. The query implements fair scheduling to prevent high-volume integrations from monopolizing workers: Query breakdown: Step 1: Identify busy integrations This CTE identifies integrations with 50+ concurrent processing jobs. Step 2: Select jobs with priority ordering Jobs are selected from integrations not in the busy list. Priority updates are ordered first, followed by FIFO ordering. locks selected rows to the current transaction, preventing duplicate processing by concurrent workers. Step 3: Update job status Selected jobs are updated to status with a recorded start time. This ensures fair resource allocation across integrations. Job timeouts are critical for queue health. In the initial release, we reused the global context for job processing. When jobs hung waiting for slow external APIs, they couldn’t be marked completed or failed due to context lifecycle coupling. Jobs accumulated in state indefinitely. The solution: context separation. The global context controls worker lifecycle. Each job receives its own context with a timeout. Timed-out jobs are marked , allowing queue progression. This also enables database writes during shutdown using a fresh context, even when the global context is canceled. Failed jobs require retry logic with appropriate timing. Immediate retries against failing external APIs are counterproductive. We implement exponential backoff: instant first retry, 10 seconds for the second, 30 seconds for the third, up to 30 minutes. The field drives backoff calculation. After 10 attempts, jobs are marked . Error types guide retry behavior: This allows each integration to decide how to handle errors based on the external API’s response. For example, a 400 Bad Request might be a permanent validation failure (NonRetryableError), while a 503 Service Unavailable is transient and should retry (RetryableError). The integration implementation determines the appropriate error type for each scenario. Jobs occasionally become stuck in state due to worker panics, database connection failures, or unexpected container termination. A cron job runs every minute, identifying jobs in state beyond the expected duration. These jobs are moved to with incremented retry counts, treating them as standard failures. This ensures queue progression despite unexpected failures. Rate limiting across multiple containers was v2’s most complex challenge. V1’s in-memory rate limiter worked for single containers but couldn’t share state across instances. While Redis was an option, we already had PostgreSQL with sufficient performance. The solution: a table tracking request counts per integration per second: Before external API requests, we increment the counter for the integration’s current time window (rounded to the second). PostgreSQL returns the new count. If the count exceeds the limit, we sleep 250ms and retry. If under the limit, we proceed. This works because all containers share the database as the source of truth for rate limiting. Occasionally, jobs are rate-limited during heavy load due to the gap between count checking and request sending. These jobs retry immediately. The occurrence rate is acceptable. Hope you enjoyed this article and learned something new. This system has worked really well so far, and we’ve had only a few minor issues that we fixed quickly. I will update this article over time. Mass updates generate large objects per record Objects are duplicated for each configured integration Copies buffer across 5-10 consumer instances Infrastructure requires 2GB RAM and 2 cores to handle spikes, despite needing only 512MB and 1 core during normal operation Horizontal scaling - Enable scaling across multiple containers Distributed rate limiting - Coordinate rate limits across instances Data ownership - Store operational data within the service Delta updates - Send only changed data rather than complete records Fair scheduling - Prevent single integrations from monopolizing resources Priority queuing - Process critical updates before lower-priority changes Self-service re-sync - Enable customers to re-sync catalogs independently Visibility - Provide APIs for customers to monitor sent data and queue status Performance - PostgreSQL is fast enough for our use case. We don’t need sub-second message delivery. Simplicity - Using a managed PostgreSQL instance on GCP is significantly simpler than introducing new infrastructure. Familiarity - Most developers understand SQL, reducing onboarding time. Existing infrastructure - We already use PostgreSQL for our data, eliminating the need for additional systems. - Links logs across services - Specifies the action (e.g., ) - Records failure details - Tracks current workflow state - Counts retry attempts - Schedules next retry , , - Provides metrics for observability - Links to specific integrations - Identifies the platform - Contains job data - Prevents duplicate execution Created → Initial state: Picked up → Transitions to Success → Becomes , records Failed (10 retries) → Becomes , records Failed (retries remaining) → Becomes , increments , calculates Parallel worker execution Horizontal scaling across containers Graceful shutdowns without job loss Distributed rate limit enforcement—we need to respect rate limits no matter how many containers we run - Permanent failures (e.g., validation errors). No retry. - Transient failures (e.g., 500 Internal Server Error). Retry with backoff. - Retry limit reached. Mark failed.

0 views
Jack Vanlightly 1 months ago

Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg

Over the past few months, I’ve seen a growing number of posts on social media promoting the idea of a “zero-copy” integration between Apache Kafka and Apache Iceberg. The idea is that Kafka topics could live directly as Iceberg tables. On the surface it sounds efficient: one copy of the data, unified access for both streaming and analytics. But from a systems point of view, I think this is the wrong direction for the Apache Kafka project. In this post, I’ll explain why.  Zero-copy is a bit of a marketing buzzword. I prefer the term shared tiering . The idea behind shared tiering is that Apache Kafka tiers colder data to an Apache Iceberg table instead of tiering closed log segment files. It’s called shared tiering because the tiered data serves both Kafka and data analytics workloads.  The idea has been popularized recently in the Kafka world by Aiven, with a tiered storage plugin for Apache Kafka that adds Iceberg table tiering to the tiered storage abstraction inside Kafka brokers. But before we understand shared tiering we should understand the difference between tiering and materialization: Tiering is about moving data from one storage tier (and possibly storage format) to another, such that both tiers are readable by the source system and data is only durably stored in one tier. While the system may use caches, only one tier is the source of truth for any given data item. Usually it’s about moving data to a cheaper long-term storage tier.  Materialization is about making data of a primary system available to a secondary system , by copying data from the primary storage system (and format) to the secondary storage system, such that both data copies are maintained (albeit with different formats). The second copy is not readable from the source system as its purpose is to feed another data system. Copying data to a lakehouse for access by various analytics engines is a prime example. Fig 1. Tiering vs materialization There are two types of tiering: Internal Tiering is data tiering where only the primary data system can access the various storage tiers. For example, Kafka tiered storage is internal tiering. These internal storage tiers form the primary storage as a whole. Shared Tiering is data tiering where one or more data tiers is shared between multiple systems. The result is a tiering-materialization hybrid, serving both purposes. Tiering to a lakehouse is an example of shared tiering. Fig 2. Shared tiering makes the long-term storage a shared storage layer for two or more systems. There are two broad approaches to populating an Iceberg table from a Kafka topic: Internal Tiering + Materialization Shared Tiering Option 1: Internal Tiering + Materialization where Kafka continues to use traditional tiered storage of log segment files (internal tiering) and also materializes the topic as an Iceberg table (such as via Kafka Connect). Catch-up Kafka consumers will be served from tiered Kafka log segments, whereas compute engines such as Spark will use the Iceberg table. Fig 3. Internal Tiering + Materialization Option 2: Shared Tiering (zero-copy) where Kafka tiers topic data to Iceberg directly. The Iceberg table will serve both Kafka brokers (for catch-up Kafka consumers) and analytics engines such as Spark and Flink. Fig 4. Kafka tiers data directly to Iceberg. Iceberg is shared between Kafka and analytics. On the surface, shared tiering sounds attractive: one copy of the data, lower storage costs, no need to keep two copies of the data in sync. But the reality is more complicated.  Zero-copy shared tiering appears efficient at first glance as it eliminates duplication between Kafka and the data lake.  However, rather than simply taking one cost away, it shifts that cost from storage to compute. By tiering directly to Iceberg, brokers take on substantial compute overhead as they must both construct columnar Parquet files from log segment files (instead of simply copying them to object storage), and they must download Parquet files and convert them back into log segment files to serve lagging consumers. Richie Artoul of WarpStream blogged about why tiered storage is such a bad place to put Iceberg tiering work: “ First off, generating parquet files is expensive. Like really expensive. Compared to copying a log segment from the local disk to object storage, it uses at least an order of magnitude more CPU cycles and significant amounts of memory. That would be fine if this operation were running on a random stateless compute node, but it’s not, it’s running on one of the incredibly important Kafka brokers that is the leader for some of the topic-partitions in your cluster. This is the worst possible place to perform computationally expensive operations like generating parquet files. ” – Richard Artoul, The Case for an Iceberg-Native Database: Why Spark Jobs and Zero-Copy Kafka Won’t Cut It Richard is pretty sceptical of tiered storage in general, a sentiment which I don’t share so much, but where we agree is that Parquet file writing is far more expensive than log segment uploads. Things can get worse (for Kafka) when we consider optimizing the Iceberg table for analytics queries. I recently wrote about how open table formats optimize query performance. “ Ultimately, in open table formats, layout is king. Here, performance comes from layout (partitioning, sorting, and compaction) which determines how efficiently engines can skip data… Since the OTF table data exists only once, its data layout must reflect its dominant queries. You can’t sort or cluster the table twice without making a copy of it. ” – Beyond Indexes: How Open Table Formats Optimize Query Performance Layout, layout, layout. Performance comes from pruning, pruning efficiency comes from layout. So what layout should we choose for our shared Iceberg tables? To serve Kafka consumers efficiently, the data must be laid out in offset order. Each Parquet file would contain contiguous ranges of offsets of one or more topic partitions. This is ideal for Kafka as it needs only to fetch individual Parquet files or even row groups within files, in order to reconstruct a log segment to serve a lagging consumer. Best case is a 1:1 mapping from log-segment to Parquet file; otherwise read amplification grows quickly. Fig 5. Column statistics and Iceberg partitions clearly tell Kafka where a specific topic partition offset range lives. But these pruning statistics are not useful for analytics queries. However, for the analytics query, there is no useful pruning information in this layout, leading to large table scans. Queries like WHERE EventType = 'click' must read every file, since each one contains a random mixture of event types. Min/max column statistics on columns such as EventType are meaningless for predicate pushdown, so you lose all the performance advantages of Iceberg’s table abstraction. An analytics-optimized layout, by contrast, partitions and sorts data by business dimensions, such as EventType, Region and EventTime. Files are rewritten or merged to tighten column statistics and make pruning effective. That makes queries efficient: scans touch only a handful of partitions, and within each partition, min/max stats allow skipping large chunks of data. Fig 6. Column statistics and partitions clearly tell analytics queries that want to slice and dice by EventType and Region how to prune Parquet files. But these pruning statistics are not useful for Kafka at all, which must do a costly table scan (from a Kafka broker) to find a specific offset range to rebuild a log segment. With this analytics optimized layout, a lagging Kafka consumer is going to cause a Kafka broker to have a really bad day. Offset order no longer maps to file order. To fetch an offset range, the broker may have to fetch dozens of Parquet files (as the offset range is spread across files) due to analytics workload driven choices for partition scheme and sort orders. The read path becomes highly fragmented, resulting in massive read amplification. All of this downloading, scanning and segment reconstruction is happening in the Kafka broker. That’s a lot more work for Kafka brokers and it all costs a lot of CPU and memory. Normally I would say: given two equally important workloads, optimize for the workload that is less predictable, in this case, the lakehouse. The Kafka workload is sequential and therefore predictable, so we can use tricks such as read-aheads. Analytics workloads are far less predictable, slicing and dicing the data in different ways, so optimize for that.  But this is simply not practical for Apache Kafka. We can’t ask it to reconstruct sequential log segments from Parquet files that have been completely reorganized into a different data distribution. The best we could do is some half-way house, firmly leaning towards Kafka’s needs. Basically, partitioning by ingestion time (hour or date) so we preserve the sequential nature of the data and hope that most analytical queries only touch recent data. The alternative is to use the Kafka Iceberg table as a staging table for incrementally populating a cleaner (silver) table, but then haven’t we just made a copy anyway? In practice, the additional compute and I/O overhead can easily offset the storage savings, leading to a less efficient and less predictable performance profile overall. How we optimize the Iceberg table is critical to the efficiency of both the Kafka brokers, but also the analytics workload. Without separation of concerns, each workload is trapped by the other.  The first issue is that the schema of the Kafka topic may not even be suitable for the Iceberg table, for example, a CDC stream with before and after payloads. When we materialize these streams as tables, we don’t materialize them in their raw format, we use the information to materialize how the data ended up. But shared tiering requires that we write everything, every column, every header field and so on. For CDC, we’d then do some kind of MERGE operation from this topic table into a useful table, which reintroduces a copy. But it is evolution that concerns me more. When materializing Kafka data into an Iceberg table for long-term storage, schema evolution becomes a challenge. A Kafka topic’s schema often changes over time as new fields are added and others are renamed or deprecated, but the historical data still exists in the old shape. A Kafka topic partition is an immutable log where events remain in their original form until expired. If you want to retain all records forever in Iceberg, you need a strategy that allows old and new events to coexist in a consistent, queryable way. We need to avoid cases where either queries fail on missing fields or historical data becomes inaccessible for Kafka. Let’s look at two approaches to schema evolution of tables that are tiered Kafka topics. One common solution is to build an uber-schema , which is essentially the union of all fields that have ever existed in the Kafka topic schema. In this model, the Iceberg table schema grows as new Kafka fields are introduced, with each new field added as a nullable column. Old records simply leave these columns null, while newer events populate them. Deprecated fields remain in the schema but stay null for modern data. This approach preserves history in its original form, keeps ingestion pipelines simple, and leverages Iceberg’s built-in schema evolution capabilities.  The trade-off is that the schema can accumulate many rarely used columns, and analysts must know which columns are meaningful for which time ranges. Views can help, performing the necessary coalesces and such like to turn a messy set of columns into something cleaner. But it would require careful maintenance. The alternative is to periodically migrate old data forward into the latest schema. Instead of carrying every historical field forever, old records are rewritten so that they conform to the current schema.  Missing values may be backfilled with defaults, derived from other fields, or simply set to null. No longer used columns are dropped. This limits the messiness of the table’s schema over time, turning it from a shameful mess of nullable columns into something that looks reasonable. While this strategy produces a tidier schema and simplifies querying, it can be complex to implement. No vendors support this approach that I have seen, leaving this as a job for their customers, who know their data best. From the lakehouse point of view, we’d prefer to use the migrate-forward approach over the long term, but use the uber-schema in the shorter term. Once we know that no more producers are using an old schema, we can migrate the rows of that schema forward. But that migration can cause a loss of fidelity for the Kafka workload. What gets written to Kafka may not be what gets read back. This could be a good thing or a bad thing. Good if you want to go back and reshape old events to more closely match the newest schema. Really bad for some compliance and auditing purposes; imagine telling your stakeholders/auditors that we can’t guarantee the data that is read from Kafka will be the same as was written to Kafka! The needs of SQL-writing analysts and the needs of Kafka conflict. Kafka wants fidelity (and doesn’t care about column creep) and Iceberg wants clean tables. In some cases we might have a reprieve if Kafka clients only need to consume 7 days or 30 days of data, then we can use the superset method for the period that covers Kafka, and use the migrate-forward method for the rest. But we are still coupling the needs of Kafka clients with lakehouse clients. If Kafka only needs 7 days of data and Iceberg is storing years worth, then why do we care about data duplication anyway? Zero-copy may reduce data duplication but does not necessarily reduce cost. It shifts cost from storage to compute and operational overhead. When Kafka brokers handle Iceberg files directly, they assume responsibilities that are both CPU and I/O intensive: file format conversion, table maintenance, and reconstruction of ordered log segments. The result is higher broker load and less predictable performance. Any storage savings are easily offset by the increased compute requirements. On the subject of data duplication, there are reasons to believe that Kafka Iceberg tables may degenerate into staging tables, due to the unoptimized layout and proliferation of columns over time as schemas change. Such usage as staging tables eliminates the duplication argument. But even so, there is already plenty of duplication going on in the lakehouse. We already have copies. We have bronze then silver tables. We have all kinds of staging tables already. How much impact does the duplication avoidance of shared tiering actually make? Also consider: With a good materializer, we can write to silver tables directly rather than using bronze tables as tiered Kafka data (especially in the CDC case). A good materializer can also reduce data duplication. Usually, Kafka topic retention is only a few days, or up to a month, so the scope for duplication is limited to the Kafka retention period. Bidirectional fidelity is a real concern when converting between formats like Avro and formats like Parquet, which have different type systems and evolution rules. Storing the original Kafka bytes as a column in the Iceberg table is a legitimate approach as an insurance against conversion issues (but involves a copy). The deeper problem with zero-copy tiering is its erosion of boundaries and the resulting coupling. When Kafka uses Iceberg tables as its primary storage layer, that boundary disappears. Who is responsible for the Iceberg table? If Kafka doesn’t take on ownership, then Kafka becomes vulnerable to the decisions of whoever owns and maintains the Iceberg table. After all, the Iceberg table will host the vast majority of Kafka’s data! Who’s on call if tiered data can’t be read? Kafka engineers? Lakehouse engineers? Will they cooperate well? The mistake of this zero-copy thing is that it assumes that Kafka/Iceberg unification requires a single physical representation of the data. But unification is better served as logical and semantic unification when the workloads differ so much. A logical unification of Kafka and Iceberg allows for the storage of each workload to remain separate and optimized. Physical unification removes flexibility by adding coupling (the traps). Once Kafka and Iceberg share the same storage, each system constrains the other’s optimization and evolution. The result is a single, entangled system that must manage two storage formats and be tuned for two conflicting workloads, making both less reliable and harder to operate.  This entanglement also leads to Kafka becoming a data lakehouse manager, which I believe is a mistake for the project. Once Kafka stores its data in Iceberg, it must take ownership of that data. I don’t agree with that direction. There are both open-source and SaaS data lakehouse platforms which include table maintenance, as well as access and security controls. Let lakehouses do lakehouse management, and leave Kafka to do what it does best. Materialization isn’t perfect, but it decouples concerns and avoids forcing Kafka to become a lakehouse. The benefits of maintaining storage decoupling are: Lakehouse performance optimization doesn’t penalize the Kafka workload or vice-versa.  Kafka can use a native log-centric tiering mechanism that does not overly burden brokers. Schema evolution of the lakehouse does not impact Kafka topics, providing the lakehouse with more flexibility leading to cleaner tables. There is less risk and overhead associated with bidirectional conversion between Kafka and Iceberg and back again. 100% fidelity is a concern to all system builders working in this space. Materialized data can be freely projected and transformed, as it does not form the actual Kafka data. Kafka does not depend on the materialized data at all. Apache Kafka doesn’t need to perform lakehouse management itself. The drawbacks are: we need two data copies but hopefully I’ve argued why that isn’t the biggest concern here. Duplication is already a fact of life in modern data platforms. Tools such as Kafka Connect and Apache Flink already exist for materializing Kafka data in secondary systems. They move data across boundaries in a controlled, one-way flow, preserving clear ownership and interfaces on each side. Modern lakehouse platforms provide managed tables with ingestion APIs (such as Snowpipe Streaming, Databricks Unity Catalog REST API, and others). Kafka Connect can work really well with these ingestion APIs. Flink also has sinks for Iceberg, Delta, Hudi and Paimon. We don’t need Kafka brokers to do this work and do maintenance on top of that. Unifying operational and analytical systems doesn’t mean merging their physical storage into one copy. It’s about logical and semantic unification, achieved through consistent schemas and reliable data movement, not shared files. It is tempting to put everything into one cluster, but separation of concerns is what keeps complex systems comprehensible, performant and resilient. UPDATE: I’ve been asked a few times about Confluent Tableflow and how it aligns with this discussion. I didn't include Tableflow as it is a proprietary component of Confluent Cloud, whereas this post is focused on Apache Kafka. But for what’s it’s worth, Tableflow falls under the materialization approach, acting both as a sophisticated Iceberg/Delta materializer and as a table maintenance service. It is operated as a separate component (separate from the Kora brokers). In the future it will support more sophisticated data layout optimization for analytics workloads. It does not perform zero-copy for many of the reasons why zero-copy isn’t the right choice for Apache Kafka. Nor does it execute within brokers, being a separate service that is independently scalable. Tiering is about moving data from one storage tier (and possibly storage format) to another, such that both tiers are readable by the source system and data is only durably stored in one tier. While the system may use caches, only one tier is the source of truth for any given data item. Usually it’s about moving data to a cheaper long-term storage tier.  Materialization is about making data of a primary system available to a secondary system , by copying data from the primary storage system (and format) to the secondary storage system, such that both data copies are maintained (albeit with different formats). The second copy is not readable from the source system as its purpose is to feed another data system. Copying data to a lakehouse for access by various analytics engines is a prime example. Internal Tiering is data tiering where only the primary data system can access the various storage tiers. For example, Kafka tiered storage is internal tiering. These internal storage tiers form the primary storage as a whole. Shared Tiering is data tiering where one or more data tiers is shared between multiple systems. The result is a tiering-materialization hybrid, serving both purposes. Tiering to a lakehouse is an example of shared tiering. Internal Tiering + Materialization Shared Tiering Missing values may be backfilled with defaults, derived from other fields, or simply set to null. No longer used columns are dropped. With a good materializer, we can write to silver tables directly rather than using bronze tables as tiered Kafka data (especially in the CDC case). A good materializer can also reduce data duplication. Usually, Kafka topic retention is only a few days, or up to a month, so the scope for duplication is limited to the Kafka retention period. Bidirectional fidelity is a real concern when converting between formats like Avro and formats like Parquet, which have different type systems and evolution rules. Storing the original Kafka bytes as a column in the Iceberg table is a legitimate approach as an insurance against conversion issues (but involves a copy). Lakehouse performance optimization doesn’t penalize the Kafka workload or vice-versa.  Kafka can use a native log-centric tiering mechanism that does not overly burden brokers. Schema evolution of the lakehouse does not impact Kafka topics, providing the lakehouse with more flexibility leading to cleaner tables. There is less risk and overhead associated with bidirectional conversion between Kafka and Iceberg and back again. 100% fidelity is a concern to all system builders working in this space. Materialized data can be freely projected and transformed, as it does not form the actual Kafka data. Kafka does not depend on the materialized data at all. Apache Kafka doesn’t need to perform lakehouse management itself.

0 views
tonsky.me 1 months ago

I am sorry, but everyone is getting syntax highlighting wrong

Translations: Russian Syntax highlighting is a tool. It can help you read code faster. Find things quicker. Orient yourself in a large file. Like any tool, it can be used correctly or incorrectly. Let’s see how to use syntax highlighting to help you work. Most color themes have a unique bright color for literally everything: one for variables, another for language keywords, constants, punctuation, functions, classes, calls, comments, etc. Sometimes it gets so bad one can’t see the base text color: everything is highlighted. What’s the base text color here? The problem with that is, if everything is highlighted, nothing stands out. Your eye adapts and considers it a new norm: everything is bright and shiny, and instead of getting separated, it all blends together. Here’s a quick test. Try to find the function definition here: See what I mean? So yeah, unfortunately, you can’t just highlight everything. You have to make decisions: what is more important, what is less. What should stand out, what shouldn’t. Highlighting everything is like assigning “top priority” to every task in Linear. It only works if most of the tasks have lesser priorities. If everything is highlighted, nothing is highlighted. There are two main use-cases you want your color theme to address: 1 is a direct index lookup: color → type of thing. 2 is a reverse lookup: type of thing → color. Truth is, most people don’t do these lookups at all. They might think they do, but in reality, they don’t. Let me illustrate. Before: Can you see it? I misspelled for and its color switched from red to purple. Here’s another test. Close your eyes (not yet! Finish this sentence first) and try to remember what color your color theme uses for class names? If the answer for both questions is “no”, then your color theme is not functional . It might give you comfort (as in—I feel safe. If it’s highlighted, it’s probably code) but you can’t use it as a tool. It doesn’t help you. What’s the solution? Have an absolute minimum of colors. So little that they all fit in your head at once. For example, my color theme, Alabaster, only uses four: That’s it! And I was able to type it all from memory, too. This minimalism allows me to actually do lookups: if I’m looking for a string, I know it will be green. If I’m looking at something yellow, I know it’s a comment. Limit the number of different colors to what you can remember. If you swap green and purple in my editor, it’ll be a catastrophe. If somebody swapped colors in yours, would you even notice? Something there isn’t a lot of. Remember—we want highlights to stand out. That’s why I don’t highlight variables or function calls—they are everywhere, your code is probably 75% variable names and function calls. I do highlight constants (numbers, strings). These are usually used more sparingly and often are reference points—a lot of logic paths start from constants. Top-level definitions are another good idea. They give you an idea of a structure quickly. Punctuation: it helps to separate names from syntax a little bit, and you care about names first, especially when quickly scanning code. Please, please don’t highlight language keywords. , , , stuff like this. You rarely look for them: “where’s that if” is a valid question, but you will be looking not at the the keyword, but at the condition after it. The condition is the important, distinguishing part. The keyword is not. Highlight names and constants. Grey out punctuation. Don’t highlight language keywords. The tradition of using grey for comments comes from the times when people were paid by line. If you have something like of course you would want to grey it out! This is bullshit text that doesn’t add anything and was written to be ignored. But for good comments, the situation is opposite. Good comments ADD to the code. They explain something that couldn’t be expressed directly. They are important . So here’s another controversial idea: Comments should be highlighted, not hidden away. Use bold colors, draw attention to them. Don’t shy away. If somebody took the time to tell you something, then you want to read it. Another secret nobody is talking about is that there are two types of comments: Most languages don’t distinguish between those, so there’s not much you can do syntax-wise. Sometimes there’s a convention (e.g. vs in SQL), then use it! Here’s a real example from Clojure codebase that makes perfect use of two types of comments: Per statistics, 70% of developers prefer dark themes. Being in the other 30%, that question always puzzled me. Why? And I think I have an answer. Here’s a typical dark theme: and here’s a light one: On the latter one, colors are way less vibrant. Here, I picked them out for you: This is because dark colors are in general less distinguishable and more muddy. Look at Hue scale as we move brightness down: Basically, in the dark part of the spectrum, you just get fewer colors to play with. There’s no “dark yellow” or good-looking “dark teal”. Nothing can be done here. There are no magic colors hiding somewhere that have both good contrast on a white background and look good at the same time. By choosing a light theme, you are dooming yourself to a very limited, bad-looking, barely distinguishable set of dark colors. So it makes sense. Dark themes do look better. Or rather: light ones can’t look good. Science ¯\_(ツ)_/¯ There is one trick you can do, that I don’t see a lot of. Use background colors! Compare: The first one has nice colors, but the contrast is too low: letters become hard to read. The second one has good contrast, but you can barely see colors. The last one has both : high contrast and clean, vibrant colors. Lighter colors are readable even on a white background since they fill a lot more area. Text is the same brightness as in the second example, yet it gives the impression of clearer color. It’s all upside, really. UI designers know about this trick for a while, but I rarely see it applied in code editors: If your editor supports choosing background color, give it a try. It might open light themes for you. Don’t use. This goes into the same category as too many colors. It’s just another way to highlight something, and you don’t need too many, because you can’t highlight everything. In theory, you might try to replace colors with typography. Would that work? I don’t know. I haven’t seen any examples. Some themes pay too much attention to be scientifically uniform. Like, all colors have the same exact lightness, and hues are distributed evenly on a circle. This could be nice (to know if you have OCD), but in practice, it doesn’t work as well as it sounds: The idea of highlighting is to make things stand out. If you make all colors the same lightness and chroma, they will look very similar to each other, and it’ll be hard to tell them apart. Our eyes are way more sensitive to differences in lightness than in color, and we should use it, not try to negate it. Let’s apply these principles step by step and see where it leads us. We start with the theme from the start of this post: First, let’s remove highlighting from language keywords and re-introduce base text color: Next, we remove color from variable usage: and from function/method invocation: The thinking is that your code is mostly references to variables and method invocation. If we highlight those, we’ll have to highlight more than 75% of your code. Notice that we’ve kept variable declarations. These are not as ubiquitous and help you quickly answer a common question: where does thing thing come from? Next, let’s tone down punctuation: I prefer to dim it a little bit because it helps names stand out more. Names alone can give you the general idea of what’s going on, and the exact configuration of brackets is rarely equally important. But you might roll with base color punctuation, too: Okay, getting close. Let’s highlight comments: We don’t use red here because you usually need it for squiggly lines and errors. This is still one color too many, so I unify numbers and strings to both use green: Finally, let’s rotate colors a bit. We want to respect nesting logic, so function declarations should be brighter (yellow) than variable declarations (blue). Compare with what we started: In my opinion, we got a much more workable color theme: it’s easier on the eyes and helps you find stuff faster. I’ve been applying these principles for about 8 years now . I call this theme Alabaster and I’ve built it a couple of times for the editors I used: It’s also been ported to many other editors and terminals; the most complete list is probably here . If your editor is not on the list, try searching for it by name—it might be built-in already! I always wondered where these color themes come from, and now I became an author of one (and I still don’t know). Feel free to use Alabaster as is or build your own theme using the principles outlined in the article—either is fine by me. As for the principles themselves, they worked out fantastically for me. I’ve never wanted to go back, and just one look at any “traditional” color theme gives me a scare now. I suspect that the only reason we don’t see more restrained color themes is that people never really thought about it. Well, this is your wake-up call. I hope this will inspire people to use color more deliberately and to change the default way we build and use color themes. Look at something and tell what it is by its color (you can tell by reading text, yes, but why do you need syntax highlighting then?) Search for something. You want to know what to look for (which color). Green for strings Purple for constants Yellow for comments Light blue for top-level definitions Explanations Disabled code JetBrains IDEs Sublime Text ( twice )

0 views

Arrows to Arrows, Categories to Queries

I’ve had a little time off of work as of late, and been spending it in characteristically unwise ways. In particular, I’ve written a little programming language that compiles to SQL . I call it catlang . That’s not to say that I’ve written a new query language. It’s a programming language, whose compiler spits out one giant statement. When you run that query in postgres, you get the output of your program. Why have I done this? Because I needed a funny compilation target to test out the actual features of the language, which is that its intermediary language is a bunch of abstract category theory nonsense. Which I’ll get to. But I’m sure you first want to see this bad boy in action. Behold, the function that returns 100 regardless of what input you give it. But it does it with the equivalent of a while loop: If you’re familiar with arrow notation , you’ll notice the above looks kinda like one big block. This is not a coincidence (because nothing is a coincidence). I figured if I were to go through all of this work, we might as well get a working arrow desugarer out of the mix. But I digress; that’s a story for another time. Anyway, what’s going on here is we have an arrow , which takes a single argument . We then loop, starting from the value of . Inside the loop, we now have a new variable , which we do some voodoo on to compute —the current value of the loop variable. Then we subtract 100 from , and take the absolute value. The function here is a bit odd; it returns if the input was negative, and otherwise. Then we branch on the output of , where and have been renamed and respectively. If was less than zero, we find ourselves in the case, where we add 1 to and wrap the whole thing in —which the loop interprets as “loop again with this new value.” Otherwise, was non-negative, and so we can return directly. Is it roundabout? You bet! The obtuseness here is not directly a feature, I was just looking for conceptually simple things I could do which would be easy to desugar into category-theoretical stuff. Which brings us to the intermediary language. After desugaring the source syntax for above, we’re left with this IL representation: We’ll discuss all of this momentarily, but for now, just let your eyes glaze over the pretty unicode. The underlying idea here is that each of these remaining symbols has very simple and specific algebraic semantics. For example, means “do and pipe the result into .” By giving a transformation from this categorical IL into other domains, it becomes trivial to compile catlang to all sorts of weird compilation targets. Like SQL. You’re probably wondering what the generated SQL looks like. Take a peek if you dare. It’s not pretty, rather amazingly, running the above query in postgres 17 will in fact return a single row with a single column whose value is 100. And you’d better believe it does it by actually looping its way up to 100. If you don’t believe me, make the following change: which will instead return a row for each step of the iteration. There are some obvious optimizations I could make to the generated SQL, but it didn’t seem worth my time, since that’s not the interesting part of the project. Let’s take some time to discuss the underlying category theory here. I am by no means an expert, but what I have learned after a decade of bashing my head against this stuff is that a little goes a long way. For our intents and purposes, we have types, and arrows (functions) between types. We always have the identity “do nothing arrow” : and we can compose arrows by lining up one end to another: 1 Unlike Haskell (or really any programming language, for that matter), we DO NOT have the notion of function application. That is, there is no arrow: You can only compose arrows, you can’t apply them. That’s why we call these things “arrows” rather than “functions.” There are a bundle of arrows for working with product types. The two projection functions correspond to and , taking individual components out of pairs: How do we get things into pairs in the first place? We can use the “fork” operation, which takes two arrows computing and , and generates a new arrow which generates a pair of : If you’re coming from a Haskell background, it’s tempting to think of this operation merely as the pair constructor. But you’ll notice from the type of the computation that there can be no data dependency between and , thus we are free to parallelize each side of the pair. In category theory, the distinction between left and right sides of an arrow is rather arbitrary. This gives rise to a notion called duality where we can flip the arrows around, and get cool new behavior. If we dualize all of our product machinery, we get the coproduct machinery, where a coproduct of and is “either or , but definitely not both nor neither.” Swapping the arrow direction of and , and replacing with gives us the following injections: and the following “join” operation for eliminating coproducts: Again, coming from Haskell this is just the standard function. It corresponds to a branch between one of two cases. As you can see, with just these eight operations, we already have a tremendous amount of expressivity. We can express data dependencies via and branching via . With we automatically encode opportunities for parallelism, and gain the ability to build complicated data structures, with and allowing us to get the information back out of the data structures. You’ll notice in the IL that there are no variable names anywhere to be found. The desugaring of the source language builds a stack (via the pattern), and replaces subsequent variable lookups with a series of projections on the stack to find the value again. On one hand, this makes the categorical IL rather hard to read, but it makes it very easy to re-target! Many domains do have a notion of grouping, but don’t have a native notion of naming. For example, in an electronic circuit, I can have a ribbon of 32 wires which represents an . If I have another ribbon of 32 wires, I can trivially route both wires into a 64-wire ribbon corresponding to a pair of . By eliminating names before we get to the IL, it means no compiler backend ever needs to deal with names. They can just work on a stack representation, and are free to special-case optimize series of projections if they are able to. Of particular interest to this discussion is how we desugar loops in catlang. The underlying primitive is : which magically turns an arrow on s into an arrow without the eithers. We obviously must run that arrow on eithers. If that function returns , then we’re happy and we can just output that. But if the function returns , we have no choice but to pass it back in to the eithered arrow. In Haskell, cochoice is implemented as: which as you can see, will loop until finally returns a . What’s neat about this formulation of a loop is that we can statically differentiate between our first and subsequent passes through the loop body. The first time through is , while for all other times it is . We don’t take advantage of it in the original program, but how many times have you written loop code that needs to initialize something its first time through? So that’s the underlying theory behind the IL. How can we compile this to SQL now? As alluded to before, we simply need to give SQL implementations for each of the operations in the intermediary language. As a simple example, compiles to , where is the input of the arrow. The hardest part here was working out a data representation. It seems obvious to encode each element of a product as a new column, but what do we do about coproducts? After much work thought, I decided to flatten out the coproducts. So, for example, the type: would be represented as three columns: with the constraint that exactly one of or would be at any given point in time. With this hammered out, almost everything else is pretty trivial. Composition corresponds to a nested query. Forks are s which concatenate the columns of each sub-query. Joins are s, where we add a clause to enforce we’re looking at the correct coproduct constructor. Cochoice is the only really tricky thing, but it corresponds to a recursive CTE . Generating a recursive CTE table for the computation isn’t too hard, but getting the final value out of it was surprisingly tricky. The semantics of SQL tables is that they are multisets and come with an arbitrary greatest element. Which is to say, you need an column structured in a relevant way in order to query the final result. Due to some quirks in what postgres accepts, and in how I structured my queries, it was prohibitively hard to insert a “how many times have I looped” column and order by that. So instead I cheated and added a column which looks at the processor clock and ordered by that. This is clearly a hack, and presumably will cause problems if I ever add some primitives which generate more than one row, but again, this is just for fun and who cares. Send me a pull request if you’re offended by my chicanery! I’ve run out of vacation time to work on this project, so I’m probably not going to get around to the meta-circular stupidity I was planning. The compiler still needs a few string-crunching primitives (which are easy to add), but then it would be simple to write a little brainfuck interpreter in catlang. Which I could then compile to SQL. Now we’ve got a brainfuck interpreter running in postgres. Of course, this has been done by hand before, but to my knowledge, never via compilation. There exist C to brainfuck compilers. And postgres is written in C. So in a move that would make Xzibit proud, we could run postgres in postgres. And of course, it would be fun to run brainfuck in brainfuck. That’d be a cool catlang backend if someone wanted to contribute such a thing. I am not the first person to do anything like this. The source language of catlang is heavily inspired by Haskell’s arrow syntax , which in turn is essentially a desugaring algorithm for Arrows . Arrows are slightly the wrong abstraction because they require an operation —which requires you to be able to embed Haskell functions in your category, something which is almost never possible. Unfortunately, arrow syntax in Haskell desugars down to for almost everything it does, which in turn makes arrow notation effectively useless. In an ideal world, everything I described in this blog post would be a tiny little Haskell library, with arrow notation doing the heavy lifting. But that is just not the world we live in. Nor am I the first person to notice that there are categorical semantics behind programming languages. I don’t actually know whom to cite on this one, but it is well-established folklore that the lambda calculus corresponds to cartesian-closed categories . The “closed” part of “cartesian-closed” means we have an operation , but everyone and their dog has implemented the lambda calculus, so I thought it would be fun to see how far we can get without it. This is not a limitation on catlang’s turing completeness (since gives us everything we need.) I’ve been thinking about writing a category-first programming language for the better part of a decade, ever since I read Compiling to Categories . That paper takes Haskell and desugars it back down to categories. I stole many of the tricks here from that paper. Anyway. All of the code is available on github if you’re interested in taking a look. The repo isn’t up to my usual coding standards, for which you have my apologies. Of note is the template-haskell backend which can spit out Haskell code; meaning it wouldn’t be very hard to make a quasiquoter to compile catlang into what Haskell’s arrow desugaring ought to be. If there’s enough clamor for such a thing, I’ll see about turning this part into a library. When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎ When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎

1 views
Dangling Pointers 1 months ago

Optimizing Datalog for the GPU

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25 Datalog source code comprises a set of relations, and a set of rules. A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named : A relation can also be implicitly defined with a set of rules. The paper uses the relation as an example: Rule 1 states that two vertices ( and ) are part of the same generation if they both share a common ancestor ( ), and they are not actually the same vertex ( ). Rule 2 states that two vertices ( and ) are part of the same generation if they have ancestors ( and ) from the same generation. “Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added). One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the relation with itself, using as the join key, and as a filter. Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets: : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration For a join involving two relations ( and ), is computed as the union of the result of 3 joins: joined with joined with joined with The key fact for performance is that is never joined with . More details on Semi-naïve Evaluation can be found in these notes . This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red): Source: https://dl.acm.org/doi/10.1145/3669940.3707274 The data array holds the actual tuple data. It is densely packed in row-major order. The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort). The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys. A join of relations A and , can be implemented with the following pseudo-code: Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple. Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707274 Dangling Pointers The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster). I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio. Subscribe now : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration joined with joined with joined with

2 views
Jack Vanlightly 1 months ago

Beyond Indexes: How Open Table Formats Optimize Query Performance

My career in data started as a SQL Server performance specialist, which meant I was deep into the nuances of indexes, locking and blocking, execution plan analysis and query design. These days I’m more in the world of the open table format such as Apache Iceberg. Having learned the internals of both transactional and analytical database systems, I find the use of the word “index” interesting as they mean very different things to different systems. I see the term “index” used loosely when discussing open table format performance, both in their current designs and in speculation about future features that might make it into their specs. But what actually counts as an index in this world? Some formats, like Apache Hudi, do maintain record-level indexes such as, primary-key-to-filegroup maps that enable upserts and deletes to be directed efficiently to the right filegroup in order to support primary key tables. But they don’t help accelerate read performance across arbitrary predicates like the secondary indexes we rely on in OLTP databases. Traditional secondary indexes (like the B-trees used in relational databases) don’t exist in Iceberg, Delta Lake, or even Hudi. But why? Can't we solve some performance issues if we just added secondary indexes to the Iceberg spec? The short answer is: “ no and it's complicated ”. There are real and practical reasons why the answer isn’t just " we haven't gotten around to it yet ." The aim of the post is to understand: Why we can’t just add more secondary indexes to speed up analytical queries like we can often do in RDBMS databases. How the analytics workload drives the design of the table format in terms of core structure and auxiliary files (and how that differs to the RDBMS).  One core thing I’m going to try to convince you of in this post is that whether your table is a traditional RDBMS or Iceberg, read performance comes down to reducing the amount of IO performed, which in turn comes down to: How we organize the data itself. The use of auxiliary data structures over that data organization. The key point is that the workload determines how you organize that data and what kind of auxiliary data structures are used. We’ll start by understanding indexing in the RDBMS and then switch over and contrast that to how open table formats such as Iceberg work. At its core, an index speeds up queries by reducing the amount of IO required to execute the query. If a query needs one row but has to scan the entire table, then that will be really slow if the table is large. An index in an RDBMS is a type of B-tree and provides two main access patterns: a seek which traverses the tree looking for a specific row or a scan which iterates over the whole tree, or a range of the tree. We can broadly categorize indexes into two types in the traditional RDBMS: Clustered index . The index and the data aren’t separate, they are clustered together. The table is the clustered index. Like books on a library shelf arranged alphabetically by author. The ordering is the collection itself; finding “Tolkien” means walking directly to the T section, because the data is stored in that order.  Non-clustered index . The index is a separate (smaller) structure that points to the actual table rows. A secondary index is an example of this. Like in the library there may be a computer that allows you to search for books, and which tells you the shelf locations. Then we have table statistics such as estimated cardinalities and histograms which, as we’ll see further down, become very useful for the query optimizer to select an efficient execution plan. Let’s look at these three things: the clustered index, non-clustered index and column statistics. I could add heap-based tables and a bunch of other things but let’s try and be brief. Let’s imagine we have a User table, with UserId as the primary key, with multiple columns including FirstName, LastName, Country and Nationality. The table is a clustered (B-tree) index, which is sorted by UserId. Fig 1. The clustered index with a primary key of UserId as an auto-incrementing int. If you query the table by the primary key, the database engine can do a B-tree traversal and rapidly find the location of the desired row. For example, if I run SELECT * FROM User WHERE UserId = 100; then the database will do a clustered index seek, which means it traverses the clustered index by key, quickly finding the page which hosts the desired row. When rows are inserted and deleted, they must be added to the B-tree which may require new data pages, page splits and page merges when data pages become full or too empty. Fragmentation can occur over time such that the B-tree data pages are not laid out sequentially on disk, so index rebuilds and defragmentation can be needed. All this tree management is expensive but the benefits are worth it because in an OLTP workload, queries are overwhelmingly about finding or updating a single row, or a very small set of rows, quickly. A B-tree sorted by the primary key makes this possible: the cost of a lookup is only O(log n) regardless of table size, and once the leaf page is reached, the row is right there. That means whether the table has a thousand rows or a hundred million, a clustered index seek for UserId = 18764389 takes just a handful of page reads. The same applies to joins: if another table has a foreign key reference to UserId, the database can seek into the clustered index for each match, making primary-key joins fast and predictable. Inserts and deletes may cause page splits or merges, but the tree structure ensures the data remains in key order, so seeks stay efficient. So far so good, but what about when you want to query the Users table by country or last name? That leads us to secondary (non-clustered) indexes. My application also needs to be able to execute queries with predicates based on Country and LastName. A where clause on either of these two columns would result in a full table scan (clustered index scan) to find all rows which match that predicate. The reason is that the B-tree intermediate nodes simply contain no information about the Country  and LastName columns. To cater for queries that do not use the primary key, we could use one or more secondary indexes, which are separate B-trees which map specific non-PK columns to rows in the clustered index. In SQL Server, a non-clustered index that points to a clustered index stores the indexed column(s) value and clustering key, aka, the primary key of the match. I could create one secondary index on the column ‘LastName’ and another on ‘Country’. Fig 2. Secondary indexes map columns to the clustering key (primary key) so the query engine can perform a clustered index seek. Note that in small tables, secondary indexes are not used as it's quicker to scan the table. My SELECT * FROM Users WHERE LastName = ‘Martinez’ will use a lookup operation, which is a:  Seek on the secondary index B-tree to obtain the clustering key (PK) of the matching Martinez row. A seek on the clustered index (the table itself) using the clustering key (PK) to retrieve the full row.  Selectivity is important here . If the secondary index scan returns a small number of rows, this random IO of multiple index seeks is worth it, but if the scan returns a large number of rows, it might actually be quicker to perform a clustered index scan (full table scan).  Imagine if the table had 1 million rows and the distribution of the Country column matched the figure above, that is to say roughly 800,000 rows had ‘Spain’ in the Country column. 800,000 lookups would be far less efficient than one table scan. The degree of selectivity where either lookups from a secondary index vs a clustered index scan can be impacted depending on whether the database is running on a HDD or SSD, with SSDs tolerating more random IO than a HDD. It is up to the query optimizer to decide. Secondary indexes are ideal for high selectivity queries , but also queries that can be serviced by the secondary index alone. For example, if I only wanted to perform a count aggregation on Country, then the database could scan the Country secondary index, counting the number of rows with each country and never need to touch the much larger clustered index. We’ve successfully reduced the amount of IO to service the query and therefore sped it up. But we can also add “covering columns” to secondary indexes . Imagine the query SELECT Nationality FROM User WHERE Country = ‘Spain’ . If the user table is really wide and really large, then we can speed up the query by adding the Nationality column to the Country secondary index as a covering column . The query optimizer will see that the Country secondary index includes everything it needs and decides not to touch the clustered index at all. Covering columns make the index larger which is bad, but can remove the need to seek the clustered index which is good. These secondary indexes trade off read performance for space, write cost and maintenance cost overhead. Each write must update the secondary indexes synchronously, and sometimes these indexes get fragmented and need rebuilding. For an OLTP database, this cost can be worth it, and the right secondary index or indexes on a table can help support diverse queries on the same table. Just because I created a secondary index doesn’t mean the query optimizer will use it. Likewise, just because a secondary index exists doesn’t mean the query optimizer should choose it either. I have seen SQL Server get hammered because of incorrect secondary index usage. The optimizer believed the secondary index would help so it chose the “lookup” strategy, but it turned out that there were tens of thousands of matching rows (instead of the handful predicted), leading to tens of thousands of individual clustered index seeks. This can turn into millions of seeks in queries if the Users table exists in several joins. Table statistics, such as cardinalities and histograms are very useful here to help the optimizer decide which strategy to use. For example, if every row contains ‘Spain’ for country, the cardinality estimate will tell the optimizer not to use the secondary index for a  SELECT * FROM Users WHERE Country = ‘Spain’ query as the selectivity is not low enough. Best to do a table scan. Histograms are useful when the cardinality is skewed, such that some values have large numbers of rows while others have relatively few.  There is, of course, far more to write about this topic, way more nuances on indexing, plus alternatives such as heap-based tables, other trees such as LSM trees, and alternative data models such as documents and graphs. But for the purpose of this post, we keep our mental model to the following: Workload : optimized for point lookups, short ranges, joins of a small number of rows, and frequent writes/updates. Queries often touch a handful of rows, not millions. Data organization : tables are row-based; each page holds complete rows, making single-row access efficient. Clustered index : the table itself is a B-tree sorted by the primary key; lookups and joins by key are fast and predictable (O(log n)). Secondary (non-clustered) index : separate B-trees that map other columns to rows in the clustered index. Useful for high-selectivity predicates and covering indexes, but costly to maintain. Statistics : cardinalities and histograms guide the optimizer to choose between using an index or scanning the table. Trade-off : indexes cut read I/O but add write and maintenance overhead. In OLTP this is worth it, since selective access dominates. Perhaps the main point for this post is that secondary indexes allow for one table to support diverse queries. This is something that we cannot easily say of the open table formats as we’ll see next. Analytical systems have a completely different workload from OLTP. Instead of reading a handful of rows or updating individual records, queries in the warehouse or lakehouse typically scan millions or even billions of rows, aggregating or joining them to answer analytical questions. Where row-based storage in a sorted B-tree structure makes sense for OLTP, it is not efficient for analytical workloads. So instead, data is stored in a columnar format in large contiguous blocks with a looser global structure. Secondary indexes aren’t efficient either. Jumping from index to row for millions of matches would already be massively inefficient on a file system, but given OTFs are hosted on object storage, secondary indexing becomes even less useful. So the analytical workload makes row-based B-trees and secondary indexes the wrong tool for the job. Columnar systems flipped the model: store data in columns rather than rows , grouped into large contiguous blocks. Instead of reducing IO via B-tree traversal by primary key, or pointer-chasing through secondary indexes, IO reduction comes from effective data skipping during table scans. The columnar storage allows for whole columns to be skipped and depending on a few other factors, entire files can be skipped as well. Most data skipping happens during the planning phase, where the execution engine decides which files need to be scanned, and is often referred to as pruning . There are different types of pruning and as well as different levels of pruning effectiveness which all depend on how the data is organized, what metadata exists and any other auxiliary search optimization data structures that might exist. Let’s look at the main ingredients for effective pruning: the table structure itself and the auxiliary structures. Let’s focus on Iceberg for now to keep ourselves grounded in a specific table format design. I did an in-depth description of Iceberg internals based on v2 which is still valid today, if you want to dive in deep. To keep things simple, we’ll look at the organization of an Iceberg table at one point in time. Fig 3. An Iceberg table is queried based on a single snapshot, which forms a tree of metadata and data files. An Iceberg table also forms a tree, but it looks very different from the sorted B-tree of a clustered index. At the root is the metadata file, which records the current snapshot. That snapshot points to a manifest list, which in turn points to many manifest files, each of which references a set of data files. The data files are immutable Parquet (or ORC) files. A clustered index is inherently ordered by its key, but an Iceberg table’s data files have no intrinsic order, they’re just a catalogued set of Parquet files. Given this unordered set of files managed by metadata, performance comes from effective pruning. Effective pruning comes down to data locality that aligns with your queries . If I query for rows with Country = ‘Spain’ , but the Spanish data is spread across all the files, then the query will have to scan every file. But if all the Spanish data is grouped into one small subset of the entire data set, then the query only has to read that subset, speeding it up. We can split the discussion of effective pruning into: The physical layout of the data . How well the data locality and sorting of the table aligns with its queries. Also the size and quantity of the data files. Metadata available to the query planner . The metadata and auxiliary data structures used by the query planner (and even during the execution phase in some cases) to prune files and even blocks within files. An Iceberg table achieves data locality through multiple layers of grouping, both across files and within files (via Parquet or ORC): Partitioning : groups related rows together across a subset of data files, so queries can target only the relevant partitions. Sorting : orders rows within each partition so that values that are logically close are also stored physically close together within the same file or even the same row group inside a file. Partitioning is the first major grouping lever available. It determines how data files are organized into data files dividing the table into logical partitions based on one or more columns, such as a date (or month/year), a region, etc. All rows sharing the same partition key values are written into the same directory or group of files. This creates data locality such that when a query includes a filter on the partition column (for example, WHERE EventDate = '2025-10-01' ), the engine can identify exactly which partitions contain relevant data and ignore the rest. This process is known as partition pruning , and it allows the engine to skip scanning large portions of the dataset entirely. The more closely the partitioning aligns with typical query filters, the more effective pruning becomes. Great care must be taken with the choice of partition key as the number of partitions impacts the number of data files (with many small files causing a performance issue). Within each partition, the data files have no inherent order by themselves. This is where sort order becomes important. Sorting defines how rows are organized within each partition, typically by one or more columns that are frequently used in filter predicates or range queries. For example, if queries often filter by EventTime or Country, sorting by those columns ensures that rows with similar values are stored close together. Sorting improves data locality both within files and across files: Within files, sorting determines how rows are organized as data is written, ensuring that similar values are clustered together. This groups similar data together (according to sort key) in Parquet row groups and makes row group level pruning more effective. Sort order can influence grouping across data files during compaction. When files are written in a consistent sort order, each file tends to represent a narrow slice of the sorted key space. For example, the rows of Spain might slot into a single file or a handful of files when Country is a sort key. Fig 4. A table with three partitions by Country and sort order by Surname. Sort order is preserved, but less effective on newly written small files. Compactions can sort rows of a large group of files into fewer larger files with ordering preserved across them. Partitioning is a strong, structural form of grouping that is enforced at all times. Once data is written into partitions, those boundaries are fixed and guaranteed, making partition pruning a reliable and predictable optimization. Sort order, on the other hand, is a more dynamic process that becomes most effective after compaction. Each write operation typically sorts its own batch of data before creating new files, but this sorting happens in isolation. Across multiple writes, even if all use the same declared sort order, there’s no partition-wide reordering of existing data. As a result, files within a partition may each be internally sorted but not well ordered relative to one another. Compaction resolves this by rewriting many files into fewer, larger ones while enforcing a consistent sort order across them. Only then does the sort order become a powerful performance optimization for data skipping. Delta Lake has similar features, with partitioning and sorting. Delta supports Z-ordering, which is a multidimensional clustering technique. Instead of sorting by a single column, Z-order interleaves the bits of multiple column values (for example, Country, Nationality, and EventDate) into a single composite key that preserves spatial locality. This means that rows with similar combinations of those column values are likely to be stored close together in a file, even if no single column is globally sorted. Z-ordering is particularly effective for queries that filter on multiple dimensions simultaneously. How rows are organized across files and within files impacts pruning massively. Next let's look at how metadata and auxiliary data structures help the query planner make use of that organization. Just like in the RDBMS, column statistics are critically important, however, for the open table format, they are even more important. Column statistics are the primary method that query engines use to prune data files and row groups within files. In Iceberg we have two sources of column statistics: Metadata files : Manifest files list a set of data files along with per-column min/max values. Query engines can use these column stats to skip entire files during the planning phase based on the filter predicates. Data files : A Parquet file is divided into row groups, each containing a few thousand to millions of rows stored column by column. For every column in a row group, Parquet records min/max statistics, and query engines can use these to skip entire row groups when the filter predicate falls outside the recorded range. Fig 4. Parquet splits up the data into one or more row groups, and within each row group, each column is stored contiguously. Min/max column statistics are a mirror of how the data is laid out . When data is clustered/sorted by a column, that column’s statistics reflect that structure: each file or row group will capture a narrow slice of the column’s value range. For example, if data is sorted by Country, one file might record a min/max of ["France"–"Germany"] while another covers ["Spain"–"Sweden"]. These tight ranges make pruning highly effective. But when data isn’t organized by that column and all countries are mixed randomly across files, the statistics show it. Each file’s min/max range becomes broad, often spanning most of the column’s domain. The result is poor selectivity and less effective pruning, because the query planner can’t confidently skip any files. Simply put, column statistics are a low cost tool for the query planner to understand the data layout and plan accordingly. There are other auxiliary structures that enhance pruning and search efficiency, beyond min/max statistics.  Bloom filters are one such mechanism. A Bloom filter is a compact probabilistic data structure that can quickly test whether a value might exist in a data file (or row group). If the Bloom filter says “no,” the engine can skip reading that section entirely; if it says “maybe,” the data must still be scanned. Iceberg and Delta can store Bloom filters at the file or row-group level, providing an additional layer of pruning beyond simple min/max range checks. Bloom filters are especially useful on columns that are not part of the sort order (so min/max stats are less effective) or exact match predicates. Some systems also maintain custom or proprietary sidecar files that act as lightweight search/data skipping structures. These can include: More fine-grained statistics files, such as histograms. Table level Bloom filters (Hudi). Primary-key-to-file index to support primary key tables (Hudi, Paimon). Upserts and deletes can be directed to specific files via the index. Other techniques involve creating optimized materialized projections, known as materialized views . These are precomputed datasets derived from base tables, often filtered, aggregated, or reorganized to match common query patterns. The idea is to store data in the exact shape queries need, so that the engine can answer those queries without repeatedly scanning or aggregating the raw data. They effectively trade storage and maintenance cost for speed, just like secondary indexes do in the RDBMS. Some engines implement these projections transparently, using them automatically when a query can be satisfied by a precomputed result, just like an RDBMS will choose a secondary index if the planner decides it is more efficient. This can be much more powerful than just a materialized view as a separate table. One example is Dremio’s Reflections , though there may already be other instances of this in the wild among the data platforms. Fig 5. Materialized views help support diverse queries that the base table, with its one layout, cannot. Making such materialized views transparent to the user, by allowing the query planner to use the MV instead of the base table is a powerful approach, akin to the use of secondary indexes. Open table formats like Iceberg, Delta, and Hudi store data in immutable, columnar files, optimized for large analytical scans. Query performance depends on data skipping (pruning), which is the ability to avoid reading irrelevant files or row groups based on metadata. Pruning effectiveness depends on data layout. Data layout levers: Partitioning provides strong physical grouping across files, enabling efficient partition pruning when filters match partition keys. Sorting improves data locality within partitions, tightening column value ranges and enhancing row-group-level pruning. Compaction consolidates small files and enforces consistent sort order, making pruning more effective (and reducing the small file cost that partitioning can sometimes introduce). Z-ordering (Delta) and Liquid Clustering (Databricks) extend sorting to multi-dimensional and adaptive clustering strategies. Column statistics in Iceberg manifest files and Parquet row groups drive pruning by recording min/max values per column. The statistics reflect the physical layout. Bloom filters add another layer of pruning, especially for unsorted columns and exact match predicates.  Some systems maintain sidecar indexes such as histograms or primary-key-to-file maps for faster lookups (e.g., Hudi, Paimon). Materialized views and precomputed projections further accelerate queries by storing data in the shape of common query patterns (e.g., Dremio Reflections). These require some data duplication and data maintenance, and are the closest equivalent (in spirit) to the secondary index of an RDBMS. The physical layout of the individual table isn’t the whole story. Data modeling plays just as critical a role. In a star schema, queries follow highly predictable patterns: they filter on small dimension tables and join to a large fact table on foreign keys. That layout makes diverse analytical queries fast without the need for secondary indexes. The dimensions are small enough to be read in full (and often cached), so indexing them provides no real gain. Fig 6. The central fact table contains information such as sales, with foreign keys to the dimension tables and facts such as sale amount, quantity, etc. Dimensions may include products, customers, vendors, dates and so on. A product dimension will have a ProductId and a number of columns describing aspects of the product such as manufacturer, SKU, etc. In an Iceberg-based star schema, partitioning should follow coarse-grained dimensions like time, such as days(sale_ts) or months(sale_ts) or by region, rather than high-cardinality surrogate IDs which would fragment the table too much. However, Iceberg’s bucketing partition option can still turn surrogate IDs into large boundaries, such as bucket(64, customer_id) . Sorting within partitions by the most selective keys like customer_id, product_id, or sale_ts makes Iceberg’s column statistics far more effective for skipping data in the large fact table. The goal is to make the table’s physical structure mirror the star schema’s logical access paths, so Iceberg’s metadata and statistics work efficiently. The star-schema can serve queries on various predicates efficiently, without needing ad hoc indexes for every possible predicate. The tradeoff is the complexity and cost of transforming data into this star schema (or other scheme). If you squint, the clustered index of a relational database and the Parquet and metadata layers of an open table format such as Iceberg share a common goal: minimize the amount of data scanned . Both rely on structure and layout to guide access, the difference is that instead of maintaining B-tree structures, OTFs lean on looser data layout and lightweight metadata to guide search (pruning being a search optimization). This makes sense for analytical workloads where queries often scan millions of rows: streaming large contiguous blocks is cheap, but random pointer chasing and maintaining massive B-trees is not. Secondary indexes, so valuable in OLTP, add little to data warehousing. Analytical queries don’t pluck out individual rows, they aggregate, filter, and join across millions.  Ultimately, in open table formats, layout is king. Here, performance comes from layout (partitioning, sorting, and compaction) which determines how efficiently engines can skip data, while modeling, such as star or snowflake schemas, make analytical queries more efficient for slicing and dicing. Unlike a B-tree, order in Iceberg or Delta is incidental, shaped by ongoing maintenance. Tables drift out of shape as their layout diverges from their queries, and keeping them in tune is what truly preserves performance. Since the OTF table data exists only once, its data layout must reflect its dominant queries. You can’t sort or cluster the table twice without making a copy of it. But it turns out that copying, in the form of materialized views, is a valuable strategy for supporting diverse queries over the same table, as exemplified by Dremio Reflections. These make the same cost trade offs as secondary indexes: space and maintenance for read speed.  Looking ahead, the open table formats will likely evolve by formalizing more performance-oriented metadata. The core specs may expand to include more diverse standardized sidecar files such as richer statistics so engines can make more effective pruning decisions. But secondary data structures like materialized views are probably destined to remain platform-specific. They’re not metadata; they’re alternate physical representations of data itself, with their own lifecycles and maintenance processes. The open table formats focus on representing one canonical table layout clearly and portably. Everything beyond that such as automated projections and adaptive clustering is where engines will continue to differentiate. In that sense, the future of the OTFs lies not in embedding more sophisticated logic, but in providing the hooks for smarter systems to build on top of them.  What, then, is an “index” in this world? The closest thing we have is a Bloom filter, with column statistics and secondary materialized views serving the same spirit of guiding the query planner toward efficiency. Traditional secondary indexes are nowhere to be found, and calling an optimized Iceberg table an index (in clustered index sense) stretches the term too far. But language tends to follow utility, and for lack of a better term, “index” will probably keep standing in for the entire ecosystem of metadata, statistics, and other pruning structures that make these systems more efficient. In the future I’d love to dive into how Hybrid Transactional Analytical Processing (HTAP) systems deal with the fundamentally different nature of transactional and analytical queries. But that is for a different post. Thanks for reading. ps, if you want to learn more about table format internals, I’ve written extensively on it: The ultimate guide to table format internals - all my writing so far Why we can’t just add more secondary indexes to speed up analytical queries like we can often do in RDBMS databases. How the analytics workload drives the design of the table format in terms of core structure and auxiliary files (and how that differs to the RDBMS).  How we organize the data itself. The use of auxiliary data structures over that data organization. Clustered index . The index and the data aren’t separate, they are clustered together. The table is the clustered index. Like books on a library shelf arranged alphabetically by author. The ordering is the collection itself; finding “Tolkien” means walking directly to the T section, because the data is stored in that order.  Non-clustered index . The index is a separate (smaller) structure that points to the actual table rows. A secondary index is an example of this. Like in the library there may be a computer that allows you to search for books, and which tells you the shelf locations. Seek on the secondary index B-tree to obtain the clustering key (PK) of the matching Martinez row. A seek on the clustered index (the table itself) using the clustering key (PK) to retrieve the full row.  Workload : optimized for point lookups, short ranges, joins of a small number of rows, and frequent writes/updates. Queries often touch a handful of rows, not millions. Data organization : tables are row-based; each page holds complete rows, making single-row access efficient. Clustered index : the table itself is a B-tree sorted by the primary key; lookups and joins by key are fast and predictable (O(log n)). Secondary (non-clustered) index : separate B-trees that map other columns to rows in the clustered index. Useful for high-selectivity predicates and covering indexes, but costly to maintain. Statistics : cardinalities and histograms guide the optimizer to choose between using an index or scanning the table. Trade-off : indexes cut read I/O but add write and maintenance overhead. In OLTP this is worth it, since selective access dominates. The physical layout of the data . How well the data locality and sorting of the table aligns with its queries. Also the size and quantity of the data files. Metadata available to the query planner . The metadata and auxiliary data structures used by the query planner (and even during the execution phase in some cases) to prune files and even blocks within files. Partitioning : groups related rows together across a subset of data files, so queries can target only the relevant partitions. Sorting : orders rows within each partition so that values that are logically close are also stored physically close together within the same file or even the same row group inside a file. Within files, sorting determines how rows are organized as data is written, ensuring that similar values are clustered together. This groups similar data together (according to sort key) in Parquet row groups and makes row group level pruning more effective. Sort order can influence grouping across data files during compaction. When files are written in a consistent sort order, each file tends to represent a narrow slice of the sorted key space. For example, the rows of Spain might slot into a single file or a handful of files when Country is a sort key. Metadata files : Manifest files list a set of data files along with per-column min/max values. Query engines can use these column stats to skip entire files during the planning phase based on the filter predicates. Data files : A Parquet file is divided into row groups, each containing a few thousand to millions of rows stored column by column. For every column in a row group, Parquet records min/max statistics, and query engines can use these to skip entire row groups when the filter predicate falls outside the recorded range. More fine-grained statistics files, such as histograms. Table level Bloom filters (Hudi). Primary-key-to-file index to support primary key tables (Hudi, Paimon). Upserts and deletes can be directed to specific files via the index. Open table formats like Iceberg, Delta, and Hudi store data in immutable, columnar files, optimized for large analytical scans. Query performance depends on data skipping (pruning), which is the ability to avoid reading irrelevant files or row groups based on metadata. Pruning effectiveness depends on data layout. Data layout levers: Partitioning provides strong physical grouping across files, enabling efficient partition pruning when filters match partition keys. Sorting improves data locality within partitions, tightening column value ranges and enhancing row-group-level pruning. Compaction consolidates small files and enforces consistent sort order, making pruning more effective (and reducing the small file cost that partitioning can sometimes introduce). Z-ordering (Delta) and Liquid Clustering (Databricks) extend sorting to multi-dimensional and adaptive clustering strategies. Column statistics in Iceberg manifest files and Parquet row groups drive pruning by recording min/max values per column. The statistics reflect the physical layout. Bloom filters add another layer of pruning, especially for unsorted columns and exact match predicates.  Some systems maintain sidecar indexes such as histograms or primary-key-to-file maps for faster lookups (e.g., Hudi, Paimon). Materialized views and precomputed projections further accelerate queries by storing data in the shape of common query patterns (e.g., Dremio Reflections). These require some data duplication and data maintenance, and are the closest equivalent (in spirit) to the secondary index of an RDBMS.

0 views
NorikiTech 1 months ago

A Rust web service

Part of the “ Rustober ” series. A web service in Rust is much more than it seems. Whereas in Go you can just and , in Rust you have to bring your own async runtime. I remember when async was very new in Rust, and apparently since then the project leadership never settled on a specific runtime. Now you can pick and choose (good), but also you have to pick and choose (not so good). In practical, real-world terms it means you use Tokio . And . And . And . And . I think that’s it? My default intention is to use as few dependencies as possible — preferably none. For Typingvania I wrote the engine from scratch in C and the only library I ship with the game is SDL . DDPub that runs this website only really needs Markdown support. However, with Rust I want to try something different and commit to the existing ecosystem. I expect there will be plenty of challenges as it is. So how does all of the above relate to each other and why is it necessary? In short, as far as I understand it now: Because all of the libraries are so integrated with each other, it enables (hopefully) a fairly ergonomic experience where I don’t have to painstakingly plug things together. They are all mature libraries tested in production, so even though it’s a lot of API surface, I consider it a worthwhile time investment: my project will work, and more likely than not I’ll have to work with them again if/when I get to write Rust in a commercial setting — or just work on more side projects. , being the source (all three of the others are subprojects) provides the runtime I mentioned above and defines how everything else on top of it works. provides the trait that defines how data flows around, with abstractions and utilities around it, but is protocol-agnostic. Here’s the trait: implements HTTP on top of ’s trait, but as a building block, not as a framework. is a framework for building web applications — an abstraction over and . Effectively, I will be using mostly facilities implemented using the other three libraries. Finally, is an asynchronous SQL toolkit using directly.

0 views
Karan Sharma 1 months ago

AI and Home-Cooked Software

Everyone is worried that AI will replace programmers. They’re missing the real revolution: AI is turning everyone into one. I’ve been noticing a new pattern: people with deep domain knowledge but no coding experience are now building their own tools. Armed with AI assistants, they can create custom workflows in a matter of days, bypassing traditional development cycles. Are these solutions production-ready? Not even close. But they solve urgent, specific problems, and that’s what matters. Tasks that once required weeks of specialized training are quickly becoming weekend projects. This trend is happening even within the AI companies themselves. Anthropic, for example, shared how their own teams use Claude to accelerate their work. Crucially, this isn’t limited to developers. Their post details how non-technical staff now build their own solutions and create custom automations, providing a powerful real-world example of this new paradigm. Why search for a generic tool when you can build exactly what you need? This question leads to what I call ‘home-cooked software’: small, personal applications we build for ourselves, tailored to our specific needs. Robin Sloan beautifully describes building an app as making “a home-cooked meal,” while Maggie Appleton writes about “barefoot developers” creating software outside traditional industry structures. What’s new isn’t the concept but the speed and accessibility. With AI, a custom export format, a specific workflow, or the perfect integration is now an afternoon’s work. We’re entering an unprecedented era where the barrier between wanting a tool and having it has nearly vanished. But let’s be clear: the journey from a prototype to a production-ready application is as challenging as ever. In my experience, an AI can churn out a first draft in a few hours, which gets you surprisingly far. But the devil is in the details, and the last stretch of the journey – handling edge cases, ensuring security, and debugging subtle issues – can stretch into weeks. This distinction is crucial. AI isn’t replacing programmers; it’s creating millions of people who can build simple tools. There’s a significant difference. AI is fundamentally reshaping the economics of building software. Before AI, even a simple tool required a significant time investment in learning programming basics, understanding frameworks, and debugging. Only tools with broad appeal or critical importance justified the effort. Now, that effort is measured in hours, not months, and the primary barrier is no longer technical knowledge, but imagination and a clear understanding of one’s own needs. This doesn’t apply to complex or security-critical systems, where deep expertise remains essential. But for the long tail of personal utilities, automation scripts, and custom workflows, the math has changed completely. I’m talking about solving all those minor irritations that pile up: the script to reformat a specific CSV export, the dashboard showing exactly the three metrics you care about, or a script that pulls data from a personal project management tool to sync with an obscure time-tracking app. These tools might be held together with digital duct tape, but they solve real problems for real people. And increasingly, that’s all that matters. But this newfound capability isn’t free. It comes with what I call the “AI Tax”: a set of hidden costs that are rarely discussed. First, prompt engineering can be surprisingly time-consuming, especially for tasks of moderate complexity. While simple requests are often straightforward, anything more nuanced can become an iterative dialogue. You prompt, the AI generates a flawed output, you clarify the requirements, and it returns a new version that misses a different detail. It’s a classic 80/20 scenario: you get 80% of the way there with a simple prompt, but achieving the final 20% of correctness requires a disproportionate amount of effort in refining, correcting, and clarifying your intent to the model. Second, there’s the verification burden. Every line of AI-generated code is a plausible-looking liability. It may pass basic tests, only to fail spectacularly in production with an edge case you never considered. AI learned from the public internet, which means it absorbed all the bad code along with the good. SQL injection vulnerabilities, hardcoded secrets, race conditions—an AI will happily generate them all with complete confidence. Perhaps the most frustrating aspect is “hallucination debugging”: the uniquely modern challenge of troubleshooting plausible-looking code that relies on APIs or methods that simply don’t exist. Your codebase becomes a patchwork of different AI-generated styles and patterns. Six months later, it’s an archaeological exercise to determine which parts you wrote and which parts an AI contributed. But the most significant danger is that AI enables you to build systems you don’t fundamentally understand. When that system inevitably breaks, you lack the foundational knowledge to debug it effectively. Despite these challenges, there’s something profoundly liberating about building software just for yourself. Instead of just sketching out ideas, I’ve started building these small, specific tools. For this blog, I wanted a simple lightbox for images; instead of pulling in a heavy external library, I had Claude write a 50-line JavaScript snippet that did exactly what I needed. I built a simple, single-page compound interest calculator tailored for my own financial planning. To save myself from boilerplate at work, I created prom2grafana , a tool that uses an LLM to convert Prometheus metrics into Grafana dashboards. Ten years ago, I might have thought about generalizing these tools, making them useful for others, perhaps even starting an open source project. Today? I just want a tool that works exactly how I think. I don’t need to handle anyone else’s edge cases or preferences. Home-cooked software doesn’t need product-market fit—it just needs to fit you. We’re witnessing the emergence of a new software layer. At the base are the professionally-built, robust systems that power our world: databases, operating systems, and rock-solid frameworks. In the middle are commercial applications built for broad audiences. And at the top, a new layer is forming: millions of tiny, personal tools that solve individual problems in highly specific ways. This top layer is messy, fragile, and often incomprehensible to anyone but its creator. It’s also incredibly empowering. Creating simple software is becoming as accessible as writing. And just as most writing isn’t professional literature, most of this new software won’t be professional-grade. That’s not just okay; it’s the point. The implications are profound. Subject-matter experts can now solve their own problems without waiting for engineering resources, and tools can be hyper-personalized to a degree that is impossible for commercial software. This unlocks a wave of creativity, completely unconstrained by the need to generalize or find a market. Yes, there are legitimate concerns. Security is a real risk, though the profile changes when a tool runs locally on personal data with no external access. We’re creating personal technical debt, but when a personal tool breaks, the owner is the only one affected. They can choose to fix it, rebuild it, or abandon it without impacting anyone else. Organizations, on the other hand, will soon have to grapple with the proliferation of incompatible personal tools and establish new patterns for managing them. But these challenges pale in comparison to the opportunities. The barrier between user and creator is dissolving. We’re entering the age of home-cooked software, where building your own tool is becoming as natural as cooking your own meal. The kitchen is open. What will you cook?

0 views
Simon Willison 2 months ago

Designing agentic loops

Coding agents like Anthropic's Claude Code and OpenAI's Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly exercise the code they are writing, correct errors, dig through existing implementation details, and even run experiments to find effective code solutions to problems. As is so often the case with modern AI, there is a great deal of depth involved in unlocking the full potential of these new tools. A critical new skill to develop is designing agentic loops . One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal a coding agent can often brute force its way to an effective solution. My preferred definition of an LLM agent is something that runs tools in a loop to achieve a goal . The art of using them well is to carefully design the tools and loop for them to use. Agents are inherently dangerous - they can make poor decisions or fall victim to malicious prompt injection attacks , either of which can result in harmful results from tool calls. Since the most powerful coding agent tool is "run this command in the shell" a rogue agent can do anything that you could do by running a command yourself. To quote Solomon Hykes : An AI agent is an LLM wrecking its environment in a loop. Coding agents like Claude Code counter this by defaulting to asking you for approval of almost every command that they run. This is kind of tedious, but more importantly, it dramatically reduces their effectiveness at solving problems through brute force. Each of these tools provides its own version of what I like to call YOLO mode, where everything gets approved by default. This is so dangerous , but it's also key to getting the most productive results! Here are three key risks to consider from unattended YOLO mode. If you want to run YOLO mode anyway, you have a few options: Most people choose option 3. Despite the existence of container escapes I think option 1 using Docker or the new Apple container tool is a reasonable risk to accept for most people. Option 2 is my favorite. I like to use GitHub Codespaces for this - it provides a full container environment on-demand that's accessible through your browser and has a generous free tier too. If anything goes wrong it's a Microsoft Azure machine somewhere that's burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository. There are plenty of other agent-like tools that run code on other people's computers. Code Interpreter mode in both ChatGPT and Claude can go a surprisingly long way here. I've also had a lot of success (ab)using OpenAI's Codex Cloud . Coding agents themselves implement various levels of sandboxing, but so far I've not seen convincing enough documentation of these to trust them. Update : It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says: Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use in a container without internet access. You can follow this reference implementation using Docker Dev Containers. Locking internet access down to a list of trusted hosts is a great way to prevent exfiltration attacks from stealing your private source code. Now that we've found a safe (enough) way to run in YOLO mode, the next step is to decide which tools we need to make available to the coding agent. You can bring MCP into the mix at this point, but I find it's usually more productive to think in terms of shell commands instead. Coding agents are really good at running shell commands! If your environment allows them the necessary network access, they can also pull down additional packages from NPM and PyPI and similar. Ensuring your agent runs in an environment where random package installs don't break things on your main computer is an important consideration as well! Rather than leaning on MCP, I like to create an AGENTS.md (or equivalent) file with details of packages I think they may need to use. For a project that involved taking screenshots of various websites I installed my own shot-scraper CLI tool and dropped the following in : Just that one example is enough for the agent to guess how to swap out the URL and filename for other screenshots. Good LLMs already know how to use a bewildering array of existing tools. If you say "use playwright python " or "use ffmpeg" most models will use those effectively - and since they're running in a loop they can usually recover from mistakes they make at first and figure out the right incantations without extra guidance. In addition to exposing the right commands, we also need to consider what credentials we should expose to those commands. Ideally we wouldn't need any credentials at all - plenty of work can be done without signing into anything or providing an API key - but certain problems will require authenticated access. This is a deep topic in itself, but I have two key recommendations here: I'll use an example to illustrate. A while ago I was investigating slow cold start times for a scale-to-zero application I was running on Fly.io . I realized I could work a lot faster if I gave Claude Code the ability to directly edit Dockerfiles, deploy them to a Fly account and measure how long they took to launch. Fly allows you to create organizations, and you can set a budget limit for those organizations and issue a Fly API key that can only create or modify apps within that organization... So I created a dedicated organization for just this one investigation, set a $5 budget, issued an API key and set Claude Code loose on it! In that particular case the results weren't useful enough to describe in more detail, but this was the project where I first realized that "designing an agentic loop" was an important skill to develop. Not every problem responds well to this pattern of working. The thing to look out for here are problems with clear success criteria where finding a good solution is likely to involve (potentially slightly tedious) trial and error . Any time you find yourself thinking "ugh, I'm going to have to try a lot of variations here" is a strong signal that an agentic loop might be worth trying! A few examples: A common theme in all of these is automated tests . The value you can get from coding agents and other LLM coding tools is massively amplified by a good, cleanly passing test suite. Thankfully LLMs are great for accelerating the process of putting one of those together, if you don't have one yet. Designing agentic loops is a very new skill - Claude Code was first released in just February 2025! I'm hoping that giving it a clear name can help us have productive conversations about it. There's so much more to figure out about how to use these tools as effectively as possible. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The joy of YOLO mode Picking the right tools for the loop Issuing tightly scoped credentials When to design an agentic loop This is still a very fresh area Bad shell commands deleting or mangling things you care about. Exfiltration attacks where something steals files or data visible to the agent - source code or secrets held in environment variables are particularly vulnerable here. Attacks that use your machine as a proxy to attack another target - for DDoS or to disguise the source of other hacking attacks. Run your agent in a secure sandbox that restricts the files and secrets it can access and the network connections it can make. Use someone else's computer. That way if your agent goes rogue, there's only so much damage they can do, including wasting someone else's CPU cycles. Take a risk! Try to avoid exposing it to potential sources of malicious instructions and hope you catch any mistakes before they cause any damage. Try to provide credentials to test or staging environments where any damage can be well contained. If a credential can spend money, set a tight budget limit. Debugging : a test is failing and you need to investigate the root cause. Coding agents that can already run your tests can likely do this without any extra setup. Performance optimization : this SQL query is too slow, would adding an index help? Have your agent benchmark the query and then add and drop indexes (in an isolated development environment!) to measure their impact. Upgrading dependencies : you've fallen behind on a bunch of dependency upgrades? If your test suite is solid an agentic loop can upgrade them all for you and make any minor updates needed to reflect breaking changes. Make sure a copy of the relevant release notes is available, or that the agent knows where to find them itself. Optimizing container sizes : Docker container feeling uncomfortably large? Have your agent try different base images and iterate on the Dockerfile to try to shrink it, while keeping the tests passing.

3 views
Armin Ronacher 2 months ago

90%

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code” — Dario Amodei Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close. For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding. The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue. I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted. Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that. There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%. As contrast: another quick prototype we built is a mess of unclear database tgables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end. I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers. I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up. For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand: Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better. I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem. You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along. I generally create PR-sized chunks that I can review. There are two paths to this: Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles. The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database. It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later. Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t. Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems. For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer. Here are some of the things I enjoyed a lot on this project: Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it. Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way. At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago. That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow. That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility. Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it.

0 views
Marc Brooker 2 months ago

Seven Years of Firecracker

Time flies like an arrow. Fruit flies like a banana. Back at re:Invent 2018, we shared Firecracker with the world. Firecracker is open source software that makes it easy to create and manage small virtual machines. At the time, we talked about Firecracker as one of the key technologies behind AWS Lambda, including how it’d allowed us to make Lambda faster, more efficient, and more secure. A couple years later, we published Firecracker: Lightweight Virtualization for Serverless Applications (at NSDI’20). Here’s me talking through the paper back then: The paper went into more detail into how we’re using Firecracker in Lambda, how we think about the economics of multitenancy ( more about that here ), and how we chose virtualization over kernel-level isolation (containers) or language-level isolation for Lambda. Despite these challenges, virtualization provides many compelling benefits. From an isolation perspective, the most compelling benefit is that it moves the security-critical interface from the OS boundary to a boundary supported in hardware and comparatively simpler software. It removes the need to trade off between kernel features and security: the guest kernel can supply its full feature set with no change to the threat model. VMMs are much smaller than general-purpose OS kernels, exposing a small number of well-understood abstractions without compromising on software compatibility or requiring software to be modified. Firecracker has really taken off, in all three ways we hoped it would. First, we use it in many more places inside AWS, backing the infrastructure we offer to customers across multiple services. Second, folks use the open source version directly, building their own cool products and businesses on it. Third, it was the motivation for a wave of innovation in the VM space. In this post, I wanted to write a bit about two of the ways we’re using Firecracker at AWS that weren’t covered in the paper. Bedrock AgentCore Back in July, we announced the preview of Amazon Bedrock AgentCore . AgentCore is built to run AI agents. If you’re not steeped in the world of AI right now, you might be confused by the many definitions of the word agent . I like Simon Willison’s take : An LLM agent runs tools in a loop to achieve a goal. 1 Most production agents today are programs, mostly Python, which use a framework that makes it easy to interact with tools and the underlying AI model. My favorite one of those frameworks is Strands , which does a great job of combining traditional imperative code with prompt-driven model-based interactions. I build a lot of little agents with Strands, most being less than 30 lines of Python (check out the strands samples for some ideas ). So where does Firecracker come in? AgentCore Runtime is the compute component of AgentCore. It’s the place in the cloud that the agent code you’ve written runs. When we looked at the agent isolation problem, we realized that Lambda’s per-function model isn’t rich enough for agents. Specifically, because agents do lots of different kinds of work on behalf of different customers. So we built AgentCore runtime with session isolation . Each session with the agent is given its own MicroVM, and that MicroVM is terminated when the session is over. Over the course of a session (up to 8 hours), there can be multiple interactions with the user, and many tool and LLM calls. But, when it’s over the MicroVM is destroyed and all the session context is securely forgotten. This makes interactions between agent sessions explicit (e.g. via AgentCore Memory or stateful tools), with no interactions at the code level, making it easier to reason about security. Firecracker is great here, because agent sessions vary from milliseconds (single-turn, single-shot, agent interactions with small models), to hours (multi-turn interactions, with thousands of tool calls and LLM interactions). Context varies from zero to gigabytes. The flexibility of Firecracker, including the ability to grow and shrink the CPU and memory use of VMs in place, was a key part of being able to build this economically. Aurora DSQL We announced Aurora DSQL, our serverless relational database with PostgreSQL compatibility, in December 2014. I’ve written about DSQL’s architecture before , but here wanted to highlight the role of Firecracker. Each active SQL transaction in DSQL runs inside its own Query Processor (QPs), including its own copy of PostgreSQL. These QPs are used multiple times (for the same DSQL database), but only handle one transaction at a time. I’ve written before about why this is interesting from a database perspective. Instead of repeating that, lets dive down to the page level and take a look from the virtualization level. Let’s say I’m creating a new DSQL QP in a new Firecracker for a new connection in an incoming database. One way I could do that is to start Firecracker, boot Linux, start PostgreSQL, start the management and observability agents, load all the metadata, and get going. That’s not going to take too long. A couple hundred milliseconds, probably. But we can do much better. With clones . Firecracker supports snapshot and restore , where it writes down all the VM memory, registers, and device state into a file, and then can create a new VM from that file. Cloning is the simple idea that once you have a snapshot you can restore it as many time as you like. So we boot up, start the database, do some customization, and then take a snapshot. When we need a new QP for a given database, we restore the snapshot. That’s orders of magnitude faster. This significantly reduces creation time, saving the CPU used for all that booting and starting. Awesome. But it does something else too: it allows the cloned microVMs to share unchanged ( clean ) memory pages with each other, significantly reducing memory demand (with fine-grained control over what is shared). This is a big saving, because a lot of the memory used by Linux, PostgreSQL, and the other processes on the box aren’t modified again after start-up. VMs get their own copies of pages they write to (we’re not talking about sharing writable memory here), ensuring that memory is still strongly isolated between each MicroVM. Another knock-on effect is the shared pages can also appear only once in some levels of the CPU cache hierarchy, further improving performance. There’s a bit more plumbing that’s needed to make some things like random numbers work correctly in the cloned VMs 2 . Last year, I wrote about our paper Resource management in Aurora Serverless . To understand these systems more deeply, let’s compare their approaches to one common challenge: Linux’s approach to memory management. At a high level, in stock Linux’s mind, an empty memory page is a wasted memory page. So it takes basically every opportunity it can to fill all the available physical memory up with caches, buffers, page caches, and whatever else it may think it’ll want later. This is a great general idea. But in DSQL and Aurora Serverless, where the marginal cost of a guest VM holding onto a page is non-zero, it’s the wrong one for the overall system. As we say in the Aurora serverless paper , Aurora Serverless fixes this with careful tracking of page access frequency: − Cold page identification: A kernel process called DARC [8] continuously monitors pages and identifies cold pages. It marks cold file-based pages as free and swaps out cold anonymous pages. This works well, but is heavier than what we needed for DSQL. In DSQL, we take a much simpler approach: we terminate VMs after a fixed period of time. This naturally cleans up all that built-up cruft without the need for extra accounting. DSQL can do this because connection handling, caching, and concurrency control are handled outside the QP VM. In a lot of ways this is similar to the approach we took with MVCC garbage collection in DSQL. Instead of PostgreSQL’s , which needs to carefully keep track of references to old versions from the set of running transactions, we instead bound the set of running transactions with a simple rule (no transaction can run longer than 5 minutes). This allows DSQL to simply discard versions older than that deadline, safe in the knowledge that they are no longer referenced. Simplicity, as always, is a system property.

0 views