Latest Posts (20 found)
Zak Knill 1 months ago

AI token streaming isn't about SSE vs WebSockets

At Ably , we’ve solved production token streaming, so you don’t have to. And the hard-part isn’t SSE or WebSockets. Ask an agentic coding tool or chatbot “how to stream AI tokens to a client in production” and it’ll give you a section of the answer on SSE vs WebSockets. But that’s not the question, or really the answer. In a pure comparison of using SSE or WebSockets as the transport, SSE is the simpler choice, and is also the better choice for most usecases. The architecture you should build for production token streaming looks like the diagram below. It’s got separation of ‘prompt’ request and ‘response’ stream, and a token cache/data store for storing the tokens in allowing for resume and reconnection.

0 views
Zak Knill 1 months ago

Generations of AI applications: conversational, delegative, and collaborative

Walk into most product reviews, board decks, or “AI strategy” docs and the mental model on display is still the one from November 2022: a chat window, a back-and-forth, an LLM replying in prose. That model is two generations out of date, and teams building against it are solving the wrong problems. The conversational generation of AI applications came first. ChatGPT launched in November 2022, and through the first half of 2023 the Chat product category evolved. In early 2024 Google Gemini joined the race, and the Claude 3 family of models launched. These products are all part of the conversational generation of AI applications. It’s this generation of AI apps that still matches most people’s mental models. The core interaction of a conversational app is a text box at the bottom of the screen, you type a question or instruction, and the AI replies in the same window, in prose. This is also the design of most AI library examples. This is the design that uses HTTP request/response and SSE streamed responses. It’s the design that fits well into companies’ existing technologies and architectures. This mental model is closer to instant messaging than anything else, which is why some of the first areas of disruption were the areas where users were already interacting with a chat-box. Customer support, and search. In the conversational generation of AI applications, there’s no sense that the AI is doing anything for you. You are consulting the AI and it’s responding to you; answering your questions, asking you questions. Most people’s workflows operated on copy-pasting information in and out of the conversation. The AI’s response is essentially the whole product in the first generation of AI applications.

0 views
Zak Knill 1 months ago

LLMs are breaking 20 year old system design

The ‘cloud-native’ architecture of the last decade is built on a 20-year-old assumption: that state lives in the database, and compute is stateless. If you want to scale, you scale the database vertically (get a larger machine) [1] [1] or design the database schema around partition the data and you scale your application servers horizontally (add more boxes). Any request can hit any server, the loadbalancer doesn’t care, and the database is the single source of truth.

0 views
Zak Knill 2 months ago

SSE token streaming is easy, they said

I wrote about AI having ‘durable sessions’ to support async agentic applications, and in the comments everyone said: “Token streaming over SSE is easy” . …so I figured I’d dig into that claim. Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks.

0 views
Zak Knill 2 months ago

All your agents are going async

Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks. For most of the time LLMs have been around, you use them by opening a chat-style window and typing a prompt. The LLM streams the response back token-by-token. It’s how ChatGPT, claude.ai, and Claude Code work. It’s also how the demos work for basically every AI SDK or AI Library. It’s easy to think that LLM chatbots are the ‘art of the possible’ for AI right now. But that’s not the case.

0 views
Zak Knill 3 months ago

You are the bottleneck

The agent can produce code faster than you can review it. That’s the bottleneck now, not the keyboard, not the compiler. You. Before agents, the constraint was how fast you could write code. Now it’s how fast you can review it. The agent ships. You approve. And the agent is faster than you. You’re not the producer anymore. You’re the reviewer. And that changes everything about how you should spend your time.

0 views
Zak Knill 4 months ago

If code is cheap, intent is the currency

Apparently writing code is cheap now . So since the barrier to producing code is gone, the intent behind the code is the most important bit. Intent is the new scarce resource, and commit messages are where that intent lives. Agents are still, for now, working inside human processes. The software development lifecycle (I’m getting flashbacks to every agile coach ever!) is still the same: we still have commits, pull requests, code review. We still have humans responsible for the agent’s output. But generating the code is cheaper, so the code review carries more of the weight and responsibility for good code .

0 views
Zak Knill 4 months ago

A chatbot's worst enemy is page refresh

How is is possible that we’ve made incredible gains in the performance of models, but virtually no gains in the infrastructure that supports them?. .. or what I like to call: the worst enemy of chatbots is page refresh. There are some large GIFs in this article, let them load :) If a picture speaks a thousand words, here is a GIF of the Claude UI taken on 11th Feb 2026.

0 views
Zak Knill 4 months ago

Only use agents for tasks you already know how to do

We’ve all seen the complaints. The burden of reviewing AI ‘output’ is shifting onto project maintainers and team members. Folks can easily generate lots of code using AI, that code might even be functional (in that it passes the tests also written by the AI). But that doesn’t necessarily make the code good or correct . So if you want to be a good team member, here’s my rule for coding with AI agents:

0 views
Zak Knill 6 months ago

SSE sucks for transporting LLM tokens

I’m just going to cut to the chase here. SSE as a transport mechanism for LLM tokens is naff. It’s not that it can’t work, obviously it can, because people are using it and SDKs are built around it. But it’s not a great fit for the problem space. The basic SSE flow goes something like this: Sure the approach has some benefits, like simplicity and compatibility with existing HTTP infrastructure. But it still sucks. Client makes an HTTP POST request to the server with a prompt Server responds with a 200 OK and keeps the connection open Server streams tokens back to the client as they are generated, using the SSE format Client processes the tokens as they arrive on the long-lived HTTP connection

0 views
Zak Knill 7 months ago

So you want to build AI agent group chat?

Disclaimer, I work for Ably; so I’m intimately familiar with the tech I mention here. Opinions are my own, etc. On Nov 13th Open AI announced the pilot of group chats in ChatGPT . This post looks at the existing patterns for interacting with models, and how they make it hard to build similar features. The Open AI group chat feature allows multiple users to join a chat with an AI model, and have a conversation together. Responses from each user are visible to all participants, and the model responds to the entire group. Building this with existing model and sdk transports patterns is hard.

0 views
Zak Knill 1 years ago

Patterns for building realtime features

Realtime features make apps feel modern, collaborative, and up-to-date. The features predominantly require sharing changes triggered by one user to other users, as the changes are happening. This typically means your server needs to send data to some set of clients, where those clients don’t know they are missing the data. These patterns rely on a connection between the client and the server, where the server can notify the client of some data. This connection could be websockets, sse, event-streams, or polling (long or short). The connection just needs to allow the server to send data to the client without the client knowing that there is new data.

0 views
Zak Knill 2 years ago

Phone call asymmetry

You get a phone call, but you’re away from your phone or you can’t answer it right at that moment. You call the number back and hear an automated voice say: Thank you for calling [some business], for accounts press 1, to place a new order press 2…. Perhaps, by sheer luck (or skill) you manage to navigate the labyrinth of options and talk to a real human (sidebar: there’s a circle of hell reserved for the flow-chart designer that creates a branch that ends up in them hanging up on you).

0 views
Zak Knill 2 years ago

Every programmer should know

Programmers should know a lot.. apparently. Programming paradigms Lockless concurrency Floating point More algorithms More latency More more latency More more more latency More more more more latency More memory Regular expressions Programming Vim commands Time complexities Optical fibre

0 views
Zak Knill 2 years ago

How to adopt Realtime updates in your app

…and why you really should! Realtime updates rely on two main technologies: You might also think of polling or long polling as a mechanism for fetching ‘Realtime’ data from your backend. Polling is not Realtime. Websockets : A stateful, persistent, bi-directional ‘channel’ of communication. Server sent events (SSE) : Built on top of HTTP, opens a long-running HTTP connection where multiple independent messages are written to the response over time.

0 views
Zak Knill 2 years ago

You don't need CRDTs for collaborative experiences

You don’t need CRDTs for collaborative experiences. First lets get the ‘what-about-ery’ out the way… Hold, on.. that all sounds great, but.. Offline first – this is wayy harder to get useful behaviour with out CRDTs. If you don’t use them, you’re pretty much destined to have LWW (which is actually a CRDT behaviour), and one user is likely to overwrite the changes of another. This isn’t a great experience for anyone involved. Text editing – everyone’s gonna say “but hey, google docs uses operational transform not CRDTs”.. OK yes, but you are not google . Martin Kleppmann has a great round-up of the various people who though they implemented OT correctly, but actually didn’t. The reason that you need CRDTs for text editing collaboration is that it’s a really extreme example of collaboration. The nature of text editing is that any tiny errors in the placement of characters by the convergence algorithm is going to create incorrect words, and incorrect words are incredibly obvious. Text editing has a high rate of edits (as you type), and the edits need to interleave perfectly or you get incorrect words, and errors in the interleaving are super obvious (incorrect words)!

0 views
Zak Knill 2 years ago

Giving up my smartphone - Duoqin F22 Pro

I was first attracted to the dumbphones after seeing a series of articles on Hacker News. I like the idea of using - and relying on - my phone less and less. No plan survives contact with the enemy, and I knew that I wouldn’t manage in life with a stripped down phone that could only do calls, texts, and maybe some music. Eventually I stumbled across the dumbphones subreddit (/r/dumbphones). On this subreddit I discovered ’transition phones’, that is a phone that can do some smartphone things, but with dumbphone characteristics. I found that you could have a dumbphone form factor but still install all the smartphone apps you might need.

0 views
Zak Knill 2 years ago

Do developers really want to give over their data?

There’s a rise in hosted database companies like Supabase , Neon , Turso , etc. When I look at those companies, here’s the thing I’ve been struggling with: Do developers really want to give over their data? Your making a trade-off by choosing one of these companies, and the tradeoff is this: They will solve some boring infrastructure and security problems, and in return, they get all your data. Not in the Cambridge-Analytica/Facebook style of “get all your data”. More like the S3 style; where the cost (in dollars 💲) or the cost (in time/effort 🕔) are high enough to dissuade you from trying to leave. There’s a strong lock-in effect.

0 views
Zak Knill 2 years ago

So you want to build Miro and Figma style collaboration?

Miro and Figma have a bunch of collaboration features, in this post I’m going to break down two of those features and look at what you’d have to think about when building these into your own apps. Disclaimer: I work for a company in this product space, which is why I care about these problems . Lets start with.. Collaborative cursors allow multiple users to interact on the same page of a website, and for each participant to see where the other participants are pointing or moving their cursors.

0 views
Zak Knill 2 years ago

Streaming data aggregation

Imagine you’re presented with this problem: Design a system that can show the top 10 most popular songs over the last 10 seconds on the homepage of a music streaming service. You have access to a queue of events representing song ‘plays’ with a tuple. The data should update, and be as fresh as possible. We are given this to work with, we need to design a system that satisfies the requirements, replacing the “❓”:

0 views