GreatReads - Blog Aggregator · Phoenix Framework

0 views

The Jolly Teapot Yesterday

April 2026 blend of links

If you read this via your RSS reader, you won’t notice the tiny design refresh of the site: nothing major, just a new header, a new icon/logo, an improved HTML structure, and an overall simplification and further reduction in weight to be, you guessed it, as light as possible . I’m very pleased with the new look, and I have a feeling that this one will stick for a while. Anyway, moving on to the links I found interesting in the last few weeks. Man Who Threw Molotov Cocktail At Sam Altman’s Home Claims He Was Following ChatGPT Recipe For Risotto – The Onion, still the best website on the internet. Architypes, by Anthony Nelzin-Santos – A lovely collection of photos of old storefronts. (via People and Blogs ) The Song of LinkedIn – “ The song confirms your existing beliefs with words that are brave and controversial because, man, they just don’t get it. They’re so dumb! Their whole business is lagging behind! Whose business? Behind whom? Who knows! The Song of LinkedIn doesn’t care. All it cares about is making you feel like insights are happening to you when really I’m just being a dick by making you feel like a dick for not being as big a dick as me. Synergy! ” I Verified My LinkedIn Identity. Here's What I Actually Handed Over. – Deleting my LinkedIn account a few years ago is still one of the best decisions I’ve made in my life, and today when I’m asked why I am not on LinkedIn, I tend to answer with “Why are you on LinkedIn?”, and the words “pile of garbage website” may come out too, if I’m being polite. (via 82MHz ) delphitools – The kind of website that absolutely deserves a spot on my “JavaScript allowed” list. Excellent. (via Rodrigo Ghedin ) Nobody Gets Promoted for Simplicity – “ The actual path to seniority isn’t learning more tools and patterns, but learning when not to use them. Anyone can add complexity. It takes experience and confidence to leave it out. ” (via Violet Pixel ) All my clients wanted a carousel, now it's an A.I. chatbot! – “ I've learned that when a client says simple, they don't mean easy to use. They mean not impressive enough. They mean what will people think. A lean, fast website doesn't look like it cost anything. It doesn't signal effort. It doesn't say: we take this seriously. ” Dept. of Enthusiasm – Not sure when the next issue of this newsletter will be released, but reading this extremely well-written entry got me sold. I’m also very envious of that prose. (via Meanwhile ) ‘He’d gaze at the stars and go: I’m gonna be up there one day’: Prince by those who knew him best, 10 years after his death – If you’re familiar with my list of favourite songs , you can have an idea of what Prince means to me. His sudden death in April 2016 still hurts, as if I had lost a part of me, or an old friend. Scottish comedian Robert Florence, then, shared a thought that I think about often: “ You'll always be the angel and the devil on my shoulders. ”

Design

0 views

The Universe of Joshua Blais 2 days ago

Emacs is my browser

In my ever-increasing desire to use emacs as my sole computing environment , I have started to take browsing the web inside it far more seriously. Where I previously had thought EWW to be a niceity but far from capable, after using it for a few days it seems to be useable for about 85-90% of use cases - even on the javascript riddled hellhole that is the internet. I have seen in my own use of the modern browser (chromium/firefox) that it is far too easy to get distracted - too easy to get off track and fall down rabbit holes that take over my day. There are infinite suggestions as to how I should spend my time, uncountable shiny objects that take my eye off the prize that is Creativity and Depth. This has been momentously negated with EWW (or any terminal browser, lynx or browsh for the non-emacs users). In addition to this, my belief is that we should make strides toward leaving behind Layer 8 of the internet - the limiting frontends of social platforms and locked away corners of the net that limit actual discourse (Discord, I’m looking at you). We have given up far too much to big tech platforms, and gotten nothing of value in return, to the point that many now think the internet is dying . The internet is just a delivery mechanism, and for people that see it for what it really is, the internet has never been more alive . I truly recommend using the internet as if it was 1999 . Even on my phone, I’ve stopped using javascript frontends and embraced using eww in emacs (in termux), only falling back to fennec for about 5-10% of usecases: I find this way of using the internet far higher in signal than any other method, allowing me to look up information, read documentation, and produce more. For the uninitiated, emacs ships with EWW (Emacs Web Wowser) permitting you to browse the web with image and gif (when did gifs die, I almost never see them these days!) support directly inside Emacs. I have some sane defaults that permit ease of use such as using for back, for yanking url at point, and a few built in functions. Hitting will invoke which is similar to reader mode in firefox, removing headers and footers and focusing in on the text on page. Useful. You can send the page to your default browser with . If a page doesn’t render nicely, this is a good fallback. will download images locally, goes back a page, will save bookmarks. All of this makes browsing in eww truly enjoyable. I have also set .pdf to open in emacs, .mp4 and youtube/video hosting links to open in mpv, and gopher/gemini links to open in elpher. Elpher is the EWW for the gopher and gemini protocols, the smolweb that is all signal and no noise. EWW uses Shr to render html, so you will see some callouts to that in my configuration below: Here’s how I have configured EWW to work in my Emacs configuration : My default search engine is my own Searx instance , a privacy respecting frontend that amalgamates all other search engines. You will not be watching youtube videos or reading Tweets (x.com is adversarial to non-JS supported browsers). You will not be using social media, nor will you be logging into any platforms. You will not be doing your online banking, filling in government forms, or viewing client portals. The usecase is quick web searches, documentation, and reading blogs generally. Once more, hitting in any web page will bring up your default browser to continue your session in chromium or firefox when you do run into pages that don’t work well in EWW. I have reverted all my url functions to default to browsing in EWW, so any web interaction must first go through EWW. This has encouraged me to deeply consider what I am doing online first and foremost, and then only falling back to a modern browser when needed. I have been pleasantly surprised with how much I am able to do inside emacs, and continue to move toward using it as my computing environment in perpetuity. As always, God bless, and until next time. If you enjoyed this post, consider Supporting my work , Checking out my book , Working with me , or sending me an Email to tell me what you think.

0 views

Martin Alderson 3 days ago

Figma's woes compound with Claude Design

I think Figma is increasingly becoming a go-to case study in the victims of the so-called "SaaSpocalypse". And Claude Design's recent launch last week just adds a whole new dimension of pain. Firstly, I should say that I love(d?) the Figma product. It's hard to understand now what a big deal Figma's initial product was when it launched in the mid 2010s. The initial product ushered in a whole new category of SaaS - using the nascent WebGL and asm.js technologies to allow designers to design entirely in browser. It used to be the running joke that an app like Photoshop would ever run in the browser, but Figma proved it wrong. It quickly overtook Sketch as the defacto design tool in the market. Firstly for UI/UX wireframing and prototyping, but increasingly for everything graphic design. As it was based in the browser, it was a revelation from the developer side to be able to open UI/UX files if you weren't on a Mac (Sketch is Mac only). It was also brilliant to be able to leave comments on the design and collaborate with the designer(s) to iterate on designs really quickly. The collaborative features (without requiring anyone to download any software) quickly meant it got adoption outside of pure design roles - PMs and executives could finally collaborate in real time on the product they were building, without having to (at best) send back revisions and notes from badly screenshotted files that tended to be out of date by the time they were received. I'll skip over the rest of the history, including a no doubt distracting takeover attempt by Adobe, that was later blocked on competition grounds. But (of course) LLMs happened and suddenly one of the most forward looking SaaS companies became very vulnerable to disruption itself. One completely unexpected development me and others noticed (and wrote up a few months ago at How to make great looking reports with Claude Code ) was that LLMs started to get fairly "good" at design. By good I do not mean as good as a talented designer, clearly it's nowhere near that - currently. But like many things, not everything requires a great designer. Even if you use a great design team to build out your core product experience (and many do not ), there's an awful lot of design 'resource' required for auxiliary parts of the product, reports, proposals etc. It's not stuff that tends to get designers excited but can sap an awful lot of time going back and forth on a pitch deck. And this is exactly why I think Figma is almost uniquely vulnerable. The way it managed to expand into organisations by getting uptake with non-designers becomes a liability if those non-designers can get an AI agent to do the design for them. Looking at Figma's S1 (which is somewhat out of date by now, but is the only reported breakdown I can find) corroborates this potential weakness. Only 33% of Figma's userbase in Q1 2025 was designers, with developers making up 30% and other non-design roles making up 37%. A lot of Figma's continued expansion depended on this part of their userbase. A lot of their recent product development has been to enable further expansion in organisations - "Dev Mode" for developers (which now looks incredibly quaint against LLMs), Slides (to compete against PowerPoint and other presentation tools) and Sites (a WebFlow-esque site builder) all are about expanding their TAM out of "pure" design. The real surprise for me though was how basic their "flagship" AI design product Figma Make is. It really does feel like something that someone put together in an internal AI hackathon one weekend and it never progressed beyond that. Given how much Figma managed to push the envelope on web technology I found this surprising - perhaps they were caught off guard with how quickly LLMs' design prowess improved, or there were internal disagreements about the role AI should or will play in design. Regardless, it's an incredibly underwhelming product as it stands. If things weren't bad enough, Anthropic themselves launched Claude Design which is a pretty direct competitor to Figma in many ways. While it's nowhere near functional and polished enough to replace Figma's core design product, I expect it will get significant traction outside of that. The ability for it to grab a design system from your existing assets in one click is very powerful - and allows you to then pull together prototypes, presentations or reports in your corporate design style that look and feel far better than anything a non-designer could do themselves. And I thought it was extremely telling that unlike a lot of the other Anthropic product launches that have touched design - Figma did not provide a testimonial on it (understandably). Canva did , which I found extremely odd (they are in my eyes even more vulnerable to this product than Figma). I think this really underlines two major weaknesses in many SaaS companies' AI strategies: Firstly, it's very difficult to compete on AI against the company that is providing your AI inference. A quick check on Figma Make suggests that Figma (at least on my account) is indeed using Sonnet 4.5 for its inference - though I have seen it use Gemini in the past: At this point Figma is effectively funding a competitor - and the more AI usage Figma has - the more money they send over to Anthropic for the tokens they use. Even worse, Sonnet 4.5 is miles behind what Anthropic uses on Claude Design (Opus 4.7, which has vastly improved vision capabilities [1] ), so the results a user gets on Make vs Claude Design are almost certainly going to underwhelm. Also, unlike most/all SaaS costs, inference (especially with these frontier models) is expensive . As Cursor found out, the frontier labs can charge a lot less to end users than API customers like Figma. When you are potentially looking at a shrinking userbase, it's far from ideal to have very expensive variable costs that start pulling your profitability down. Secondly, it really underlines to me how incredibly efficient headcount-wise companies can build products now. Figma has close to 2,000 employees - not all working on product engineering of course. I really doubt Anthropic even needed 10 to build Claude Design. Indeed the entirety of Anthropic is around 2,500 people. It's also worth noting that a lot of the things that would traditionally lock a company like Figma in stop working as well in an agent-first world. Multiplayer matters less when your collaborator is an agent iterating on a prompt. Plugin ecosystems matter less when you can just ask for the functionality directly. Design system tooling is the whole point of Claude Design. Enterprise SSO - Claude already has that. Most of the moats that protect a mature SaaS company are moats against other SaaS companies, not against the thing providing their inference. I might be wrong about how bad this gets for Figma specifically. Companies with strong brands, great distribution and genuinely talented teams can often adapt faster than outsiders expect, and I'd rather be long Figma than most of its competitors. But the structural point is harder to wriggle out of. Figma has ~2,000 employees. Anthropic has ~2,500 total and I doubt Claude Design took more than a handful to build. Figma now needs to out-execute a competitor whose inference is ~free to them, whose marginal cost to ship is roughly zero, and who employs fewer people on the competing product than Figma has on a single pod. That's a very hard position to pivot out of. This feels like a preview of where SaaS economics are heading. The companies that built big orgs on the assumption of steady seat expansion are going to find themselves competing with products built by tiny teams inside the frontier labs. Figma just happens to be the first big public name where one of their primary inference suppliers has started competing against them. Both GPT 5.4 and Opus 4.7 can now "see" screenshots at much higher resolution - Opus 4.7 jumped from 1568px / 1.15MP to 2576px / 3.75MP. Resolution isn't the whole story (scaffolding and post-training matter a lot too) but it meaningfully helps with small-element detection and layout judgement. If you've ever pasted a screenshot of something broken and the model told you it looks great, the previous lack of resolution is one of the reasons why. ↩︎ Both GPT 5.4 and Opus 4.7 can now "see" screenshots at much higher resolution - Opus 4.7 jumped from 1568px / 1.15MP to 2576px / 3.75MP. Resolution isn't the whole story (scaffolding and post-training matter a lot too) but it meaningfully helps with small-element detection and layout judgement. If you've ever pasted a screenshot of something broken and the model told you it looks great, the previous lack of resolution is one of the reasons why. ↩︎

Business

Design

0 views

Robert Alexander's Tech Blog 1 weeks ago

This blog post tells the time

Computer clock synchronization is a complicated process, requiring protocols like NTP and a specialized server to answer requests. In this post I explore a “serverless” method, which relies on widely available CDNs to distribute time. It’s the serverless time servers we didn’t realize we already had. This clock should display the correct time. If your device’s clock is set to the wrong time, it should tell you how far off the clock is set. The page starts the process by requesting a tiny asset through the Cloudflare CDN. As Cloudflare builds the response, an HTTP header transform rule adds timing information, like http.request.timestamp.sec , to the response headers. The client waits for the response and then analyzes the network request using the fine-grained metrics provided by performance resource timers . Finally, some math is applied to adjust for network delay. The PerformanceResourceTiming interface exposes detailed network timing information. This is similar to the web developer tools network tab, just accessible via JavaScript in the page. These metrics are extremely helpful to developers who are troubleshooting performance issues and will prove useful here. Notice that “sending” and “receiving” are both shown as zero milliseconds. The request and response used here are so small they likely fit in a single packet, so these events appear instantaneous. The only measurable part of the HTTP portion of the request is waiting for the network and the server to do their work. These detailed performance timing metrics help us address a major challenge of time distribution: any server provided timestamp has grown stale by the time it reaches the client. To account for this effect, the NTP protocol and similar software estimate the network round trip time. A good estimate for the network delay experienced by the response is “half the round trip time”, although this is not always accurate. Additional adjustment based on server processing delay further improves accuracy. Cloudflare helps us out by providing server-side timing information. The client generally can’t distinguish between network delay and server delay, so this information helps us estimate when the server generated the timestamp. This data includes metrics like cf.timings.origin_ttfb_msec which tells us how long the Cloudflare CDN waited on a response from Cloudflare Pages. At the end of all the measurement and the math, the clock display is an estimate. We’re guessing how much the server-provided timestamp aged before it reached the web browser. It’s an educated guess, informed by a lot of metrics, but there is uncertainty here. For a technique I’ve been calling serverless, I’ve sure talked about servers a lot. The term serverless really means that we’re not managing individual servers ourselves, the cloud hosting provider has abstracted those away. This setup uses Cloudflare Pages to host the tiny asset which this page fetches. The HTTP header transform rule is part of the CDN, we don’t even need Cloudflare Workers . So it’s just files I’ve pushed to GitLab, served by Cloudflare Pages, and some CDN configuration. Tons of servers, but abstracted away. Contrast this with NTP, where we’d need to run the NTP daemon and perhaps manage the underlying operating system. It feels “serverless” in comparison. The clock display includes error bounds, which describe the precision of the provided time. Network latency plays a big part here as we don’t know how long it took to reach Cloudflare or how long it took to get back. Network paths could be asymmetric or packet loss could cause unexpected delay in a single direction. While in normal cases we’d expect the server to process the request (and generate the timestamp) in the middle of our waiting period, in extreme cases those events could fall far to one side. This uncertainty, and the associated error bounds, are reduced when the network latency is lower, which plays into a strength of Cloudflare: their CDN points-of-presence are located geographically near major population centers. For most of us, network latency to Cloudflare is quite small. The performance resource timers also help us precisely estimate when Cloudflare processes the HTTP request as we can eliminate delays caused by DNS resolution, TCP handshake, and TLS session initialization. Precision could be improved by performing multiple requests and applying statistical analysis, but this page makes no such effort. In my testing I’ve often seen 60ms error bounds shown in the web clock. NTP clients, like the command line ntpdig has a much tighter estimate, closer to 6ms. This is an order of magnitude difference. While this method provides decent synchronization with the clock on Cloudflare CDN servers, we’ve got to consider how well synchronized that clock is with the official time. After all, if Cloudflare’s CDN servers provide the wrong timestamp it doesn’t matter how precisely we’ve synchronized, we’ll display the wrong time. Cloudflare’s CDN is not formally a time server, so we need to tread carefully when using it this way. I checked the accuracy against a couple sources. When I collected the ntpdig output shown above, my web clock reported I was behind by 130±70ms. These measurements are within each other’s error bounds, which shows agreement. I also checked using a GPS debugging app on my phone. GPS provides extremely accurate clocks and is likely the most accurate clock I can access. The clocks appeared to update in lock-step, again showing agreement. In this screenshot notice that my phone’s clock is ahead of the other clocks and this offset is detected by the web clock. In any case, it seems risky to depend on an unwitting time server, so without specific promises from Cloudflare I’d just consider this a demo. After all, the Internet doesn’t always know the right time . So far I’ve tested this on my laptop and phone, but I’d be interested to see how well it works for others. You can use tools like ntpdig or GPS debugging apps to compare. I’ve built a standalone web clock for that sort of testing. You may be surprised by how inaccurate your system time can be; slightly offset clocks are quite common. This is especially true when a device sleeps, suspends, or hibernates. When a computer’s CMOS battery is missing of failed, clocks can fall very far out of sync. I’d be curious to see what people discover (contact info at bottom). While the precision of this CDN-based method is relatively poor for a time synchronization protocol, it does offer some attractive features over current solutions. First and foremost: it’s web-native! NTP’s lack of security has been a growing concern. One replacement, Network Time Security (NTS), cryptographically authenticates information sent by the time server . The authenticated encryption of HTTPS similarly protects the CDN-based web clock approach. This avoids situations where an attacker-in-the-middle tampers with insecure NTP responses, messing up your system’s clock. There’s a lot of hazards here, unfortunately. Alternate time synchronization protocols have a history of mistakes, so its wise to be wary. Microsoft tried a TLS-based synchronization approach via Secure Time Seeding (STS). Their approach relied on time metadata in TLS connections, but most servers actually provide random data in the relevant field. This caused clock to reset to random times . In either case, this underscores the risks of getting a clock reference from systems that don’t realize they are being used as time servers. Closing on a more nostalgic note, NIST’s time.gov has a wonderfully retro clock widget . Unfortunately they no longer allow you to host it on your own site, probably due to server load. Here’s my own 88x31 badge, which is hereby MIT licensed. It makes use of SVG’s questionably ability to embed scripts in images.

Performance Serverless

0 views

iDiallo 1 weeks ago

Back button hijacking is going away

When websites are blatantly hostile, users close them to never come back. Have you ever downloaded an app, realized it was deceptive, and deleted it immediately? It's a common occurrence for me. But there is truly hostile software that we still end up using daily. We don't just delete those apps because the hostility is far more subtle. It's like the boiling frog, the heat turns up so slowly that the frog enjoys a nice warm bath before it's fully cooked. With clever hostile software, they introduce one frustrating feature at a time. Every time I find myself on LinkedIn, it's not out of pleasure. Maybe it's an email about an enticing job. Maybe it's an article someone shared with me. Either way, before I click the link, I have no intention of scrolling through the feed. Yet I end up on it anyway, not because I want to, but because I've been tricked. You see, LinkedIn employs a trick called back button hijacking. You click a LinkedIn URL that a friend shared, read the article, and when you're done, you click the back button expecting to return to whatever app you were on before. But instead of going back, you're still on LinkedIn. Except now, you are on the homepage, where your feed loads with enticing posts that lure you into scrolling. How did that happen? How did you end up on the homepage when you only clicked on a single link? That's back button hijacking. Here's how it works. When you click the original LinkedIn link, you land on a page and read the article. In the background, LinkedIn secretly gets to work. Using the JavaScript method, it swaps the page's URL to the homepage. The method doesn't add an entry to the browser's history. Then LinkedIn manually pushes the original URL you landed on into the history stack. This all happens so fast that the user never notices any change in the URL or the page. As far as the browser is concerned, you opened the LinkedIn homepage and then clicked on a post to read it. So when you click the back button, you're taken back to the homepage, the feed loads, and you're presented with the most engaging post to keep you on the platform. If you spent a few minutes reading the article, you probably won't even remember how you got to the site. So when you click back and see the feed, you won't question it. You'll assume nothing deceptive happened. While LinkedIn only pushes you one level down in the history state, more aggressive websites can break the back button entirely. They push a new history state every time you try to go back, effectively trapping you on their site. In those cases, your only option is to close the tab. I've also seen developers unintentionally break the back button, often when implementing a search feature. On a search box where each keystroke returns a result, an inexperienced developer might push a new history state on every keystroke, intending to let users navigate back to previous search terms. Unfortunately, this creates an excessive number of history entries. If you typed a long search query, you'd have to click the back button for every character (including spaces) just to get back to the previous page. The correct approach is to only push the history state when the user submits or leaves the search box ( ). As of yesterday, Google announced a new spam policy to address this issue. Their reasoning: People report feeling manipulated and eventually less willing to visit unfamiliar sites. As we've stated before, inserting deceptive or manipulative pages into a user's browser history has always been against our Google Search Essentials. Any website using these tactics will be demoted in search results: Pages that are engaging in back button hijacking may be subject to manual spam actions or automated demotions, which can impact the site's performance in Google Search results. To give site owners time to make any needed changes, we're publishing this policy two months in advance of enforcement on June 15, 2026. I'm not sure how much search rankings affect LinkedIn specifically, but in the grand scheme of things, this is a welcome change. I hope this practice is abolished entirely.

0 views

iDiallo 1 weeks ago

Your friends are hiding their best ideas from you

Back in college, the final project in our JavaScript class was to build a website. We were a group of four, and we built the best website in class. It was for a restaurant called the Coral Reef. We found pictures online, created a menu, and settled on a solid theme. I was taking a digital art class in parallel, so I used my Photoshop skills to place our logo inside pictures of our fake restaurant. All of a sudden, something clicked. We were admiring our website on a CRT monitor when my classmate pulled me aside. She had an idea. A business idea. An idea so great that she couldn't share it with the rest of the team. She whispered, covering her mouth with one hand so a lip reader couldn't steal this fantastic idea: "what if we build websites for people?" This was the 2000s, of course it was a fantastic idea. The perfect time to spin up an online business after a market crash. But what she didn't know was that, while I was in class in the mornings, my afternoons were spent scouring Craigslist and building crappy websites for a hundred to two hundred dollars a piece. I wasn't going to share my measly spoils. If anything, this was the perfect time to build that kind of service. That's a great idea , I said. There is something satisfying about having an idea validated. A sort of satisfaction we get from the acknowledgment. We are smart, and our ideas are good. Whenever someone learned that I was a developer, they felt this urge to share their "someday" idea. It's an app, a website, or some technology I couldn't even make sense of. I used to try to dissect these ideas, get to the nitty-gritty details, scrutinize them. But that always ended in hostility. "Yeah, you don't get it. You probably don't have enough experience" was a common response when I didn't give a resounding yes. I don't get those questions anymore, at least not framed in the same way. I have worked for decades in the field, and I even have a few failed start-ups under my belt. I'm ready to hear your ideas. But that job has been taken, not by another eager developer with even more experience, or maybe a successful start-up on their résumé. No, not a person. AI took this job. Somewhere behind a chatbot interface, an AI is telling one of your friends that their idea is brilliant. Another AI is telling them to write out the full details in a prompt and it will build the app in a single stroke. That friend probably shared a localhost:3000 link with you, or a Lovable app, last year. That same friend was satisfied with the demo they saw then and has most likely moved on. In the days when I stood as a judge, validating an idea was rarely what sparked a business. The satisfaction was in the telling. And today, a prompt is rarely a spark either. In fact, the prompt is not enough. My friends share a link to their ChatGPT conversation as proof that their idea is brilliant. I can't deny it, the robot has already spoken. I'm not the authority on good or bad ideas. I've called ideas stupid that went on to make millions of dollars. (A ChatGPT wrapper for SMS, for instance.) A decade ago, I was in Y Combinator's Startup School. In my batch, there were two co-founders: one was the developer, and the other was the idea guy. In every meeting, the idea guy would come up with a brand new idea that had nothing to do with their start-up. The instructor tried to steer him toward being the salesman, but he wouldn't budge. "My talent is in coming up with ideas," he said. We love having great ideas. We're just not interested in starting a business, because that's what it actually takes. A friend will joke, "here's an idea" then proceeds to tell me their idea. "If you ever build it, send me my share." They are not expecting me to build it. They are happy to have shared a great idea. As for my classmate, she never spoke of the business again. But over the years, she must have sent me at least a dozen clients. It was a great idea after all.

Career

Business

0 views

Eli Bendersky's website 1 weeks ago

watgo - a WebAssembly Toolkit for Go

I'm happy to announce the general availability of watgo - the W eb A ssembly T oolkit for G o. This project is similar to wabt (C++) or wasm-tools (Rust), but in pure, zero-dependency Go. watgo comes with a CLI and a Go API to parse WAT (WebAssembly Text), validate it, and encode it into WASM binaries; it also supports decoding WASM from its binary format. At the center of it all is wasmir - a semantic representation of a WebAssembly module that users can examine (and manipulate). This diagram shows the functionalities provided by watgo: watgo comes with a CLI, which you can install by issuing this command: The CLI aims to be compatible with wasm-tools [1] , and I've already switched my wasm-wat-samples projects to use it; e.g. a command to parse a WAT file, validate it and encode it into binary format: wasmir semantically represents a WASM module with an API that's easy to work with. Here's an example of using watgo to parse a simple WAT program and do some analysis: One important note: the WAT format supports several syntactic niceties that are flattened / canonicalized when lowered to wasmir . For example, all folded instructions are lowered to unfolded ones (linear form), function & type names are resolved to numeric indices, etc. This matches the validation and execution semantics of WASM and its binary representation. These syntactic details are present in watgo in the textformat package (which parses WAT into an AST) and are removed when this is lowered to wasmir . The textformat package is kept internal at this time, but in the future I may consider exposing it publicly - if there's interest. Even though it's still early days for watgo, I'm reasonably confident in its correctness due to a strategy of very heavy testing right from the start. WebAssembly comes with a large official test suite , which is perfect for end-to-end testing of new implementations. The core test suite includes almost 200K lines of WAT files that carry several modules with expected execution semantics and a variety of error scenarios exercised. These live in specially designed .wast files and leverage a custom spec interpreter. watgo hijacks this approach by using the official test suite for its own testing. A custom harness parses .wast files and uses watgo to convert the WAT in them to binary WASM, which is then executed by Node.js [2] ; this harness is a significant effort in itself, but it's very much worth it - the result is excellent testing coverage. watgo passes the entire WASM spec core test suite. Similarly, we leverage wabt's interp test suite which also includes end-to-end tests, using a simpler Node-based harness to test them against watgo. Finally, I maintain a collection of realistic program samples written in WAT in the wasm-wat-samples repository ; these are also used by watgo to test itself. Parse: a parser from WAT to wasmir Validate: uses the official WebAssembly validation semantics to check that the module is well formed and safe Encode: emits wasmir into WASM binary representation Decode: read WASM binary representation into wasmir

C++

Rust

0 views

Martin Alderson 1 weeks ago

Has Mythos just broken the deal that kept the internet safe?

For nearly 20 years the deal has been simple: you click a link, arbitrary code runs on your device, and a stack of sandboxes keeps that code from doing anything nasty. Browser sandboxes for untrusted JavaScript, VM sandboxes for multi-tenant cloud, ad iframes so banner creatives can't take over your phone or laptop - the modern internet is built on the assumption that those sandboxes hold. Anthropic just shipped a research preview that generates working exploits for one of them 72.4% of the time, up from under 1% a few months ago. That deal might be breaking. From what I've read Mythos is a very large model. Rumours have pointed to it being similar in size to the short lived (and very underwhelming) GPT4.5 . As such I'm with a lot of commentators in thinking that a primary reason this hasn't been rolled out further is compute. Anthropic is probably the most compute starved major AI lab right now and I strongly suspect they do not have the compute to roll this out even if they wanted more broadly. From leaked pricing, it's expensive as well - at $125/MTok output (5x more than Opus, which is itself the most expensive model out there). One thing that has really been overlooked with all the focus on frontier scale models is how quickly improvements in the huge models are being achieved on far smaller models. I've spent a lot of time with Gemma 4 open weights model, and it is incredibly impressive for a model that is ~50x smaller than the frontier models. So I have no doubt that whatever capabilities Mythos has will relatively quickly be available in smaller, and thus easier to serve, models. And even if Mythos' huge size somehow is intrinsic to the abilities (I very much doubt this, given current progress in scaling smaller models) it has, it's only a matter of time before newer chips [1] are able to serve it en masse. It's important to look to where the puck is going. As I've written before, LLMs in my opinion pose an extremely serious cybersecurity risk. Fundamentally we are seeing a radical change in how easy it is to find (and thus exploit) serious flaws and bugs in software for nefarious purposes. To back up a step, it's important to understand how modern cybersecurity is currently achieved. One of the most important concepts is that of a sandbox . Nearly every electronic device you touch day to day has one (or many) layers of these to protect the system. In short, a sandbox is a so called 'virtualised' environment where software can execute on the system, but with limited permissions, segregated from other software, with a very strong boundary that protects the software 'breaking out' of the sandbox. If you're reading this on a modern smartphone, you have at least 3 layers of sandboxing between this page and your phone's operating system. First, your browser has (at least) two levels of sandboxing. One is for the JavaScript execution environment (which runs the interactive code on websites). This is then sandboxed by the browser sandbox, which limits what the site as a whole can do. Finally, iOS or Android then has an app sandbox which limits what the browser as a whole can do. This defence in depth is absolutely fundamental to modern information security, especially allowing users to browse "untrusted" websites with any level of security. For a malicious website to gain control over your device, it needs to chain together multiple vulnerabilities, all at the same time. In reality this is extremely hard to do (and these kinds of chains fetch millions of dollars on the grey market ). Guess what? According to Anthropic, Mythos Preview successfully generates a working exploit for Firefox's JS shell in 72.4% of trials. Opus 4.6 managed this in under 1% of trials in a previous evaluation: Worth flagging a couple of caveats. The JS shell here is Firefox's standalone SpiderMonkey - so this is escaping the innermost sandbox layer, not the full browser chain (the renderer process and OS app sandbox still sit on top). And it's Anthropic's own benchmark, not an independent one. But even hedging both of those, the trajectory is what matters - we're going from "effectively zero" to "72.4% of the time" in one model generation, on a real-world target rather than a toy CTF. This is pretty terrifying if you understand the implications of this. If an LLM can find exploits in sandboxes - which are some of the most well secured pieces of software on the planet - then suddenly every website you aimlessly browse through could contain malicious code which can 'escape' the sandbox and theoretically take control of your device - and all the data on your phone could be sent to someone nasty. These attacks are so dangerous because the internet is built around sandboxes being safe. For example, each banner ad your browser loads is loaded in a separate sandboxed environment. This means they can run a huge amount of (mostly) untested code, with everyone relying on the browser sandbox to protect them. If that sandbox falls, then suddenly a malicious ad campaign can take over millions of devices in hours. Equally, sandboxes (and virtualisation) are fundamental to allowing cloud computing to operate at scale. Most servers these days are not running code against the actual server they are on. Instead, AWS et al take the physical hardware and "slice" it up into so called "virtual" servers, selling each slice to different customers. This allows many more applications to run on a single server - and enables some pretty nice profit margins for the companies involved. This operates on roughly the same model as your phone, with various layers to protect customers from accessing each other's data and (more importantly) from accessing the control plane of AWS. So, we have a very, very big problem if these sandboxes fail, and all fingers point towards this being the case this year. I should tone down the disaster porn slightly - there have been many sandbox escapes before that haven't caused chaos, but I have a strong feeling that this is going to be difficult. And to be clear, when just AWS us-east-1 goes down (which it has done many , many , times ) it is front page news globally and tends to cause significant disruption to day to day life. This is just one of AWS's data centre zones - if a malicious actor was able to take control of the AWS control plane it's likely they'd be able to take all regions simultaneously, and it would likely be infinitely harder to restore when a bad actor was in charge, as opposed to the internal problems that have caused previous problems - and been extremely difficult to restore from in a timely way. Given all this it's understandable that Anthropic are being cautious about releasing this in the wild. The issue though, is that the cat is out of the bag. Even if Anthropic pulled a Miles Dyson and lowered their model code into a pit of molten lava, someone else is going to scale an RL model and release it. The incentives are far, far too high and the prisoner's dilemma strikes again. The current status quo seems to be that these next generation models will be released to a select group of cybersecurity professionals and related organisations, so they can fix things as much as possible to give them a head start. Perhaps this is the best that can be done, but this seems to me to be a repeat of the famous "obscurity is not security" approach which has become a meme in itself in the information security world. It also seems far fetched to me that these organisations who do have access are going to find even most of the critical problems in a limited time window. And that brings me to my final point. While Anthropic are providing $100m of credit and $4m of 'direct cash donations' to open source projects, it's not all open source projects. There are a lot of open source projects that everyone relies on without realising. While the obvious ones like the Linux kernel are getting this "access" ahead of time, there are literally millions of pieces of open source software (nevermind commercial software) that are essential for a substantial minority of systems operation. I'm not quite sure where the plan leaves these ones. Perhaps this is just another round in the cat and mouse cycle that reaches a mostly stable equilibrium, and at worst we have some short term disruption. But if I step back and look how fast the industry has moved over the past few years - I'm not so sure. And one thing I think is for certain, it looks like we do now have the fabled superhuman ability in at least one domain. I don't think it's the last. Albeit at the cost of adding yet more pressure onto the compute crunch the AI industry is experiencing ↩︎ Albeit at the cost of adding yet more pressure onto the compute crunch the AI industry is experiencing ↩︎

Shell

0 views

Simon Willison 1 weeks ago

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Meta announced Muse Spark today, their first model release since Llama 4 almost exactly a year ago . It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on meta.ai (Facebook or Instagram login required). Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows". The model is exposed as two different modes on meta.ai - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro. I prefer to run my pelican test via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly. Here's the pelican I got for "Instant": And this one for "Thinking": Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model output an SVG directly (with code comments) whereas the Thinking model wrapped it in a thin HTML shell with some unused JavaScript libraries. Which got me curious... Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style. But what else can it do? I asked it: what tools do you have access to? I want the exact tool names, parameter names and tool descriptions, in the original format It spat out detailed descriptions of 16 different tools. You can see the full list I got back here - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks. Here are highlights derived from that response: Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. Let's talk about that one. Here's the description in full: Description: Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects. Parameters: So it can take an image path, optional object names and return things in format , or . Sounds to me like Meta's Segment Anything . I decided to take it for a spin. generate a photo of a raccoon sitting on a trash can wearing trash as a hat Here you go — king of the alley, in full regalia: Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too? My guess is that image generation in Meta AI is powered by their Emu model , or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's ). The tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried: use python OpenCV to analyze that image and find out neat things about it Annoyingly it did not show me the code it ran, but it did create this image: Pretty neat! We can both generate images and then use them with Python tools. Next, the tool. Knowing the names of the tools makes it easy to refer to them in prompts: Now try the container.visual_grounding tool The response ( full response here ): Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision: Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect? Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly mode, but how about the others? I prompted: yes draw bounding boxes, can it do masks too? Yes — bounding boxes work great. Masks? Not directly in ; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks. Here's the bbox output: What you’re seeing: Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers. About masks: since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options: Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG. Here's the HTML for that bbox visualization . Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can see that all here . The count mode is particularly interesting - here's that in full: So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools. Which means... it can count pelicans too! Here's that overlay exported as HTML . Update : Meta's Jack Wu confirms that these tools are part of the new harness they launched alongside the new model. On Twitter Alexandr Wang said : this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions. I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency: [...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison. So are Meta back in the frontier model game? Artificial Analysis think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively. I'm waiting for API access - while the tool collection on meta.ai is quite strong the real test of a model like this is still what we can build on top of it. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. object_names (array, required) image_path (string, default "") format_type ("bbox"|"point"|"count"|null, default "bbox") title (string|null)[string] raccoon : big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal coffee cup : (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space banana peel : (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side newspaper : (305,223)-(548,315) — left ear coverage trash can lid : (107,800)-(1000,998) — spans almost the full image width at the bottom face : (305,283)-(750,537) — tight crop around the mask eyes : both ~40×37 pixel boxes, perfectly level at y=379 GrabCut — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask SAM-style color threshold — use the bbox plus k-means to isolate the coffee cup or banana peel

Machine Learning

API

0 views

Simon Willison 2 weeks ago

Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me

Anthropic didn't release their latest model, Claude Mythos ( system card PDF ), today. They have instead made it available to a very restricted set of preview partners under their newly announced Project Glasswing . The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare. Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser . Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems. There's a great deal more technical detail in Assessing Claude Mythos Preview’s cybersecurity capabilities on the Anthropic Red Team blog: In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex JIT heap spray that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets. Plus this comparison with Claude 4.6 Opus: Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more. Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted. Just a few days ( last Friday ) ago I started a new ai-security-research tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research. Greg Kroah-Hartman of the Linux kernel: Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us. Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real. Daniel Stenberg of : The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good. I'm spending hours per day on this now. It's intense. And Thomas Ptacek published Vulnerability Research Is Cooked , a post inspired by his podcast conversation with Anthropic's Nicholas Carlini. Anthropic have a 5 minute talking heads video describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine): It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...] I've found more bugs in the last couple of weeks than I found in the rest of my life combined . We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it . On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches patches so that anyone who runs the software is no longer vulnerable to these attacks. I found this on the OpenBSD 7.8 errata page : 025: RELIABILITY FIX: March 25, 2026 All architectures TCP packets with invalid SACK options could crash the kernel. A source code patch exists which remedies this problem. I tracked that change down in the GitHub mirror of the OpenBSD CVS repo (apparently they still use CVS!) and found it using git blame : Sure enough, the surrounding code is from 27 years ago. I'm not sure which Linux vulnerability Nicholas was describing, but it may have been this NFS one recently covered by Michael Lynch . There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues. I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon. The bad news for those of us who are not trusted partners is this: We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview. I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

Shell

0 views

devansh 2 weeks ago

On LLMs and Vulnerability Research

I have been meaning to write this for six months. The landscape kept shifting. It has now shifted enough to say something definitive. I work at the intersection of vulnerability triage. I see, every day, how this landscape is changing. These views are personal and do not represent my employer. Take them with appropriate salt. Two things happened in quick succession. Frontier models got dramatically better (Opus 4.6, GPT 5.4). Agentic toolkits (Claude Code, Codex, OpenCode) gave those models hands. The combination produces solid vulnerability research. "LLMs are next-token predictors." This framing was always reductive. It is now actively misleading. The gap between what these models theoretically do (predict the next word) and what they actually do (reason about concurrent thread execution in kernel code to identify use-after-free conditions) has grown too wide for the old frame to hold. Three mechanisms explain why. Implicit structural understanding. Tokenizers know nothing about code. Byte Pair Encoding treats , , and as frequent byte sequences, not syntactic constructs. But the transformer layers above tell a different story. Through training on massive code corpora, attention heads specialise: some track variable identity and provenance, others develop bias toward control flow tokens. The model converges on internal representations that capture semantic properties of code, something functionally equivalent to an abstract syntax tree, built implicitly, never formally. Neural taint analysis. The most security-relevant emergent capability. The model learns associations between sources of untrusted input (user-controlled data, network input, file reads) and dangerous sinks (system calls, SQL queries, memory operations). When it identifies a path from source to sink without adequate sanitisation, it flags a vulnerability. This is not formal taint analysis. No dataflow graph as well. It is a statistical approximation. But it works well for intra-procedural bugs where the source-to-sink path is short, and degrades as distance increases across functions, files, and abstraction layers. Test-time reasoning. The most consequential advance. Standard inference is a single forward pass: reactive, fast, fundamentally limited. Reasoning models (o-series, extended thinking, DeepSeek R1) break this constraint by generating internal reasoning tokens, a scratchpad where the model works through a problem step by step before answering. The model traces execution paths, tracks variable values, evaluates branch conditions. Symbolic execution in natural language. Less precise than formal tools but capable of handling what they choke on: complex pointer arithmetic, dynamic dispatch, deeply nested callbacks. It self-verifies, generating a hypothesis ("the lock isn't held across this path"), then testing it ("wait, is there a lock acquisition I missed?"). It backtracks when reasoning hits dead ends. DeepSeek R1 showed these behaviours emerge from pure reinforcement learning with correctness-based rewards. Nobody taught the model to check its own work. It discovered that verification produces better answers. The model is not generating the most probable next token. It is spending variable compute to solve a specific problem. Three advances compound on each other. Mixture of Experts. Every frontier model now uses MoE. A model might contain 400 billion parameters but activate only 17 billion per token. Vastly more encoded knowledge about code patterns, API behaviours, and vulnerability classes without proportional inference cost. Million-token context. In 2023, analysing a codebase required chunking code into a vector database, retrieving fragments via similarity search, and feeding them to the model. RAG is inherently lossy: code split at arbitrary boundaries, cross-file relationships destroyed, critical context discarded. For vulnerability analysis, where understanding cross-module data flow is the entire point, this information loss is devastating. At one million tokens, you fit an entire mid-size codebase in a single prompt. The model traces user input from an HTTP handler through three middleware layers into a database query builder and spots a sanitisation gap on line 4,200 exploitable via the endpoint on line 890. No chunking. No retrieval. No information loss. Reinforcement-learned reasoning. Earlier models trained purely on next-token prediction. Modern frontier models add an RL phase: generate reasoning chains, reward correctness of the final answer rather than plausibility of text. Over millions of iterations, this shapes reasoning to produce correct analyses rather than plausible-sounding ones. The strategies transfer across domains. A model that learned to verify mathematical reasoning applies the same verification to code. A persistent belief: truly "novel" vulnerability classes exist, bugs so unprecedented that only human genius could discover them. Comforting. Also wrong. Decompose the bugs held up as examples. HTTP request smuggling: the insight that a proxy and backend might disagree about where one request ends and another begins feels like a creative leap. But the actual bug is the intersection of known primitives: ambiguous protocol specification, inconsistent parsing between components, a security-critical assumption about message boundaries. None novel individually. The "novelty" was in combining them. Prototype pollution RCEs in JavaScript frameworks. Exotic until you realise it is dynamic property assignment in a prototype-based language, unsanitised input reaching object modification, and a rendering pipeline evaluating modified objects in a privileged context. Injection, type confusion, privilege boundary crossing. Taxonomy staples for decades. The pattern holds universally. "Novel" vulnerabilities decompose into compositions of known primitives: spec ambiguities, type confusions, missing boundary checks, TOCTOU gaps, trust boundary violations. The novelty is in the composition, not the components. This is precisely what frontier LLMs are increasingly good at. A model that understands protocol ambiguity, inconsistent component behaviour, and security boundary assumptions has all the ingredients to hypothesise a request-smuggling-class vulnerability when pointed at a reverse proxy codebase. It does not need to have seen that exact bug class. It needs to recognise that the conditions for parser disagreement exist and that parser disagreement at a trust boundary has security implications. Compositional reasoning over known primitives. Exactly what test-time reasoning enables. LLMs will not discover the next Spectre tomorrow. Microarchitectural side channels in CPU pipelines are largely absent from code-level training data. But the space of "LLM-inaccessible" vulnerabilities is smaller than the security community assumes, and it shrinks with every model generation. Most of what we call novel vulnerability research is creative recombination within a known search space. That is what these models do best. Effective AI vulnerability research = good scaffolding + adequate tokens. Scaffolding (harness design, prompt engineering, problem framing) is wildly underestimated. Claude Code and Codex are general-purpose coding environments, not optimised for vulnerability research. A purpose-built harness provides threat models, defines trust boundaries, highlights historical vulnerability patterns in the specific technology stack, and constrains search to security-relevant code paths. The operator designing that context determines whether the model spends its reasoning budget wisely or wastes it on dead ends. Two researchers, same model, same codebase, dramatically different results. Token quality beats token quantity. A thousand reasoning tokens on the right code path with the right threat model outperform a million tokens sprayed across a repo with "find vulnerabilities." The search space is effectively infinite. You cannot brute-force it. You narrow it with human intelligence encoded as context, directing machine intelligence toward where bugs actually live. "LLMs are non-deterministic, so you can't trust their findings." Sounds devastating. Almost entirely irrelevant. It confuses the properties of the tool with the properties of the target. The bugs are deterministic. They are in the code. A buffer overflow on line 847 is still there whether the model notices it on attempt one or attempt five. Non-determinism in the search process does not make the search less valid. It makes it more thorough under repetition. Each run samples a different trajectory through the hypothesis space. The union of multiple runs covers more search space than any single run. Conceptually identical to fuzzing. Nobody says "fuzzers are non-deterministic so we can't trust them." You run the fuzzer longer, cover more input space, find more bugs. Same principle. Non-determinism under repetition becomes coverage. In 2023 and 2024, the state of the art was architecture. Multi-agent systems, RAG pipelines, tool integration with SMT solvers and fuzzers and static analysis engines. The best orchestration won. That era is ending. A frontier model ingests a million tokens of code in a single prompt. Your RAG pipeline is not an advantage when the model without RAG sees the whole codebase while your pipeline shows fragments selected by retrieval that does not know what is security-relevant. A reasoning model spends thousands of tokens tracing execution paths and verifying hypotheses. Your external solver integration is not a differentiator when the model approximates what the solver does with contextual understanding the solver lacks. Agentic toolkits handle orchestration better than your custom tooling. The implication the security industry has not fully processed: vulnerability research is being democratised. When finding a memory safety bug in a C library required a Project Zero-calibre researcher with years of experience, the supply was measured in hundreds worldwide. When it requires a well-prompted API call, the supply is effectively unlimited. What replaces architecture as the competitive advantage? Two things. Domain expertise encoded as context. Not "find bugs in this code" but "this is a TLS implementation; here are three classes of timing side-channel that have affected similar implementations; analyse whether the constant-time guarantees hold across these specific code paths." The human provides the insight. The model does the grunt work. Access to compute. Test-time reasoning scales with inference compute. More tokens means deeper analysis, more self-verification, more backtracking. Teams that let a model spend ten minutes on a complex code path will find bugs that teams limited to five-second responses will miss. The end state: vulnerability discovery for known bug classes becomes a commodity, available to anyone with API access and a credit card. The researchers who thrive will focus where the model cannot: novel vulnerability classes, application-level logic flaws, architectural security review, adversarial creativity. This is not a prediction. It is already happening. The pace is set by model capability, which doubles on a timeline measured in months. Beyond next-token prediction Implicit structural understanding Neural taint analysis Test-time reasoning The architecture that enabled this Mixture of Experts Million-token context Reinforcement-learned reasoning The myth of novel vulnerabilities Scaffolding and tokens Non-determinism is a feature Orchestration is no longer your moat

Machine Learning

0 views

Susam Pal 2 weeks ago

Wander Console 0.4.0

Wander Console 0.4.0 is the fourth release of Wander, a small, decentralised, self-hosted web console that lets visitors to your website explore interesting websites and pages recommended by a community of independent website owners. To try it, go to susam.net/wander/ . This release brings a few small additions as well as a few minor fixes. You can find the previous release pages here: /code/news/wander/ . The sections below discuss the current release. Wander Console now supports wildcard patterns in ignore lists. An asterisk ( ) anywhere in an ignore pattern matches zero or more characters in URLs. For example, an ignore pattern like can be used to ignore URLs such as this: These ignore patterns are specified in a console's wander.js file. These are very important for providing a good wandering experience to visitors. The owner of a console decides what links they want to ignore in their ignore patterns. The ignore list typically contains commercial websites that do not fit the spirit of the small web, as well as defunct or incompatible websites that do not load in the console. A console with a well maintained ignore list ensures that a visitor to that console has a lower likelihood of encountering commercial or broken websites. For a complete description of the ignore patterns, see Customise Ignore List . By popular demand , Wander now adds a query parameter while loading a recommended web page in the console. The value of this parameter is the console that loaded the recommended page. For example, if you encounter midnight.pub/ while using the console at susam.net/wander/ , the console loads the page using the following URL: This allows the owner of the recommended website to see, via their access logs, that the visit originated from a Wander Console. While this is the default behaviour now, it can be customised in two ways. The value can be changed from the full URL of the Wander Console to a small identifier that identifies the version of Wander Console used (e.g. ). The query parameter can be disabled as well. For more details, see Customise 'via' Parameter . In earlier versions of the console, when a visitor came to your console to explore the Wander network, it picked the first recommendation from the list of recommended pages in it (i.e. your file). But subsequent recommendations came from your neighbours' consoles and then their neighbours' consoles and so on recursively. Your console (the starting console) was not considered again unless some other console in the network linked back to your console. A common way to ensure that your console was also considered in subsequent recommendations too was to add a link to your console in your own console (i.e. in your ). Yes, this created self-loops in the network but this wasn't considered a problem. In fact, this was considered desirable, so that when the console picked a console from the pool of discovered consoles to find the next recommendation, it considered itself to be part of the pool. This workaround is no longer necessary. Since version 0.4.0 of Wander, each console will always consider itself to be part of the pool from which it picks consoles. This means that the web pages recommended by the starting console have a fair chance of being picked for the next web page recommendation. The Wander Console loads the recommended web pages in an element that has sandbox restrictions enabled. The sandbox properties restrict the side effects the loaded web page can have on the parent Wander Console window. For example, with the sandbox restrictions enabled, a loaded web page cannot redirect the parent window to another website. In fact, these days most modern browsers block this and show a warning anyway, but we also block this at a sandbox level too in the console implementation. It turned out that our aggressive sandbox restrictions also blocked legitimate websites from opening a link in a new tab. We decided that opening a link in a new tab is harmless behaviour and we have relaxed the sandbox restrictions a little bit to allow it. Of course, when you click such a link within Wander console, the link will open in a new tab of your web browser (not within Wander Console, as the console does not have any notion of tabs). Although I developed this project on a whim, one early morning while taking a short break from my ongoing studies of algebraic graph theory, the subsequent warm reception on Hacker News and Lobsters has led to a growing community of Wander Console owners. There are two places where the community hangs out at the moment: If you own a personal website but you have not set up a Wander Console yet, I suggest that you consider setting one up for yourself. You can see what it looks like by visiting mine at /wander/ . To set up your own, follow these instructions: Install . It just involves copying two files to your web server. It is about as simple as it gets. Read on website | #web | #technology Wildcard Patterns The 'via' Query Parameter Console Picker Algorithm Allow Links that Open in New Tab New consoles are announced in this thread on Codeberg: Share Your Wander Console . We also have an Internet Relay Chat (IRC) channel named #wander on the Libera IRC network. This is a channel for people who enjoy building personal websites and want to talk to each other. You are welcome to join this channel, share your console URL, link to your website or recent articles as well as share links to other non-commercial personal websites.

0 views

The Jolly Teapot 2 weeks ago

Browsing the web with JavaScript turned off

Some time ago, I tried to use my web browser with JavaScript turned off by default. The experiment didn’t last long , and my attempt at a privacy-protecting, pain-free web experience failed. Too many websites rely on JavaScript, which made this type of web browsing rather uncomfortable. I’ve kept a Safari extension like StopTheScript around, on top of a content blocker like Wipr , just in case I needed to really “trim the fat” of the occasional problematic webpage. * 1 Recently, I’ve given this setup a new chance to shine, and even described it in a post. The results are in: the experiment failed yet again. But I’m not done. Even if this exact setup isn’t the one I currently rely on, JavaScript-blocking is nevertheless still at the heart of my web browsing hygiene on the Mac today. For context, this need for fine-tuning comes from the fact that my dear old MacBook Air from early 2020, rocking an Intel chip, starts to show its age. Sure, it already felt like a 10-year-old computer the moment the M1 MacBook Air chip was released, merely six months after I bought it, but let’s just say that a lot of webpages make this laptop choke. My goal of making this computer last one more year can only be reached if I manage not to throw the laptop through the window every time I want to open more than three tabs. On my Mac, JavaScript is now blocked by default on all pages via StopTheScript. Leaving JavaScript on, meaning giving websites a chance, sort of defeated the purpose of my setup (performance and privacy). Having JS turned off effectively blocks 99% of ads and trackers (I think, don’t quote me on that) and makes browsing the web a very enjoyable experience. The fan barely activates, and everything is as snappy and junk-free as expected. For websites that require JavaScript — meaning frequently visited sites like YouTube or where I need to be logged in like LanguageTool — I turn off StopTheScript permanently via the Websites > Extensions menu in the Safari Settings. I try to keep this list to a bare minimum, even if this means I have to accept a few annoyances like not having access to embedded video players or comments on some websites. For instance, I visit the Guardian multiple times daily, yet I won’t add it to the exception list, even if I’m a subscriber and therefore not exposed to the numerous “please subscribe” modals. I can no longer hide some categories on the home page, nor watch embedded videos: a small price to pay for a quick and responsive experience, and a minimal list of exceptions. For the few times when I actually need to watch a video on the Guardian, comment on a blog post, or for the occasional site that needs JavaScript simply to appear on my screen (more on that later), what I do is quickly open the URL in a new private window. There, StopTheScript is disabled by default (so that JavaScript is enabled: sorry, I know this is confusing). Having to reopen a page in a different browser window is an annoying process, yes. Even after a few weeks it still feels like a chore, but it seems to be the quickest way on the Mac to get a site to work without having to mess around with permissions and exceptions, which can be even more annoying on Safari. Again, a small price to pay to make this setup work. * 2 Another perk of that private browsing method is that the ephemeral session doesn’t save cookies and the main tracking IDs disappear when I close the window. I think. The problem I had at first was that these sessions tended to display the webpages as intended by the website owners: loaded with JavaScript, ads, modals, banners, trackers, &c. Most of the time, it is a terrible mess. Really, no one should ever experience the general web without any sort of blocker. To solve this weakness of my setup, I switched from Quad9 to Mullvad DNS to block a good chunk of ads and trackers (using the “All” profile ). Now, the private window only allows the functionality part of the JavaScript, a few cookie banners and Google login prompt annoyances, but at least I am not welcomed by privacy-invading and CPU-consuming ads and trackers every time my JS-free attempt fails. I know I could use a regular content blocker instead of a DNS resolver, but keeping it active all the time when JS is turned off feels a bit redundant and too much of an extension overlap. More importantly, I don’t want to be tempted to manage yet another exception list on top of the StopTheScript one (been there, done that, didn’t work). Also, with Safari I don’t think it’s possible to activate an extension in Private Mode only. John Gruber , in a follow-up reaction to The 49MB Web Page article from Shubham Bose, which highlights the disproportionate weight of webpages related to their content, wrote: One of the most controversial opinions I’ve long espoused, and believe today more than ever, is that it was a terrible mistake for web browsers to support JavaScript. Not that they should have picked a different language, but that they supported scripting at all. That decision turned web pages — which were originally intended as documents — into embedded computer programs. There would be no 49 MB web pages without scripting. There would be no surveillance tracking industrial complex. The text on a page is visible. The images and video embedded on a page are visible. You see them. JavaScript is invisible. That makes it seem OK to do things that are not OK at all. Amen to that. But if JavaScript is indeed mostly used for this “invisible” stuff, why are some websites built to use it for the most basic stuff? Video streaming services, online stores, social media platforms, I get it: JavaScript makes sense. But text-based sites? Blogs? Why? The other day I wanted to read this article , and only the website header showed up in my browser. Even Reader Mode didn’t make the article appear. When I opened the link in a private window, where StopTheScript is disabled, lo and behold, the article finally appeared. For some obscure reason, on that website (and others) JavaScript is needed to load text on a freaking web page. Even if you want your website to have a special behaviour regarding loading speeds, design subtleties, or whatever you use JavaScript for, please, use a tag, either to display the article in its most basic form, or at least to show a message saying “JavaScript needed for no apparent reason at all. Sorry.” * 3 This is what I do on my phone, as managing Safari extensions on iOS is a painful process. Quiche Browser is a neat solution and great way for me to have the “turn off JavaScript” menu handy, but without a way to sync bookmarks, history or open tabs with the Mac, I still prefer to stick to Safari, at least for now. ^ I still wish StopTheScript had a one-touch feature to quickly reload a page with JavaScript turned on until the next refresh or for an hour or so, but it doesn’t. ^ This is what I do for this site’s search engine , where PageFind requires JavaScript to operate. Speaking of search engine, DuckDuckGo works fine in HTML-only mode (the only main search engine to offer this I believe). ^ This is what I do on my phone, as managing Safari extensions on iOS is a painful process. Quiche Browser is a neat solution and great way for me to have the “turn off JavaScript” menu handy, but without a way to sync bookmarks, history or open tabs with the Mac, I still prefer to stick to Safari, at least for now. ^ I still wish StopTheScript had a one-touch feature to quickly reload a page with JavaScript turned on until the next refresh or for an hour or so, but it doesn’t. ^ This is what I do for this site’s search engine , where PageFind requires JavaScript to operate. Speaking of search engine, DuckDuckGo works fine in HTML-only mode (the only main search engine to offer this I believe). ^

0 views

Den Odell 3 weeks ago

You're Looking at the Wrong Pretext Demo

Pretext , a new JavaScript library from Cheng Lou, crossed 7,000 GitHub stars in its first three days. If you've been anywhere near the frontend engineering circles in that time, you've seen the demos: a dragon that parts text like water , fluid smoke rendered as typographic ASCII , a wireframe torus drawn through a character grid , multi-column editorial layouts with animated orbs displacing text at 60fps . These are visually stunning and they're why the library went viral. But they aren't the reason this library matters. The important thing Pretext does is predict the height of a block of text without ever reading from the DOM. This means you can position text nodes without triggering a single layout recalculation. The text stays in the DOM, so screen readers can read it and users can select it, copy it, and translate it. The accessibility tree remains intact, the performance gain is real, and the user experience is preserved for everyone. This is the feature that will change how production web applications handle text, and it's the feature almost nobody is demonstrating. The community has spent three days building dragons. It should be building chat interfaces. And the fact that the dragons went viral while the measurement engine went unnoticed tells us something important about how the frontend community evaluates tools: we optimize for what we can see, not for what matters most to the people using what we build. The problem is forced layout recalculation, where the browser has to pause and re-measure the page layout before it can continue. When a UI component needs to know the height of a block of text, the standard approach is to measure it from the DOM. You call or read , and the browser synchronously calculates layout to give you an answer. Do this for 500 text blocks in a virtual list and you've forced 500 of these pauses. This pattern, called layout thrashing , remains a leading cause of visual stuttering in complex web applications. Pretext's insight is that uses the same font engine as DOM rendering but operates outside the browser's layout process entirely. Measure a word via canvas, cache the width, and from that point forward layout becomes pure arithmetic: walk cached widths, track running line width, and insert breaks when you exceed the container's maximum. No slow measurement reads, and no synchronous pauses. The architecture separates this into two phases. does the expensive work once: normalize whitespace, segment the text using for locale-aware word boundaries, handle bidirectional text (such as mixing English and Arabic), measure segments with canvas, and return a reusable reference. is then pure calculation over cached widths, taking about 0.09ms for a 500-text batch against roughly 19ms for . Cheng Lou himself calls the 500x comparison "unfair" since it excludes the one-time cost, but that cost is only paid once and spread across every subsequent call. It runs once when the text appears, and every subsequent resize takes the fast path, where the performance boost is real and substantial. The core idea traces back to Sebastian Markbage's research at Meta, where Cheng Lou implemented the earlier prototype that proved canvas font metrics could substitute for DOM measurement. Pretext builds on that foundation with production-grade internationalization, bidirectional text support, and the two-phase architecture that makes the fast path so fast. Lou has a track record here: react-motion and ReasonML both followed the same pattern of identifying a constraint everyone accepted as given and removing it with a better abstraction. The first use case Pretext serves, and the one I want to make the case for, is measuring text height so you can render DOM text nodes in exactly the right position without ever asking the browser how tall they are. This isn't a compromise path, it's the most capable thing the library does. Consider a virtual scrolling list of 500 chat messages. To render only the visible ones, you need to know each message's height before it enters the viewport. The traditional approach is to insert the text into the DOM, measure it, and then position it, paying the layout cost for every message. Pretext lets you predict the height mathematically and then render the text node at the right position. The text itself still lives in the DOM, so the accessibility model, selection behavior, and find-in-page all work exactly as they would with any other text node. Here's what that looks like in practice: Two function calls: the first measures and caches, the second predicts height through calculation. No layout cost, yet the text you render afterward is a standard DOM node with full accessibility. The shrinkwrap demo is the clearest example of why this path matters. CSS sizes a container to the widest wrapped line, which wastes space when the last line is short. There's no CSS property that says "find the narrowest width that still wraps to exactly N lines." Pretext's calculates the optimal width mathematically, and the result is a tighter chat bubble rendered as a standard DOM text node. The performance gain comes from smarter measurement, not from abandoning the DOM. Nothing about the text changes for the end user. Accordion sections whose heights are calculated from Pretext, and masonry layouts with height prediction instead of DOM reads: these both follow the same model of fast measurement feeding into standard DOM rendering. There are edge cases worth knowing about, starting with the fact that the prediction is only as accurate as the font metrics available at measurement time, so fonts need to be loaded before runs or results will drift. Ligatures (where two characters merge into one glyph, like "fi"), advanced font features, and certain CJK composition rules can introduce tiny differences between canvas measurement and DOM rendering. These are solvable problems and the library handles many of them already, but acknowledging them is part of taking the approach seriously rather than treating it as magic. Pretext also supports manual line layout for rendering to Canvas, SVG, or WebGL. These APIs give you exact line coordinates so you can paint text yourself rather than letting the DOM handle it. This is the path that went viral, and the one that dominates every community showcase. The canvas demos are impressive and they're doing things the DOM genuinely can't do at 60fps. But they're also painting pixels, and when you paint text as canvas pixels, the browser has no idea those pixels represent language. Screen readers like VoiceOver, NVDA, and JAWS derive their understanding of a page from the accessibility tree, which is itself built from the DOM, so canvas content is invisible to them. Browser find-in-page and translation tools both skip canvas pixels entirely. Native text selection is tied to DOM text nodes and canvas has no equivalent, so users can't select, copy, or navigate the content by keyboard. A element is also a single tab stop, meaning keyboard users can't move between individual words or paragraphs within it, even if it contains thousands of words. In short, everything that makes text behave as text rather than an image of text disappears. None of this means the canvas path is automatically wrong. There are legitimate contexts where canvas text rendering is the right choice: games, data visualizations, creative installations, and design tools that have invested years in building their own accessibility layer on top of canvas. For SVG rendering, the trade-offs are different again, since SVG text elements do participate in the accessibility tree, making it a middle ground between DOM and canvas. But the canvas path is not the breakthrough, because canvas text rendering has existed for fifteen or more years across dozens of libraries. What none of them offered was a way to predict DOM text layout without paying the layout cost. Pretext's and do exactly that, and it's genuinely new. This pattern often repeats across the frontend ecosystem, and I understand why. A dragon parting text like water is something you can record as a GIF, post to your socials, and collect thousands of impressions. A virtual scrolling list that pre-calculates text heights looks identical to one that doesn't. The performance difference is substantial but invisible to the eye. Nobody makes a showcase called "works flawlessly with VoiceOver" or "scrolls 10,000 messages without a single forced layout" because these things look like nothing. They look like a web page working the way web pages are supposed to work. This is Goodhart's Law applied to web performance: once a metric becomes a target, it ceases to be a good measure. Frame rate and layout cost are proxies for "does this work well for users." GitHub stars are a proxy for "is this useful." When the proxy gets optimized instead, in this case by visually impressive demos that happen to use the path with the steepest accessibility trade-offs, the actual signal about what makes the library important gets lost. The library's identity gets set by its most visually impressive feature in the first 72 hours, and the framing becomes "I am drawing things" rather than "I am measuring things faster than anyone has before." Once that framing is set, it's hard to shift. The best text-editing libraries on the web, CodeMirror , Monaco , and ProseMirror , all made the deliberate choice to stay in the DOM even when leaving it would have been faster, because the accessibility model isn't optional. Pretext's DOM measurement path belongs in that tradition but goes further: those editors still read from the DOM when they need to know how tall something is. Pretext eliminates that step entirely, predicting height through arithmetic before the node is ever rendered. It's the next logical step in the same philosophy: keep text where it belongs, but stop paying the measurement cost to do so. I've been thinking about performance engineering as a discipline for most of my career, and what strikes me about Pretext is that the real innovation is the one that is hardest to see. Predicting how text will lay out before it reaches the page, while keeping the text in the DOM and preserving everything that makes it accessible, is a genuinely new capability on the web platform. It's the kind of foundational improvement that every complex text-heavy application can adopt immediately. If you're reaching for Pretext this week, reach for and first. Build something that keeps text in the DOM and predicts its height without asking the browser. Ship an interface that every user can read, select, search, and navigate. Nobody else has done this yet, and it deserves building. Performance engineering is at its best when it serves everyone without asking anyone to give something up. Faster frame rates that don't make someone nauseous. Fewer layout pauses that mean a page responds when someone with motor difficulties needs it to. Text that is fast and readable and selectable and translatable and navigable by keyboard and comprehensible to a screen reader. The dragons are fun. The measurement engine is important. Let's try not to confuse the two.

CSS

0 views

blog.philz.dev 3 weeks ago

computing 2+2: so many sandboxes

Sandboxes are so in right now. If you're doing agentic stuff, you've now doubt thought about what Simon Willison calls the lethal trifecta : private data, untrusted content, and external communication. If you work in a VM, for example, you can avoid putting a secret on that VM, and then that secret--that's not there!--can't be exfiltrated. If you want to deal with untrusted data, you can also cut off external communication. You can still use an agent, but you need to either limit its network access or limit its tools. So, today's task is to run five ways. Cloud Hypervisor is a Virtual Machine Monitor which runs on top of the Linux Kernel KVM (Kernel-based Virtual Machine) which runs on top of CPUs that support virtualization. A cloud-hypervisor VM sorta looks like a process on the host (and can be managed with cgroups, for example), but it's running a full Linux kernel. With the appropriate kernel options, you can run Docker containers, do tricky networking things, nested virtualization, and so on. Lineage-wise, it's in the same family as Firecracker and crosvm . It avoids implementing floppy devices and tries to be pretty small. Traditionally, people tell you to unpack a file system and maybe make a vinyl out of it using an iso image or some such. A trick is to instead start with a container image for your userspace, and then you get all the niceties (and all the warts) of Docker. Takes about 2 seconds. gVisor implements a large chunk of the Linux syscall interface in a Go process. Think of it as a userland kernel. It came out of Google's AppEngine work. It can use systrap/seccomp, ptrace, and KVM tricks to do the interception. The downside of gVisor is that you can't do some things inside of it. For example, you can't run vanilla Docker inside of gVisor because it doesn't support Docker's networking tricks. Again, let's use Docker to get ourselves a userland. No need for a kernel image. stands for "run secure container." Monty is a Python interpreter written in Rust. It doesn't expose the host, but can call functions that are explicitly exposed. This one's super fast. Pyodide is CPython compiled to WebAssembly. Deno is a JS runtime with permission-based security. Deno happens to run wasm code fine, so we're using it as a wasm runtime. There are other choices. Chromium is probably the world's most popular sandbox. This is pretty much the same as Deno: it's the V8 interpreter under the hood. Lots of ways to drive Chromium. Puppeteer, headless , etc. Let's try rodney : Run pyodide inside Deno inside gVisor inside cloud-hypervisor. Setting up the networking and the file system/disk sharing for these things is usually not trivial, especially if you don't want to accidentally expose the VMs to each other, and so forth. I want to compare two possible agents: a coding agent and a logs agent. A coding agent needs a full Linux, because, at the end of the day, it needs to edit files and run tests and operate git. Your sandboxing options are going to end up being a VM or a container of some sort. A logs agent needs access to your logs (say, the ability to run readonly queries on Clickhouse) and it needs to be able to send you its output. In the minimal case, it doesn't need any sandboxing at all, since it doesn't have access to anything. If you want it to be able to produce a graph, however, it will need to write out a file. At the minimum, it will need to take the results of its queries and pair them with an HTML file that has some JS that renders them with Vegalite. You might also want to mix and match the results of multiple queries, and do some data munging outside of SQL. This is all where a setup like Monty or Pyodide come in handy. Giving the agent access to some Python expands considerably how much the agent can do, and you can do it cheaply and safely with these sandboxes. In this vein, if you use DSPy for RLM, its implementation gives the LLM the Deno/pyodide solution to let the LLM have "infinite" context. Browser-based agents are a thing too. Itsy-Bitsy is a bookmarklet-based agent. It runs in the context of the web page it's operating on. Let me know what other systems I missed!

DevOps

Python

0 views

Susam Pal 4 weeks ago

Wander 0.2.0

Wander 0.2.0 is the second release of Wander, a small, decentralised, self-hosted web console that lets visitors to your website explore interesting websites and pages recommended by a community of independent personal website owners. To try it, go to susam.net/wander . This release brings a number of improvements. When I released version 0.1.0, it was the initial version of the software I was using for my own website. Naturally, I was the only user initially and I only added trusted web pages to the recommendation list of my console. But ever since I announced this project on Hacker News , it has received a good amount of attention. It has been less than a week since I announced it there but over 30 people have set up a Wander console on their personal websites. There are now over a hundred web pages being recommended by this network of consoles. With the growth in the number of people who have set up Wander console, came several feature requests, most of which have been implemented already. This release makes these new features available. Since Wander 0.2.0, the file of remote consoles is executed in a sandbox to ensure that it has no side effects on the parent Wander console page. Similarly, the pages recommended by the network are also loaded into a sandbox . This release also brings several customisation features. Console owners can customise their Wander console by adding custom CSS or JavaScript. Console owners can also block certain URLs from ever being recommended on their console. This is especially important in providing a good wandering experience to visitors. Since this network is completely decentralised, console owners can add any web page they like to their console. Sometimes they inadvertently add pages that do not load successfully in the console due to frame embedding restrictions. This leads to an uneven wandering experience because these page recommendations occasionally make it to other consoles where they fail to load. Console owners can now block such URLs in their console to decrease the likelihood of these failed page loads. This helps make the wandering experience smoother. Another significant feature in this release is the expanded Console dialog box. This dialog box now shows various details about the console and the current wandering session. For example, it shows the console's configuration: recommended pages, ignored URLs and linked consoles. It also shows a wandering history screen where you can see each link that was recommended to you along with the console that recommendation came from. There is another screen that shows all the consoles discovered during the discovery process. Those who care about how Wander works would find this dialog box quite useful. To check it out, go to my Wander console and explore. To learn more about Wander, how it works and how to set it up, please read the project README at codeberg.org/susam/wander . Read on website | #web | #technology

CSS

0 views

Loren Stewart 1 months ago

ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite

The big AI chat apps ship heavy rendering libraries to every device. Cheddy Chat renders markdown server-side and streams finished HTML, eliminating 160-440KB of client JavaScript while keeping the main thread free.

0 views

Simon Willison 1 months ago

Profiling Hacker News users based on their comments

Here's a mildly dystopian prompt I've been experimenting with recently: "Profile this user", accompanied by a copy of their last 1,000 comments on Hacker News. Obtaining those comments is easy. The Algolia Hacker News API supports listing comments sorted by date that have a specific tag, and the author of a comment is tagged there as . Here's a JSON feed of my ( ) most recent comments, for example: https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&hitsPerPage=1000 The Algolia API is served with open CORS headers, which means you can access the API from JavaScript running on any web page. Last August I had ChatGPT build me a simple tool for hitting that API for any user which fetches their comments and gives me a mobile-friendly "copy to clipboard" button. I've since tweaked it a few times with Claude. I can then paste the whole lot into any LLM - these days I mostly use Claude Opus 4.6 for this - and prompt "profile this user" to see what it can figure out. It's startlingly effective! It feels invasive to quote the profile of another user here, so I'll show you what it produces for me: This is Simon Willison — a prolific, independent software developer, blogger, and one of the most visible voices in the AI-assisted coding space. Here's a profile drawn from his comments: Professional identity: Independent developer and writer. Co-creator of Django, creator of Datasette and many other open source tools. On the board of the Python Software Foundation. Runs a heavily-trafficked personal blog at simonwillison.net with multiple content types (entries, link blog, notes, quotations). Monetizes through GitHub sponsors, ethical ads, and occasional consulting — not through any AI company payroll, a point he defends frequently. Core thesis on AI coding: He's the leading evangelist for what he calls "agentic engineering" — using coding agents (Claude Code, Codex CLI) as productivity multipliers for skilled developers. His central argument: LLMs don't replace programmers, they amplify existing expertise. He's writing a guide/book on this at simonwillison.net/guides. He insists these tools are deceptively difficult to use well, and that most people who report bad results simply haven't learned the craft yet. Working style: He does most of his programming from his iPhone via Claude Code for web, often while riding BART, walking the dog, or sitting in the garden. He embraces "YOLO mode" (auto-approving agent actions) and frequently runs 2-3 agent sessions in parallel. He starts every session with "run uv run pytest" to anchor the agent in TDD. He's on the $200/month Claude Max plan. Key technical interests: Sandboxing and security (WebAssembly, Pyodide, sandbox-exec, Firecracker), SQLite (deeply), Python packaging and tooling (uv, PyPI distribution tricks), browser-in-a-browser experiments (v86, WASM Linux), and local LLM inference. He's been learning Go "by osmosis" through coding agents. Security consciousness: He coined the term "prompt injection" and the "lethal trifecta" (access to private data + exposure to untrusted input + ability to take actions). He's genuinely alarmed about OpenClaw/Clawdbot security risks and has been predicting a headline-grabbing prompt injection attack "in the next six months" for over two years — acknowledging it keeps not happening while insisting the risk is real. He references the "normalization of deviance" pattern frequently. Personality and debate style: Energetic, combative when challenged, but generally good-natured. He engages heavily in HN comment threads — sometimes dozens of replies in a single discussion. He's transparent about his biases and maintains a public disclosures page. He pushes back hard against what he sees as motivated reasoning from AI skeptics, but also against AGI hype ("I refer to that as the science fiction version of AI"). He gets visibly frustrated by sock-puppet accounts and bad-faith criticism. He has a self-deprecating streak ("my stupid pelican benchmark"). Recurring themes across hundreds of comments: Personal interests mentioned: Niche museums (runs niche-museums.com), New Zealand kākāpō parrots, cooking food from countries he's never visited, chickens in his garden. Lives in Half Moon Bay, CA near a major pelican roost. The pelican-on-a-bicycle SVG test is his signature LLM benchmark. In a sentence: A deeply experienced, independently-minded developer who's genuinely excited about AI coding tools, fights hard against both uncritical hype and reflexive dismissal, and is trying to professionalize how the industry uses these tools — while also worrying loudly about the security implications almost nobody else takes seriously enough. This all checks out! I ran this in Claude incognito mode to hopefully prevent Claude from guessing that I was evaluating myself and sycophantically glazing me - the tone of the response it gave here is similar to the tone I've seen against other accounts. I expect it guessed my real name due to my habit of linking to my own writing from some of my comments, which provides plenty of simonwillison.net URLs for it to associate with my public persona. I haven't seen it take a guess at a real name for any of the other profiles I've generated. It's a little creepy to be able to derive this much information about someone so easily, even when they've shared that freely in a public (and API-available) place. I mainly use this to check that I'm not getting embroiled in an extensive argument with someone who has a history of arguing in bad faith. Thankfully that's rarely the case - Hacker News continues to be a responsibly moderated online space. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . "Two things can be true at the same time" — he holds nuanced positions Tests are for productivity, not just quality The November 2025 model releases (Opus 4.5, GPT-5.2) were a genuine inflection point Code review is the biggest bottleneck in agent-assisted workflows "Cognitive debt" is a real and unsolved problem The best engineering practices (tests, docs, CI/CD, clean code) also make agents work better He's deliberately trying to "teach people good software engineering while tricking them into thinking the book is about AI"

JSON

0 views

alikhil 1 months ago

What is a CDN and Why It Matters?

With the rapid growth of GenAI solutions and the continuous launch of new applications, understanding the fundamental challenges and solutions of the web is becoming increasingly important. One of the core challenges is delivering content quickly to the end user . This is where a CDN comes into play. A CDN stands for Content Delivery Network . Let’s break it down. (Note: Modern CDN providers often bundle additional services such as WAF, DDoS protection, and bot management. Here, we focus on the static content delivery.) Content refers to any asset that needs to be loaded on the user’s device: images, audio/video files, JavaScript, CSS, and more. Delivery means that this content is not only available but also delivered efficiently and quickly. A CDN is a network of distributed nodes that cache content. Instead of fetching files directly from the origin server, users receive them from the nearest node, minimizing latency. Consider an online marketplace for digital assets, such as a photo stock or NFT platform. The application stores thousands of images on a central server. Whenever users open the app, those images must load quickly. If the application server is hosted in Paris, users in Paris will experience minimal ping. However: These numbers only reflect simple ICMP ping times. Actual file delivery involves additional overhead such as TCP connections and TLS handshakes, which increase delays even further. With a CDN, each user connects to the nearest edge node instead of the origin server. This is typically achieved via GeoDNS. Importantly, only the CDN knows the actual address of the origin server, which also improves security by reducing exposure to direct DDoS attacks. CDN providers usually operate edge nodes in major world cities. When a request is made: If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. For popular websites, the cache hit rate approaches but rarely reaches 100% due to purges, new files, or new users. The shield node plays a critical role. Without it, each cache miss from any edge node would hit the origin server directly, increasing load. Many providers offer shields as an optional feature, and enabling them can significantly reduce origin stress. Beyond cache hits and misses, performance can be measured with concrete indicators: Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches. These KPIs provide a real, measurable view of CDN efficiency rather than theoretical assumptions. The closer the edge node is to the end user, the faster the content loads. The key questions are: Where are the users located? Which CDN providers have the best edge coverage for those locations? But don’t rely on maps alone. Measure real performance with Real User Monitoring (RUM) using metrics like TTFB and Core Web Vitals. There are plenty of ready-made tools available. If you’re interested in building your own RUM system, leave a comment or reaction – I can cover that in a follow-up post. Users in Spain may see about 2× ping time. Users in the USA may see 6× ping time. Users in Australia may see 12× ping time. If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches.

CSS