Latest Posts (20 found)
blog.philz.dev 2 weeks ago

computing 2+2: so many sandboxes

Sandboxes are so in right now. If you're doing agentic stuff, you've now doubt thought about what Simon Willison calls the lethal trifecta : private data, untrusted content, and external communication. If you work in a VM, for example, you can avoid putting a secret on that VM, and then that secret--that's not there!--can't be exfiltrated. If you want to deal with untrusted data, you can also cut off external communication. You can still use an agent, but you need to either limit its network access or limit its tools. So, today's task is to run five ways. Cloud Hypervisor is a Virtual Machine Monitor which runs on top of the Linux Kernel KVM (Kernel-based Virtual Machine) which runs on top of CPUs that support virtualization. A cloud-hypervisor VM sorta looks like a process on the host (and can be managed with cgroups, for example), but it's running a full Linux kernel. With the appropriate kernel options, you can run Docker containers, do tricky networking things, nested virtualization, and so on. Lineage-wise, it's in the same family as Firecracker and crosvm . It avoids implementing floppy devices and tries to be pretty small. Traditionally, people tell you to unpack a file system and maybe make a vinyl out of it using an iso image or some such. A trick is to instead start with a container image for your userspace, and then you get all the niceties (and all the warts) of Docker. Takes about 2 seconds. gVisor implements a large chunk of the Linux syscall interface in a Go process. Think of it as a userland kernel. It came out of Google's AppEngine work. It can use systrap/seccomp, ptrace, and KVM tricks to do the interception. The downside of gVisor is that you can't do some things inside of it. For example, you can't run vanilla Docker inside of gVisor because it doesn't support Docker's networking tricks. Again, let's use Docker to get ourselves a userland. No need for a kernel image. stands for "run secure container." Monty is a Python interpreter written in Rust. It doesn't expose the host, but can call functions that are explicitly exposed. This one's super fast. Pyodide is CPython compiled to WebAssembly. Deno is a JS runtime with permission-based security. Deno happens to run wasm code fine, so we're using it as a wasm runtime. There are other choices. Chromium is probably the world's most popular sandbox. This is pretty much the same as Deno: it's the V8 interpreter under the hood. Lots of ways to drive Chromium. Puppeteer, headless , etc. Let's try rodney : Run pyodide inside Deno inside gVisor inside cloud-hypervisor. Setting up the networking and the file system/disk sharing for these things is usually not trivial, especially if you don't want to accidentally expose the VMs to each other, and so forth. I want to compare two possible agents: a coding agent and a logs agent. A coding agent needs a full Linux, because, at the end of the day, it needs to edit files and run tests and operate git. Your sandboxing options are going to end up being a VM or a container of some sort. A logs agent needs access to your logs (say, the ability to run readonly queries on Clickhouse) and it needs to be able to send you its output. In the minimal case, it doesn't need any sandboxing at all, since it doesn't have access to anything. If you want it to be able to produce a graph, however, it will need to write out a file. At the minimum, it will need to take the results of its queries and pair them with an HTML file that has some JS that renders them with Vegalite. You might also want to mix and match the results of multiple queries, and do some data munging outside of SQL. This is all where a setup like Monty or Pyodide come in handy. Giving the agent access to some Python expands considerably how much the agent can do, and you can do it cheaply and safely with these sandboxes. In this vein, if you use DSPy for RLM, its implementation gives the LLM the Deno/pyodide solution to let the LLM have "infinite" context. Browser-based agents are a thing too. Itsy-Bitsy is a bookmarklet-based agent. It runs in the context of the web page it's operating on. Let me know what other systems I missed!

0 views
blog.philz.dev 2 weeks ago

What is Buildkite?

If you're starting a new project, just skip the misery of GitHub actions and move on. Buildkite mostly gets it. The core Buildkite noun is a Pipeline, and, as traditional for an enterprise software company, their docs don't really tell you what's what. The point is that your pipeline should be: Pipelines can add steps to themselves . So, you can write a script to generate your pipeline (or just store it in your repo), and cat it into the command, and that's how the rest of your steps are discovered. Pipeline steps are each executed in their own clean checkout of what you're building. So, if you want to run the playwright tests in parallel with the backend tests (or whatever), you just declare that as two different steps, but they're part of the same thing. Pipelines have a dependency graph between steps that's conceptually similar to . (Perhaps was the Make replacement that I first heard of that did "generate the ninja graph"?) The agents seem pretty good at manipulating Buildkite once you give them an API key. They also seem to not in-line shell scripts into the yaml, which is Obviously Good. The ways to speed up the build is always the same: cache and parallelize. A 16-core machine for 1 minute costs the same as an 2-core machine for 8 minutes, and I know which one I'd rather wait for! Buildkite makes parallelism pretty easy. Anyway, it's pretty good. Thanks, Buildkite.

0 views
blog.philz.dev 1 months ago

Philip's Second Law of Robotics

Bot posts must include a pointer to their source code and execution environment.

0 views
blog.philz.dev 2 months ago

Unbundling a monorepo to a multi-repo

In a previous post , we talked about using some git plumbing techniques to combine a bunch of repos into a monorepo. The reverse also makes sense! You have a monorepo, but maybe you want to publish or open-source a subdirectory. push-to-both-repos.sh does just this.

0 views
blog.philz.dev 4 months ago

Coverage

Sometimes, the question arises: which tests trigger this code here? Maybe I've found a block of code that doesn't look like it can't be hit, but it's hard to prove. Or I want to answer the age-old question of which subset of quick tests might be useful to run if the full test suite is kinda slow. So, run each test with coverage by itself. Then, instead of merging all the coverage data, find which tests cover the line in question. Oddly enough, though some of the Java tools (e.g., Clover) support per-test coverage, the tools here in general are somewhat lacking. , part of the suite, supports a ("test name") marker, but only displays the per test data on a per-file level: This is the kind of thing where in 2025, you can ask a coding agent to vibe-code or vibe-modify a generator, and it'll work fine. I have not found the equivalent of Profilerpedia for coverage file formats, but the lowest common denominator seems to be . The file format is described at geninfo(1) . Most language ecosystems can either produce LCOV output directly or have pre-existing conversion tools.

0 views
blog.philz.dev 5 months ago

Build Artifacts

This is a quick story about a thing I miss, that doesn't seem to have a default solution in our industry: a build artifact store. In a previous world, we had one. You could query it for a "global build number" and it would assign you a build number (and a writable to you S3 bucket). You could then produce a build, and store it back into the build-database, with both immutable metadata (what it was, when it was built, from what commits, etc.) and mutable metadata (tags). You could then query the build database for the build that matches criteria. Perhaps you want the latest build of Elephant that ran on Slackware and passed the nightly tests? This could both be used to cobble together tiers of QA and as a build artifact cache. This was a super simple service, cobbled together in a few files of Python, and it held up to our needs quite well. What do you use? Surely Git LFS or Artifactory aren't the end states here.

0 views
blog.philz.dev 6 months ago

Containerizing Agents

Simon Willison has been writing about using parallel coding agents ( blog ), and his post encouraged me to write down about my current workflow, which involves both parallelism, containerization, and web browsers. I’m spoiled by (and helped build) sketch.dev’s agent containerization , so, when I need to use other agents as well, I wrote a shell script to containerize them "just so." My workflow is that I run , and I find myself in a web browser, in the same git repo I was in, but now in a randomly named branch, in a container, in . The first pane is the agent, but there are other panes doing other stuff. When I'm done, I've got a branch to work with, and I merge/rebase/cherry-pick. Let's break up the pieces: First, my shell script is in my favorite shell scripting language, dependency-less python3. Python3 has the advantage of not requiring you to think about and is sufficiently available. Second, I have a customized Dockerfile with the dependencies my projects need. I don't minimize the container; I add all the things I want. Browsers, playwright, subtrace, tmux, etc. Third, U cross-mount my git repo itself into the container, and create a worktree inside the container. From the outside, this work tree is going to look "prunable", but that causes no harm, and there’s a new branch that corresponds to the agent’s worktree. I like worktrees more than remotes because they’re in the same namespace; you don’t need to "fetch" or "push" across them. It’s easy to lose changes when the container exits; I commit automatically on exit. It's also easy to lose the worktree if something calls on your behalf, but recovery is possible with and some fiddling. Fourth, I run tmux inside the container so that opening a shell in the container is as simple as opening a new pane. (Somehow, is too rich.) I'm used to sketch.dev's terminal pane to do the little git operation, take a look at a diff, run a server... tmux helps. Fifth, networking magic with Tailscale. I publish ports 8000-9999 (and 11111) on my tailnet, using the same randomly generated name as I've used for my container and my branch. You're inevitably working on a web app, and you inevitably need to actually look at it, and Docker networking is doable, but you have to pre-declare exposed ports, and avoid conflicts, and ... it's just not great for this use case. There are other solutions (ngrok, SSH port forwarding), but I already use Tailscale, so this works nicely. I originally started with tsnsrv , but then vibe-coded a custom thing that supports port ranges. is the userland networking library here, and the agents do a fine job one-shotting this stuff. Sixth, I use to expose my to my browser over the tailnet network. I'm used to having a browser-tab per agent, and this gives me that. (Terminal-based agents feel weird to me. Browsers are great at scrolling, expand/collapse widgets, cut and paste, word wrap of text, etc.) Seventh, I vibe-coded a headless browser tool called , which wraps the excellent chromedp library, which remote-controls a headless Chrome over it's debugging protocol. Getting the MCPs configured for playwright was finnicky, especially across multiple agents, and I'm experimenting with this command line tool to do the same. As I’ve written about before . Using agents in containers gives me two things I value: Isolation for parallel work. The agents can start processes and run tests and so forth without conflicting on ports or files. A bit more security. Even the Economist has now picked up on the Lethal Trifecta (or Simon Willison's original) . By explicitly choosing which environment variables I forward, and not sharing my cookies and my SSH keys, I’m exerting some control over what data and capabilities are exposed to the agent. We’re still playing with fire (can you break out of Colima? Sure! Can you edit my git repo? Sure! Break into my tailnet? Sorta.), but it’s a smaller, more controlled burn. If you want to try my nonsense, https://github.com/philz/ctr-agent . First, my shell script is in my favorite shell scripting language, dependency-less python3. Python3 has the advantage of not requiring you to think about and is sufficiently available. Second, I have a customized Dockerfile with the dependencies my projects need. I don't minimize the container; I add all the things I want. Browsers, playwright, subtrace, tmux, etc. Third, U cross-mount my git repo itself into the container, and create a worktree inside the container. From the outside, this work tree is going to look "prunable", but that causes no harm, and there’s a new branch that corresponds to the agent’s worktree. I like worktrees more than remotes because they’re in the same namespace; you don’t need to "fetch" or "push" across them. It’s easy to lose changes when the container exits; I commit automatically on exit. It's also easy to lose the worktree if something calls on your behalf, but recovery is possible with and some fiddling. Fourth, I run tmux inside the container so that opening a shell in the container is as simple as opening a new pane. (Somehow, is too rich.) I'm used to sketch.dev's terminal pane to do the little git operation, take a look at a diff, run a server... tmux helps. Fifth, networking magic with Tailscale. I publish ports 8000-9999 (and 11111) on my tailnet, using the same randomly generated name as I've used for my container and my branch. You're inevitably working on a web app, and you inevitably need to actually look at it, and Docker networking is doable, but you have to pre-declare exposed ports, and avoid conflicts, and ... it's just not great for this use case. There are other solutions (ngrok, SSH port forwarding), but I already use Tailscale, so this works nicely. I originally started with tsnsrv , but then vibe-coded a custom thing that supports port ranges. is the userland networking library here, and the agents do a fine job one-shotting this stuff. Sixth, I use to expose my to my browser over the tailnet network. I'm used to having a browser-tab per agent, and this gives me that. (Terminal-based agents feel weird to me. Browsers are great at scrolling, expand/collapse widgets, cut and paste, word wrap of text, etc.) Seventh, I vibe-coded a headless browser tool called , which wraps the excellent chromedp library, which remote-controls a headless Chrome over it's debugging protocol. Getting the MCPs configured for playwright was finnicky, especially across multiple agents, and I'm experimenting with this command line tool to do the same. Isolation for parallel work. The agents can start processes and run tests and so forth without conflicting on ports or files. A bit more security. Even the Economist has now picked up on the Lethal Trifecta (or Simon Willison's original) . By explicitly choosing which environment variables I forward, and not sharing my cookies and my SSH keys, I’m exerting some control over what data and capabilities are exposed to the agent. We’re still playing with fire (can you break out of Colima? Sure! Can you edit my git repo? Sure! Break into my tailnet? Sorta.), but it’s a smaller, more controlled burn.

1 views
blog.philz.dev 6 months ago

Pipefail Fail

What the?!?! That should succeed. We're printing , and surely is finding it. Turns out that SIGPIPE (13) is , and the non-zero exit code is because the part of the pipeline is failing with SIGPIPE. There are lots of solutions, but the simplest one is to not use , which not only does "quiet" but also exits at first match, causing the failure.

0 views
blog.philz.dev 8 months ago

Itsy Bitsy Agent Bookmarklet (or, adding an agent to a PyBricks simulator)

As we recall from my earlier post, an agent is just 9 lines of code . So, I built my own agent (by instructing the sketch coding assistant to follow my blog post!), that embeds Content-Security Policy headers can prevent it from working. itself in any web page via a bookmarklet. You bring your own Anthropic API key (which I promise not to steal, though the target web page could, with some effort), and, voila. Try it at this link: The Itsy Bitsy Agent Bookmarklet . This demos well with a video, in which we give a Lego simulator an agent. Saying that "I built" this is a bit of exaggeration. I used a variety of LLM Coding Agents, but mostly the one I work on, Sketch , with Claude Sonnet 4.0 as the underlying model.

0 views
blog.philz.dev 8 months ago

Shell Trap

Should you fall into the trap of having a load bearing shell script, perhaps this will help: And then you can play three truths and a lie: My rule of thumb is that once you get to 100 lines of shell, it's time to move on.

0 views
blog.philz.dev 8 months ago

Infrastructure as Code for Grafana Dashboards

This post came about from some work I and others did at Resolve.ai ; check them out for your agentic on-call needs! I'm sharing it with you with their kind permission. Checking dashboards into gets you all the usual "infrastructure as code" advantages: for loops, variables, version control, consistency. The essence of a dashboard is the queries that power the visualizations. More often than not, the visualizations themselves are similar across many queries. Writing the dashboards as code lets you focus on the essence—the queries—and re-use the styling. This post does that with Grafana and TypeScript. I chose to use TypeScript to define the dashboards, so as to embed the dashboards within a language and tooling ecosystem we already know well. (Others may choose to use the Terraform provider or JSONnet ). TypeScript's type system and language server are a real advantage in working with Grafana's APIs, because there are exist good types for the surface area. Grafana's Foundation SDK has types for many Grafana dashboard concepts as well as examples . The JSON model for dashboards is documented as part of Grafana's API documentation . A Grafana dashboard has 3 main components: Rows . These visually separate groups of metrics. Panels . These are the visualizations you place in your dashboard. The Grafana grid is composed of 24 columns, and each "height" unit represents 30 pixels. The grid has negative gravity, which means that a panel slides upwards to empty space, like an upside down game of Tetris. If you want three charts per row, you use a width of 8, and if you want two, use a width of 12. Using "4! = 24" as a basis gives the chart lots of divisors for layout options! We've found that if you just specify a height and width in your panels, Grafana lays them out in order nicely enough. The panels are where all the action is, and there are many, many panel types. This folder has many panel types, including the very popular "timeseries", "text", "piechart", and so forth. For the most part, Grafana's JSON system has sensible defaults, so you don't need to specify all possible properties. This is a big win of using the JS bindings over checking in the "expanded" JSON directly. (We've found that the Cloudwatch panel is pretty picky and doesn't work if you don't specify nearly everything.) Now that we sort of understand's Grafana's nouns, we can build out a dashboard in code. It's very likely that you want a lot of panels of all the same type, so you define something like and invoke it many times. If you need to do advanced things, you can do one manually in the UI, and then find the "Inspect…Panel Json" action on every Grafana Panel to dig in. Most of your dashboards will look something like this. The rest is boilerplate at the per-dashboard and overall layers: We can now look at the end to end example, annotated slightly. You’ll need a Grafana bearer token to run this against your instance. Here are the key files you'll need: This is the dashboard it generates: Here's the TypeScript code that generates this dashboard: Happy monitoring! Coding agents are great at modifying the code above. Give your favorite (I'm partial to Sketch ) agent the Grafana keys, and let it do its thing. A dashboard is, in essence, an array of (title, query) pairs: you can get pretty close to that essence. The styling of the panels within that dashboard is typically common! Use the sample code below to programmatically create Grafana dashboards and alerts. Use your company's common programming language to define dashboards for great developer ergonomics and lower barriers to entry. Grafana dashboard panels play an upside down game of Tetris! Variables . These appear at the top and can be used to drill down to specific instances of your infrastructure. They're used within the individual PromQL queries, and Grafana does a great job of letting you specify a metric to grab the possible values. Rows . These visually separate groups of metrics. Panels . These are the visualizations you place in your dashboard. The Grafana grid is composed of 24 columns, and each "height" unit represents 30 pixels. The grid has negative gravity, which means that a panel slides upwards to empty space, like an upside down game of Tetris. If you want three charts per row, you use a width of 8, and if you want two, use a width of 12. Using "4! = 24" as a basis gives the chart lots of divisors for layout options! We've found that if you just specify a height and width in your panels, Grafana lays them out in order nicely enough. - Dependencies and npm scripts - TypeScript configuration - The main script (shown below) (optional) - ESLint setup for TypeScript

0 views
blog.philz.dev 11 months ago

The Unreasonable Effectiveness of an LLM Agent Loop with Tool Use

My co-workers and I have been working on an AI Programming Assistant called Sketch for the last few months. The thing I've been most surprised by is how shockingly simple the main loop of using an LLM with tool use is: There's some pomp and circumstance to make the above work ( here's the full script ), but the core idea is the above 9 lines. Here, is a function that sends the system prompt, the conversation so far, and the next message to the LLM API. Tool use is the fancy term for "the LLM returns some output that corresponds to a schema," and, in the full script, we tell the LLM (in its system prompt and tool description prompts) that it has access to . With just that one very general purpose tool, the current models (we use Claude 3.7 Sonnet extensively) can nail many problems, some of them in "one shot." Whereas I used to look up an esoteric git operation and then cut and paste, now I just ask Sketch to do it. Whereas I used to handle git merges manually, now I let Sketch take a first pass. Whereas I used to change a type and go through the resulting type checker errors one by one (or, let's be real, with ridiculousness), I give it a shot with Sketch. If appropriately prompted, the agentic loop can be persistent. If you don't have some tool installed, it'll install it. If your has different command line options, it adapts. (It can also be infuriating! "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.) For many workflows, agentic tools specialize. Sketch's quiver of tools is not just , as we've found that a handful of extra tools improve the quality, speed up iterations, and facilitate better developer workflows. Tools that let the LLM edit text correctly are surprisingly tricky. Seeing the LLM struggle with one-liners re-affirms that visual (as opposed to line) editors are a marvel. I have no doubt that agent loops will get incorporated into more day to day automation tedium that's historically been too specific for general purpose tools and too esoteric and unstable to automate traditionally. I keep thinking of how much time I've spent correlating stack traces with git commits, and how good LLMs are at doing a first pass on it. We'll be seeing more custom, ad hoc , throw-away LLM agent loops in our directories. Grab your favorite bearer token and give it a shot. Also published at https://sketch.dev/blog/agent-loop .

0 views
blog.philz.dev 1 years ago

LLM Log

We're all trying to build some intuition on what does and doesn't work when using LLMs. Making CSS counters display in hex rather than decimal. (Claude) Converting a C header struct into a struct. (Claude) In editor, converting into something that rounds to two decimal points (and uses a format string). (Continue.dev) Converting AppleScript to JavaScript (Claude) Reading an (calendar) file and telling me what events were in it. ( llm ) Generating a web page to print envelopes. (Val.town)[https://philz-tendermagentacod.web.val.run] (Claude)[https://philz-static.web.val.run/static/envelope]. Changing the first character of a few lines (like I would usually do with regular expressions) (Copilot) Filling in some variables into a SQL query. (ChatGPT) Understanding the difference between memory usage reported by Activity Monitor and . (Claude) Building a website that authenticates with Passkeys (val.town)

0 views
blog.philz.dev 1 years ago

tool report: spr

I've been using spacedentist/spr ( docs ) (not to be confused with 's tool of the same name) to send PRs. If you like Gerrit's model of , you may like spr. Phabricator's tool is also similar. Behind the covers, when is invoked (on HEAD), it creates a PR and puts a pointer to that PR in the form of text in HEAD's commit message. The PR actually points to a hidden branch that it's managing. When you invoke again on your amended commit, it figures out the PR by creating new synthetic commits in its branch. Github, which sometimes likes to lose review comments if you force push your PR branch, is none the wiser. See this issue comment for a more accurate explanation. My full workflow involves a wrapper which uses to choose any commit between and invoke on that one by abusing in an interactive rebase.

0 views
blog.philz.dev 1 years ago

Exporting Language Server Data to SQL

Let's do what we always do... let's export the data from our language server into a SQL database. After all, the Language Server has all of this information, but its query language is a tedious JSON-RPC situation. This post was an excuse for me to learn a bit more about language servers; come along for the ride! Much of our mutual drudgery is slurping data from one end of our systems to another through a tiny, awkwardly-shaped straw. Some examples, to wit: Compared to OS X's infrastructure for listing processes , the file system is lovely; we have tools (like ) for querying file systems! Let me know if you've got items to add to my handy table. Microsoft, with Visual Studio Code , led the way in standardizing Language Servers. They describe their history on github . The specification is at https://microsoft.github.io/language-server-protocol/ . With a nod to XKCD 927 , VSCode doesn't actually use a Typescript Language Server! It uses a "TypeScript Server" ( ). The Language Server protocol evolved as a generalization, but the migration hasn't happened. For this project, I used one of the wrappers that wraps in a language server. Language servers typically run as a separate process, and communicate with their parent via stdin/stdout pipes . This makes them behind the scenes, as they're managed by your IDE, but it means it's tricky to see their logs, and, at least for me, it made it a bit tricky to attach a debugger to the language server. Using stdin/stdout, and needing to "read lines" to parse the content-length exposes you to some fun with Python's bytes versus strings, UTF8 encoding, blocking reads and writes, etc. I'm sure my script breaks if you're not using UTF-8, but surely you're using UTF8 everywhere . The protocol pushes a line, newlines, and then a JSON blob of content. This happens bi-directionally. There are notifications (which go one way and don't expect a response) and requests (which expect a response). The specification has a list of methods. A reasonable person would use a pre-existing client library with typed support. This blog post was not written reasonably. The protocol is stateful. You pretend to be an editor and say that you "didOpen" a document, and then you can ask for hover annotations. Often, you can script APIs by inspecting what they do in the Network tab and using "Copy as curl" and scripting with . This is not easy to do for language servers. Sometimes CLI tools for the API are the way to go, but I didn't find much (though maybe should have tried lsp-cli ). Having said that, I nerd-sniped myself, and here's querying language servers in . Note how counting content-length is a pain, but we have to pad our JSON for us. The is insidious: the language server is happy to exit whenever its input stream is closed, and Node is asynchronous. (See this note : "Writes may be synchronous depending on what the stream is connected to"... Or also this investigation .) Let me know if I've nerd-sniped you , and you figured out how to get rid of the sleep without resorting to reasonable approaches like . 🥁🥁 Here's the same data, queried sensibly: So, how should we look at this data? Probably the editors, with their "inlay hints" and their mouse-overs are doing it right, but maybe like an annotated ("glossed") medieval manuscript? Maybe like my student's copy of The Aeneid , with lots of hard to decipher vocabulary notes? Bibliothèque nationale de France, MS Latin 7980, detail of fol. 5v. from medievalcodes.ca ❤️ AP Latin This is a long-winded blog post, so we also get the opportunity to quote Bret Victor : We expect programmers to write code that manipulates variables, without ever seeing the values of those variables. We expect readers to understand code that manipulates variables, without ever seeing the values of the variables. The entire purpose of code is to manipulate data, and we never see the data. We write with blindfolds, and we read by playing pretend with data-phantoms in our imaginations. One of the all-time most popular programming models is the spreadsheet. A spreadsheet is the dual of a conventional programming language -- a language shows all the code, but hides the data. A spreadsheet shows all the data, but hides the code. Some people believe that spreadsheets are popular because of their two-dimensional grid, but that's a minor factor. Spreadsheets rule because they show the data. It's a stretch, but, in this case, the hidden data is the types the compiler knows but are hidden to us as the readers. And, so, I fed hono into it. (Hono is like Express, but it's the default web framework for val.town ...) https://philz.github.io/language-server-db/ and https://github.com/philz/language-server-db have the Python script, a full page of sample output, etc.

0 views
blog.philz.dev 1 years ago

Safari Top, Part 2

I posted recently about getting the top memory-using tabs from Safari. This is the sort of pickle you get into if you're using a laptop with only 8GB of RAM. There are two problems: (1) how to map tabs to process ids and (2) how to get the memory usage of the underlying processes. Once you enable AppleScript works well enough to get the mapping of tabs to process ids, but, crucially for the second problem, was underreporting memory usage. For example, Claude is reportedly using 1GB of memory, but is reporting just 1MB. This led me down a rabbit hole of finding the command, and seeing memory usage more in the 1GB ballpark. I learned about from Julia Evans' blog post and went on a little bit of a detour to try to replicate it. It turns out that to get a "mach port" you need several "Entitlements" like and . So, you make the Rust work, figure out how to and, voila, it still doesn't work . Safari is protected by System Integrity Protection and doesn't allow you to open a mach port to it. So, back at square two, we find out about , find the header files in , and use Python's package. The field seems to match with Activity Monitor says. (The documentation is sparse, and I haven't delved deeper.) The reason to use Python rather than compiling a binary is to avoid a compile or installation step. So, here's the result: Here's the Python code : This time, I converted the AppleScript into "Javascript for Automation" (JXA), and learned that the Script Editor app has an "Open Dictionary" feature which lets you browse what's possible. If you find out how Activity Monitor actually gets the pids of the tabs, let me know!

0 views
blog.philz.dev 1 years ago

Finding top memory consumers in Safari

Update: see also part two Activity Monitor manages to do this, but it's not clear what hooks it has into asking Safari for the mapping between URLs and pids. If you enable a debug option to append pids to tab names, you can get at it with AppleScript, like so:

0 views
blog.philz.dev 1 years ago

CI Performance Debugging

A friend of mine asked me to look at why their GitHub Actions CI workflow was slow. The punchline was that their self-hosted GitHub Runner (on AWS EC2) had too few IOPS available to it, and, as a result, was waiting around for the EBS volume quite a bit. The tool showed a highly utilized disk in a nice red color, so, fine, we figured it out. I pointed Bazel to ( ), and suddenly we were 3x faster. Logging into a far-away machine can be tricky. Turns out the following should work: Simon Willison has also described this on his blog , and there's a GitHub action called action-tmate that does the same. In my case, however, this didn't work! It turned out that the GitHub runner's user had shell , and was refusing to start. The error confusingly said: If you ran directly, it would confusingly spew out: The underlying problem was that couldn't create a login session because the shell was set to . To work around this, change the shell or run as root, or use Tailscale or one of the AWS Connect options... When I went rooting around in the Bazel setup, I found that the somewhat hidden file is, in fact, in the format. So, if you're using , you have a nice graph of your CPU usage (as well as a timeline visualization of your tests) already! In the first (slower) run, the CPU graph goes up and down. Meanwhile, in the second (faster) run, the CPU graph is pegged once the (parallelized) tests start, and it remains pegged.

0 views
blog.philz.dev 1 years ago

direnv for Node and Python

I use to manage per-directory environment variables, which comes in handy when a project needs its own Python or Node (or whatever) environment. Install from , , or whatever. Then install into your like so: One time, install like the following. I use to pull in whatever version I want, but the important thing is the target directory and naming scheme. Add the following globally (e.g., to ): And then, here's the per-project Node incantation of : See the direnv-stdlib(1) for more details. Python is a bit different. Here's the : It creates the virtualenv for you!

0 views
blog.philz.dev 1 years ago

Observability in Trouble

Writing down and sharing a tools and tricks that got us out of a jam. These tools and jams are quite generic, and I wouldn't hesitate to re-implement them in new contexts. A specific kind of request for a very specific customer was really, really slow. We could see the slowness in the logs (hey, why is that taking 30+ seconds?), but we couldn't tell why, or what had changed. An incident formed, and folks started combing through commits. and all that. This particular system was NodeJS-based, and it turns out that if you send (via something like ) to a NodeJS process, it will start listening to the debugger. It'll spew something like the following to stderr: Then, you use some incantations to tunnel your local port 9229 to production's 9229, and then you can use to attach Chrome's debugger to your remote process. And then you use the simplest of profilers: you hit pause, note the stack trace, and hit play. And you do that a few more times. And it's very surprising that the stack trace is always the same: that's your hot spot! Once we had the stack trace, and we could inspect some variables and arguments for state, we quickly realized what was "accidentally quadratic" (hat tip to the Accidentally Quadratic blog ) and the rest was history... ( Ctrl-C Profiling is in the same spirit.) The above--connecting to production, SSH tunnels, clicking around in your debugger--is a bit imposing. It can be finnicky. It should be gated with permissions and processes. Instead, set up something that allows you to start a profiler via some HTTP path (preferably gated with permissions), possibly via a magic header. Have that profiler run for, say, 60 seconds, and dump its output to S3. Log the path to a proxy that will let you download the profile to your logging system. Now, you can trigger a profile, trigger the errant action, and analyze the profile, all from the convenience of home! Since profiles only have function names and timing information, they are, unlike customer data, allowed to be on your dev machine. If you want to go one step further, integrate directly with Speedscope and serve the profiler UI directly. See Stripe's write-up on Canonical Log Lines . Have a structured log line for every request and point your log analysis tool (e.g., Kibana) at it. Just looking for the slowest requests can often pin point a source of trouble. Having p95 latencies in your monitoring stack is nice and all, but, at the end of the day, you need to find some actual requests that experienced those latencies, and it sure is handy when that's easy. If your app talks to a database, a common source of latency and load is the SQL queries made in responding to a request. If you have a misbehaving query, it's hard to figure out what code path is invoking it (especially if it's in a library or an ORM layer is involved). If you have a slow request, it's hard to to know which queries are slowing it down. Sure, you can keep a query counter and a query latency total in your canonical log line, but you still need to figure out which queries are at the heart of the issue. So, for every Nth request (N=10,000 is reasonable), log all the queries that request does! If you have enough QPS flowing through your system, you'll have a pretty effective sample of all those queries in your log system to do further analysis. If you have a tracing header in your system (e.g., OpenTelemtry's ), you can tie the sampling to the trace-id. If you're comfortable with the security implications, you can, perhaps, learn how to inject a trace-id that will make sure that your particular request that you're making right now with is one of the Nth requests that gets sampled. It's often useful to know the stack trace that triggered a SQL query to be executed. Taking stack traces (via or the like) can be expensive in your runtime. Sampling can come to the rescue again: do it every Mth request that's already sampled, and your logs will give you stack traces. I've also, in the past, built a CI job that annotated all SQL queries and shoved them into a SQLite file (viewable, e.g., with https://datasette.io). That worked, but logs turned out more useful. You can also put them stack trace into a query comment, as suggested by Henry . Sometimes, you want to sneak a bit of data into your profile. For (a made-up) example, say you're executing a query, and you want to replace with , so that you can discern which table is being scanned. In a sufficiently dynamic language (Java, Python, JS all qualify), you can sneak a string into a profile like so: The output looks like: Note that Content Security Policies can disable the use of and this trick.

0 views