Latest Posts (20 found)

Coverage

Sometimes, the question arises: which tests trigger this code here? Maybe I've found a block of code that doesn't look like it can't be hit, but it's hard to prove. Or I want to answer the age-old question of which subset of quick tests might be useful to run if the full test suite is kinda slow. So, run each test with coverage by itself. Then, instead of merging all the coverage data, find which tests cover the line in question. Oddly enough, though some of the Java tools (e.g., Clover) support per-test coverage, the tools here in general are somewhat lacking. , part of the suite, supports a ("test name") marker, but only displays the per test data on a per-file level: This is the kind of thing where in 2025, you can ask a coding agent to vibe-code or vibe-modify a generator, and it'll work fine. I have not found the equivalent of Profilerpedia for coverage file formats, but the lowest common denominator seems to be . The file format is described at geninfo(1) . Most language ecosystems can either produce LCOV output directly or have pre-existing conversion tools.

0 views
blog.philz.dev 1 months ago

Build Artifacts

This is a quick story about a thing I miss, that doesn't seem to have a default solution in our industry: a build artifact store. In a previous world, we had one. You could query it for a "global build number" and it would assign you a build number (and a writable to you S3 bucket). You could then produce a build, and store it back into the build-database, with both immutable metadata (what it was, when it was built, from what commits, etc.) and mutable metadata (tags). You could then query the build database for the build that matches criteria. Perhaps you want the latest build of Elephant that ran on Slackware and passed the nightly tests? This could both be used to cobble together tiers of QA and as a build artifact cache. This was a super simple service, cobbled together in a few files of Python, and it held up to our needs quite well. What do you use? Surely Git LFS or Artifactory aren't the end states here.

0 views
blog.philz.dev 1 months ago

Containerizing Agents

Simon Willison has been writing about using parallel coding agents ( blog ), and his post encouraged me to write down about my current workflow, which involves both parallelism, containerization, and web browsers. I’m spoiled by (and helped build) sketch.dev’s agent containerization , so, when I need to use other agents as well, I wrote a shell script to containerize them "just so." My workflow is that I run , and I find myself in a web browser, in the same git repo I was in, but now in a randomly named branch, in a container, in . The first pane is the agent, but there are other panes doing other stuff. When I'm done, I've got a branch to work with, and I merge/rebase/cherry-pick. Let's break up the pieces: First, my shell script is in my favorite shell scripting language, dependency-less python3. Python3 has the advantage of not requiring you to think about and is sufficiently available. Second, I have a customized Dockerfile with the dependencies my projects need. I don't minimize the container; I add all the things I want. Browsers, playwright, subtrace, tmux, etc. Third, U cross-mount my git repo itself into the container, and create a worktree inside the container. From the outside, this work tree is going to look "prunable", but that causes no harm, and there’s a new branch that corresponds to the agent’s worktree. I like worktrees more than remotes because they’re in the same namespace; you don’t need to "fetch" or "push" across them. It’s easy to lose changes when the container exits; I commit automatically on exit. It's also easy to lose the worktree if something calls on your behalf, but recovery is possible with and some fiddling. Fourth, I run tmux inside the container so that opening a shell in the container is as simple as opening a new pane. (Somehow, is too rich.) I'm used to sketch.dev's terminal pane to do the little git operation, take a look at a diff, run a server... tmux helps. Fifth, networking magic with Tailscale. I publish ports 8000-9999 (and 11111) on my tailnet, using the same randomly generated name as I've used for my container and my branch. You're inevitably working on a web app, and you inevitably need to actually look at it, and Docker networking is doable, but you have to pre-declare exposed ports, and avoid conflicts, and ... it's just not great for this use case. There are other solutions (ngrok, SSH port forwarding), but I already use Tailscale, so this works nicely. I originally started with tsnsrv , but then vibe-coded a custom thing that supports port ranges. is the userland networking library here, and the agents do a fine job one-shotting this stuff. Sixth, I use to expose my to my browser over the tailnet network. I'm used to having a browser-tab per agent, and this gives me that. (Terminal-based agents feel weird to me. Browsers are great at scrolling, expand/collapse widgets, cut and paste, word wrap of text, etc.) Seventh, I vibe-coded a headless browser tool called , which wraps the excellent chromedp library, which remote-controls a headless Chrome over it's debugging protocol. Getting the MCPs configured for playwright was finnicky, especially across multiple agents, and I'm experimenting with this command line tool to do the same. As I’ve written about before . Using agents in containers gives me two things I value: Isolation for parallel work. The agents can start processes and run tests and so forth without conflicting on ports or files. A bit more security. Even the Economist has now picked up on the Lethal Trifecta (or Simon Willison's original) . By explicitly choosing which environment variables I forward, and not sharing my cookies and my SSH keys, I’m exerting some control over what data and capabilities are exposed to the agent. We’re still playing with fire (can you break out of Colima? Sure! Can you edit my git repo? Sure! Break into my tailnet? Sorta.), but it’s a smaller, more controlled burn. If you want to try my nonsense, https://github.com/philz/ctr-agent . First, my shell script is in my favorite shell scripting language, dependency-less python3. Python3 has the advantage of not requiring you to think about and is sufficiently available. Second, I have a customized Dockerfile with the dependencies my projects need. I don't minimize the container; I add all the things I want. Browsers, playwright, subtrace, tmux, etc. Third, U cross-mount my git repo itself into the container, and create a worktree inside the container. From the outside, this work tree is going to look "prunable", but that causes no harm, and there’s a new branch that corresponds to the agent’s worktree. I like worktrees more than remotes because they’re in the same namespace; you don’t need to "fetch" or "push" across them. It’s easy to lose changes when the container exits; I commit automatically on exit. It's also easy to lose the worktree if something calls on your behalf, but recovery is possible with and some fiddling. Fourth, I run tmux inside the container so that opening a shell in the container is as simple as opening a new pane. (Somehow, is too rich.) I'm used to sketch.dev's terminal pane to do the little git operation, take a look at a diff, run a server... tmux helps. Fifth, networking magic with Tailscale. I publish ports 8000-9999 (and 11111) on my tailnet, using the same randomly generated name as I've used for my container and my branch. You're inevitably working on a web app, and you inevitably need to actually look at it, and Docker networking is doable, but you have to pre-declare exposed ports, and avoid conflicts, and ... it's just not great for this use case. There are other solutions (ngrok, SSH port forwarding), but I already use Tailscale, so this works nicely. I originally started with tsnsrv , but then vibe-coded a custom thing that supports port ranges. is the userland networking library here, and the agents do a fine job one-shotting this stuff. Sixth, I use to expose my to my browser over the tailnet network. I'm used to having a browser-tab per agent, and this gives me that. (Terminal-based agents feel weird to me. Browsers are great at scrolling, expand/collapse widgets, cut and paste, word wrap of text, etc.) Seventh, I vibe-coded a headless browser tool called , which wraps the excellent chromedp library, which remote-controls a headless Chrome over it's debugging protocol. Getting the MCPs configured for playwright was finnicky, especially across multiple agents, and I'm experimenting with this command line tool to do the same. Isolation for parallel work. The agents can start processes and run tests and so forth without conflicting on ports or files. A bit more security. Even the Economist has now picked up on the Lethal Trifecta (or Simon Willison's original) . By explicitly choosing which environment variables I forward, and not sharing my cookies and my SSH keys, I’m exerting some control over what data and capabilities are exposed to the agent. We’re still playing with fire (can you break out of Colima? Sure! Can you edit my git repo? Sure! Break into my tailnet? Sorta.), but it’s a smaller, more controlled burn.

1 views
blog.philz.dev 2 months ago

Pipefail Fail

What the?!?! That should succeed. We're printing , and surely is finding it. Turns out that SIGPIPE (13) is , and the non-zero exit code is because the part of the pipeline is failing with SIGPIPE. There are lots of solutions, but the simplest one is to not use , which not only does "quiet" but also exits at first match, causing the failure.

0 views
blog.philz.dev 3 months ago

Itsy Bitsy Agent Bookmarklet (or, adding an agent to a PyBricks simulator)

As we recall from my earlier post, an agent is just 9 lines of code . So, I built my own agent (by instructing the sketch coding assistant to follow my blog post!), that embeds Content-Security Policy headers can prevent it from working. itself in any web page via a bookmarklet. You bring your own Anthropic API key (which I promise not to steal, though the target web page could, with some effort), and, voila. Try it at this link: The Itsy Bitsy Agent Bookmarklet . This demos well with a video, in which we give a Lego simulator an agent. Saying that "I built" this is a bit of exaggeration. I used a variety of LLM Coding Agents, but mostly the one I work on, Sketch , with Claude Sonnet 4.0 as the underlying model.

0 views
blog.philz.dev 4 months ago

Shell Trap

Should you fall into the trap of having a load bearing shell script, perhaps this will help: And then you can play three truths and a lie: My rule of thumb is that once you get to 100 lines of shell, it's time to move on.

0 views
blog.philz.dev 4 months ago

Infrastructure as Code for Grafana Dashboards

This post came about from some work I and others did at Resolve.ai ; check them out for your agentic on-call needs! I'm sharing it with you with their kind permission. Checking dashboards into gets you all the usual "infrastructure as code" advantages: for loops, variables, version control, consistency. The essence of a dashboard is the queries that power the visualizations. More often than not, the visualizations themselves are similar across many queries. Writing the dashboards as code lets you focus on the essence—the queries—and re-use the styling. This post does that with Grafana and TypeScript. I chose to use TypeScript to define the dashboards, so as to embed the dashboards within a language and tooling ecosystem we already know well. (Others may choose to use the Terraform provider or JSONnet ). TypeScript's type system and language server are a real advantage in working with Grafana's APIs, because there are exist good types for the surface area. Grafana's Foundation SDK has types for many Grafana dashboard concepts as well as examples . The JSON model for dashboards is documented as part of Grafana's API documentation . A Grafana dashboard has 3 main components: Rows . These visually separate groups of metrics. Panels . These are the visualizations you place in your dashboard. The Grafana grid is composed of 24 columns, and each "height" unit represents 30 pixels. The grid has negative gravity, which means that a panel slides upwards to empty space, like an upside down game of Tetris. If you want three charts per row, you use a width of 8, and if you want two, use a width of 12. Using "4! = 24" as a basis gives the chart lots of divisors for layout options! We've found that if you just specify a height and width in your panels, Grafana lays them out in order nicely enough. The panels are where all the action is, and there are many, many panel types. This folder has many panel types, including the very popular "timeseries", "text", "piechart", and so forth. For the most part, Grafana's JSON system has sensible defaults, so you don't need to specify all possible properties. This is a big win of using the JS bindings over checking in the "expanded" JSON directly. (We've found that the Cloudwatch panel is pretty picky and doesn't work if you don't specify nearly everything.) Now that we sort of understand's Grafana's nouns, we can build out a dashboard in code. It's very likely that you want a lot of panels of all the same type, so you define something like and invoke it many times. If you need to do advanced things, you can do one manually in the UI, and then find the "Inspect…Panel Json" action on every Grafana Panel to dig in. Most of your dashboards will look something like this. The rest is boilerplate at the per-dashboard and overall layers: We can now look at the end to end example, annotated slightly. You’ll need a Grafana bearer token to run this against your instance. Here are the key files you'll need: This is the dashboard it generates: Here's the TypeScript code that generates this dashboard: Happy monitoring! Coding agents are great at modifying the code above. Give your favorite (I'm partial to Sketch ) agent the Grafana keys, and let it do its thing. A dashboard is, in essence, an array of (title, query) pairs: you can get pretty close to that essence. The styling of the panels within that dashboard is typically common! Use the sample code below to programmatically create Grafana dashboards and alerts. Use your company's common programming language to define dashboards for great developer ergonomics and lower barriers to entry. Grafana dashboard panels play an upside down game of Tetris! Variables . These appear at the top and can be used to drill down to specific instances of your infrastructure. They're used within the individual PromQL queries, and Grafana does a great job of letting you specify a metric to grab the possible values. Rows . These visually separate groups of metrics. Panels . These are the visualizations you place in your dashboard. The Grafana grid is composed of 24 columns, and each "height" unit represents 30 pixels. The grid has negative gravity, which means that a panel slides upwards to empty space, like an upside down game of Tetris. If you want three charts per row, you use a width of 8, and if you want two, use a width of 12. Using "4! = 24" as a basis gives the chart lots of divisors for layout options! We've found that if you just specify a height and width in your panels, Grafana lays them out in order nicely enough. - Dependencies and npm scripts - TypeScript configuration - The main script (shown below) (optional) - ESLint setup for TypeScript

0 views
blog.philz.dev 6 months ago

The Unreasonable Effectiveness of an LLM Agent Loop with Tool Use

My co-workers and I have been working on an AI Programming Assistant called Sketch for the last few months. The thing I've been most surprised by is how shockingly simple the main loop of using an LLM with tool use is: There's some pomp and circumstance to make the above work ( here's the full script ), but the core idea is the above 9 lines. Here, is a function that sends the system prompt, the conversation so far, and the next message to the LLM API. Tool use is the fancy term for "the LLM returns some output that corresponds to a schema," and, in the full script, we tell the LLM (in its system prompt and tool description prompts) that it has access to . With just that one very general purpose tool, the current models (we use Claude 3.7 Sonnet extensively) can nail many problems, some of them in "one shot." Whereas I used to look up an esoteric git operation and then cut and paste, now I just ask Sketch to do it. Whereas I used to handle git merges manually, now I let Sketch take a first pass. Whereas I used to change a type and go through the resulting type checker errors one by one (or, let's be real, with ridiculousness), I give it a shot with Sketch. If appropriately prompted, the agentic loop can be persistent. If you don't have some tool installed, it'll install it. If your has different command line options, it adapts. (It can also be infuriating! "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.) For many workflows, agentic tools specialize. Sketch's quiver of tools is not just , as we've found that a handful of extra tools improve the quality, speed up iterations, and facilitate better developer workflows. Tools that let the LLM edit text correctly are surprisingly tricky. Seeing the LLM struggle with one-liners re-affirms that visual (as opposed to line) editors are a marvel. I have no doubt that agent loops will get incorporated into more day to day automation tedium that's historically been too specific for general purpose tools and too esoteric and unstable to automate traditionally. I keep thinking of how much time I've spent correlating stack traces with git commits, and how good LLMs are at doing a first pass on it. We'll be seeing more custom, ad hoc , throw-away LLM agent loops in our directories. Grab your favorite bearer token and give it a shot. Also published at https://sketch.dev/blog/agent-loop .

0 views
blog.philz.dev 1 years ago

LLM Log

We're all trying to build some intuition on what does and doesn't work when using LLMs. Making CSS counters display in hex rather than decimal. (Claude) Converting a C header struct into a struct. (Claude) In editor, converting into something that rounds to two decimal points (and uses a format string). (Continue.dev) Converting AppleScript to JavaScript (Claude) Reading an (calendar) file and telling me what events were in it. ( llm ) Generating a web page to print envelopes. (Val.town)[https://philz-tendermagentacod.web.val.run] (Claude)[https://philz-static.web.val.run/static/envelope]. Changing the first character of a few lines (like I would usually do with regular expressions) (Copilot) Filling in some variables into a SQL query. (ChatGPT) Understanding the difference between memory usage reported by Activity Monitor and . (Claude) Building a website that authenticates with Passkeys (val.town)

0 views
blog.philz.dev 1 years ago

tool report: spr

I've been using spacedentist/spr ( docs ) (not to be confused with 's tool of the same name) to send PRs. If you like Gerrit's model of , you may like spr. Phabricator's tool is also similar. Behind the covers, when is invoked (on HEAD), it creates a PR and puts a pointer to that PR in the form of text in HEAD's commit message. The PR actually points to a hidden branch that it's managing. When you invoke again on your amended commit, it figures out the PR by creating new synthetic commits in its branch. Github, which sometimes likes to lose review comments if you force push your PR branch, is none the wiser. See this issue comment for a more accurate explanation. My full workflow involves a wrapper which uses to choose any commit between and invoke on that one by abusing in an interactive rebase.

0 views
blog.philz.dev 1 years ago

Exporting Language Server Data to SQL

Let's do what we always do... let's export the data from our language server into a SQL database. After all, the Language Server has all of this information, but its query language is a tedious JSON-RPC situation. This post was an excuse for me to learn a bit more about language servers; come along for the ride! Much of our mutual drudgery is slurping data from one end of our systems to another through a tiny, awkwardly-shaped straw. Some examples, to wit: Compared to OS X's infrastructure for listing processes , the file system is lovely; we have tools (like ) for querying file systems! Let me know if you've got items to add to my handy table. Microsoft, with Visual Studio Code , led the way in standardizing Language Servers. They describe their history on github . The specification is at https://microsoft.github.io/language-server-protocol/ . With a nod to XKCD 927 , VSCode doesn't actually use a Typescript Language Server! It uses a "TypeScript Server" ( ). The Language Server protocol evolved as a generalization, but the migration hasn't happened. For this project, I used one of the wrappers that wraps in a language server. Language servers typically run as a separate process, and communicate with their parent via stdin/stdout pipes . This makes them behind the scenes, as they're managed by your IDE, but it means it's tricky to see their logs, and, at least for me, it made it a bit tricky to attach a debugger to the language server. Using stdin/stdout, and needing to "read lines" to parse the content-length exposes you to some fun with Python's bytes versus strings, UTF8 encoding, blocking reads and writes, etc. I'm sure my script breaks if you're not using UTF-8, but surely you're using UTF8 everywhere . The protocol pushes a line, newlines, and then a JSON blob of content. This happens bi-directionally. There are notifications (which go one way and don't expect a response) and requests (which expect a response). The specification has a list of methods. A reasonable person would use a pre-existing client library with typed support. This blog post was not written reasonably. The protocol is stateful. You pretend to be an editor and say that you "didOpen" a document, and then you can ask for hover annotations. Often, you can script APIs by inspecting what they do in the Network tab and using "Copy as curl" and scripting with . This is not easy to do for language servers. Sometimes CLI tools for the API are the way to go, but I didn't find much (though maybe should have tried lsp-cli ). Having said that, I nerd-sniped myself, and here's querying language servers in . Note how counting content-length is a pain, but we have to pad our JSON for us. The is insidious: the language server is happy to exit whenever its input stream is closed, and Node is asynchronous. (See this note : "Writes may be synchronous depending on what the stream is connected to"... Or also this investigation .) Let me know if I've nerd-sniped you , and you figured out how to get rid of the sleep without resorting to reasonable approaches like . 🥁🥁 Here's the same data, queried sensibly: So, how should we look at this data? Probably the editors, with their "inlay hints" and their mouse-overs are doing it right, but maybe like an annotated ("glossed") medieval manuscript? Maybe like my student's copy of The Aeneid , with lots of hard to decipher vocabulary notes? Bibliothèque nationale de France, MS Latin 7980, detail of fol. 5v. from medievalcodes.ca ❤️ AP Latin This is a long-winded blog post, so we also get the opportunity to quote Bret Victor : We expect programmers to write code that manipulates variables, without ever seeing the values of those variables. We expect readers to understand code that manipulates variables, without ever seeing the values of the variables. The entire purpose of code is to manipulate data, and we never see the data. We write with blindfolds, and we read by playing pretend with data-phantoms in our imaginations. One of the all-time most popular programming models is the spreadsheet. A spreadsheet is the dual of a conventional programming language -- a language shows all the code, but hides the data. A spreadsheet shows all the data, but hides the code. Some people believe that spreadsheets are popular because of their two-dimensional grid, but that's a minor factor. Spreadsheets rule because they show the data. It's a stretch, but, in this case, the hidden data is the types the compiler knows but are hidden to us as the readers. And, so, I fed hono into it. (Hono is like Express, but it's the default web framework for val.town ...) https://philz.github.io/language-server-db/ and https://github.com/philz/language-server-db have the Python script, a full page of sample output, etc.

0 views
blog.philz.dev 1 years ago

Safari Top, Part 2

I posted recently about getting the top memory-using tabs from Safari. This is the sort of pickle you get into if you're using a laptop with only 8GB of RAM. There are two problems: (1) how to map tabs to process ids and (2) how to get the memory usage of the underlying processes. Once you enable AppleScript works well enough to get the mapping of tabs to process ids, but, crucially for the second problem, was underreporting memory usage. For example, Claude is reportedly using 1GB of memory, but is reporting just 1MB. This led me down a rabbit hole of finding the command, and seeing memory usage more in the 1GB ballpark. I learned about from Julia Evans' blog post and went on a little bit of a detour to try to replicate it. It turns out that to get a "mach port" you need several "Entitlements" like and . So, you make the Rust work, figure out how to and, voila, it still doesn't work . Safari is protected by System Integrity Protection and doesn't allow you to open a mach port to it. So, back at square two, we find out about , find the header files in , and use Python's package. The field seems to match with Activity Monitor says. (The documentation is sparse, and I haven't delved deeper.) The reason to use Python rather than compiling a binary is to avoid a compile or installation step. So, here's the result: Here's the Python code : This time, I converted the AppleScript into "Javascript for Automation" (JXA), and learned that the Script Editor app has an "Open Dictionary" feature which lets you browse what's possible. If you find out how Activity Monitor actually gets the pids of the tabs, let me know!

0 views
blog.philz.dev 1 years ago

Finding top memory consumers in Safari

Update: see also part two Activity Monitor manages to do this, but it's not clear what hooks it has into asking Safari for the mapping between URLs and pids. If you enable a debug option to append pids to tab names, you can get at it with AppleScript, like so:

0 views
blog.philz.dev 1 years ago

CI Performance Debugging

A friend of mine asked me to look at why their GitHub Actions CI workflow was slow. The punchline was that their self-hosted GitHub Runner (on AWS EC2) had too few IOPS available to it, and, as a result, was waiting around for the EBS volume quite a bit. The tool showed a highly utilized disk in a nice red color, so, fine, we figured it out. I pointed Bazel to ( ), and suddenly we were 3x faster. Logging into a far-away machine can be tricky. Turns out the following should work: Simon Willison has also described this on his blog , and there's a GitHub action called action-tmate that does the same. In my case, however, this didn't work! It turned out that the GitHub runner's user had shell , and was refusing to start. The error confusingly said: If you ran directly, it would confusingly spew out: The underlying problem was that couldn't create a login session because the shell was set to . To work around this, change the shell or run as root, or use Tailscale or one of the AWS Connect options... When I went rooting around in the Bazel setup, I found that the somewhat hidden file is, in fact, in the format. So, if you're using , you have a nice graph of your CPU usage (as well as a timeline visualization of your tests) already! In the first (slower) run, the CPU graph goes up and down. Meanwhile, in the second (faster) run, the CPU graph is pegged once the (parallelized) tests start, and it remains pegged.

0 views
blog.philz.dev 1 years ago

direnv for Node and Python

I use to manage per-directory environment variables, which comes in handy when a project needs its own Python or Node (or whatever) environment. Install from , , or whatever. Then install into your like so: One time, install like the following. I use to pull in whatever version I want, but the important thing is the target directory and naming scheme. Add the following globally (e.g., to ): And then, here's the per-project Node incantation of : See the direnv-stdlib(1) for more details. Python is a bit different. Here's the : It creates the virtualenv for you!

0 views
blog.philz.dev 1 years ago

Observability in Trouble

Writing down and sharing a tools and tricks that got us out of a jam. These tools and jams are quite generic, and I wouldn't hesitate to re-implement them in new contexts. A specific kind of request for a very specific customer was really, really slow. We could see the slowness in the logs (hey, why is that taking 30+ seconds?), but we couldn't tell why, or what had changed. An incident formed, and folks started combing through commits. and all that. This particular system was NodeJS-based, and it turns out that if you send (via something like ) to a NodeJS process, it will start listening to the debugger. It'll spew something like the following to stderr: Then, you use some incantations to tunnel your local port 9229 to production's 9229, and then you can use to attach Chrome's debugger to your remote process. And then you use the simplest of profilers: you hit pause, note the stack trace, and hit play. And you do that a few more times. And it's very surprising that the stack trace is always the same: that's your hot spot! Once we had the stack trace, and we could inspect some variables and arguments for state, we quickly realized what was "accidentally quadratic" (hat tip to the Accidentally Quadratic blog ) and the rest was history... ( Ctrl-C Profiling is in the same spirit.) The above--connecting to production, SSH tunnels, clicking around in your debugger--is a bit imposing. It can be finnicky. It should be gated with permissions and processes. Instead, set up something that allows you to start a profiler via some HTTP path (preferably gated with permissions), possibly via a magic header. Have that profiler run for, say, 60 seconds, and dump its output to S3. Log the path to a proxy that will let you download the profile to your logging system. Now, you can trigger a profile, trigger the errant action, and analyze the profile, all from the convenience of home! Since profiles only have function names and timing information, they are, unlike customer data, allowed to be on your dev machine. If you want to go one step further, integrate directly with Speedscope and serve the profiler UI directly. See Stripe's write-up on Canonical Log Lines . Have a structured log line for every request and point your log analysis tool (e.g., Kibana) at it. Just looking for the slowest requests can often pin point a source of trouble. Having p95 latencies in your monitoring stack is nice and all, but, at the end of the day, you need to find some actual requests that experienced those latencies, and it sure is handy when that's easy. If your app talks to a database, a common source of latency and load is the SQL queries made in responding to a request. If you have a misbehaving query, it's hard to figure out what code path is invoking it (especially if it's in a library or an ORM layer is involved). If you have a slow request, it's hard to to know which queries are slowing it down. Sure, you can keep a query counter and a query latency total in your canonical log line, but you still need to figure out which queries are at the heart of the issue. So, for every Nth request (N=10,000 is reasonable), log all the queries that request does! If you have enough QPS flowing through your system, you'll have a pretty effective sample of all those queries in your log system to do further analysis. If you have a tracing header in your system (e.g., OpenTelemtry's ), you can tie the sampling to the trace-id. If you're comfortable with the security implications, you can, perhaps, learn how to inject a trace-id that will make sure that your particular request that you're making right now with is one of the Nth requests that gets sampled. It's often useful to know the stack trace that triggered a SQL query to be executed. Taking stack traces (via or the like) can be expensive in your runtime. Sampling can come to the rescue again: do it every Mth request that's already sampled, and your logs will give you stack traces. I've also, in the past, built a CI job that annotated all SQL queries and shoved them into a SQLite file (viewable, e.g., with https://datasette.io). That worked, but logs turned out more useful. You can also put them stack trace into a query comment, as suggested by Henry . Sometimes, you want to sneak a bit of data into your profile. For (a made-up) example, say you're executing a query, and you want to replace with , so that you can discern which table is being scanned. In a sufficiently dynamic language (Java, Python, JS all qualify), you can sneak a string into a profile like so: The output looks like: Note that Content Security Policies can disable the use of and this trick.

0 views
blog.philz.dev 1 years ago

pybricks

Pybricks is excellent. Through a web-based IDE, that supports both Python and block-based programming (costs extra, compiling the blocks down into Python, uses Google's Blockly ), you can program your Lego hubs and motors. The household winner was controlling the Lego remote-controlled cars (e.g., Off-Road Buggy ) with an Xbox controller. Some notes: Here's the Xbox-controlled car with a record-replay capability that runs out of memory to give you a sense: Safari doesn't work but Chrome does. The IDE first prompts you to flash the firmware of the controllers over Bluetooth, and then you connect to the hub and run a program on it. The MicroPython seems to be running on the hub itself. I've not figured out the exact memory limitations, but a naive "macro record-replay" program ran out of memory if the macro was too long. There's a abstraction that takes as input the steering and drive motors, and lets you implement a car more simply.

0 views
blog.philz.dev 1 years ago

A Bibliography of Sorts and Some Quotes

These were impactful to me, one way or another. Did you just tell me to... is a classic from @jrecursive. Migrations are a fact of software engineering life, and this, by Manu Cornet , is on point. Julia Evans's comics and zines are a national treasure. I learned some options to ! I learned about CSS! I've shared the post on SQL queries don't start with SELECT many times! Google published The Standard of Code Review which includes the following: In general, reviewers should favor approving a CL once it is in a state where it definitely improves the overall code health of the system being worked on, even if the CL isn’t perfect. I learned a lot from the ggplot2 book . I've followed Jeff Heer's work including Vega Lite, and you could do much worse than reading some of it. Matt Eccleston wrote a blog post on code-centric versus product-goal-centric teams . My friend Dan wrote Effective Typescript . Sometimes, coming up with the right approach to testing a problem is the key to solving the problem. Don't take just my word for it; here's FoundationDB Testing and debugging distributed systems is at least as hard as building them. Unexpected process and network failures, message reorderings, and other sources of non determinism can expose subtle bugs and implicit assumptions that break in reality, which are extremely difficult to reproduce or debug. The consequences of such subtle bugs are especially severe for database systems, which purport to offer perfect fidelity to an unambiguous contract. Moreover, the stateful nature of a database system means that any such bug can result in subtle data corruption that may not be discovered for months. FDB took a radical approach— before building the database itself, we built a deterministic database simulation framework that can simulate a network of interacting processes and a variety of disk, process, network, and request-level failures and recoveries, all within a single physical process. This rigorous testing in simulation makes FDB extremely stable, and allows its developers to introduce new features and releases in a rapid cadence. When I approach a new code base, I look first at what happens on disk (typically, the database schemas) and what happens across the wire (the RPC definitions like protobuf or thrift files, OpenAPI/swagger/typescript schemas, clicking around in the network tab in Chrome). This quote, from The Mythical Man-Month (Brooks) strikes a chord: Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious. Tracing. I had the privilege of using Dapper while at Google. X-Trace is similar. The ad-hoc version of this is to add a random identifier to log statements (especially "canonical log lines") and pass those identifiers along across RPC boundaries (e.g., via an HTTP header). Add canonical log lines (thanks, Stripe and Brandur Leach, for the write up) to your system. If you can make it queryable with SQL, you have a lovely analytics system! Nelson Elhage's "What does a cache do?" is a great discussion of why read replicas and caches are different. E-mail me if you have pointers to these! A history of UI library APIs. How did we get from Apple II graphics to React? These libraries seem so dynamic; I'd love to read a history! A picture of the Cauldron visualization for a CDH build. (This is very specific!)

0 views
blog.philz.dev 1 years ago

Drawing flamegraphs with DuckDB and Vega-Lite

Having spent a good deal of time staring at profiles, I wanted to explore what sampling profiles and Flame Graphs are in a bit more detail. Brendan Gregg's ACM article on flamegraphs describes them: A flame graph visualizes a collection of stack traces (aka call stacks), shown as an adjacency diagram with an inverted icicle layout. The dataset produced by a sampling profiler is quite simple: it is a collection of call stacks. If a function appears frequently, it's hot, and may be a target for optimization. The simplest analysis might be to just look for common functions with (or the like), but visualizing the data helps weed out the frequent, correlated unimportant bits. A "Poor Man's Profiler" ( e.g. , jpmp for java ) may be as simple as loop calling .) Brendan Gregg's takes a "semi-colon delimited stack with count" text file and produces an SVG. Can we do the same, but use DuckDB for the data munging and Vega-Lite (as wrapped by Altair) for the visualization? Sure we can! For this example, we've borrowed a sample stack trace from Speedscope's repo . Jupyter Notebook Viewer . Load our file into DuckDB by pretending it's a CSV file. We use "|" as the delimiter, which doesn't exist in the data, so we get one row per line. We us the window function to assign ids to our stacks. Now split out the weight and the stack. We parse the stack into an array. Using , we create one row per stack frame. Two s in the same statement pair up their results, which we use to get the frame index, which we call . We also record the ancestry. This took a bit of sleuthing to find. In PostgreSQL, the equivalent is . (Originally, I wrote a recursive CTE to build out the ancestry, but given array functions and depth, we don't need to!) Now we can finally start computing the layout. First, for a given ancestry, we compute the cumulative weights. Ine one dimension, this is like a stacked bar chart. Finally, we can visualize this with Vega-Lite. Here I'm using Altair as a Python front-end. I'm inverting depth because I slightly "icicle" graphs, where the stacks are drooping from the ceiling. Vega-Lite does not seem to be able to truncate text marks to a dynamic size based on the width of the bar I wish to match the text into. Please let me know if you've got a better solution for this, but I've resorted to doing the truncation here in Python land. (Nothing prevents us from doing it in SQL.) Implementing zoom and and testing at scale are an exercise left to the reader. As is hoisting this into a web app. As is implementing "time order" graphs: instead of cumulative sums, you need to merge adjacent, identical frame rectangles, by identifying runs using . Evidently this is called the "Gaps and Islands" problem. Google's Perfetto works by keeping its data in an in-memory columnar SQL database (see documentation ). If a profile is just a collection of stack traces, you can combine them by concatenating. (For profiles, one such tool is pprof-merge . For our semi-colon format, works fine.) You can also diff them with subtraction, and differential flame graphs are a thing. Firefox profiler has tooling for comparison as well. Let's take a very mini recursive implementation of gron that is deliberately slow. Something like: If we profile this, we should see that is, well, really slow. I sued node-pprof to profile this. ( uses the normal v8 profiling bindings, and then converts from the Node profile representation to pprof's protobuf representation.) Let's look at the pprof graph and the flame graph. The pprof graph draws an edge between pairs of callers and callees. We spend all of our time in as you'd expect. The flame graph, however, has multiple calls, because we invoke the function at different recursion depths. Both representations are useful: sometimes you have a hot function, but it's only hot when it comes in via a particular stack trace, and you need to figure that out. Go read Nelson Elhage's "Performance engineering, profilers, and seeing the invisible" . ProfilerPedia connects all the world's profilers and their data formats. The typical profiler schema is very simple (a collection of stack traces, maybe with some time stamps and filename/line number metadata), but there are many, many implementations, file formats, and analysis tools. (Speedscope's schema description)[https://github.com/jlfwong/speedscope/wiki/Importing-from-custom-sources] is a reasonable starting point. Speedscope is my go to Flamegraph-esque visualizer. It sports both a Flamegraph view and a "time order" view. Sneaking in metadata. For the most part, profiles contain only function names and statistics. You can, however, sneak in some custom metadata if you'd like. Let's say you want to associate part of a profile with a request. Somewhere, you have a method, but you'd really like to know much more about that request from your logs. If you're using a dynamic language, you can create a no-op passthrough function. In the example below, I snuck in the string SNEAKY into the stack trace. This will show up just fine in profiles and so forth. (You can do this in JavaScript, Python, and Java for sure; doing it in C might be a fun blog post for someone.) You can do this with a request id or the like. Whether the overhead is significant depends on what you're up to and how it compares to the request lifecycle. Friction to getting a profile is an enemy. Typically, to get a profile, you have to SSH into production, install some tools, jump through a few containers, edit , take the profile, and then exfiltrate the result from production back to your machine for analysis. Yuck. Profiles don't have user data (they have function names and line numbers maybe!). Could you perhaps set a header to a request and then get a URL to the data back (either directly or via your logging system)? If you have benchmarks, add a flag to automatically produce the profile too. Maybe even spit it out as part of your CI system with the flamegraph or trace data all ready to go. Even better, perhaps profiling will be always on? I've not used it yet, but Elastic nee Prodfiler has an offering in this space. Some of these systems do fun eBPF magic to profile everything, and the magic is whether you get decent symbols and line numbers and such... An implication of what we're doing here with DuckDB is that you're not limited to the queries and visualizations that others have built. If you can get your profiles into queryable form, you can build out your own custom queries that make sense for you. Go read Nelson Elhage's "Performance engineering, profilers, and seeing the invisible" . ProfilerPedia connects all the world's profilers and their data formats. The typical profiler schema is very simple (a collection of stack traces, maybe with some time stamps and filename/line number metadata), but there are many, many implementations, file formats, and analysis tools. (Speedscope's schema description)[https://github.com/jlfwong/speedscope/wiki/Importing-from-custom-sources] is a reasonable starting point. Speedscope is my go to Flamegraph-esque visualizer. It sports both a Flamegraph view and a "time order" view. Sneaking in metadata. For the most part, profiles contain only function names and statistics. You can, however, sneak in some custom metadata if you'd like. Let's say you want to associate part of a profile with a request. Somewhere, you have a method, but you'd really like to know much more about that request from your logs. If you're using a dynamic language, you can create a no-op passthrough function. In the example below, I snuck in the string SNEAKY into the stack trace. This will show up just fine in profiles and so forth. (You can do this in JavaScript, Python, and Java for sure; doing it in C might be a fun blog post for someone.) You can do this with a request id or the like. Whether the overhead is significant depends on what you're up to and how it compares to the request lifecycle. Friction to getting a profile is an enemy. Typically, to get a profile, you have to SSH into production, install some tools, jump through a few containers, edit , take the profile, and then exfiltrate the result from production back to your machine for analysis. Yuck. Profiles don't have user data (they have function names and line numbers maybe!). Could you perhaps set a header to a request and then get a URL to the data back (either directly or via your logging system)? If you have benchmarks, add a flag to automatically produce the profile too. Maybe even spit it out as part of your CI system with the flamegraph or trace data all ready to go. Even better, perhaps profiling will be always on? I've not used it yet, but Elastic nee Prodfiler has an offering in this space. Some of these systems do fun eBPF magic to profile everything, and the magic is whether you get decent symbols and line numbers and such... An implication of what we're doing here with DuckDB is that you're not limited to the queries and visualizations that others have built. If you can get your profiles into queryable form, you can build out your own custom queries that make sense for you.

0 views