How I make CI/CD (much) faster and cheaper
Why GitHub Actions runners are slow and how bare metal servers can make your CI/CD 2-10x faster while costing 10x less
Why GitHub Actions runners are slow and how bare metal servers can make your CI/CD 2-10x faster while costing 10x less
TL;DR : for a very long time, GitHub Actions lacked support for YAML anchors. This was a good thing . YAML anchors in GitHub Actions are (1) redundant with existing functionality, (2) introduce a complication to the data model that makes CI/CD human and machine comprehension harder, and (3) are not even uniquely useful because GitHub has chosen not to support the one feature (merge keys) that lacks a semantic equivalent in GitHub Actions. For these reasons, YAML anchors are a step backwards that reinforces GitHub Actions’ status as an insecure by default CI/CD platform. GitHub should immediately remove support for YAML anchors, before adoption becomes widespread. GitHub recently announced that YAML anchors are now supported in GitHub Actions. That means that users can write things like this: On face value, this seems like a reasonable feature: the job and step abstractions in GitHub Actions lend themselves to duplication, and YAML anchors are one way to reduce that duplication. Unfortunately, YAML anchors are a terrible tool for this job. Furthermore (as we’ll see) GitHub’s implementation of YAML anchors is incomplete , precluding the actual small subset of use cases where YAML anchors are uniquely useful (but still not a good idea). We’ll see why below. Pictured: the author’s understanding of the GitHub Actions product roadmap. The simplest reason why YAML anchors are a bad idea is because they’re redundant with other more explicit mechanisms for reducing duplication in GitHub Actions. GitHub’s own example above could be rewritten without YAML anchors as: This version is significantly clearer, but has slightly different semantics: all jobs inherit the workflow-level . But this, in my opinion, is a good thing : the need to template environment variables across a subset of jobs suggests an architectural error in the workflow design. In other words: if you find yourself wanting to use YAML anchors to share “global” configuration between jobs or steps, you probably actually want separate workflows, or at least separate jobs with job-level blocks. In summary: YAML anchors further muddy the abstractions of workflows, jobs, and steps, by introducing a cross-cutting form of global state that doesn’t play by the rules of the rest of the system. This, to me, suggests that the current Actions team lacks a strong set of opinions about how GitHub Actions should be used, leading to a “kitchen sink” approach that serves all users equally poorly. As noted above: YAML anchors introduce a new form of non-locality into GitHub Actions. Furthermore, this form of non-locality is fully general : any YAML node can be anchored and referenced. This is a bad idea for humans and machines alike: For humans: a new form of non-locality makes it harder to preserve local understanding of what a workflow, job, or step does: a unit of work may now depend on any other unit of work in the same file, including one hundreds or thousands of lines away. This makes it harder to reason about the behavior of one’s GitHub Actions without context switching. It would only be fair to note that GitHub Actions already has some forms of non-locality: global contexts, scoping rules for blocks, dependencies, step and job outputs, and so on. These can be difficult to debug! But what sets them apart is their lack of generality : each has precise semantics and scoping rules, meaning that a user who understands those rules can comprehend what a unit of work does without referencing the source of an environment variable, output, &c. For machines: non-locality makes it significantly harder to write tools that analyze (or transform) GitHub Actions workflows. The pain here boils down to the fact that YAML anchors diverge from the one-to-one object model 1 that GitHub Actions otherwise maps onto. With anchors, that mapping becomes one-to-many: the same element may appear once in the source, but multiple times in the loaded object representation. In effect, this breaks a critical assumption that many tools make about YAML in GitHub Actions: that an entity in the deserialized object can be mapped back to a single concrete location in the source YAML. This is needed to present reasonable source locations in error messages, but it doesn’t hold if the object model doesn’t represent anchors and references explicitly. Furthermore, this is the reality for every YAML parser in wide use: all widespread YAML parsers choose (reasonably) to copy anchored values into each location where they’re referenced, meaning that the analyzing tool cannot “see” the original element for source location purposes. I feel these pains directly: I maintain zizmor as a static analysis tool for GitHub Actions, and makes both of these assumptions. Moreover, ’s dependencies make these assumptions: (like most other YAML parsers) chooses to deserialize YAML anchors by copying the anchored value into each location where it’s referenced 2 . One of the few things that make YAML anchors uniquely useful is merge keys : a merge key allows a user to compose multiple referenced mappings together into a single mapping. An example from the YAML spec, which I think tidily demonstrates both their use case and how incredibly confusing merge keys are: I personally find this syntax incredibly hard to read, but at least it has a unique use case that could be useful in GitHub Actions: composing multiple sets of environment variables together with clear precedence rules is manifestly useful. Except: GitHub Actions doesn’t support merge keys ! They appear to be using their own internal YAML parser that already had some degree of support for anchors and references, but not for merge keys. To me, this takes the situation from a set of bad technical decisions (and lack of strong opinions around how GitHub Actions should be used) to farce : the one thing that makes YAML anchors uniquely useful in the context of GitHub Actions is the one thing that GitHub Actions doesn’t support. To summarize, I think YAML anchors in GitHub Actions are (1) redundant with existing functionality, (2) introduce a complication to the data model that makes CI/CD human and machine comprehension harder, and (3) are not even uniquely useful because GitHub has chosen not to support the one feature (merge keys) that lacks a semantic equivalent in GitHub Actions. Of these reasons, I think (2) is the most important: GitHub Actions security has been in the news a great deal recently , with the overwhelming consensus being that it’s too easy to introduce vulnerabilities in (or expose otherwise latent vulnerabilities through ) GitHub Actions workflow. For this reason, we need GitHub Actions to be easy to analyze for humans and machine alike. In effect, this means that GitHub should be decreasing the complexity of GitHub Actions, not increasing it. YAML anchors are a step in the wrong direction for all of the reasons aforementioned. Of course, I’m not without self-interest here: I maintain a static analysis tool for GitHub Actions, and supporting YAML anchors is going to be an absolute royal pain in my ass 3 . But it’s not just me: tools like actionlint , claws , and poutine are all likely to struggle with supporting YAML anchors, as they fundamentally alter each tool’s relationship to GitHub Actions’ assumed data model. As-is, this change blows a massive hole in the larger open source ecosystem’s ability to analyze GitHub Actions for correctness and security. All told: I strongly believe that GitHub should immediately remove support for YAML anchors in GitHub Actions. The “good” news is that they can probably do so with a bare minimum of user disruption, since support has only been public for a few days and adoption is (probably) still primarily at the single-use workflow layer and not the reusable action (or workflow) layer. That object model is essentially the JSON object model, where all elements appear as literal components of their source representation and take a small subset of possible types (string, number, boolean, array, object, null). ↩ In other words: even though YAML itself is a superset of JSON, users don’t want YAML-isms to leak through to the object model. Everybody wants the JSON object model, and that means no “anchor” or “reference” elements anywhere in a deserialized structure. ↩ To the point where I’m not clear it’s actually worth supporting anchors to any meaningful extent, and instead immediately flagging them as an attempt at obfuscation. ↩ For humans: a new form of non-locality makes it harder to preserve local understanding of what a workflow, job, or step does: a unit of work may now depend on any other unit of work in the same file, including one hundreds or thousands of lines away. This makes it harder to reason about the behavior of one’s GitHub Actions without context switching. It would only be fair to note that GitHub Actions already has some forms of non-locality: global contexts, scoping rules for blocks, dependencies, step and job outputs, and so on. These can be difficult to debug! But what sets them apart is their lack of generality : each has precise semantics and scoping rules, meaning that a user who understands those rules can comprehend what a unit of work does without referencing the source of an environment variable, output, &c. For machines: non-locality makes it significantly harder to write tools that analyze (or transform) GitHub Actions workflows. The pain here boils down to the fact that YAML anchors diverge from the one-to-one object model 1 that GitHub Actions otherwise maps onto. With anchors, that mapping becomes one-to-many: the same element may appear once in the source, but multiple times in the loaded object representation. In effect, this breaks a critical assumption that many tools make about YAML in GitHub Actions: that an entity in the deserialized object can be mapped back to a single concrete location in the source YAML. This is needed to present reasonable source locations in error messages, but it doesn’t hold if the object model doesn’t represent anchors and references explicitly. Furthermore, this is the reality for every YAML parser in wide use: all widespread YAML parsers choose (reasonably) to copy anchored values into each location where they’re referenced, meaning that the analyzing tool cannot “see” the original element for source location purposes. I feel these pains directly: I maintain zizmor as a static analysis tool for GitHub Actions, and makes both of these assumptions. Moreover, ’s dependencies make these assumptions: (like most other YAML parsers) chooses to deserialize YAML anchors by copying the anchored value into each location where it’s referenced 2 . That object model is essentially the JSON object model, where all elements appear as literal components of their source representation and take a small subset of possible types (string, number, boolean, array, object, null). ↩ In other words: even though YAML itself is a superset of JSON, users don’t want YAML-isms to leak through to the object model. Everybody wants the JSON object model, and that means no “anchor” or “reference” elements anywhere in a deserialized structure. ↩ To the point where I’m not clear it’s actually worth supporting anchors to any meaningful extent, and instead immediately flagging them as an attempt at obfuscation. ↩
Imagine this scenario: your team uses Bazel for fast, distributed C++ builds. A developer builds a change on their workstation, all tests pass, and the change is merged. The CI system picks it up, gets a cache hit from the developer’s build, and produces a release artifact. Everything looks green. But when you deploy to production, the service crashes with a mysterious error: . What went wrong? The answer lies in the subtle but dangerous interaction between Bazel’s caching, remote execution, and differing glibc versions across your fleet. In previous posts in this series, I’ve covered the fundamentals of action non-determinism , remote caching , and execution execution . Now, finally, we’ll build on those to tackle this specific problem. This article dives deep into how glibc versions can break build reproducibility and presents several ways to fix it—from an interesting hack (which spawned this whole series) to the ultimate, most robust solution. Before moving on and getting captivated by the intricate details of the problem, take a moment to support Blog System/5. Suppose you have a pretty standard (corporate?) development environment like the following: Developer workstations (WS). This is where Bazel runs during daily development, and Bazel can execute build actions both locally and remotely. A CI system. This is a distributed cluster of machines that run jobs, including PR merge validation and production release builds. These jobs execute Bazel too, who in turn executes build actions both locally and remotely. The remote execution (RE) system. This is a distributed cluster of worker machines that execute individual Bazel build actions remotely. The key components we want to focus on today are the AC, the CAS, and the workers—all of which I covered in detail in the previous two articles. The production environment (PROD). This is where you deploy binary artifacts to serve your users. No build actions run here. All of the systems above run some version of Linux, and it is tempting to wish to keep such version in sync across them all. The reasons would include keeping operations simpler and ensuring that build actions can run consistently no matter where they are executed. However, this wish is misguided and plain impossible. It is misguided because you may not want to run the same Linux distribution on all three environments: after all, the desktop distribution you run on WS may not be the best choice for RE workers, CI nodes, nor production. And it is plain impossible because, even if you aligned versions to the dot, you would need to take upgrades at some point: distributed upgrades must be rolled out over a period of time (weeks or even months) for reliability, so you’d have to deal with version skew anyway. To make matters more complicated, the remote AC is writable from all of WS, CI, and RE to maximize Bazel cache hits and optimize build times. This goes against best security practices (so there are mitigations in place to protect PROD), but it’s a necessity to support an ongoing onboarding into Bazel and RE. The question becomes: can the Linux version skew among all machines involved cause problems with remote caching? It sure can because C and C++ build actions tend to pick up system-level dependencies in a way that Bazel is unaware of (by default), and those influence the output the actions produce. Here, look at this: The version of glibc leaks into binaries and this is invisible to Bazel’s C/C++ action keys . glibc versions its symbols to provide runtime backwards compatibility when their internal details change, and this means that binaries built against newer glibc versions may not run on systems with older glibc versions. How is this a problem though? Let’s take a look by making the problem specific. Consider the following environment: In this environment, developers run Bazel in WS for their day-to-day work, and CI-1 runs Bazel to support development flows (PR merge-time checks) and to produce binaries for PROD. CI-2 sometimes runs builds too. All of these systems can write to the AC that lives in RE. As it so happens, one of the C++ actions involved in the build of , say , has a tag which forces the action to bypass remote execution. This can lead to the following sequence of events: A developer runs a build on a WS. has changed so it is rebuilt on the WS. The action uses the C++ compiler, so the object files it produces pick up the dependency on glibc 2.28. The result of the action is injected into the remote cache. CI-1 schedules a job to build for release. This job runs Bazel on a machine with glibc 2.17 and leverages the RE cluster which also contain glibc 2.17. Many C++ actions get rebuilt but is reused from the cache. The production artifact now has a dependency on symbols from glibc 2.28. Release engineering picks the output of CI-1, deploys the production binary to PROD, and… boom, PROD explodes: The fact that the developer WS could write to the AC is very problematic on its own, but we could encounter this same scenario if we first ran the production build on CI-2 for testing purposes and then reran it on CI-1 to generate the final artifact. So, what do we do now? In a default Bazel configuration, C and C++ action keys are underspecified and can lead us to non-deterministic behavior when we have a mixture of host systems compiling them. Let’s start with the case where you aren’t yet ready to strictly restrict writes to the AC from RE workers, yet you want to prevent obvious mistakes that lead to production breaks. The idea here is to capture the glibc version that is used in the local and remote environments, pick the higher of the two, and make that version number an input to the C/C++ toolchain. This causes the version to become part of the cache keys and should prevent the majority of the mistakes we may see. WARNING: This is The Hack I recently implemented and that drove me to writing this article series! Prefer the options presented later, but know that you have this one up your sleeve if you must mitigate problems quickly. To implement this hack, the first thing we have to do is capture the local glibc version. We can do this with: One important tidbit here is the use of the file, indirectly via the requirement of stamping. This is necessary to force this action to rerun on every build because we don’t want to hit the case of using an old tree against an upgraded system. As a consequence, we need to modify the script pointed at by script (you have one, right?) to emit the glibc version: The second thing we have to do is capture the remote glibc version. This is… trickier because there is no tag to force Bazel to run an action remotely. Even if we assume remote execution, features like the dynamic spawn strategy or the remote local fallback could cause the action to run locally at random. To prevent problems, we have to detect whether the action is running within RE workers or not, and the way to do that will depend on your environment: The third part of the puzzle is to select the highest glibc version between the two that we collected. We can do this with the following , leveraging ’s flag to compare versions. This flag is a GNU extension… but we are talking about glibc anyway here so I’m not going to be bothered by it : And, finally, we can go to our C++ toolchain definition and modify it to depend on the produced by the previous action: Ta-da! All of our C/C++ actions now encode the highest possible glibc version that the outputs they produce may depend on. And, while not perfect, this is an easy workaround to guard against most mistakes. But can we do better? Of course. Based on the previous articles, what we should think about is plugging the AC hole and forcing build actions to always run on the RE workers. In this way, we would precisely control the environment that generates action outputs and we should be good to go. Unfortunately, we can still encounter problems! Remember how I said that, at some point, you will have to upgrade glibc versions? What happens when you are in the middle of a rolling upgrade to your RE workers? The worker pool will end up with different “partitions”, each with a different glibc version, and you will still run into this issue. To handle this case, you would need to have different worker pools, one with the old glibc version and one with the new version, and then make the worker pool name be part of the action keys. You would then have to migrate from one pool to the other in a controlled manner. This would work well at the expense of reducing cache effectiveness, causing a a big toll on operations, and making the rollout risky because the switch from one pool to another is a all-or-nothing proposition. The real solution comes in the form of sysroots. The idea is to install multiple parallel versions of glibc in all environments and then modify the Bazel C/C++ toolchain to explicitly use a specific one. In this way, the glibc version becomes part of the cache key and all build outputs are pinned to a deterministic glibc version. This allows us to roll out a new version slowly with a code change, pinning the version switch to a specific code commit that can be rolled back if necessary, and keeping the property of reproducible builds for older commits. This is the solution outlined at the end of Picking glibc versions at runtime and is the only solution that can provide you 100% safety against the problem presented in this article. It is difficult to implement, though, because convincing GCC and clang to not use system-provided libraries is tricky and because this solution will sound alien to most of your peers. The problem presented in this article is far from theoretical, but it’s often forgotten about because typical build environments don’t present significant skew across Linux versions. This means that facing new glibc symbols is unlikely, so the chances of ending up with binary-incompatible artifacts are low. But they can still happen, and they can happen at the worst possible moment. Therefore, you need to take action. I’d strongly recommend that you go towards the sysroot solution because it’s the only one that’ll give you a stable path for years to come, but I also understand that it’s hard to implement. Therefore, take the solutions in the order I gave them to you: start with the hack to mitigate obvious problems, follow that up with securing the AC, and finally go down the sysroot rabbit hole. As for the glibc 2.17 mentioned en-passing above, well, it is ancient by today standards at 13 years of age, but it is what triggered this article in the first place. glibc 2.17 was kept alive for many years by the CentOS 7 distribution—an LTS system used as a core building block by companies and that reached EOL a year ago, causing headaches throughout the industry. Personally, I believe that relying on LTS distributions is a mistake that ends up costing more money/time than tracking a rolling release, but I’ll leave that controversial topic for a future opinion post. Imagine this scenario: your team uses Bazel for fast, distributed C++ builds. A developer builds a change on their workstation, all tests pass, and the change is merged. The CI system picks it up, gets a cache hit from the developer’s build, and produces a release artifact. Everything looks green. But when you deploy to production, the service crashes with a mysterious error: . What went wrong? The answer lies in the subtle but dangerous interaction between Bazel’s caching, remote execution, and differing glibc versions across your fleet. In previous posts in this series, I’ve covered the fundamentals of action non-determinism , remote caching , and execution execution . Now, finally, we’ll build on those to tackle this specific problem. This article dives deep into how glibc versions can break build reproducibility and presents several ways to fix it—from an interesting hack (which spawned this whole series) to the ultimate, most robust solution. Before moving on and getting captivated by the intricate details of the problem, take a moment to support Blog System/5. The scenario Suppose you have a pretty standard (corporate?) development environment like the following: Developer workstations (WS). This is where Bazel runs during daily development, and Bazel can execute build actions both locally and remotely. A CI system. This is a distributed cluster of machines that run jobs, including PR merge validation and production release builds. These jobs execute Bazel too, who in turn executes build actions both locally and remotely. The remote execution (RE) system. This is a distributed cluster of worker machines that execute individual Bazel build actions remotely. The key components we want to focus on today are the AC, the CAS, and the workers—all of which I covered in detail in the previous two articles. The production environment (PROD). This is where you deploy binary artifacts to serve your users. No build actions run here. All of the systems above run some version of Linux, and it is tempting to wish to keep such version in sync across them all. The reasons would include keeping operations simpler and ensuring that build actions can run consistently no matter where they are executed. However, this wish is misguided and plain impossible. It is misguided because you may not want to run the same Linux distribution on all three environments: after all, the desktop distribution you run on WS may not be the best choice for RE workers, CI nodes, nor production. And it is plain impossible because, even if you aligned versions to the dot, you would need to take upgrades at some point: distributed upgrades must be rolled out over a period of time (weeks or even months) for reliability, so you’d have to deal with version skew anyway. To make matters more complicated, the remote AC is writable from all of WS, CI, and RE to maximize Bazel cache hits and optimize build times. This goes against best security practices (so there are mitigations in place to protect PROD), but it’s a necessity to support an ongoing onboarding into Bazel and RE. The problem The question becomes: can the Linux version skew among all machines involved cause problems with remote caching? It sure can because C and C++ build actions tend to pick up system-level dependencies in a way that Bazel is unaware of (by default), and those influence the output the actions produce. Here, look at this: The version of glibc leaks into binaries and this is invisible to Bazel’s C/C++ action keys . glibc versions its symbols to provide runtime backwards compatibility when their internal details change, and this means that binaries built against newer glibc versions may not run on systems with older glibc versions. How is this a problem though? Let’s take a look by making the problem specific. Consider the following environment: In this environment, developers run Bazel in WS for their day-to-day work, and CI-1 runs Bazel to support development flows (PR merge-time checks) and to produce binaries for PROD. CI-2 sometimes runs builds too. All of these systems can write to the AC that lives in RE. As it so happens, one of the C++ actions involved in the build of , say , has a tag which forces the action to bypass remote execution. This can lead to the following sequence of events: A developer runs a build on a WS. has changed so it is rebuilt on the WS. The action uses the C++ compiler, so the object files it produces pick up the dependency on glibc 2.28. The result of the action is injected into the remote cache. CI-1 schedules a job to build for release. This job runs Bazel on a machine with glibc 2.17 and leverages the RE cluster which also contain glibc 2.17. Many C++ actions get rebuilt but is reused from the cache. The production artifact now has a dependency on symbols from glibc 2.28. Release engineering picks the output of CI-1, deploys the production binary to PROD, and… boom, PROD explodes:
TL;DR : GitHub Actions provides a policy mechanism for limiting the kinds of actions and reusable workflows that can be used within a repository, organization, or entire enterprise. Unfortunately, this mechanism is trivial to bypass . GitHub has told me that they don’t consider this a security issue (I disagree), so I’m publishing this post as-is. Update 2025-06-13 : GitHub has silently updated the actions policies documentation to note the bypass in this post: Policies never restrict access to local actions on the runner filesystem (where the path start with ). GitHub Actions is GitHub’s CI/CD offering. I’m a big fan of it, despite its spotty security track record . Because a CI/CD offering is essentially arbitrary code execution as a service , users are expected to be careful about what they allow to run in their workflows, especially privileged workflows that have access to secrets and/or can modify the repository itself. That, in effect, means that users need to be careful about what actions and reusable workflows they trust. Like with other open source ecosystems, downstream consumers (i.e., users of GitHub Actions) retrieve their components (i.e., action definitions) from an essentially open index (the “Actions Marketplace” 1 ). To establish trust in those components, downstream users perform all of the normal fuzzy heuristics: they look at the number of stars, the number of other user, recency of activity, whether the user/organization is a “good” one, and so forth. Unfortunately, this isn’t good enough along two dimensions: Even actions that satisfy these heuristics can be compromised. They’re heuristics after all, not verifiable assertions of quality or trustworthiness. The recent tj-actions attack typifies this: even popular, widely-used actions are themselves software components, with their own supply chains (and CI/CD setups). This kind of acceptance scheme just doesn’t scale , both in terms of human effort and system complexity: complex CI/CD setups can have dozens (or hundreds) of workflows, each of which can contain dozens (or hundreds) of jobs that in turn employ actions and reusable workflows. These sorts of large setups don’t necessarily have a single owner (or even a single team) responsible for gating admission and preventing a the introduction of unvetted actions and reusable workflows. The problem (as stated above) is best solved by eliminating the failure mode itself: rather than giving the system’s committers the ability to introduce new actions and reusable workflows without sufficient review, the system should prevent them from doing so in the first place . To their credit, GitHub understands this! They have a feature called “Actions policies 2 ” that does exactly this. From the Manage GitHub Actions settings documentation: You can restrict workflows to use actions and reusable workflows in specific organizations and repositories. Specified actions cannot be set to more than 1000. (sic) To restrict access to specific tags or commit SHAs of an action or reusable workflow, use the same syntax used in the workflow to select the action or reusable workflow. For an action, the syntax is . For example, use to select a tag or to select a SHA. For more information, see Using pre-written building blocks in your workflow. For a reusable workflow, the syntax is . For example, . For more information, see Reusing workflows. You can use the wildcard character to match patterns. For example, to allow all actions and reusable workflows in organizations that start with , you can specify . To allow all actions and reusable workflows in repositories that start with , you can use . For more information about using the wildcard, see Workflow syntax for GitHub Actions. Use to separate patterns. For example, to allow and , you can specify . GitHub also provides special “preset” cases for this functionality, such as allowing only actions and reusable workflows that belong to the same organization namespace as the repository itself. Here’s what that looks like on a dummy organization and repository of mine: …and here’s what happens when I try to violate that policy, e.g. by using in a workflow: This is fantastic, except that it’s trivial to bypass. Let’s see how. To understand how we’re going to bypass this, we need to understand a few of the building blocks underneath actions and reusable workflows. In particular: These four aspects of GitHub Actions compose together into the world’s dumbest policy bypass : instead of doing , the user can (or otherwise fetch) the repository into the runner’s filesystem, and then use to run the very same action. Here’s what that looks like in practice: (The actual block of the step is inconsequential — I just used that repository for the demo, but anything would work.) And naturally, it works just fine: The fix for this bypass is simple, if potentially somewhat painful: GitHub Actions could consider “local” references to be another category for the purpose of policies, and reject them whenever the policy doesn’t permit them. This would seal off the entire problem, since would just stop working. The downside is that it would potentially break existing users of policies who also use local actions and reusable workflows, assuming there are significant numbers of them 4 . The other option would be to leave it the way it is, but explicitly document local references as a limitation of this policy mechanism. I honestly think this would be perfectly fine; what matters is that users 5 are informed of a feature’s limitations, not necessarily that the feature lacks limitations. First, I’ll couch this again: this is not exactly fancy stuff. It’s a very dumb bypass, and I don’t think it’s critical by any means. At the same time, I think this matters a great deal : ineffective policy mechanisms are worse than missing policy mechanisms, because they provide all of the feeling of security through compliance while actually incentivizing malicious forms of compliance . In this case, the maliciously complying party is almost certainly a developer just trying to get their job done: like most other developers who encounter an inscrutable policy restriction, they will try to hack around it such that the policy is satisfied in name only. For that reason alone I think GitHub should fix this bypass, either by actually fixing it or at least documenting its limitations. Without either of those, projects and organizations are likely to mistakenly believe that these sorts of policies provide a security boundary where none in fact exists . Technically “publishing” an action to the Actions Marketplace is not required; anybody can do to fetch the action defined in even if it isn’t published. All publishing does is give the action a marketplace page and the potential for a little blue checkmark of unclear security value. ↩ Actually, I don’t know what this feature is called. It’s titled “Policies” under the “Actions” section of the repo/org/enterprise settings and is documented under “Github Actions policies” in the Enterprise documentation, but I’m not sure if that’s an umbrella term or not. I’m just going to keep calling it “Actions policies” for now. ↩ Where “owner” is an individual owner or an organization, which in turn might be controlled by an enterprise. But that last bit isn’t visible in the namespace. ↩ I honestly have no idea how widely used this policy feature is. ↩ Here, policy authors and enforcers. ↩ Even actions that satisfy these heuristics can be compromised. They’re heuristics after all, not verifiable assertions of quality or trustworthiness. The recent tj-actions attack typifies this: even popular, widely-used actions are themselves software components, with their own supply chains (and CI/CD setups). This kind of acceptance scheme just doesn’t scale , both in terms of human effort and system complexity: complex CI/CD setups can have dozens (or hundreds) of workflows, each of which can contain dozens (or hundreds) of jobs that in turn employ actions and reusable workflows. These sorts of large setups don’t necessarily have a single owner (or even a single team) responsible for gating admission and preventing a the introduction of unvetted actions and reusable workflows. For an action, the syntax is . For example, use to select a tag or to select a SHA. For more information, see Using pre-written building blocks in your workflow. For a reusable workflow, the syntax is . For example, . For more information, see Reusing workflows. Actions and reusable workflows share the same namespace as the rest of GitHub, i.e. 3 ; When a user writes something like in a workflow, GitHub resolves that reference to mean “the file defined at tag in the repository”; keywords can also refer to relative paths on the runner itself. For example, runs the step with the in the current directory. Relative paths from the runner are not inherently part of the repository state itself: the runner is can contain any state introduced by previous steps within the same job. Technically “publishing” an action to the Actions Marketplace is not required; anybody can do to fetch the action defined in even if it isn’t published. All publishing does is give the action a marketplace page and the potential for a little blue checkmark of unclear security value. ↩ Actually, I don’t know what this feature is called. It’s titled “Policies” under the “Actions” section of the repo/org/enterprise settings and is documented under “Github Actions policies” in the Enterprise documentation, but I’m not sure if that’s an umbrella term or not. I’m just going to keep calling it “Actions policies” for now. ↩ Where “owner” is an individual owner or an organization, which in turn might be controlled by an enterprise. But that last bit isn’t visible in the namespace. ↩ I honestly have no idea how widely used this policy feature is. ↩ Here, policy authors and enforcers. ↩
However, it's easy to forget that there's a class of service/business where this doesn't make sense. Many small businesses (if they're not using a PaaS like Fly), don't have the resources to run a Kubernetes cluster and 95% of the time, just need an app to run on a single VPS. Sometimes, maybe during an event or a promotional period, it might make sense to add another server to the mix, load balancing traffic somewhere (DigitalOcean load balancers are pretty effective). Assuming we have some nice way to spin up more servers running the app (a nice Ansible Playbook perhaps?), and assuming we have some nice CD pipeline to deploy the app, we need some way to make sure all the servers are running the latest version. One pattern I like for this is a simplified GitOps pull model (think ArgoCD or Flux but held together with duct tape) where we: This way, your deployment pipeline doesn't need to know anything about the servers it's deploying to - it just needs to know how to update the repo with a reference to the new release. In one case, I have a repo called 'ops' which along with all the ansible playbooks, has a set of files, one for each app which contain (along with other settings), a line like: I can then run a script like the following on each server, which pulls down the repo, checks each app and updates the systemd service file to point to the latest release. For cases where Kubernetes and a full GitOps solution is overkill, this is a nice way to get a simple deployment pipeline setup. Have a repo which indicates what app releases should currently be deployed. Get your servers to query this repo to determine what it should be running and automatically update the app on the server to match. Make your deployment pipeline update the repo with a reference to the new release.
A couple of days ago, all of the sudden, my jobs started running out of space.
I have a lot of experience with massive CI/CD pipelines that deploy private code to public servers. I’ve also worked with pipelines that deploy public repositories to private servers, such as my homelab. However, I never experimented with a pipeline that takes a private GitHub repo, builds it, and deploys it to a server on the LAN. That’s precisely what I needed for a project I’m currently working on that isn’t yet public.
In this post I explain how I built my automatic blogroll using Github Action and Github Pages.
Redownloading dependencies for every step in your CI/CD pipeline can be time consuming. You can dramatically speed up the build time of your application with caching, making your team more responsive to breaking changes and ultimately more productive. Here's how to do it.
I’ve written in the past that this blog is a playground for me to try various tools and play with code around it. Jenkins has been my choice as the CI for it since the start, mostly since it was something I’m used to. However, I’ve also stated that running it on an old laptop with no real backups is a recipe for disaster. I have since rectified the issue by hiding the laptop under a box in a closet but that meant moving away from Jenkins to something that’s lighter and more portable. The choice is the self-hosted enterprise edition of Drone . Drone consists of two parts - server and agents. The server handles auth, users, secret management, handles hooks for source control and orchestrates the work for agents. The agents are responsible for actually executing the workflows they receive from the server. This basically means that you can have workers anywhere as long as they can reach the drone server. Drone is written in Go and is extremely lightweight. This means that it has an extremely low memory requirements which is of great advantage to me as I’m trying to keep the costs to a minimum. The server I’m running takes up around 15MB of memory while the worker takes 14MB. Everything is dockerized so it’s super easy to deploy as well. My setup consists of a server running in an AWS EC2 instance and a worker running at home in a MSI Cubi Silent NUC I’ve recently acquired. The Raspberry I’ve used for jenkins is also a great candidate to run a worker but due to the workloads I throw at it(lots of disk io - the sd card can’t handle that. Looking at you - javascript.) it’s less than ideal in my situation. I’ll keep it on hand just in case I need more workers. The old laptop could also be a candidate here for the future. That’s part of the joy with Drone - you can literally run it anywhere. Drone looks for a file in an enabled repository. It’s in this file that you specify your pipeline. What makes drone great is that you can actually run the pipeline locally, using the drone cli. It makes testing the builds super easy. That’s a huge contrast to what I’m used to with Jenkins(Disclaimer: I might just be a scrub, I’m not hating on Jenkins). What this also means is that you really don’t need to worry about storing the jobs themselves anywhere as they are just as safe as the rest of your code in your source control. Hopefully anyway. The steps in the pipeline are run in docker containers which are thrown out after the pipeline is done. It means that the jobs are nice and reproducible. And while I hate YAML the pipelines are quite easy to understand. Click for a look at an example . I like it. Drone seems to be on a really good path towards becoming an excellent CI tool. There’s things missing though. It seems a bit basic. Things like the lack of global secrets(they are now defined per-repo instead or I didn’t manage to find them) or proper documentation as the current one seems a bit lacking. Took me quite a while to get my first proper job running and I’ve only managed that after I’ve looked at a few examples on the internet, rather than the docs. There’s also the question of pricing. The website is not super clear on the pricing, but from what I gather, the enterprise edition is free as long as you run less than 15000 jobs per year. The pricing afterwards is per user at an unknown rate. Anyways, I should be covered as I probably run less than 500 jobs per year and do not plan on adding new ones any time soon. There’s also the lack of option to run multiple different jobs from the same repository, which leads to a pattern on many small repo’s appearing on my source control. I’m not too fond of that and wish there was a way to schedule quite a few jobs from a single repo. Nonetheless, once you get the hang of it Drone seems to be a powerful tool. The UI is crisp, deploying it is easy and it runs on a potato. Everything I like from my CI installation. As for the lacking features - I’m sure they are coming soon™. I’ll keep this space updated with my latest projects running on drone. Stay tuned! Have you used Drone? Do you prefer something else? Do let me know in the comments below.