Posts in Devops (20 found)
devansh Today

Hitchhiker's Guide to Attack Surface Management

I first heard about the word "ASM" (i.e., Attack Surface Management) probably in late 2018, and I thought it must be some complex infrastructure for tracking assets of an organization. Looking back, I realize I almost had a similar stack for discovering, tracking, and detecting obscure assets of organizations, and I was using it for my bug hunting adventures. I feel my stack was kinda goated, as I was able to find obscure assets of Apple, Facebook, Shopify, Twitter, and many other Fortune 100 companies, and reported hundreds of bugs, all through automation. Back in the day, projects like ProjectDiscovery were not present, so if I had to write an effective port scanner, I had to do it from scratch. (Masscan and nmap were present, but I had my fair share of issues using them, this is a story for another time). I used to write DNS resolvers (massdns had a high error rate), port scanners, web scrapers, directory brute-force utilities, wordlists, lots of JavaScript parsing logic using regex, and a hell of a lot of other things. I used to have up to 50+ self-developed tools for bug-bounty recon stuff and another 60-something helper scripts written in bash. I used to orchestrate (gluing together with duct tape is a better word) and slap together scripts like a workflow, and save the output in text files. Whenever I dealt with a large number of domains, I used to distribute the load over multiple servers (server spin-up + SSH into it + SCP for pushing and pulling files from it). The setup was very fragile and error-prone, and I spent countless nights trying to debug errors in the workflows. But it was all worth it. I learned the art of Attack Surface Management without even trying to learn about it. I was just a teenager trying to make quick bucks through bug hunting, and this fragile, duct-taped system was my edge. Fast forward to today, I have now spent almost a decade in the bug bounty scene. I joined HackerOne in 2020 (to present) as a vulnerability triager, where I have triaged and reviewed tens of thousands of vulnerability submissions. Fair to say, I have seen a lot of things, from doomsday level 0-days, to reports related to leaked credentials which could have led to entire infrastructure compromise, just because some dev pushed an AWS secret key in git logs, to things where some organizations were not even aware they were running Jenkins servers on some obscure subdomain which could have allowed RCE and then lateral movement to other layers of infrastructure. A lot of these issues I have seen were totally avoidable, only if organizations followed some basic attack surface management techniques. If I search "Guide to ASM" on Internet, almost none of the supposed guides are real resources. They funnel you to their own ASM solution, and the guide is just present there to provide you with some surface-level information, and is mostly a marketing gimmick. This is precisely why I decided to write something where I try to cover everything I learned and know about ASM, and how to protect your organization's assets before bad actors could get to them. This is going to be a rough and raw guide, and will not lead you to a funnel where I am trying to sell my own ASM SaaS to you. I have nothing to sell, other than offering what I know. But in case you are an organization who needs help implementing the things I am mentioning below, you can reach out to me via X or email (both available on the homepage of this blog). This guide will provide you with insights into exactly how big your attack surface really is. CISOs can look at it and see if their organizations have all of these covered, security researchers and bug hunters can look at this and maybe find new ideas related to where to look during recon. Devs can look at it and see if they are unintentionally leaving any door open for hackers. If you are into security, it has something to offer you. Attack surface is one of those terms getting thrown around in security circles so much that it's become almost meaningless noise. In theory, it sounds simple enough, right. Attack surface is every single potential entry point, interaction vector, or exploitable interface an attacker could use to compromise your systems, steal your data, or generally wreck your day. But here's the thing, it's the sum total of everything you've exposed to the internet. Every API endpoint you forgot about, every subdomain some dev spun up for "testing purposes" five years ago and then abandoned, every IoT device plugged into your network, every employee laptop connecting from a coffee shop, every third-party vendor with a backdoor into your environment, every cloud storage bucket with permissions that make no sense, every Slack channel, every git commit leaking credentials, every paste on Pastebin containing your database passwords. Most organizations think about attack surface in incredibly narrow terms. They think if they have a website, an email server, and maybe some VPN endpoints, they've got "good visibility" into their assets. That's just plain wrong. Straight up wrong. Your actual attack surface would terrify you if you actually understood it. You run , and is your main domain. You probably know about , , maybe . But what about that your intern from 2015 spun up and just never bothered to delete. It's not documented anywhere. Nobody remembers it exists. Domain attack surface goes way beyond what's sitting in your asset management system. Every subdomain is a potential entry point. Most of these subdomains are completely forgotten. Subdomain enumeration is reconnaissance 101 for attackers and bug hunters. It's not rocket science. Setting up a tool that actively monitors through active and passive sources for new subdomains and generates alerts is honestly an hour's worth of work. You can use tools like Subfinder, Amass, or just mine Certificate Transparency logs to discover every single subdomain connected to your domain. Certificate Transparency logs were designed to increase security by making certificate issuance public, and they've become an absolute reconnaissance goldmine. Every time you get an SSL certificate for , that information is sitting in public logs for anyone to find. Attackers systematically enumerate these subdomains using Certificate Transparency log searches, DNS brute-forcing with massive wordlists, reverse DNS lookups to map IP ranges back to domains, historical DNS data from services like SecurityTrails, and zone transfer exploitation if your DNS is misconfigured. Attackers are looking for old development environments still running vulnerable software, staging servers with production data sitting on them, forgotten admin panels, API endpoints without authentication, internal tools accidentally exposed, and test environments with default credentials nobody changed. Every subdomain is an asset. Every asset is a potential vulnerability. Every vulnerability is an entry point. Domains and subdomains are just the starting point though. Once you've figured out all the subdomains belonging to your organization, the next step is to take a hard look at IP address space, which is another absolutely massive component of your attack surface. Organizations own, sometimes lease, IP ranges, sometimes small /24 blocks, sometimes massive /16 ranges, and every single IP address in those blocks and ranges that responds to external traffic is part of your attack surface. And attackers enumerate them all if you won't. They use WHOIS lookups to identify your IP ranges, port scanning to find what services are running where, service fingerprinting to identify exact software versions, and banner grabbing to extract configuration information. If you have a /24 network with 256 IP addresses and even 10% of those IPs are running services, you've got 25 potential attack vectors. Scale that to a /20 or /16 and you're looking at thousands of potential entry points. And attackers aren't just looking at the IPs you know about. They're looking at adjacent IP ranges you might have acquired through mergers, historical IP allocations that haven't been properly decommissioned, and shared IP ranges where your servers coexist with others. Traditional infrastructure was complicated enough, and now we have cloud. It's literally exploded organizations' attack surfaces in ways that are genuinely difficult to even comprehend. Every cloud service you spin up, be it an EC2 instance, S3 bucket, Lambda function, or API Gateway endpoint, all of this is a new attack vector. In my opinion and experience so far, I think the main issue with cloud infrastructure is that it's ephemeral and distributed. Resources get spun up and torn down constantly. Developers create instances for testing and forget about them. Auto-scaling groups generate new resources dynamically. Containerized workloads spin up massive Kubernetes clusters you have minimal visibility into. Your cloud attack surface could be literally anything. Examples are countless, but I'd categorize them into 8 different categories. Compute instances like EC2, Azure VMs, GCP Compute Engine instances exposed to the internet. Storage buckets like S3, Azure Blob Storage, GCP Cloud Storage with misconfigured permissions. Serverless stuff like Lambda functions with public URLs or overly permissive IAM roles. API endpoints like API Gateway, Azure API Management endpoints without proper authentication. Container registries like Docker images with embedded secrets or vulnerabilities. Kubernetes clusters with exposed API servers, misconfigured network policies, vulnerable ingress controllers. Managed databases like RDS, CosmosDB, Cloud SQL instances with weak access controls. IAM roles and service accounts with overly permissive identities that enable privilege escalation. I've seen instances in the past where a single misconfigured S3 bucket policy exposed terabytes of data. An overly permissive Lambda IAM role enabled lateral movement across an entire AWS account. A publicly accessible Kubernetes API server gave an attacker full cluster control. Honestly, cloud kinda scares me as well. And to top it off, multi-cloud infrastructure makes everything worse. If you're running AWS, Azure, and GCP together, you've just tripled your attack surface management complexity. Each cloud provider has different security models, different configuration profiles, and different attack vectors. Every application now uses APIs, and all applications nowadays are like a constellation of APIs talking to each other. Every API you use in your organization is your attack surface. The problem with APIs is that they're often deployed without the same security scrutiny as traditional web applications. Developers spin up API endpoints for specific features and those endpoints accumulate over time. Some of them are shadow APIs, meaning API endpoints which aren't documented anywhere. These endpoints are the equivalent of forgotten subdomains, and attackers can find them through analyzing JavaScript files for API endpoint references, fuzzing common API path patterns, examining mobile app traffic to discover backend APIs, and mining old documentation or code repositories for deprecated endpoints. Your API attack surface includes REST APIs exposed to the internet, GraphQL endpoints with overly broad query capabilities, WebSocket connections for real-time functionality, gRPC services for inter-service communication, and legacy SOAP APIs that never got decommissioned. If your organization has mobile apps, be it iOS, Android, or both, this is a direct window to your infrastructure and should be part of your attack surface management strategy. Mobile apps communicate with backend APIs and those API endpoints are discoverable by reversing the app. The reversed source of the app could reveal hard-coded API keys, tokens, and credentials. Using JADX plus APKTool plus Dex2jar is all a motivated attacker needs. Web servers often expose directories and files that weren't meant to be publicly accessible. Attackers systematically enumerate these using automated tools like ffuf, dirbuster, gobuster, and wfuzz with massive wordlists to discover hidden endpoints, configuration files, backup files, and administrative interfaces. Common exposed directories include admin panels, backup directories containing database dumps or source code, configuration files with database credentials and API keys, development directories with debug information, documentation directories revealing internal systems, upload directories for file storage, and old or forgotten directories from previous deployments. Your attack surface must include directories which are accidentally left accessible during deployments, staging servers with production data, backup directories with old source code versions, administrative interfaces without authentication, API documentation exposing endpoint details, and test directories with debug output enabled. Even if you've removed a directory from production, old cached versions may still be accessible through web caches or CDNs. Search engines also index these directories, making them discoverable through dorking techniques. If your organization is using IoT devices, and everyone uses these days, this should be part of your attack surface management strategy. They're invisible to traditional security tools. Your EDR solution doesn't protect IoT devices. Your vulnerability scanner can't inventory them. Your patch management system can't update them. Your IoT attack surface could include smart building systems like HVAC, lighting, access control. Security cameras and surveillance systems. Printers and copiers, which are computers with network access. Badge readers and physical access systems. Industrial control systems and SCADA devices. Medical devices in healthcare environments. Employee wearables and fitness trackers. Voice assistants and smart speakers. The problem with IoT devices is that they're often deployed without any security consideration. They have default credentials that never get changed, unpatched firmware with known vulnerabilities, no encryption for data in transit, weak authentication mechanisms, and insecure network configurations. Social media presence is an attack surface component that most organizations completely ignore. Attackers can use social media for reconnaissance by looking at employee profiles on LinkedIn to reveal organizational structure, technologies in use, and current projects. Twitter/X accounts can leak information about deployments, outages, and technology stack. Employee GitHub profiles expose email patterns and development practices. Company blogs can announce new features before security review. It could also be a direct attack vector. Attackers can use information from social media to craft convincing phishing attacks. Hijacked social media accounts can be used to spread malware or phishing links. Employees can accidentally share sensitive information. Fake accounts can impersonate your brand to defraud customers. Your employees' social media presence is part of your attack surface whether you like it or not. Third-party vendors, suppliers, contractors, or partners with access to your systems should be part of your attack surface. Supply chain attacks are becoming more and more common these days. Attackers can compromise a vendor with weaker security and then use that vendor's access to reach your environment. From there, they pivot from the vendor network to your systems. This isn't a hypothetical scenario, it has happened multiple times in the past. You might have heard about the SolarWinds attack, where attackers compromised SolarWinds' build system and distributed malware through software updates to thousands of customers. Another famous case study is the MOVEit vulnerability in MOVEit Transfer software, exploited by the Cl0p ransomware group, which affected over 2,700 organizations. These are examples of some high-profile supply chain security attacks. Your third-party attack surface could include things like VPNs, remote desktop connections, privileged access systems, third-party services with API keys to your systems, login credentials shared with vendors, SaaS applications storing your data, and external IT support with administrative access. It's obvious you can't directly control third-party security. You can audit them, have them pen-test their assets as part of your vendor compliance plan, and include security requirements in contracts, but ultimately their security posture is outside your control. And attackers know this. GitHub, GitLab, Bitbucket, they all are a massive attack surface. Attackers search through code repositories in hopes of finding hard-coded credentials like API keys, database passwords, and tokens. Private keys, SSH keys, TLS certificates, and encryption keys. Internal architecture documentation revealing infrastructure details in code comments. Configuration files with database connection strings and internal URLs. Deprecated code with vulnerabilities that's still in production. Even private repositories aren't safe. Attackers can compromise developer accounts to access private repositories, former employees retain access after leaving, and overly broad repository permissions grant access to too many people. Automated scanners continuously monitor public repositories for secrets. The moment a developer accidentally pushes credentials to a public repository, automated systems detect it within minutes. Attackers have already extracted and weaponized those credentials before the developer realizes the mistake. CI/CD pipelines are massive another attack vector. Especially in recent times, and not many organizations are giving attention to this attack vector. This should totally be part of your attack surface management. Attackers compromise GitHub Actions workflows with malicious code injection, Jenkins servers with weak authentication, GitLab CI/CD variables containing secrets, and build artifacts with embedded malware. The GitHub Actions supply chain attack, CVE-2025-30066, demonstrated this perfectly. Attackers compromised the Action used in over 23,000 repositories, injecting malicious code that leaked secrets from build logs. Jenkins specifically is a goldmine for attackers. An exposed Jenkins instance provides complete control over multiple critical servers, access to hardcoded AWS keys, Redis credentials, and BitBucket tokens, ability to manipulate builds and inject malicious code, and exfiltration of production database credentials containing PII. Modern collaboration tools are massive attack surface components that most organizations underestimate. Slack has hidden security risks despite being invite-only. Slack attack surface could include indefinite data retention where every message, channel, and file is stored forever unless admins configure retention periods. Public channels accessible to all users so one breached account opens the floodgates. Third-party integrations with excessive permissions accessing messages and user data. Former contractor access where individuals retain access long after projects end. Phishing and impersonation where it's easy to change names and pictures to impersonate senior personnel. In 2022, Slack leaked hashed passwords for five years affecting 0.5% of users. Slack channels commonly contain API keys, authentication tokens, database credentials, customer PII, financial data, internal system passwords, and confidential project information. The average cost of a breached record was $164 in 2022. When 1 in 166 messages in Slack contains confidential information, every new message adds another dollar to total risk exposure. With 5,000 employees sending 30 million Slack messages per year, that's substantial exposure. Trello board exposure is a significant attack surface. Trello attack vectors include public boards with sensitive information accidentally shared publicly, default public visibility where boards are created as public by default in some configurations, unsecured REST API allowing unauthenticated access to user data, and scraping attacks where attackers use email lists to enumerate Trello accounts. The 2024 Trello data breach exposed 15 million users' personal information when a threat actor named "emo" exploited an unsecured REST API using 500 million email addresses to compile detailed user profiles. Security researcher David Shear documented hundreds of public Trello boards exposing passwords, credentials, IT support customer access details, website admin logins, and client server management credentials. IT companies were using Trello to troubleshoot client requests and manage infrastructure, storing all credentials on public Trello boards. Jira misconfiguration is a widespread attack surface issue. Common misconfigurations include public dashboards and filters with "Everyone" access actually meaning public internet access, anonymous access enabled allowing unauthenticated users to browse, user picker functionality providing complete lists of usernames and email addresses, and project visibility allowing sensitive projects to be accessible without authentication. Confluence misconfiguration exposes internal documentation. Confluence attack surface components include anonymous access at site level allowing public access, public spaces where space admins grant anonymous permissions, inherited permissions where all content within a space inherits space-level access, and user profile visibility allowing anonymous users to view profiles of logged-in users. When anonymous access is enabled globally and space admins allow anonymous users to access their spaces, anyone on the internet can access that content. Confluence spaces often contain internal documentation with hardcoded credentials, financial information, project details, employee information, and API documentation with authentication details. Cloud storage misconfiguration is epidemic. Google Drive misconfiguration attack surface includes "Anyone with the link" sharing making files accessible without authentication, overly permissive sharing defaults making it easy to accidentally share publicly, inherited folder permissions exposing everything beneath, unmanaged third-party apps with excessive read/write/delete permissions, inactive user accounts where former employees retain access, and external ownership blind spots where externally-owned content is shared into the environment. Metomic's 2023 Google Scanner Report found that of 6.5 million Google Drive files analyzed, 40.2% contained sensitive information, 34.2% were shared externally, and 0.5% were publicly accessible, mostly unintentionally. In December 2023, Japanese game developer Ateam suffered a catastrophic Google Drive misconfiguration that exposed personal data of nearly 1 million people for over six years due to "Anyone with the link" settings. Based on Valence research, 22% of external data shares utilize open links, and 94% of these open link shares are inactive, forgotten files with public URLs floating around the internet. Dropbox, OneDrive, and Box share similar attack surface components including misconfigured sharing permissions, weak or missing password protection, overly broad access grants, third-party app integrations with excessive permissions, and lack of visibility into external sharing. Features that make file sharing convenient create data leakage risks when misconfigured. Pastebin and similar paste sites are both reconnaissance sources and attack vectors. Paste site attack surface includes public data dumps of stolen credentials, API keys, and database dumps posted publicly, malware hosting of obfuscated payloads, C2 communications where malware uses Pastebin for command and control, credential leakage from developers accidentally posting secrets, and bypassing security filters since Pastebin is legitimate so security tools don't block it. For organizations, leaked API keys or database credentials on Pastebin lead to unauthorized access, data exfiltration, and service disruption. Attackers continuously scan Pastebin for mentions of target organizations using automated tools. Security teams must actively monitor Pastebin and similar paste sites for company name mentions, email domain references, and specific keywords related to the organization. Because paste sites don't require registration or authentication and content is rarely removed, they've become permanent archives of leaked secrets. Container registries expose significant attack surface. Container registry attack surface includes secrets embedded in image layers where 30,000 unique secrets were found in 19,000 images, with 10% of scanned Docker images containing secrets, and 1,200 secrets, 4%, being active and valid. Immutable cached layers contain 85% of embedded secrets that can't be removed, exposed registries with 117 Docker registries accessible without authentication, unsecured registries allowing pull, push, and delete operations, and source code exposure where full application code is accessible by pulling images. GitGuardian's analysis of 200,000 publicly available Docker images revealed a staggering secret exposure problem. Even more alarming, 99% of images containing active secrets were pulled in 2024, demonstrating real-world exploitation. Unit 42's research identified 941 Docker registries exposed to the internet, with 117 accessible without authentication containing 2,956 repositories, 15,887 tags, and full source code and historical versions. Out of 117 unsecured registries, 80 allow pull operations to download images, 92 allow push operations to upload malicious images, and 7 allow delete operations for ransomware potential. Sysdig's analysis of over 250,000 Linux images on Docker Hub found 1,652 malicious images including cryptominers, most common, embedded secrets, second most prevalent, SSH keys and public keys for backdoor implants, API keys and authentication tokens, and database credentials. The secrets found in container images included AWS access keys, database passwords, SSH private keys, API tokens for cloud services, GitHub personal access tokens, and TLS certificates. Shadow IT includes unapproved SaaS applications like Dropbox, Google Drive, and personal cloud storage used for work. Personal devices like BYOD laptops, tablets, and smartphones accessing corporate data. Rogue cloud deployments where developers spin up AWS instances without approval. Unauthorized messaging apps like WhatsApp, Telegram, and Signal used for business communication. Unapproved IoT devices like smart speakers, wireless cameras, and fitness trackers on the corporate network. Gartner estimates that shadow IT makes up 30-40% of IT spending in large companies, and 76% of organizations surveyed experienced cyberattacks due to exploitation of unknown, unmanaged, or poorly managed assets. Shadow IT expands your attack surface because it's not protected by your security controls, it's not monitored by your security team, it's not included in your vulnerability scans, it's not patched by your IT department, and it often has weak or default credentials. And you can't secure what you don't know exists. Bring Your Own Device, BYOD, policies sound great for employee flexibility and cost savings. For security teams, they're a nightmare. BYOD expands your attack surface by introducing unmanaged endpoints like personal devices without EDR, antivirus, or encryption. Mixing personal and business use where work data is stored alongside personal apps with unknown security. Connecting from untrusted networks like public Wi-Fi and home networks with compromised routers. Installing unapproved applications with malware or excessive permissions. Lacking consistent security updates with devices running outdated operating systems. Common BYOD security issues include data leakage through personal cloud backup services, malware infections from personal app downloads, lost or stolen devices containing corporate data, family members using devices that access work systems, and lack of IT visibility and control. The 60% of small and mid-sized businesses that close within six months of a major cyberattack often have BYOD-related security gaps as contributing factors. Remote access infrastructure like VPNs and Remote Desktop Protocol, RDP, are among the most exploited attack vectors. SSL VPN appliances from vendors like Fortinet, SonicWall, Check Point, and Palo Alto are under constant attack. VPN attack vectors include authentication bypass vulnerabilities with CVEs allowing attackers to hijack active sessions, credential stuffing through brute-forcing VPN logins with leaked credentials, exploitation of unpatched vulnerabilities with critical CVEs in VPN appliances, and configuration weaknesses like default credentials, weak passwords, and lack of MFA. Real-world attacks demonstrate the risk. Check Point SSL VPN CVE-2024-24919 allowed authentication bypass for session hijacking. Fortinet SSL-VPN vulnerabilities were leveraged for lateral movement and privilege escalation. SonicWall CVE-2024-53704 allowed remote authentication bypass for SSL VPN. Once inside via VPN, attackers conduct network reconnaissance, lateral movement, and privilege escalation. RDP is worse. Sophos found that cybercriminals abused RDP in 90% of attacks they investigated. External remote services like RDP were the initial access vector in 65% of incident response cases. RDP attack vectors include exposed RDP ports with port 3389 open to the internet, weak authentication with simple passwords vulnerable to brute force, lack of MFA with no second factor for authentication, and credential reuse from compromised passwords in data breaches. In one Darktrace case, attackers compromised an organization four times in six months, each time through exposed RDP ports. The attack chain went successful RDP login, internal reconnaissance via WMI, lateral movement via PsExec, and objective achievement. The Palo Alto Unit 42 Incident Response report found RDP was the initial attack vector in 50% of ransomware deployment cases. Email infrastructure remains a primary attack vector. Your email attack surface includes mail servers like Exchange, Office 365, and Gmail with configuration weaknesses, email authentication with misconfigured SPF, DKIM, and DMARC records, phishing-susceptible users targeted through social engineering, email attachments and links as malware delivery mechanisms, and compromised accounts through credential stuffing or password reuse. Email authentication misconfiguration is particularly insidious. If your SPF, DKIM, and DMARC records are wrong or missing, attackers can spoof emails from your domain, your legitimate emails get marked as spam, and phishing emails impersonating your organization succeed. Email servers themselves are also targets. The NSA released guidance on Microsoft Exchange Server security specifically because Exchange servers are so frequently compromised. Container orchestration platforms like Kubernetes introduce massive attack surface complexity. The Kubernetes attack surface includes the Kubernetes API server with exposed or misconfigured API endpoints, container images with vulnerabilities in base images or application layers, container registries like Docker Hub, ECR, and GCR with weak access controls, pod security policies with overly permissive container configurations, network policies with insufficient micro-segmentation between pods, secrets management with hardcoded secrets or weak secret storage, and RBAC misconfigurations with overly broad service account permissions. Container security issues include containers running as root with excessive privileges, exposed Docker daemon sockets allowing container escape, vulnerable dependencies in container images, and lack of runtime security monitoring. The Docker daemon attack surface is particularly concerning. Running containers with privileged access or allowing docker.sock access can enable container escape and host compromise. Serverless computing like AWS Lambda, Azure Functions, and Google Cloud Functions promised to eliminate infrastructure management. Instead, it just created new attack surfaces. Serverless attack surface components include function code vulnerabilities like injection flaws and insecure dependencies, IAM misconfigurations with overly permissive Lambda execution roles, environment variables storing secrets as plain text, function URLs with publicly accessible endpoints without authentication, and event source mappings with untrusted input from various cloud services. The overabundance of event sources expands the attack surface. Lambda functions can be triggered by S3 events, API Gateway requests, DynamoDB streams, SNS topics, EventBridge schedules, IoT events, and dozens more. Each event source is a potential injection point. If function input validation is insufficient, attackers can manipulate event data to exploit the function. Real-world Lambda attacks include credential theft by exfiltrating IAM credentials from environment variables, lateral movement using over-permissioned roles to access other AWS resources, and data exfiltration by invoking functions to query and extract database contents. The Scarlet Eel adversary specifically targeted AWS Lambda for credential theft and lateral movement. Microservices architecture multiplies attack surface by decomposing monolithic applications into dozens or hundreds of independent services. Each microservice has its own attack surface including authentication mechanisms where each service needs to verify requests, authorization rules where each service enforces access controls, API endpoints for service-to-service communication channels, data stores where each service may have its own database, and network interfaces where each service exposes network ports. Microservices security challenges include east-west traffic vulnerabilities with service-to-service communication without encryption or authentication, authentication and authorization complexity from managing auth across 40 plus services multiplied by 3 environments equaling 240 configurations, service-to-service trust where services blindly trust internal traffic, network segmentation failures with flat networks allowing unrestricted pod-to-pod communication, and inconsistent security policies with different services having different security standards. One compromised microservice can enable lateral movement across the entire application. Without proper network segmentation and zero trust architecture, attackers pivot from service to service. How do you measure something this large, right. Attack surface measurement is complex. Attack surface metrics include the total number of assets with all discovered systems, applications, and devices, newly discovered assets found through continuous discovery, the number of exposed assets accessible from the internet, open ports and services with network services listening for connections, vulnerabilities by severity including critical, high, medium, and low CVEs, mean time to detect, MTTD, measuring how quickly new assets are discovered, mean time to remediate, MTTR, measuring how quickly vulnerabilities are fixed, shadow IT assets that are unknown or unmanaged, third-party exposure from vendor and partner access points, and attack surface change rate showing how rapidly the attack surface evolves. Academic research has produced formal attack surface measurement methods. Pratyusa Manadhata's foundational work defines attack surface as a three-tuple, System Attackability, Channel Attackability, Data Attackability. But in practice, most organizations struggle with basic attack surface visibility, let alone quantitative measurement. Your attack surface isn't static. It changes constantly. Changes happen because developers deploy new services and APIs, cloud auto-scaling spins up new instances, shadow IT appears as employees adopt unapproved tools, acquisitions bring new infrastructure into your environment, IoT devices get plugged into your network, and subdomains get created for new projects. Static, point-in-time assessments are obsolete. You need continuous asset discovery and monitoring. Continuous discovery methods include automated network scanning for regular scans to detect new devices, cloud API polling to query cloud provider APIs for resource changes, DNS monitoring to track new subdomains via Certificate Transparency logs, passive traffic analysis to observe network traffic and identify assets, integration with CMDB or ITSM to sync with configuration management databases, and cloud inventory automation using Infrastructure as Code to track deployments. Understanding your attack surface is step one. Reducing it is the goal. Attack surface reduction begins with asset elimination by removing unnecessary assets entirely. This includes decommissioning unused servers and applications, deleting abandoned subdomains and DNS records, shutting down forgotten development environments, disabling unused network services and ports, and removing unused user accounts and service identities. Access control hardening implements least privilege everywhere by enforcing multi-factor authentication, MFA, for all remote access, using role-based access control, RBAC, for cloud resources, implementing zero trust network architecture, restricting network access with micro-segmentation, and applying the principle of least privilege to IAM roles. Exposure minimization reduces what's visible to attackers by moving services behind VPNs or bastion hosts, using private IP ranges for internal services, implementing network address translation, NAT, for outbound access, restricting API endpoints to authorized sources only, and disabling unnecessary features and functionalities. Security hardening strengthens what remains by applying security patches promptly, using security configuration baselines, enabling encryption for data in transit and at rest, implementing Web Application Firewalls, WAF, for web apps, and deploying endpoint detection and response, EDR, on all devices. Monitoring and detection watch for attacks in progress by implementing real-time threat detection, enabling comprehensive logging and SIEM integration, deploying intrusion detection and prevention systems, IDS/IPS, monitoring for anomalous behavior patterns, and using threat intelligence feeds to identify known bad actors. Your attack surface is exponentially larger than you think it is. Every asset you know about probably has three you don't. Every known vulnerability probably has ten undiscovered ones. Every third-party integration probably grants more access than you realize. Every collaboration tool is leaking more data than you imagine. Every paste site contains more of your secrets than you want to admit. And attackers know this. They're not just looking at what you think you've secured. They're systematically enumerating every possible entry point. They're mining Certificate Transparency logs for forgotten subdomains. They're scanning every IP in your address space. They're reverse-engineering your mobile apps. They're buying employee credentials from data breach databases. They're compromising your vendors to reach you. They're scraping Pastebin for your leaked secrets. They're pulling your public Docker images and extracting the embedded credentials. They're accessing your misconfigured S3 buckets and exfiltrating terabytes of data. They're exploiting your exposed Jenkins instances to compromise your entire infrastructure. They're manipulating your AI agents to exfiltrate private Notion data. The asymmetry is brutal. You have to defend every single attack vector. They only need to find one that works. So what do you do. Start by accepting that you don't have complete visibility. Nobody does. But you can work toward better visibility through continuous discovery, automated asset management, and integration of security tools that help map your actual attack surface. Implement attack surface reduction aggressively. Every asset you eliminate is one less thing to defend. Every service you shut down is one less potential vulnerability. Every piece of shadow IT you discover and bring under management is one less blind spot. Every misconfigured cloud storage bucket you fix is terabytes of data no longer exposed. Every leaked secret you rotate is one less credential floating around the internet. Adopt zero trust architecture. Stop assuming that anything, internal services, microservices, authenticated users, collaboration tools, is inherently trustworthy. Verify everything. Monitor paste sites and code repositories. Your secrets are out there. Find them before attackers weaponize them. Secure your collaboration tools. Slack, Trello, Jira, Confluence, Notion, Google Drive, and Airtable are all leaking data. Lock them down. Fix your container security. Scan images for secrets. Use secret managers instead of environment variables. Secure your registries. Harden your CI/CD pipelines. Jenkins, GitHub Actions, and GitLab CI are high-value targets. Protect them. And test your assumptions with red team exercises and continuous security testing. Your attack surface is what an attacker can reach, not what you think you've secured. The attack surface problem isn't getting better. Cloud adoption, DevOps practices, remote work, IoT proliferation, supply chain complexity, collaboration tool sprawl, and container adoption are all expanding organizational attack surfaces faster than security teams can keep up. But understanding the problem is the first step toward managing it. And now you understand exactly how catastrophically large your attack surface actually is.

0 views
./techtipsy 2 days ago

The day IPv6 went away

I take pride in hosting my blog on a 13-year old ThinkPad acting as a home server , but sometimes it’s kind of a pain. It’s only fair that I cover the downsides of this setup in contrast to all the positives. Yesterday, I happened to notice that a connection to a backup endpoint was gone. Okay, happens sometimes. Then I went into the router and noticed that hey, that’s odd, there’s no WAN6 connection showing up. All gone. Just as if I had gone back to a crappy ISP that only provides IPv4 ! Restarting the interface did not work, but a full router restart worked. Since the IPv4 address and IPv6 prefix are all dynamic, that meant that my DNS entries had just gone stale. I do have a custom DNS auto-updater script for my DNS provider, but DNS propagation takes time. Luckily not a lot of time, my uptime checker only reported downtime of 5-15 minutes, depending on the domain. Here’s what it looked like on OpenWRT. Impact to my blog? Not really noticeable, since IPv4 kept trucking along. Perhaps a few IPv6-only readers may have noticed this. 1 I can always move to a cheap VPS or the cloud at a moments’ notice, but where’s the fun in that? I can produce AWS levels of uptime at home, thankyouverymuch ! I think I’ll now need to figure out some safeguards, even if it means scheduling a weekly router reboot if the WAN6 interface is not up for X amount of time. That, and better monitoring. if you are that person, say hi!  ↩︎ if you are that person, say hi!  ↩︎

0 views
codedge 2 days ago

Managing secrets with SOPS in your homelab

Sealed Secrets, Ansible Vault, 1Password or SOPS - there are multiple ways how and where to store your secrets. I went with SOPS and age with my ArgoCD GitOps environment. Managing secrets in your homelab, be it within a Kubernetes cluster or while deploying systems and tooling with Ansible, is a topic that arises with almost 100% certainty. In general you need to decide, whether you want secrets to be held and managed externally or internally. One important advantage I see with internally managed solutions is, that I do not need an extra service. No extra costs and connections, no chicken-egg-problem when hosting your passwords inside your own Kubernetes cluster, but cannot reach it when the cluster is down. Therefore I went with SOPS for both, secrets for my Ansible scripts and secrets I need to set for my K8s cluster. While SOPS can be used with PGP, GnuPG and more, I settled with age as encryption. With SOPS your secrets live, encrypted, inside your repository and can be en-/decrypted on-the-fly whenever needed. The private key for encryption should, of course, never be committed into your git repository or made available to untrusted sources. First, we need to install SOPS, age and generate an age key. SOPS is available for all common operating systems via the package manager. I either use Mac or Arch: Now we need to generate an age key and link it to SOPS as the default key to encrypt with. Generate an age key Our age key will live in . Now we tell SOPS where to find our age key. I put the next line in my . The last thing to do is to put a in your folder from where you want to encrypt your files. This file acts as a configuration regarding the age recipient (key) and how the data should be encrypted. My config file looks like this: You might wonder yourself about the first rule with . I will just quote the [KSOPS docs](To make encrypted secrets more readable, we suggest using the following encryption regex to only encrypt data and stringData values. This leaves non-sensitive fields, like the secret’s name, unencrypted and human readable.) here: To make encrypted secrets more readable, we suggest using the following encryption regex to only encrypt data and stringData values. This leaves non-sensitive fields, like the secret’s name, unencrypted and human readable. All the configuration can be found in the SOPS docs . Let’s now look into the specifics using our new setup with either Ansible or Kubernetes. Ansible can automatically process (decrypt) SOPS-encrypted files with the [Community SOPS Collection](Community SOPS Collection). Additionally in my I enabled this plugin ( see docs ) via Now, taken from the official Ansible docs: After the plugin is enabled, correctly named group and host vars files will be transparently decrypted with SOPS. The files must end with one of these extensions: .sops.yaml .sops.yml .sops.json That’s it. You can now encrypt your group or host vars files and Ansible can automatically decrypt them. SOPS can be used with Kubernetes with the KSOPS Kustomize Plugin . The configuration we already prepared, we only need to apply KSOPS to our cluster. I use the following manifest - see more examples in my homelab repository : Externally managed: this includes either a self-hosted and externally hosted secrets solution like AWS KMS, password manager like 1Password or similar Internally managed: solutions where your secrets live next to your code, no external service is need Arch Linux: Two separate rules depending on the folder, where the encrypted files are located Files ending with are targeted The age key, that should be used for en-/decryption is specified

0 views
neilzone 3 days ago

Upgrading our time recording system from Kimai v1 to Kimai v2

I have used Kimai as a FOSS time recording system for probably the best part of 10 years. It is a great piece of software, allowing multiple users to record the time that they spend on different tasks, linked to different customers and projects. I use it for time tracking for decoded.legal, recording all my working time. I run it on a server which is not accessible from the Internet, so the fact that we were running the now long outdated v1 of the software did not bother me too much. But, as part of ongoing hygiene / system security stuff, I’ve had my eye on upgrading it to Kimai v2 for a while now, and I’ve finally got round to upgrading it. Fortunately, there is a clear upgrade path from v1 to v2 and It Just Worked. The installation of v2 was itself pretty straightforward, with clear installation instructions . I then imported the data from v1, and the migration/importer tool flagged a couple of issues which needed fixing (e.g. no email address associated with system users, which is now a requirement). The documentation was good in terms of how to deal with those. All in all, it took about 20 minutes to install the new software, and sort out DNS, the web server configuration, TLS, and so on, and then import the data from the old installation. I used the export functionality to compare the data in v2 with what I had in v1, to check that there were no (obvious, anyway) disparities. There were not, which was good! One of the changes in Kimai v2 is the ability to create customised exportable timesheets easily, using the GUI tool. This means that, within a couple of minutes, I had created the kind of timesheet that I provide to clients along with each monthly invoice, so that they can see exactly what I did on which of their matters, and how long I spent on it. For clients who prefer to pay on the basis of time spent, this is important. This is nothing fancy; just a clear summary on the front page, and then a detailed breakdown. I have yet to work out how to group the breakdown on a per-project basis, rather than a single chronological list, but I doubt that this will be much of a problem. I have yet to investigate the possibility for some automation, particularly around the generation of timesheets at the end of each month, one per customer. I’ll still check each of them by hand, of course, but automating their production would be nice. Or, even if not automated, just one click to produce them all. As with v1, Kimai v2 stores its data in MariaDB database, so automating backups is straightforward. Again, there are clear instructions , which is a good sign.

0 views
daniel.haxx.se 3 days ago

curl 8.17.0

Download curl from curl.se . the 271st release 11 changes 56 days (total: 10,092) 448 bugfixes (total: 12,537) 699 commits (total: 36,725) 2 new public libcurl function (total: 100) 0 new curl_easy_setopt() option (total: 308) 1 new curl command line option (total: 273) 69 contributors, 35 new (total: 3,534) 22 authors, 5 new (total: 1,415) 1 security fixes (total: 170) CVE-2025-10966 : missing SFTP host verification with wolfSSH. curl’s code for managing SSH connections when SFTP was done using the wolfSSH powered backend was flawed and missed host verification mechanisms. We drop support for several things this time around: And then we did some other smaller changes: We set a new project record this time with no less than 448 documented bugfixes since the previous release. The release presentation mentioned above discusses some of the perhaps most significant ones. There a small set of pull-requests waiting to get merged, but other than that our future is not set and we greatly appreciate your feedback, submitted issues and provided pull-requests to guide us. If this release happens to include an annoying regression, there might be a patch release already next week. If we are lucky and it doesn’t, then we aim for a 8.18.0 release in the early January 2026. drop Heimdal support drop the winbuild build system drop support for Kerberos FTP drop support for wolfSSH up the minimum libssh2 requirement to 1.9.0 add a notifications API to the multi interface expand to use 6 characters per size in the progress meter support Apple SecTrust – use the native CA store add to the command line tool wcurl : import v2025.11.04 write-out: make able to output all occurrences of a header

0 views
neilzone 5 days ago

Stopping a kernel module from loading on boot in Debian Trixie using /etc/modules-load.d/

I am going through, upgrading my Debian machines from bookworm to trixie. On the whole, so far anyway, it has been pretty painless. But here’s something which needed some manual intervention. I have a couple of virtual machines still using VirtualBox. Yes, I know, but now was not the time to solve that part of the problem. They ran fine on a hypervisor running bookworm. After I had upgraded the hypervisor to trixie and started the virtual machine, I got an error: Sure enough, when I ran , the kernel module was loaded: Temporarily unloading it with did the trick. But that does not survive a reboot. Previously, I would have done something like . But says: It seems that the new (I don’t know how new) way to do things is to put a simple config file in : (It is a shame that the command uses “blacklist” in 2025.) And then I rebuilt the initramfs: . On rebooting the machine, that kernel module no longer loads automatically, and I can start my virtual machine.

0 views
Brain Baking 1 weeks ago

The Internet Is No Longer A Safe Haven

A couple of days ago, the small server hosting this website was temporarily knocked out by scraping bots. This wasn’t the first time, nor is it the first time I’m seriously considering employing more aggressive countermeasures such as Anubis (see for example the June 2025 summary post). But every time something like this happens, a portion of the software hobbyist in me dies. We should add this to the list of things AI scrapers destroy next to our environment, the creative enthusiasm of the individuals who made things that are being scraped, and our critical thinking skills. When I tried accessing Brain Baking, I was met with an unusual delay that prompted me to login and see what’s going on. A simple revealed both Gitea and the Fail2ban server gobbling up almost all CPU resources. Uh oh. Quickly killing Gitea didn’t reduce the work of Fail2ban as the Nginx access logs were being flooded with entries such as: I have enough fail safe systems in place to block bad bots but the user agent isn’t immediately recognized as “bad”: it’s ridiculously easy to spoof that HTTP header. Most user agent checkers I throw this string at claim this agent isn’t a bot. That means we shouldn’t only rely on this information. Also, I temporarily block isolated IPs that keep on poking around (e.g. rate limiting on Nginx that get pulled into the ban list) but of course these scrapers never come from a single source. Yet the base attacking IP ranges remained the same: . The website ipinfo.io can help in identifying the threat: AS45102 Alibaba (US) Technology Co., Ltd. . Huh? Apparently, Alibaba provides hosting from Singapore that is frequently being abused by attackers. Many others that host forums software such as PhpBB experienced the same problems and although the AbuseIPDB doesn’t report recent issues on the IPs from the above logs, I went ahead and blocked the entire range. Fail2ban was struggling to keep up: it ingests the Nginx access.log file to apply its rules but if the files keep on exploding… Piping to instant-ban everyone trying to access Git’s commit logs simply wasn’t fast enough. The only thing that had immediate effect was . In case that wasn’t yet clear: I hate having to deal with this. It’s a waste of time, doesn’t hold back the next attack coming from another range, and intervening always happens too late. But worst of all, semi-random fire fighting is just one big mood killer. I just know this won’t be enough. Having a robust anti attacker system in place might increase the odds but that means either resorting to hand cannons like Anubis or moving the entire hosting to CloudFlare that will do it for me. But I don’t want to fiddle with even more moving components and configuration, nor do I want to route my visitors through tracking-enabled USA servers. That Gitea instance should be moved off-site, or better yet, I should move the migration to Codeberg to the top of my TODO list. Yet it’s sad to see that people who like fiddling with their own little servers are increasingly punished for doing so, pushing many to a centralized solution, making things worse in the long term. The internet is no longer a safe haven for software hobbyists. I could link to dozens of other bloggers who reported similar issues to further solidify my point. Other things I’ve noticed is increased traffic with Referer headers coming from strange websites such as , , and . It’s not like any of these giants are going to link to an article on this site. I don’t understand what the purpose of spoofing that header is besides upping the hits count? However worse things might get, I refuse to give in. It’s just like 50 Cent said: Get Hostin’ Or Die Tryin’ . Related topics: / scraping / AI / By Wouter Groeneveld on 31 October 2025.  Reply via email .

0 views
Farid Zakaria 1 weeks ago

Nix derivation madness

I’ve written a bit about Nix and I still face moments where foundational aspects of the package system confounds and surprises me. Recently I hit an issue that stumped me as it break some basic comprehension I had on how Nix works. I wanted to produce the build and runtime graph for the Ruby interpreter. I have Ruby but I don’t seem to have the derivation, , file present on my machine. No worries, I think I can it and download it from the NixOS cache. I guess the NixOS cache doesn’t seem to have it. 🤷 This was actually perplexing me at this moment. In fact there are multiple discourse posts about it. My mental model however of Nix though is that I must have first evaluated the derivation (drv) in order to determine the output path to even substitute. How could the NixOS cache not have it present? Is this derivation wrong somehow? Nope. This is the derivation Nix believes that produced this Ruby binary from the database. 🤨 What does the binary cache itself say? Even the cache itself thinks this particular derivation, , produced this particular Ruby output. What if I try a different command? So I seem to have a completely different derivation, , that resulted in the same output which is not what the binary cache announces. WTF? 🫠 Thinking back to a previous post, I remember touching on modulo fixed-output derivations . Is that what’s going on? Let’s investigate from first principles. 🤓 Let’s first create which is our fixed-output derivation . ☝️ Since this is a fixed-output derivation (FOD) the produced path will not be affected to changes to the derivation beyond the contents of . Now we will create a derivation that uses this FOD. The for the output for this derivation will change on changes to the derivation except if the derivation path for the FOD changes. This is in fact what makes it “modulo” the fixed-output derivations. Let’s test this all out by changing our derivation. Let’s do this by just adding some garbage attribute to the derivation. What happens now? The path of the derivation itself, , has changed but the output path remains consistent. What about the derivation that leverages it? It also got a new derivation path but the output path remained unchanged. 😮 That means changes to fixed-output-derivations didn’t cause new outputs in either derivation but it did create a complete new tree of files. 🤯 That means in nixpkgs changes to fixed-output derivations can cause them to have new store paths for their but result in dependent derivations to have the same output path. If the output path had already been stored in the NixOS cache, then we lose the link between the new and this output path. 💥 The amount of churn that we are creating in derivations was unbeknownst to me. It can get even weirder! This example came from @ericson2314 . We will duplicate the to another file whose only difference is the value of the garbage. Let’s now use both of these in our derivation. We can now instantiate and build this as normal. What is weird about that? Well, let’s take the JSON representation of the derivation and remove one of the inputs. We can do this because although there are two input derivations, we know they both produce the same output! Let’s load this modified derivation back into our and build it again! We got the same output . Not only do we have a trait for our output paths to derivations but we can also take certain derivations and completely change them by removing inputs and still get the same output! 😹 The road to Nix enlightenment is no joke and full of dragons.

2 views
Maurycy 2 weeks ago

You already have a git server:

If you have a git repository on a server with ssh access, you can just clone it: You can then work on it locally and push your changes back to the origin server. By default, git won’t let you push to the branch that is currently checked out, but this is easy to change: This is a great way to sync code between multiple computers or to work on server-side files without laggy typing or manual copying. If you want to publish your code, just point your web server at the git repo: … although you will have to run this command server-side to make it cloneable: That’s a lot of work, so let’s set up a hook to do that automatically: Git hooks are just shell scripts, so they can do things like running a static site generator: This is how I’ve been doing this blog for a while now: It’s very nice to be able to type up posts locally (no network lag), and then push them to the server and have the rest handled automatically. It’s also backed up by default: If the server breaks, I’ve still got the copy on my laptop, and if my laptop breaks, I can download everything from the server. Git’s version tracking also prevents accidental deletions, and if something breaks, it’s easy to figure out what caused it.

0 views
daniel.haxx.se 2 weeks ago

On 110 operating systems

In November 2022, after I had been keeping track and adding names to this slide for a few years already, we could boast about curl having run on 89 different operating systems and only one year later we celebrated having reached 100 operating systems . This time I am back with another update and I here is the official list of the 110 operating systems that have run curl. I don’t think curl is unique in having reached this many operating systems, but I think it is a rare thing and I think it is even rarer that we actually have tracked all these names down to have them mentioned – and counted. For several of these cases, no patches or improvements were ever sent back to the curl project and we don’t know how much or little work that was required to make them happen. The exact definition of “operating system” in this context is vague but separate Linux distributions do not count as another operating system. There are probably more systems to include. Please tell me if you have run curl on something not currently mentioned.

0 views

A Fire is Not an Emergency for the Fire Department

Yesterday (Oct 20, 2025), AWS had a major incident . A lot of companies had a really bad day. I’ve been chatting with friends at impacted companies and a few had similar stories. Their teams were calm while their execs were not. There was a lot of “DO SOMETHING!” or “WHY AREN’T YOU FREAKING OUT?!” Incidents happen, but panic doesn’t do anything to solve them. Fires aren’t emergencies for the fire department and incidents shouldn’t be for engineering responders either. In this case, very little was likely in their control. For all but the most trivial of systems, you’re not going to migrate off of AWS in the middle of an incident and expect a good outcome. But the principle stands: the worst thing you can do is freak out, especially when the incident is self-inflicted . Calm leadership is going to be better than panic every time. Imagine you lit your stove on fire and called the fire department to put it out. Which of these two responses would you prefer after the fire chief hops out of the truck: I’m sure the first response would have you questioning who hired this person to be the fire chief and would not give you a lot of confidence that your whole house wasn’t going to burn down. Similarly, the second response might let you be able to exhale and leave things to the professionals. Like fires, not all incidents are created equally. A pan on fire on the stove is much more routine than a large wildfire threatening a major city. I would argue the need for calm leadership gets MORE important the bigger the impact of the incident. A small incident might not attract much attention, while a big incident certainly will. In yesterday’s AWS incident it would most definitely not be reassuring to see whoever was in charge of the response making an announcement about how bad it was and how they were on the verge of a panic attack. Same goes for wildfires, and same goes for your incidents. When you’re in a leadership position during an incident, it serves no useful purpose to yell and scream or jump up and down or tear your hair out. Your responders want normal operations to be restored as much as you do, especially if it’s off-hours and they’ve had their free time or sleep interrupted. Shouting isn’t going to make them solve the problems any faster. Now and then someone else will react negatively to someone in the Incident Commander or the engineering management seat behaving in a calm and collected way during an incident. Usually it’s someone who is in the position to get yelled at by a customer, like an account rep or a non-technical executive. They’re under a lot of pressure because incidents can cost your customers money, can result in churn or lost sales opportunities, and all of those things can affect the bottom line or career prospects of people who have no control over the actual technical issue. So they’re stressed about a lot of factors that are not the incident per se, but might be downstream consequences of it, much in the same way you would be after setting your kitchen on fire. Sometimes this comes out as anger or frustration that you, the technology person responsible for this, are not similarly stressed. This is understandable. Keep in mind too that to them, they are utterly dependent on you and the tech team in this moment. They don’t know how to fix whatever’s broken, but they’re still going to hear about it from customers and prospects who might not be nice about it. This can cause a feeling of powerlessness that results in expressions of frustration. This is not a reason to change your demeanour. The best thing you can do is keep the updates flowing, even if there’s no new news. People find reassurance in knowing the team is still working the problem. Keep the stakeholder updates jargon-free and high-level. Show empathy that you understand the customer impact and the potential consequences of that, but reassure them that the focus right now is restoring normal operations and, if appropriate, you’ll be available to talk through what happened after the dust settles. A negative reaction to the calm approach is usually born out of a fear that ‘nothing is happening’. The truth is often a big part of incident response is waiting. Without the technical context to understand what’s really going on, some people just want to see something happening to reassure themselves. You can mitigate this with updates: Use your tools to broadcast this as often as you can and you’ll tamp down on a lot of that anxiety. Big cloud provider incidents are fairly rare. Self-inflicted incidents are not. The responders trying to fix a self-inflicted problem are very likely to be the ones who took an action that caused or contributed to the outage. Even in a healthy, blameless culture this is going to cause some stress and shame, and in a blameful, scapegoating culture it’s going to be incredibly difficult. Having their manager, manager’s manager, or department head freaking out is only going to compound the problem. Highly stressed people can’t think straight and make mistakes. Set a tone of calm, methodical response and you’ll get out from under the problem a lot faster. This might be culturally hard to achieve if you have very hands-on execs who want to be very involved, even when their involvement is counterproductive (hint: they don’t see it that way even if you do). But to the extent you can, have one communication channel for the responders and a separate one for updates for stakeholders. Mixing the two just distracts the responders and lengthens the time-to-resolution. The key to this being successful is not neglecting the updates. I recommend using a timer and giving an update every 10 minutes whether there’s news or not. People will appreciate it and your responders can focus on responding. I don’t want this to be a “how to run a good incident” post. There are plenty of comprehensive materials out there that do a better job than I would here. I do want to focus on how to be a good leader in the context of incidents. Like many things, it comes down to culture and practice. Firefighters train to stay calm when responding to fires. They’re methodical and follow well-known good practices. To make this as easy as possible on yourself: If you can, work “fires aren’t emergencies for the fire department” into your company lexicon and remind people that in the context of incidents, the engineering team are the fire department. " Wildfire " by USFWS/Southeast is marked with Public Domain Mark 1.0 . Like this? Please feel free to share it on your favourite social media or link site! Share it with friends! Hit subscribe to get new posts delivered to your inbox automatically. Feedback? Questions? Get in touch ! [Screaming] “Oh dear! A fire! Somebody get the hose! Agggh!” “We see the problem, we know how to deal with it, we’ll handle it and give you an update as soon as we can.” Here’s what we’ve done. Here’s what we’re doing (or waiting on the results of). Here’s what we’ll likely do next. Set a good example when you’re in the incident commander seat (whether in an official or ad-hoc capacity). Run practice incidents where the stakes are low. Build up runbooks and checklists to make response activities automatic. Define your response process and roles. Define your communication channels so everyone knows where to find updates.

0 views
blog.philz.dev 2 weeks ago

Build Artifacts

This is a quick story about a thing I miss, that doesn't seem to have a default solution in our industry: a build artifact store. In a previous world, we had one. You could query it for a "global build number" and it would assign you a build number (and a writable to you S3 bucket). You could then produce a build, and store it back into the build-database, with both immutable metadata (what it was, when it was built, from what commits, etc.) and mutable metadata (tags). You could then query the build database for the build that matches criteria. Perhaps you want the latest build of Elephant that ran on Slackware and passed the nightly tests? This could both be used to cobble together tiers of QA and as a build artifact cache. This was a super simple service, cobbled together in a few files of Python, and it held up to our needs quite well. What do you use? Surely Git LFS or Artifactory aren't the end states here.

0 views
Fredrik Meyer 3 weeks ago

What I self host

I’ve always liked reading blogs, and have used several feed readers in the past (Feedly, for example). For a long time I was thinking it would be fun to write my own RSS reader, but instead of diving into the challenge, I did the next best thing, which was finding a decent one, and learning how to self host it. In this post I will tell about the self hosting I do, and end by sketching the setup. Miniflux is a “minimalist and opinionated feed reader”. I host my own instance at https://rss.fredrikmeyer.net/ It is very easy to set up using Docker, see the documentation . I do have lots of unread blog posts 🤨. I host a Grafana instance, also using Docker. What first triggered me to make this instance was an old project (that I want to revive one day): I had a Raspberry Pi with some sensors measuring gas and dust at my previous apartment, and a Grafana dashboard showing the data. It was interesting seeing how making food at home had a measurable impact on volatile gas levels. Later I discovered the Strava datasource plugin for Grafana. It is a plugin that lets Grafana connect to the Strava API, and gives you summaries of your Strava activities. Below is an example of how it looks for me: One gets several other dashboards included in the plugin. One day YourSpotify was mentioned on HackerNews. It is an application that connects to the Spotify API, and gives you aggregated statistics of artists and albums you’ve listened to over time (why they chose to store the data in MongoDB I have no idea of!). It is interesting to note that I have listened to less and less music over the years (I have noticed that the more experience I have at work, the less actual programming I do). Because I didn’t bother setting up DNS, this one is only exposed locally, so I use Tailscale to be able to access YourSpotify. This works by having Tailscale installed on the host, and connecting to the Tailnet. It lets me access the application by writing in the browser. I have a problem with closing tabs, and a tendency to hoard information (don’t get me started on the number of unread PDF books on my Remarkable!). So I found Linkding, a bookmark manager, which I access at https://links.fredrikmeyer.net/bookmarks/shared . In practice it is a grave yard for interesting things I never have the time to read, but it gives me peace some kind of peace of mind. I have an ambition of making the hosting “production grade”, but at the moment this setup is a mix of practices of varying levels of quality. I pay for a cheap droplet at DigitalOcean , about $5 per month, and an additional dollar for backup. The domain name and DNS is from Domeneshop . SSL certificates from Let’s Encrypt . All the apps run in different Docker containers, with ports exposed. These ports are then listened to by Nginx, which redirects to HTTPS. I manage most of the configuration using Ansible. Here I must give thanks to Jeff Geerling ’s book Ansible for DevOps , which was really good. So if I change my Nginx configuration, I edit it on my laptop, and run to let Ansible do its magic. In this case, “most” means the Nginx configuration and Grafana. Miniflux and YourSpotify are managed by simply doing and running on the host. Ideally, I would like to have a 100% “infrastructure as code” approach, but hey, who has time for that! It would be nice to combine AI and user manuals of house appliances to make an application that lets you ask questions like “what does the red light on my oven mean?”. Or write my own Jira, or… Lots of rabbit holes in this list on Github. Until next time!

0 views
alikhil 3 weeks ago

kubectl-find - UNIX-find-like plugin to find resources and perform action on them

Recently, I have developed a plugin for inspired by UNIX utility to find and perform action on resources. And few days ago number of stars in the repo reached 50! I think it’s a good moment to tell more about the project. As engineer who works with kubernetes everyday I use kubectl a lot. Actually, more than 50% of my terminal history commands are related to kubernetes. Here is a top 10 commands: Run this command if you are curious what about yours the most popular commands in terminal history. I use kubectl to check status of the pods, delete orphaned resources, trigger sync on and much more. When I realized half my terminal history was just kubectl commands, I thought — there must be a better way to find things in Kubernetes without chaining pipes with / / . And I imagined how nice it would be to have a UNIX -like tool — something that lets you search for exactly what you need in the cluster and then perform actions directly on the matching resources. I searched for a krew plugin like this but there was not any. For that reason, I decided to develop one ! I used sample-cli-plugin as a starting point. Its clean repository structure and straightforward design make it a great reference for working with the Kubernetes API. Additionally, it allows easy reuse of the extensive Kubernetes client libraries. Almost everything in the Kubernetes ecosystem is written in Go, and this plugin is no exception — which is great, as it allows building binaries for a wide range of CPU architectures and operating systems. Use filter to find any resource by any custom condition. uses gojq implementation of . By default, will print found resources to Stdout. However, there flags that you can provide to perform action on found resources: Use krew to install the plugin: I’m currently working on adding: If you’re tired of writing long chains, give a try — it’s already saved me countless keystrokes. Check out the repo ⭐ github.com/alikhil/kubectl-find and share your ideas or issues — I’d love to hear how you use it! - to delete them - to patch with provided JSON - to run command on pods JSON/YAML output format More filters Saved queries

0 views
allvpv’s space 3 weeks ago

Environment variables are a legacy mess: Let's dive deep into them

Programming languages have rapidly evolved in recent years. But in software development, the new often meets the old, and the scaffolding that OS gives for running new processes hasn’t changed much since Unix. If you need to parametrize your application at runtime by passing a few ad-hoc variables (without special files or a custom solution involving IPC or networking), you’re doomed to a pretty awkward, outdated interface: There are no namespaces for them, no types. Just a flat, embarrassingly global dictionary of strings. But what exactly are these envvars? Is it some kind of special dictionary inside the OS? If not, who owns them and how do they propagate? In a nutshell: they’re passed from parent to child. On Linux, a program must use the syscall to execute another program. Whether you type in Bash, call in Python, or launch a code editor, it ultimately comes down to , preceded by a / . The family of C functions also relies on . This system call takes three arguments: , , . For example, for an invocation: By default, all envvars are passed from the parent to the child. However, nothing prevents a parent process from passing a completely different or even empty environment when calling ! In practice, most tooling passes the environment down: Bash, Python’s , the C library , and so on. And this is what you expect – variables are inherited by child processes. That’s the point – to track the environment. Which tools do not pass the parent’s environment? For example, the executable, used when signing into a system, sets up a fresh environment for its children. After launching the new program, the kernel dumps the variables on the stack as a sequence of null-terminated strings which contain the envvar definitions. Here is a hex view: This static layout can’t easily be modified or extended; the program must copy those variables into its own data structure. Let’s look at how Bash, C, and Python store envvars internally. I analyzed their source code and here is a summary. It stores the variables in a hashmap . Or, more precisely, in a stack of hashmaps . When you spawn a new process using Bash, it traverses the stack of hashmaps to find variables marked as exported and copies them into the environment array passed to the child. Side note: Why is traversing the stack needed? Each function invocation in Bash creates a new local scope – a new entry on the stack. If you declare your variable with , it ends up in this locally-scoped hashmap. What’s interesting is that you can export a variable too! I wouldn’t have learned this without diving into Bash source. My intuitive (wrong) assumption was that automatically makes the variable global – like ! Super interesting stuff. exposes a dynamic array, managed via and library functions. It uses an array, so the time complexity of and is linear in the number of envvars. Remember – envvars are not a high-performance dictionary and you should not abuse them. Python couples its environment to the C library, which can cause surprising inconsistencies. If you’ve programmed some Python, you’ve probably used the dictionary. On startup, is built from the C library’s array. But those dictionary values are NOT the “ground truth” for child processes. Rather, each change to invokes the native function, which in turn calls the C library’s . Note that the propagation is one-directional: modifying will call , but not the other way around. Call , and won’t be updated. The Linux kernel is very liberal about the format of environment variables, and so is . For example, your C program can manipulate the environment – the global array – such that several variables share the same name but have different values. And when you execute a child process, it will inherit this “broken” setup. You don’t even need an equals sign separating name from value! The usual entry is , but nothing prevents you from adding to the array. The kernel happily accepts any null-terminated string as an “environment variable” definition. It just imposes a size limitation: Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars. But the fact that you can do something does not mean that you should. For example, if you start Bash with the “broken” environment – duplicated names and entries without – it deduplicates the variables and drops the nonsense. One interesting edge case is a space inside the variable name . My beloved shell – Nushell – has no problem with the following assignment: Python is fine with it, too. Bash, on the other hand, can’t reference it because whitespace isn’t allowed in variable names. Fortunately, the variable isn’t lost – Bash keeps such entries in a special hashmap called and still passes them to child processes. So what name and value can you safely use for your envvar? A popular misconception, repeated on StackOverflow and by ChatGPT, is that POSIX permits only uppercase envvars, and everything else is undefined behavior. But this is seriously NOT what the standard says : These strings have the form name=value; names shall not contain the character ‘=’. For values to be portable across systems conforming to POSIX.1-2017, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in an environment of a process has the same name, the consequences are undefined. Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2017 consist solely of uppercase letters, digits, and the <underscore> ( ‘_’ ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities. Yes, POSIX-specified utilities use uppercase envvars, but that’s not prescriptive for your programs. Quite the contrary: you’re encouraged to use lowercase for your envvars so they don’t collide with the standard tools. The only strict rule is that a variable name cannot contain an equals sign. POSIX requires compliant applications to preserve all variables that conform to this rule. But in reality, not many applications use lowercase. The proper etiquette in software development is to use . …to use for names, and UTF-8 for values. You shouldn’t hit problems on Linux. If you want to be super safe: instead of UTF-8, use the POSIX-mandated Portable Character Set (PCS) – essentially ASCII without control characters. …and I hope it wasn’t a boring read. is the (the executable path), is the array of command line arguments – the implicit first (“zero”) argument is usually the executable name, is the array of envvars (typically much longer). Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars.

0 views

Abstraction, not syntax

Alternative configuration formats solve superficial problems. Configuration languages solve the deeper problem: the need for abstraction.

0 views
Filippo Valsorda 4 weeks ago

A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises

Lack of memory safety is such a predominant cause of security issues that we have a responsibility as professional software engineering to robustly mitigate it in security-sensitive use cases—by using memory safe languages. Similarly, I have the growing impression that software supply chain compromises have a few predominant causes which we might have a responsibility as a professional open source maintainers to robustly mitigate. To test this impression and figure out any such mitigations, I collected all 2024/2025 open source supply chain compromises I could find, and categorized their root cause. (If you find more, do email me!) Since I am interested in mitigations we can apply as maintainers of depended-upon projects to avoid compromises, I am ignoring: intentionally malicious packages (e.g. typosquatting), issues in package managers (e.g. internal name shadowing), open source infrastructure abuse (e.g. using package registries for post-compromise exfiltration), and isolated app compromises (i.e. not software that is depended upon). Also, I am specifically interested in how an attacker got their first unauthorized access, not in what they did with it. Annoyingly, there is usually a lot more written about the latter than the former. In no particular order, but kind of grouped. XZ Utils Long term pressure campaign on the maintainer to hand over access. Root cause : control handoff. Contributing factor: non-reproducible release artifacts. Nx S1ingularity Shell injection in GitHub Action with trigger and unnecessary read/write permissions 1 , used to extract a npm token. Root cause : pull_request_target. Contributing factors: read/write CI permissions, long-lived credential exfiltration, post-install scripts. Shai-Hulud Worm behavior by using compromised npm tokens to publish packages with malicious post-install scripts, and compromised GitHub tokens to publish malicious GitHub Actions workflows. Root cause : long-lived credential exfiltration. Contributing factor: post-install scripts. npm debug/chalk/color Maintainer phished with an "Update 2FA Now" email. Had TOTP 2FA enabled. Root cause : phishing. polyfill.io Attacker purchased CDN domain name and GitHub organization. Root cause : control handoff. MavenGate Expired domains and changed GitHub usernames resurrected to take control of connected packages. Root causes : domain resurrection, username resurrection. reviewdog and tj-actions/changed-files Contributors deliberately granted automatic write access for GitHub Action repository 2 . Malicious tag re-published to compromise GitHub PAT of more popular GitHub Action 3 . Root cause : control handoff. Contributing factors: read/write CI permissions, long-lived credential exfiltration, mutable GitHub Actions tags. Ultralytics Shell injection in GitHub Action with trigger (which required read/write permissions), pivoted to publishing pipeline via GitHub Actions cache poisoning. Compromised again later using an exfiltrated PyPI token. Root cause : pull_request_target. Contributing factors: GitHub Actions cache poisoning, long-lived credential exfiltration. Kong Ingress Controller GitHub Action with trigger restricted to trusted users but bypassed via Dependabot impersonation 4 , previously patched but still available on old branch. GitHub PAT exfiltrated and used. Root causes : pull_request_target, Dependabot impersonation. Contributing factors: per-branch CI configuration, long-lived credential exfiltration. Rspack Pwn request 5 against workflow 6 in other project, leading to a GitHub classic token of a maintainer with permissions to the web-infra-dev organization 7 (kindly confirmed via email by the Rspack Team). Similar to previously reported and fixed vulnerability 8 in the Rspack repository. Root causes : issue_comment. Contributing factor: long-lived credential exfiltration. eslint-config-prettier "Verify your account" 9 npm phishing. Root cause : phishing. num2words "Email verification" PyPI phishing. Root cause : phishing. @solana/web3.js A "phishing attack on the credentials for publishing npm packages." Root cause : phishing. rustfoundation.dev Fake compromise remediation 10 Crates.io phishing. Unclear if successful. Root cause : phishing. React Native ARIA & gluestack-ui "[U]nauthorized access to publishing credentials." Colorful and long Incident Report lacks any details on "sophisticated" entry point. Presumably an exposed npm token. Root cause : long-lived credential exfiltration(?). lottie-player Unclear, but mitigation involved "remov[ing] all access and associated tokens/services accounts of the impacted developer." Root cause : long-lived credential exfiltration(?) or control handoff(?). rand-user-agent Unclear. Malicious npm versions published, affected company seems to have deleted the project. Presumably npm token compromise. Root cause : long-lived credential exfiltration(?). DogWifTool GitHub token extracted from distributed binary. Root cause : long-lived credential exfiltration. Surprising no one, the most popular confirmed initial compromise vector is phishing. It works against technical open source maintainers. It works against 2FA TOTP. It. Works. It is also very fixable. It’s 2025 and every professional open source maintainer should be using phishing-resistant authentication (passkeys or WebAuthn 2FA) on all developer accounts, and accounts upstream of them. Upstream accounts include email, password manager, passkey sync (e.g. Apple iCloud), web/DNS hosting, and domain registrar. Some services, such as GitHub, require a phishable 2FA method along with phishing-resistant ones. In that case, the best option is to enable TOTP, and delete the secret or write it down somewhere safe and never ever use it—effectively disabling it. This does not work with SMS, since SIM jacking is possible even without action by the victim. Actually surprisingly—to me—a number of compromises are due to, effectively, giving access to the attacker. This is a nuanced people issue. The solution is obviously “don’t do that” but that really reduces to the decades-old issue of open source maintenance sustainability. In a sense, since this analysis is aimed at professional maintainers who can afford it, control handoff is easily avoided by not doing it. Kind of incredible that a specific feature has a top 3 spot, but projects get compromised by “pwn requests” all the time. The workflow trigger runs privileged CI with a context full of attacker-controlled data in response to pull requests. It makes a meek attempt to be safer by not checking out the attacker’s code, instead checking out the upstream target. That’s empirically not enough, with shell injection attacks causing multiple severe compromises. The zizmor static analyzer can help detect injection vulnerabilities, but it seems clear that is unsafe at any speed, and should just never be used. Other triggers that run privileged with attacker-controlled context should be avoided for the same reason. The Rspack compromise, for example, was due to checking out attacker-controlled code on an trigger if the PR receives a comment. What are the alternatives? One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. Overall, none of the mitigations are particularly satisfactory, so the solution might be simply to eschew features that require and other privileged attacker-controlled triggers. (To be honest, I am not a fan of chatty bots on issues and PRs, so I never needed them.) Attackers love to steal tokens. There is no universal solution, but it’s so predominant that we can consider piecemeal solutions. Long-lived credentials are only a root cause when they are accidentally exposed. Otherwise, they are a secondary compromise mechanism for lateral movement or persistence, after the attacker got privileged code execution. Mitigating the latter is somewhat less appealing because an attacker with code execution can find more creative ways to carry out an attack, but we can prune some low-hanging fruit. Go removes the need for package registry tokens by simply not having accounts. (Instead, the go command fetches modules directly from VCS, with caching by the Go Modules Proxy and universality and immutability guaranteed by the Go Checksum Database.) In other ecosystems Trusted Publishing replaces long-lived private tokens with short-lived OIDC tokens, although there is no way to down-scope the capabilities of an OIDC token. GitHub Personal Access Tokens are harder to avoid for anything that’s not supported by GitHub Actions permissions. Chainguard has a third-party Security Token Service that trades OIDC tokens for short-lived tokens , and their article has a good list of cases in which PATs end up otherwise necessary. Given the risk, it might be worth giving up on non-critical features that would require powerful tokens. Gerrit “git cookies” (which are actually just OAuth refresh tokens for the Gerrit app) can be replaced with… well, OAuth refresh tokens but kept in memory instead of disk, using git-credential-oauth . They can also be stored a little more safely in the platform keychain by treating them as an HTTP password, although that’s not well documented . In the long term, it would be great to see the equivalent of Device Bound Session Credentials for developer and automated workflows. Turns out you can just exfiltrate a token from a GitHub Actions runner to impersonate Dependabot with arbitrary PRs ??? I guess! Fine! Just don’t allowlist Dependabot. Not sure what a deeper meta-mitigation that didn’t require knowing this factoid would have been. This is also a social engineering risk, so I guess just turn off Dependabot? Multiple ecosystems (Go and Maven, for example) are vulnerable to name takeovers, whether expired domain names or changed GitHub user/org names. The new owner of the name gets to publish updates for that package. From the point of view of the maintainer, the mitigation is just not to change GitHub names (at least without registering the old one), and to register critical domains for a long period, with expiration alerting. Some CI compromises happened in contexts that could or should have been read-only. It sounds like giving GitHub Actions workflows only read permissions like should be a robust mitigation for any compromise of the code they run. Unfortunately, and kind of incredibly, even a read-only workflow is handed a token that can write to the cross-workflow cache for any key. This cache is then used implicitly by a number of official actions, allowing cross-workflow escalation by GitHub Actions cache poisoning . This contradicts some of GitHub’s own recommendations, and makes the existence of a setting to make GitHub Actions read-only by default more misleading than useful. The behavior does not extend to regular triggers, which are actually read-only (otherwise anyone could poison caches with a PR). GitHub simply doesn’t seem to offer a way to opt in to it. I can see no robust mitigation in the GitHub ecosystem. I would love to be wrong, this is maddening. Two compromises propagated by injecting npm post-install scripts, to obtain code execution as soon as a dependency was installed. This can be disabled with which is worth doing for defense in depth. However, it’s only useful if the dependency is not going to be executed in a privileged context, e.g. to run tests in Node.js. Go, unlike most ecosystems, considers code execution during fetch or compilation to be a security vulnerability, so has this safety margin by default. The XZ backdoor was hidden in a release artifact that didn’t match the repository source. It would be great if that was more detectable, in the form of reproducible artifacts. The road to a fail-closed world where systems automatically detect non-reproducing artifacts is still long, though. How supply chain attacks usually work these days is that an attacker gets the ability to publish new versions for a package, publishes a malicious version, and waits for dependents to update (maybe with the help of Dependabot) or install the latest version ex novo. Not with GitHub Actions! The recommended and most common way to refer to a GitHub Action is by its major version, which is resolved to a git tag that is expected to change arbitrarily when new versions are published. This means that an attacker can instantly compromise every dependent workflow. This was an unforced error already in 2019, when GitHub Actions launched while Go had already shipped an immutable package system. This has been discussed many times since and most other ecosystems have improved somewhat. A roadmap item for immutable Actions has been silent since 2022 . The new immutable releases feature doesn’t apply to non-release tags, and the GitHub docs still recommend changing tags for Actions. As maintainers, we can opt in to pinning where it’s somehow still not the default. For GitHub Actions, that means using unreadable commit hashes, which can be somewhat ameliorated with tooling . For npm, it means using instead of . One compromise was due to a vulnerability that was already fixed, but had persisted on an old branch. Any time we make a security improvement (including patching a vulnerable Action) on a GitHub Actions workflow, we need to remember to cherry-pick it to all branches, including stale ones. Can’t think of a good mitigation, just yet another sharp edge of GitHub Actions you need to be aware of, I suppose. There are a number of useful mitigations, but the ones that appear to be as clearly a professional responsibility as memory safety are phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). This research was part of an effort to compile a Geomys Standard of Care that amongst other things mitigates the most common security risks to the projects we are entrusted with. We will publish and implement it soon, to keep up to date follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . On Saturday, between 250,000 and 1,000,000 people (depending on who you believe, 0.4–1.7% of the whole population of Italy) took part in a demonstration against the genocide unfolding in Gaza. Anyway, here's a picture of the Archbasilica of San Giovanni in Laterano at the end of the march. My work is made possible by Geomys , an organization of professional Go maintainer, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064 One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064

0 views

Co-Pilots, Not Competitors: PM/EM Alignment Done Right

In commercial aviation, most people know there's a "Pilot" and a "Co-Pilot" up front (also known as the "Captain" and "First Officer"). The ranks of Captain and First Officer denote seniority but both pilots, in practice, take turns filling one of two roles: "pilot flying" and "pilot monitoring". The pilot flying is at the controls actually operating the plane. The pilot monitoring is usually the one talking to Air Traffic Control, running checklists, and watching over the various aircraft systems to make sure they're healthy throughout the flight. Obviously, both pilots share the goal of getting the passengers to the destination safely and on time while managing the costs of fuel etc. secondarily. They have the same engines, fuel tank and aircraft systems and no way to alter that once airborne. They succeed or fail together. It would be a hilariously bad idea to give one pilot the goal of getting to the destination as quickly as possible while giving the other pilot the goal of limiting fuel use as much as possible and making sure the wings stay on. Or, to take an even stupider example, give one pilot the goal of landing in Los Angeles and the other the goal of landing in San Francisco. Obviously that wouldn't work at because those goals are in opposition to each other. You can get there fast if you don't care about fuel or the long-term health of the aircraft, or you can optimize for fuel use and aircraft stress if you don't need to worry about the travel time as much. It sounds ridiculous! And yet, this is how a lot of organizations structure EM and PM goals. An EM and PM have to use the same pool of available capacity to achieve their objectives. They are assigned a team of developers and what that team of developers is able to do in a given period is a zero-sum-game. In a modern structure, following the kinds of practices that lead to good DORA metrics , their ownership is probably going to be a mix of building new features and care and feeding of what they've shipped before. Like a plane's finite fuel tank, the capacity of the team is fixed and exhaustible. Boiled down to its essence, the job of leaders is managing that capacity to produce the best outcomes for the business. Problems arise, however, in determining what those outcomes are because they're going to be filtered through the lens of each individual's incentive structure. Once fuel is spent, it’s spent. Add more destinations without adjusting the plan, and you’re guaranteeing failure. It would not be controversial to say that a typical Product Manager is accountable and incentivized to define and deliver new features that increase user adoption and drive revenue growth. Nor would it be controversial to say that a typical Engineering Manager is accountable and incentivized to make sure that the code that is shipped is relatively error-free, doesn't crash, treats customer data appropriately, scales to meet demand, etc etc. You know the drill. I think this is fine and reflects the reality that EMs and PMs aren't like pilots in that their roles are not fully interchangeable and there is some specialization in those areas they're accountable for. But if you just stop there, you have a problem. You've created the scenario where one pilot is trying to get to the destination as quickly as possible and one is trying to save fuel and keep the airplane from falling apart. You need to go a step further and coalesce all of those things into shared goals for the team. An interview question I like to ask both prospective EM and PM candidates is "How do you balance the need to ship new stuff with the need to take care of the stuff you've already shipped?" The most common answer is something like "We reserve X% of our story points for 'Engineering Work' and the rest is for 'Product Work'. This way of doing things is a coping strategy disguised as a solution. It's the wrong framing entirely because everything is product work . Performance is a feature, reliability is a feature, scalability is a feature. A product that doesn't work isn't a product, and so a product manager needs to care about that sort of thing too. Pretending there's a clean line between "Product Work" and "Engineering Work" is how teams can quietly drift off-course. On the flip-side, I'm not letting engineering managers off the hook either. New features and product improvements are usually key to maintaining business growth and keeping existing customers happy. All the scalability and performance in the world don't matter if you don't have users. Over-rotating on trying to get to "five nines" of uptime when you're not making any revenue is a waste of your precious capacity that could be spent on things that grow the business and that growth will bring more opportunities for solving interesting technical challenges and more personal growth opportunities for everyone who is there. EMs shouldn't ignore the health of the business any more than PMs should ignore the health of the systems their teams own. Different hats are fine, different destinations are not. If you use OKRs or some other cascading system of goals where the C-Suite sets some broad goals for the company and those cascade into ever more specific versions as you walk down the org chart hierarchy, what happens when you get to the team level? Do the EM and PM have separate OKRs they're trying to achieve? Put them side by-side and ask yourself the following questions: If you answer 'no' to any of those, congratulations, you've just planned your team's failure to achieve their goals. To put it another way your planning is not finished yet! You need to keep going and narrow down the objectives to something that both leaders can get behind and do their best to commit to. You've already got your goals side-by-side. After you've identified the ones that are in conflict, have a conversation about business value. Don't bring in the whole team yet, they're engineers and the PM might feel outnumbered. Just have the conversation between the two of you and see if you can come to some kind of consensus about which of the items in conflict are more valuable. Prioritize that one and deprioritize the one that it's in oppostition to. Use "what's best for the business" as your tie-breaker. That's not always going to be perfectly quantifiable so there's a lot of subjectivity and bias that's going to enter into it. This is where alignment up the org chart is very beneficial. These concepts don't just apply at the team level. If you find you can't agree and aren't making progress, try to bring in a neutral third party to facilitate the discussion. If you have scrum-masters, agile coaches, program managers or the like they can often fill this role nicely. You could also consult a partner from Customer Support who will have a strong incentive to advocate for the customer above all to add more context to the discussion. Don't outsource the decision though, both the EM and PM need to ultimately agree (even reluctantly) on what's right for the team. If your org, like many, has parallel product and engineering hierarchies, alignment around goals at EACH level is critical. The VP Eng and VP Product should aim to have shared goals for the entire team. Same thing at the Director level, and every other level. That way if each team succeeds, each leader up the org chart succeeds, and ultimately the customer and the business are the beneficiaries. If you don't have that, doing it at the team level alone is going to approach impossible and you should send this post to your bosses. 😉 But in seriousness, you should spend some energy trying to get alignment up and down the org chart. If your VP Eng and VP Product don’t agree on direction, your team is flying two different flight plans. No amount of team-level cleverness can fully fix that. Your job is to surface it, not absorb it. Things you can do: You likely set goals on a schedule. Every n months you're expected to produce a new set and report on the outcomes from the last set. That's all well and good but I think most people who've done it a few times recognize that the act of doing it is more valuable than the artifacts the exercise produces are. (Often described as "Plans are useless, planning is critical.") The world changes constantly, the knowledge you and your team have about what you're doing increases constantly, all of these things can impact your EM/PM alignment so it's also important to stay close, keep the lines of communication very open, and make adjustments whenever needed. The greatest predictor of a team's success, health, and happiness is the quality of the relationship between the EM and PM. Keeping that relationship healthy and fixing it if it starts to break down will keep the team on the right flight path with plenty of spare fuel for surprise diversions if they're needed. Regardless of what framework you're using, or even if you're not using one at all, the team should have one set of goals they're working toward and the EM and PM should succeed or fail together on achieving those goals. Shared goals don’t guarantee a smooth landing. Misaligned ones guarantee a crash. " A380 Cockpit " by Naddsy is licensed under CC BY 2.0 . Like this? Please feel free to share it on your favourite social media or link site! Share it with friends! Hit subscribe to get new posts delivered to your inbox automatically. Feedback? Get in touch ! If we had infinite capacity, are these all achievable? Or are some mutually exclusive? Example: Launch a new feature with expensive storage requirements (PM) vs. Cut infra spend in half (EM) Does achieving any of these objectives make another one harder? Example: Increase the number of shipped experiments and MVPs (PM) vs. Cut the inbound rate of customer support tickets escalated to the team. (EM) Do we actually have the capacity to do all of this? Example: Ship a major feature in time for the annual customer conference (PM) vs. Upgrade our key framework which is 2 major versions behind and is going out of support. (EM) If your original set of goals conflicted, the goals at the next level up probably do to. Call that out and ask for it to be reconciled. Escalate when structural problems get in the way of the team achieving thier goals. Only the most dysfunctional organizations would create systems that guarantee failure on purpose .

0 views