Latest Posts (20 found)
Karan Sharma 1 weeks ago

AI and Home-Cooked Software

Everyone is worried that AI will replace programmers. They’re missing the real revolution: AI is turning everyone into one. I’ve been noticing a new pattern: people with deep domain knowledge but no coding experience are now building their own tools. Armed with AI assistants, they can create custom workflows in a matter of days, bypassing traditional development cycles. Are these solutions production-ready? Not even close. But they solve urgent, specific problems, and that’s what matters. Tasks that once required weeks of specialized training are quickly becoming weekend projects. This trend is happening even within the AI companies themselves. Anthropic, for example, shared how their own teams use Claude to accelerate their work. Crucially, this isn’t limited to developers. Their post details how non-technical staff now build their own solutions and create custom automations, providing a powerful real-world example of this new paradigm. Why search for a generic tool when you can build exactly what you need? This question leads to what I call ‘home-cooked software’: small, personal applications we build for ourselves, tailored to our specific needs. Robin Sloan beautifully describes building an app as making “a home-cooked meal,” while Maggie Appleton writes about “barefoot developers” creating software outside traditional industry structures. What’s new isn’t the concept but the speed and accessibility. With AI, a custom export format, a specific workflow, or the perfect integration is now an afternoon’s work. We’re entering an unprecedented era where the barrier between wanting a tool and having it has nearly vanished. But let’s be clear: the journey from a prototype to a production-ready application is as challenging as ever. In my experience, an AI can churn out a first draft in a few hours, which gets you surprisingly far. But the devil is in the details, and the last stretch of the journey – handling edge cases, ensuring security, and debugging subtle issues – can stretch into weeks. This distinction is crucial. AI isn’t replacing programmers; it’s creating millions of people who can build simple tools. There’s a significant difference. AI is fundamentally reshaping the economics of building software. Before AI, even a simple tool required a significant time investment in learning programming basics, understanding frameworks, and debugging. Only tools with broad appeal or critical importance justified the effort. Now, that effort is measured in hours, not months, and the primary barrier is no longer technical knowledge, but imagination and a clear understanding of one’s own needs. This doesn’t apply to complex or security-critical systems, where deep expertise remains essential. But for the long tail of personal utilities, automation scripts, and custom workflows, the math has changed completely. I’m talking about solving all those minor irritations that pile up: the script to reformat a specific CSV export, the dashboard showing exactly the three metrics you care about, or a script that pulls data from a personal project management tool to sync with an obscure time-tracking app. These tools might be held together with digital duct tape, but they solve real problems for real people. And increasingly, that’s all that matters. But this newfound capability isn’t free. It comes with what I call the “AI Tax”: a set of hidden costs that are rarely discussed. First, prompt engineering can be surprisingly time-consuming, especially for tasks of moderate complexity. While simple requests are often straightforward, anything more nuanced can become an iterative dialogue. You prompt, the AI generates a flawed output, you clarify the requirements, and it returns a new version that misses a different detail. It’s a classic 80/20 scenario: you get 80% of the way there with a simple prompt, but achieving the final 20% of correctness requires a disproportionate amount of effort in refining, correcting, and clarifying your intent to the model. Second, there’s the verification burden. Every line of AI-generated code is a plausible-looking liability. It may pass basic tests, only to fail spectacularly in production with an edge case you never considered. AI learned from the public internet, which means it absorbed all the bad code along with the good. SQL injection vulnerabilities, hardcoded secrets, race conditions—an AI will happily generate them all with complete confidence. Perhaps the most frustrating aspect is “hallucination debugging”: the uniquely modern challenge of troubleshooting plausible-looking code that relies on APIs or methods that simply don’t exist. Your codebase becomes a patchwork of different AI-generated styles and patterns. Six months later, it’s an archaeological exercise to determine which parts you wrote and which parts an AI contributed. But the most significant danger is that AI enables you to build systems you don’t fundamentally understand. When that system inevitably breaks, you lack the foundational knowledge to debug it effectively. Despite these challenges, there’s something profoundly liberating about building software just for yourself. Instead of just sketching out ideas, I’ve started building these small, specific tools. For this blog, I wanted a simple lightbox for images; instead of pulling in a heavy external library, I had Claude write a 50-line JavaScript snippet that did exactly what I needed. I built a simple, single-page compound interest calculator tailored for my own financial planning. To save myself from boilerplate at work, I created prom2grafana , a tool that uses an LLM to convert Prometheus metrics into Grafana dashboards. Ten years ago, I might have thought about generalizing these tools, making them useful for others, perhaps even starting an open source project. Today? I just want a tool that works exactly how I think. I don’t need to handle anyone else’s edge cases or preferences. Home-cooked software doesn’t need product-market fit—it just needs to fit you. We’re witnessing the emergence of a new software layer. At the base are the professionally-built, robust systems that power our world: databases, operating systems, and rock-solid frameworks. In the middle are commercial applications built for broad audiences. And at the top, a new layer is forming: millions of tiny, personal tools that solve individual problems in highly specific ways. This top layer is messy, fragile, and often incomprehensible to anyone but its creator. It’s also incredibly empowering. Creating simple software is becoming as accessible as writing. And just as most writing isn’t professional literature, most of this new software won’t be professional-grade. That’s not just okay; it’s the point. The implications are profound. Subject-matter experts can now solve their own problems without waiting for engineering resources, and tools can be hyper-personalized to a degree that is impossible for commercial software. This unlocks a wave of creativity, completely unconstrained by the need to generalize or find a market. Yes, there are legitimate concerns. Security is a real risk, though the profile changes when a tool runs locally on personal data with no external access. We’re creating personal technical debt, but when a personal tool breaks, the owner is the only one affected. They can choose to fix it, rebuild it, or abandon it without impacting anyone else. Organizations, on the other hand, will soon have to grapple with the proliferation of incompatible personal tools and establish new patterns for managing them. But these challenges pale in comparison to the opportunities. The barrier between user and creator is dissolving. We’re entering the age of home-cooked software, where building your own tool is becoming as natural as cooking your own meal. The kitchen is open. What will you cook?

0 views
Karan Sharma 1 weeks ago

State of My Homelab 2025

For the past five years, I have maintained a homelab in various configurations. This journey has served as a practical exploration of different technologies, from Raspberry Pi clusters running K3s to a hybrid cloud setup and eventually a cloud-based Nomad setup . Each iteration provided valuable lessons, consistently highlighting the operational benefits of simplicity. This article details the current state of my homelab. A primary motivation for this build was to dip my toes into “actual” homelabbing—that is, maintaining a physical server at home. The main design goal was to build a dedicated, reliable, and performant server that is easy to maintain. This led me to move away from complex container orchestrators like Kubernetes in favor of a more straightforward Docker Compose workflow. I will cover the hardware build, software architecture, and the rationale behind the key decisions. After considerable research, I selected components to balance performance, power efficiency, and cost. The server is designed for 24/7 operation in a home environment, making noise and power consumption important considerations. My previous setups involved Kubernetes and Nomad, but the operational overhead proved unnecessary for my use case. I have since standardized on a Git-based, Docker Compose workflow that prioritizes simplicity and transparency. The core of the system is a Git repository that holds all configurations. Each service is defined as a self-contained “stack” in its own directory. The structure is organized by machine, making it easy to manage multiple environments: This modular approach allows me to manage each application’s configuration, including its and any related files, as an independent unit. Deployments are handled by a custom script, with a providing a convenient command-runner interface. The process is fundamentally simple: Each machine’s connection settings ( , , ) are defined in its file. This file can also contain and hooks for custom actions. The makes daily operations trivial: This system provides fine-grained control over deployments, with support for actions like , , , , and (which also removes persistent volumes). To keep the system consistent, I follow a few key patterns: The homelab comprises three distinct machines to provide isolation and redundancy. This distributed setup isolates my home network from the public internet and ensures that critical public services remain online even if the home server is down for maintenance. The following is a breakdown of the services, or “stacks,” running on each machine. A few key services that are central to the homelab are detailed further in the next section. I came across Technitium DNS after seeing a recommendation from @oddtazz , and it has been a revelation. For anyone who wants more than just basic ad blocking from their DNS server, it’s a game-changer. It serves as both a recursive and authoritative server, meaning I don’t need a separate tool like to resolve from root hints. The level of configuration is incredible—from DNSSEC, custom zones, and SOA records to fine-grained caching control. The UI is a bit dated, but that’s a minor point for me given the raw power it provides. It is a vastly underrated tool for any homelabber who wants to go beyond Pi-hole or AdGuard Home. For a long time, I felt that monitoring a homelab meant spinning up a full Prometheus and Grafana stack. Beszel is the perfect antidote to that complexity. It provides exactly what I need for basic node monitoring—CPU, memory, disk, and network usage—in a simple, lightweight package. It’s incredibly easy to set up and provides a clean, real-time view of my servers without the overhead of a more complex system. For a simple homelab monitoring setup, it’s hard to beat. While Beszel monitors the servers from the inside, Gatus watches them from the outside. Running on an independent Hetzner VM, its job is to ensure my services are reachable from the public internet. It validates HTTP status codes, response times, and more. This separation is crucial; if my entire home network goes down, Gatus is still online to send an alert to my phone. It’s the final piece of the puzzle for robust monitoring, ensuring I know when things are broken even if the monitoring service itself is part of the outage. Data integrity and recoverability are critical. My strategy is built on layers of redundancy and encryption. I chose BTRFS for its modern features: The two 4TB drives are mirrored in a RAID 1 array, providing redundancy against a single drive failure. The entire array is encrypted using LUKS2, with the key stored on the boot SSD for automatic mounting. This protects data at rest in case of physical theft or drive disposal. Mount options in : RAID does not protect against accidental deletion, file corruption, or catastrophic failure. My backup strategy follows the 3-2-1 rule. Daily, automated backups are managed by systemd timers running . Backups are encrypted and sent to Cloudflare R2, providing an off-site copy. R2 was chosen for its zero-cost egress, which is a significant advantage for restores. The backup script covers critical application data and the Docker Compose configurations: Each backup run reports its status to a healthchecks.io endpoint, which sends a push notification on failure. I must appreciate its generous free tier, which is more than sufficient for my needs. This homelab represents a shift in philosophy from exploring complexity to valuing simplicity and reliability. The upfront hardware investment of ~$1,200 is offset by eliminating recurring cloud hosting costs and providing complete control over my data and services. For those considering a homelab, my primary recommendation is to start with a simple, well-understood foundation. A reliable machine with a solid backup strategy is more valuable than a complex, hard-to-maintain cluster. The goal is to build a system that serves your needs, not one that you serve. CPU : The Ryzen 5 7600X provides a strong price-to-performance ratio. Its 6 cores offer ample headroom for concurrent containerized workloads and future experimentation. Storage : The boot drive is a 500GB NVMe for fast OS and application performance. The primary storage consists of two 4TB HDDs in a BTRFS RAID 1 configuration. To mitigate the risk of correlated failures, I chose drives from different manufacturers (WD and Seagate) purchased at different times. RAM : 32GB of DDR5-6000 provides sufficient memory for a growing number of services without risking contention. Case & PSU : The ASUS Prime AP201 is a compact MicroATX case with a clean aesthetic suitable for a home office. The Corsair SF750 (80+ Platinum) PSU was chosen for its efficiency and to provide capacity for a future GPU for local LLM or transcoding workloads. Sync : copies the specified stack’s directory from the local Git repository to a (e.g., ) on the target machine. Execute : runs the appropriate command on the remote machine. Data Persistence : Instead of using Docker named volumes, I use host bind mounts. All persistent data for a service is stored in a dedicated directory on the host, typically . This makes backups and data management more transparent. Reverse Proxy Network : The Caddy stack defines a shared Docker network called . Other stacks that need to be exposed to the internet are configured to join this network. This allows Caddy to discover and proxy them without exposing their ports on the host machine. I have written about this pattern in detail in a previous post . Port Exposure : Services behind the reverse proxy use the directive in their to make ports available to Caddy within the Docker network. I avoid binding ports directly with unless absolutely necessary. floyd-homelab-1 (Primary Server) : The core of the homelab, running on the AMD hardware detailed above. It runs data-intensive personal services (e.g., Immich, Paperless-ngx) and is accessible only via the Tailscale network. floyd-pub-1 (Public VPS) : A small cloud VPS that hosts public-facing services requiring high availability, such as DNS utilities, analytics, and notification relays. floyd-monitor-public (Monitoring VPS) : A small Hetzner VM running Gatus for health checks. Its independence ensures that I am alerted if the primary homelab or home network goes offline. Actual : A local-first personal finance and budgeting tool. Caddy : A powerful, enterprise-ready, open source web server with automatic HTTPS. Gitea : A Git service for personal projects. Glance : A dashboard for viewing all my feeds and data in one place. Immich : A photo and video backup solution, directly from my mobile phone. Karakeep : An app for bookmarking everything, with AI-based tagging and full-text search. Owntracks : A private location tracker for recording my own location data. Paperless-ngx : A document management system that transforms physical documents into a searchable online archive. Silverbullet : A Markdown-based knowledge management and note-taking tool. Caddy : Reverse proxy for the services on this node. Beszel-agent : The agent for the Beszel monitoring platform. Caddy : Reverse proxy for the services on this node. Cloak : A service to securely share sensitive text with others. Doggo : A command-line DNS Client for Humans, written in Golang. Ntfy : A self-hosted push notification service. prom2grafana : A tool to convert Prometheus metrics to Grafana dashboards and alert rules using AI. Umami : A simple, fast, privacy-focused alternative to Google Analytics. Checksumming : Protects against silent data corruption. Copy-on-Write : Enables instantaneous, low-cost snapshots. Transparent Compression : compression saves space without significant performance overhead.

0 views
Karan Sharma 2 months ago

TIL: WireGuard's Misleading "No Route to Host" Error

I recently spent some time debugging a WireGuard tunnel that was acting weird. The handshake was successful, pings worked perfectly, but any TCP connection failed with . Classic misleading error message. The routing was fine. Server with a public IP running WireGuard ( ) with IP . Client connects and gets assigned . I wanted to proxy TCP traffic from the server to a service running on the client at . Diagnostics showed contradictory results: Routing worked fine: Server routing table correctly directed traffic to . Pings were successful: TCP failed immediately: The key insight: was being treated differently than . This pointed to a firewall issue, not routing. The “no route to host” error was the kernel interpreting an ICMP “Destination Unreachable” message from the remote peer. But when I ran on the client, things got stranger: The packet arrived successfully through . But no response. No (success), no error (rejection). The packet was being silently dropped. The client was running Arch Linux with . My mistake was trying to manage firewall rules with commands in the WireGuard script. While was installed, was the active manager, using as its backend. When a new interface like comes up, needs to know which “zone” it belongs to. If unassigned, it gets handled by a restrictive default policy that silently s unsolicited TCP packets while allowing ICMP (pings). Don’t add rules. Just assign the WireGuard interface to the right zone. For internal tunnels, works well. On the client : TCP connections worked instantly after this. TL;DR: If WireGuard pings work but TCP fails with “no route to host”, it’s probably a client firewall issue. On systems, assign the WireGuard interface to the right zone instead of messing with .

0 views
Karan Sharma 5 months ago

Announcing Logchef

So, for the last 3-4 months, I’ve been busy building Logchef . This tool basically grew straight out of my day job managing logs at Zerodha , where I’ve been managing logs for almost half a decade. I wanted to share a bit about how Logchef came to be. Like many, we journeyed through the complexities of ELK (a management nightmare) and found its OSS fork, OpenSearch, didn’t quite hit the mark for us either. We eventually found solid ground with Clickhouse, as detailed on our tech blog: Logging at Zerodha . However, as I noted in that post, while Metabase served us well for analytics, it wasn’t the ideal UI specifically tailored for log analysis against Clickhouse: “While Metabase has served us well so far, there is certainly room for improvement, especially regarding a more tailored UI for Clickhouse… we plan to continue exploring potential solutions.” Here’s a distilled version of the common pain points we experienced: TL;DR: Metabase interface wasn’t optimized for the specific task of log exploration. Debugging sessions that should have taken minutes were stretching significantly longer. Querying and exploring logs felt clunkier than it needed to be. And one fine day, I decided to stop just wishing for a better tool and start building one: When I first started prototyping, I kept the scope pretty tight: just build a viewer for the standard OTEL schema . OTEL’s flexible enough, but a quick chat with Kailash sparked what turned out to be a game-changing idea: make Logchef schema-agnostic. And that really became the core concept. Basically, Logchef lets you connect it straight to your existing Clickhouse log tables, no matter their structure. All it really needs is a timestamp field ( or ). Bring your own custom schemas, stick with the OTEL standard, or even adapt it to your own needs. Logchef doesn’t force you into a specific format. From what I’ve seen, not many tools offer this kind of plug-and-play flexibility with existing tables today. Logchef is designed as a specialized query and visualization layer sitting on top of Clickhouse. Logchef intentionally excludes log collection and ingestion. Why reinvent the wheel when excellent tools like Vector, Fluentbit, Filebeat, etc., already handle this reliably? Logchef focuses purely on exploring the logs once it’s in Clickhouse. I wanted a public demo instance so people could try Logchef easily. Setting this up involved a few specific tweaks compared to a standard deployment, all managed within the Docker Compose setup: Generating Dummy Data: A log viewer isn’t much use without logs! Instead of ingesting real data, I configured using its source type. This continuously generates realistic-looking syslog and HTTP access logs and pushes them into the demo Clickhouse instance ( and tables). It gives users immediate data to query without any setup on their part. Securing Admin Endpoints (Demo Mode): Since this is a public, shared instance, I wanted to prevent users from making potentially disruptive changes via the API (like deleting sources or teams). I used as the reverse proxy and configured it to intercept requests to admin-specific API routes (like ) and block any method other than . If someone tries a , , or to these endpoints, Caddy returns a directly. This keeps the demo environment stable. Improving Demo Login UX: Logchef uses OIDC for authentication. For the demo, I’m running as the OIDC provider. To make it completely frictionless for users, I didn’t want them needing to sign up or guess credentials. I simply customized Dex’s theme template for the login page to explicitly display the static demo username ( ) and password ( ) right there. It’s a small UX tweak (again, thanks to Kailash for the idea!), but it means anyone landing on the demo can log in instantly. Logchef is already being used internally, but the journey towards a full v1.0 release continues this year. The roadmap includes exciting additions like: Logchef is open source (AGPLv3), and community involvement is welcomed. You can check out the Demo or view the code on GitHub . If you have more ideas or features you’d like to see, please reach out on GitHub or email me ! I’m always open to suggestions and feedback. Honestly, building Logchef has been incredibly rewarding. It started as a way to fix something that bugged me (and others!), and seeing it turn into a tool I’m genuinely excited about feels great. I couldn’t have done it alone, though. I’m really grateful to my friends and colleagues who jumped in with feedback along the way. Huge thanks to Kailash for the constant support and encouragement, and to Vivek , Sarat , and Rohan for testing the early builds and offering great suggestions. Finally, a big thank you to my wife, who patiently endured my late-night coding sessions. Her support means the world to me <3 Ad-hoc Querying Was Painful: Writing raw Clickhouse SQL in Metabase for quick log searches felt cumbersome and slow. Even modifying existing complex query templates was error-prone – a tiny syntax mistake could lead to minutes spent debugging the query itself, especially stressful during production incidents. Disconnect Between Visualization and Raw Logs: A common workflow is to visualize trends (e.g., errors over time) and then drill down into the specific logs causing those trends. In Metabase, this often meant writing two separate queries – one for aggregation/visualization and another (often rewritten from scratch) just to see the raw log lines. Metabase’s row limits (around 2k) further complicated viewing the full context of raw logs after filtering. The intuitive “slice and drill-down” experience many log tools offer was missing. UI/UX Annoyances: Several smaller but cumulative issues added friction: difficulty selecting precise time ranges like “last 6 hours,” viewing logs immediately surrounding a relevant event, columns getting truncated ( ), and limited timestamp precision display in results. Though there are some workarounds, they often felt like band-aids rather than solutions. Backend: Written in Go for performance and concurrency. Metadata Storage: Uses SQLite for lightweight management of users, teams, Clickhouse source connections, and query collections. It’s simple and perfectly suited for this kind of a metadata store. Frontend: An interactive log viewer with Vue.js and styled with shadcn/ui and Tailwind CSS. I also implemented a simple search syntax for common filtering tasks (e.g., ). This involved writing a tokenizer and parser that translates this syntax into efficient ClickHouse SQL conditions optimised for querying logs. Building this parser, validator, and integrating it smoothly with the Monaco editor for syntax highlighting was a significant effort but quite happy with the end result. Generating Dummy Data: A log viewer isn’t much use without logs! Instead of ingesting real data, I configured using its source type. This continuously generates realistic-looking syslog and HTTP access logs and pushes them into the demo Clickhouse instance ( and tables). It gives users immediate data to query without any setup on their part. Securing Admin Endpoints (Demo Mode): Since this is a public, shared instance, I wanted to prevent users from making potentially disruptive changes via the API (like deleting sources or teams). I used as the reverse proxy and configured it to intercept requests to admin-specific API routes (like ) and block any method other than . If someone tries a , , or to these endpoints, Caddy returns a directly. This keeps the demo environment stable. Improving Demo Login UX: Logchef uses OIDC for authentication. For the demo, I’m running as the OIDC provider. To make it completely frictionless for users, I didn’t want them needing to sign up or guess credentials. I simply customized Dex’s theme template for the login page to explicitly display the static demo username ( ) and password ( ) right there. It’s a small UX tweak (again, thanks to Kailash for the idea!), but it means anyone landing on the demo can log in instantly. Alerting: Trigger notifications based on query results. Live Tail Logs: Stream logs in real-time. Enhanced Dashboarding: More powerful visualization capabilities.

0 views
Karan Sharma 6 months ago

Trying out NixOS

I’ve been introduced to Nix by my colleagues at work. Being a Linux user for over a decade and a serial distro hopper, I was curious to learn more about it. I’d seen Nix mentioned before, but the comments about its steep learning curve made me wonder if the effort was worth it. I decided to give it a try by reading this excellent beginner’s guide however got bored very quickly and decided to “learn on the fly”. I spun up a VM in my homelab to install NixOS using their official GUI installer image . The installation was as straightforward as any other Linux distro. NixOS is a declarative operating system that leverages the Nix functional package manager and a rich ecosystem of Nix packages. The flexibility is mind-blowing: you can configure everything—from user accounts and SSH keys to config and plugins entirely through code. Once installed, the first place you’d want to poke around is the directory, which contains two essential configuration files: : Generated during installation (or regenerated with commands like ), it has hardware-specific details such as filesystem mount points, disk configurations, kernel modules etc. See an example file here . : This is the most important file you want to start editing with. Here you define system-wide settings like timezone, locale, user accounts, and networking. Everything is declared in one place, making your system’s state reproducible. When I opened the terminal, I immediately noticed that wasn’t installed. So, I updated my to include the packages I needed: After saving, I ran: This rebuilds the system using the new declarative configuration. Next, I wanted to set up version control for my Nix configurations. The key takeaway is that while the system’s state is revertable in NixOS, your personal data (which includes ) isn’t automatically backed up. You must manage your own version history for your Nix configs. Since I was tweaking with no knowledge of Nix, having a version history was crucial. I moved my configs to and initialized a Git repository: Here’s how looks: Flakes are an experimental (although widely adopted in the community) feature in Nix that bring reproducibility, composability, and a standardized structure to your configurations and package definitions. They allow you to declare all inputs (like nixpkgs, home-manager, or other repositories) and outputs (such as system configurations, packages, or development shells) in a single file. Flakes also create a lock file ( ) that pins your dependencies to specific revisions, ensuring that your builds remain reproducible over time. I learned the hard way that—even for local configurations you must commit your files. Otherwise, you may see errors like: Even if you’re using local paths and have no intention to push to , you still need && for flakes to work. From whatever google-fu I did, it seems this requirement is to ensure that flakes can reliably reference the exact content in your configuration. I am sure there might be good reasons for it (as I said before, I’ve skipped RTFMing altogether ^_^), but atleast the errors can be more verbose/helpful. And why I skipped docs: Remember, we’re on a mission to get things up and running with Nix and then later spend time about reading their internals if it actually proves to be a valuable experiment. While installing packages, I noticed some packages were quite outdated. That’s when I learned about NixOS channels. Think of channels as analogous to LTS releases. For faster updates, you can switch to the channel. Although the name sounds intimidating, it simply means you’ll receive more frequent package updates. To do this, you can edit your and switch the URL to an unstable channel: After setting up packages, it was time to configure firmware updates using for keeping your hardware up to date. I asked Claude to help me for a quick setup. Here’s what I did: Then run a rebuild: Once enabled, you can use the command-line tool to manage firmware updates: I also tweaked some settings for the Nix package manager to optimize builds, caching, and overall performance. Here’s a snippet from my configuration: So far things seems all rosy. Within just spending a couple of minutes - I had a perfectly working machine for myself - and the best part - all reproducible with a single command. I was starting to see why people who use NixOS preach about it so much. However, not everything is smooth when you deviate from the happy path. For instance, I use Aider for LLM assisted programming, but the version on Nixpkgs was about three minor versions behind. Typically for any other software, I wouldn’t have cared so much - however with these LLM tools, a lot changes rapidly and I didn’t want to stay behind. Besides, it seemed like a fun exercise in getting my hands dirty by installing a Python package on NixOS which turned out to be quite tricky because Nix is absurdly obsessive about fully isolated builds. Here’s an example flake that I used for attempting to install Aider with in a dev shell (which didn’t work btw): Entering the dev shell with and installing Aider with : However, I ran into this error: The error indicated that Aider was missing a required dependency which is a part of the C++ standard library needed by the tokenizers package. To fix this, I added (and even to be on the safer side) to my . This is because while installs Python packages, it doesn’t handle system-level dependencies. In a Nix environment, every dependency, including system libraries, must be explicitly specified. Frankly, Python’s packaging ecosystem is still a mess. Although tools like help, achieving a completely isolated build, especially when shared libraries are involved is challenging. I wish the Python community would put more effort into resolving these issues. While I was able to make work by explicitly adding all the dependencies, I faced another outdate package: . Since this is a full blown electron app, I didn’t wish to package this myself. After some frustration, I tried using Distrobox as recommended by a colleague . Distrobox lets you run containers that feel almost like a native OS by managing user IDs, host mounts, network interfaces, and more. I used an Arch Linux image, installed from the AUR, and everything worked fine. Well mostly: Yet, something still felt off. The whole point of using NixOS is to achieve a fully declarative and reproducible setup. Resorting to an escape hatch like Distrobox undermines that goal. So I was very conflicted about this. I’m sure there’s a better way to handle these situations, and I should probably read the docs to find out. I’m definitely sold on running NixOS, especially when managing multiple systems. With a single declarative file ( ), duplicating your setup across machines becomes effortless. No more “documenting” (or rather forgetting to document and keeping it updated)- as the config is the single source of truth. Fun fact: I even messed up my NixOS build by misconfiguring the , and my system became unusable even after a reboot, it couldn’t mount the filesystem on the correct device. In other distros, that would have sent me into panic mode, but with NixOS, all I had to do was revert to the previous working state, and everything was fine. That was so cool! I’m definitely considering moving my homelab to NixOS in the coming few days because I honestly see the value for a server setup. I often set up my personal server and then forget everything I’ve done and I’m always scared of touching or creating a new server from scratch. I even created a small shell script installer to help me for getting a base system ready. But like this shell script or even tools such as Ansible - they are all idempotent in nature. However in Nix, if I remove a certain piece from the configuration, there isn’t a trace of it left on the system. That makes it truly declarative and reproducible - unlike Ansible where you can still have some parts of the old setup. However, for my primary machine at work, I’ll wait on the sidelines until the packages I depend on resolve their dependency issues and I get a chance to read up more on the escape hatches I tried to see if there’s a more streamlined way of doing things. I might be missing a lot of fundamental details since I skipped the docs entirely to get my hands dirty. But now that I see the value of a declarative system and especially how easy it is to roll back the machine to a previously known good state, I’m motivated to read up more on this and might post an update to this blog. : Generated during installation (or regenerated with commands like ), it has hardware-specific details such as filesystem mount points, disk configurations, kernel modules etc. See an example file here . : This is the most important file you want to start editing with. Here you define system-wide settings like timezone, locale, user accounts, and networking. Everything is declared in one place, making your system’s state reproducible. Fonts were missing. So, if I want to use custom fonts in my IDE - I need to have them installed in the container as well. Since my shell config had , I had to install in the container as well, otherwise, I’d get an error when trying to etc. There’s an option to customise the shell in distrobox, but for whatever reason (that I didn’t want to debug), it didn’t work for me.

0 views
Karan Sharma 7 months ago

Automating Badminton Game Alerts

I’ve been playing badminton more regularly since the start of 2025 - almost 4-5 days a week. I recently moved to a new part of the city, which meant I couldn’t play with my old friends anymore. PlayO has been super helpful for finding games with new people. On PlayO, a host creates a game and up to 6 people can join one court for a one-hour badminton doubles session. However, on hectic days I would often forget to check for badminton games, only to find them fully booked later. I wanted to automate this process by creating a small script that would send me scheduled alerts about today’s game availability, allowing me to book slots before they filled up. I drew inspiration from Matt’s post where he did something similar. Thankfully, PlayO has a public API endpoint to retrieve a list of available games: . You can send a request to this URL with these parameters for filtering: It returns a list of activities matching these filters. One such activity looks like: Using the above response, I filtered for games where: I also wanted to add a feature to send these details to Telegram for convenient notifications. I then vibe coded with Claude 3.7 to create a Python script to automate this whole process. Impressively, it produced a working script pretty much in a one-shot prompt, though I had to make a few minor tweaks. I quite like Simon Willison’s approach of using to build one-shot tools. Managing dependencies, virtual environments, etc. is still a pain point in Python, but using feels like magic by comparison. The script outputs a beautiful output: I wanted this script to run reliably every day and used GitHub Actions for that. GitHub Actions felt like the path of least resistance as I didn’t have to worry about keeping a server running or getting alerts if something crashed. For a small personal script like this, it was the perfect “set it and forget it” solution. I used GitHub Actions inputs to configure the variables for my script. Found this feature to be quite neat for scheduling different crons for weekday/weekends. For small quality-of-life improvements - solving your own specific problems with custom scripts tailored exactly to your needs - gotta love the LLMs man. We’re gonna see more and more of such “personal tooling” in future as the entry to barrier for coding is lowered with LLMs. The democratization of coding through LLMs means people (even non-technical ones) can focus on “describing” the problem well, rather than struggling with implementation details. Being able to articulate what you want clearly becomes the primary skill - yes, it’s a skill issue if you can’t prompt well, but it’s far more accessible than learning programming from scratch. is   (This indicates that   is not true, meaning spots are still available to join) and   fall within 7-8PM IST

0 views
Karan Sharma 9 months ago

Cleaning up Notes with LLM

My Obsidian vault has gotten quite messy over time. I’ve been dumping notes without proper frontmatter, tags were all over the place, and some notes didn’t even have proper titles! I needed a way to clean this up without spending hours manually organizing everything. I’d been playing around with Claude’s API lately, and thought – hey, why not use an LLM to analyze my notes and add proper frontmatter? After all, that’s what these AI models are good at – understanding context and categorizing stuff. I wrote a small Python script using the llm library (which is pretty neat btw) to do just this. Here’s what it looks like: The script is pretty straightforward – it reads each markdown file, extracts any existing frontmatter (because I don’t want to lose that!), and then asks Claude to analyze the content and generate appropriate frontmatter. It adds stuff like title, category, tags, status, priority. What I love about this approach is that it’s contextual . Unlike regex-based approaches or keyword matching, the LLM actually understands what the note is about and can categorize it properly. A note about “Setting up BTRFS on Arch” automatically gets tagged with “linux”, “filesystem”, “arch” without me having to maintain a predefined list of tags. The categorization is probably better than what I’d have done manually at 2 AM while organizing my notes!

0 views
Karan Sharma 9 months ago

2024: A Year In Review

2024 was indeed an important year for me as it marked several significant milestones. Quite happy with how this year was! Here’s my reflection on this memorable year. This year has been truly transformative, bringing together personal joy, professional growth, and exciting adventures. Looking forward to what 2025 has in store! Got married to the prettiest and dearest Saumya 💗 Did my first international trip, exploring Europe Bought a fun toy - Maruti Jimny 4x4 Relocated to Bangalore after working from home for 4+ years since Covid Attended several amazing concerts: Indian Ocean Blackstratblues Anand Bhaskar Collective Switzerland Ranthambore Pondicherry Released v1.0.0 of Doggo - It hit frontpage of HN as well! Built an expense tracker app - Gullak Made a lot of small utility apps: lil - URL shortener silencer - Prometheus alerts <> Mattermost bridge toru - Go modules proxy with caching junbi - Server Setup and Hardening Tool Ovenly Delights - Small bakery shop website nomcfg - Nomad config generator clx - Generate CLI commands using AI for common ops Started working on a log analytics app - full focus on that in 2025. Read more

0 views
Karan Sharma 11 months ago

How I use LLMs

Just yesterday, GitHub announced integrating Claude 3.5 Sonnet with Copilot. Interesting times ahead. In my experience, Claude has been remarkably better than the GPT-4 family of models for programming tasks. I’ve tried a bunch of tools like Cursor, Continue.dev but finally settled with Aider for most of my tasks. In this post, I want to write about my workflow of using Aider when working on small coding tasks. Aider is an open source Python CLI which supports multiple models, including Claude 3.5 Sonnet. Aider describes itself as “AI pair programming in your terminal”. The tool integrates quite well in its workflow so it can edit files, create new files, and track all changes via git. In case you want to revert, simply reverting the commit or using the shortcut would do the same. The tool has multiple modes that serve different purposes: My typical workflow involves running Aider in a terminal while keeping VSCode open for manual code review. I often use the flag to view the diffs before committing. Despite advances in LLM technology, I believe they haven’t yet reached the stage where they can fully understand your team’s coding style guides, and I prefer not to have a certain style forced upon me. Manually tweaking portions of AI-generated functions still proves helpful and saves considerable time. To begin, would open the interactive window where you can begin writing prompts. To add context, you need to add files using commands like . What makes Aider powerful is its control over the LLM context - you can or source code, or even to drop all files and start with a fresh context. This granular control helps manage the context window effectively. A really cool thing about it is that it gives an approximate idea of the number of tokens (cost) associated with each prompt. I find it useful to remove unnecessary files from the context window, which not only helps in getting sharper, more accurate responses but also helps with the costs. There’s a nice command which will show the cost of sending each file added in context with the prompt. I find the Aider + Claude 3.5 combo works really well when you have a narrow-scoped, well-defined task. For example, this is the prompt I used on a codebase I was working on: Theme preference is not preserved when reloading pages or navigating to new pages. We should store this setting in localStorage. Please implement using standard best practices. Under the hood, Aider uses tree-sitter to improve code generation and provide rich context about your codebase. Tree-sitter parses your code into an Abstract Syntax Tree (AST), which helps Aider understand the structure and relationships in your code. Unlike simpler tools that might just grep through your codebase, tree-sitter understands the actual syntax of your programming language. This means when you’re working on a task, Aider isn’t just blindly sending your entire codebase to the LLM. Instead, it creates an optimized “repository map” that fits within your token budget (default is 1k tokens, adjustable via ). This map focuses on the most relevant pieces of your code, making sure the LLM understands the context without wasting tokens. Aider’s approach to AI pair programming feels natural and productive. Here are some example prompts where it helped me build stuff in less than a minute: Modify fetch method in store/store.go to filter out expired entries Write a k6 load test script to benchmark the endpoint and simulate real-world traffic patterns Create a Makefile, Dockerfile, goreleaser.yml for my Go binary. Target platforms: arm64 and amd64 I prefer to invoke with a few extra flags: Make sure to go through the Tips page to effectively try out Aider on your existing projects. : Use it when you simply want to chat with the model about the codebase or explain some pieces of it. This mode won’t touch your files. It’s great for understanding existing code or getting explanations. : Use it to discuss a broad overall idea. The model will propose some changes to your files. You can further chat and tune it to your preferences. : This will directly edit your files and commit them. It can identify function definitions, class declarations, variable scopes, and their relationships It extracts full function signatures and type information It builds a dependency graph showing how different parts of your code relate to each other It helps rank the importance of different code sections based on how often they’re referenced

0 views
Karan Sharma 1 years ago

Self Hosting Outline Wiki

I recently discovered Outline a collaborative knowledge base. I wanted to self-host it on my server, but the mandatory auth provider requirement was off-putting. My server is on a private encrypted network (Tailscale) that only my approved devices in the tailnet can access, so I don’t really need authentication for my personal single-use apps. I found a few guides using Authelia/Keycloak, but these are heavy-duty applications that would consume a lot of resources (DBs, caches, proxies, and whatnot) just to have an OIDC provider for Outline. There had to be a simpler way, right? Enter Dex . As recommended by my friend and colleague Chinmay , it turned out to be quite easy. Here’s the full setup you need to get Outline up and running on your local instance! You’ll need to add the following env variables as well And finally, to configure Dex, we need the following config: Voilà! With , you’ll have an Outline server ready to go. You can log in using the user.

0 views
Karan Sharma 1 years ago

Building an expense tracker app

A couple of weeks ago, I decided to start logging and tracking my expenses. The goal was not to record every minor purchase but to gain a general insight into where my money was going. In this post, I’ll dive deep into the behind-the-scenes of building Gullak —an expense tracker app with a dash of AI (yes). My wife and I have a simple system for tracking our expenses during trips: we use Apple Notes to maintain a day-wise record, jotting down a one-liner for each expense under the date. This straightforward method has proven effective in keeping tabs on our spending habits while traveling. For instance, during our last Europe trip, we recorded our daily expenses. After returning home, I was eager to analyze our spending patterns. I copied all these items into Google Sheets to analyse the top categories that I spent on during the trip. I decided to develop a simple expense tracker app that automatically categorizes expenses into various groups like food, travel, shopping, etc. I believed this was a practical use case for leveraging an LLM paired with Function calling to parse and categorize expenses. The first step involved designing a prompt to capture user input about their spending. I picked up go-openai library and experimented with it. Almost a year ago, I had developed a small bot for personal use, which provided a JSON output detailing the macronutrients and calories in specific food items, storing this information in Metabase. However, this was during the early days of API access provided by OpenAI. Due to occasionally unsatisfactory and inconsistent responses (despite instructions like “MUST RETURN JSON OR 1000 CATS WILL D*E SOMEWHERE”), it wasn’t entirely reliable. Function calling addresses two main limitations of traditional language model responses: It’s important to note that the LLM does not actually execute any functions. Rather, we create a structure for the LLM to follow in its responses. The LLM would then generate a response with the content as a stringified JSON object following the schema provided in the function definiton. I created a function called . This function takes a list of transactions as parameters, with each transaction having properties like , , , and . Here’s what this looks like: The response from this API call can then be unmarshalled into a struct. The next step was to determine exactly how users would provide input. I considered various methods that would make entering expenses as straightforward as my approach with Apple Notes and decided to create a Telegram bot. I developed a Telegram bot that would parse the expenses and save them to a SQLite database. I explored tools like evidence.dev , a nice platform for creating frontends using the database as the sole source of truth. However, I encountered an issue where it could not correctly parse date values (see GitHub issue ). Ultimately, I returned to my reliable old friend—Metabase. However, I faced two main challenges with this approach: Privacy Concerns : Telegram does not offer the option to create a private bot; all bots generated through BotFather are public. To restrict access, I considered adding session tokens, but this approach was unsatisfactory. If I planned to distribute this bot, implementing a token-based, DIY authentication system on Telegram did not seem appropriate. Fixing Bad Entries : To correct erroneous entries, I had to manually update the SQLite table. As I intended to share this bot with my wife, I needed a more user-friendly workflow. Manually raw dogging SQL queries was not the most user-friendly solution. After a day or two of experimenting, I decided to build a small frontend for now. As a backend developer, my core expertise is NOT JavaScript, and I strongly dislike the JS ecosystem. Obviously there’s no dearth of choices when it comes to frameworks, however for this project I wanted to stay away from the hype and choose a stack that is simple to use and productive (for me) out of the box. Having used Vue.js in production in the past, I feel it ticks those boxes for me as it comes bundled with a router, store, and all the niceties, and it has excellent documentation. After reading a refresher on the new Vue3 composition API syntax, I hit the ground running. I find Tailwind CSS ideal for someone like me who prefers not to write CSS or invent class names. It’s a heavily debated topic online, but it’s important to pick our battles. An issue I encountered while researching UI frameworks was that Vue.js seems to have fewer options compared to React, likely due to its lower popularity. After some google-fu, I discovered a promising project called shadcn-vue , an unofficial community led port of the shadcn/ui React library. The cool thing about this library is that it doesn’t come bundled as a package, meaning there’s no way to install it as a dependency . Instead, it gets added directly to your source code, encouraging you to tweak it the way you like. I believe it’s an excellent starting point for anyone looking to build their own design system from scratch, as it allows for customization of both appearance and behavior. It might have been overkill for my simple UI, but I thought, what the heck, if side projects aren’t for exploring new things, what’s the point of it all? 😄 For the database, I opted for SQLite. It’s perfect for a small project like this since the database is just a single file, making it easier to manage. Initially, I used the popular driver mattn/go-sqlite3 , but I found that the CGO-free alternative modernc/sqlite works just as well. I also experimented with sqlc for the first time. For those unfamiliar, generates type-safe Go code from your raw SQL queries. It handles all the boilerplate database code needed to retrieve results, scan them into a model, manage transactions, and more. sqlc makes it seem like you’re getting the best of both worlds (ORM + raw SQL). Here’s an example query: Using , it generates the following code: Similar to my Apple Notes approach, I wanted to create a shortcut that would allow me to log expenses quickly. I created a simple shortcut that would prompt me to enter the expenses and send an HTTP POST request to Gullak’s API server. I then open the dashboard once in a while to confirm/edit these unconfirmed transactions. You can read more about setting up the Shortcut in your Apple devices here . For every “I could do this in a weekend” comment, yes, this project is straightforward—a “CRUD GPT” wrapper that isn’t complicated to build. Yet, it took me over a month to develop. I spent less than an hour most days on this project, instead of cramming it into an all-nighter weekend project - an approach I want to move away from. Slow and steady efforts compound, outlasting quick, sporadic bursts. I’m pleased to balance this with my full-time job without burning out. Initially, I didn’t set out to build a comprehensive budgeting app, just an expense logger, as that was my primary need. However, if usage increases and the tool proves helpful in reducing unnecessary spending, I’m open to adding more features. Some possibilities include a subscription tracker, integration with budgeting tools like YNAB or Actual through their APIs, and monthly reports sent via email. The best part is that you own complete data, as the data is stored locally on your device so you can also export it anytime and build other integrations on top of it. Feel free to open a GitHub issue or reach out if you have any suggestions or feedback. I’m excited to see where this project goes! Inconsistent response format : Without function calling, responses from language models can be unstructured and inconsistent, requiring complex validation and parsing logic on the application side. Lack of external data integration : Language models are typically limited to the knowledge they were trained on, making it challenging to provide answers based on real-time or external data. Privacy Concerns : Telegram does not offer the option to create a private bot; all bots generated through BotFather are public. To restrict access, I considered adding session tokens, but this approach was unsatisfactory. If I planned to distribute this bot, implementing a token-based, DIY authentication system on Telegram did not seem appropriate. Fixing Bad Entries : To correct erroneous entries, I had to manually update the SQLite table. As I intended to share this bot with my wife, I needed a more user-friendly workflow. Manually raw dogging SQL queries was not the most user-friendly solution.

0 views
Karan Sharma 1 years ago

A Random Act of Kindness

Last month, I did a wonderful trip travelling through the scenic landscapes of Switzerland. My wife and I were in Lucerne and had scheduled a day trip to Mt. Titlis for the next day but were wondering what to do that evening. After strolling along the Chapel Bridge and enjoying an amazing lunch by the waterfront, my wife and I collectively decided to book a Lake Lucerne cruise for the evening. It seemed like the perfect setup for a romantic date night, or so I thought! We got the tickets from the information booth for the evening and went back to our hotel to freshen up and relax for a bit. We arrived exactly at 6:45 PM, as mentioned on our tickets, and started to wait. Except, there were only the two of us waiting. We waited for probably half an hour, until 7:15 PM, which was the departure time. Knowing how precise the Swiss transport system usually is, I sensed something fishy. Luckily, I spotted Martin, a staff member of the cruise company, and asked him about it. He looked puzzled by my question and informed me that there was no cruise scheduled for today. Yep, not today, not tomorrow, and not for the rest of the weekend. It was Easter time, and all the cruise trips were cancelled. In fact, he was quite as puzzled as I was as to how the lady at the ticket counter even gave us the tickets for today. However, he told me that he couldn’t do much and suggested writing an email for a refund. I was a bit sad as this dashed our evening plans, but I thought, fine… Shit happens. It’s not the end of the world! And then, out of nowhere, by pure serendipity, my wife spotted another member of the cruise—this time, in fact, the captain! She began to tell the captain about our ordeal. The captain, a very warm and kind lady, listened to my wife patiently and understood our plight! She apologized on behalf of her company and immediately offered a refund in cash for the tickets that we had purchased. We were quite happy, as following up on emails, etc., was something that I was not particularly excited about looking forward to during the other half of my trip. So we said yes, except she needed to go to the ATM to draw cash. We waited for her and chatted with Martin about random stuff! He told us some really fun stories about how Toblerone’s iconic packaging no longer features the Matterhorn mountain . He also shared practical advice on how to safeguard oneself against pickpockets in Italy(we were gonna visit it soon), and reminisced about his life in Zermatt before relocating to Lucerne. Anyway, the wait stretched longer than usual, and I found myself wondering about her whereabouts. And then finally, we saw her approaching us. She told us that she hadn’t found a working ATM nearby, and had to go a bit far. But she didn’t come back empty-handed; she also brought us macaroons as a token of her apologies, a gesture that was incredibly thoughtful. She handed us 200 CHF in cash, covering more than the 180 CHF we had paid for our tickets. When we attempted to return the excess 20 CHF, she firmly refused to take it back. Despite our insistence on returning the excess 20 CHF, she remained steadfast, refusing to accept it. She encouraged us to use the extra money to enjoy a few drinks, suggesting it as a consolation for our evening plans being spoiled by the canceled cruise. I am glad the cruise didn’t happen. Life has its own ways of revealing that kindness exists in every corner of the world, and serendipity can lead to the most memorable encounters! For most, it may not be a huge thing, but for me, it was a profoundly touching and generous act from a stranger who simply chose to be kind without any ulterior motives. The lesson: Be kind to others, do no harm, and always pay it forward. :)

0 views
Karan Sharma 1 years ago

Travelling with Tailscale

I have an upcoming trip to Europe, which I am quite excited about. I wanted to set up a Tailscale exit node to ensure that critical apps I depend on, such as banking portals continue working from outside the country. Tailscale provides a feature called “Exit nodes”. These nodes can be setup to route all traffic (0.0.0.0/0, ::/0) through them. I deployed a tiny DigitalOcean droplet in region and setup Tailscale as an exit node. The steps are quite simple and can be found here . The node is now advertised as an exit node, and we can confirm that from the output of : On the client side, I was able to start Tailscale and configure it to send all the traffic to the exit node with: We can confirm that the traffic is going via the exit node by checking our public IP from this device: However, I encountered a minor issue since I needed to bring my work laptop for on-call duties, in case any critical production incidents required my attention during my travels. At my organization, we use Netbird as our VPN, which, like Tailscale, creates a P2P overlay network between different devices. The problem was that all 0.0.0.0 traffic was routed to the exit node, meaning the internal traffic meant for Netbird to access internal sites on our private AWS VPC network was no longer routed via the Netbird interface. Netbird automatically propagates a bunch of IP routing rules when connected to the system. These routes are to our internal AWS VPC infrastructure. For example: Here, is the Netbird interface. So, for example, any IP like will go via this interface. To verify this: However, after connecting to the Tailscale exit node, this was no longer the case. Now, even the private IP meant to be routed via Netbird was being routed through Tailscale: Although Tailscale nodes allow for the selective whitelisting of CIDRs to route only the designated network packets through them, my scenario was different. I needed to selectively bypass certain CIDRs and route all other traffic through the exit nodes. I came across a relevant GitHub issue , but unfortunately, it was closed due to limited demand. This led me to dig deeper into understanding how Tailscale propagates IP routes, to see if there was a way for me to add custom routes with a higher priority. Initially, I examined the IP routes for Tailscale. Typically, one can view the route table list using , which displays the routes in the and tables. However, Tailscale uses routing table 52 for its routes, instead of the default or main table. A few notes on the route table: is the default route for this table. Traffic that doesn’t match any other route in this table will be sent through the interface. This ensures that any traffic not destined for a more specific route will go through the Tailscale network. : This is a special route that tells the system to “throw” away traffic destined for 127.0.0.0/8 (local host addresses) if it arrives at this table, effectively discarding it before it reaches the local routing table. We can see the priority of these IP rules are evaluated using : This command lists all the current policy routing rules, including their priority (look for the pref or priority value). Each rule is associated with a priority, with lower numbers having higher priority. By default, Linux uses three main routing tables: Since Netbird already propagates the IP routes in the main routing table, we only need to add a higher priority rule to lookup in the table before Tailscale takes over. Now, our looks like: To confirm whether the packets for destination get routed via instead of , we can use the good ol’ : Perfect! This setup allows us to route all our public traffic via exit node and only the internal traffic meant for internal AWS VPCs get routed via Netbird VPN. Since, these rules are ephemeral and I wanted to add a bunch of similar network routes, I created a small shell script to automate the process of adding/deleting rules: is the default route for this table. Traffic that doesn’t match any other route in this table will be sent through the interface. This ensures that any traffic not destined for a more specific route will go through the Tailscale network. : This is a special route that tells the system to “throw” away traffic destined for 127.0.0.0/8 (local host addresses) if it arrives at this table, effectively discarding it before it reaches the local routing table. Local (priority 0) Main (priority 32766) Default (priority 32767)

0 views
Karan Sharma 1 years ago

One Billion Row Challenge in Go

Earlier this week, I had stumbled upon 1brc , which presents a fun task: loading a huge text file (1 billion lines) in Java as quickly as possible. The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file. Utilize all your virtual threads, leverage SIMD, optimize your GC, or employ any other technique to create the fastest implementation for this task! The challenge is mainly about Java, but I thought to do the same in my preferred language: Go. This post is about how I did several iterations to my Go program to reduce the time and discuss the main techniques used in each iteration to make it faster. I was able to create a solution which takes ~20s to read, parse and calculate stats for 1bn lines on my Apple M2 (10 vCPU, 32GB RAM). There are some insane solutions that people have come up with, be sure to check out GitHub Discussions to go through them! To generate the text file for these measurements, follow the steps outlined here . After running the commands, I have a on my file system: Example output after running the commands: Let’s take a look at a basic Go code to read and parse the above file. We’ll also calculate stats on the fly. On running the above program, we get the following output: This approach works well for small, simple files. However, there are certain restrictions: Before we proceed to optimize this further, let’s establish a baseline performance of 100 million lines first: Baseline: It takes approximately 19s to read and calculate stats from 100 mn lines. There’s a lot of room to optimize it further, let’s go through them one by one. The concept involves reading multiple lines simultaneously in the producer Goroutine and then dispatching these batches to worker Goroutines. We can establish a worker pool to implement a producer-consumer pattern. Producers read lines from the file and send them to a channel. Consumers retrieve lines from the channel, parse the data, and calculate the minimum, mean, and maximum temperatures for each station. The concurrent version, unexpectedly, resulted in almost a 3x decrease in performance. Where did we go wrong? This is a classic case where the overhead of concurrency mechanisms outweighs their benefits. In our current implementation, each line is sent to the channel individually, which is likely less efficient than batching lines for processing. This means that for a file with a large number of lines, there will be an equally large number of channel send operations. Each channel operation involves locking and unlocking, which can be costly, especially in a high-frequency context. In this version we are Batching the lines before sending to the worker which will significantly reduce the overhead of channel communication. Batch Processing : Each batch contains lines. This reduces the frequency of channel operations (both sending and receiving), as well as the overhead associated with these operations. Efficient Worker Utilization : With batch processing, each worker goroutine spends more time processing data and less time interacting with channels. This reduces the overhead of context switching and synchronization, making the processing more efficient. The improvement from iteration 2 to iteration 3 is quite remarkable, thanks to efficiently batching the lines together and reducing the number of channel ops. So far, we’ve reduced the time to about 6.5s which is a great start and improvement of our baseline version of 19s. However, we’re making quite a few extra memory allocations and the focus of next iteration should be to reduce that. Down to 5.3s! The time has further decreased from 5.3s to 4.8s with these changes. In this version, the file is read in chunks, and each chunk is processed to ensure it contains complete lines. The function is used to separate valid data from leftover data in each chunk. Chunk size can be controlled with command line args as well. In addition to this, I moved the to a separate Goroutine as well: We’re down from 4.8s to just 2.1s to read/parse/process 100mn lines! Basic File Reading and Parsing (Baseline) : Producer-Consumer Pattern : Batch Processing of Lines : Reducing Memory Allocations - Iteration 1 : Reducing Memory Allocations - Iteration 2 (Avoiding ) : Read File in Chunks : I’m quite satisfied with the final version for now. We can now proceed to test it with 1 billion lines. However it’s evidently CPU-bound, as we spawn N workers for N CPUs. I experimented with different chunk sizes, and here are the results from each run: Tweaking the chunk size doesn’t significantly impact performance, as processing larger chunks takes longer. TL;DR: On an average and with multiple runs it takes approx 20s with the final iteration for 1bn lines. Checkout the full code on my GitHub . This project was not only fun but also a great opportunity to revisit and refine many Go concepts. There are several ideas to contemplate for further improving this version’s timings: It reads the file line by line using a scanner. Reading and processing a billion rows is time-consuming. Each operation, even if small, adds up when repeated a billion times. This includes string splitting, type conversion, error checking, and appending to a slice. Additionally, we need to consider the potential of hitting the max Disk IOPS limit if we perform too many file operations per second. Batch Processing : Each batch contains lines. This reduces the frequency of channel operations (both sending and receiving), as well as the overhead associated with these operations. Efficient Worker Utilization : With batch processing, each worker goroutine spends more time processing data and less time interacting with channels. This reduces the overhead of context switching and synchronization, making the processing more efficient. A batch slice is pre-allocated with a capacity of and reused for each batch of lines. After sending a batch to the channel, the slice is reset to zero length ( ), but the underlying array is retained and reused. Avoiding : Instead of using , which allocates a new slice for each line, we can use to find the delimiter and manually slice the string. typically creates a new slice for each split part, leading to more memory usage and subsequent GC overhead. Basic File Reading and Parsing (Baseline) : Time : 19s (baseline). Key Change : Sequentially reading and processing each line. Speedup : N/A (baseline). Producer-Consumer Pattern : Time : 54.225s. Key Change : Implemented concurrent line processing with producer-consumer pattern. Speedup : -185% (slower than baseline). Batch Processing of Lines : Time : 6.442s. Key Change : Batched lines before processing, reducing channel communication. Speedup : +66% (compared to baseline). Reducing Memory Allocations - Iteration 1 : Time : 5.346s. Key Change : Reused batch slices and reduced memory allocations. Speedup : +72% (compared to baseline). Reducing Memory Allocations - Iteration 2 (Avoiding ) : Time : 4.853s. Key Change : Replaced with manual slicing for efficiency. Speedup : +75% (compared to baseline). Read File in Chunks : Time : 2.190s. Key Change : Processed file in chunks and optimized aggregation. Speedup : +87% (compared to baseline). I haven’t yet considered using , but I believe it could substantially speed things up. To delve even deeper, custom line parsing functions, especially for converting to , could offer improvements. Employing custom hashing functions (perhaps FnV) might aid in faster map lookups.

0 views
Karan Sharma 1 years ago

Making sad servers happy

Recently, I stumbled upon sadservers , a platform described as “Like LeetCode for Linux”. The premise is: you are given access to a full remote Linux server with a pre-configured problem. Your mission is to diagnose and fix the issues in a fixed time window. With the goal of documenting my journey through these challenges and sharing the knowledge gained, I decided to not only tackle these puzzles but also to record my solutions in a video format. The format is twofold in its purpose: it allows me to reflect on my problem-solving approach and provides a resource for others who may encounter similar problems, whether in real-world scenarios or in preparation for an SRE/DevOps interview. Each server presented a different issue, from misconfigured network settings to services failing to start, from permission issues to resource overutilization. One server, for instance, had a failing database service because of a disk full partition. The cause? Stale backup files. Another had a web server throwing errors because of incorrect file permissions. The video recordings start with an introduction to the problem and my initial thoughts. Viewers can see my screen as I work through the issue, making the troubleshooting process transparent and educational. The commentary explains my thought process, the tools/CLI utilities used, and the solutions applied. For those looking to enhance their Linux troubleshooting skills, sadservers.com is a gold mine. It’s an excellent preparation ground for anyone aiming to step into the SRE/DevOps field or wanting to keep their skills sharp. As I continue to record and share these troubleshooting escapades, I invite you to subscribe, comment with your insights, or even suggest what types of challenges you’d like to see addressed next.

0 views
Karan Sharma 2 years ago

Nomad can do everything that K8s can

This blog post is ignited by the following Twitter exchange : I don’t take the accusation of unsubstantiated argument, especially on a technical topic lightly. I firmly believe in substantiated arguments and hence, here I am, elaborating on my stance. If found mistaken, I am open to corrections and revise my stance. In my professional capacity, I have run and managed several K8s clusters (using AWS EKS) for our entire team of devs ( been there done that ). The most complex piece of our otherwise simple and clean stack was K8s and we’d been longing to find a better replacement. None of us knew whether that would be Nomad or anything else. But we took the chance and we have reached a stage where we can objectively argue that, for our specific workloads, Nomad has proven to be a superior tool compared to K8s. Nomad presents a fundamental building block approach to designing your own services. It used to be true that Nomad was primarily a scheduler, and for serious production workloads, you had to rely on Consul for service discovery and Vault for secret management. However, this scenario has changed as Nomad now seamlessly integrates these features, making them first-class citizens in its environment. Our team replaced our HashiCorp stack with just Nomad, and we never felt constrained in terms of what we could accomplish with Consul/Vault. While these tools still hold relevance for larger clusters managed by numerous teams, they are not necessary for our use case. Kubernetes employs a declarative state for every operation in the cluster, essentially operating as a reconciliation mechanism to keep everything in check. In contrast, Nomad requires dealing with fewer components, making it appear lacking compared to K8s’s concept of everything being a “resource.” However, that is far from the truth. One of my primary critiques of K8s is its hidden complexities. While these abstractions might simplify things on the surface, debugging becomes a nightmare when issues arise. Even after three years of managing K8s clusters, I’ve never felt confident dealing with databases or handling complex networking problems involving dropped packets. You might argue that it’s about technical chops, which I won’t disagree with - but then do you want to add value to the business by getting shit done or do you want to be the resident K8s whiz at your organization? Consider this: How many people do you know who run their own K8s clusters? Even the K8s experts themselves preach about running prod clusters on EKS/GKE etc. How many fully leverage all that K8s has to offer? How many are even aware of all the network routing intricacies managed by kube-proxy? If these queries stir up clouds of uncertainty, it’s possible you’re sipping the Kubernetes Kool-Aid without truly comprehending the recipe, much like I found myself doing at one point Now, if you’re under the impression that I’m singing unabashed praises for Nomad, let me clarify - Nomad has its share of challenges. I’ve personally encountered and reported several. However, the crucial difference lies in Nomad’s lesser degree of abstraction, allowing for a comprehensive understanding of its internals. For instance, we encountered service reconciliation issues with a particular Nomad version. However, we could query the APIs, identify the problem, and write a bash script to resolve and reconcile it. It wouldn’t have been possible when there are too many moving parts in the system and we don’t know where to even begin debugging. The YAML hell is all too well known to all of us. In K8s, writing job manifests required a lot of effort (by the developers who don’t work with K8s all day) and were very complex to understand. It felt “too verbose” and involved copy pasting large blocks from the docs and trying to make things work. Compare that to HCL, it feels much nicer to read and shorter. Things are more straightforward to understand. I’ve not even touched upon the nice-ities on Nomad yet. Like better humanly understandable ACLs? Cleaner and simpler job spec, which defines the entire job in one file? A UI which actually shows everything about your cluster, nodes, and jobs? Not restricting your workloads to be run as Docker containers? A single binary which powers all of this? The central question this post aims to raise is: What can K8s do that Nomads can’t, especially considering the features people truly need? My perspectives are informed not only by my organization but also through interactions with several other organizations at various meetups and conferences. Yet, I have rarely encountered a use case that could only be managed by K8s. While Nomad isn’t a panacea for all issues, it’s certainly worth a try. Reducing the complexity of your tech stack can prove beneficial for your applications and, most importantly, your developers. At this point, K8s enjoys immense industry-wide support, while Nomad remains the unassuming newcomer. This contrast is not a negative aspect, per se. Large organizations often gravitate towards complexity and the opportunity to engage more engineers. However, if simplicity were the primary goal, the prevailing sense of overwhelming complexity in the infrastructure and operations domain wouldn’t be as pervasive. I hope my arguments provide a more comprehensive perspective and address the earlier critique of being unsubstantiated. Darren has responded to this blog post. You can read the response on Twitter . Ingress : We run a set of HAProxy on a few nodes which act as “L7 LBs”. Configured with Nomad services, they can do the routing based on Host headers. DNS : To provide external access to a service without using a proxy, we developed a tool that scans all services registered in the cluster and creates a corresponding DNS record on AWS Route53. Monitoring : Ah my fav. You wanna monitor your K8s cluster. Sure, here’s kube-prometheus , prometheus-operator , kube-state-metrics . Choices, choices. Enough to confuse you for days. Anyone who’s ever deployed any of these, tell me why this thing needs such a monstrosity setup of CRDs and operators. Monitoring Nomad is such a breeze, 3 lines of HCL config and done. Statefulsets : It’s 2023 and the irony is rich - the recommended way to run a database inside K8s is… not to run it inside K8s at all. In Nomad, we run a bunch of EC2 instances and tag them as nodes. The DBs don’t float around as containers to random nodes. And there’s no CSI plugin reaching for a storage disk in AZ-1 when the node is basking in AZ-2. Running a DB on Nomad feels refreshingly like running it on an unadorned EC2 instance. Autoscale : All our client nodes (except for the nodes) are ephemeral and part of AWS’s Auto Scaling Groups (ASGs). We use ASG rules for the horizontal scaling of the cluster. While Nomad does have its own autoscale, our preference is to run large instances dedicated to specific workloads, avoiding a mix of different workloads on the same machine.

0 views
Karan Sharma 2 years ago

Storing AWS Pinpoint Logs

At $dayjob, we use AWS Pinpoint to send out SMS to our customers. We’ve also written a detailed blog post on how we use Clickhouse + vector stack for our logging needs. We additionally wanted to store the delivery logs generated by the Pinpoint service. But like with anything else in AWS, even simpler tasks like these usually tend to piggyback on other counterparts of AWS - in this case, it happens to be AWS Kinesis. All the delivery logs which contain metadata about SMS delivery are streamed to Kinesis. Our setup involves configuring Pinpoint with Amazon Kinesis Data Firehose stream. Firehose is an ETL service that helps stream events to other persistent stores. Firehose supports multiple such output sinks and in our case we use sink. This is what the flow looks like: On the HTTP server side, we used ’s aws_kinesis_firehose source. Compared to just using http source, here are the differences I found: Has first-class support for access_key. AWS Kinesis can be configured to send access_key which comes as the value header in the HTTP request. This means that the request which contains an invalid access key will be rejected at the source itself. However, in the source, I couldn’t find a way to drop events at the source level. It is required to use a VRL transformer to check whether is present in the headers and do a value comparison with our key. Has native support for decoding the payload. This one’s pretty useful and saved me a lot of VRL transformer rules that I would have otherwise written with the source. So, basically, this is how the server receives the payload: The value of the payload is a base64 encoded value of the JSON Object of an SMS event. However, the source is smart enough and automagically decodes this list of records and their values into individual events. This is how the final event looks like when using source: This makes it straightforward because now we just have to parse the JSON inside the key and do transformations on that object. If it was source, then I’d to loop over the records structure and figure out how to split them as individual events for the rest of the Vector pipeline… which would have been messy to say the least. Here’s the vector config so far: Now that we have a pipeline which sends and receives data, we can process the events and transform them into a schema that is more desirable. Since we require the events to be queryable in a Clickhouse DB, this is the schema we have: To achieve the above format, we can use VRL to parse and format our SMS events: Plugging this, we have a clean JSON object for each SMS event. The only thing now we need to add is an output sink to Clickhouse: Perfect! On running this pipeline with we can see the consumption the records Hope this short post was useful if you’ve to do anything similar! Has first-class support for access_key. AWS Kinesis can be configured to send access_key which comes as the value header in the HTTP request. This means that the request which contains an invalid access key will be rejected at the source itself. However, in the source, I couldn’t find a way to drop events at the source level. It is required to use a VRL transformer to check whether is present in the headers and do a value comparison with our key. Has native support for decoding the payload. This one’s pretty useful and saved me a lot of VRL transformer rules that I would have otherwise written with the source. So, basically, this is how the server receives the payload: The value of the payload is a base64 encoded value of the JSON Object of an SMS event. However, the source is smart enough and automagically decodes this list of records and their values into individual events. This is how the final event looks like when using source: This makes it straightforward because now we just have to parse the JSON inside the key and do transformations on that object. If it was source, then I’d to loop over the records structure and figure out how to split them as individual events for the rest of the Vector pipeline… which would have been messy to say the least.

0 views
Karan Sharma 2 years ago

Bridge Networking in Nomad

To set the stage, it’s crucial to understand what we mean by “bridge networking”. In a nutshell, it is a type of network connection in Linux that allows virtual interfaces, like the ones used by virtual machines and containers, to share a physical network interface. With Nomad, when a task is allocated, it creates a network namespace with its own network stack. Within this, a virtual ethernet (veth) pair is established, one end of which is assigned to the network namespace of the allocation, and the other remains in the host namespace. To illustrate this practically, let’s assume a packet is sent from a task within an allocation. The packet would first be received by the local end of the veth pair, it would then traverse to the other end residing in the host’s namespace. From there, it is sent to the bridge on the host (in this case, the “nomad” bridge), which finally sends the packet out to the world via the host’s physical network interface (typically “eth0” or equivalent in your machine). The journey of a packet from the outside world to a task inside an allocation is the exact mirror image. The packet reaches “eth0” first, then the nomad bridge, it is then forwarded to the appropriate veth interface in the host’s namespace. From there, it crosses over to the other end of the veth pair in the allocation’s network namespace and finally gets routed to the destination task. Let’s take a look at the following jobspec which is for deploying my tiny side project - Cloak on Nomad Our focus should be on the stanza. To illustrate what happens behind the scenes when an alloc runs in (host network), we can run the above job. On the machine, we can see that port 7000 (static) and port 27042 (dynamic) are allocated on the host network interface (eth0): We can also see the port and process details using : This config is more suitable for specific workloads - like load balancers or similar deployments where you want to expose the network interface on the host. It’s also helpful for applications running outside of Nomad on that host to connect via the host network interface. However, typically in a job where you want to connect to multiple different allocs - you’d want to set up a bridge network. This generally avoids exposing the workload on the host network directly. It’s a typical setup where you want to put applications behind a reverse proxy (NGINX/Caddy). Let’s change in the above job spec and see the changes. Now we don’t see the ports forwarded on the host network: Similarly, also shows no process listening on the host network To understand what happened when we switched the networking mode to , we need to take a look at the Nomad magic which comes into play when using network. I pulled up the and saw specific rules under the chains and . These rules, in essence, allow all traffic to and from the allocation’s network namespace. Nomad uses as the default subnet for the bridge network. The IPs and are assigned to 2 different allocs in this CIDR. The rules allow complete traffic to flow on this subnet. To check the routing, command can be used. It uses the network interface for routing packets related to the default bridge network. Using we can find more details about the network namespace created for an alloc. Let’s find details about the alloc: We can see that one end of the pair is (container’s default gateway) which is connected to a network interface with an index . For the tunnel to actually work, the veth pair should also exist on the host: So, when we see in the host’s network namespace (with the index ), and then we see inside the Redis container, we can infer that these two interfaces form a pair: on the host side and inside the container. This connection enables the container to communicate with the external network through the host’s network stack. We can capture TCP packets on the interface to see the routing work: To summarize the output, we can see that the log is showing a TCP connection between 172.26.64.1 (source) and 172.26.64.6 (destination), specifically on port 7000. happens to be the gateway for subnet. Hope this post clarified some networking internals and behind the scenes magic when using Nomad bridge networking. Refer to my other post - Nomad networking explained for a practical breakdown of all the different ways to expose and connect applications in a Nomad cluster.

0 views
Karan Sharma 2 years ago

Analyzing credit card transactions with GPT and Python

You know those budget freaks? People who log and categorise every Rupee they’ve spent over the month? The financially sane people? I am definitely not one and I suck at it. I moved cities a couple of months back and had some big ticket spends off late, mostly financed by credit card. I wanted an easy way to list down where all I’ve spent most money and spot some recurring expenses so I can be better prepared for them from next month. I’ve found that broadly keeping an idea of things where you spend money works for me (v/s the two extremes - completely blind or logging every small transactions). Of course I know people who make budgeting a habit but I only wish I was consistent enough to do that. Anyway, I downloaded the statement in CSV format from my bank. Initially I thought I’d use some simple Excel to make sense of this but I realised how bad my Excel skills really are. I got an idea to dump the CSV file to ChatGPT (yay privacy) and ask questions. It kinda sucked at it and gave wrong answers for a lot of questions and also started to hallucinate data which wasn’t even present in the CSV. The next most obvious step would be to write a simple script and parse it. I wanted to experiment if ChatGPT could do this entire exercise of writing the script and the relevant code for the analysis I wanted to perform. Here’s the initial prompt I gave: It returned the following code: Looking at this, I was a bit impressed as it figured the CSV contains some useless empty columns and it removed (without me giving any information about it). I also asked it to modify the code to read the file locally from disk and it swapped with the path to CSV file: Next, I prompted to do some analysis on it: It returned some one-liners to answer each question: At this point, I know that this will fail because we’ve not cleaned up the data. The column needs cleaning up. I prompted ChatGPT to write a function to clean this column: It responded with: Perfect! After transforming the amounts, I ran the analysis code: The next prompt I gave was to analyse the spending in various categories. It actually did an okayish job at this and ignored a lot of vendors which I think it could have guessed easily: I decided to give some manual inputs to it to refine the function. And it modified the Python snippet to add these rules. Mixed reactions looking at this. Happy that I could practically get exactly the result I had in mind in just 10 minutes without writing any code. Sad because damn I need to limit those empty calories from next month (famous last words). Next, I wanted to see if my spends on weekends are higher or not. (I don’t expect them to be, but you never know). It was a fun 10-15 min exercise to figure out my spending habits based on the last month’s statement. I intend to do this for the next couple of months and then it would make sense to write more queries which would show trend-lines of spends in various categories over time. Honestly, I just loved how ChatGPT made this task so seemingly simple. It’s not that I can’t myself write the code for these kind of simple analysis. It’s the sheer power at your hand to go from ideation phase to answer within seconds. And I think that’s why I love it so much. I didn’t have to go through Pandas docs (because I don’t use that in my day job so it’s quite normal to not know various syntax/functions that I could use) and I’d grok through different StackOverflow questions to achieve what I wanted to. And maybe imagining all this resistance on a Sunday morning would have meant that I never got to write the script in the first place.

0 views
Karan Sharma 2 years ago

The curious case of missing and duplicate logs

At work, we use a Vector pipeline for processing and shipping logs to Clickhouse . We also self-host our SMTP servers and recently started using Haraka SMTP . While Haraka is excellent in raw performance and throughput, it needed an external logging plugin for audit and compliance purposes. I wrote haraka-plugin-outbound-logger to log basic metadata like timestamps/subject/SMTP response in a JSON file. The plan was to dump these logs into a file and use Vector’s file source for reading them and doing the further transformation. However, things went differently than I had planned. There were mainly two issues propped up due to bad Vector configuration. The vector configuration to read the file looked like this: Vector has a handy configuration of automagically deleting the file if the file hasn’t received any new write in the configured time interval. So specifies if the file hasn’t had any new writes since 24h, it can delete it. It made sense to configure because our workload was a shorter process. It was a batch job done once every N days. When the file didn’t receive any new writes after 24h, deleted the file as expected. However, the plugin continued logging into the same file handler, even for newer batch jobs. As a result, the file didn’t receive any new logs and was empty. I created a minimal POC to reproduce this seemingly strange issue: This snippet logs to . During the 10s time interval, I deleted the file from the disk to mimic the behaviour of . I expected the file to get re-created and logged by the above script. However, that didn’t happen. The script didn’t complain about a missing file, either. I was perplexed and did some google-fu and found the following via Stackoverflow : The writes actually do not fail.When you delete a file that is open in another program you are deleting a named link to that file’s inode. The program that has it open still points to that inode. It will happily keep writing to it, actually writing to disk. Only now you don’t have a way to look it at, because you deleted the named reference to it. (If there were other references, e.g. hard links, you would still be able to!). This is exactly what was happening in production. When deleted the file (as configured via ), the plugin didn’t know about it and kept writing to the same inode. This was a major TIL moment for me. Fix : The fix was simple enough; I removed from Vector’s config. To address the problem of the file not growing unbounded forever, I created a config: Some notes: Now after fixing the case of missing logs, I found myself in the opposite problem - now the logs were duplicated on Clickhouse. My mental state at that moment couldn’t be more accurately described than this meme: To add more context, before developing my plugin for logging email delivery in Haraka, we used another plugin ( acharkizakaria/haraka-plugin-accounting-files ) to get these logs. This plugin records the metadata to CSV files. Still, there were some issues in properly escaping the subject lines (if the subject had a comma, that was incorrectly parsed); hence, the log file had inconsistent output. To address these issues, I found writing another plugin from scratch that outputs to a fixed JSON schema is better. As seen above, ’s file source was configured like the following for reading CSV files. The only change here is that is gone after fixing issue #1. Vector “ fingerprints ” the file source, as it keeps the checkpoint of how many bytes it has read for each file in its own “disk buffer”. This buffer is helpful if Vector crashes so it can restart reading the file from where it last stopped. There are two strategies for fingerprinting that vector uses: As I was using a different plugin which logged to a CSV file, the checksum strategy did not work in my context. Since vector fingerprints, the first few bytes (usually just enough for a header of a CSV), all the CSV files in that disk would have the same title, and Vector would not read all of them. To work around this, I changed the so Vector uniquely identifies all CSV files by their inode path. (In hindsight, I should have just used with a higher value of fingerprint.lines value.) The mistake this time was when I switched to a JSON log file, I continued with the strategy. This isn’t a problem if there is no log rotation setup. Since I did configure to fix issue #1, as you would have guessed, created another log file, and because I was using strategy, vector thought this was a “new” file to be watched and processed. So now I had duplicate entries from this new file, which is technically just an older log-rotated file. I switched back to the default strategy and adjusted the thresholds for lines/header bytes to account for JSON logs. The same is also documented very clearly in Vector and it was my RTFM moment. This strategy avoids the common pitfalls associated with using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies. Phew! I am glad after these fixes; is durably and reliably processing all the logs and is happily working in conjunction as well. I hope documenting my learnings about these production issues would help someone with the same problems. is useful in this context. It copies the existing file to a new one, which now becomes a stale one. The current (active) file will be truncated to zero bytes. E.g., if it is rotated, logrotate will copy the file to a new file and then truncate the existing one to zero bytes. will not compress the logs until the next rotation happens. This is useful if hasn’t finished processing the log and can continue to do that. The strategy uses CRC check on the first N lines of the file. The uses the disk’s actual inode location to identify the file uniquely.

0 views