Latest Posts (3 found)
allvpv’s space 1 months ago

Environment variables are a legacy mess: Let's dive deep into them

Programming languages have rapidly evolved in recent years. But in software development, the new often meets the old, and the scaffolding that OS gives for running new processes hasn’t changed much since Unix. If you need to parametrize your application at runtime by passing a few ad-hoc variables (without special files or a custom solution involving IPC or networking), you’re doomed to a pretty awkward, outdated interface: There are no namespaces for them, no types. Just a flat, embarrassingly global dictionary of strings. But what exactly are these envvars? Is it some kind of special dictionary inside the OS? If not, who owns them and how do they propagate? In a nutshell: they’re passed from parent to child. On Linux, a program must use the syscall to execute another program. Whether you type in Bash, call in Python, or launch a code editor, it ultimately comes down to , preceded by a / . The family of C functions also relies on . This system call takes three arguments: , , . For example, for an invocation: By default, all envvars are passed from the parent to the child. However, nothing prevents a parent process from passing a completely different or even empty environment when calling ! In practice, most tooling passes the environment down: Bash, Python’s , the C library , and so on. And this is what you expect – variables are inherited by child processes. That’s the point – to track the environment. Which tools do not pass the parent’s environment? For example, the executable, used when signing into a system, sets up a fresh environment for its children. After launching the new program, the kernel dumps the variables on the stack as a sequence of null-terminated strings which contain the envvar definitions. Here is a hex view: This static layout can’t easily be modified or extended; the program must copy those variables into its own data structure. Let’s look at how Bash, C, and Python store envvars internally. I analyzed their source code and here is a summary. It stores the variables in a hashmap . Or, more precisely, in a stack of hashmaps . When you spawn a new process using Bash, it traverses the stack of hashmaps to find variables marked as exported and copies them into the environment array passed to the child. Side note: Why is traversing the stack needed? Each function invocation in Bash creates a new local scope – a new entry on the stack. If you declare your variable with , it ends up in this locally-scoped hashmap. What’s interesting is that you can export a variable too! I wouldn’t have learned this without diving into Bash source. My intuitive (wrong) assumption was that automatically makes the variable global – like ! Super interesting stuff. exposes a dynamic array, managed via and library functions. It uses an array, so the time complexity of and is linear in the number of envvars. Remember – envvars are not a high-performance dictionary and you should not abuse them. Python couples its environment to the C library, which can cause surprising inconsistencies. If you’ve programmed some Python, you’ve probably used the dictionary. On startup, is built from the C library’s array. But those dictionary values are NOT the “ground truth” for child processes. Rather, each change to invokes the native function, which in turn calls the C library’s . Note that the propagation is one-directional: modifying will call , but not the other way around. Call , and won’t be updated. The Linux kernel is very liberal about the format of environment variables, and so is . For example, your C program can manipulate the environment – the global array – such that several variables share the same name but have different values. And when you execute a child process, it will inherit this “broken” setup. You don’t even need an equals sign separating name from value! The usual entry is , but nothing prevents you from adding to the array. The kernel happily accepts any null-terminated string as an “environment variable” definition. It just imposes a size limitation: Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars. But the fact that you can do something does not mean that you should. For example, if you start Bash with the “broken” environment – duplicated names and entries without – it deduplicates the variables and drops the nonsense. One interesting edge case is a space inside the variable name . My beloved shell – Nushell – has no problem with the following assignment: Python is fine with it, too. Bash, on the other hand, can’t reference it because whitespace isn’t allowed in variable names. Fortunately, the variable isn’t lost – Bash keeps such entries in a special hashmap called and still passes them to child processes. So what name and value can you safely use for your envvar? A popular misconception, repeated on StackOverflow and by ChatGPT, is that POSIX permits only uppercase envvars, and everything else is undefined behavior. But this is seriously NOT what the standard says : These strings have the form name=value; names shall not contain the character ‘=’. For values to be portable across systems conforming to POSIX.1-2017, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in an environment of a process has the same name, the consequences are undefined. Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2017 consist solely of uppercase letters, digits, and the <underscore> ( ‘_’ ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities. Yes, POSIX-specified utilities use uppercase envvars, but that’s not prescriptive for your programs. Quite the contrary: you’re encouraged to use lowercase for your envvars so they don’t collide with the standard tools. The only strict rule is that a variable name cannot contain an equals sign. POSIX requires compliant applications to preserve all variables that conform to this rule. But in reality, not many applications use lowercase. The proper etiquette in software development is to use . …to use for names, and UTF-8 for values. You shouldn’t hit problems on Linux. If you want to be super safe: instead of UTF-8, use the POSIX-mandated Portable Character Set (PCS) – essentially ASCII without control characters. …and I hope it wasn’t a boring read. is the (the executable path), is the array of command line arguments – the implicit first (“zero”) argument is usually the executable name, is the array of envvars (typically much longer). Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars.

0 views
allvpv’s space 2 months ago

Git’s hidden simplicity

Many programmers would admit this: our knowledge of Git tends to be pretty… superficial. “Oops, what happened? Screw that, I’ll cherry pick my commits and start again on a fresh branch”. I’ve been there. I knew the basic use cases. I even thought I was pretty experienced after a hundred or so resolved merge conflicts. But the confidence or fluency somehow wasn’t coming. It was a hunch: learned scenarios, commands from Stack Overflow or ChatGPT, trivia-like knowledge without a solid base. In software engineering, you don’t need to have all the knowledge : you just need to quickly identify and fetch the missing bits of knowledge . My goal is to give you that low-level grounding to sharpen your intuition. Git isn’t really complicated in its principles! Disclaimer: I am not a Git expert either. Let’s learn together. Do you know how commit hashes are generated? I have to admit, I thought for a while that those hashes were somehow randomized. After all, I can run , change nothing, and still get the same commit, but with a new hash, right? Likewise, ing the same commit onto another branch gives me yet another hash. Boy, I couldn’t be more wrong. The commit hash is literally just a SHA-1 checksum of the information that constitutes the commit. So two identical commits have identical hashes. Let’s look what a commit consists of. Run the following command: (In case you don’t know: HEAD resolves to the commit you currently checked out) . Let’s call the output of this command the payload. For example, the payload might be: That’s it. That’s the full commit. Then prepend the following null-terminated string to the : “ ”, where is the size of the payload in bytes. Compute a SHA-1 over the result and boom: you’ve got a Git commit hash! Try it yourself: Now compare the output to the actual commit hash: It works. So simple. Now, let’s ponder what the contains: We are not hashing the diff a commit introduces. Rather, the commit header , together with the referenced tree and parent , determines the hash. And now it’s easy to see what happens when you run and change “nothing”. Something still changes: the date in the field! (Note that doesn’t display the committer; the date you see comes from the author field). But if you are fast enough to amend within the same second as the original commit, the commit hash remains unchanged! And on a , the parent field changes, and usually, though not always, the tree field as well. If you’re a careful reader, you might wonder what the parent field is for the first commit in a repo, and for a merge commit. What do you think? Grab a repo and verify. We saw that a commit references a tree. Let’s check what it really is: Oops, the isn’t human-readable text; it’s binary data. But just like with commits, if you prepend “ ” to the payload bytes, you can compute the tree’s hash from the result! Fortunately, Git lets you pretty-print a tree’s contents: A is just like a directory: it references other files (blobs) and directories (trees) nested inside it. It looks a bit like ls output. The first column records, of course, the Unix file permissions. Nothing more, nothing less than the raw file content – no metadata. And yes, prepend null-terminated “ ” to the bytes, run sha1sum, and you’ll get the blob’s hash! No extra metadata such as file modification time: that can be inferred from commit history. A simple and immutable structure : you can’t change a commit without changing its hash. And if you think about it, you will notice that it is a… There are three types of nodes in this graph: commits, trees, and blobs. And four types of edges: Interestingly, the graph fragment reachable from a tree node doesn’t have to form a strict tree. For example, a single blob can be referenced by multiple parents. As you probably know, a branch is just a ref pointing to a commit hash. If you run this in your repo root, you’ll see all local branches as file names, each file just a few bytes, with the referenced commit’s hash inside. Likewise, directory contains pointers to the remote-tracking branches. So you can think of branches as labels for commit histories. If you commit on main: And the file contains the name of the current branch – or commit hash, if you’re in a detached state. This special pointer tells Git what is currently checked out. I hope this clarifies your mental model and clears some of the mystery around Git. The building blocks are simple. Now you shouldn’t have a problem answering questions such as: In the next articles, I plan to cover more advanced concepts, such as Git object storage, garbage collection, and how the default merge strategy works. If you have a little more time and want to keep going, I recommend a few resources: – the hash of a tree object. More on trees later; for now, think of it as a snapshot of all files in the repo. – a hash of parent commit(s). , – self-explanatory, but notice that they include date (seconds since the Unix epoch) and time zone; in several scenarios it’s possible that the author is not the committer . the commit message. commit -> commit – parent relationship; a commit has zero or more parents (usually one). commit -> tree – each commit points to exactly one tree (a snapshot of files and folders). tree -> tree – subdirectory relationship. tree -> blob – files contained in a directory. the new commit will have the hash pointed to by as its parent field; then the branch label will be updated to point to the new commit’s hash. How are Git commit hashes generated? Why does rebasing produce different commit hashes? Can a remote-tracking branch update without your local branch updating? Which data structure represents the repository? What are the node and edge types in this DAG, and how do they relate? Pro Git Book : very practical, but it doesn’t lack depth; look at the Git Internals section. Git for Computer Scientists by Tommi Virtanen; short and sweet: this is where I got the DAG analogy.

0 views
allvpv’s space 3 months ago

Turn off Cursor, turn on your mind

I believe that AI can make you smarter . It can help you grow faster, especially when it comes to gaining new knowledge or developing new skills. It is the perfect learning companion and a great language tutor, and it’s ability to browse the web and surface relevant links far exceeds my own. Heck, I’ve even used it – albeit very modestly – to polish this draft so it’s bearable to the native speakers. However, I would call this way of using AI – for the sake of this essay – approach one. That’s not the main way people – especially software engineers – use or are expected to use AI. Let’s call the other emerging paradigm approach two. I think that approach 2 is the most straightforward – or let’s say, the laziest – way to use AI tools, especially agentic coding tools like Cursor or Claude Code. And I think it’s a trap. Unless AI systems can completely replace us, we shouldn’t let ourselves get lazy – even for promised efficiency gains . There’s a spectrum when it comes to outsourcing tasks to AI. For example, I could: Of these four choices, the first two are detrimental to my essay; option 3 fixes it but doesn’t help me learn; and option 4 not only helps the essay but also makes me a slightly better writer in English. I believe the same logic holds for coding. Because coding is an ongoing learning process, even those in senior positions shouldn’t think they have it all figured out . Let me elucidate this point. Let’s start with something we can probably all agree on. A new hire generally won’t perform at the same pace as someone who is fully onboarded into the project and has been working in it for some time. (Holds true when comparing people of similar skills and drive). Let’s say you are new to the project: you pick up a task and it takes you Y1 to complete. Some time passes, you implemented a huge chunk of the project. Then you take a different task of the same complexity as the first, and it takes you Y2. If you’re like me, you’ll agree that Y2 is smaller than Y1. You see, programming a large project is not a repetitive task but an ongoing learning process . When you work in a large codebase, you are essentially learning, and building a mind map of how the system fits together. I’m sure you can think of a situation in which you develped a feature very quickly – maybe in a day or two. And another, in which more trivial problem took the same amount of time, or even longer. For me, the differentiator was knowledge of the project and the related technologies. The paragraph above isn’t controversial – I’d say it’s something most sensible people would agree on. What might be more controversial is the second premise: Agentic coding reduces the learning experience substantially. I won’t justify that with anecdotes (though I could share one in which ‘vibe coding’ led to a loss of control and a lot of debugging). Instead, I’ll appeal to common sense: learning takes time . You’ve probably heard claims that, in the dawn of AI coding, a team of three engineers working for six months can be replaced by a single engineer over a weekend with the help of an AI agent. Many find those claims exaggerated, but let’s assume for the sake of argument that they’re true. My question is this: who, do you think, will learn more about the underlying system? Who will better sharpen their coding and debugging skills in a new project with new technologies – the person who spent three days or the one who spent six months on it? How many times, in the middle of a coding session, have you deleted your code and reworked it substantially? I bet it took you some time to figure that out. Someone might say that, with AI coding, you can spin up two prototypes in 15 minutes and discard them easily, but I’d argue that you often can’t competently and thoroughly evaluate the AI-generated code . As the system grows denser and more complicated, and as AI compresses the time you spend thinking about it, you slowly lose control . If you’re like me, you’ve probably had the unpleasant gut feeling of losing control when coding with an AI agent. At the end of the day, I am responsible for the code I ship . I can’t say that an AI agent messed something up unless the agent is also accountable for it. So I think: better to implement this myself, and use AI to suggest improvements or explore alternatives. Maybe we’ll reach a day when AI agents join our daily standups, take tasks, ship them, and assume full responsibility. Maybe, at some point, they’ll take over coding entirely, laying us all off. But until then, I won’t entrust them with my skills, my learning, or my responsibilities. The moment they break down, I have to take over – so I need to know exactly what’s going on in the system. At the same time, I’m all for using AI to learn, explore alternatives, and gain efficiency by means other than outsourcing coding to AI. Use AI to learn faster and understand the whole system better. Using AI to tackle problems for you (so, at the end of the day, you neither learn much nor deepen your understanding of the system). Ask AI to draft this essay based on the title. Ask AI to rephrase my essay to make it ChatGPT-esque. Ask AI to fix the draft’s grammar. Ask AI to list grammatical or stylistic issues in this essay and suggest improvements.

0 views