Why Graceful Shutdown Matters in Kubernetes
Have you ever deployed a new version of your app in Kubernetes and noticed errors briefly spiking during rollout? Many teams do not even realize this is happening, especially if they are not closely monitoring their error rates during deployments. There is a common misconception in the Kubernetes world that bothers me. The official Kubernetes documentation and most guides claim that “if you want zero downtime upgrades, just use rolling update mode on deployments”. I have learned the hard way that this simply it is not true - rolling updates alone are NOT enough for true zero-downtime deployments. And it is not just about deployments. Your pods can be terminated for many other reasons: scaling events, node maintenance, preemption, resource constraints, and more. Without proper graceful shutdown handling, any of these events can lead to dropped requests and frustrated users. In this post, I will share what I have learned about implementing proper graceful shutdown in Kubernetes. I will show you exactly what happens behind the scenes, provide working code examples, and back everything with real test results that clearly demonstrate the difference. If you are running services on Kubernetes, you have probably noticed that even with rolling updates (where Kubernetes gradually replaces pods), you might still see errors during deployment. This is especially annoying when you are trying to maintain “zero-downtime” systems. When Kubernetes needs to terminate a pod (for any reason), it follows this sequence: The problem? Most applications do not properly handle that SIGTERM signal. They just die immediately, dropping any in-flight requests. In the real world, while most API requests complete in 100-300ms, there are often those long-running operations that take 5-15 seconds or more. Think about processing uploads, generating reports, or running complex database queries. When these longer operations get cut off, that’s when users really feel the pain. Rolling updates are just one scenario where your pods might be terminated. Here are other common situations that can lead to pod terminations: Horizontal Pod Autoscaler Events : When HPA scales down during low-traffic periods, some pods get terminated. Resource Pressure : If your nodes are under resource pressure, the Kubernetes scheduler might decide to evict certain pods. Node Maintenance : During cluster upgrades, node draining causes many pods to be evicted. Spot/Preemptible Instances : If you are using cost-saving node types like spot instances, these can be reclaimed with minimal notice. All these scenarios follow the same termination process, so implementing proper graceful shutdown handling protects you from errors in all of these cases - not just during upgrades. Instead of just talking about theory, I built a small lab to demonstrate the difference between proper and improper shutdown handling. I created two nearly identical Go services: Both services: I specifically chose a 4-second processing time to make the problem obvious. While this might seem long compared to typical 100-300ms API calls, it perfectly simulates those problematic long-running operations that occur in real-world applications. The only difference between the services is how they respond to termination signals. To test them, I wrote a simple k6 script that hammers both services with requests while triggering rolling restart of service’s deployment. Here is what happened: The results speak for themselves. The basic service dropped 14 requests during the update (that is 2% of all traffic), while the graceful service handled everything perfectly without a single error. You might think “2% it is not that bad” — but if you are doing several deployments per day and have thousands of users, that adds up to a lot of errors. Plus, in my experience, these errors tend to happen at the worst possible times. After digging into this problem and testing different solutions, I have put together a simple recipe for proper graceful shutdown. While my examples are in Go, the fundamental principles apply to any language or framework you are using. Here are the key ingredients: First, your app needs to catch that SIGTERM signal instead of ignoring it: This part is easy - you are just telling your app to wake up when Kubernetes asks it to shut down. You need to know when it is safe to shut down, so keep track of ongoing requests: This counter lets you check if there are still requests being processed before shutting down. it is especially important for those long-running operations that users have already waited several seconds for - the last thing they want is to see an error right before completion! Here is a commonly overlooked trick - you need different health check endpoints for liveness and readiness: This separation is crucial. The readiness probe tells Kubernetes to stop sending new traffic, while the liveness probe says “do not kill me yet, I’m still working!” Now for the most important part - the shutdown sequence: I’ve found this sequence to be optimal. First, we mark ourselves as “not ready” but keep running. We pause to give Kubernetes time to notice and update its routing. Then we patiently wait until all in-flight requests finish before actually shutting down the server. Do not forget to adjust your Kubernetes configuration: This tells Kubernetes to wait up to 30 seconds for your app to finish processing requests before forcefully terminating it. If you are in a hurry, here are the key takeaways: Catch SIGTERM Signals : Do not let your app be surprised when Kubernetes wants it to shut down. Track In-Flight Requests : Know when it is safe to exit by counting active requests. Split Your Health Checks : Use separate endpoints for liveness (am I running?) and readiness (can I take traffic?). Fail Readiness First : As soon as shutdown begins, start returning “not ready” on your readiness endpoint. Wait for Requests : Do not just shut down - wait for all active requests to complete first. Use Built-In Shutdown : Most modern web frameworks have graceful shutdown options; use them! Configure Terminaton Grace Period : Give your pods enough time to complete the shutdown sequence. Test Under Load : You will not catch these issues in simple tests - you need realistic traffic patterns. You might be wondering if adding all this extra code is really worth it. After all, we’re only talking about a 2% error rate during pod termination events. From my experience working with high-traffic services, I would say absolutely yes - for three reasons: User Experience : Even small error rates look bad to users. Nobody wants to see “Something went wrong” messages, especially after waiting 10+ seconds for a long-running operation to complete. Cascading Failures : Those errors can cascade through your system, especially if services depend on each other. Long-running requests often touch multiple critical systems. Deployment Confidence : With proper graceful shutdown, you can deploy more frequently without worrying about causing problems. The good news is that once you have implemented this pattern once, it is easy to reuse across your services. You can even create a small library or template for your organization. In production environments where I have implemented these patterns, we have gone from seeing a spike of errors with every deployment to deploying multiple times per day with zero impact on users. that is a win in my book! If you want to dive deeper into this topic, I recommend checking out the article Graceful shutdown and zero downtime deployments in Kubernetes from learnk8s.io. It provides additional technical details about graceful shutdown in Kubernetes, though it does not emphasize the critical role of readiness probes in properly implementing the pattern as we have discussed here. For those interested in seeing the actual code I used in my testing lab, I’ve published it on GitHub with instructions for running the demo yourself. Have you implemented graceful shutdown in your services? Did you encounter any other edge cases I didn’t cover? Let me know in the comments how this pattern has worked for you! Sends a SIGTERM signal to your container Waits for a grace period (30 seconds by default) If the container does not exit after the grace period, it gets brutal and sends a SIGKILL signal Horizontal Pod Autoscaler Events : When HPA scales down during low-traffic periods, some pods get terminated. Resource Pressure : If your nodes are under resource pressure, the Kubernetes scheduler might decide to evict certain pods. Node Maintenance : During cluster upgrades, node draining causes many pods to be evicted. Spot/Preemptible Instances : If you are using cost-saving node types like spot instances, these can be reclaimed with minimal notice. Basic Service : A standard HTTP server with no special shutdown handling Graceful Service : The same service but with proper SIGTERM handling Process requests that take about 4 seconds to complete (intentionally configured for easier demonstration) Run in the same Kubernetes cluster with identical configurations Serve the same endpoints Catch SIGTERM Signals : Do not let your app be surprised when Kubernetes wants it to shut down. Track In-Flight Requests : Know when it is safe to exit by counting active requests. Split Your Health Checks : Use separate endpoints for liveness (am I running?) and readiness (can I take traffic?). Fail Readiness First : As soon as shutdown begins, start returning “not ready” on your readiness endpoint. Wait for Requests : Do not just shut down - wait for all active requests to complete first. Use Built-In Shutdown : Most modern web frameworks have graceful shutdown options; use them! Configure Terminaton Grace Period : Give your pods enough time to complete the shutdown sequence. Test Under Load : You will not catch these issues in simple tests - you need realistic traffic patterns. User Experience : Even small error rates look bad to users. Nobody wants to see “Something went wrong” messages, especially after waiting 10+ seconds for a long-running operation to complete. Cascading Failures : Those errors can cascade through your system, especially if services depend on each other. Long-running requests often touch multiple critical systems. Deployment Confidence : With proper graceful shutdown, you can deploy more frequently without worrying about causing problems.