GreatReads - Blog Aggregator · Phoenix Framework

Posts in Microservices (10 found)

We Should Call Them Macroservices

I love the idea of microservices. When there's a problem on your website, you don't need to fix and redeploy your entire codebase. If the issue only affects your authentication service, you can deploy just that one component and call it a day. You've isolated the authentication feature into an independent microservice that can be managed and maintained on its own. That's the theory. The reality is often different. Microservices are a software architecture style where an application is built as a collection of small, independent, and loosely coupled services that communicate with each other. The "micro" in the name implies they should be small, and they usually start that way. When you first adopt this philosophy, all services are genuinely small and build incredibly fast. At this stage, you start questioning why you ever thought working on a monolith was a good idea. I love working on applications where the time between pushing a change and seeing its effect is minimal. The feedback loop is tight, deployments are quick, and each service feels manageable. But I've worked long enough in companies adopting this style to watch the transformation. Small becomes complex. Fast becomes extremely slow. Cheap becomes resource-intensive. Microservices start small, then they grow. And grow. And the benefits you once enjoyed start to vanish. For example, your authentication service starts with just login and logout. Then you add password reset. Then OAuth integration. Then multi-factor authentication. Then session management improvements. Then API key handling. Before you know it, your "micro" service has ballooned to thousands of lines of code, multiple database tables, and complex business logic. When you find yourself increasing the memory allocation on your Lambda functions by 2x or 3x, you've reached this stage. The service that once spun up in milliseconds now takes seconds to cold start. The deployment that took 30 seconds now takes 5 minutes. If speed were the only issue, I could live with it. But as services grow and get used, they start to depend on one another. When using microservices, we typically need an orchestration layer that consumes those services. Not only does this layer grow over time, but it's common for the microservices themselves to accumulate application logic that isn't easy to externalize. A service that was supposed to be a simple data accessor now contains validation rules, business logic, and workflow coordination. Imagine you're building an e-commerce checkout flow. You might have: Where does the logic live that says "only charge the customer if all items are in stock"? Or "apply the discount before calculating shipping"? This orchestration logic has to live somewhere, and it often ends up scattered across multiple services or duplicated in various places. As microservices grow, it's inevitable that they grow teams around them. A team specializes in managing a service and becomes the domain expert. Not a bad thing on its own, but it becomes an issue when someone debugging a client-side problem discovers the root cause lies in a service only another team understands. A problem that could have been solved by one person now requires coordination, meetings, and permissions to identify and resolve. For example, a customer reports that they're not receiving password reset emails. The frontend developer investigates and confirms the request is being sent correctly. The issue could be: Each of these components is owned by a different team. What should be a 30-minute investigation becomes a day-long exercise in coordination. The feature spans across several microservices, but domain experts only understand how their specific service works. There's a disconnect between how a feature functions end-to-end and the teams that build its components. When each microservice requires an actual HTTP request (or message queue interaction), things get relatively slower. Loading a page that requires data from several dependent services, each taking 50-100 milliseconds, means those latencies quickly compound. Imagine for a second you are displaying a user profile page. Here is the data that's being loaded: If these calls happen sequentially, you're looking at 350ms just for service-to-service communication, before any actual processing happens. Even with parallelization, you're paying the network tax multiple times over. In a monolith, this would be a few database queries totaling perhaps 50ms. There are some real benefits to microservices, especially when you have good observability in place. When a bug is identified via distributed tracing, the team that owns the affected service can take over the resolution process. Independent deployment means that a critical security patch to your authentication service doesn't require redeploying your entire application. Different services can use different technology stacks suited to their specific needs. These address real pain points that people have and is why we are attracted to this architecture in the first place. But Microservices are not a solution to every architectural problem. I always say if everybody is "holding it wrong," then they're not the problem, the design is. Microservices have their advantages, but they're just one option among many architectural patterns. To build a good system, we don't have to exclusively follow one style. Maybe what many organizations actually need isn't microservices at all, but what I'd call "macroservices". Larger, more cohesive service boundaries that group related functionality together. Instead of separate services for user accounts, authentication, and authorization, combine them into an identity service. Instead of splitting notification into separate services for email, SMS, and push notifications, keep them together where the shared logic and coordination naturally lives. The goal should be to draw service boundaries around business capabilities and team ownership, not around technical functions. Make your services large enough that a feature can live primarily within one service, but small enough that a team can own and understand the entire thing. Microservices promised us speed and independence. What many of us got instead were distributed monoliths, all the complexity of a distributed system with all the coupling of a monolith. An inventory service to check stock A pricing service to calculate totals A payment service to process transactions A shipping service to calculate delivery options A notification service to send confirmations The account service isn't triggering the email request properly The email service is failing to send messages The email service is sending to the wrong queue The notification preferences service has the user marked as opted-out The rate limiting service is blocking the request User account details (Account Service: 50ms) Recent orders (Order Service: 80ms) Saved payment methods (Payment Service: 60ms) Personalized recommendations (Recommendation Service: 120ms) Notification preferences (Settings Service: 40ms)

Backend Microservices

0 views

Binary Igor 3 months ago

Modular Monolith and Microservices: Data ownership, boundaries, consistency and synchronization

Virtually every module - folder or versioned package in a modular monolith, separately deployed microservice - must own or at least read some data to provide its functionality. As we shall see, the degree to which module A needs data from module B is often the degree to which it depends on this module; functionality being another important dimension of dependence. This leads us to the following principles...

Backend Microservices

0 views

iDiallo 5 months ago

How to Lead in a Room Full of Experts

Here is a realization I made recently. I'm sitting in a room full of smart people. On one side are developers who understand the ins and outs of our microservice architecture. On the other are the front-end developers who can debug React in their sleep. In front of me is the product team that has memorized every possible user path that exists on our website. And then, there is me. The lead developer. I don't have the deepest expertise on any single technology. So what exactly is my role when I'm surrounded by experts? Well, that's easy. I have all the answers. OK. Technically, I don't have all the answers. But I know exactly where to find them and connect the pieces together. When the backend team explains why a new authentication service would take three weeks to build, I'm not thinking about the OAuth flows or JWT token validation. Instead, I think about how I can communicate it to the product team who expects it done "sometime this week." When the product team requests a "simple" feature, I'm thinking about the 3 teams that need to be involved to update the necessary microservices. Leadership in technical environments isn't about being the smartest person in the room. It's about being the most effective translator. I often get "eye rolls" when I say this to developers: You are not going to convince anyone with facts. In a room full of experts, your technical credibility gets you a seat at the table, but your social skills determine whether anything productive happens once you're there. Where ideally you will provide documentation that everyone can read and understand, in reality, you need to talk to get people to understand. People can get animated when it comes to the tools they use. When the database team and the API team are talking past each other about response times, your role isn't to lay down the facts. Instead it's to read the room and find a way to address technical constraints and unclear requirements. It means knowing when to let a heated technical debate continue because it's productive, and when to intervene because it's become personal. When you are an expert in your field, you love to dive deep. It's what makes you experts. But someone needs to keep one eye on the forest while everyone else is examining the trees. I've sat through countless meetings where engineers debated the merits of different caching strategies while the real issue was that we hadn't clearly defined what "fast enough" meant for the user experience. The technical discussion was fascinating, but it wasn't moving us toward shipping. As a leader, your job isn't to have sophisticated technical opinions. It's to ask how this "discussion" can move us closer to solving our actual problem. When you understand a problem, and you have a room full of experts, the solution often emerges from the discussion. But someone needs to clearly articulate what problem we're actually trying to solve. When a product team says customers are reporting the app is too slow, that's not a clear problem. It's a symptom. It might be that users are not noticing when the shopping cart is loaded, or that maybe we have an event that is not being triggered at the right time. Or maybe the app feels sluggish during peak hours. Each of those problems has different solutions, different priorities, and different trade-offs. Each expert might be looking at the problem with their own lense, and may miss the real underlying problem. Your role as a leader is to make sure the problem is translated in a way the team can clearly understand the problem. By definition, leading is knowing the way forward. But in reality, in a room full of experts, pretending to know everything makes you look like an idiot. Instead, "I don't know, but let's figure it out" becomes a superpower. It gives your experts permission to share uncertainty. It models intellectual humility. And it keeps the focus on moving forward rather than defending ego. It's also an opportunity to let your experts shine. Nothing is more annoying than a lead who needs to be the smartest person in every conversation . Your database expert spent years learning how to optimize queries - let them be the hero when performance issues arise. Your security specialist knows threat models better than you, give them the floor when discussing architecture decisions. Make room for some productive discussion. When two experts disagree about implementation approaches, your job isn't to pick the "right" answer. It's to help frame the decision in terms of trade-offs, timeline, and user impact. Your value isn't in having all the expertise. It's in recognizing which expertise is needed when, and creating space for the right people to contribute their best work. There was this fun blog post I read recently about how non-developers read tutorials written by developers . What sounds natural to you, can be complete gibberish to someone else. As a lead, you constantly need to think about your audience. You need to learn multiple languages to communicate the same thing: Developer language: "The authentication service has a dependency on the user service, and if we don't implement proper circuit breakers, we'll have cascading failures during high load." Product language: "If our login system goes down, it could take the entire app with it. We need to build in some safeguards, which will add about a week to the timeline but prevent potential outages." Executive language: "We're prioritizing system reliability over feature velocity for this sprint. This reduces risk of user-facing downtime that could impact revenue." All three statements describe the same technical decision, but each is crafted for its audience. Your experts shouldn't have to learn product speak, and your product team shouldn't need to understand circuit breaker patterns. But someone needs to bridge that gap. "I'm the lead, and we are going to do it this way." That's probably the worst way to make a decision. That might work in the short term, but it erodes trust and kills the collaborative culture that makes expert teams thrive. Instead, treat your teams like adults and communicate the reason behind your decision: The more comfortable you become with not being the expert, the more effective you become as a leader. When you stop trying to out-expert the experts, you can focus on what expert teams actually need: Your role isn't to have all the answers. It's to make sure the right questions get asked, the right people get heard, and the right decisions get made for the right reasons. Technical leadership in expert environments is less about command and control, and more about connection and context. You're not the conductor trying to play every instrument. You're the one helping the orchestra understand what song they're playing together. That's a much more interesting challenge than trying to be the smartest person in the room.

Microservices

JavaScript

0 views

Jack Vanlightly 8 months ago

Coordinated Progress – Part 4 – A Loose Decision Framework

In this last post, we’ll review the mental framework and think about how that can translate to a decision framework. Microservices, functions, stream processors and AI agents represent nodes in our graph. An incoming edge represents a trigger of work in the node, and the node must do the work reliably. I have been using the term reliable progress but I might have used durable execution if it hadn’t already been used to define a specific type of tool. In this light, stream processors like Flink and Kafka Streams are particularly interesting. They’ve had durable execution built in from the start. The model assumes that processing logic is applied to an ordered, durable log of events. Progress is explicitly tracked through durable checkpoints or changelogs. Failures are expected and routinely recovered from. Many of the promises of durable execution are simply native assumptions in the stream processing world. You can think of stream processors as always-on orchestrators for data-in-motion: The durable trigger is the event log itself. The progressable work is defined through operators, keyed state, and fault-tolerant state backends. This makes stream processing a powerful, production-tested example of how durable execution/reliable progress can be built into the foundation of an execution model. Microservices and functions have not historically had a durable execution/reliable progress foundation built into it’s execution model (except may be for some actor frameworks). Reliable triggers exist in the form of queues and event streams, but progressable work has been limited to idempotency. The Durable Execution category aims to close this reliable progress gap for microservices and functions. It can provide an additional reliable trigger in the form of Reliable RPC and provides progressable work by durably persisting all progress, with resume functionality. The question then comes down to preferences and constraints: What’s your coding style preference? What's your coupling tolerance? What infrastructure dependencies can you tolerate? Stream processors like Flink and Kafka Streams use a continuous dataflow programming model, with durability "batteries included". You likely already have a dependency on an event streaming system like Kafka. Flink is an additional runtime to manage, while Kafka Streams is just a library but still comes with some operational challenges related to state management. Imperative microservices and functions offer familiar procedural programming but typically lack native durability mechanisms. To achieve reliable progress, you can: Couple reliable triggers (like events from queues/topics) with manual progressability patterns like idempotency and spreading logic across response handlers. Or, add a durable runtime in the form of a Durable Execution Engine for automatic state persistence and resumability (another infrastructure dependency). Building reliable distributed systems requires thoughtful choices about coordination and progress. Architecture decisions are often about managing complexity in systems that span teams, services, and failure domains. We can simplify the decision making process with a mental framework based on a graph of nodes, edges, and workflows . The reliability of distributed work in this graph comes from coordinated reliable progress (of reliable triggers, progressable work, choreography and orchestration). To summarize our model, the graph is composed of: Nodes: Microservice functions, FaaS functions, stream processing jobs and AI agents. Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability. There are direct and indirect edges. Workflows are sub-graphs (or connected components as in graph theory) of the Graph. Coordination strategies shape how workflows are built and maintained: Choreography (reactive, event-driven) provides high decoupling and flexibility. Orchestration (centralized, procedural) offers greater clarity and observability. Reliable progress hinges on two core concepts: Durable Triggers : Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC). Progressable Work : Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state. Coupling is an ever present property that must be balanced with other needs and constraints. Using this graph model, we can apply a loose decision framework: Ask whether an edge should be a reliable trigger, and if so, what form serves your reliability and coupling requirements? Is it a direct or indirect edge? Ask whether a node needs progressable work capabilities , and whether idempotency, transactions, or durable state persistence best fits your context? What programming style are you comfortable with? What infrastructure dependencies are you willing to take on? Consider the coordination trade-offs , does this workflow need choreography's flexibility or orchestration's clarity? Is there a core workflow in here, with only direct edges that can be spun out? Is the graph highly connected and dynamic? Consider the programming model . Do you prefer procedural code or does a more continuous dataflow programming model suit you, or the problem being solved? Also consider the complexity of the code written by developers and also the total complexity of the system . Durable execution can make some code simpler to write, but it’s also another middleware to support, with failure modes of its own, just like a distributed queue or event stream. Your specific needs will be specific, complex, with a number of constraints. But you can use this mental framework to define your own more rigorous decision framework to make more informed and balanced architecture decisions. I hope this graph model has been useful for you, no matter your preferences regarding the coordination, communication and programming styles. A theme I had not expected when I started out writing this analysis was the unifying thread of durability . Durability is behind the idea that distributed work should not vanish or halt when something fails. Whether in communication (reliable triggers), in execution (progressable work), or in system design (trees and logs), durability underpins coordination and recoverability. Durability isn’t just about data, but about progress too. It’s a foundational property that can be built into functions, microservices, and stream processors alike, the only question is in what form. Coordinated Progress series links: Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework The durable trigger is the event log itself. The progressable work is defined through operators, keyed state, and fault-tolerant state backends. What’s your coding style preference? What's your coupling tolerance? What infrastructure dependencies can you tolerate? Couple reliable triggers (like events from queues/topics) with manual progressability patterns like idempotency and spreading logic across response handlers. Or, add a durable runtime in the form of a Durable Execution Engine for automatic state persistence and resumability (another infrastructure dependency). Nodes: Microservice functions, FaaS functions, stream processing jobs and AI agents. Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability. There are direct and indirect edges. Workflows are sub-graphs (or connected components as in graph theory) of the Graph. Choreography (reactive, event-driven) provides high decoupling and flexibility. Orchestration (centralized, procedural) offers greater clarity and observability. Durable Triggers : Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC). Progressable Work : Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state. Ask whether an edge should be a reliable trigger, and if so, what form serves your reliability and coupling requirements? Is it a direct or indirect edge? Ask whether a node needs progressable work capabilities , and whether idempotency, transactions, or durable state persistence best fits your context? What programming style are you comfortable with? What infrastructure dependencies are you willing to take on? Consider the coordination trade-offs , does this workflow need choreography's flexibility or orchestration's clarity? Is there a core workflow in here, with only direct edges that can be spun out? Is the graph highly connected and dynamic? Consider the programming model . Do you prefer procedural code or does a more continuous dataflow programming model suit you, or the problem being solved? Also consider the complexity of the code written by developers and also the total complexity of the system . Durable execution can make some code simpler to write, but it’s also another middleware to support, with failure modes of its own, just like a distributed queue or event stream. Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework

Microservices

0 views

Jack Vanlightly 8 months ago

Coordinated Progress – Part 3 – Coupling, Synchrony and Complexity

In part 2 , we built a mental framework using a graph of nodes and edges to represent distributed work. Workflows are subgraphs coordinated via choreography or orchestration. Reliability, in this model, means reliable progress: the result of reliable triggers and progressable work. In part 3 we refine this graph model in terms of different types of coupling between nodes, and how edges can be synchronous or asynchronous. Let’s set the scene with an example, then dissect that example with the concepts of coupling and communication styles. Let's say we have these services: Inventory Service (checks/reserves stock) Payment Service (processes payment) Shipping Service (arranges delivery) Notification Service (sends confirmations) Each service reacts independently to events: Compensations are also event-driven: Failure handling : If a payment fails, the Payment Service publishes PaymentFailed. The Inventory Service listens for this and releases the reserved stock. The Shipping Service ignores the order since it never sees both required events. Key traits in action : Decoupled : Payment Service doesn't know or care about Inventory Service Temporal decoupling : If Shipping Service is down, events wait in the queue. Reliable progress : Durable events act as reliable triggers of work. Emergent workflow : Easy to add new services (like Fraud Detection) that react to existing events. A central Order Orchestrator manages the entire flow: Key traits in action : Centralized control : Easy to see the entire workflow in one place. Clear compensation : If payment fails, the orchestrator explicitly releases stock. Reliable progress : DEE ensures the workflow resumes exactly where it left off after crashes. Tight coupling : Orchestrator must know about all participating services. Choreography : Add the Fraud Service that listens to OrderPlaced, publishes FraudCheckPassed/Failed. Update Payment Service to wait for fraud clearance. No changes to other services. Orchestration : Modify the central orchestrator code to add the fraud check step. Deploy the new version carefully to handle in-flight workflows (versioning is one serious challenge for orchestration-based workflows). The fraud check is probably connected by a direct edge , i.e, it is necessary for the order process to complete. Now let’s add some tangential actions, such as updating a CRM, a reporting service and logging the order in an auditing service. Each gets added one at a time over a period of weeks. Choreography : No changes to existing services. Make the CRM, reporting service and auditing service listen for the order_placed and order_cancelled events. Orchestration : Modify the central orchestrator code to add the call to the CRM and deploy. Followed by the financial reporting service, then the auditing service. Choreography : Check logs across multiple services, correlate by order ID, reconstruct the event flow. Orchestration : Look at the orchestrator's execution history. As you can see from this limited example, there are a wealth of pros and cons to consider. Now let's use this example, along with additional concepts around coupling and communication styles, to refine our mental framework further. Coupling is a key consideration when designing service-to-service communication. It comes in different forms, most notably design-time coupling and runtime coupling . Design-time coupling refers to how much one service (a node in our graph) must know about another in order to interact. RPC introduces strong design-time coupling: the caller must know the callee's interface, expected request structure, response shape, and often its semantic behavior (such as throttling and latency profile). Even changes to the internal implementation of the callee can break the caller if not carefully abstracted. Code that triggers work via RPC must be changed every time a work dependency changes, is added or removed. Events also introduce design-time coupling, primarily around shared schemas. However, they also reduce coupling in other forms. For example, in a choreographed architecture, producers don’t need to know who consumes the event or how it’s processed. This allows new consumers to be added later without changes to the producer. Services evolve more independently, as long as schema evolution practices (e.g. schema versioning, compatibility guarantees) are respected. For example, the service emitting the PaymentProcessed event doesn’t need to know whether shipping, analytics, or notification systems will consume it, or even whether those systems exist yet. This contrasts with orchestration, which may have to know about these services and have to be updated when new dependencies are added (like in our e-commerce example). Runtime coupling is about whether services need to be up and available at the same time. RPC is tightly runtime-coupled: if service A calls service B synchronously, B must be up, fast, and reliable. Chains of RPC calls can create fragile failure modes, where one slow or unavailable service can stall or crash an entire request path. As Leslie Lamport famously said: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” — Leslie Lamport He must have been thinking about RPC call chains. But request/response implemented over queues can suffer from similar runtime coupling, if the timing expectation is the same as RPC. We’ll get into timing expectations in the next section. In contrast, publish-subscribe using events is free from runtime coupling between services. A service can emit an event and move on, trusting that any consumers (if they exist) will eventually process it when ready. Both RPC and events come with coupling trade-offs. Events reduce direct entanglement between services, especially at runtime, while still requiring careful attention to schema contracts. RPC is simpler in some use cases, but creates tighter coupling that can make systems more brittle over time. My colleague Gunnar Morling wrote a nice piece Synchrony Budget that is relevant here. Communication between services isn’t just synchronous vs asynchronous, it exists on a continuum of timing/response expectations. I will differentiate here between a response and a confirmation . I use the term confirmation for an immediate reply of “ I’ve got this, trust me to get it don e”. A service that carries out the work asynchronously, will still respond with a 200 to confirm to the caller that they received the request successfully. Kafka will send a producer acknowledgement once the batch has been committed. I use the term response for a reply (that might come later and possibly even through a different communication medium) that contains data associated with the requested work now done. Responses sit on a continuum. At one end is immediate response, where the caller needs an answer now to proceed. At the other end is no response at all , where the sender simply informs the system of something that happened. In between lies relaxed response timing, where the result is needed eventually, but not right away. RPC is typically on the left . It’s used when the caller needs to block and wait for the result, with a short latency requirement. One example is “reserve inventory before payment” in a synchronous checkout flow. This tight timing requirement makes the caller innately dependent on the callee's availability and latency, making an RPC a valid choice. Events sit on the right . They’re used when a service simply emits a fact (e.g. “order placed”), without expecting a reply. Consumers handle the event independently and asynchronously. Relaxed timing responses live in the middle . Some use asynchronous RPC (like invoking a function and getting a callback later, e.g. via a webhook or polling), and some use queue-based patterns. Asynchronous RPC is still runtime-coupled, whereas queues act as the fault-tolerant middleware that will eventually deliver the request to the destination. While RPC and events are often presented as opposites, real systems use hybrids. RPC can be made asynchronous. Events can be used for request/response with correlation IDs. But starting with this continuum helps frame the key trade offs. We defined Reliable RPC previously as being “delivered” by a fault-tolerant middleware, such as a Durable Execution Engine (DEE). This gives RPC some queue-like properties. For one, it reduces runtime coupling between the sender and receiver. If the receiver is down when the sender sends it, no worries, the RPC will eventually get delivered to the receiver, just like an event/command on a queue. Using the graph terminology from parts 1 and 2 , it has turned an ephemeral edge into a reliable one (a reliable trigger). In fact, I tend to think of Reliable RPC as another implementation of point-to-point queues, which are either one-way or request/response. Of course, this is most useful in relaxed timing scenarios. But what about resumability? If a Reliable RPC is like a synchronous RPC but with reliable delivery (of both request and response), then what about the sender’s context and resuming once the result has been received? The sender is keeping a bunch of variables and context in its stack memory, and if the host fails, then it cannot resume once the Reliable RPC has completed. Let’s take an example of a two step process: Make a call to reserve stock. Make a call to the payment service. This is a relaxed timing scenario, the user has been told the order is in progress. Let’s say that the 1st call is made, the stock gets reserved, but the calling context dies before the response is received. How does this work get resumed? Asynchronous RPC and request/response-over-queues address this resumability in the above scenario in the following ways: The response is not directly received by the calling context, but via an event handler receiving a response from a queue or a webhook RPC. So it doesn’t matter that the original context is dead. It could be a completely different instance of the application that receives the response. The state necessary for moving onto step 2 is either contained in the response, or a correlation id is used in the request and the response, and the necessary state was written to a data store, using the correlation id as the key. So the receiver can retrieve the necessary state. The response handler then calls the payment service. The downside of this approach is the code is spread across response handlers, making it more complex to write, read and debug. With queues or webhooks, the response arrives out-of-band, and correlation IDs must be managed manually. Some DEEs eliminate this complexity by letting you write code as if it were synchronous. The framework persists intermediate state and retriggers the function for it to make progress. The DEE acts as the reliable delivery middleware for requests and responses (like a queue), but also abstracts away all the response handler stuff. All the code exists in one function and if the function fails, the DEE invokes it again, but crucially, the code is written using the DEE SDK and the SDK silently retrieves the state regarding the function progress (including prior responses) so that code essentially resumes from where it left off (by using persisted state to skip prior work). Determinism is required for this strategy to work. For example, if the function starts by generating a random UUID as the identifier for the work (such as an order), then a second invocation would generate a different UUID. This is a problem if half the steps have been carried out already, using a different UUID. To cover those needs the SDK will provide a deterministic UUID (basically it durably stored the result of the first UUID generation) or datetime or integer etc. Reliable RPC coupled with resumability simplifies the code (no callbacks, response handlers, correlation ids etc). All the code can exist in one function, but can resume from a different application instance after failure. With a reductionist mindset, we could say that Reliable RPC + Resumability is a convenience solution to avoid needing to build complex asynchronous response handling. But this is not a trivial aspect at all. It can make the developer’s life easier and make for more readable code. Bringing this back to graph model, not all edges are created equal. Direct edges represent dependencies where the workflow cannot proceed without some kind of response. In the e-commerce flow, the order cannot be completed without reserving inventory and processing the payment first. These are blocking dependencies, though the timing requirements could be short (ideal for RPC) or long (best suited to reliable forms of communication such as a queue, event stream or Reliable RPC). When a workflow has blocking dependencies, the coupling between steps already exists at the business logic level. The question becomes whether to make this coupling explicit through orchestration or implicit through event choreography. The coupling cost of orchestration may be worth paying because these services already have a runtime dependency, they need each other to function. Making this explicit through orchestration may reduce overall system complexity compared to managing the same coordination through distributed event flows. However, even for a core workflow of direct edges, the decoupled nature of choreography might win out. It may simply not make sense for one piece of a wider workflow to be modeled as an orchestrated workflow, with different technologies, support issues, versioning strategies and deployment procedures. Also, team autonomy and team boundaries may cross such a workflow such that the decoupling of choreography is still best. Indirect edges represent actions that are operationally or even strategically important but don't block the immediate workflow from completing successfully. The reliable indirect edge ensures that important work eventually gets carried out. In our e-commerce example, CRM updates, financial reporting, and audit logging might be business-critical or legally required, but the customer's order can be fulfilled even if these systems are temporarily unavailable. Indirect edges will also often cross more granular organizational boundaries, where the cost of inter-team coordination is higher. The order processing logic is likely focused on a small number of teams in the software development org, whereas financial and auditing services/software systems are likely managed by a different set of teams, potentially under different management hierarchies. For indirect edges, events match this natural business decoupling because: Separate concerns are decoupled : If CRM updates, financial reporting, and audit logging are separate business concerns, they shouldn't be embedded in order processing logic. Every time a new system needs to know about orders (fraud detection, analytics, customer success tools), the core orchestrator would need modification, deployment, and versioning. Reduces coordination overhead : Teams owning peripheral systems can independently subscribe to order events without requiring changes from the order processing team. Prevents scope creep : The core workflow stays focused on its essential purpose rather than accumulating tangential responsibilities. The principle is that orchestration should be limited to direct edge subgraphs i.e., the minimal set of services that must coordinate to complete the core business function. Everything else should use choreography to preserve the business-level decoupling that already exists. In Part 4 , we’ll finish the series with some last reflections and a loose decision framework. Coordinated Progress series links: Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework Inventory Service (checks/reserves stock) Payment Service (processes payment) Shipping Service (arranges delivery) Notification Service (sends confirmations) Decoupled : Payment Service doesn't know or care about Inventory Service Temporal decoupling : If Shipping Service is down, events wait in the queue. Reliable progress : Durable events act as reliable triggers of work. Emergent workflow : Easy to add new services (like Fraud Detection) that react to existing events. Centralized control : Easy to see the entire workflow in one place. Clear compensation : If payment fails, the orchestrator explicitly releases stock. Reliable progress : DEE ensures the workflow resumes exactly where it left off after crashes. Tight coupling : Orchestrator must know about all participating services. Choreography : Add the Fraud Service that listens to OrderPlaced, publishes FraudCheckPassed/Failed. Update Payment Service to wait for fraud clearance. No changes to other services. Orchestration : Modify the central orchestrator code to add the fraud check step. Deploy the new version carefully to handle in-flight workflows (versioning is one serious challenge for orchestration-based workflows). Choreography : No changes to existing services. Make the CRM, reporting service and auditing service listen for the order_placed and order_cancelled events. Orchestration : Modify the central orchestrator code to add the call to the CRM and deploy. Followed by the financial reporting service, then the auditing service. Choreography : Check logs across multiple services, correlate by order ID, reconstruct the event flow. Orchestration : Look at the orchestrator's execution history. Make a call to reserve stock. Make a call to the payment service. The response is not directly received by the calling context, but via an event handler receiving a response from a queue or a webhook RPC. So it doesn’t matter that the original context is dead. It could be a completely different instance of the application that receives the response. The state necessary for moving onto step 2 is either contained in the response, or a correlation id is used in the request and the response, and the necessary state was written to a data store, using the correlation id as the key. So the receiver can retrieve the necessary state. The response handler then calls the payment service. Separate concerns are decoupled : If CRM updates, financial reporting, and audit logging are separate business concerns, they shouldn't be embedded in order processing logic. Every time a new system needs to know about orders (fraud detection, analytics, customer success tools), the core orchestrator would need modification, deployment, and versioning. Reduces coordination overhead : Teams owning peripheral systems can independently subscribe to order events without requiring changes from the order processing team. Prevents scope creep : The core workflow stays focused on its essential purpose rather than accumulating tangential responsibilities. Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework

Microservices

Programming

0 views

Jack Vanlightly 8 months ago

Coordinated Progress – Part 2 – Making Progress Reliable

In part 1 , we described distributed computation as a graph and constrained the graph for this analysis to microservices, functions, stream processing jobs and AI Agents as nodes, and RPC, queues, and topics as the edges. For a workflow to be reliable, it must be able to make progress despite failures and other adverse conditions. Progress typically depends on durability at the node and edge levels. Reliable progress relies on a reliable trigger of the work and the work being “progressable”. Reliable Progress = Reliable Trigger + Progressable Work Progressable Work is any unit of work that can make incremental, eventually consistent progress even in the face of failures. Typically, a reliable trigger causes a repeated attempt at performing the work, which may include one or more of the following patterns: Implements idempotency (so repeated attempts don’t cause duplication or corruption). Durably logs partial progress , so it can resume from where it left off (thereby avoiding duplication or corruption). Work that is atomic (via a transaction) is also helpful for consistency, though idempotency is also required to avoid duplication. Transactions may not be an option, and so reliable eventual consistency must be implemented instead (via progressable work). Progressable work is work that is safe to retrigger, rewind, or resume. But what re-triggers the work? This is where source state durability is required. The thing that triggers the work, and its associated state must also be reliable and therefore durable . The classic reliable trigger is the message queue. So, summarizing, if we look at any given node in the graph, the reliability stems from: A Reliable Trigger . This will require a form of durability. Progressable Work . Either: No controls, as duplication or inconsistency doesn’t matter. Idempotency of tasks, so a re-triggering avoids duplication/inconsistency. Transactions. Usually only available (or desired) from a single data system, such as a database. Still relies on idempotency to avoid duplication. Durable logging of work progress , with the ability to resume where it left off (by caching intermediate results). Some mix of the above. A stream processing job , such as Flink or Kafka Streams, ensures reliable progress by durably logging its progress via state persistence/changelogs (Progressable Work) and relying on a durable data source (Reliable Trigger) in the form of a Kafka topic. A reliable function will have a Reliable Trigger and implement Progressable Work. A reliable trigger for a function/microservice will be an RPC or a message (event/command) from a queue or topic. Queues and topics are highly available, fault-tolerant, durable middlewares (and therefore great as reliable triggers). A queue durably stores messages (events or commands) until deleted. A message can trigger a function, and the message is only deleted once the function has successfully run. A message on a topic is basically the same, except that messages can be replayed, adding an additional layer of durability. Note that some queues can also do replay. RPC is not innately reliable, it depends on the caller being available to maintain state in memory and reissue the RPC if a retry is needed. In typical microservices/functions, the function instance is not fault-tolerant, so it cannot do that. For RPC to become innately reliable, it must also be “delivered” via a highly available, fault-tolerant, durable middleware. This is one of the roles of the Durable Execution Engines (DEE), and we’ll refer to RPCs mediated by these engines as Reliable RPC . Progressable work , as already described, has two main options: idempotency or durable logging of work progress (and transactions can play a role for consistency). AI agents will become more deeply embedded in distributed systems as the category gains traction, with agents taking actions, calling APIs, and generating decisions. The need for reliable progress applies here too. Like any other node in a distributed workflow, an AI agent’s actions must be triggered reliably and either complete successfully or fail in a way that can be handled. Many durable execution start-ups are now focusing on AI agents as a core use case. The Flink AI Agent sub-project (see FLIP-531 ) has also been spurred by the need for agents to progress reliably through a series of actions. Its approach is to treat AI inference and decision-making as part of the dataflow, with exactly-once semantics , checkpointing, and state management taken care of, in part, by Flink’s existing progressable work infrastructure. So far, we have described distributed computation as a graph and constrained the graph for this analysis to microservices, functions, stream processing jobs, and AI qgents as nodes, and RPC, queues, and topics as the edges. Secondly, we have identified reliable progress as the combination of Reliable Triggers and Progressable Work (where durability plays a key role). Reliable Progress = Reliable Triggers + Progressable Work Now, we will examine work that spans multiple nodes of the graph, using the term 'workflow'. A critical aspect of workflow is coordination . A workflow forms a graph of steps, where each node or edge in the graph can fail in unique ways. A failure may be simple, such as a service being unavailable to invoke, or complex, such as a partial failure where only half of the action was taken, leaving the other half undone (and inconsistent). If we imagine a continuum, at each extreme we find a different strategy for coordinating such a workflow: Use a decentralized, event-driven, choreography-based architecture. Use a centralized, RPC-heavy, procedural orchestrator. Key traits: Decentralized coordination. Reactive autonomy. Coordinated by asynchronous events. Decoupled services. Event-driven choreography supports workflow in multiple ways: Events act as reliable triggers . Only once an event has been fully processed will it become unreadable (by an offset advancing in Kafka or an event being deleted from a queue). If a failure occurs during the processing of an event, then the event remains to be consumed again for a retry. Asynchronous: Consuming services do not need to be available at the same time as publishing services, decoupling the two temporally, increasing reliability. Events can be replayed . Even processed events can be replayed if necessary (given the event streaming service supports that). Events trigger rollbacks and compensations (sagas) . If an error occurs, a service can emit a failure event that other services can subscribe to, in order to perform rollbacks and compensation actions. Kafka transactions support advancing the offset on the input topic and publishing an error event as an atomic action. Long-running workflows . The workflow can pause and resume implicitly, based on the timing of published events. However, said timing is not typically controlled by the event streaming service. Reliable event consumers must implement Progressable Work in some way, with idempotency being a common one. While I list decoupling as the main pro, its impact cannot be understated. Highly decoupled services ( a very big deal ) Independent service evolution and deployment (design-time decoupling). Limited blast radius during failures (runtime decoupling). Downstream services that execute subsequent workflow steps do not need to be available when the current step executes (temporal decoupling). Highly adaptable architecture. Scales naturally with system growth. New consumers can react to events without requiring changes to existing publishers. Flexible composition. Workflows can evolve organically as new consumers are added, enabling emergent behavior. Reasoning about complex flows Hard to reason about for complicated workflows or sprawling event flows. It can be challenging to find where a choreographed workflow starts and where it ends, particularly as it crosses team boundaries. Ownership of the workflow is distributed to the participating teams. Monitoring and debugging It can be difficult to debug the workflows that cross many different boundaries. Monitoring workflows is challenging due to its decentralized and evolvable nature. Consistency challenges Compensations can be harder to reason about as logic is spread across consumers, making it sometimes challenging to verify full undo/compensation without strong observability. Non-deterministic execution: more likely to see race conditions (such as receiving OrderModified before OrderCreated). No in-built Progressable Work tooling, only reliable triggers. Developer training Developers must learn event modelling. Developers may need to learn how to use Kafka or a queue reliably (or use a high-level framework). Key traits: Centralized coordination. Procedural control. RPC triggers subordinate microservices/functions. Well-defined scope. “Orchestration engines” refers to DEEs such as Temporal, Restate, DBOS, Resonate, LittleHorse, etc, which all have slightly different models (which are out of scope for this document). An orchestrated workflow is a centrally defined control flow that coordinates a sequence of steps across services. It behaves like a procedural program. Procedural code . The orchestrated workflow is written like regular code (if-then-else). Progressable Work via durable logging of work progress . State is persisted so it can survive restarts, crashes, or timeouts, and can resume based on timers. This is key for completing sagas. Reliable Triggers via reliable RPC (if the framework supports that) or events (can integrate with event streams). Centralized Control Flow . Unlike choreography, where each service reacts to events, orchestration has one logical owner for the process: the orchestrator. Explicit Logic for Branching, Looping, and Waiting . This may use regular code constructs such as in Function-based Workflow, or may use a framework SDK for these Ifs and Loops in Graph-based Workflow. Long-running workflows. An orchestrated workflow can pause, and then resume based on 1st class triggers or timers. An orchestrated workflow is like a function, and therefore, it needs its own Reliable Trigger and the orchestration code must implement Progressable Work. A reliable trigger of the workflow itself could be: An event or command delivered via a queue or topic. A Reliable RPC mediated via a Durable Execution Engine. Progressable work could be implemented entirely via idempotent steps (though this may not be practical as a general rule as idempotency can be hard to implement). Therefore, the durable logging of work progression (by a Durable Execution Engine) can add value. Centralized control flow makes it: Simpler to reason about. Easier to monitor and debug. Reliable Triggers and Progressable Work are built into DEE frameworks. Reliable RPCs can function without temporal coupling (via DEE). Compensation actions are clearly linked to failures and made reliable. Challenging to version. It can be hard to update workflows while supporting existing workflow executions. Long-running workflows could conceivably have multiple versions running concurrently. The orchestration (or DEE) service is another infrastructure dependency. Orchestration code belongs to one team, but that team must coordinate with teams of the subordinate microservices. Orchestration can lead to tighter coupling without discipline. This can conflict with microservices autonomy and bounded context independence. Greater design-time coupling leads to more versioning as flows change. Developer training: Developers must learn the programming model of the specific DEE (all are a bit different). Developers must learn about deterministic workflow execution, step vs workflow boundaries and avoiding anti-patterns such as God workflows which control everything. In Part 1 , I described how not all edges in a workflow graph are equal. Some are direct dependencies which are essential steps that must succeed for the business goal to be achieved. For example, the edge from the Order Service to the Payment Service during checkout is part of the core execution path. If it fails, the workflow fails. Other edges are indirect . These represent auxiliary actions triggered by the workflow, such as updating a CRM, reporting service or auditing. While important, they are not critical to completing the core task itself. Often these just need to be reliable, but are triggered in a decoupled and one-way fashion. In orchestration , these distinctions matter. A well-designed orchestrator should focus only on the minimal set of steps required to drive the business outcome (the direct edges). Incorporating indirect actions directly into the orchestrator increases coupling, inflates the workflow definition, and introduces more reasons to redeploy or version the orchestrator when non-essential concerns change. Choreography , by contrast, treats direct and indirect edges the same. Events flow outward, and any number of services can react. There is no centralized control, and thus no enforced boundary around what "belongs" to a given workflow. This can be both a strength (such as encouraging extensibility) and a weakness. The main weakness being that it is harder to reason about what constitutes the workflow's critical path. Choreography and orchestration are both essential patterns for coordinating distributed workflows, but they offer different properties. What they share is the need for durability in order to provide reliability. Orchestration looks promising for mission-critical core workflows because of its superior understandability, observability, and debuggability. With a centralized control flow, durable state, and explicit compensation handling, orchestration frameworks make core workflows easier to understand, monitor, and debug. The orchestration engine provides Reliable Trigger and Progressable Work support. But such orchestration should be limited to islands of core workflows, connected by indirect edges in the form of events. Choreography , on the other hand, is indispensable for decoupling systems, allowing services to react to events without tight coupling or centralized control. Design-time decoupling enables teams to build, deploy, and evolve services independently, reducing coordination overhead and supporting faster iteration. Runtime decoupling minimizes blast radius by isolating failures — one service can fail or degrade without directly affecting others. Temporal decoupling allows producers and consumers to operate on different schedules, enabling long-running workflows, asynchronous retries, and increased resilience to transient outages. Together, these forms of decoupling promote architectural flexibility and team autonomy. Events act as Reliable Triggers, and the event consumers must decide how to implement Progressable Work themselves. In practice, many operational estates could benefit from a hybrid coordination model. There are two types of hybrid: Mixing choreography and orchestration across the graph . As I already described, orchestration should not control the entire execution flow of a system. Unlike choreography, which can span an entire system due to its decoupled nature, orchestration should focus on well-defined processes. These orchestrated workflows can still integrate with the broader choreographed system by emitting events, responding to events, and acting as reliable islands of control within a larger event-driven architecture. Using orchestration via events (or event-mediated orchestration). In this hybrid model, the orchestration code does not use RPC to invoke subordinate microservices, but sends commands via queues or topics. Subordinate microservices use the events as Reliable Triggers, implement their own Progressable Work, and send responses to another queue/topic. These responses trigger the next procedural set of work. In this model, Reliable Triggers are handled by queues/topics, and Progressable Work is either done via idempotency or durable logging of work progress. This can avoid the need for a full DEE, but might require custom durable logging. Now that we have the mental model in place, Part 3 will refine the model further, with concepts such as coupling and synchrony. Coordinated Progress series links: Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework Implements idempotency (so repeated attempts don’t cause duplication or corruption). Durably logs partial progress , so it can resume from where it left off (thereby avoiding duplication or corruption). A Reliable Trigger . This will require a form of durability. Progressable Work . Either: No controls, as duplication or inconsistency doesn’t matter. Idempotency of tasks, so a re-triggering avoids duplication/inconsistency. Transactions. Usually only available (or desired) from a single data system, such as a database. Still relies on idempotency to avoid duplication. Durable logging of work progress , with the ability to resume where it left off (by caching intermediate results). Some mix of the above. Use a decentralized, event-driven, choreography-based architecture. Use a centralized, RPC-heavy, procedural orchestrator. Decentralized coordination. Reactive autonomy. Coordinated by asynchronous events. Decoupled services. Events act as reliable triggers . Only once an event has been fully processed will it become unreadable (by an offset advancing in Kafka or an event being deleted from a queue). If a failure occurs during the processing of an event, then the event remains to be consumed again for a retry. Asynchronous: Consuming services do not need to be available at the same time as publishing services, decoupling the two temporally, increasing reliability. Events can be replayed . Even processed events can be replayed if necessary (given the event streaming service supports that). Events trigger rollbacks and compensations (sagas) . If an error occurs, a service can emit a failure event that other services can subscribe to, in order to perform rollbacks and compensation actions. Kafka transactions support advancing the offset on the input topic and publishing an error event as an atomic action. Long-running workflows . The workflow can pause and resume implicitly, based on the timing of published events. However, said timing is not typically controlled by the event streaming service. Highly decoupled services ( a very big deal ) Independent service evolution and deployment (design-time decoupling). Limited blast radius during failures (runtime decoupling). Downstream services that execute subsequent workflow steps do not need to be available when the current step executes (temporal decoupling). Highly adaptable architecture. Scales naturally with system growth. New consumers can react to events without requiring changes to existing publishers. Flexible composition. Workflows can evolve organically as new consumers are added, enabling emergent behavior. Reasoning about complex flows Hard to reason about for complicated workflows or sprawling event flows. It can be challenging to find where a choreographed workflow starts and where it ends, particularly as it crosses team boundaries. Ownership of the workflow is distributed to the participating teams. Monitoring and debugging It can be difficult to debug the workflows that cross many different boundaries. Monitoring workflows is challenging due to its decentralized and evolvable nature. Consistency challenges Compensations can be harder to reason about as logic is spread across consumers, making it sometimes challenging to verify full undo/compensation without strong observability. Non-deterministic execution: more likely to see race conditions (such as receiving OrderModified before OrderCreated). No in-built Progressable Work tooling, only reliable triggers. Developer training Developers must learn event modelling. Developers may need to learn how to use Kafka or a queue reliably (or use a high-level framework). Centralized coordination. Procedural control. RPC triggers subordinate microservices/functions. Well-defined scope. Procedural code . The orchestrated workflow is written like regular code (if-then-else). Progressable Work via durable logging of work progress . State is persisted so it can survive restarts, crashes, or timeouts, and can resume based on timers. This is key for completing sagas. Reliable Triggers via reliable RPC (if the framework supports that) or events (can integrate with event streams). Centralized Control Flow . Unlike choreography, where each service reacts to events, orchestration has one logical owner for the process: the orchestrator. Explicit Logic for Branching, Looping, and Waiting . This may use regular code constructs such as in Function-based Workflow, or may use a framework SDK for these Ifs and Loops in Graph-based Workflow. Long-running workflows. An orchestrated workflow can pause, and then resume based on 1st class triggers or timers. An event or command delivered via a queue or topic. A Reliable RPC mediated via a Durable Execution Engine. Centralized control flow makes it: Simpler to reason about. Easier to monitor and debug. Reliable Triggers and Progressable Work are built into DEE frameworks. Reliable RPCs can function without temporal coupling (via DEE). Compensation actions are clearly linked to failures and made reliable. Challenging to version. It can be hard to update workflows while supporting existing workflow executions. Long-running workflows could conceivably have multiple versions running concurrently. The orchestration (or DEE) service is another infrastructure dependency. Orchestration code belongs to one team, but that team must coordinate with teams of the subordinate microservices. Orchestration can lead to tighter coupling without discipline. This can conflict with microservices autonomy and bounded context independence. Greater design-time coupling leads to more versioning as flows change. Developer training: Developers must learn the programming model of the specific DEE (all are a bit different). Developers must learn about deterministic workflow execution, step vs workflow boundaries and avoiding anti-patterns such as God workflows which control everything. Design-time decoupling enables teams to build, deploy, and evolve services independently, reducing coordination overhead and supporting faster iteration. Runtime decoupling minimizes blast radius by isolating failures — one service can fail or degrade without directly affecting others. Temporal decoupling allows producers and consumers to operate on different schedules, enabling long-running workflows, asynchronous retries, and increased resilience to transient outages. Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework

Backend

DevOps Microservices

0 views

Jack Vanlightly 8 months ago

Coordinated Progress – Part 1 – Seeing the System: The Graph

At some point, we’ve all sat in an architecture meeting where someone asks, “ Should this be an event? An RPC? A queue? ”, or “ How do we tie this process together across our microservices? Should it be event-driven? Maybe a workflow orchestration? ” Cue a flurry of opinions, whiteboard arrows, and vague references to sagas. Now that I work for a streaming data infra vendor, I get asked: “ How do event-driven architecture , stream processing , orchestration , and the new durable execution category relate to one another? ” These are deceptively broad questions, touching everything from architectural principles to practical trade-offs. To be honest, I had an instinctual understanding of how they fit together but I’d never written it down. Coordinated Progress is a 4-part series describing how I see it, my mental framework , and hopefully it will be useful and understandable to you. I anchor the mental framework in the context of workflow , using the term in a broad sense to mean any distributed work that spans multiple services (such as a checkout flow, a booking process, or a loan application pipeline). Many people use the term “saga” to describe long-running workflows that span multiple services and require coordination and compensation. This analysis uses the more general term workflow to capture a broader class of distributed work. Modern systems are no longer built as monoliths, they are sprawling graphs of computation , stitched together by APIs, queues, and streams; implemented across microservices, functions, stream processors, and AI agents. Complex workflows cross service boundaries, requiring coordination that must be both reliable and understandable, as well as flexible and adaptable. Within these graphs, the concepts of coordination and reliable progress are critically important. Coordination strategies shape how workflows are built and maintained: Choreography (reactive, event-driven) provides high decoupling and flexibility. Orchestration (centralized, procedural) offers greater clarity and observability. But not all edges in the graph are equal . Some edges are direct , defining the critical path of a workflow where failure means failure. Others are indirect , triggering auxiliary actions in adjacent or even far away services. A good mental model distinguishes between the two: orchestration should focus on direct edges, while choreography handles both naturally. Reliable progress hinges on two core concepts: Durable Triggers : Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC). Progressable Work : Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state. While stream processors (e.g. Flink , Kafka Streams ) and event-driven systems based on queues and topics (e.g. Kafka , RabbitMQ ) have durability built in, imperative code typically does not. Durable Execution Engines (DEEs), such as Temporal , Restate , DBOS , Resonate , and LittleHorse (among many others), aim to fill that gap in the world of imperative functions. They provide varying tooling and language support for adding durable triggers and progressable work to imperative, procedural code. This analysis constructs a conceptual framework for understanding both coordination and reliable progress in modern distributed architectures composed of microservices, functions, stream processing, and AI agents, including new building blocks made available by DEE frameworks. Neo: The Graph. Morpheus: Do you want to know what it is? Morpheus: The Graph is everywhere. It is all around us. Even now, in this very server room. You can see it when you look at your IDE or when you design your Flink topology. You can feel it when you work on microservices... when you handle incidents... when you deploy. At every level of abstraction, computation reveals itself as a graph . A function in a microservice contains a control flow graph (comprising branches, loops, and conditionals) that describes its execution logic. A Flink job is explicitly a directed graph of operators and stateful nodes connected by streams. A workflow that ties together multiple services, whether via orchestration or choreography, is also a graph, one that represents dependencies, event flows, or command sequences. Coordination plays a critical role in this graph of graphs and is also present at every layer . Within a single Flink job, it is Flink itself that coordinates work across multiple task managers. Within a microservice, the executable code acts as a linear or concurrent coordination of multiple steps that may invoke other services or data systems. However, it is the coordination required for workflows across multiple systems that presents the largest challenge . Distributed coordination across heterogeneous systems, programming models and environments is a strategic concern for organizations with far-reaching consequences. This graph, made up of workflows spanning multiple systems, can make discussing topics of programming models and coordination methods confusing. For example, one consumer in an event-driven architecture may execute its code procedurally in an imperative style, but play the role of a node in a reactive architecture. A workflow can be triggered by an event, which acts as a reliable trigger for retries, or it might be triggered by an ephemeral RPC. The types of nodes and edges of the graph matter. While everything is a graph (even code), for this analysis, let’s confine things so that: Nodes are microservice functions, FaaS functions, stream processing jobs and AI agents. Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability. Workflows are sub-graphs (or connected components as in graph theory) of the Graph. The Graph. Nodes connected by edges. Workflows as sub-graphs. What constitutes a workflow is debatable, especially with the broad meaning I’m using in this analysis. But I like to think of workflow by the types of edges, which can be direct or indirect . Edges also have other properties, such as request/response vs one-way, and synchronous or asynchronous, but for now we’ll keep the model simple and think about whether edges are direct or indirect . Direct edges trigger work that is central to the goal being performed . Using an example of an Order Placed workflow. There might be a set of microservices to handle the payment, reservation of stock, initiation of shipment preparation, which are all directly tied to the order. These are all connected by direct edges and form the core order placed workflow. Indirect edges trigger tangential, auxiliary work , such as notifying the CRM for customer management, a finance/reporting system or auditing service for compliance in the order workflow. An indirect edge could even trigger a secondary core workflow, such as a just-in time inventory process. Whether an edge is direct or indirect will influence what kind of communication medium is chosen (more on that in parts 2 and 3). Workflows require coordination , whether that coordination is just a dance between independent services or something more centralized and controlled. There are two main coordination strategies: choreography and orchestration. Choreography: Event-driven workflow (reactive). Highly decoupled microservices that independently react to their input events as they arrive. There is no blocking or waiting, and all consumers operate independently of any upstream producers or subsequent downstream consumers. Coordination via publish-subscribe semantics. The entire impact of an upstream event can spread far, and is dynamic over time. The boundaries of any given workflow within that wider event flow can be soft and hard to define (but with low coupling). Orchestration: Procedural workflow ( if-this-then-that ). Logic is centralized in some kind of orchestrator (even a microservice or function), issuing commands and awaiting responses from subordinate worker microservices. The orchestrator keeps track of which parts of the workflow have been completed, which are in progress, and which have yet to be started. It keeps track of the commands sent to the subordinate services as well as the responses from those services. Coordination via procedural orchestration semantics. The entire impact of an upstream workflow can spread far, as individual nodes can still emit events. The boundaries of a given workflow are clearly encoded in the orchestration code, albeit at the cost of increased coupling. We’ll cover choreography and orchestration in more detail in part 2. Stream processing frameworks like Apache Flink and Kafka Streams can be thought of as microservices with a configurable blend of continuous dataflow programming model and reactive event handling, designed to transform and react to streams of events in real time. Like microservices, stream processors form logical graphs of computation , using branching, joining, and aggregation to process data. However, their programming model is more constrained, being optimized for data-centric transformations of event streams rather than complex control flow or handling individual requests on demand. In the context of workflows and sagas, stream processors fit naturally into event-driven choreography as nodes in the event graph, not only performing transformations or enrichments, but also taking on roles that overlap with those traditionally handled by microservices, including stateful business logic and triggering downstream effects. Just as modern microservices are decomposed into bounded contexts following domain-driven design principles, so too should stream processors be scoped narrowly. Embedding an entire business workflow (e.g., shopping cart checkout, payment, shipping, fulfillment) into a single Flink or Kafka Streams job is generally discouraged. Instead, stream processors work best as individual nodes in a choreographed system, each independently reacting to events. Beyond participating as choreographed actors, stream processors play two valuable roles in saga and workflow architectures: Real-time triggers for workflows : Detecting event patterns (e.g., "user added to cart but didn’t check out in 1 hour") and emitting signals to start or branch workflows. Aggregated state for decisions: Continuously computing derived state (e.g., fraud scores, user behavior patterns) that orchestrators or services can query to guide workflow logic. In summary, stream processing can replace traditional microservices in choreographed workflows and enhance orchestrated workflows with real-time insights, triggers, and data transformations. However, one stream processing job would rarely include an entire workflow, just as a microservice would not. In distributed systems, durability is not just about data but about progress. A workflow that performs critical operations must either complete or fail in a controlled, recoverable way. Durable coordination ensures that steps don’t vanish into the void after a crash or network fault. No matter the execution model (procedural, event-driven, or dataflow), durability is the mechanism that transforms ephemeral logic into reliable systems. Choreography in the form of event-driven architectures (EDA) offers durability by default. Events are stored durably in a queue or log (e.g., Kafka), enabling reactive systems to recover from crashes, replay history, and trigger retries. Each service reacts independently, and progress is tracked implicitly in the event stream. In this model, the event log acts as both coordination medium and source of truth, encoding the causal structure of the system’s behavior. Imperative code, by contrast, lacks built-in durability. A service running procedural logic (e.g., "do A, then B, then C") typically stores its state in memory and relies on external systems for persistence of selective state. When a crash occurs mid-execution, everything in the call stack is lost unless explicitly saved. This gap gave rise to the Durable Execution product category, which brings event-log-like durability to imperative workflows. Durable Execution Engines (such as Temporal, Restate, DBOS) persist the workflow’s progress, key variables, intermediate results, responses from other services, and so on, allowing it to be retried, resuming exactly where it left off. Durable Execution Engines are, in effect, the Kafka of imperative coordination (aka coordination via orchestration). Durability is also foundational in stream processing. Frameworks like Apache Flink and Kafka Streams include native durability through state persistence mechanisms (such as checkpointing, changelogs, and recovery logs), ensuring that event transformations and stateful aggregations survive failures. While the paradigm is data-centric and continuous, the core concept is the same: progress is recorded durably so that computation can continue reliably. Ultimately, everything is also a log. Whether it's a sequence of domain events, a durable workflow history, or a changelog backing a stream processor, the underlying idea is the same: encode system activity as a durable, append-only record of what has happened (and possibly what should happen next). Making durability a first-class concern allows systems to be: Recoverable after crashes. Observable through replayable history. Reliable across asynchronous boundaries. Composable across execution models. This perspective is a start, but we need to break this down into more precise terms by creating a simple model for thinking about reliable execution and coordinated progress . Let’s do this in part 2 . Coordinated Progress series links: Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework Choreography (reactive, event-driven) provides high decoupling and flexibility. Orchestration (centralized, procedural) offers greater clarity and observability. Durable Triggers : Work must be initiated in a way that survives failure (e.g., Kafka, queues, reliable RPC). Progressable Work : Once started, work must be able to make progress under adverse conditions via replayability using patterns such as idempotency, atomicity, or the ability to resume from saved state. Nodes are microservice functions, FaaS functions, stream processing jobs and AI agents. Edges are RPC, queues, event streams. These vary widely in semantics, some are ephemeral, others durable, which affects reliability. Workflows are sub-graphs (or connected components as in graph theory) of the Graph. Real-time triggers for workflows : Detecting event patterns (e.g., "user added to cart but didn’t check out in 1 hour") and emitting signals to start or branch workflows. Aggregated state for decisions: Continuously computing derived state (e.g., fraud scores, user behavior patterns) that orchestrators or services can query to guide workflow logic. Recoverable after crashes. Observable through replayable history. Reliable across asynchronous boundaries. Composable across execution models. Seeing the system: The Graph Making Progress Reliable Coupling, Synchrony and Complexity A Loose Decision Framework

Programming Microservices

0 views

Binary Igor 2 years ago

Modular Monolith and Microservices: Modularity is what truly matters

Modularity is a crucial concept when designing and creating software. Independent of whether our chosen architecture style is to have a single unit of deployment - Monolith or multiple units of deployment - Microservices/Services. It is a quality that should be treated completely independent of how many deployable units of software we choose to have.

Microservices

Design

0 views

harrisoncramer.me 4 years ago

Designing asynchronous microservices with RabbitMQ

RabbitMQ is an open-source message broker that can help you decouple your microservices and keep your application fault-tolerant and fast.

Backend

DevOps Microservices

0 views

SerCe 5 years ago

You don't need no Service Mesh

In this article, I explore an anti-hype opinion on service meshes and aim to provide a clearer perspective on whether they are the right solution for specific problems.

DevOps Microservices

0 views