BACK

Mitigating Workflow Disruptions: The Role of Durable Execution Frameworks

7 min
5/31/2024

Here at Ansa, we recognize the importance of maintaining uptime across critical applications and workflows. We’ve written about the future of observability platforms and recently led an investment in a network observability company (to be announced). Despite company efforts to proactively mitigate downtime, it is inevitable to experience workflow failures for a variety of reasons, like network/hardware failures, service/API unavailability, critical updates, resource overconsumption, and more. Without mechanisms in place to persist state across the workflow leading up to failure, these workflows would typically need to be restarted from the beginning, resulting in lost productivity, additional resource consumption, and poor user experiences. To address this challenge and ensure workflows are executed to completion despite failures, companies are increasingly implementing durable execution frameworks.

Scenario: Workflow Failure in an E-Commerce Checkout

There is a simple e-commerce workflow consisting of four microservices: 

  1. Order Intake
  2. Payment Authorization
  3. Inventory Update
  4. Order Confirmation

Let’s say a user successfully places an order and authorizes payment, but a network error occurs and disrupts communication with the Inventory Update service, preventing it from receiving the message to kick off. Since the Order Confirmation service relies on inventory being updated, that service is ultimately unable to execute. Without a durable execution framework, the system will lose track of the workflow state, and the workflow will need to start over from the beginning. From the user perspective, they might receive an error message asking them to try again later, leaving them uncertain about whether or not their payment was processed. This then places a burden on the user to reach out to support to help finalize the order, or it could result in a lost transaction completely if they don’t bother retrying.

With a durable execution framework in place, the system is able to:

  • Persist workflow state: This ensures the system remembers the progress made leading up to the point of failure
  • Automatically retry failed tasks: Depending on the retry strategy implemented, the system could automatically retry the Inventory Update service once a specific network condition is met
  • Guarantee workflow completion: Since the Inventory Update service no longer prohibits progress, the entire workflow is able to execute to completion

From the user's perspective, this translates to a smoother experience:

  • Although there was an error on the company’s side, the user might only receive a notification saying that their order will be confirmed shortly
  • This eliminates ambiguity and fosters the impression that the system is functioning as intended

The above example represents a basic function chaining pattern in which a series of functions are executed sequentially. This is only one of the typical application patterns that benefit from durable execution. Microsoft highlights other patterns that benefit from this here.

Trends Driving the Need for Durable Execution

The need for durable execution has grown considerably over the years as a result of many technological and operational tailwinds.

  • Microservice Proliferation: Software development continues to shift towards microservice architecture. Since microservices typically need to communicate with each other over a network, there are increased points of failure that can cause a workflow to break as more microservices are added and need to communicate. 
  • API Usage: The surge in API usage is partly driven by the growth in microservices. According to research from Postman, 92% of respondents say their investments in APIs will increase or stay the same over the next 12 months. These APIs are not only consumed internally but also relied on externally. These external developers have no control over the availability of these APIs, so the need for resiliency through durable execution is even more critical.
  • IoT and Edge Computing: IoT and edge devices similarly rely on network connectivity to operate effectively but are more prone to intermittent connectivity, in addition to hardware failures and other challenges. Durable execution addresses these issues by allowing these devices to retain workflow state during periods of disconnection and ensures the workflow properly syncs data when connectivity is restored, thereby ensuring the reliability and effectiveness of IoT and edge devices.
  • Compliance: A variety of compliance and regulatory legislation also benefit from durable execution. Regulations like GDPR, for example, require data integrity and accuracy, which durable execution can help support even in the event of failures by ensuring data persistence. Similarly, data sovereignty laws necessitate a more distributed data footprint for organizations which further supports the need for durable execution.
  • Ever-Increasing User Expectations: As technology continues to improve user experiences, the bar for what consumers expect when interacting with a website or application continues to rise. If users start to feel friction in their experiences, they will quickly abandon their session, opt for competitors with more seamless experiences, or cease future usage altogether.
  • AI: Similar to the dynamics discussed in API Usage, the proliferation of AI is further accelerating the need for durable execution. Many organizations across the board are building applications that likely call the APIs of an LLM provider like OpenAI. This introduces a dependency on external availability beyond an organization’s control. Additionally, companies continue to leverage AI agents for workflow efficiencies, which often rely on third-party APIs to carry out actions on behalf of users. Durable execution can mitigate the disruption caused by failures or outages in external dependencies, allowing for the continued functionality of agents.

As the criticality of operational resiliency only continues to rise, we are extremely excited by the value durable execution provides. If you’re building in the space, please reach out to josh@ansa.co, we’d love to hear from you!

Share