Durable Execution

While building Qube I realised that our system did gave any guarantees of execution, that is, will it complete? Will it recover from failures? Will it recover from errors? Will it recover from system crashes?

The answer to all of these questions was “NO”. And if the logic did not execute till completion, naturally our database and other state started becoming inconsistent, requiring manual fixes.

At this point we need a minimum set of guarantees for our program’s execution which would not leave our system in an inconsistent state.

Problem

Describe the problem here

Concepts

Here are the list of concepts that I’m currently exploring while drafting the requirement for a durable execution system

Fault Tolerance

While building distributed systems we should be building with the expectation that things can go wrong, hardware breaks, software bugs and network hiccups.

Fault tolerance is the ability of a system to keep operating during failures. It keeps disruptions minimal and ensures that services stay up for users, even when the unexpected occurs.

Temporal helps by preserving workflow state and retrying failed tasks so that even complex, long-running processes can recover without extra manual code. It simplifies resilience.

Workflow Orchestration

TBD

Transactional State Management

TBD

Write Ahead Log

TBD

Code-as-Workflow

A temporal terminology

Orchestration

Process of coordinating and managing multiple computer systems, applications and services. It involves automating tasks to perform them in specific order to achieve the desired outcome.

OLTP Database

It stands for Online Transaction Processing. It's a type of database system that processes transactions in real-time. It's used in many operational systems including online banking, shopping and order entry.

Ideas

Here are angles that I should explore to build/find a project that caters to my needs:

Log based architecture for durable execution

Use of fault-tolerant task queue to run the jobs

Workflow centric paradigm - Blocks of code that execute based on events, or timer. We do have flavours here though:

Temporal approach - Record the entire call stack and play it back. This offers stronger guarantees, but makes the application harder to integrate and reason with
Orchestrator approach - The workflow engine serves as an orchestrator to async functions. Events are direct to workflow engine which then coordinates the execution of tasks.

Database centric paradigm - Not really clear by what the authors meant by this (see A16Z article in the Links section below). But involves the usage of Application logic transactional platform (ALTP, inspired from OLTP).

Erlang’s fault tolerance concepts

Distributed transactions

Saga orchestration pattern

Resources

All external resources that I could find on this subject

Companies/Tools

Here are the list of companies working in the durable execution space:

Temporal

Restate

Orkes

StealthRocket

LittleHorse

Flawless

Convex

Rama

inngest

defer

trigger

AWS Step Functions

Apache Airflow

dbos

Azure Durable Functions

Upstash - Serverless data platform

Convex - Reactive database for web app developers

Grafbase - Grafbase provides secure, self-hosted deployment options, unmatched query speed, and unified data access for reliable, enterprise-grade API management.

Xata - Fully managed postgres database platform

Supabase Workflows

Prefect

River - Fast and reliable background jobs in Go.

hatchet - A Distributed, Fault-Tolerant Task Queue. Also supports durable execution and task scheduling.

Challenges

List of some interesting problems in durable execution according to Chris Gillum (creator of Azure Durable Functions):

Determinism

Workflows have to be written in a manner where in if the workflow was called with the same set of inputs, it should produce the same output (it should follow the same execution path).

Idempotent

The non-deterministic part of a worklfow (the activities) must be idempotent. Because they might be called multiple times (we wouldn't want to charge a customer twice for a transaction). Not every case will allow for idempotency. Some system may support idempotent operations, others might not. Like I can check if a record already exists in the database before writing it again, but you may not be able to know if an email has already been sent.

Versoning

When working with reply-based durable execution frameworks, deploying a change to the code that isn't compatible with an existing an existing history log can therefore cause your app to unexpectedly fail with non-determinism errors.

This problem of code changes not matching existing state has always existed for apps that depend on state in the form of queue messages or rows in a database.

The problem with these worklow-as-code frameworks is that it's hard for the developers to know whether a code change is safe to make or not when writing this kind of implicitly stateful code (workflows or orchestrations) vs. explicitly stateful code that reads queue messages or queries a database.

People have solved this problem in two ways for now

Use if and else to check for different versions of the code and handle them accordingly. This does not scale well when the number of changes are large.

Deploy the code change in a separate copy of the app. But this burdens the developer with the task of managing multiple versions of the code.

Payload Sizes

This is not an obvious problem, the framework hide the fact that the activity parameters and return values are being serialized and deserialized. People realise that they have hit the limits when they try to send large payloads that violate some sort of size constraints. Azure durable functions serialize the payload and upload it to blob storage, and then download it, uncompressed, deserialize it when the activity is called. This adds a huge overhead. This is why it's recommended to keep the payload size small.

History Log Sizes

Developers needs to keep workflow or orchestration history log sizes in mind when designing the workflow. Writing infinite loop or fanning out thousands of activities can cause the history log to grow very large. This requires significant code refactoring.

Dead-lettering or Poison message handling

Sometimes workflows or orchestrations will get stuck in retry loop because of some unrecoverable failure. What makes this especially problematic is that the infinite retries will continuously consume resources, preventing other orchestrations or workflows from making progress. At the time of writing, neither Durable Functions nor Temporal has a built-in dead-lettering or poison-message handling feature. This means that these types of problems require manual mitigation.

Prompt

I need to use a microservice workflow orchestration tool to implement a system that handles communication with multiple other micro-services. All this is supposed to be written in Python. Here is what I am looking for:

Should offer durable execution

Fault tolerant - If the process crashes or faces a transient error, it should be able to handle it.

Rollbacks - Ability to specify rollback for individual steps incase the workflow couldn't execute.

Long running - Should be able to handle long running workflows without consuming system resources

Low development overhead - Should not involve writing too much boilerplate code to get it working.

Concurrency - Should be able to work across multiple instance of the application to perform parallel execution of workflows and operations.

Scheduled Jobs - Executed workflows at certain time or regular intervals.

Visibility - We should be able to view the current state of workflow and take actions.

Please choose a framework/library that can help me achieve this and also implement a demo program.

My Goals

This is what I want:

Fault-tolerant (Resumability) - In face of system failure, process crash or any other error. The program should resume execution where it left off.

Long-running - The workflows should be able to run for a long time without consuming any system resources.

Lowest number of moving pieces - The system should make use of least number libraries, frameworks or servers in order to achieve the goal

Rollbacks - If we couldn’t complete a set of operations even after performing multiple retries, we should be able to rollback changes to ensure we are in a consistent state.

One Person Framework - A single person should be able to understand the entire system and reason about the execution model to write performant and safe code.

Low Development overhead - Should not involve writing too much boilerplate code to get it working.

Atomicity - The system should either perform all the operations listed in the workflow and shouldn’t perform any of them. This would always ensure consistency of the data.

Concurrency - Should be able to work across multiple instance of the application to perform parallel execution of workflows and operations.

Scalable - The same architecture should be able to support at least 50,000 concurrent smart meters.

SAGA Pattern - Ensure safe and consistent state across distributed services so that the failure of one operation within a sequence results in prior operations being reversed using compensating transactions.

Scheduled Jobs - Executed workflows at certain time or regular intervals.

Application level resilience - Automated retries, configurable timeouts, and stateful recovery to make your workflows recover smoothly.

Visibility - We should be able to view the current state of workflow and take actions.

Ordered Execution - Isolate sequence of jobs to run in a strict sequential order.

This is what I need to follow in order to get what I want:

Idempotent tasks - the side effects must not change when the function is run with the same parameters.

Deterministic workflows - Where ever we perform I/O, use random number generator or fetch the current time using language native APIs we would get a different result each time we call it.