A deep dive comparing Airflow, Prefect, Temporal, Inngest, and Windmill — how they work internally, their trade-offs, and real benchmarks. Plus honorable mentions for Restate, DBOS, and Hatchet.
Why Workflow Engines Exist
Every backend eventually grows a function like this:
async function processOrder(order: Order) {
const validated = await validateInventory(order);
const payment = await chargePayment(validated);
const shipment = await createShipment(payment);
await sendConfirmationEmail(shipment);
}
This works until it doesn't. What happens when the server crashes after chargePayment but before createShipment? The customer was charged, but nothing shipped. Do you retry? You'd charge them twice. Do you skip? They paid but get nothing.
The fundamental problem: a sequence of side-effects spread across time and network boundaries cannot be made atomic. You can wrap two database writes in a transaction, but you can't wrap "call Stripe" + "call FedEx" + "call SendGrid" in one.
Every workflow engine is a different answer to the same question: how do you coordinate multiple fallible side-effects so that the overall process makes progress, even when individual steps fail?
The answers cluster into three generations, each with a different core abstraction.
The Three Generations
Generation 1: DAG Schedulers Airflow, Prefect
"Define a graph of tasks,
a scheduler runs them in order"
Generation 2: Durable Execution Temporal, Inngest, Windmill WAC
"Write normal code, the runtime
makes it survive crashes"
Hybrid: Visual Flow Builder Windmill Flows
"Drag-and-drop steps,
JSON-defined DAG with code steps"
The shift from Gen 1 to Gen 2 mirrors a broader shift in computer science: from declarative (describe the computation) to imperative (write the computation, let the infrastructure handle durability). Neither is universally better — they solve different problems.
Generation 1: DAG Schedulers
The Abstraction
A DAG scheduler separates what to do (your task code) from when and where to do it (the scheduler's job). You declare tasks and their dependencies as a directed acyclic graph. The scheduler inspects the graph, determines which tasks are ready, and dispatches them.
The key property: tasks are independent units of work. They don't share memory. They don't know about each other. They communicate through external storage. The scheduler is the only component that understands the full picture.
┌──────────────────────────────────────────────────────┐
│ DAG Scheduler Model │
│ │
│ You define: Scheduler does: │
│ │
│ [Task A] ──┐ 1. Parse graph │
│ ├──→ 2. Poll: which tasks are ready? │
│ [Task B] ──┘ 3. Dispatch ready tasks │
│ │ 4. Wait for completion │
│ ▼ 5. Repeat from 2 │
│ [Task C] │
│ │
│ Data passes via external storage (DB, S3, XCom) │
│ Tasks are independent processes │
└──────────────────────────────────────────────────────┘
Airflow: The Incumbent
Airflow (Airbnb, 2014) is the canonical DAG scheduler. You write Python files that define DAGs:
from airflow.decorators import dag, task
from datetime import datetime
@task
def extract():
return {"data": [1, 2, 3]}
@task
def transform(raw):
return [x * 2 for x in raw["data"]]
@task
def load(transformed):
db.insert(transformed)
@dag(schedule="@hourly", start_date=datetime(2024, 1, 1))
def etl_pipeline():
raw = extract()
transformed = transform(raw)
load(transformed)
The fundamental misunderstanding about Airflow: this looks like Python calling functions, but it isn't. At parse time, no functions execute. Airflow builds a dependency graph from the return value annotations. The actual execution happens later — possibly minutes later, on a different machine.
How the Scheduler Works
The Airflow scheduler is a polling loop over a relational database:
Every ~5 seconds:
1. Parse all DAG Python files (discover tasks, dependencies)
2. Query DB: which DagRuns need new TaskInstances?
3. Query DB: which TaskInstances are ready to run?
4. Enter critical section (SELECT ... FOR UPDATE)
5. Check pool limits, concurrency limits
6. Enqueue ready tasks to the executor
Each task passes through a state machine stored in the database:
none → scheduled → queued → running → success
└──→ failed → up_for_retry → scheduled → ...
Every state transition is a database write. The scheduler owns scheduled → queued. The executor owns queued → running. The worker owns running → success/failed.
Data Passing Between Steps
Since tasks run in separate processes (possibly different machines), data must be serialized to shared storage. Airflow calls this "XCom" (cross-communication). All engines where steps are separate jobs share this pattern — Temporal stores results in event history, Windmill in Postgres JSONB — but Airflow's XCom has historically had the worst developer experience: tight size limits (48KB default), and in older versions, explicit xcom_push/xcom_pull calls. Newer Airflow versions with the @task decorator make this more transparent, but the size limits remain.
The Executor Layer
Airflow's executor is pluggable — one of its best design decisions:
- LocalExecutor: forks a subprocess per task. Simple, single-machine.
- CeleryExecutor: sends tasks to a message broker (Redis/RabbitMQ). Celery workers pick them up. Most common production setup.
- KubernetesExecutor: spins up a fresh Kubernetes pod per task. Maximum isolation, ~10-30s cold start per task.
Each executor makes a different trade-off between isolation, latency, and operational complexity. But all share the fundamental constraint: each task is an independent execution unit.
Pros
- Massive ecosystem: hundreds of "operators" (pre-built integrations) for AWS, GCP, databases, Spark, dbt, etc.
- Scheduling: sophisticated time-based scheduling with backfill, catchup, data intervals.
- Monitoring: built-in UI showing DAG runs, task statuses, logs, Gantt charts.
- Battle-tested: runs at Airbnb, Google, PayPal, thousands of companies. You will find answers on StackOverflow.
Cons
Latency and cold start. In our benchmarks, Airflow took 56 seconds to run 40 lightweight tasks (~0.7 tasks/sec). Windmill completed the same workload in 2.4 seconds (~16.5 tasks/sec) — a 23x difference. The overhead comes from architectural differences:
- Three-hop dispatch: in Airflow, a task goes scheduler (polls DB, resolves dependencies, checks pool limits) → DB state update → executor → message broker (Redis/RabbitMQ for Celery) → worker. Three separate components, each with their own polling interval and latency. In Windmill, the worker polls Postgres directly with
SELECT ... FOR UPDATE SKIP LOCKED— one component, one hop. - Scheduler overhead: Airflow's scheduler is a Python process that re-parses DAG files, evaluates dependencies, and checks concurrency limits — all in Python — before a task can even be enqueued. This adds 1-5 seconds per scheduling cycle. Windmill has no separate scheduler; workers self-schedule by pulling from the queue.
- Cold start per task: each Airflow task forks a subprocess that loads the entire DAG file + Airflow framework imports. Even for a trivial task, this can take 1-2 seconds. Windmill's cold start is lighter (~26ms for Python, ~12ms for Bun), and with dedicated workers it's 0ms — the process stays alive across jobs.
With the KubernetesExecutor, cold start grows to 10-30 seconds per task (pod creation). This makes Airflow unsuitable for anything latency-sensitive.
Python-only. DAGs are Python files. Tasks are Python functions. If your pipeline needs a TypeScript transform or a Go data processor, you shell out or use a BashOperator — no first-class polyglot support.
No visual editor. Airflow has a monitoring UI (DAG view, Gantt charts, logs), but no visual flow builder. You define workflows in Python code, which is powerful but excludes non-developers from authoring workflows.
Static DAGs. The dependency graph is fixed at parse time. Airflow 2.x added @task.branch and dynamic task mapping, but you're still declaring branches upfront, not writing arbitrary runtime control flow.
No durable execution. If a task crashes mid-execution, all progress within that task is lost. Airflow retries the entire task from the beginning.
Parse overhead. The scheduler re-parses all Python DAG files periodically. With thousands of DAGs, this alone can consume significant CPU and cause scheduling delays.
Prefect: The Pythonic Successor
Prefect (2018) was built explicitly as "Airflow, but for Python developers who want less ceremony." Its core insight: use Python's native execution model instead of fighting it.
from prefect import flow, task
@task
def extract():
return [1, 2, 3]
@task
def transform(data):
return [x * 2 for x in data]
@task
def load(results):
db.insert(results)
@flow
def etl_pipeline():
data = extract()
transformed = transform(data)
load(transformed)
This looks almost identical to Airflow, but with a crucial difference: the code actually runs as Python. When etl_pipeline() is called, extract() really executes extract(). There's no graph construction phase — the DAG is implicit from the call order.
The Hybrid Execution Model
Prefect sits between Generation 1 and Generation 2. Tasks execute in the same process as the flow (by default), so there's no XCom problem — data passes through Python variables. But each task run is tracked by the Prefect server via a REST API:
@task runs:
1. POST /task_runs → server creates TaskRun with state Pending
2. PUT /task_runs/{id}/state → Running
3. function body executes (in same Python process)
4. PUT /task_runs/{id}/state → Completed (with result)
Every state transition is an HTTP call to the Prefect API server, which persists it in Postgres.
Concurrency via Futures
Prefect uses Python's native async/futures for parallelism:
@flow
def parallel_pipeline():
futures = [transform.submit(item) for item in items] # Submit all
results = [f.result() for f in futures] # Collect
.submit() creates a future (using Python's concurrent.futures or a task runner). The function call runs in a thread/process pool. This is simpler than Airflow's DAG-level parallelism but limited by Python's GIL for CPU-bound work.
Pros
- Zero new concepts for Python developers. Decorators on regular functions. Python control flow. Python data passing.
- Dynamic workflows. Since the code is real Python, you can use
if/else,forloops,try/except— anything. The "DAG" is whatever Python actually executes. - Lower ceremony than Airflow. No scheduler process. No DAG file parsing. Just run the flow.
Cons
- No durable execution. Like Airflow, if the process crashes mid-task, work is lost. Task-level retries restart the task from the beginning.
- State-tracking overhead. Every task run creates multiple HTTP calls + DB writes for state transitions (Pending → Running → Completed). For workflows with hundreds of short tasks, this overhead dominates.
- Python-only. The server is Python (FastAPI). The workers are Python. The SDK is Python. If your workflow involves non-Python code, Prefect can shell out, but there's no native multi-language support.
- No server-side sleep.
time.sleep(60)in a flow holds the worker process for 60 seconds. There's no "schedule me to wake up in 60 seconds" primitive (unlike Temporal or Windmill).
The DAG Scheduler Trade-off
Both Airflow and Prefect share the same fundamental model: tasks are tracked externally, data passes through storage, and the orchestrator drives execution. The workflow code describes what to do, but doesn't directly control how it's executed.
Pro: Simple mental model. Tasks are independent. Easy to monitor.
Pro: Mature ecosystems (especially Airflow).
Pro: Natural fit for scheduled batch processing.
Con: No durable execution within a task.
Con: High per-task overhead (state transitions, data serialization).
Con: Static or weakly dynamic control flow (Airflow worse, Prefect better).
Con: Data passing goes through the database (all engines share this when steps are separate processes, but Airflow's XCom has historically been the most limited in size and ergonomics).
For scheduled ETL pipelines where tasks run for minutes, these trade-offs are excellent. For real-time, latency-sensitive, or long-running workflows, they're not.
Generation 2: Durable Execution
The Abstraction
Durable execution inverts the DAG scheduler model: instead of an external orchestrator driving tasks, the workflow code drives itself, and the runtime makes the code survive crashes.
You write what looks like a normal program:
async function processOrder(order) {
const payment = await chargePayment(order);
const shipment = await createShipment(payment);
await sendConfirmation(shipment);
}
The runtime intercepts each await and ensures that:
- The result is durably persisted before execution continues
- On crash, the function resumes from where it left off — already-completed steps are not re-executed
- Side effects happen at least once (and ideally exactly once)
The key insight: the await keyword is the persistence boundary. Everything between two awaits is either fully completed or fully retried — never partially executed.
But the implementations differ wildly in how they achieve this.
Temporal: Event Sourcing + Deterministic Replay
Temporal (2019, ex-Uber Cadence team) is the most well-known durable execution engine. Its core abstraction: record every state change as an immutable event, then replay events to reconstruct state.
// Workflow — must be deterministic (sandboxed in TS, by convention in Go/Java)
export async function processOrder(orderId: string) {
const order = await activities.getOrder(orderId);
const payment = await activities.chargePayment(order);
await activities.shipOrder(payment);
}
// Activity — runs in normal Node.js, can do anything
export async function chargePayment(order: Order): Promise<Receipt> {
return stripe.charges.create({ amount: order.total });
}
The Workflow / Activity Split
Temporal enforces a strict separation:
- Workflow code must be deterministic — no I/O, no randomness, no direct clock access. How strictly this is enforced depends on the SDK:
- TypeScript: the strictest. Workflows run in a V8 isolate with
Math.random(),Date(),setTimeout()replaced by deterministic versions. Node.js APIs (fs,http,fetch) are blocked at the bundler level. - Python: a sandbox using proxy objects and a custom module importer restricts most non-deterministic access at runtime.
- Go and Java: no sandbox. Determinism is enforced by convention — developers are told not to use goroutines/threads, system clocks, or randomness. Violations are only caught at replay time (non-determinism error), not at compile time.
- TypeScript: the strictest. Workflows run in a V8 isolate with
- Activity code runs in normal Node.js / Python / Go. It can do anything — call APIs, write to databases, generate random numbers.
This split exists because of Temporal's replay mechanism.
How Replay Works
Every time a workflow makes a decision (schedule an activity, start a timer, send a signal), Temporal records it as an event in an immutable event history stored in the database.
When the workflow needs to resume (after an activity completes, after a crash, after a timer fires), the entire workflow function re-executes from the beginning. But this time, the SDK checks the event history:
Execution 1: run → await getOrder → [no event] → schedule activity → YIELD
Execution 2: run → await getOrder → [event: completed(order)] → return recorded result
→ await chargePayment → [no event] → schedule activity → YIELD
Execution 3: run → await getOrder → [event] → skip
→ await chargePayment → [event] → skip
→ await shipOrder → [no event] → schedule activity → YIELD
Each execution replays all previous steps (returning results from the event history) and then advances one step. This is event sourcing applied to code execution.
Concrete Example: Event History
For the 3-step workflow above, here's what Temporal actually stores in its database (Postgres, MySQL, or Cassandra depending on deployment):
Event# EventType Details
────── ──────────────────────────── ─────────────────────────────
1 WorkflowExecutionStarted {input: orderId}
2 WorkflowTaskScheduled {taskQueue: "main"}
3 WorkflowTaskStarted {worker: "w1"}
4 WorkflowTaskCompleted {commands: [ScheduleActivity("getOrder")]}
5 ActivityTaskScheduled {type: "getOrder"}
6 ActivityTaskStarted {worker: "w1"}
7 ActivityTaskCompleted {result: {id: 123, total: 99}}
8 WorkflowTaskScheduled {taskQueue: "main"}
9 WorkflowTaskStarted {worker: "w1"}
10 WorkflowTaskCompleted {commands: [ScheduleActivity("chargePayment")]}
11 ActivityTaskScheduled {type: "chargePayment"}
12 ActivityTaskStarted {worker: "w1"}
13 ActivityTaskCompleted {result: {receipt: "ch_xxx"}}
14 WorkflowTaskScheduled {taskQueue: "main"}
15 WorkflowTaskStarted {worker: "w1"}
16 WorkflowTaskCompleted {commands: [ScheduleActivity("shipOrder")]}
17 ActivityTaskScheduled {type: "shipOrder"}
18 ActivityTaskStarted {worker: "w1"}
19 ActivityTaskCompleted {result: {tracking: "FDX123"}}
20 WorkflowTaskScheduled {taskQueue: "main"}
21 WorkflowTaskStarted {worker: "w1"}
22 WorkflowTaskCompleted {commands: [CompleteWorkflow]}
23 WorkflowExecutionCompleted {result: "ok"}
23 events for 3 steps. Each activity generates ~7 events. This is the write amplification cost of event sourcing. But you get a complete, queryable audit trail of exactly what happened and when.
The Determinism Requirement
Since the workflow function is replayed from the beginning on every resume, it must produce the same sequence of commands on every execution. If you used Math.random() to decide whether to call activity A or B, replay would make a different choice and Temporal would throw a non-determinism error.
This is the most common source of developer pain with Temporal. You must learn to think about which code is "workflow" (deterministic orchestration) and which is "activity" (actual work). In TypeScript, the sandbox catches most violations immediately. In Go or Java, a third-party library that calls time.Now() or Math.random() will silently work until replay fails — potentially in production, weeks after deployment.
// ❌ BROKEN — non-deterministic
export async function myWorkflow() {
if (Math.random() > 0.5) { // Different on replay!
await activities.pathA();
} else {
await activities.pathB();
}
}
// ✅ CORRECT — decision based on activity result
export async function myWorkflow() {
const coin = await activities.flipCoin(); // Recorded in history
if (coin > 0.5) {
await activities.pathA();
}
}
Architecture
Temporal's server is 4 services (Frontend, History, Matching, Worker) backed by PostgreSQL or Cassandra. Workers connect via gRPC and long-poll for tasks. This is the highest operational complexity of any engine in this comparison.
Workflow Worker Temporal Server (4 services) Activity Worker
│ │ │
│◀── gRPC WorkflowTask ───│ │
│ (with event history) │ │
│ │ │
│ replay, hit new await │ │
│ │ │
│── gRPC Command ────────▶│── append events ─▶ DB │
│ ScheduleActivityTask │── enqueue on task queue ──────▶│
│ │ │
│ │ execute fn()
│ │ │
│ │◀── gRPC result ────────────────│
│ │── append events ─▶ DB │
│ │ │
│◀── gRPC WorkflowTask ───│ │
│ (updated history) │ │
│ replay all, advance │ │
Pros
- True durable execution. Workflows can run for months. Crash anywhere, resume exactly where you left off.
- Full audit trail. Every event is recorded. You can inspect and replay any workflow.
- Multi-language SDKs. TypeScript, Go, Java, Python, .NET, PHP.
- Rich primitives. Signals, queries, child workflows, timers, cancellation, search attributes.
Cons
- Operational complexity. 4 server services + database + optionally Elasticsearch. Many moving parts.
- Determinism tax. Developers must constantly think about what's deterministic. Subtle bugs from non-deterministic libraries.
- Write amplification. 7+ events per activity. A 100-step workflow generates 700+ database writes.
- Replay cost. Each workflow task replays from the beginning. Mitigated by sticky execution (caching state on the same worker), but cold replay of long histories is expensive.
Inngest: HTTP Callbacks + Memoization
Inngest (2022) took a radically different approach: what if the execution engine was just an HTTP middleware?
export const processOrder = inngest.createFunction(
{ id: "process-order" },
{ event: "order/created" },
async ({ event, step }) => {
const order = await step.run("get-order", () =>
db.orders.findById(event.data.orderId)
);
const payment = await step.run("charge", () =>
stripe.charges.create({ amount: order.total })
);
await step.run("ship", () =>
shipping.dispatch(order)
);
}
);
The HTTP Round-Trip Model
Inngest's execution model is unlike anything else. Your code runs as a stateless HTTP endpoint. The Inngest server orchestrates execution by making HTTP calls to your endpoint:
Request 1 (no steps completed):
Server POST → your endpoint
Code runs: step.run("get-order", fn) → fn executes → returns order
Response: { step_result: "get-order", data: order }
Server stores result.
Request 2 (get-order completed):
Server POST → your endpoint (with memoized results)
Code runs: step.run("get-order", fn) → memoized, returns stored result
Code runs: step.run("charge", fn) → fn executes → returns receipt
Response: { step_result: "charge", data: receipt }
Server stores result.
Request 3 (get-order + charge completed):
Server POST → your endpoint (with memoized results)
Code runs: step.run("get-order") → memoized
Code runs: step.run("charge") → memoized
Code runs: step.run("ship", fn) → fn executes
Response: { step_result: "ship", data: tracking }
Request 4 (all steps completed):
Server POST → your endpoint
All steps memoized, function returns.
Response: { complete: true, result: ... }
Each step.run() = one HTTP round-trip. The function re-executes from the top on every request, but completed steps return instantly from memoized results.
Why HTTP?
This design choice has profound implications:
Pro: Truly stateless workers. Your code is a regular HTTP endpoint — deploy it on Vercel, AWS Lambda, Cloudflare Workers, a Docker container, anywhere. No persistent worker process, no gRPC connection to maintain, no special runtime. The Inngest server handles all state.
Pro: No new infrastructure for the developer. You add Inngest to your existing Express/Next.js/Flask app. No separate worker binary, no task queue, no Celery/RabbitMQ.
Pro: Language-agnostic by design. Any language that can serve HTTP can be an Inngest worker.
Con: Highest per-step latency. Every step = HTTP request + response + memoized replay of all previous steps. A 10-step workflow makes 10 HTTP requests, and the 10th request re-executes (and skips) all 9 previous steps before running the 10th.
Con: Full re-execution per step. Like Temporal, the function re-runs from the beginning. Unlike Temporal, there's no compiled workflow bundle or V8 isolate — it's a full HTTP request with all the associated overhead (routing, middleware, JSON parsing).
The Memoization Distinction
Inngest and Temporal both re-execute code and skip completed steps, but the mechanism differs:
- Temporal: The SDK intercepts
awaitcalls and checks an in-memory event history. If a matching event exists, the call returns instantly. This happens within a single process execution. - Inngest: The server sends memoized results in the HTTP request body. The SDK checks its local cache. If found,
step.run()returns immediately. This happens across HTTP requests.
The practical difference: Temporal's replay is in-process (fast, ~microseconds per replayed step). Inngest's replay is across HTTP (slower, but the memoized steps are essentially free since the function body isn't called).
Pros
- Simplest deployment model. Add it to your existing app. No infrastructure beyond the Inngest server (which can be self-hosted or cloud).
- Serverless-native. Works perfectly with Lambda/Vercel/Cloudflare. No persistent connections to maintain.
- Event-driven. First-class event system with fan-out, debounce, throttle.
- Server-side sleep.
step.sleep("1h")doesn't hold a process — the server wakes your function after 1 hour.
Cons
- Latency. Each step = HTTP round-trip. For workflows with many fast steps, the HTTP overhead dominates.
- Re-execution cost. The function code (parsing, importing, middleware) runs on every step, not just the new one.
- Observability. Debugging is harder when execution is spread across multiple HTTP requests.
Windmill WAC: Suspend/Resume + Checkpoint
Windmill (2022) introduced Workflow-as-Code (WAC) with a unique mechanism: exception-based suspend/resume with mutable checkpoints.
import { task, step, workflow } from "windmill-client";
const getOrder = task(async (id: string) => {
return db.orders.findById(id);
});
export const main = workflow(async () => {
const order = await getOrder("order-123");
// step() executes inline — no child job, no dispatch
const total = await step("calc-total", () =>
order.items.reduce((sum, i) => sum + i.price, 0)
);
const payment = await chargePayment(total);
return { payment };
});
