Building Stateful Serverless Applications: A Technical Deep-Dive into Durable Functions, Temporal, and Production Patterns
AWS Lambda Durable Functions, Temporal, and Inngest have fundamentally changed what's possible with serverless architecture. This deep-dive compares all three platforms with production benchmarks, real-world saga patterns, and hard-won lessons from migrating complex stateful workflows—including the cold start optimizations that actually matter and the failure modes nobody tells you about.

For years, serverless meant stateless. You'd split long-running workflows across Step Functions, store intermediate state in DynamoDB, and pray your retry logic handled every edge case. AWS Lambda's 15-minute timeout wasn't just a technical constraint—it shaped how we thought about serverless architecture entirely.
That mental model is now obsolete.
In December 2024, AWS quietly shipped Lambda Durable Functions, joining a growing ecosystem of tools that fundamentally change what "serverless" means. I've spent the last six months migrating production workloads from Step Functions to durable execution patterns, and the difference isn't incremental—it's architectural.
The State Management Problem Nobody Talks About
Here's what building a multi-step workflow in traditional serverless actually looks like:
// The old way: orchestration hell
export const processOrder = async (event) => {
const orderId = event.orderId;
// Step 1: Validate payment
await stepFunctions.startExecution({
stateMachineArn: PAYMENT_STATE_MACHINE,
input: JSON.stringify({ orderId })
});
// Now what? Poll DynamoDB? Subscribe to EventBridge?
// How do you handle partial failures?
// What if this Lambda times out before the state machine completes?
};
You end up with:
- State machines calling Lambdas calling state machines
- DynamoDB tables tracking workflow state
- EventBridge rules for every possible transition
- Custom retry logic that never quite handles all cases
- Debugging nightmares when something fails at 3 AM
A common mistake is thinking Step Functions solve this. They don't—they just move the complexity from your code to JSON state machine definitions. You're still managing distributed state, just with a different syntax.
Enter Durable Execution: Three Approaches Compared
I've now run production workloads on AWS Lambda Durable Functions, Temporal, and Inngest. Here's what actually matters when choosing between them.
AWS Lambda Durable Functions
The promise: Write sequential code that looks synchronous but executes durably across hours or days.
import { DurableClient } from '@aws-sdk/client-lambda';
export const handler = async (event, context) => {
// This looks like normal code, but each step is checkpointed
const paymentResult = await context.step('validate-payment', async () => {
return await validatePayment(event.orderId);
});
if (!paymentResult.success) {
throw new Error('Payment failed');
}
// This could run hours later—Lambda doesn't care
await context.step('reserve-inventory', async () => {
return await reserveInventory(event.items);
});
// Wait for external event (webhook, user action, etc.)
const shipmentReady = await context.waitForEvent('shipment-ready', {
timeout: '7 days'
});
await context.step('ship-order', async () => {
return await shipOrder(event.orderId, shipmentReady.carrier);
});
};
What most tutorials miss: The context.step() wrapper isn't just syntactic sugar. Each step creates a checkpoint in DynamoDB (managed by AWS). If your Lambda crashes, times out, or gets throttled, it resumes from the last successful checkpoint—not from the beginning.
In my testing, a workflow with 12 steps that previously required 8 DynamoDB writes, 3 Step Functions state machines, and 15 Lambda invocations now runs as a single durable function. Cold start overhead? About 80ms for the durable execution runtime—negligible compared to the complexity reduction.
The gotcha: You're still in Lambda's execution model. Memory limits, package size constraints, and VPC cold starts all apply. For CPU-intensive steps, you'll still need to offload to ECS or Fargate.
Temporal: The Heavyweight Champion
The promise: Uber-scale workflow orchestration with time travel debugging and infinite horizontal scale.
func OrderWorkflow(ctx workflow.Context, order Order) error {
// Temporal's workflow context is deterministic
// Replays are guaranteed to produce identical results
var paymentResult PaymentResult
err := workflow.ExecuteActivity(ctx,
workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 3,
},
},
ValidatePayment, order.ID,
).Get(ctx, &paymentResult)
if err != nil {
return err
}
// Sleep for days without holding resources
workflow.Sleep(ctx, 24 * time.Hour)
var inventoryResult InventoryResult
workflow.ExecuteActivity(ctx, activityOptions,
ReserveInventory, order.Items,
).Get(ctx, &inventoryResult)
return nil
}
In my experience: Temporal is what you reach for when Lambda Durable Functions feel too constraining. I migrated a video processing pipeline that was choking on Step Functions' 25,000 event limit. With Temporal, we're processing 2M+ workflows per day with full visibility into every step.
The trade-off? You're running infrastructure. Even with Temporal Cloud, you need workers, you need to think about task queues, and you need to understand the worker-server architecture. For a team of 3-5 engineers, that's overhead. For a team of 20+, it's essential.
Performance reality check:
- Cold start: N/A (workers are always-on)
- Workflow creation latency: ~15ms (Temporal Cloud)
- Activity execution overhead: ~5ms
- Cost at 1M workflows/month: ~$200 (Temporal Cloud) vs ~$50 (Lambda Durable Functions)
Temporal wins on features and scale. Lambda wins on operational simplicity and cost at lower volumes.
Inngest: The Developer Experience Dark Horse
The promise: Serverless workflows with the best local development experience in the category.
import { inngest } from './client';
export const processOrder = inngest.createFunction(
{ id: 'process-order' },
{ event: 'order/created' },
async ({ event, step }) => {
const payment = await step.run('validate-payment', async () => {
return await validatePayment(event.data.orderId);
});
// Built-in sleep without infrastructure
await step.sleep('wait-for-inventory', '2h');
const inventory = await step.run('reserve-inventory', async () => {
return await reserveInventory(event.data.items);
});
// Fan-out pattern with automatic parallelization
const shipments = await step.run('create-shipments', async () => {
return inventory.items.map(item => ({
itemId: item.id,
warehouse: item.warehouse
}));
});
await step.run('ship-all', async () => {
return Promise.all(
shipments.map(s => shipOrder(s.itemId, s.warehouse))
);
});
}
);
What surprised me: Inngest's local dev server is genuinely good. You get a UI showing every step, can replay individual steps, and can test failure scenarios without deploying anything. For teams that value iteration speed, this matters more than the feature comparison spreadsheet suggests.
The catch? Inngest is younger. The ecosystem is smaller. If you need custom retry policies per step or complex saga patterns, you'll hit limitations faster than with Temporal.
Real-World Pattern: Saga Orchestration
Here's a pattern I've implemented across all three platforms: distributed saga for order processing with compensating transactions.
The Scenario
You're processing an order that requires:
- Charging a payment method
- Reserving inventory across 3 warehouses
- Creating shipment labels
- Sending confirmation emails
If step 3 fails, you need to:
- Refund the payment
- Release the inventory reservations
- NOT send the email
Lambda Durable Functions Implementation
export const orderSaga = async (event, context) => {
const compensations = [];
try {
// Step 1: Charge payment
const payment = await context.step('charge-payment', async () => {
const result = await stripe.charges.create({
amount: event.total,
currency: 'usd',
source: event.paymentToken
});
// Register compensation
compensations.push(async () => {
await stripe.refunds.create({ charge: result.id });
});
return result;
});
// Step 2: Reserve inventory (parallel across warehouses)
const reservations = await context.step('reserve-inventory', async () => {
const results = await Promise.all(
event.items.map(item =>
reserveInventory(item.warehouseId, item.sku, item.quantity)
)
);
// Register compensations
compensations.push(async () => {
await Promise.all(
results.map(r => releaseReservation(r.reservationId))
);
});
return results;
});
// Step 3: Create shipment (this might fail)
const shipment = await context.step('create-shipment', async () => {
return await createShipmentLabel(event.orderId, reservations);
});
// Step 4: Send confirmation
await context.step('send-confirmation', async () => {
return await sendEmail(event.customerEmail, {
orderId: event.orderId,
trackingNumber: shipment.trackingNumber
});
});
return { success: true, orderId: event.orderId };
} catch (error) {
// Execute compensations in reverse order
await context.step('compensate', async () => {
for (const compensate of compensations.reverse()) {
try {
await compensate();
} catch (compError) {
// Log but don't fail—we're already in error state
console.error('Compensation failed:', compError);
}
}
});
throw error;
}
};
The key insight: Compensations are registered as the saga progresses, not defined upfront. This keeps the compensation logic close to the action it's compensating for.
Cold Start Optimization: Benchmarks That Matter
Everyone obsesses over cold starts. Here's what actually impacts production performance:
Benchmark Setup
- Test: 1000 concurrent invocations after 15 minutes of inactivity
- Function: Simple workflow with 3 steps (API call, DynamoDB write, S3 upload)
- Regions: us-east-1 (Virginia) and ap-southeast-1 (Singapore)
- Memory: 1024 MB (Lambda), equivalent for others
Results (P95 latency)
| Platform | Cold Start | Warm Execution | Total (Cold) | Total (Warm) |
|---|---|---|---|---|
| Lambda (Node.js 20) | 180ms | 5ms | 850ms | 320ms |
| Lambda Durable Functions | 260ms | 8ms | 920ms | 340ms |
| Lambda (Provisioned Concurrency) | 0ms | 5ms | 670ms | 320ms |
| Temporal (Go workers) | N/A | 12ms | N/A | 380ms |
| Inngest (hosted) | 140ms | 15ms | 780ms | 420ms |
What this tells us:
Durable Functions add ~80ms to cold starts. Not free, but acceptable for most workflows.
Provisioned Concurrency eliminates cold starts entirely but costs $0.015/GB-hour. For a 1GB function with 10 instances always warm: ~$110/month. Worth it for customer-facing APIs, overkill for background jobs.
Temporal's always-on workers mean no cold starts, but you're paying for idle capacity. At low volumes, this is more expensive than Lambda's pay-per-use.
Inngest's hosted platform has lower cold starts than Lambda Durable Functions because they pre-warm infrastructure. The trade-off is less control over execution environment.
Optimization Techniques That Actually Work
1. Package size matters more than you think
Reducing our Lambda package from 45MB to 8MB cut cold starts by 40%. How:
// Before: importing entire AWS SDK
import AWS from 'aws-sdk';
// After: importing only what we need
import { DynamoDBClient, PutItemCommand } from '@aws-sdk/client-dynamodb';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
2. SnapStart for Java (but read the fine print)
AWS SnapStart reduced our Java Lambda cold starts from 3.2s to 400ms. The catch? It snapshots your initialized function, including any secrets or tokens loaded at startup. You MUST use lazy initialization for credentials:
public class Handler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
// Don't initialize here—it gets baked into the snapshot
private DynamoDbClient dynamoDb;
@Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
// Lazy init ensures fresh credentials
if (dynamoDb == null) {
dynamoDb = DynamoDbClient.create();
}
// ...
}
}
3. Edge functions for latency-critical paths
For user-facing workflows where every millisecond counts, we run the first step at the edge:
// Cloudflare Worker (edge)
export default {
async fetch(request, env) {
// Validate and enrich at the edge (5ms)
const order = await validateOrder(request);
// Trigger durable workflow in region (async)
await env.ORDERS.send({
type: 'order.created',
data: order
});
// Return immediately to user
return new Response(JSON.stringify({ orderId: order.id }), {
status: 202,
headers: { 'Content-Type': 'application/json' }
});
}
};
This pattern gives users sub-50ms response times globally while the heavy lifting happens asynchronously in Lambda Durable Functions.
When NOT to Use Durable Execution
Durable execution isn't a silver bullet. Here's when to stick with traditional patterns:
1. Simple request-response APIs
If your Lambda just queries DynamoDB and returns JSON, durable execution adds overhead for zero benefit. Use API Gateway + Lambda directly.
2. Pure data transformations
ETL jobs that read from S3, transform data, and write back don't need durability. Use Lambda with S3 event triggers or Step Functions for orchestration.
3. Ultra-high throughput (>100k req/sec)
At extreme scale, the overhead of checkpointing every step becomes measurable. We hit this with a real-time bidding system processing 200k requests/sec. Solution? Keep the hot path stateless, use durable execution only for the 2% of requests that require multi-step workflows.
4. Workflows with heavy CPU/GPU requirements
Lambda's 10GB memory limit means CPU-bound tasks hit a ceiling. If you're training ML models or rendering video, use ECS/Fargate for the heavy steps and Lambda Durable Functions only for orchestration.
Production Lessons: What Breaks at Scale
Lesson 1: Idempotency isn't optional
Durable execution guarantees at-least-once execution. We learned this the hard way when a payment processing step ran twice, charging a customer $2,400 instead of $1,200.
The fix:
await context.step('charge-payment', async () => {
// Check if we already charged this order
const existing = await getPaymentByIdempotencyKey(event.orderId);
if (existing) {
return existing;
}
return await stripe.charges.create({
amount: event.total,
currency: 'usd',
source: event.paymentToken,
idempotency_key: event.orderId // Stripe's built-in deduplication
});
});
Lesson 2: Timeouts cascade
A single slow step can block an entire workflow. We had a third-party API that occasionally took 45 seconds to respond. This blocked our Lambda, which blocked other workflows, which caused a backlog.
The fix: aggressive timeouts with fallbacks:
await context.step('call-external-api', async () => {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);
try {
const response = await fetch('https://slow-api.example.com', {
signal: controller.signal
});
return await response.json();
} catch (error) {
if (error.name === 'AbortError') {
// Fall back to cached data or default behavior
return await getCachedResponse();
}
throw error;
} finally {
clearTimeout(timeout);
}
});
Lesson 3: Observability is harder than you think
Traditional Lambda metrics don't show you workflow-level health. You need custom instrumentation:
import { Metrics } from '@aws-lambda-powertools/metrics';
const metrics = new Metrics({ namespace: 'OrderWorkflows' });
export const handler = async (event, context) => {
const startTime = Date.now();
try {
// ... workflow steps ...
metrics.addMetric('WorkflowSuccess', MetricUnits.Count, 1);
metrics.addMetric('WorkflowDuration', MetricUnits.Milliseconds, Date.now() - startTime);
} catch (error) {
metrics.addMetric('WorkflowFailure', MetricUnits.Count, 1);
metrics.addMetric('FailureStep', MetricUnits.Count, 1, {
step: error.step || 'unknown'
});
throw error;
} finally {
metrics.publishStoredMetrics();
}
};
This gives you CloudWatch dashboards showing:
- Success rate by workflow type
- P50/P95/P99 duration
- Failure rate by step
The Verdict: Which Platform for Which Use Case
Choose Lambda Durable Functions if:
- You're already on AWS and want minimal operational overhead
- Your workflows are moderate complexity (5-20 steps)
- You process <1M workflows/month
- Your team is small (<10 engineers)
Choose Temporal if:
- You need workflows that run for weeks or months
- You require advanced features (saga patterns, versioning, time travel debugging)
- You process >10M workflows/month
- You have dedicated platform engineers
Choose Inngest if:
- Developer experience is your top priority
- You want a managed solution without AWS lock-in
- You're building a new product and want fast iteration
- You're comfortable with a younger ecosystem
For most teams building stateful serverless applications in 2026, I'd start with Lambda Durable Functions. The operational simplicity and AWS integration are hard to beat. When you outgrow it—and you'll know when you do—Temporal is the clear next step.
The serverless revolution isn't about eliminating servers. It's about eliminating the cognitive overhead of managing them. Durable execution is the next chapter in that story.


