Dead-Letter Topics¶

Dead-letter topics (DLT) provide bounded failure handling for subscriptions that cannot process some messages successfully. Instead of retrying indefinitely, Pub/Sub redirects repeatedly failing messages to a separate topic after a configured number of delivery attempts.

In production systems, this mechanism protects healthy traffic from poison messages and creates an explicit queue for operational triage.

Conceptual Model¶

A message follows this lifecycle:

It is delivered to a subscription.
The handler succeeds and acknowledges the message, or fails and triggers a retry.
Retry continues until max_delivery_attempts is reached.
Pub/Sub moves the message to dead_letter_topic.

This means dead-letter routing is not a replacement for handler quality. It is a containment mechanism that prevents one pathological message class from degrading the full subscription.

flowchart LR
    A[Message delivered] --> B{Handler success?}
    B -->|Yes| C[Acked and removed]
    B -->|No| D{Attempts < max?}
    D -->|Yes| E[Retry with backoff]
    E --> A
    D -->|No| F[Move to dead-letter topic]

Baseline Configuration¶

Configure dead-letter routing directly in the subscriber declaration:

@broker.subscriber(
    alias="order-processor",
    topic_name="orders",
    subscription_name="orders-subscription",
    dead_letter_topic="orders-dlq",
    max_delivery_attempts=5,
    autocreate=True,
)
async def process_order(message: Message):
    await process_payment(message.data)

Parameters and Their Roles¶

Parameter	Role	Typical Decision Rule
`dead_letter_topic`	Destination for failed messages	Use `{topic}-dlq` or `{topic}-dlt` naming convention.
`max_delivery_attempts`	Retry ceiling before reroute	Start with `5` (the minimum), increase only if transient failures are common.
`autocreate`	Creates resources at startup	Keep `True` in local or dev environment but decide by platform policy in prod.

Retry Dynamics and Backoff¶

Dead-letter topics are most effective when paired with deliberate retry pacing.

@broker.subscriber(
    alias="api-caller",
    topic_name="api-requests",
    subscription_name="api-requests-subscription",
    dead_letter_topic="api-requests-dlq",
    max_delivery_attempts=10,
    min_backoff_delay_secs=10,
    max_backoff_delay_secs=600,
    autocreate=True,
)
async def call_api(message: Message):
    await call_external_api(message.data)

With exponential backoff, transient outages receive time to recover while permanent failures are eventually quarantined.

Attempt	Approximate Wait
1	Immediate
2	~10 seconds
3	~20 seconds
4	~40 seconds
5	~80 seconds
6+	600 seconds as maximum back-off period

Handling Dead-Letter Traffic¶

A dead-letter topic should always have a dedicated consumer path. If not, failures become invisible operational debt.

@broker.subscriber(
    alias="dlq-handler",
    topic_name="orders-dlq",
    subscription_name="orders-dlq-subscription",
    autocreate=True,
)
async def handle_failed_orders(message: Message):
    # Log the failure with details
    logger.error(
        f"Message {message.id} failed permanently",
        extra={
            "message_data": message.data.decode("utf-8"),
            "attributes": message.attributes,
            "delivery_attempt": message.delivery_attempt,
        },
    )

    # Alert your operations team
    await send_alert_to_ops_team(message)

    # Store for later analysis
    await store_failed_message(message)

The handler should implement at least one of the following:

Alerting for immediate operator awareness.
Persistence for forensic analysis and replay workflows.
Enrichment with diagnostic context (correlation IDs, tenant, source service).

Operational Patterns¶

Alert + Persist Pattern¶

@broker.subscriber(
    alias="dlq-alert-store",
    topic_name="events-dlq",
    subscription_name="events-dlq-subscription",
)
async def handle_dlq_alert_store(message: Message):
    await slack_webhook.send(f"Failed message: {message.id}")
    await database.insert(
        "failed_messages",
        {
            "message_id": message.id,
            "data": message.data,
            "failed_at": datetime.now(UTC),
        },
    )

Fallback Execution Pattern¶

@broker.subscriber(
    alias="dlq-retry",
    topic_name="payments-dlq",
    subscription_name="payments-dlq-subscription",
)
async def retry_with_fallback(message: Message):
    # Try a fallback payment processor
    await fallback_payment_service.process(message.data)

Manual Review Queue Pattern¶

@broker.subscriber(
    alias="dlq-review",
    topic_name="tickets-dlq",
    subscription_name="tickets-dlq-subscription",
)
async def queue_for_review(message: Message):
    await admin_dashboard.create_ticket(
        title=f"Failed order: {message.id}",
        data=message.data,
        priority="high",
    )

Validation with `PubSubTestClient`¶

PubSubTestClient is useful to verify local failure behavior (for example, that handler failures are observable in test results) without infrastructure dependency.

@pytest.mark.asyncio
async def test_failed_message_reaches_error_result_stream() -> None:
    test_broker = PubSubBroker(project_id="test-project")

    @test_broker.subscriber(
        alias="always-fails",
        topic_name="orders",
        subscription_name="orders-subscription",
        dead_letter_topic="orders-dlq",
        max_delivery_attempts=5,
    )
    async def always_fails(_: Message) -> None:
        raise ValueError("invalid payload")

    async with PubSubTestClient(test_broker) as client:
        await client.publish(topic="orders", data={"order_id": "ord-1"})
        results = client.get_results()

    assert len(results) == 1
    assert isinstance(results[0].error, ValueError)

Note that in-memory tests validate application behavior and error surfaces. Final validation of managed dead-letter routing itself should still be exercised in an integration environment.

Design Recommendations¶

Choosing `max_delivery_attempts`¶

Keep values low when failures are deterministic (schema errors, impossible states).
Increase values when downstream dependencies are known to recover quickly.
Prefer explicit tuning over high defaults; large values delay incident visibility.

Naming Strategy¶

Use a consistent suffix and include bounded domain context:

orders-dlq
payments-dlq
inventory-dlq

Consistent names simplify dashboards, alerts, and runbook lookup.

Monitoring Signals¶

Track:

Dead-letter message ingress rate.
Most frequent failure class.
Time-to-resolution per dead-letter message.
Replay success rate after remediation.

Common Failure Modes¶

Configuring a dead-letter topic but never subscribing to it.
Setting max_delivery_attempts too high and delaying diagnosis.
Ignoring retry backoff, causing rapid failure loops.
Mixing naming conventions and losing traceability.

Recap¶

Dead-letter topics isolate persistent failures from healthy traffic.
Configure dead_letter_topic and max_delivery_attempts per subscriber.
Pair dead-letter routing with retry backoff for controlled failure pacing.
Always consume and monitor the dead-letter topic.
Validate handler failure behavior early with PubSubTestClient, then verify managed routing in integration.