Skip to content

Dead-Letter Topics

Dead-letter topics (DLT) provide bounded failure handling for subscriptions that cannot process some messages successfully. Instead of retrying indefinitely, Pub/Sub redirects repeatedly failing messages to a separate topic after a configured number of delivery attempts.

In production systems, this mechanism protects healthy traffic from poison messages and creates an explicit queue for operational triage.

Conceptual Model

A message follows this lifecycle:

  1. It is delivered to a subscription.
  2. The handler succeeds and acknowledges the message, or fails and triggers a retry.
  3. Retry continues until max_delivery_attempts is reached.
  4. Pub/Sub moves the message to dead_letter_topic.

This means dead-letter routing is not a replacement for handler quality. It is a containment mechanism that prevents one pathological message class from degrading the full subscription.

flowchart LR
    A[Message delivered] --> B{Handler success?}
    B -->|Yes| C[Acked and removed]
    B -->|No| D{Attempts < max?}
    D -->|Yes| E[Retry with backoff]
    E --> A
    D -->|No| F[Move to dead-letter topic]

Baseline Configuration

Configure dead-letter routing directly in the subscriber declaration:

@broker.subscriber(
    alias="order-processor",
    topic_name="orders",
    subscription_name="orders-subscription",
    dead_letter_topic="orders-dlq",
    max_delivery_attempts=5,
    autocreate=True,
)
async def process_order(message: Message):
    await process_payment(message.data)

Parameters and Their Roles

Parameter Role Typical Decision Rule
dead_letter_topic Destination for failed messages Use {topic}-dlq or {topic}-dlt naming convention.
max_delivery_attempts Retry ceiling before reroute Start with 5 (the minimum), increase only if transient failures are common.
autocreate Creates resources at startup Keep True in local or dev environment but decide by platform policy in prod.

Retry Dynamics and Backoff

Dead-letter topics are most effective when paired with deliberate retry pacing.

@broker.subscriber(
    alias="api-caller",
    topic_name="api-requests",
    subscription_name="api-requests-subscription",
    dead_letter_topic="api-requests-dlq",
    max_delivery_attempts=10,
    min_backoff_delay_secs=10,
    max_backoff_delay_secs=600,
    autocreate=True,
)
async def call_api(message: Message):
    await call_external_api(message.data)

With exponential backoff, transient outages receive time to recover while permanent failures are eventually quarantined.

Attempt Approximate Wait
1 Immediate
2 ~10 seconds
3 ~20 seconds
4 ~40 seconds
5 ~80 seconds
6+ 600 seconds as maximum back-off period

Handling Dead-Letter Traffic

A dead-letter topic should always have a dedicated consumer path. If not, failures become invisible operational debt.

@broker.subscriber(
    alias="dlq-handler",
    topic_name="orders-dlq",
    subscription_name="orders-dlq-subscription",
    autocreate=True,
)
async def handle_failed_orders(message: Message):
    # Log the failure with details
    logger.error(
        f"Message {message.id} failed permanently",
        extra={
            "message_data": message.data.decode("utf-8"),
            "attributes": message.attributes,
            "delivery_attempt": message.delivery_attempt,
        },
    )

    # Alert your operations team
    await send_alert_to_ops_team(message)

    # Store for later analysis
    await store_failed_message(message)

The handler should implement at least one of the following:

  • Alerting for immediate operator awareness.
  • Persistence for forensic analysis and replay workflows.
  • Enrichment with diagnostic context (correlation IDs, tenant, source service).

Operational Patterns

Alert + Persist Pattern

@broker.subscriber(
    alias="dlq-alert-store",
    topic_name="events-dlq",
    subscription_name="events-dlq-subscription",
)
async def handle_dlq_alert_store(message: Message):
    await slack_webhook.send(f"Failed message: {message.id}")
    await database.insert(
        "failed_messages",
        {
            "message_id": message.id,
            "data": message.data,
            "failed_at": datetime.now(UTC),
        },
    )

Fallback Execution Pattern

@broker.subscriber(
    alias="dlq-retry",
    topic_name="payments-dlq",
    subscription_name="payments-dlq-subscription",
)
async def retry_with_fallback(message: Message):
    # Try a fallback payment processor
    await fallback_payment_service.process(message.data)

Manual Review Queue Pattern

@broker.subscriber(
    alias="dlq-review",
    topic_name="tickets-dlq",
    subscription_name="tickets-dlq-subscription",
)
async def queue_for_review(message: Message):
    await admin_dashboard.create_ticket(
        title=f"Failed order: {message.id}",
        data=message.data,
        priority="high",
    )

Validation with PubSubTestClient

PubSubTestClient is useful to verify local failure behavior (for example, that handler failures are observable in test results) without infrastructure dependency.

@pytest.mark.asyncio
async def test_failed_message_reaches_error_result_stream() -> None:
    test_broker = PubSubBroker(project_id="test-project")

    @test_broker.subscriber(
        alias="always-fails",
        topic_name="orders",
        subscription_name="orders-subscription",
        dead_letter_topic="orders-dlq",
        max_delivery_attempts=5,
    )
    async def always_fails(_: Message) -> None:
        raise ValueError("invalid payload")

    async with PubSubTestClient(test_broker) as client:
        await client.publish(topic="orders", data={"order_id": "ord-1"})
        results = client.get_results()

    assert len(results) == 1
    assert isinstance(results[0].error, ValueError)

Note that in-memory tests validate application behavior and error surfaces. Final validation of managed dead-letter routing itself should still be exercised in an integration environment.

Design Recommendations

Choosing max_delivery_attempts

  • Keep values low when failures are deterministic (schema errors, impossible states).
  • Increase values when downstream dependencies are known to recover quickly.
  • Prefer explicit tuning over high defaults; large values delay incident visibility.

Naming Strategy

Use a consistent suffix and include bounded domain context:

  • orders-dlq
  • payments-dlq
  • inventory-dlq

Consistent names simplify dashboards, alerts, and runbook lookup.

Monitoring Signals

Track:

  • Dead-letter message ingress rate.
  • Most frequent failure class.
  • Time-to-resolution per dead-letter message.
  • Replay success rate after remediation.

Common Failure Modes

  • Configuring a dead-letter topic but never subscribing to it.
  • Setting max_delivery_attempts too high and delaying diagnosis.
  • Ignoring retry backoff, causing rapid failure loops.
  • Mixing naming conventions and losing traceability.

Recap

  • Dead-letter topics isolate persistent failures from healthy traffic.
  • Configure dead_letter_topic and max_delivery_attempts per subscriber.
  • Pair dead-letter routing with retry backoff for controlled failure pacing.
  • Always consume and monitor the dead-letter topic.
  • Validate handler failure behavior early with PubSubTestClient, then verify managed routing in integration.