Production best practices

A short checklist for running ToRouter in production. None of these is exotic — they're the things people wish they'd done before the first incident.

1. One key per environment

Create separate keys for dev, staging, and prod. A leaked dev key with a tight quota cap is a nuisance; a leaked prod key with no cap is a bill.

2. Set IP allowlists on prod keys

In /keys, add the CIDR of your production egress (Fly.io regions, VPC NAT, k8s egress gateway, etc.). A leaked key outside the allowlist is unusable.

3. Cap spend per key

Give every key a spending ceiling in /keys (USD or CNY). If a runaway loop bills $10,000 in an hour, the gateway stops at 429 instead of silently charging through.

Rough starting points: prod around 90% of your monthly budget, staging on the order of tens of dollars, dev around ten dollars — tune in /keys to match your team.

4. Pin model versions

# Good
model="claude-opus-4-7"
model="gpt-5.3-codex"

# Bad — silently changes behavior over time
model="claude"
model="gpt-4"

A pinned version lets you A/B against the next version on your schedule, not the provider's.

5. Retry 429 and transient 5xx with exponential backoff

The OpenAI and Anthropic SDKs handle this when you set max_retries. For raw HTTP, see Rate-limited for a minimal implementation. Never retry 4xx (400/401/403/404) — same response will come back.

6. Always log `x-request-id`

resp = client.chat.completions.with_raw_response.create(...)
print(resp.headers["x-request-id"])

When you need to file a support ticket, this is the one piece of data that lets us trace the request through gateway, scheduler, and upstream in seconds.

7. Pick a fallback model

If your primary is gpt-5 and it's globally down, you want code that tries claude-opus-4-7 (or another model in the same group) before paging an engineer:

fallback

def chat(messages, primary="gpt-5", fallback="claude-opus-4-7"):
    try:
        return client.chat.completions.create(model=primary, messages=messages)
    except Exception:
        return client.chat.completions.create(model=fallback, messages=messages)

8. Watch the dashboard and spend

Open /dashboard once a week — look for unusual spend spikes per model or per key.
Before traffic spikes, top up so you don't hit surprise 402s.
When spend looks wrong, drill into Usage details row by row.

Rotate keys quarterly even when there's no incident. A regular rotation makes leak detection (audit log mismatches, "wait, why is this key still in use?") trivial.

Next steps

Per-key limits

Configure the levers this page recommends.

Usage dashboard

Spend, quota, and trends.

Top up

Avoid 402 surprises.

A short checklist for running ToRouter in production. None of these is exotic — they're the things people wish they'd done before the first incident.

1. One key per environment

Create separate keys for dev, staging, and prod. A leaked dev key with a tight quota cap is a nuisance; a leaked prod key with no cap is a bill.

2. Set IP allowlists on prod keys

In /keys, add the CIDR of your production egress (Fly.io regions, VPC NAT, k8s egress gateway, etc.). A leaked key outside the allowlist is unusable.

3. Cap spend per key

Give every key a spending ceiling in /keys (USD or CNY). If a runaway loop bills $10,000 in an hour, the gateway stops at 429 instead of silently charging through.

Rough starting points: prod around 90% of your monthly budget, staging on the order of tens of dollars, dev around ten dollars — tune in /keys to match your team.

4. Pin model versions

# Good
model="claude-opus-4-7"
model="gpt-5.3-codex"

# Bad — silently changes behavior over time
model="claude"
model="gpt-4"

A pinned version lets you A/B against the next version on your schedule, not the provider's.

5. Retry 429 and transient 5xx with exponential backoff

6. Always log `x-request-id`

resp = client.chat.completions.with_raw_response.create(...)
print(resp.headers["x-request-id"])

When you need to file a support ticket, this is the one piece of data that lets us trace the request through gateway, scheduler, and upstream in seconds.

7. Pick a fallback model

If your primary is gpt-5 and it's globally down, you want code that tries claude-opus-4-7 (or another model in the same group) before paging an engineer:

fallback

def chat(messages, primary="gpt-5", fallback="claude-opus-4-7"):
    try:
        return client.chat.completions.create(model=primary, messages=messages)
    except Exception:
        return client.chat.completions.create(model=fallback, messages=messages)

8. Watch the dashboard and spend

Open /dashboard once a week — look for unusual spend spikes per model or per key.
Before traffic spikes, top up so you don't hit surprise 402s.
When spend looks wrong, drill into Usage details row by row.

Rotate keys quarterly even when there's no incident. A regular rotation makes leak detection (audit log mismatches, "wait, why is this key still in use?") trivial.

1. One key per environment

2. Set IP allowlists on prod keys

3. Cap spend per key

4. Pin model versions

5. Retry 429 and transient 5xx with exponential backoff

6. Always log `x-request-id`

7. Pick a fallback model

8. Watch the dashboard and spend

Next steps

Per-key limits

Usage dashboard

Top up

Table of Contents

Production best practices

1. One key per environment

2. Set IP allowlists on prod keys

3. Cap spend per key

4. Pin model versions

5. Retry 429 and transient 5xx with exponential backoff

6. Always log `x-request-id`

7. Pick a fallback model

8. Watch the dashboard and spend

Next steps

Per-key limits

Usage dashboard

Top up

Table of Contents