Retries & Error Handling
Defaults
Section titled “Defaults”Every schedule and job has two retry-related fields:
| Field | Default | Description |
|---|---|---|
max_retries | 3 | Number of retry attempts after first failure. 0 disables retries. |
timeout | 30 | Seconds before an execution is considered timed out. |
Set these per schedule or per job at creation time, or update them on an existing schedule via PATCH.
What triggers a retry
Section titled “What triggers a retry”| Event | Retries? | Why |
|---|---|---|
| Push: endpoint returns 5xx | Yes | Server error, likely transient |
| Push: endpoint times out | Yes | May be temporary overload |
| Push: network error (DNS, connection refused) | Yes | Infrastructure issue, likely transient |
| Push: endpoint returns 4xx | No | Client error. Won’t fix itself on retry |
| Pull: handler throws an error | Yes | Reported as failure, retry scheduled |
| Pull: lease expires (no result reported) | Yes | Worker may have crashed |
| Retries exhausted | No | Job transitions to failed permanently |
Backoff formula
Section titled “Backoff formula”Retries use exponential backoff with jitter, capped at 1 hour:
baseDelay = min(1000ms × 2^(attempt - 1), 3,600,000ms)jitter = random(0, baseDelay)delay = min(baseDelay + jitter, 3,600,000ms)| Attempt | Base delay | Actual range |
|---|---|---|
| 1 | 1s | 1-2s |
| 2 | 2s | 2-4s |
| 3 | 4s | 4-8s |
| 4 | 8s | 8-16s |
| 5 | 16s | 16-32s |
| 10 | ~17 min | 17-34 min |
| 13+ | 1 hour | exactly 1 hour (capped) |
The jitter prevents thundering-herd when many jobs fail simultaneously.
The retry flow
Section titled “The retry flow”Execution fails │ ├── Retries remaining? ──▶ Yes: schedule new execution after backoff delay │ Job status → retrying │ New execution created with trigger: system_retry │ └── Retries exhausted? ──▶ Job status → failed (terminal)Each retry creates a new execution. ctx.attempt tells your handler which attempt this is (1-indexed).
Configuring retries
Section titled “Configuring retries”Disable retries entirely
Section titled “Disable retries entirely”{ "name": "Fire-and-forget notification", "handler": "notify", "cron": "0 * * * *", "max_retries": 0}Increase retries for critical work
Section titled “Increase retries for critical work”{ "name": "Monthly billing", "handler": "charge-customer", "cron": "0 0 1 * *", "max_retries": 8, "timeout": 60}With 8 retries, the final attempt happens roughly 4–8 hours after the first failure (due to exponential backoff).
Per-job override
Section titled “Per-job override”One-off jobs can have different retry settings than their schedule pattern:
curl -X POST https://api.chronos.sh/v1/jobs \ -H "Authorization: Bearer chrns_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "name": "Critical charge", "handler": "charge-customer", "max_retries": 10, "timeout": 120, "payload": { "invoiceId": "inv_123" } }'Timeout behavior
Section titled “Timeout behavior”Push delivery
Section titled “Push delivery”If your endpoint doesn’t respond within timeout seconds, Chronos aborts the request and marks the execution as timeout. A retry is scheduled if attempts remain.
Pull delivery
Section titled “Pull delivery”The SDK does not enforce timeouts. Your handler runs as long as it needs. However, Chronos tracks a lease on the server:
- When a job is claimed, a lease is set:
lease_expires_at = now + timeout - A background sweep checks for expired leases every 30 seconds
- If the lease expired without a result, the execution is marked
timeout - A retry is scheduled if attempts remain
If your handler routinely takes longer than 30 seconds, increase timeout so the lease doesn’t expire while you’re still working.
Handling errors in SDK handlers
Section titled “Handling errors in SDK handlers”Your handler’s behavior determines the execution outcome:
chronos.worker.handle('process-payment', async (ctx) => { // Throw to trigger a retry (if attempts remain) const result = await chargeCustomer(ctx.payload.customerId); if (!result.success) { throw new Error(`Charge failed: ${result.error}`); }
// Return to mark as completed return { chargeId: result.id };});- Return a value (or void) → execution
completed - Throw an error → execution
failed, error message captured (truncated to 4KB), retry scheduled if attempts remain
Distinguish retryable vs terminal failures
Section titled “Distinguish retryable vs terminal failures”If you know a failure is permanent (bad data, invalid state), you might want to avoid wasting retries. Since Chronos always retries on handler failure (until exhausted), design your handler to handle terminal cases gracefully:
chronos.worker.handle('send-email', async (ctx) => { const user = await db.users.findById(ctx.payload.userId);
// Terminal: user deleted — no point retrying if (!user) { console.warn(`User ${ctx.payload.userId} not found, skipping`); return { skipped: true, reason: 'user_not_found' }; }
// This might throw on transient network issues → retry is appropriate await emailService.send(user.email, ctx.payload.template); return { sent: true };});SDK error classes
Section titled “SDK error classes”When building around the SDK, these error types help you handle different failure modes:
| Error | When | Your action |
|---|---|---|
ChronosConfigError | Invalid options at construction | Fix your config. Thrown at startup |
ChronosApiError | API returned an error (non-2xx or success: false) | Check .status and .code |
ChronosNetworkError | Fetch failed (DNS, TCP, TLS) | Transient. SDK retries poll automatically |
ChronosHandlerError | Your handler threw | Logged internally, failure reported to API |
The SDK handles poll-loop errors internally (logs + retries after retryDelayMs). You don’t need to catch these. The worker keeps running.
Monitoring failures
Section titled “Monitoring failures”Use the executions list endpoint to find failed jobs:
curl "https://api.chronos.sh/v1/executions?status=failed" \ -H "Authorization: Bearer chrns_your_api_key"Each failed execution includes:
error: the error message (from your handler’s thrown error or the HTTP response)response_code: HTTP status code (push delivery only)duration_ms: how long the execution ran before failingtrigger:system(first attempt) orsystem_retry(retry)