feat(worker): Implement exponential backoff for task retries #206

Closed
opened 2026-01-24 17:14:58 +00:00 by jack · 0 comments
Owner

Problem

Aktuelle Retry-Strategie ist simpel - nur Counter-Inkrement ohne Delay:

// Aktuell
task.retryCount++;
await this.taskRepo.update(task);
// Sofortiger Retry → hämmert fehlgeschlagene Services

Lösung

1. Exponential Backoff mit Jitter

// packages/worker/src/utils/retry.ts
export interface RetryConfig {
  initialDelayMs: number;   // Erste Wartezeit
  maxDelayMs: number;       // Maximale Wartezeit
  multiplier: number;       // Faktor pro Retry
  jitterFactor: number;     // Zufällige Variation (0-1)
}

export const defaultRetryConfig: RetryConfig = {
  initialDelayMs: 1000,     // 1 Sekunde
  maxDelayMs: 60000,        // Max 1 Minute
  multiplier: 2,            // Verdoppeln
  jitterFactor: 0.2,        // ±20% Variation
};

export function calculateRetryDelay(
  retryCount: number, 
  config: RetryConfig = defaultRetryConfig
): number {
  // Exponential: initialDelay * multiplier^retryCount
  const exponentialDelay = config.initialDelayMs * Math.pow(config.multiplier, retryCount);
  
  // Cap at max
  const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
  
  // Add jitter to prevent thundering herd
  const jitter = cappedDelay * config.jitterFactor * (Math.random() * 2 - 1);
  
  return Math.round(cappedDelay + jitter);
}

2. Task mit Retry-After

// Task Entity erweitern
interface Task {
  // ... existing fields
  retryAfter?: number;  // Unix timestamp für nächsten Versuch
}

// TaskDispatcher
async handleTaskFailed(taskId: string, error: Error): Promise<void> {
  const task = await this.taskRepo.findById(taskId);
  
  if (task.retryCount >= task.maxRetries) {
    await this.taskRepo.update(taskId, { 
      status: 'failed',
      error: error.message 
    });
    return;
  }
  
  const delay = calculateRetryDelay(task.retryCount);
  const retryAfter = Date.now() + delay;
  
  await this.taskRepo.update(taskId, {
    status: 'pending',
    retryCount: task.retryCount + 1,
    retryAfter,
    error: `Retry ${task.retryCount + 1}/${task.maxRetries}: ${error.message}`
  });
  
  logger.info('Task scheduled for retry', {
    taskId,
    retryCount: task.retryCount + 1,
    retryAfter: new Date(retryAfter).toISOString(),
    delayMs: delay
  });
}

3. Dispatcher berücksichtigt retryAfter

// TaskDispatcher.dispatchPendingTasks()
async getPendingTasks(): Promise<Task[]> {
  return this.taskRepo.findMany({
    status: 'pending',
    $or: [
      { retryAfter: null },
      { retryAfter: { $lte: Date.now() } }
    ]
  }, {
    orderBy: { priority: 'DESC', createdAt: 'ASC' },
    limit: 10
  });
}

4. Per-Task-Type Retry Config

const retryConfigs: Record<TaskType, RetryConfig> = {
  observation: { initialDelayMs: 500, maxDelayMs: 30000, multiplier: 2, jitterFactor: 0.1 },
  embedding: { initialDelayMs: 2000, maxDelayMs: 120000, multiplier: 2, jitterFactor: 0.2 },
  'qdrant-sync': { initialDelayMs: 5000, maxDelayMs: 300000, multiplier: 2, jitterFactor: 0.3 },
  summarize: { initialDelayMs: 1000, maxDelayMs: 60000, multiplier: 2, jitterFactor: 0.1 },
  'claude-md': { initialDelayMs: 1000, maxDelayMs: 60000, multiplier: 2, jitterFactor: 0.1 },
};

Retry-Delays Beispiel

Retry # Delay (ohne Jitter) Mit ±20% Jitter
1 1s 0.8s - 1.2s
2 2s 1.6s - 2.4s
3 4s 3.2s - 4.8s
4 8s 6.4s - 9.6s
5 16s 12.8s - 19.2s
6 32s 25.6s - 38.4s
7+ 60s (max) 48s - 72s

Akzeptanzkriterien

  • calculateRetryDelay Funktion mit Tests
  • retryAfter Feld in Task Entity
  • Dispatcher filtert nach retryAfter
  • Per-Task-Type Konfiguration
  • Logging für Retry-Scheduling
  • Jitter verhindert Thundering Herd
## Problem Aktuelle Retry-Strategie ist simpel - nur Counter-Inkrement ohne Delay: ```typescript // Aktuell task.retryCount++; await this.taskRepo.update(task); // Sofortiger Retry → hämmert fehlgeschlagene Services ``` ## Lösung ### 1. Exponential Backoff mit Jitter ```typescript // packages/worker/src/utils/retry.ts export interface RetryConfig { initialDelayMs: number; // Erste Wartezeit maxDelayMs: number; // Maximale Wartezeit multiplier: number; // Faktor pro Retry jitterFactor: number; // Zufällige Variation (0-1) } export const defaultRetryConfig: RetryConfig = { initialDelayMs: 1000, // 1 Sekunde maxDelayMs: 60000, // Max 1 Minute multiplier: 2, // Verdoppeln jitterFactor: 0.2, // ±20% Variation }; export function calculateRetryDelay( retryCount: number, config: RetryConfig = defaultRetryConfig ): number { // Exponential: initialDelay * multiplier^retryCount const exponentialDelay = config.initialDelayMs * Math.pow(config.multiplier, retryCount); // Cap at max const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); // Add jitter to prevent thundering herd const jitter = cappedDelay * config.jitterFactor * (Math.random() * 2 - 1); return Math.round(cappedDelay + jitter); } ``` ### 2. Task mit Retry-After ```typescript // Task Entity erweitern interface Task { // ... existing fields retryAfter?: number; // Unix timestamp für nächsten Versuch } // TaskDispatcher async handleTaskFailed(taskId: string, error: Error): Promise<void> { const task = await this.taskRepo.findById(taskId); if (task.retryCount >= task.maxRetries) { await this.taskRepo.update(taskId, { status: 'failed', error: error.message }); return; } const delay = calculateRetryDelay(task.retryCount); const retryAfter = Date.now() + delay; await this.taskRepo.update(taskId, { status: 'pending', retryCount: task.retryCount + 1, retryAfter, error: `Retry ${task.retryCount + 1}/${task.maxRetries}: ${error.message}` }); logger.info('Task scheduled for retry', { taskId, retryCount: task.retryCount + 1, retryAfter: new Date(retryAfter).toISOString(), delayMs: delay }); } ``` ### 3. Dispatcher berücksichtigt retryAfter ```typescript // TaskDispatcher.dispatchPendingTasks() async getPendingTasks(): Promise<Task[]> { return this.taskRepo.findMany({ status: 'pending', $or: [ { retryAfter: null }, { retryAfter: { $lte: Date.now() } } ] }, { orderBy: { priority: 'DESC', createdAt: 'ASC' }, limit: 10 }); } ``` ### 4. Per-Task-Type Retry Config ```typescript const retryConfigs: Record<TaskType, RetryConfig> = { observation: { initialDelayMs: 500, maxDelayMs: 30000, multiplier: 2, jitterFactor: 0.1 }, embedding: { initialDelayMs: 2000, maxDelayMs: 120000, multiplier: 2, jitterFactor: 0.2 }, 'qdrant-sync': { initialDelayMs: 5000, maxDelayMs: 300000, multiplier: 2, jitterFactor: 0.3 }, summarize: { initialDelayMs: 1000, maxDelayMs: 60000, multiplier: 2, jitterFactor: 0.1 }, 'claude-md': { initialDelayMs: 1000, maxDelayMs: 60000, multiplier: 2, jitterFactor: 0.1 }, }; ``` ## Retry-Delays Beispiel | Retry # | Delay (ohne Jitter) | Mit ±20% Jitter | |---------|---------------------|-----------------| | 1 | 1s | 0.8s - 1.2s | | 2 | 2s | 1.6s - 2.4s | | 3 | 4s | 3.2s - 4.8s | | 4 | 8s | 6.4s - 9.6s | | 5 | 16s | 12.8s - 19.2s | | 6 | 32s | 25.6s - 38.4s | | 7+ | 60s (max) | 48s - 72s | ## Akzeptanzkriterien - [ ] `calculateRetryDelay` Funktion mit Tests - [ ] `retryAfter` Feld in Task Entity - [ ] Dispatcher filtert nach retryAfter - [ ] Per-Task-Type Konfiguration - [ ] Logging für Retry-Scheduling - [ ] Jitter verhindert Thundering Herd
jack closed this issue 2026-01-25 00:00:10 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
customable/claude-mem#206
No description provided.