[Worker] Task queue backlog - 41+ tasks queued, workers not processing #298

Closed
opened 2026-01-25 17:02:27 +00:00 by jack · 0 comments
Owner

Problem

Task queue is backing up significantly (41+ tasks queued) and workers are not processing them fast enough. This causes:

  1. CLAUDE.md updates delayed by 18+ minutes
  2. Observations not being extracted in real-time
  3. Session context becoming stale

Symptoms

  • Hook logs show worker-service repeatedly failing to connect:
    [ws-client] WebSocket closed: 1006 - 
    [ws-client] Reconnecting in 5000ms (attempt 1/10)
    
  • Backend logs show TypeError: fetch failed errors for tasks
  • Queue grows faster than workers can process

Potential Causes

  1. Worker connection issues: WebSocket connection failing (code 1006 = abnormal closure)
  2. External API failures: Mistral/Anthropic API timeouts causing task retries
  3. Insufficient worker capacity: Not enough workers for the workload
  4. Task retry storms: Failed tasks retrying and clogging the queue

Investigation Needed

  1. Check why workers are disconnecting (1006 error)
  2. Check external API availability (Mistral)
  3. Review task failure patterns in logs
  4. Consider increasing worker count or implementing better backpressure
  • Issue #205 - Task backpressure mechanism (may need tuning)
  • Backend logs show multiple TypeError: fetch failed errors around 15:57-16:40
## Problem Task queue is backing up significantly (41+ tasks queued) and workers are not processing them fast enough. This causes: 1. CLAUDE.md updates delayed by 18+ minutes 2. Observations not being extracted in real-time 3. Session context becoming stale ## Symptoms - Hook logs show worker-service repeatedly failing to connect: ``` [ws-client] WebSocket closed: 1006 - [ws-client] Reconnecting in 5000ms (attempt 1/10) ``` - Backend logs show `TypeError: fetch failed` errors for tasks - Queue grows faster than workers can process ## Potential Causes 1. **Worker connection issues**: WebSocket connection failing (code 1006 = abnormal closure) 2. **External API failures**: Mistral/Anthropic API timeouts causing task retries 3. **Insufficient worker capacity**: Not enough workers for the workload 4. **Task retry storms**: Failed tasks retrying and clogging the queue ## Investigation Needed 1. Check why workers are disconnecting (1006 error) 2. Check external API availability (Mistral) 3. Review task failure patterns in logs 4. Consider increasing worker count or implementing better backpressure ## Related - Issue #205 - Task backpressure mechanism (may need tuning) - Backend logs show multiple `TypeError: fetch failed` errors around 15:57-16:40
jack 2026-01-25 17:02:31 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
customable/claude-mem#298
No description provided.