[Worker] Task queue backlog - 41+ tasks queued, workers not processing #298

New issue

Closed

opened 2026-01-25 17:02:27 +00:00 by jack · 0 comments

jack commented

2026-01-25 17:02:27 +00:00

Owner

Problem

Task queue is backing up significantly (41+ tasks queued) and workers are not processing them fast enough. This causes:

CLAUDE.md updates delayed by 18+ minutes
Observations not being extracted in real-time
Session context becoming stale

Symptoms

Hook logs show worker-service repeatedly failing to connect:

[ws-client] WebSocket closed: 1006 - 
[ws-client] Reconnecting in 5000ms (attempt 1/10)

Backend logs show TypeError: fetch failed errors for tasks
Queue grows faster than workers can process

Potential Causes

Worker connection issues: WebSocket connection failing (code 1006 = abnormal closure)
External API failures: Mistral/Anthropic API timeouts causing task retries
Insufficient worker capacity: Not enough workers for the workload
Task retry storms: Failed tasks retrying and clogging the queue

Investigation Needed

Check why workers are disconnecting (1006 error)
Check external API availability (Mistral)
Review task failure patterns in logs
Consider increasing worker count or implementing better backpressure

Issue #205 - Task backpressure mechanism (may need tuning)
Backend logs show multiple TypeError: fetch failed errors around 15:57-16:40

## Problem Task queue is backing up significantly (41+ tasks queued) and workers are not processing them fast enough. This causes: 1. CLAUDE.md updates delayed by 18+ minutes 2. Observations not being extracted in real-time 3. Session context becoming stale ## Symptoms - Hook logs show worker-service repeatedly failing to connect: ``` [ws-client] WebSocket closed: 1006 - [ws-client] Reconnecting in 5000ms (attempt 1/10) ``` - Backend logs show `TypeError: fetch failed` errors for tasks - Queue grows faster than workers can process ## Potential Causes 1. **Worker connection issues**: WebSocket connection failing (code 1006 = abnormal closure) 2. **External API failures**: Mistral/Anthropic API timeouts causing task retries 3. **Insufficient worker capacity**: Not enough workers for the workload 4. **Task retry storms**: Failed tasks retrying and clogging the queue ## Investigation Needed 1. Check why workers are disconnecting (1006 error) 2. Check external API availability (Mistral) 3. Review task failure patterns in logs 4. Consider increasing worker count or implementing better backpressure ## Related - Issue #205 - Task backpressure mechanism (may need tuning) - Backend logs show multiple `TypeError: fetch failed` errors around 15:57-16:40