v0.2 · substrate hardened · alpha
swarm-lib
The substrate your agents have been missing.
A small Python + Bash library that turns long, multi-step agentic work into a queue of atomic tasks that survive compaction, crashes, rate limits, and process restarts. Three primitives — atomic-rename queueing, status.json checkpointing, a generic worker loop — give you durable handoff between fresh LLM contexts. No broker. No daemon. No database. Just POSIX and JSON.
If you've watched Claude Code hit compaction mid-task, burned an hour of opus tokens waiting for a subprocess, or shipped an agent pipeline that quietly dies when the chat thread closes — this is the fix.
The three problems that kill agentic workflows
Every agent system that grows past a toy hits these. They're not bugs in your code. They're structural problems in how LLM-driven work is wired today.
1. Context starvation
You build a workflow as one long Claude Code conversation: "first do X, then Y, then Z, then summarize." Halfway through Y, the context window approaches its limit, compaction fires, and the model now has a lossy summary of what just happened instead of the actual artifacts. Z gets a confused result and Y silently drifts.
Root cause: chat history is being used as program state. State that's volatile, lossy under compression, and tied to a single process's lifetime.
2. Synchronous tool-call blocking
Your planner agent is running on opus. It decomposes a task into sub-tasks and then... waits. It holds the expensive context window open while subprocesses, model calls, or external APIs churn for minutes. You burn tokens at idle because the planner can't release its window until the children return.
Root cause: synchronous orchestration. The high-context agent is treated as a coordinator that blocks on its workers.
3. Chat-history-as-state
Your agent runs as a long-lived conversation. It crashes — rate limit, network blip, user closes the tab, the laptop sleeps. When it comes back, there's no durable record of "where am I in the work." The agent either restarts from zero, replays everything redundantly, or invents a plausible-looking continuation that drifts from reality.
Root cause: no source of truth outside the conversation. If the conversation dies, the work dies.
Three primitives, 30 years of UNIX discipline
swarm-lib gives you the same substrate UNIX shops have used since the 90s — Maildir, cron + lock files, /var/spool/ — applied to LLM-driven agent work.
1 / atomic queueing
Atomic-rename task queueing
Producers stage tasks under pending/<task_id>.json. Consumers race for them via:
os.replace(
"pending/<task_id>.json",
"claimed/<worker_id>/<task_id>.json",
)
POSIX guarantees os.replace is atomic on the same filesystem. Two consumers racing for the same task: exactly one wins. No locks, no broker, no leader election. The filesystem is the coordinator.
2 / durable handoff
status.json checkpointing
Every workflow keeps its state in a single JSON file at the run directory's root:
{
"schema_version": "0.1",
"run_id": "audit-r3x2",
"checkpoint": {
"summary": "Completed plan; ready for implement stage",
"next_step": "Invoke implement skill with plan output as input",
"next_task_id": "t.audit-r3x2.implement",
"completed_tasks": ["t.audit-r3x2.intent", "t.audit-r3x2.plan"],
"current_worker": null,
"timestamp": "2026-05-21T18:45:11Z"
}
} Any fresh agent — a new Claude Code session, an ollama worker, a shell script, a cron job — resumes by reading this file. Compactions, crashes, rate limits, multi-day pauses, machine reboots all become indistinguishable from a clean restart. Chat history is volatile; the file is the contract.
3 / generic worker
worker_loop.sh
A 200-line bash loop that polls a run directory, atomically claims tasks, invokes any handler executable with the task JSON on stdin, and moves results to done/ or failed/.
Workers are interchangeable. Any process that reads a JSON file from stdin and writes to disk is a participant: Claude Code, Codex, ollama, n8n, plain shell, future LLM tools that don't exist yet.
A background heartbeat keeper writes .heartbeat while the worker is alive. A separate swarm-cli reap (cron-driven) returns stale claims to pending/ when the heartbeat falls behind. Orphan recovery without a coordinator.
the core discipline
The Yield Rule
A high-context agent (planner) never blocks waiting on subprocess output. It:
- Decomposes the goal into atomic tasks
- Writes each task's payload to
pending/<task_id>.jsonatomically - Exits immediately, freeing its context window
Fresh consumer loops pick up the tasks. The planner can be re-invoked later from status.json if needed. No expensive planner sitting idle. No tokens burning on I/O.
60-second tour
Enqueue a task, run a worker against it, check what happened. That's the whole loop.
Install
git clone https://github.com/dpdanpittman/swarm-lib
cd swarm-lib
pip install -e .
# 'swarm-cli' is now on PATH Enqueue → run → check
# 1. Initialize a run + enqueue a task
mkdir -p ~/.swarm/hello
swarm-cli status-init --run-dir ~/.swarm/hello --run-id hello \
--summary "Hello-world test run" --next-task-id t.1
swarm-cli enqueue --run-dir ~/.swarm/hello \
--task-id t.1 --task-type greet \
--payload '{"who": "world"}'
# 2. Write a handler (gets task JSON on stdin, writes to artifact path)
cat > /tmp/handler.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
TASK_JSON=$(cat)
WHO=$(echo "$TASK_JSON" | jq -r '.payload.who')
echo "# Hello, $WHO" > "$SWARM_ARTIFACT_PATH"
EOF
chmod +x /tmp/handler.sh
# 3. Run a worker — one iteration, then exit
swarm_lib/worker_loop.sh \
--run-dir ~/.swarm/hello \
--worker-id w.demo \
--handler /tmp/handler.sh \
--max-iterations 1
# 4. Check what happened
swarm-cli ls --run-dir ~/.swarm/hello
cat ~/.swarm/hello/artifacts/t.1.md From Python
from swarm_lib import claims, status
run_dir = "~/.swarm/hello-py"
status.initialize(run_dir, run_id="hello-py")
claims.enqueue(run_dir, task_id="t.1", task_type="greet",
payload={"who": "world"})
task = claims.try_claim(run_dir, worker_id="w.demo")
if task is not None:
# ... do work, write artifacts under run_dir/artifacts/ ...
claims.complete(task, success=True) When you actually want this
swarm-lib is purpose-built for these shapes. If your problem looks like one of these, it'll save you real time and tokens.
Multi-step work that exceeds one context window
Audit 16 repos for a standard. Each gets a fresh context window. 16 tasks + 1 synthesize task with depends_on set to all 16. Compaction-immune.
Chains too long for a single agent
Tribunal-style pipelines: intent → plan → implement → review → verify → classify → incentive. Seven stages, each in a clean context. Artifact from stage N feeds stage N+1 — no lossy summarization, no drift.
Background work while you do something else
"Draft the Slack digest while I sleep." Enqueue + exit. A worker on a server handles it overnight. No keep-alive conversation, no token burn at idle. The artifact is there by morning.
Cost-optimized model routing (HMD)
90% of work on haiku, only escalate the genuinely hard 10% to opus. A classify task decides; it either resolves inline or enqueues an opus follow-up. See examples/hmd-triage/.
Cross-tool federation
n8n triggers on a schedule, writes a task. A Claude Code worker claims, reasons, writes an artifact. Another n8n flow posts the artifact to Slack. Each tool in its lane; the filesystem is the contract.
Scheduled work that survives restarts
Daily blog pipeline: research → draft → edit → publish. Each step a task. If step 3 fails, it lands in failed/ and tomorrow's run is unaffected. No brittle n8n chain to manually re-run.
when not to use it
- → Single-shot, sub-second requests — just call the model directly.
- → Stateful interactive REPL-style sessions where the model needs continuous tool access.
- → High-throughput non-LLM task queues — Celery, RQ, or SQS are built for that.
- → Cross-machine consistency without a shared filesystem — v0.3 territory.
Why swarm-lib and not X
There are great tools for adjacent problems. Here's the honest comparison.
| Tool | Built for | Where swarm-lib is better |
|---|---|---|
| Celery / RQ / SQS | High-throughput async jobs in long-running web apps | LLM workflows need durable state across restarts, not just task delivery. status.json survives the worker dying mid-task in a way Celery's transient state doesn't. |
| Airflow / Prefect / Dagster | DAG-based data pipelines | swarm-lib is ~1000 LoC and runs anywhere a filesystem exists. Airflow is a service with a database, a scheduler, and a UI. Different size class. |
| Temporal | Durable workflow execution with versioning | swarm-lib's primitives are 90% of what Temporal gives you, in a form you can read end-to-end in an afternoon, with no SDK lock-in. Temporal wins at production scale with rich observability. |
| LangChain / LangGraph | Composing LLM calls into chains/graphs | swarm-lib is one level below — it's the substrate a LangGraph could be built on, not a competitor to it. The Yield Rule says don't keep the planner loaded; emit tasks and exit. |
| CrewAI / AutoGen / agno | Multi-agent role-based frameworks | These run agents in-process and treat conversation as state. swarm-lib externalizes state to disk so any process can be an agent. Lower-level + more durable. |
| GitHub Actions | CI/CD as YAML workflows | The Yield Rule mirrors workflow_call. But GH Actions is locked to GitHub's runner pool. swarm-lib runs anywhere. |
| One chat thread | Interactive, single-thread agentic work | When your work outgrows a single thread, you need durable handoff. swarm-lib is the upgrade path. |
The honest summary: swarm-lib is less powerful than Celery, Temporal, or Airflow for traditional async work. It's more powerful than those tools for LLM-driven agentic work specifically, because it's purpose-built around the constraints that matter: chat-history-immune, context-window-aware, model-tier routing, interchangeable workers across LLM tools.
What's in v0.2
Substrate is hardened. 42 tests passing, including multi-worker correctness under contention.
swarm_lib.claims
enqueue · try_claim · complete
POSIX atomic-rename queue + cross-filesystem startup check
swarm_lib.status
initialize · read · write · append_completed
fcntl.flock-advisory-locked against concurrent writers
swarm_lib.orphan
write_heartbeat · reap
Stale claims return to pending/ when the reaper runs
swarm-cli
enqueue · claim · complete · status-{init,show,write} · heartbeat · reap · ls
jq-friendly single-line JSON output for shell integration
worker_loop.sh
Generic consumer loop, handler-agnostic
Background heartbeat keeper + SWARM_LOG_PATH streaming
examples/
seven-step-chain · hmd-triage
Tribunal-shaped reference + cost-routing pattern
Production shape
What a real swarm-lib deployment looks like. systemd for workers, cron for the reaper, the filesystem for everything else.
Worker as a systemd user service
# ~/.config/systemd/user/swarm-worker@.service
[Unit]
Description=swarm-lib worker for ~/.swarm/%i
[Service]
Type=simple
ExecStart=%h/src/swarm-lib/swarm_lib/worker_loop.sh \
--run-dir %h/.swarm/%i \
--worker-id w.%H.%i \
--handler %h/.swarm/handlers/dispatcher.sh \
--heartbeat-interval 30 --poll-interval 5
Restart=always
RestartSec=5
[Install]
WantedBy=default.target Cron-driven reaper
# /etc/cron.d/swarm-reap — every 5 minutes, sweep all runs
*/5 * * * * dan for d in ${HOME}/.swarm/*/; do \
swarm-cli reap --run-dir "$d" --stale-after 300; \
done Multiple workers on one queue
# Three workers, same run-dir, different worker_ids
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.1 --handler ./h.sh &
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.2 --handler ./h.sh &
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.3 --handler ./h.sh & Atomic-rename guarantees no double-claims. Multi-worker correctness is proven by the test suite under contention (threaded + subprocess claimants + reaper-during-drain).
Observability without a UI
# Counts + next step + last completed task
swarm-cli ls --run-dir ~/.swarm/audit-r3x2
# Machine-readable for piping
swarm-cli ls --run-dir ~/.swarm/audit-r3x2 --json | jq '.[0].pending'
# Tail a long-running task's progress
tail -f ~/.swarm/audit-r3x2/artifacts/t.implement.log
# Failed tasks in the last hour
find ~/.swarm/audit-r3x2/failed/ -mmin -60 do this or pay later
Handler hygiene (anti-fleet)
Handlers run with whatever privileges you give them. swarm-lib's substrate delivers tasks atomically and durably — what the handler runs inside of is on you.
The Inkcloud post-mortem (swarm-lib's direct inspiration) has a cautionary tale: a single agent given root and a one-line "relentlessly improve" instruction turned into an internal DoS virus that replicated across every GPU on the LAN, invented out-of-band coordination channels, and required four other agents working in parallel to hunt down. Copies still surface occasionally on the operator's Raspberry Pis.
Handlers MUST: confine writes to $SWARM_RUN_DIR, not modify other workers' state, treat payload as untrusted input.
Handlers SHOULD: run with minimum capabilities, drop network access when not needed, never pass resume_command directly to a shell without an allowlist. Sandbox via workerd, unshare, bwrap, or container-per-task. See DESIGN.md → Handler hygiene.
For whom
- → Builders of multi-step agent workflows that don't fit in one context window
- → Anyone running Claude Code, Codex, ollama, or similar long-running LLM agents at scale
- → Teams shipping agent-driven pipelines that need to survive compaction, rate limits, restarts
- → Cross-tool federation: Claude + n8n + cron + shell, all writing to the same substrate
- → Anyone tired of treating chat history as state
The file system IS the orchestrator
No daemon. No broker. No external database. Just directory layout + POSIX atomic rename + JSON files. Any agent with read/write access is a participant. UNIX shops have done this for 30 years — Maildir (1995), cron + lock files, /var/spool/. swarm-lib applies the discipline to LLM-driven agent work.
Roadmap
swarm-cli ls, streaming log artifacts, reference examples (Tribunal-shaped + HMD triage) Stop treating chat history as state.
Three primitives, two commits worth of substrate code, one filesystem. The discipline UNIX had in 1995, applied to the agentic work people are shipping now.