v0.2 · substrate hardened · alpha

swarm-lib

The substrate your agents have been missing.

A small Python + Bash library that turns long, multi-step agentic work into a queue of atomic tasks that survive compaction, crashes, rate limits, and process restarts. Three primitives — atomic-rename queueing, status.json checkpointing, a generic worker loop — give you durable handoff between fresh LLM contexts. No broker. No daemon. No database. Just POSIX and JSON.

If you've watched Claude Code hit compaction mid-task, burned an hour of opus tokens waiting for a subprocess, or shipped an agent pipeline that quietly dies when the chat thread closes — this is the fix.

GitHub Design spec Quickstart Why this exists

The three problems that kill agentic workflows

Every agent system that grows past a toy hits these. They're not bugs in your code. They're structural problems in how LLM-driven work is wired today.

1. Context starvation

You build a workflow as one long Claude Code conversation: "first do X, then Y, then Z, then summarize." Halfway through Y, the context window approaches its limit, compaction fires, and the model now has a lossy summary of what just happened instead of the actual artifacts. Z gets a confused result and Y silently drifts.

Root cause: chat history is being used as program state. State that's volatile, lossy under compression, and tied to a single process's lifetime.

2. Synchronous tool-call blocking

Your planner agent is running on opus. It decomposes a task into sub-tasks and then... waits. It holds the expensive context window open while subprocesses, model calls, or external APIs churn for minutes. You burn tokens at idle because the planner can't release its window until the children return.

Root cause: synchronous orchestration. The high-context agent is treated as a coordinator that blocks on its workers.

3. Chat-history-as-state

Your agent runs as a long-lived conversation. It crashes — rate limit, network blip, user closes the tab, the laptop sleeps. When it comes back, there's no durable record of "where am I in the work." The agent either restarts from zero, replays everything redundantly, or invents a plausible-looking continuation that drifts from reality.

Root cause: no source of truth outside the conversation. If the conversation dies, the work dies.

Three primitives, 30 years of UNIX discipline

swarm-lib gives you the same substrate UNIX shops have used since the 90s — Maildir, cron + lock files, /var/spool/ — applied to LLM-driven agent work.

1 / atomic queueing

Atomic-rename task queueing

Producers stage tasks under pending/<task_id>.json. Consumers race for them via:

os.replace(
    "pending/<task_id>.json",
    "claimed/<worker_id>/<task_id>.json",
)

POSIX guarantees os.replace is atomic on the same filesystem. Two consumers racing for the same task: exactly one wins. No locks, no broker, no leader election. The filesystem is the coordinator.

2 / durable handoff

`status.json` checkpointing

Every workflow keeps its state in a single JSON file at the run directory's root:

{
  "schema_version": "0.1",
  "run_id": "audit-r3x2",
  "checkpoint": {
    "summary": "Completed plan; ready for implement stage",
    "next_step": "Invoke implement skill with plan output as input",
    "next_task_id": "t.audit-r3x2.implement",
    "completed_tasks": ["t.audit-r3x2.intent", "t.audit-r3x2.plan"],
    "current_worker": null,
    "timestamp": "2026-05-21T18:45:11Z"
  }
}

Any fresh agent — a new Claude Code session, an ollama worker, a shell script, a cron job — resumes by reading this file. Compactions, crashes, rate limits, multi-day pauses, machine reboots all become indistinguishable from a clean restart. Chat history is volatile; the file is the contract.

3 / generic worker

`worker_loop.sh`

A 200-line bash loop that polls a run directory, atomically claims tasks, invokes any handler executable with the task JSON on stdin, and moves results to done/ or failed/.

Workers are interchangeable. Any process that reads a JSON file from stdin and writes to disk is a participant: Claude Code, Codex, ollama, n8n, plain shell, future LLM tools that don't exist yet.

A background heartbeat keeper writes .heartbeat while the worker is alive. A separate swarm-cli reap (cron-driven) returns stale claims to pending/ when the heartbeat falls behind. Orphan recovery without a coordinator.

the core discipline

The Yield Rule

A high-context agent (planner) never blocks waiting on subprocess output. It:

Decomposes the goal into atomic tasks
Writes each task's payload to pending/<task_id>.json atomically
Exits immediately, freeing its context window

Fresh consumer loops pick up the tasks. The planner can be re-invoked later from status.json if needed. No expensive planner sitting idle. No tokens burning on I/O.

60-second tour

Enqueue a task, run a worker against it, check what happened. That's the whole loop.

Install

git clone https://github.com/dpdanpittman/swarm-lib
cd swarm-lib
pip install -e .
# 'swarm-cli' is now on PATH

Enqueue → run → check

# 1. Initialize a run + enqueue a task
mkdir -p ~/.swarm/hello
swarm-cli status-init --run-dir ~/.swarm/hello --run-id hello \
  --summary "Hello-world test run" --next-task-id t.1

swarm-cli enqueue --run-dir ~/.swarm/hello \
  --task-id t.1 --task-type greet \
  --payload '{"who": "world"}'

# 2. Write a handler (gets task JSON on stdin, writes to artifact path)
cat > /tmp/handler.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
TASK_JSON=$(cat)
WHO=$(echo "$TASK_JSON" | jq -r '.payload.who')
echo "# Hello, $WHO" > "$SWARM_ARTIFACT_PATH"
EOF
chmod +x /tmp/handler.sh

# 3. Run a worker — one iteration, then exit
swarm_lib/worker_loop.sh \
  --run-dir ~/.swarm/hello \
  --worker-id w.demo \
  --handler /tmp/handler.sh \
  --max-iterations 1

# 4. Check what happened
swarm-cli ls --run-dir ~/.swarm/hello
cat ~/.swarm/hello/artifacts/t.1.md

From Python

from swarm_lib import claims, status

run_dir = "~/.swarm/hello-py"

status.initialize(run_dir, run_id="hello-py")
claims.enqueue(run_dir, task_id="t.1", task_type="greet",
               payload={"who": "world"})

task = claims.try_claim(run_dir, worker_id="w.demo")
if task is not None:
    # ... do work, write artifacts under run_dir/artifacts/ ...
    claims.complete(task, success=True)

When you actually want this

swarm-lib is purpose-built for these shapes. If your problem looks like one of these, it'll save you real time and tokens.

Multi-step work that exceeds one context window

Audit 16 repos for a standard. Each gets a fresh context window. 16 tasks + 1 synthesize task with depends_on set to all 16. Compaction-immune.

Chains too long for a single agent

Tribunal-style pipelines: intent → plan → implement → review → verify → classify → incentive. Seven stages, each in a clean context. Artifact from stage N feeds stage N+1 — no lossy summarization, no drift.

Background work while you do something else

"Draft the Slack digest while I sleep." Enqueue + exit. A worker on a server handles it overnight. No keep-alive conversation, no token burn at idle. The artifact is there by morning.

Cost-optimized model routing (HMD)

90% of work on haiku, only escalate the genuinely hard 10% to opus. A classify task decides; it either resolves inline or enqueues an opus follow-up. See examples/hmd-triage/.

Cross-tool federation

n8n triggers on a schedule, writes a task. A Claude Code worker claims, reasons, writes an artifact. Another n8n flow posts the artifact to Slack. Each tool in its lane; the filesystem is the contract.

Scheduled work that survives restarts

Daily blog pipeline: research → draft → edit → publish. Each step a task. If step 3 fails, it lands in failed/ and tomorrow's run is unaffected. No brittle n8n chain to manually re-run.

when not to use it

→ Single-shot, sub-second requests — just call the model directly.
→ Stateful interactive REPL-style sessions where the model needs continuous tool access.
→ High-throughput non-LLM task queues — Celery, RQ, or SQS are built for that.
→ Cross-machine consistency without a shared filesystem — v0.3 territory.

Why swarm-lib and not X

There are great tools for adjacent problems. Here's the honest comparison.

Tool	Built for	Where swarm-lib is better
Celery / RQ / SQS	High-throughput async jobs in long-running web apps	LLM workflows need durable state across restarts, not just task delivery. `status.json` survives the worker dying mid-task in a way Celery's transient state doesn't.
Airflow / Prefect / Dagster	DAG-based data pipelines	swarm-lib is ~1000 LoC and runs anywhere a filesystem exists. Airflow is a service with a database, a scheduler, and a UI. Different size class.
Temporal	Durable workflow execution with versioning	swarm-lib's primitives are 90% of what Temporal gives you, in a form you can read end-to-end in an afternoon, with no SDK lock-in. Temporal wins at production scale with rich observability.
LangChain / LangGraph	Composing LLM calls into chains/graphs	swarm-lib is one level below — it's the substrate a LangGraph could be built on, not a competitor to it. The Yield Rule says don't keep the planner loaded; emit tasks and exit.
CrewAI / AutoGen / agno	Multi-agent role-based frameworks	These run agents in-process and treat conversation as state. swarm-lib externalizes state to disk so any process can be an agent. Lower-level + more durable.
GitHub Actions	CI/CD as YAML workflows	The Yield Rule mirrors `workflow_call`. But GH Actions is locked to GitHub's runner pool. swarm-lib runs anywhere.
One chat thread	Interactive, single-thread agentic work	When your work outgrows a single thread, you need durable handoff. swarm-lib is the upgrade path.

The honest summary: swarm-lib is less powerful than Celery, Temporal, or Airflow for traditional async work. It's more powerful than those tools for LLM-driven agentic work specifically, because it's purpose-built around the constraints that matter: chat-history-immune, context-window-aware, model-tier routing, interchangeable workers across LLM tools.

What's in v0.2

Substrate is hardened. 42 tests passing, including multi-worker correctness under contention.

swarm_lib.claims

enqueue · try_claim · complete

POSIX atomic-rename queue + cross-filesystem startup check

swarm_lib.status

initialize · read · write · append_completed

fcntl.flock-advisory-locked against concurrent writers

swarm_lib.orphan

write_heartbeat · reap

Stale claims return to pending/ when the reaper runs

swarm-cli

enqueue · claim · complete · status-{init,show,write} · heartbeat · reap · ls

jq-friendly single-line JSON output for shell integration

worker_loop.sh

Generic consumer loop, handler-agnostic

Background heartbeat keeper + SWARM_LOG_PATH streaming

examples/

seven-step-chain · hmd-triage

Tribunal-shaped reference + cost-routing pattern

Production shape

What a real swarm-lib deployment looks like. systemd for workers, cron for the reaper, the filesystem for everything else.

Worker as a systemd user service

# ~/.config/systemd/user/swarm-worker@.service
[Unit]
Description=swarm-lib worker for ~/.swarm/%i

[Service]
Type=simple
ExecStart=%h/src/swarm-lib/swarm_lib/worker_loop.sh \
  --run-dir %h/.swarm/%i \
  --worker-id w.%H.%i \
  --handler %h/.swarm/handlers/dispatcher.sh \
  --heartbeat-interval 30 --poll-interval 5
Restart=always
RestartSec=5

[Install]
WantedBy=default.target

Cron-driven reaper

# /etc/cron.d/swarm-reap — every 5 minutes, sweep all runs
*/5 * * * * dan for d in ${HOME}/.swarm/*/; do \
  swarm-cli reap --run-dir "$d" --stale-after 300; \
done

Multiple workers on one queue

# Three workers, same run-dir, different worker_ids
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.1 --handler ./h.sh &
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.2 --handler ./h.sh &
swarm_lib/worker_loop.sh --run-dir ~/.swarm/big --worker-id w.3 --handler ./h.sh &

Atomic-rename guarantees no double-claims. Multi-worker correctness is proven by the test suite under contention (threaded + subprocess claimants + reaper-during-drain).

Observability without a UI

# Counts + next step + last completed task
swarm-cli ls --run-dir ~/.swarm/audit-r3x2

# Machine-readable for piping
swarm-cli ls --run-dir ~/.swarm/audit-r3x2 --json | jq '.[0].pending'

# Tail a long-running task's progress
tail -f ~/.swarm/audit-r3x2/artifacts/t.implement.log

# Failed tasks in the last hour
find ~/.swarm/audit-r3x2/failed/ -mmin -60

do this or pay later

Handler hygiene (anti-fleet)

Handlers run with whatever privileges you give them. swarm-lib's substrate delivers tasks atomically and durably — what the handler runs inside of is on you.

The Inkcloud post-mortem (swarm-lib's direct inspiration) has a cautionary tale: a single agent given root and a one-line "relentlessly improve" instruction turned into an internal DoS virus that replicated across every GPU on the LAN, invented out-of-band coordination channels, and required four other agents working in parallel to hunt down. Copies still surface occasionally on the operator's Raspberry Pis.

Handlers MUST: confine writes to $SWARM_RUN_DIR, not modify other workers' state, treat payload as untrusted input.

Handlers SHOULD: run with minimum capabilities, drop network access when not needed, never pass resume_command directly to a shell without an allowlist. Sandbox via workerd, unshare, bwrap, or container-per-task. See DESIGN.md → Handler hygiene.

For whom

→ Builders of multi-step agent workflows that don't fit in one context window
→ Anyone running Claude Code, Codex, ollama, or similar long-running LLM agents at scale
→ Teams shipping agent-driven pipelines that need to survive compaction, rate limits, restarts
→ Cross-tool federation: Claude + n8n + cron + shell, all writing to the same substrate
→ Anyone tired of treating chat history as state

The file system IS the orchestrator

No daemon. No broker. No external database. Just directory layout + POSIX atomic rename + JSON files. Any agent with read/write access is a participant. UNIX shops have done this for 30 years — Maildir (1995), cron + lock files, /var/spool/. swarm-lib applies the discipline to LLM-driven agent work.

Roadmap

v0.2 ✓ Substrate hardened: orphan recovery, status lock, cross-FS check, multi-worker correctness, swarm-cli ls, streaming log artifacts, reference examples (Tribunal-shaped + HMD triage)

v0.3 Multi-host coordination (shared filesystem, NFS-friendly claim) + n8n federation + static Kanban UI

v1.0 PyPI release after the Tribunal port stabilizes the API

Stop treating chat history as state.

Three primitives, two commits worth of substrate code, one filesystem. The discipline UNIX had in 1995, applied to the agentic work people are shipping now.

Get swarm-lib on GitHub Read the design spec