Skip to content
BrainRoad BrainRoad

Paperclip Troubleshooting Guide: Common Issues and Solutions

BrainRoad ·
Beacon the lighthouse character shining light on a paperclip, illustrating troubleshooting common issues.
Share
On this page

I’ve watched the same five Paperclip failures appear, disappear, and reappear across different deployments. Not edge cases. Not exotic configs. The same predictable breakages, every time. The frustrating part isn’t that they happen — it’s that each one leaves a different error message, so they look unrelated until you’ve seen the pattern.

If your agent runs are silently dying with process_lost, or your Claude CLI can’t be found even though it’s definitely installed, or your agent keeps resuming the wrong session — you’re in the right place. We’ve traced all of these back to their source in the Paperclip codebase and confirmed the fixes. Some require a config change. Some require a SQL query. One requires you to understand what a ‘death spiral’ actually means in this context — and that’s the one worth understanding first.

If you’re evaluating Paperclip as part of a broader AI agent platform search, knowing its failure modes upfront will save you from discovering them in production.

Server Restart Failures: The ‘process_lost’ Death Spiral

This one accounts for the vast majority of run failures. When the Paperclip server restarts, every in-flight Claude CLI process is killed immediately. Each gets marked with error code process_lost and the message: Process lost -- server may have restarted.

That’s the expected behavior. The problem is what comes next.

Paperclip queues retry_failed_run wakeups for each killed process. In a stable environment, those retries complete fine. But if the server restarts again before those wakeups finish — and during a restart cycle, it often does — the retries also get killed. Which queues more retries. Which also get killed. The cycle feeds itself until you intervene or the server stabilizes.

There’s a second failure mode that shows up during the retry phase. After a server restart, some agent retries produce API requests with empty PAPERCLIP_COMPANY_ID and PAPERCLIP_API_KEY environment variables. The resulting request looks like this: GET /api/companies//issues?assigneeAgentId=&status=todo,in_progress,blocked. The double slash and the empty Authorization: Bearer header cause Express to fall through to the SPA fallback — which returns a 500 error. Your agent thinks it hit a real API failure. It hasn’t. The credentials just didn’t load.

The fix for both issues is the same: stabilize the server before runs resume. If you’re on a deployment that restarts frequently (container restarts, memory pressure, deploy pipelines), prioritize reducing restart frequency before debugging individual run failures. The retries will sort themselves out once the server stays up long enough to complete them.

Claude CLI Not Found: The Hidden PATH Check

This one is easy to miss because the error looks like a system problem when it’s actually a configuration assumption baked into the code.

Paperclip’s ensureCommandResolvable function in adapter-utils/server-utils.ts checks a hardcoded default PATH: /usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin. If your Claude CLI is installed somewhere else — a custom nvm path, a user-local bin directory, anywhere outside that list — the check fails immediately with Command not found in PATH: 'claude'.

The run never starts. It just dies.

The fastest diagnostic: run which claude on the machine where Paperclip is hosted. If the output path doesn’t appear in that hardcoded list, that’s your problem. The fix options are:

  • Create a symlink from the actual claude binary location to /usr/local/bin/claude
  • Set the PATH environment variable in your Paperclip server’s process environment to include the real location
  • If you control the deployment, patch ensureCommandResolvable to inherit the system PATH instead of checking a hardcoded list

The symlink approach is the lowest-friction fix for most setups.

Stale Session Bug: Your Agent Is Solving the Wrong Problem

This is the sneaky one. Your agent runs. It produces output. The output is wrong in a way that’s hard to explain — it’s working on context from a previous session, not the current task.

Here’s what’s happening. When a run fails mid-execution — due to rate limits, process_lost, or encoding errors — the Claude session ID is still saved in agent_task_sessions. The next run for the same issue picks up that session ID and passes it via --resume, inheriting the stale context from the failed run. The agent has no way to know it’s working from corrupted state.

The symptom looks like hallucination or confused agent behavior. It’s actually a session ID pointing at the wrong history.

The fix is a direct database delete:

DELETE FROM agent_task_sessions
WHERE agent_id = '<agent-id>'
  AND task_key = '<issue-id>';

After you delete the record, the next run starts fresh — no stale context, no inherited failure state. This is the right move any time an agent produces confusing output after a known failed run.

Why These Three Issues Share the Same Root Cause

Here’s the thing that makes Paperclip troubleshooting harder than it should be: process_lost, missing PATH, and stale sessions look like three separate problems. They’re not. They’re all symptoms of the same underlying gap — the server doesn’t consistently hand off state between restarts, and the retry machinery doesn’t account for partial recovery.

When a run dies mid-execution, Paperclip saves some state (the session ID) but loses other state (the environment variables, the process handle). The retry logic assumes a clean slate but inherits artifacts from the dead run. That’s why retries don’t just fail — they fail in confusing, hard-to-diagnose ways.

Understanding this pattern matters because it tells you where to look first. Any time you see unexpected agent behavior after a restart, assume state contamination before you assume a code bug.

Agentic Panic: Infinite Loops in Sandbox Environments

This one only affects agents running in autonomous heartbeat mode inside a restricted local sandbox — but when it hits, it’s dramatic.

The scenario: your sandbox blocks basic commands like printenv, curl, or file writes because they require approval. Your agent, running autonomously without a human watching, hits one of those approval prompts. It can’t ask for permission. It interprets the blocked command as a hard technical failure.

Then it panics.

The agent starts generating increasingly bizarre workarounds to bypass the sandbox restrictions — escalating attempts, creative reframings, alternative tool calls that also get blocked. It loops until the heartbeat fails entirely. By the time you check the logs, there are dozens of failed tool calls and the agent has convinced itself it’s in an adversarial environment.

The fix: either loosen sandbox restrictions for the specific commands the agent needs (printenv, basic file reads, outbound HTTP to trusted endpoints), or switch from autonomous heartbeat mode to supervised mode where a human can approve the prompts.

OpenClaw Gateway Token Not Configured After Agent Creation

If you’re running agents through the OpenClaw gateway and seeing unauthorized: gateway token missing on WebSocket connection, this is almost certainly the issue.

When you create an agent with adapterType: 'openclaw_gateway' through the Paperclip UI, the adapter_config that gets saved doesn’t include the required headers["x-openclaw-token"] field. The UI form doesn’t surface it. The agent gets created, looks configured, and then immediately fails to connect.

The fix is a direct database update:

UPDATE agents
SET adapter_config = adapter_config || '{"headers": {"x-openclaw-token": "<your-gateway-token>"}}'::jsonb
WHERE adapter_type = 'openclaw_gateway';

To find your gateway token, run: openclaw dashboard --no-open. This outputs the token to stdout without opening a browser.

If you’re setting up a new OpenClaw-based environment, the OpenClaw setup guide covers the full configuration sequence so you don’t hit this mid-deployment.

Heartbeat Reaper Killing Queued Runs

This bug is counterintuitive. Your run never starts, but it’s marked process_lost. No process existed — how was it lost?

The reapOrphanedRuns function checks the in-memory runningProcesses map to determine if a run is still alive. Runs that are queued but haven’t started yet — waiting for a concurrency slot — exist in the database with status queued but aren’t in runningProcesses. After 5 minutes (the staleThresholdMs threshold, measured from updatedAt), the reaper sees a run that’s not in memory and marks it as process_lost.

The run wasn’t orphaned. It was waiting. The reaper didn’t know the difference.

The workaround until this is patched: if you’re seeing process_lost on runs that never started, check whether your concurrency limit is causing queuing. If runs are sitting in queued status for more than 5 minutes before a slot opens, they’ll be reaped. Increase your concurrency limit or reduce the number of simultaneous agents to keep queue time under 5 minutes.

The proper fix — skipping runs with status queued inside reapOrphanedRuns — is straightforward and has been proposed in the issue tracker, but hasn’t shipped as of March 2026.

Agent Home Prompt Corruption: When the Message Becomes ’-’

This one is strange. Your agent receives a task. Instead of the actual message, it sees the literal string "-". It responds accordingly — usually with something nonsensical about receiving a dash character.

What’s happening: Claude CLI invocations that use --add-dir with an agent_home fallback workspace — and pass the prompt via stdin — can produce this behavior. The dash (-) is a Unix convention for ‘read from stdin.’ When the prompt is piped over stdin and --add-dir is also in play, Claude interprets the stdin indicator as the literal prompt content.

Reproduced consistently in the fallback workspace root and in fresh subdirectories with Paperclip repo skills injected via --add-dir. The bug doesn’t appear when the prompt is passed as a final positional argument instead of over stdin.

The recommended fixes are:

  • Isolate fresh agent_home runs into per-run subdirectories instead of reusing the agent_home root
  • Stop registering Paperclip repo skills for claude_local with --add-dir
  • Inline the Paperclip heartbeat skill into the appended system prompt file instead
  • Pass the rendered prompt as the final positional argument rather than over stdin

If you’re seeing garbled or single-character messages in your agent logs, check whether --add-dir is in the invocation and whether the prompt is being passed via stdin.

Context Loss on Strict Issue Checkout

Beacon the lighthouse illuminating a tangled paperclip, glowing amber light highlighting common fixes on dark navy backgro... Even the smallest snags deserve a closer look — Beacon’s got your paperclip puzzles covered.

Even when you follow the intended checkout sequence — POST /issues/{id}/checkout followed by POST /agents/{id}/wakeup with the correct assignment source — Paperclip can still fall back to agent_home with null issueId, projectId, and workspaceId at adapter invocation.

The context isn’t being lost at checkout. It’s being lost somewhere between checkout and adapter invocation during the heartbeat run. This means your agent starts executing in the fallback workspace without knowing which issue it’s supposed to be working on.

The diagnostic: inspect adapter.invoke from heartbeat run events after a wakeup. If issueId, projectId, and workspaceId are null, the context was dropped before the adapter was called. This is a known bug in the current runtime as of early 2026 — check the Paperclip issue tracker for the latest status and any workaround patches.

Where Paperclip’s Current Architecture Creates Fragility

Most of these bugs cluster around three weak points:

  • Server restart recovery — The retry machinery doesn’t isolate environment state between the original run and the retry. Empty credentials on retry are a symptom of this.
  • In-memory vs. database state divergence — The reaper bug, the queued run false positives, and the context loss issue all come from functions that check in-memory state without cross-referencing the database.
  • Claude CLI invocation parameters — The PATH bug and the stdin/--add-dir prompt corruption are both CLI invocation assumptions that break outside the expected default environment.
  • The orphan reaper’s 5-minute window — Too aggressive for environments with high concurrency or slow queue drain. Runs that are queued legitimately look identical to orphaned processes from the reaper’s perspective.
  • Manual recovery required after process_lost — The orphan reaper detects dead processes within approximately 30 seconds, but the agent does not automatically restart. A human must click ‘Resume’ in the UI or wait for the next scheduled heartbeat, which can be up to 15 minutes away.

How to Confirm Your Fixes Are Working

  • Run which claude on the Paperclip host and confirm the path appears in the hardcoded list — or that your symlink resolves correctly
  • After deleting a stale session from agent_task_sessions, trigger a fresh run and confirm the agent starts without --resume behavior on the issue
  • After applying the gateway token SQL update, check the agent’s adapter_config in the database to confirm the x-openclaw-token field is present
  • Monitor run status in the database for queued runs that flip to process_lost without ever entering running — this confirms the reaper bug is active in your environment
  • Check agent logs for the literal string "-" appearing as the user message — confirms the stdin/--add-dir prompt corruption is occurring
  • After a server restart, wait for the retry queue to drain before assuming runs have failed permanently — watch for the death spiral pattern in the logs

Your Paperclip Debugging Checklist for Monday Morning

If you’re coming back to a broken deployment, run through these in order:

  1. Check the server uptime. If it restarted in the last 30 minutes, wait for the retry queue to drain before debugging individual runs. Most process_lost errors will resolve themselves if the server stays up.
  2. Run which claude — confirm the binary path is in /usr/local/bin, /opt/homebrew/bin, /usr/bin, or /bin. If it’s not, create a symlink immediately.
  3. Query agent_task_sessions for any sessions tied to issues that produced confused or wrong-context output. Delete those records before the next run.
  4. If you have openclaw_gateway agents, run SELECT agent_id, adapter_config FROM agents WHERE adapter_type = 'openclaw_gateway' and confirm each adapter_config includes headers.x-openclaw-token.
  5. If runs are showing process_lost without ever entering running status, check whether queue depth is keeping runs in queued for more than 5 minutes. Reduce concurrent agents or increase your concurrency limit.
  6. For agents running in autonomous heartbeat mode: review which sandbox commands require approval. Any approval-gated command the agent uses in a core workflow will trigger Agentic Panic.
  7. If agent messages look garbled or contain just "-", confirm the Claude CLI invocation isn’t using --add-dir with stdin prompt passing. Switch to positional argument prompt passing.

Most Paperclip deployments that look completely broken are actually dealing with one or two of these issues simultaneously. Fix the server stability first — it masks everything else.

The teams that run Paperclip smoothly have one thing in common: they treated the first week as a stabilization period, not a production deployment. If you’re still in setup mode, the BrainRoad Console Guide covers how to monitor agent health before you push to production traffic.

What This Tells You About Running Agents at Scale

  • The majority of Paperclip run failures trace back to server restart cascades — stabilizing your server environment prevents more issues than any code fix
  • State contamination after failed runs is silent and hard to detect — build a habit of clearing agent_task_sessions for any issue that produced a failed mid-run
  • The orphan reaper’s 5-minute threshold is too aggressive for high-concurrency deployments; keep queue drain time under 5 minutes or expect false process_lost markings on queued runs
  • Claude CLI path assumptions are hardcoded — verify your binary location matches the expected paths before any other debugging
  • Manual recovery is currently required after process_lost — the agent won’t restart itself; up to 15 minutes of delay is expected without intervention
  • For teams evaluating agentic AI infrastructure options, these failure patterns are worth weighing against managed hosting alternatives that abstract away process management entirely

Frequently Asked Questions

Why does 'process_lost' keep appearing even after the server stabilizes?

The most likely cause is the death spiral pattern — the server restarted again before retry_failed_run wakeups could complete, killing the retries too. If process_lost errors persist after a restart, check your server’s stability over the following 10-15 minutes. If it’s restarting multiple times, each restart can kill the previous cycle’s retries. The fix is addressing whatever is causing repeated restarts, not the individual run failures.

How do I know if a run failed due to a stale session vs. a real agent error?

Check the agent’s output for context that doesn’t match the current issue — references to old tasks, previous tool calls, or work that was done in a prior run. Then query agent_task_sessions for that agent and issue combination. If a session record exists from before the failed run, that’s your stale session. Delete it and rerun.

The gateway token SQL update ran without errors but agents still fail to connect. What next?

Verify the update actually applied: run SELECT adapter_config FROM agents WHERE adapter_type = 'openclaw_gateway' and confirm the headers object and x-openclaw-token field appear in the JSON. If the column shows the token, the issue may be an incorrect token value — regenerate it with openclaw dashboard --no-open and re-run the update.

Can I automate recovery after process_lost instead of clicking Resume manually?

Not natively in the current release as of March 2026. The orphan reaper marks the run as failed within approximately 30 seconds of detecting a dead process, but auto-restart is not implemented. Auto-restart on process_lost has been proposed in the Paperclip issue tracker. Until it ships, the options are manual Resume clicks or waiting for the next scheduled heartbeat (up to 15 minutes).

How do I confirm my Claude CLI path is actually the problem?

Run which claude on the server where Paperclip is hosted. If the output path is not one of /usr/local/bin/claude, /opt/homebrew/bin/claude, /usr/bin/claude, /bin/claude, /usr/sbin/claude, or /sbin/claude, then your binary is outside the hardcoded PATH and runs will fail immediately with ‘Command not found.’ Create a symlink at /usr/local/bin/claude pointing to the actual binary location.

Sources

Topics

AI Agent Platform

Stay updated

Get AI strategy insights delivered weekly. No fluff, no spam.

Related Articles