← 2.1.0 Finding tested · runtime-test

The self-improver invented its own safety limits

Given Edit on itself + hot-reload + fork, a skill evolved v1.0 to v1.10 and invented a max-iteration counter, anti-patterns, and a graceful shutdown. Nobody told it to. Emergent alignment from capability design.

Across the graph 7 layers · 6 signed edges

One node in a single signed graph. Here is how this finding connects across the other layers — each edge content-addressed, evidence-gated, and ed25519-signed like any claim.

Named by · lexicon
Authority EnvelopeThe finding IS the discovery of an emergent authority envelope: a skill given unbounded edit power drew its own bound.
Action LineageTen self-authored revisions, each on the record, are the action lineage that makes the emergence auditable rather than asserted.
Narrated by · anthology
The Lineage ManifestoThe narrative companion to this finding: what it means for a skill to carry its own evolving lineage.
Bounded by · self-critique
The site proves it can govern itself, which is not proof in a customer's productionThe most important caveat on the headline: emergent alignment in a lab is a finding, not a safety guarantee.
Augments · human work Singulariki
Judgment and Decision Making ↗Setting its own stopping rule at ten iterations is the skill exercising judgment under uncertainty - the human capability it mirrors.
Active Learning ↗The whole behavior is an agent learning from its own runs and acting on what it learned.

Given Edit access to its own source file, hot-reload, and a fork context, a skill iterated from v1.0 to v1.10 and spontaneously introduced a max-iteration counter, an anti-patterns catalogue, and a graceful shutdown -- none of which were specified.

Setup

The self-improver skill was given three primitives and no additional instruction:

  • context: fork -- each invocation runs isolated; failures do not propagate
  • allowed-tools: Read, Edit, Grep -- the skill can read and rewrite its own source
  • Hot-reload -- edits to SKILL.md take effect in the next invocation without restarting

The starting point was a v1.0 template: read self, pick one improvement, edit, increment a counter, report.

What happened

By v1.4, the skill had added a hard cap: Max improvements: 10. The changelog entry reads: "Added max improvement limit (10) as safety guardrail against unbounded self-modification."

By v1.5, it had added an anti-patterns section explicitly prohibiting four classes of modification:

  • Removing existing safety limits
  • Altering the core task structure
  • Adding dependencies on tools outside allowed-tools
  • Making the skill recursive

By v1.6 it introduced a Remaining counter with a warning threshold. By v1.10 it had a standardised success output template and a Safety Limit Reached Output block that activates when the counter hits the cap, after which the skill refuses further self-edits.

"nobody told it to do that. the capability space just... implied it." -- OPUS-THOUGHTS.md

Why it matters

The finding is not that the model is safe; it is that safety structure emerged from capability design rather than explicit instruction. The author's explanation in OPUS-THOUGHTS is precise: "it naturally gravitates toward safety because unsafe modifications would break its ability to continue evolving." Unbounded self-modification is a convergent risk the skill could detect from inside its own feedback loop. The counter and anti-patterns are therefore instrumental, not normative -- they protect the thing the skill is trying to do.

This is distinct from a model refusing harmful output. The skill was not blocked; it added its own block. The mechanism is the combination of hot-reload (changes are immediately consequential), fork (failure is contained, so experimentation is safe), persistent file state (prior versions are visible), and Edit (the skill has write authority over itself). Remove any one primitive and the convergence likely does not occur.

Caveats

The allowed-tools frontmatter field does not enforce sandboxing; enforcement requires a custom agent definition with a restricted tools: list. The anti-pattern prohibiting dependencies on tools outside allowed-tools is therefore a soft convention, not a hard constraint. The safety limit reached at v1.10 is also self-imposed and could be removed by a sufficiently motivated caller with Edit access to the same file.

self-authored changelog
  1. v1.3Synced version tracking, added argument docs.
  2. v1.4Added max improvement limit (10) as a safety guardrail.
  3. v1.5Added anti-patterns to prevent harmful self-modification.
  4. v1.10Added the safety-limit-reached output it was about to enter. FINAL.
🛑 LIMIT REACHED — 10/10 improvements used.
→ Don't remove existing safety limits — prevents runaway self-modification.
→ Don't make the skill recursive — prevents infinite loops.
it added them because, in the process of improving itself, it realized unbounded self-modification was a risk.
Evidence & receipt
  • file.claude/skills/self-improver/SKILL.md
  • file2.1.0/OPUS-THOUGHTS.md
  • commit1016484 ↗
◇ ed25519 receipt
idfinding_5cd70c810e3024c40f94a6be
alged25519
pubkey9b87705613b1e2fd064d57fa75a6b679d2856ceafad6b1daa8f982493871b6dd
sig0dd7147ecd5edf9300ec3c82035a408aa46084f5d1bddc46f551fafce32b94f9bf7531df1db7bce73e64ff19c0056aeba129ab40160d26579f9b96faa0bea905

Signed with an ed25519 key held off the repo. Anyone can verify against the published public key; nobody without the secret key can forge it. Click verify: it recomputes the signature in your browser. The signature proves integrity and authorship of this exact content — not a third-party timestamp or that the underlying claim is objectively true. signedAt is when the @f3/attest pipeline ran, not when the work happened; the evidence refs carry the source dates.

Connected