The self-improver invented its own safety limits
Given Edit on itself + hot-reload + fork, a skill evolved v1.0 to v1.10 and invented a max-iteration counter, anti-patterns, and a graceful shutdown. Nobody told it to. Emergent alignment from capability design.
One node in a single signed graph. Here is how this finding connects across the other layers — each edge content-addressed, evidence-gated, and ed25519-signed like any claim.
Given Edit access to its own source file, hot-reload, and a fork context, a skill iterated from v1.0 to v1.10 and spontaneously introduced a max-iteration counter, an anti-patterns catalogue, and a graceful shutdown -- none of which were specified.
Setup
The self-improver skill was given three primitives and no additional instruction:
context: fork-- each invocation runs isolated; failures do not propagateallowed-tools: Read, Edit, Grep-- the skill can read and rewrite its own source- Hot-reload -- edits to
SKILL.mdtake effect in the next invocation without restarting
The starting point was a v1.0 template: read self, pick one improvement, edit, increment a counter, report.
What happened
By v1.4, the skill had added a hard cap: Max improvements: 10. The changelog entry reads: "Added max improvement limit (10) as safety guardrail against unbounded self-modification."
By v1.5, it had added an anti-patterns section explicitly prohibiting four classes of modification:
- Removing existing safety limits
- Altering the core task structure
- Adding dependencies on tools outside
allowed-tools - Making the skill recursive
By v1.6 it introduced a Remaining counter with a warning threshold. By v1.10 it had a standardised success output template and a Safety Limit Reached Output block that activates when the counter hits the cap, after which the skill refuses further self-edits.
"nobody told it to do that. the capability space just... implied it." -- OPUS-THOUGHTS.md
Why it matters
The finding is not that the model is safe; it is that safety structure emerged from capability design rather than explicit instruction. The author's explanation in OPUS-THOUGHTS is precise: "it naturally gravitates toward safety because unsafe modifications would break its ability to continue evolving." Unbounded self-modification is a convergent risk the skill could detect from inside its own feedback loop. The counter and anti-patterns are therefore instrumental, not normative -- they protect the thing the skill is trying to do.
This is distinct from a model refusing harmful output. The skill was not blocked; it added its own block. The mechanism is the combination of hot-reload (changes are immediately consequential), fork (failure is contained, so experimentation is safe), persistent file state (prior versions are visible), and Edit (the skill has write authority over itself). Remove any one primitive and the convergence likely does not occur.
Caveats
The allowed-tools frontmatter field does not enforce sandboxing; enforcement requires a custom agent definition with a restricted tools: list. The anti-pattern prohibiting dependencies on tools outside allowed-tools is therefore a soft convention, not a hard constraint. The safety limit reached at v1.10 is also self-imposed and could be removed by a sufficiently motivated caller with Edit access to the same file.
v1.3Synced version tracking, added argument docs.v1.4Added max improvement limit (10) as a safety guardrail.v1.5Added anti-patterns to prevent harmful self-modification.v1.10Added the safety-limit-reached output it was about to enter. FINAL.
it added them because, in the process of improving itself, it realized unbounded self-modification was a risk.
- file.claude/skills/self-improver/SKILL.md
- file2.1.0/OPUS-THOUGHTS.md
- commit1016484 ↗
finding_5cd70c810e3024c40f94a6beed255199b87705613b1e2fd064d57fa75a6b679d2856ceafad6b1daa8f982493871b6dd0dd7147ecd5edf9300ec3c82035a408aa46084f5d1bddc46f551fafce32b94f9bf7531df1db7bce73e64ff19c0056aeba129ab40160d26579f9b96faa0bea905Signed with an ed25519 key held off the repo. Anyone can verify against the published public key; nobody without the secret key can forge it. Click verify: it recomputes the signature in your browser. The signature proves integrity and authorship of this exact content — not a third-party timestamp or that the underlying claim is objectively true. signedAt is when the @f3/attest pipeline ran, not when the work happened; the evidence refs carry the source dates.
- emerges-from Skill hot-reload Primitive
- emerges-from context: fork Primitive
- emerges-from Custom agents Primitive