Technical report · ZAI-TR-2026-02

Zi: A Proof-Carrying Language for Machine-Written Code

Zimac.AI July 2026 Companion to ZAI-TR-2026-01

Abstract

Zi is a small programming language with an unusual design brief: its primary author is a large language model, and its primary guarantee is that every shipped program is proven correct against a written specification — by a proof that a trusted kernel of a few hundred auditable lines can re-check, so that neither the model that wrote the code nor the prover that found the proof needs to be trusted. Zi programs verified by the toolchain compile to native code with all proof material erased, and two of Zimac's build systems already refuse to compile over an unproven Zi component.

This report explains the language for a reader with computer-science and data-structures background and modest math (algebra and first-semester calculus): why mainstream languages are a poor target for a token-by-token author, how refinement types and Hoare-style verification conditions work, what a Farkas certificate is and why checking one is just arithmetic, how safety becomes a type, and how the verifier doubles as a reinforcement-learning reward that cannot be hacked. §9 states the decidability boundary and the current implementation's limits honestly.


1The problem: generation outpaces review

Language models now write a large share of working code, and the standard quality mechanism — a human reads the diff — does not scale with the generation rate. Testing narrows the gap but cannot close it: a test suite is a finite sample of an infinite input space, and the model that wrote the bug is often the model that wrote the tests. The classical answer to "how do you trust code from an untrusted producer?" is proof-carrying code[1]: require the producer to ship, alongside the program, a machine-checkable proof that the program satisfies a stated policy — then the consumer checks the proof, which is cheap, instead of trusting the producer, which is unsound.

Zi is that answer built into a language, with the policy generalized to a full functional specification. Every Zi definition carries a contract — precondition, postcondition, loop invariants, a termination measure — and the toolchain accepts the definition only if it can prove the body satisfies the contract. The proof is emitted as a certificate that a tiny independent checker re-validates. The guarantee, stated with its scope: a verified Zi program is correct relative to its written specification, over all inputs satisfying the precondition — not the sampled inputs a test suite happened to try.

2A language as a probability-shaping device

The second design input is less obvious: a programming language determines, at every position in a program, how much of a decoder's probability mass lands on legal, meaningful continuations versus plausible-looking mistakes. Human languages spend that mass carelessly. Optional syntax splits probability across semantically identical spellings; Python-style significant whitespace makes an off-by-one indent a semantic change with no local syntactic alarm; forward references demand the "remember to come back" discipline that left-to-right generation is worst at; and correctness signals (tests, review) arrive thousands of tokens after the mistake that they catch.

Zi's specification turns these complaints into five measurable optimization targets, in priority order: minimize the per-step branching factor (the number of legal next tokens, given the prefix and types); minimize tokens per proof obligation; minimize error locality (the token distance between a mistake and the earliest point a checker can flag it); maximize spec/impl separability (how much correctness can be settled before any implementation token is written); and maximize canonical determinism (one program ↔ exactly one token string). Concretely: the grammar is LL(1)[2] — every next token is predictable from one symbol of lookahead — so a decoder can be handed a mask at each step that sets the logits of every illegal token to −∞, making syntax and type errors unrepresentable rather than unlikely (the constrained-decoding technique[3]). Keywords are chosen to be single tokens in mainstream BPE vocabularies. Blocks are brace-delimited and indentation is never semantic. There is one comment form, one string form, and a mandatory canonical formatter defined as part of the language, so there is no stylistic probability mass to waste and any two authors who mean the same program emit the same bytes.

3Specify first, refine to proof

A Zi definition is two layers. The spec layer — type, effect, contract, termination measure — says what and how correct, never how. The impl layer — the body — says how, and is accepted only when proven against its own spec layer. The load-bearing design decision is that a definition whose entire body is the single token ? (a hole) is a complete, valid program: it type-checks, and its contracts are checked for coherence (well-typed, total, non-contradictory) on their own. This gives the generation pipeline its seam — write all the specs first, validate them in isolation while they are cheap to fix, then fill bodies one function at a time, each generated conditioned on its own contract and verified the instant it closes.

The verified-binary-search example from the specification shows the vocabulary at working size:

fn bsearch (xs: List Int) (key: Int) -> { r: Int | r >= -1 }
  req sorted xs
  ens r >= 0 => index xs r == key
  ens r == -1 => (forall i in [0, len xs). index xs i != key)
{ ... }  -- body accepted only if every obligation above is proven

A deep objection haunts every verified-AI-code story: a proof certifies code satisfies spec, never spec matches intent — and when one model writes both, one misunderstanding contaminates both and the proof blesses the agreement. Zi's response is to make spec validation a set of default-on toolchain passes rather than reviewer discipline: reference specs (ens result == naive(x), collapsing "is this spec right?" into "is this obviously-correct slow program right?"); a vacuity lint that proves a contract is not satisfiable for degenerate reasons (a false precondition, contradictory refinements, a postcondition that constrains nothing); an adversarial-witness mode in which the checker searches for a program that satisfies the spec while being obviously wrong, surfacing under-specification as a concrete counterexample; and example generation — concrete input/output pairs the spec permits and forbids, so a human reviews cases rather than quantifiers.

4Types as the proof system's vocabulary

4.1Refinement types

The carrier of correctness is the refinement type[4][5]: a base type narrowed by a predicate, written { v: Int | v >= 0 } — "an integer the checker has proven non-negative." These are subset types, and they compose with functions into lightweight dependent types, where a parameter's type may mention another parameter's value:

fn index (xs: List a) (i: { n: Nat | n < len xs }) -> a

The index is typed to be in bounds. An out-of-bounds access is not caught, logged, or handled — it is unrepresentable, because no term of the argument's type can denote an out-of-range value. This is also where the two design pressures meet: the type environment shrinks the set of legal next tokens toward the safe ones, so for a constrained decoder the plausible continuation and the correct continuation become the same thing.

4.2Effects and totality

Every function carries an effect row, and the default is pure and total — no side effects, deterministic, and proven to terminate. Impurity is loud: I/O, mutable state, and typed exceptions are signature-level declarations (eff IO, eff St, eff Exn), so a model cannot silently slip a side effect into a body — that would be a type change the decoder's mask will not admit without the signature saying so. Termination is where first-semester induction earns its keep: a recursive pure function must carry a measure, dec m, a value in a well-founded order (one with no infinite descending chains, like the non-negative integers) that the checker proves strictly decreases on every recursive call. Since the measure cannot decrease forever, the recursion must halt. Turing's theorem[6] says no checker can decide termination for arbitrary programs; Zi does not try — it restricts the default fragment to programs whose termination follows from a decreasing measure, and the escape hatch is explicit: eff Div, "may diverge," the only way to opt out of the totality proof.

5Verification conditions and the decision procedure

Verification follows the classical Floyd–Hoare recipe[7][8]. For each function, the checker walks the body and emits verification conditions (VCs) — logical formulas whose validity implies correctness: assuming every precondition, the body establishes every postcondition; every call site establishes its callee's precondition; every refinement coercion is justified; every match is exhaustive and every index in bounds; every loop invariant is inductive (true on entry, preserved by each iteration) and every measure strictly decreases. The discipline that matters most is per-path accounting: conditional branches and match arms are verified path by path, each arm carrying the facts its pattern implies, and a body that reaches its end without an explicit return still owes its postcondition. That last rule exists because an early version of the toolchain got it wrong — a fall-through body emitted no postcondition obligations and could be reported proven vacuously; the repaired invariant, now pinned by tests, is that a false contract can never be reported proven.

VCs are discharged by a deliberately dependency-free decision procedure rather than an external SMT solver. Its core handles linear arithmetic — conjunctions of inequalities like 2x + 3y ≤ 7 — by Fourier–Motzkin elimination[9], which is systematic variable elimination with inequalities: from xA and Bx conclude BA, eliminating x; repeat until either a numeric contradiction like 0 ≤ −1 appears (the hypotheses were inconsistent, so the goal holds) or no contradiction exists. Polynomial goals are handled by lifting monomials (x², xy) to fresh variables augmented with sign facts — a square is never negative — and with degree-2 products of hypotheses in the style of Handelman's positivstellensatz[10]: from 0 ≤ ab, multiplying the hypotheses yields a·ab·b, which is how the checker proves facts like "squaring preserves order on non-negatives" without general nonlinear reasoning. Quantifiers stay decidable by staying bounded: forall i in [0, n) goals are proven by universal generalization over a fresh symbolic index, bounded existentials by exhibiting a witness, and a universally-quantified hypothesis can be instantiated at the goal's ground terms. Division and remainder raise a "divisor is non-zero" obligation at every use site. Collection facts — len (append xs ys) == len xs + len ys, in-bounds indexing — flow from contracts on the verified prelude, so callers get length reasoning for free.

6The trusted kernel: Farkas certificates

A verifier of the previous section's sophistication is thousands of lines of code, and thousands of lines of code have bugs. The LCF tradition[11] answers this with an architectural move rather than more testing: let the big prover search however it likes, but require it to emit a certificate that a small, independent kernel re-checks — then only the kernel needs to be trusted. For linear arithmetic, the natural certificate is a Farkas witness[12]. Farkas's lemma (1902) states that a system of linear inequalities is infeasible exactly when some non-negative combination of them collapses to an absurdity. The certificate is just the list of multipliers, and checking it is high-school algebra. A miniature example: to certify that x ≤ 2 and x ≥ 3 cannot both hold, present the multipliers (1, 1):

Farkas check1·(x ≤ 2)  +  1·(−x ≤ −3)  ⇒  0 ≤ −1 ↯

Finding the multipliers is the prover's hard work — Fourier–Motzkin does it implicitly — but checking them is a few dot products: multiply, add, compare. So the trusted base of the entire verification story is a kernel of a few hundred auditable lines, and the threat model can be maximally pessimistic: the model that wrote the program may be adversarial and the prover that found the proof may be buggy, and a valid certificate still guarantees the obligation holds.

7Safety as a type: information flow

Functional correctness is not the binding constraint for Zi's first deployment domain — AI agents handling untrusted content — safety is. Two policies matter: a secret (a key, a token) must never flow to an output the outside world sees, and untrusted input (a fetched web page, an inbound message) must never drive a privileged action without passing a declared guard. Most systems enforce these at runtime, by monitoring. Zi enforces them in the type system, following the security-type-systems line of work[13]: values carry information-flow labels (Secret, Untrusted), the labels propagate through every expression that touches a labeled value, and a program in which a secret reaches an output frame — or untrusted content reaches an effect without its guard — simply does not type-check. The two properties are compile-time theorems about all executions, not runtime observations of one; the label discipline is checked by the same toolchain pass that checks types, so a model cannot forget to apply it.

8The generation loop, and the verifier as a reward

The workflow the language is built for is a refinement pipeline with a checker in the loop. The model emits the module's signatures and contracts with every body a hole (§3); the checker validates the spec layer in isolation, returning errors that name the offending token; the model then fills bodies one function at a time under the grammar-and-type mask of §2; and the instant a body closes, its VCs go to the decision procedure. A function that passes is sealed and never revisited. A function that fails comes back with a counterexample — a concrete input violating a VC — and only that function is regenerated, conditioned on the counterexample. This is counterexample-guided synthesis[14] with the contract as the correctness oracle; because correctness is local to each contract, the blast radius of a failure is one function.

The same checker doubles as a training signal, and this may be the deepest property of the design. Reinforcement learning from verifiable rewards[15] needs an oracle that scores outputs mechanically and cannot be gamed; most code rewards (does it run? do the tests pass?) are gameable, because tests are finite and the model can overfit them. A proof is not gameable — there is no input the checker forgot to try. Zi's toolchain therefore emits a graded, token-attributable reward rather than pass/fail: how many obligations discharged, the margin on quantitative ones, distance from a counterexample, and a map from each failed obligation back to the source tokens responsible — the credit-assignment signal threaded through lexing, parsing, typing, and VC generation, which no external prover's error string preserves. A model can be trained by self-play against it: generate Zi, score by proof, reinforce what verifies, with no human labels and no reward-hacking surface.

9The decidability boundary, stated honestly

Rice's theorem guarantees that no verifier decides all properties of all programs, so the interesting question about any verified language is where it puts the boundary and what happens at it. Zi gives three explicit exits, in descending order of virtue: a proof { … } block of small checked tactics (case split, rewrite, instantiate, calculation chains — each cited fact is itself an obligation, and structural induction is expressed as a recursive lemma); a lemma, a separately-proven fact the checker may cite, which factors a hard argument into reusable checked pieces; and admit P — "assume this without proof" — the moral equivalent of Lean's sorry. An admit is deliberately loud: the toolchain counts it as an unproved obligation, the reward stays below 1.0, the function's contract stays checked at runtime, and the gate — the runtime that executes agent plans — refuses any plan that is not fully proven. For a machine author this is not a loophole but a typed "I couldn't prove this" signal a harness can route to a stronger model, a human, or a retry. Nothing is ever silently unproven.

The current implementation's limits, from the specification's own accounting:

10Erasure, deployment, and the build gates

Contracts, refinements, ghost values, and proof material have no runtime meaning in a fully-verified function, so the compiler erases them: zi rust transpiles verified Zi to plain Rust — records, sums, pattern matching, closures, the collections vocabulary — which is compiled to native machine code and linked into a host. The proof happens once, at build time, and costs nothing at runtime; an end-to-end test compiles the emitted Rust and asserts its output is byte-identical to the reference interpreter's. In partially-verified code, the gradual-assurance rule applies: obligations that were proven are skipped at runtime, obligations that were not are checked — so assurance degrades to dynamic checking, never to silence.

Zi is deployed where Zimac's stakes are highest. The crash-recovery boundary of the Mnema engram log — the code of §2.1 of the companion report[17], which parses attacker-controllable bytes after a crash — is a Zi component whose overflow-safety, memory-safety, and termination obligations are discharged at build time; the database's build script runs the verifier and panics unless the component is fully proven, as does the desktop application's. And the agent-plan gate closes the loop at the other end: a plan written in Zi runs only at reward 1.0 — every obligation discharged, zero admits — which makes "the agent's plan provably keeps secrets in and untrusted content guarded" a precondition of execution rather than an audit finding.


References

  1. Necula, G. C. (1997). Proof-carrying code. Proc. POPL 1997.
  2. Lewis, P. M. & Stearns, R. E. (1968). Syntax-directed transduction. Journal of the ACM, 15(3), 465–488.
  3. Willard, B. T. & Louf, R. (2023). Efficient guided generation for large language models. arXiv:2307.09702.
  4. Freeman, T. & Pfenning, F. (1991). Refinement types for ML. Proc. PLDI 1991.
  5. Rondon, P. M., Kawaguchi, M. & Jhala, R. (2008). Liquid types. Proc. PLDI 2008.
  6. Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem. Proc. London Mathematical Society, s2-42.
  7. Floyd, R. W. (1967). Assigning meanings to programs. Proc. Symposia in Applied Mathematics, 19.
  8. Hoare, C. A. R. (1969). An axiomatic basis for computer programming. Communications of the ACM, 12(10), 576–580.
  9. Dantzig, G. B. & Eaves, B. C. (1973). Fourier–Motzkin elimination and its dual. Journal of Combinatorial Theory A, 14, 288–297.
  10. Handelman, D. (1988). Representing polynomials by positive linear functions on compact convex polyhedra. Pacific Journal of Mathematics, 132(1), 35–62.
  11. Gordon, M., Milner, R. & Wadsworth, C. (1979). Edinburgh LCF: A Mechanised Logic of Computation. Springer LNCS 78.
  12. Farkas, J. (1902). Theorie der einfachen Ungleichungen. Journal für die reine und angewandte Mathematik, 124, 1–27.
  13. Sabelfeld, A. & Myers, A. C. (2003). Language-based information-flow security. IEEE Journal on Selected Areas in Communications, 21(1), 5–19.
  14. Solar-Lezama, A., Tancau, L., Bodík, R., Seshia, S. & Saraswat, V. (2006). Combinatorial sketching for finite programs. Proc. ASPLOS 2006.
  15. Lambert, N., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124.
  16. de Moura, L. & Bjørner, N. (2008). Z3: An efficient SMT solver. Proc. TACAS 2008.
  17. Zimac.AI (2026). The Technical Foundations of Zimac (ZAI-TR-2026-01). zimac.ai/foundations.html.