Technical report · ZAI-TR-2026-01

The Technical Foundations of Zimac

Zimac.AI July 2026 zimac.ai/foundations

Abstract

Zimac is a local-first AI system built from three parts: a native macOS assistant, a shared memory engine (Mnema), and a metered model gateway. This report explains how each part actually works, in the terms of the fields it borrows from — cognitive psychology, distributed systems, information retrieval, and formal methods — for a reader comfortable with computer science and data structures but not graduate mathematics.

The through-line is a single engineering stance: every load-bearing claim in this document corresponds to a machine-checked artifact — a proof harness, a model check, a differential test, or a measured benchmark — and the claims that are not fully proven are enumerated explicitly in §7. Nothing here is aspirational; it describes the shipped system.

1Design position

Zimac's architecture follows from one asymmetry: of everything an AI assistant touches, its memory is the component with the longest blast radius. A bad model reply is annoying once; a corrupted or silently-diverging memory store poisons every future reply. So the system is partitioned by stakes. The user interface and agent logic are ordinary, well-tested application code. The memory engine — Mnema, an embedded database that stores engrams (discrete memory records) and answers recall queries over them — is held to a categorically higher standard: its core invariants are written down as a formal specification and discharged by proofs, model checking, and differential testing (§5), and the build refuses to ship if any layer fails.

The second commitment is locality. The app and the store run on the user's machine; the local API binds to 127.0.0.1; secrets live in the operating system keychain. The only traffic that leaves the device is the model call itself — either directly to the user's provider, or through a gateway whose entire job is credential isolation and metering (§8). This is not a privacy slogan but a statement about the trusted computing base: the server side never holds a copy of the user's memory, so it does not have to be trusted with one.

2The engram log: durability as a checked property

At the bottom of Mnema is an append-only segment log, the storage design used by write-ahead logs and log-structured stores generally. Every mutation is serialized as a framed record — a fixed-size header carrying the payload length, then the payload — appended to the active segment file, and flushed with fsync before the write is acknowledged. The log is the source of truth; every in-memory index is a derived structure that can be rebuilt from it.

Integrity is layered. Each record carries a checksum (CRC), and each segment maintains a rolling SHA-256 hash chain: the hash of each record is folded into a running digest, so the file is tamper-evident in the same sense as a Git history — any bit flip in a sealed segment changes every subsequent chain value, and the chain is re-verified when the log is opened. The specification is deliberately precise about what this buys: bit-rot in a sealed segment is reported as corruption, never silently accepted. Records may additionally be encrypted at rest by an injected cipher; the checksum and chain are computed over the on-disk (sealed) bytes, so the integrity properties are unchanged, and a sealed payload read without the key is again reported as corruption rather than misread.

2.1Crash recovery on untrusted bytes

The most dangerous code in any log is the recovery scan — the routine that reads the active segment after a crash and decides where the valid data ends. It is dangerous because its input is untrusted: a torn write or bit-rot means the payload-length field it reads can be any value an adversary (or the laws of physics) chooses. A naïve bounds check, offset + HEADER + payloadLen ≤ fileLen, is wrong in exactly this situation: on fixed-width integers the left-hand sum can overflow and wrap around, making an enormous hostile length pass the check and sending the scan out of bounds.

Mnema's recovery boundary is therefore held to three machine-checked properties: (a) the frame-fits test is overflow-safe — it is proven equivalent to the intended inequality without ever forming the wrappable sum (algebraically, by checking payloadLen ≤ fileLen − offset − HEADER only after establishing the subtractions cannot underflow); (b) whenever a frame is accepted, its payload end provably lands within the buffer, so no read is out of bounds; and (c) every accepted frame advances the scan cursor strictly, so the scan terminates, returning a truncation offset that is exactly the start of the first incomplete frame. Everything before a torn tail survives; the tail is healed by truncation.

These three properties are not tested into confidence — they are proven at build time. The recovery boundary is written in Zi, the repository's proof-carrying language (§6); the build script of the log crate runs the verifier and refuses to compile the database if the proof fails. A conventional Rust mirror of the same logic is kept in lockstep and covered by tests, so the proven algorithm and the shipping marshalling agree.

Compaction — rewriting the log to drop deleted records — follows the standard crash-safe discipline: write the compacted segment to a temporary file, fsync, atomically rename over the target, then unlink the old segments. The checked property is that compaction preserves exactly the live record set, and that a crash at any point leaves either the old log or the new one, never a hybrid.

3Retrieval as cognitive science

Most "AI memory" systems are a vector database with a cosine-similarity search. Mnema's position is that ninety years of human-memory research constitutes a working spec for what a memory should return, and that the classical models — ACT-R's activation calculus, retrieved-context theory, complementary learning systems — are directly implementable as ranking mathematics. Each mechanism below is implemented in closed form (no learned weights, no model calls), each is pinned by falsifiable tests, and the ensemble is benchmarked on a public long-conversation retrieval benchmark (§3.9).

3.1Base-level activation and the power law of forgetting

The backbone is the ACT-R base-level learning equation[1]: a memory that has been used n times, at ages t₁ … t_n (in days), has base-level activation

(1)B = ln ( t₁^−d + t₂^−d + ⋯ + t_n^−d ) , d = 0.5

Two properties of equation (1) do the real work, and both are pinned by dedicated tests. First, each term decays as a power law, not an exponential. The difference is behavioral, not cosmetic: under exponential decay e^−kt, waiting one more week costs the same factor whether the memory is a day or a year old; under t^−d, doubling the age always multiplies strength by the same constant 2^−d ≈ 0.71. Old memories therefore decay ever more slowly in absolute terms — which yields Jost's law (of two memories equally strong today, the older one will win tomorrow), a regularity of human forgetting that exponential models cannot produce. Second, frequency adds logarithmically: k uses at the same age contribute exactly ln k, so activation encodes both recency and frequency in one number.

Storing every access timestamp forever would grow without bound, so Mnema uses Petrov's optimized-learning hybrid[2]: the most recent K = 16 ages are kept exact, and the older tail is integrated. This is first-semester calculus: the sum of t^−d over a dense window is approximated by ∫ t^−d dtwindow length , and since ∫ t^−d dt = t^1−d/(1−d), the tail collapses to a closed form in the window's endpoints. Ages are floored at one minute to avoid the t → 0 singularity of equation (1).

One honesty guard deserves mention because it typifies the project's style: a test named constant_decay_does_not_model_spacing exists to prove that equation (1) cannot capture the spacing effect — so nobody, human or agent, later mistakes it for a spacing model. Spacing is handled by a separate mechanism, next.

3.2Durability and the spacing effect

The spacing effect — the same number of exposures builds more durable memory when spread out than when crammed — is among the most replicated findings in experimental psychology, and it requires state that equation (1) lacks: a per-memory stability S, the time constant of forgetting. Mnema implements the open FSRS-5 scheduler model[3], whose retrievability (probability of successful recall after t days at stability S) is

(2)R(t, S) = ( 1 + 1981 · tS )^−½

Each access grows stability by a factor that increases as retrievability decreases — reviewing a memory just as it fades strengthens it most, which is precisely the spacing effect. The update computes R from the timestamp before the access (a mandatory guard: back-to-back reads see R ≈ 1 and gain almost nothing, so burst access cannot inflate durability). The checked properties: spaced access out-retains massed access at long intervals; a review at lower retrievability grows stability more; and stability never decreases.

3.3Temporal context: remembering by when

Human recall exhibits the contiguity effect: recalling one event brings back its temporal neighbors, even with no semantic overlap — the model of this is retrieved-context theory (TCM/CMR)[4][5]. Mnema maintains a slowly drifting context vector c — implemented as a sparse map from discrete feature ids (tags, people, co-active memory ids) to weights, so it needs no embedding model. When a memory is encoded, the current context is stamped onto it; then the context drifts toward the new memory's features:

(3)c ← ρ c + β c_in , ρ = √ 1 + β²(δ² − 1) − βδ , δ = cos(c, c_in)

The ρ expression looks exotic but is only algebra: it is the unique non-negative solution of the quadratic that keeps ‖c‖ = 1 after the update, so context never grows or shrinks, it only turns. Memories encoded close in time receive similar context stamps; at retrieval, the seed results' stamped contexts are reinstated and candidates are scored by cosine similarity to them. Two subtleties are checked as properties: the effect is forward-asymmetric (the memory encoded just after a seed shares more of its context than the one just before — a signature of human recall data, and it falls out of the update order for free), and the neighbor injection is precision-gated — a temporal neighbor enters results only when its context cosine clears a floor of 0.5, so incidental drift overlap cannot inject noise as the corpus grows.

On a constructed benchmark of token-disjoint episodic queries — where the answer shares no words with the query and is reachable only through time — the lexical/semantic baseline scores recall@5 = 0.0 (structurally unreachable) and the context-aware path scores 1.0. That is the cleanest demonstration in the system that temporal context is a genuinely independent retrieval signal.

3.4Spreading activation and the provenance ladder

Engrams are also nodes in a typed graph (memories, people, tags, entities), and associative recall is a spreading activation — a bounded, best-first diffusion from seed nodes outward along weighted edges, the mechanism ACT-R uses for associative priming. Three disciplines keep it sound at scale. Boundedness: the diffusion has hard budgets on nodes popped and edges relaxed, so it terminates in sub-millisecond time on graphs from 10³ to 10⁵ nodes regardless of topology. Fan-out normalization: a node's outgoing activation is divided down by its degree (the ACT-R fan effect), and person-nodes named in many memories are further damped by inverse document frequency — so a hub like a manager's name, which connects everything to everything, cannot flood the result set. The provenance ladder: edges carry a provenance tier — authored (the user said so) versus learned (the system inferred it) — and a learned plasticity weight w ∈ [0,1] may shorten an authored edge's traversal cost but is provably unable to push any edge's cost below the next stronger authored tier. That un-invertibility is discharged by a symbolic proof over all real-valued weights (§5.3): no accumulation of statistical inference, however confident, can ever outrank what the user explicitly stated.

3.5The lexical ensemble: BM25, expansion, fusion, and source cues

The statistical retrieval layer is deliberately classical. Field-weighted BM25[6] supplies document-length-normalized lexical relevance. Queries are expanded by RM3-style pseudo-relevance feedback[7]: a cheap first pass retrieves seed documents, their statistically salient terms are folded into the query, and the expanded query bridges to evidence connected by an entity the user never named. The signals — lexical, semantic, activation, context, entity-affinity — are combined by reciprocal rank fusion[8], which sums 1/(k+rank) across rankers; RRF is used precisely because it fuses ranks rather than scores, so no signal's arbitrary scale can dominate another's.

One fusion term is motivated directly by the psychology of source monitoring[9]: who said something is a first-class retrieval cue for humans, yet IDF weighting deliberately destroys it — a frequently-mentioned person is a lexical hub and gets damped to nothing (§3.4). So when a query names a person the store knows, a dedicated source-affinity term re-ranks candidates whose provenance names that person. The term is self-gating (it vanishes when no person is named, none match, or all match) and re-rank-only (it can never inject candidates), and its weight was chosen by a measured sweep: 0.8 lifted every category of the benchmark, 1.2 over-rotated and hurt single-hop recall, so the shipped weight is the interior optimum, not the maximum.

3.6Diversity: determinantal point processes and their dual

A ranked list optimized purely for relevance returns five paraphrases of the best hit. The repair is submodular selection: maximal marginal relevance[10] and greedy MAP inference over a determinantal point process[11][12] with the positive-semidefinite kernel K_ij = (1+cos θ_ij)/2 — geometrically, a DPP scores a candidate set by the volume its vectors span, so near-parallel (redundant) vectors add almost nothing and diverse sets win. A measured honesty note: unfloored diversity reranking scattered multi-evidence retrieval on the conversational benchmark (recall@5 fell 0.46 → 0.20), because co-relevant distinct facts about one topic were misread as redundant. The shipped version counts candidates as redundant only above a token-overlap floor — genuine paraphrases collapse, corroborating evidence survives — recovering the loss while preserving the paraphrase-suppression property.

The same idea is applied on the input side, which appears to be novel engineering: when several fused seeds would ignite the spreading activation of §3.4, two paraphrase seeds are two full-energy entry points into the same graph region, crowding the bounded frontier and starving distinct facts. The allocator treats ignition as a fixed-budget problem — greedy admission by residual marginal relevance rel_i · (1 − max_j K_ij), the classical (1−1/e)-style submodular greedy[13], followed by a saturating water-fill of the energy budget — so a paraphrase cluster is ignited once and the freed energy funds the distinct fact. On the constructed go/no-go benchmark, uniform ignition buries the distinct answer (recall@5 = 0.0) and the allocator surfaces it (1.0). The allocation core is a pure function with no clock, store, or I/O access, and its output is proven invariant under any permutation of its input.

3.7Knowing what it knows: calibrated confidence and surprise-gated encoding

Every recall carries a confidence estimate in the spirit of Koriat's cue-based monitoring[14], computed from evidence strength, corroboration, and the margin between the top hit and the runner-up. The calibration property is falsifiable in both directions: a decoy query with only weak, undifferentiated matches must yield low confidence, and a clear winner must yield high. One guard is worth spelling out: a single uncorroborated hit has no runner-up, so a naïve margin computation would report margin = 1.0 and inflate a lone mediocre match into certainty; the implementation refuses to synthesize a margin from an empty comparison. On the encoding side, novelty is measured as Bayesian surprise[15] — how much a new memory shifts the store's expectations — and a surprising memory receives a base-level boost of exactly ln(salience) in equation (1), so novel information out-activates a near-duplicate of what is already known.

3.8Consolidation, schemas, and global structure

Following complementary-learning-systems theory[16] — the hippocampus stores episodes fast; the cortex slowly abstracts regularities — Mnema runs an offline, deterministic consolidation pass: a seeded, prioritized replay (priority monotone in surprise, recency, and access; the highest-priority memory is always replayed; the tail is sampled rather than starved) that reinforces FSRS durability, and an abstraction step that merges clusters of near-duplicate episodes into a single semantic schema engram linked back to its sources. The pass is idempotent — re-running it does not re-abstract — and additive: until it runs, no shipping code path changes. Complementing these local mechanisms, two classical global-structure algorithms give recall a view of the graph's shape: weighted PageRank[17] as a structural prior (proven to remain a probability distribution — scores non-negative, summing to one, with dangling-node mass teleported rather than leaked), and Louvain modularity clustering[18] for community structure (proven to return a total partition, and to recover planted two-clique structure).

3.9Scale, determinism, and measured results

Exact semantic search is a linear scan; its latency grew ~10× from 10³ to 10⁴ memories in the benchmark suite. The accelerator is HNSW[19] — a hierarchy of navigable small-world graphs answering top-k nearest-neighbor queries in sub-linear expected time — with one significant modification: vanilla HNSW assigns node levels from a random number generator and depends on insertion order, which would break a determinism the whole engine treats as load-bearing. Mnema's index derives each node's level from a stable hash of its id and builds in sorted-id order, making the graph — and every query result — a pure function of the (id, vector) set, invariant under insertion order. Measured recall@10 against the exact scan is ≥ 0.90 at the default search width and rises monotonically toward 1.0 as the width parameter grows (the standard recall-versus-throughput trade[19]). The same determinism discipline runs through the engine: engram ids are a deterministic function of an origin namespace and a persisted monotonic sequence, every floating-point comparison uses total ordering with id tiebreaks, and no hash-map iteration ever feeds a reduction — so the same writes against a fresh store reproduce byte-identical ids and identical rankings, run to run, machine to machine. Semantic search works offline out of the box via a zero-dependency feature-hashing embedder[20] (word unigrams and bigrams plus character 3/4-grams hashed into a signed 512-dimensional space, L2-normalized); any external embedding always overrides it.

End to end, the ensemble is evaluated on LoCoMo[21], a public benchmark of very-long conversational memory. Against the engine's own frozen semantic baseline: recall@5 0.45 → 0.54, recall@10 0.53 → 0.62, and on multi-hop questions — whose answers share no tokens with the query — 0.19 → 0.29. These are honest, modest numbers on a hard benchmark; each contributing signal was A/B-measured in isolation, and mechanisms that regressed recall (two candidate-injection schemes, an entity re-ranker with zero discriminative signal) were reverted and are documented as such in the specification.

4Replication: convergence without coordination

Teams share memory, and Zimac refuses the easy architecture (a server that owns the truth). Instead, each member's store is a full replica, and replication is built on conflict-free replicated data types (CRDTs)[22]. The guarantee CRDTs offer is strong eventual consistency: any two replicas that have applied the same set of operations are in identical states — regardless of the order, grouping, or repetition of delivery. No consensus protocol, no coordinator, no conflict dialog.

The property is algebraic, and the algebra is the proof surface. A replicated register converges when its merge is a join-semilattice operation — idempotent (a ⊔ a = a), commutative (a ⊔ b = b ⊔ a), and associative — because those three laws are exactly what make the result independent of delivery order and duplication, the way max() folds a set identically however you traverse it. Mnema's last-writer-wins register merge is machine-proven to satisfy all three laws, plus "largest timestamp wins," over symbolic inputs (§5.3). Timestamps come from a hybrid logical clock[23] — physical time fused with a logical counter — proven strictly monotone (a replica never stamps two operations equally) and never regressed by observing a peer's clock; ties across replicas break deterministically by origin id, so every replica computes the same winner.

Sets use the add-wins observed-remove (OR-Set) semantics: each add mints a unique tag (a "dot"), and a remove deletes only the dots it has observed — so a concurrent add always survives a remove that never saw it, matching the intuition that you cannot delete what you did not know existed. The resolution rule (an element is dead iff all its dots are observed) is proven over symbolic dot sets. Delivery is causal: an operation is buffered until its declared dependencies have been applied, so a remove can never land before the add it cancels, and the buffer is model-checked to always drain. Garbage collection is causally stable: a replica may compact away operations only at or below the frontier that every peer has provably passed, raising a per-origin floor; the checked guarantee is that a fresh peer fed only the retained operations plus the floor converges to the identical state under every delivery order, and that compaction refuses to run unless the retained set reproduces the current state exactly.

Identity is cryptographic. Every operation is signed; an operation verifies only under its claimed origin's public key, so no member can be impersonated and no body or signature survives tampering — and in restricted mode, operations from origins outside a trust allowlist are dropped before they touch state. Pairing uses a short authentication string that is order-independent and distinguishes distinct peer pairs. The model-checked trust property is stated adversarially: with a forger in the model, the forged operation is never applied and the honest replicas' state remains a function of trusted operations alone.

5The verification stack

"Formally verified" is a claim that ranges from marketing to mathematics. This section states exactly what Mnema does, because the honest version is more interesting than the inflated one. The organizing artifact is a specification file listing every property the engine must satisfy — around eighty, each with an identifier (P-SYNC-SEC, P-LOG-TORN, P-RECALL-LADDER, …) — and, for each, what discharges it: a proof harness, a model check, a test, a differential gate. The rule of the file is that nothing may be merely asserted in prose; a property without a machine-checked discharge is a bug in the spec. Five complementary techniques are used, each where its assumptions actually hold.

Table 1 — The verification layers, what each establishes, and its known boundary.
Layer	Technique	What it establishes	Boundary
Property tests	Deterministic unit properties + seeded randomized sweeps	Each behavioral property holds on constructed fixtures and across thousands of generated inputs, so no property passes on one hand-picked example	Sampled, not exhaustive
Exhaustive model check	A Rust checker enumerating delivery orders against an executable reference specification	The running replica code converges identically to the reference fold under every delivery order	All orders up to 7 operations per scenario; a deterministic 3,000-order sample beyond
Protocol model check	TLA⁺ specification checked by TLC[24]	Consistency, causality, add-wins, unforgeability, GC-safety, and liveness over all interleavings of the abstract protocol	A finite model (3 replicas, small op sets); abstracts the wire and the byte log
Bounded proofs	Kani (CBMC-based model checking of Rust)[25]	Pure, total functions — the cost ladder, the hybrid clock, the LWW semilattice laws, OR-Set add-wins — proven over symbolic inputs: every value in the domain at once, not samples	Bounded model checking, not full deduction; scoped to pure functions (CBMC handles strings and cryptographic code poorly, so log helpers are test-discharged instead)
Differential parity	Cross-implementation gate against the original production JavaScript engine	The frozen ranking paths agree with the independent reference implementation to within 5.6×10⁻¹⁷ — machine epsilon for doubles	Covers the frozen paths; new science is additive behind flags the gate pins off
Mutation testing	cargo-mutants fault injection[26]	The meta-check: deliberately corrupt the code and rerun the suite — a surviving mutant exposes a property whose test is too weak to kill it	Run weekly in CI
Build-time proof	Zi proof-carrying components (§6)	The crash-recovery boundary's overflow-safety, memory-safety, and termination proven before the database is allowed to compile	Scoped to the components written in Zi

5.1Why this particular layering

Each layer covers another's blind spot. Tests exercise real code but only at sampled points; the randomized sweep widens the sample by orders of magnitude (and has caught a real defect: a confidence function that panicked when every input value was non-finite, because an emptiness guard ran before the finiteness filter — now a pinned regression). The TLA⁺ model reasons about all interleavings but of an abstraction, not the code; the exhaustive Rust checker is the bridge, checking the same properties against the real implementation, so a divergence between model and code surfaces as a failing test rather than a silent gap. Kani closes the sampling gap entirely for the pure algebraic core, where "for all inputs" is checkable symbolically. The differential gate defends against a subtler failure — a correct-looking rewrite that drifts numerically from the battle-tested original. And mutation testing audits the auditors: it measures whether the property suite would actually notice sabotage.

5.2Verification as a change contract

The stack is wired into the development process rather than left as a ceremony. The specification is change-controlled: modifying observable behavior requires amending the affected property first, and weakening an assertion to make a check pass is categorically forbidden — a red conformance check is treated as a finding, not friction. The fast layers run in the ordinary test suite; the application build runs the database gate and refuses to build on failure; pushes touching the database run it again; continuous integration runs every layer, including the proof and model-checking toolchains. This matters doubly in a codebase where AI agents write much of the code: the verification contract binds the agents too, and is designed so that it cannot be satisfied by weakening itself.

5.3What a symbolic proof buys, concretely

The un-invertibility of the provenance ladder (§3.4) makes a good exhibit. The property — "no learned weight w ∈ [0,1] can bend an authored edge's cost to or below the next stronger authored tier" — quantifies over a continuum; no finite test suite can establish it. The Kani harness instead declares w and the tier costs as symbolic values, constrains them only by their types and ranges, and model-checks the assertion for the entire domain at once, the way completing the square proves a fact about every quadratic rather than the ones you tried. The same technique discharges the semilattice laws of §4 — idempotence, commutativity, associativity over all symbolic register pairs — which is exactly the case where testing is weakest and algebra is strongest.

6Zi: proof-carrying code for machine-written programs

Companion report. This section is a summary; the language has its own full treatment — ZAI-TR-2026-02, Zi: A Proof-Carrying Language for Machine-Written Code — covering the type system, the decision procedure, the certificate kernel, and the generation loop in depth.

Parts of Zimac are written by language models, and the uncomfortable truth about generated code is that review does not scale with generation. Zi is the repository's answer: a small language in which a program carries its specification — refinement-typed pre- and postconditions, loop invariants, termination measures — and the toolchain proves the implementation against it before the program may run or ship. The idea descends from proof-carrying code[27]: don't trust the producer, verify the artifact.

The trust story is deliberately minimal. Zi's verifier discharges verification conditions with a dependency-free decision procedure for linear and polynomial arithmetic, and — because a large prover is itself fallible code — every proof is emitted as a certificate that a tiny independent kernel re-checks, in the LCF tradition[28]. For linear arithmetic the certificate is a Farkas witness[29], and the mathematics is pleasingly small: to certify that a system of linear inequalities is contradictory, you exhibit non-negative multipliers λ_i whose weighted sum of the hypotheses collapses to an absurdity such as 0 ≤ −1. Finding the multipliers is the prover's hard work; checking them is a few dot products — high-school algebra — so the trusted base is a few hundred auditable lines, and neither the model that wrote the code nor the prover that found the proof needs to be trusted.

Three design decisions distinguish Zi from using an off-the-shelf verifier. First, security is a type: values carry information-flow labels, so "a secret value never reaches an output" and "untrusted content cannot drive an effect until it passes a declared guard" are compile-time theorems rather than runtime hopes. Second, the undecidability boundary is named rather than hidden: a program may contain admit P — "assume this without proof" — but an admit is a loud, counted marker, never silent; the toolchain scores it as an unproved obligation and the gate (the runtime that executes agent plans) refuses any plan that is not fully proven. Third, verified Zi compiles to native code via Rust with contracts erased — the proof happens once at build time and costs nothing at runtime. Mnema's crash-recovery boundary (§2.1) is the flagship deployment: the database's build script runs the Zi verifier and will not compile the store over an unproven component. The scope claim is kept honest in the language's own specification: Zi proves the code correct relative to the written specification; no system can prove the specification matches human intent, so the design concentrates intent into a small, reviewable spec instead of an unbounded implementation.

7What is not proven

A verification claim is only as credible as its stated boundary. The following limits are lifted directly from the specification's own honesty section, and they are the fine print that makes the rest of this document meaningful.

Bounded, not unbounded. The exhaustive order-checker enumerates all delivery orders only up to 7 operations per scenario, with a deterministic 3,000-order sample beyond. It does not claim to have checked unbounded operation sets.
Kani proofs are bounded model checking, not full deductive proofs, and they cover pure, total functions only — never stateful I/O paths. String- and cryptography-heavy code (the log's formatting and SHA-256 helpers) is deliberately excluded, because the underlying engine models it poorly; those properties are discharged by tests and the differential gate instead.
The TLA⁺ specification is a model, abstracting the wire transport and the byte log, checked over a small finite configuration (3 replicas, a handful of operations). The Rust checker ties the model to the code; it does not extend the model's bounds.
The differential-parity guarantee covers the frozen paths. The newer cognitive machinery (§3) is additive behind flags the parity gate pins off; its correctness rests on the property-test and randomized-sweep layers, not on cross-implementation parity.
Approximate search is approximate. The HNSW index trades a bounded, measured amount of recall for sub-linear query time; the exact scan remains available, and a cold index falls back to it rather than returning a wrong answer.
Specifications, not intent. Every proof in the system — Zi's included — establishes that code satisfies a written property. Whether the property is the right one is a human judgment; the system's contribution is making that judgment about a small spec rather than a large diff.
Benchmarks are point measurements. The LoCoMo numbers in §3.9 are honest single-benchmark results with deterministic runs, not general performance claims.

8The metered proxy and the network boundary

The one component that must live on a server is the model gateway, because its purpose is to keep the provider credential off client machines entirely. The desktop app speaks the standard messages API to the proxy with a per-seat token; the proxy validates the token, checks entitlements, and swaps in the real provider key server-side, per request. A compiled client binary can be reverse-engineered; a credential that was never present cannot be extracted from it.

Metering is designed around a single-ledger invariant: the row that payment credits is the row that usage debits. A prepaid top-up lands as budget in the same ledger the proxy reads and decrements on every call, per seat and per organization pool, enforced at request time — so the number finance sees and the number the enforcement path uses are the same number, and cannot drift apart by construction. Budgets are therefore hard limits rather than after-the-fact reports. Card handling is delegated to the payment processor; card details never touch Zimac.AI infrastructure.

Team features that cross the network keep the same posture as §1. The discovery service is a rendezvous directory and store-and-forward mailbox for replicas that cannot reach each other directly; payloads it relays are end-to-end encrypted, and the service is deliberately unable to decrypt them — it forwards opaque bytes, scoped by per-tenant tokens. The replication cryptography of §4 (signed operations, trust allowlists) rides inside that opaque channel, so the relay is untrusted by design.

References

Anderson, J. R. & Schooler, L. J. (1991). Reflections of the environment in memory. Psychological Science, 2(6), 396–408.
Petrov, A. A. (2006). Computationally efficient approximation of the base-level learning equation in ACT-R. Proc. 7th International Conference on Cognitive Modeling.
Ye, J., Su, J. & Cao, Y. (2022). A stochastic shortest path algorithm for optimizing spaced repetition scheduling. Proc. KDD 2022; and the open FSRS-5 model derived from it.
Howard, M. W. & Kahana, M. J. (2002). A distributed representation of temporal context. Journal of Mathematical Psychology, 46(3), 269–299.
Polyn, S. M., Norman, K. A. & Kahana, M. J. (2009). A context maintenance and retrieval model of organizational processes in free recall. Psychological Review, 116(1), 129–156.
Robertson, S. & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
Lavrenko, V. & Croft, W. B. (2001). Relevance-based language models. Proc. SIGIR 2001.
Cormack, G. V., Clarke, C. L. A. & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proc. SIGIR 2009.
Johnson, M. K., Hashtroudi, S. & Lindsay, D. S. (1993). Source monitoring. Psychological Bulletin, 114(1), 3–28.
Carbonell, J. & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proc. SIGIR 1998.
Kulesza, A. & Taskar, B. (2012). Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3).
Chen, L., Zhang, G. & Zhou, E. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity. Proc. NeurIPS 2018.
Nemhauser, G. L., Wolsey, L. A. & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming, 14, 265–294.
Koriat, A. (1997). Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126(4), 349–370.
Itti, L. & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49(10), 1295–1306.
McClelland, J. L., McNaughton, B. L. & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex. Psychological Review, 102(3), 419–457.
Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab Technical Report.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008.
Malkov, Y. A. & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 42(4), 824–836.
Weinberger, K., Dasgupta, A., Langford, J., Smola, A. & Attenberg, J. (2009). Feature hashing for large scale multitask learning. Proc. ICML 2009.
Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbieri, F. & Fang, Y. (2024). Evaluating very long-term conversational memory of LLM agents (LoCoMo). Proc. ACL 2024.
Shapiro, M., Preguiça, N., Baquero, C. & Zawirski, M. (2011). Conflict-free replicated data types. Proc. 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS).
Kulkarni, S. S., Demirbas, M., Madappa, D., Avva, B. & Leone, M. (2014). Logical physical clocks. Proc. OPODIS 2014.
Lamport, L. (2002). Specifying Systems: The TLA⁺ Language and Tools for Hardware and Software Engineers. Addison-Wesley.
Kroening, D. & Tautschnig, M. (2014). CBMC — C bounded model checker. Proc. TACAS 2014; and the Kani Rust Verifier built on it.
DeMillo, R. A., Lipton, R. J. & Sayward, F. G. (1978). Hints on test data selection: Help for the practicing programmer. IEEE Computer, 11(4), 34–41.
Necula, G. C. (1997). Proof-carrying code. Proc. POPL 1997.
Gordon, M., Milner, R. & Wadsworth, C. (1979). Edinburgh LCF: A Mechanised Logic of Computation. Springer LNCS 78.
Farkas, J. (1902). Theorie der einfachen Ungleichungen. Journal für die reine und angewandte Mathematik, 124, 1–27.