Field notes · April 2026 · 9 min read

Verifying a 14-Billion Parameter Neural Network On-Chain

Qwen2.5-14B has 281 transformer layers and roughly fourteen billion learned weights. We proved a full forward pass on Starknet Sepolia using Obelyzk. This is the post-mortem.

Victor Amaya · Founding Engineer, BitSage Network
On-chain verification dashboard

On April 17th 2026, transaction 0x5ce1b41815e29a7b3dd03b77187cf32c8c5f0e2607960303174cbea303edfd3 on Starknet Sepolia accepted a proof that Qwen2.5-14B had run a specific forward pass to completion. It was the first time an open-weights model of that scale had been verified on-chain.

The number that matters: 14 billion parameters across 281 transformer layers. We followed it a week later with GLM-4-9B (0x542960d703a62d4beaacf0d9094ea92dc86bf326cd917c533039f4dd1eb4a30). Neither was a benchmark stunt — they were the first two real-model proofs written into the Obelyzk regression suite.

Why on-chain matters here

A proof on a hard drive is a curiosity. A proof on a public blockchain is a credible commitment: anyone can independently re-verify it, the settlement is irreversible, and the system has no privileged operator who can later say "actually that proof was wrong, sorry."

The argument for verifiable inference is the same argument we made for verifiable computing more broadly a decade ago. If you're going to act on an LLM's output — sign a transaction, gate access to a vault, settle a treasury position — you want a cryptographic receipt that the model actually said what someone is claiming it said. We built Obelyzk because we wanted those receipts.

The numbers we had to fit

Starknet's transaction calldata budget is finite. Cairo step counts are finite. The bare structure of Qwen2.5-14B is:

Naively, the proof for one forward pass would be hundreds of megabytes. Calldata budget is closer to a few thousand felts. We needed compression.

Chunked weight binding

The trick is to never put the weights on-chain. Instead, we commit to the weights once (a Poseidon Merkle root over the canonical weight order) and bind every subsequent proof to that root. The prover proves this forward pass used the weights committed to in root 0xabc..., not here are 14B weights and a forward pass.

To do that without materializing all 14B weights into a single MLE, we split the weights into ~4,096-element chunks, hash each chunk into the Merkle tree, and prove inclusion of just the chunks that participated in a given matmul. For Qwen2.5-14B, that's roughly 3.4 million chunks. The Merkle root is 32 bytes.

Why this works. The model is public. Anyone can recompute the Merkle root from the weights themselves and check it matches the one the contract was deployed with. The prover then only needs to convince the verifier that they used those weights — not any weights.

Recursive STARK aggregation

Even with chunked weights, a 281-layer forward pass produces a proof that won't fit in a single L2 transaction. So we recurse.

Each layer produces a sub-proof. Sub-proofs are aggregated in pairs by a recursive verifier — itself a STARK circuit that verifies two STARKs and emits a single STARK. After log₂(281) ≈ 9 levels of aggregation, you have one proof that attests to the entire pass.

The recursive verifier was the hardest single component. We're now at 160-bit security with a 48-column trace and two-level recursion. The final proof verifies on Starknet Sepolia in 942 felts of calldata, a 49× compression over the unaggregated stream.

The Cairo verifier on L2

The contract on Starknet — class hash 0x0300ff964fe615d094af601074b76b7193b564e0c7215c7b98bc046334c35bcf — re-executes the full STARK verifier from inside Cairo. Every polynomial evaluation, every Merkle path, every FRI fold, the proof-of-work nonce. This is not record-based: a malicious prover that submits a wrong proof is rejected by the contract itself, not by an off-chain watcher.

The contract is small. The Cairo verifier compiles to roughly 24 KB of bytecode plus a 942-felt calldata budget per verification. Two transactions can verify two independent inferences side by side. The cost on Sepolia is approximately 0.001 STRK per verification, which is small enough that we're already using it as a regression check in CI.

What we'd do differently

Three things in hindsight.

1. Start with the recursive verifier, not the leaf prover. We spent four months making leaf proofs fast, then discovered the recursive verifier was the bottleneck. The right order is: design the recursive verifier first, then design the leaf prover to produce shapes the recursive verifier likes.

2. Streaming, not batch. Our first version generated all 281 layer proofs, then aggregated them. The right approach is to stream-aggregate as proofs come out — the recursion tree gets to consume proofs incrementally, so peak memory stays bounded and we don't have to serialize gigabytes of intermediate proofs to disk.

3. The model loader is critical. Half of the bugs we shipped came from the model-loading layer not the proving layer. Different head dimensions, different LayerNorm vs RMSNorm conventions, weights transposed for the wrong matmul. We've now invested in a strict HuggingFace-to-Obelyzk import path that explicitly tags every weight with its role in the forward pass.

What this enables

The unlock here isn't that we can now spend the gas to verify a chat response. It's that we can build systems where a smart contract conditions its own behavior on the output of an LLM, and the contract enforces that the LLM ran correctly. Treasury vaults gated by AI agents. Predictions markets that resolve on verifiable LLM outputs. Game logic that integrates real model inference without trusting an oracle.

The path from here is the dull engineering one: more models supported, faster proving, cheaper verification. None of it is conceptually novel after Qwen2.5-14B. All of it is the work of the next year.

Verifiable AI is no longer a bet on a paper. It's a build-out.

Source, regression suite, and the Sepolia transaction trail at crates.io/crates/obelyzk.