A year ago, proving that a neural network of any interesting size had produced a given output on-chain was an academic exercise. Today it's a transaction. This is a tour of how we got there, told from inside the proving system, not from the cryptography-paper side.
The plan: explain why GKR sumcheck beats naive STARK arithmetization for matrix-heavy workloads, why Circle STARKs over M31 are a quiet revolution for the GPU, and what it costs in engineering effort once you commit to that path.
Why GKR, not "just" STARKs
A transformer forward pass is roughly 90% matrix-matrix multiplications. Encoding a matmul as a STARK execution trace gives you a table with one row per scalar multiplication. For a 4096×4096 × 4096×4096 matmul that's 68 billion rows. Even with the best polynomial commitment scheme on the market, you're not committing to 68 billion rows.
GKR sumcheck (Goldwasser–Kalai–Rothblum, 2008) reformulates the same
statement: instead of tracing every multiplication, you fold the matmul
into a multilinear polynomial identity and prove that identity via a series
of log₂(n) sumcheck rounds. Verifier cost goes from
O(n³) to O(n · log n). Prover cost stays roughly
O(n³) in raw work, but the work is dense, regular, and lives
on a GPU beautifully.
What STWO gives us
STWO is StarkWare's next-generation Circle STARK prover. Three things matter for our story:
- Mersenne-31 (M31). The native field is
p = 2³¹ − 1. Field multiplications fit in 64-bit integer registers with no Montgomery dance. - Circle group. Roots of unity live on the unit circle x² + y² = 1 over the M31 extension. Polynomial evaluation domains are tiny and friendly to FFTs.
- Transparent setup. No trusted ceremony. FRI for the polynomial commitment scheme. Pure cryptographic assumptions.
M31 is the unsung hero. On a modern NVIDIA H100, an M31 multiplication is
roughly (a * b) % p on 64-bit integers, then a single
fused-multiply-subtract to reduce. We measured 14× throughput against the
same kernel running on BN254. For a 14-billion-parameter Qwen forward pass,
that is the difference between a proof you can run on a single H100 and
one that needs a multi-GPU cluster.
The sumcheck round, plainly
Given a multilinear polynomial P(x₁, x₂, …, xₙ) over the
boolean hypercube, the prover wants to convince the verifier that
∑ P(x) = H for some claimed value H. The protocol runs
n rounds. In round i:
- The prover sends a univariate polynomial
gᵢ(Xᵢ)that is supposed to be the sum ofPwith all earlier variables fixed and all later variables summed. - The verifier checks that
gᵢ(0) + gᵢ(1) =the value the previous round predicted. - The verifier picks a random
rᵢand the protocol recurses on the smaller polynomial.
At the end of n rounds, the verifier needs to evaluate
P(r₁, r₂, …, rₙ) at a single point. For a transformer matmul,
the polynomial P is the dot product of two MLEs (multilinear
extensions) — one for the weight matrix, one for the activation vector.
Evaluating it at a random point is one polynomial commitment opening per
operand, which is exactly what STWO is great at.
What we actually built
Obelyzk's prover is roughly three layers:
┌──────────────────────────────────────────┐
│ ML layer (Python, glue, model loading) │
├──────────────────────────────────────────┤
│ GKR circuits (Rust) │
│ matmul · attention · LayerNorm · SiLU │
├──────────────────────────────────────────┤
│ STWO + custom CUDA (Rust + .cu kernels) │
│ MLE eval · sumcheck rounds · Poseidon │
└──────────────────────────────────────────┘
The Python layer takes a model and a tokenized input, breaks the forward pass into proof segments, and dispatches each segment to the Rust GKR layer. The GKR layer materializes the MLEs, runs the sumcheck protocol, and calls down to the CUDA kernels for the heavy lifting.
Three kernels that matter
MLE restriction. Once the verifier picks a random
challenge r, the prover has to "restrict" the multilinear
polynomial at that point. This is a fold across the hypercube. The kernel
is straightforward map-reduce, but it dominates memory bandwidth. We pin
the polynomial in HBM, write the result in-place, and reach 92% of
theoretical bandwidth on H100.
Sumcheck round. Each round computes gᵢ(0),
gᵢ(1), and the leading coefficient. With M31, this is three
parallel reductions over half the polynomial. Naively, three kernel
launches. We fuse them into a single launch with warp-level shuffles,
cutting launch overhead by ~3×.
Poseidon-Merkle commit. The polynomial commitment scheme needs Merkle trees over Poseidon hashes. Poseidon2-M31 has a tight permutation that fits inside a single warp — 32 threads, one tree layer at a time. The whole commitment for a 2²⁵-coefficient polynomial fits in under 800 ms on H100.
What it costs you to build this
Honest accounting:
- The GKR layer in Rust is roughly 8,000 lines, plus another 6,000 for circuit-specific code (attention, layer norm, activations).
- The CUDA kernels are roughly 20,000 lines.
- Test coverage is 950+ tests, mostly checking that the prover output round-trips through the verifier and that the Cairo on-chain verifier accepts everything.
- The Cairo verifier — the contract that actually runs on Starknet — is a separate concern, with its own debug cycle. We'll save that for another talk.
Open problems
Three things are still unsolved or under-baked.
1. Recursive verification cost. We recurse proofs to keep them small enough for on-chain verification. The recursive verifier itself is a STARK circuit, and writing one in Cairo that's fast enough and correct is hard. We're at 3.5 s per recursive layer on a single H100, which is acceptable for batch verification but feels like there's another 2× left.
2. Quantization. Real production ML runs in INT8 or BF16. Most ZKML work runs in field arithmetic, which means quantizing the model down to fixed-point. The fixed-point itself is fine; the accumulators are where everything subtly goes wrong if you're not careful.
3. Trust assumptions. Verifiable inference is only as good as the binding between the published model weights and the weights the prover used. We hash the weights into the proof, but the hash has to be the canonical one — and "canonical" for a 14B-parameter model is surprisingly negotiable.
The interesting moment with ZK proving today is not the math. The math is settled. The interesting moment is the engineering that makes the math run in production at 14 billion parameters.
Slides, demos, and a runnable notebook will be linked from the crates.io page once they're cleaned up. Questions welcome at victor.amaya@ciroai.us.



