A Practitioner's Guide to GKR Sumcheck Over Circle STARKs

A year ago, proving that a neural network of any interesting size had produced a given output on-chain was an academic exercise. Today it's a transaction. This is a tour of how we got there, told from inside the proving system, not from the cryptography-paper side.

The plan: explain why GKR sumcheck beats naive STARK arithmetization for matrix-heavy workloads, why Circle STARKs over M31 are a quiet revolution for the GPU, and what it costs in engineering effort once you commit to that path.

Why GKR, not "just" STARKs

A transformer forward pass is roughly 90% matrix-matrix multiplications. Encoding a matmul as a STARK execution trace gives you a table with one row per scalar multiplication. For a 4096×4096 × 4096×4096 matmul that's 68 billion rows. Even with the best polynomial commitment scheme on the market, you're not committing to 68 billion rows.

GKR sumcheck (Goldwasser–Kalai–Rothblum, 2008) reformulates the same statement: instead of tracing every multiplication, you fold the matmul into a multilinear polynomial identity and prove that identity via a series of log₂(n) sumcheck rounds. Verifier cost goes from O(n³) to O(n · log n). Prover cost stays roughly O(n³) in raw work, but the work is dense, regular, and lives on a GPU beautifully.

The trade. GKR makes the prover do regular linear-algebra work on dense vectors instead of irregular polynomial commitments over sparse traces. The verifier sees a polynomial-sized proof and runs a few rounds of evaluation. For ML inference, this trade is overwhelmingly favourable.

What STWO gives us

STWO is StarkWare's next-generation Circle STARK prover. Three things matter for our story:

Mersenne-31 (M31). The native field is p = 2³¹ − 1. Field multiplications fit in 64-bit integer registers with no Montgomery dance.
Circle group. Roots of unity live on the unit circle x² + y² = 1 over the M31 extension. Polynomial evaluation domains are tiny and friendly to FFTs.
Transparent setup. No trusted ceremony. FRI for the polynomial commitment scheme. Pure cryptographic assumptions.

M31 is the unsung hero. On a modern NVIDIA H100, an M31 multiplication is roughly (a * b) % p on 64-bit integers, then a single fused-multiply-subtract to reduce. We measured 14× throughput against the same kernel running on BN254. For a 14-billion-parameter Qwen forward pass, that is the difference between a proof you can run on a single H100 and one that needs a multi-GPU cluster.

The sumcheck round, plainly

Given a multilinear polynomial P(x₁, x₂, …, xₙ) over the boolean hypercube, the prover wants to convince the verifier that ∑ P(x) = H for some claimed value H. The protocol runs n rounds. In round i:

The prover sends a univariate polynomial gᵢ(Xᵢ) that is supposed to be the sum of P with all earlier variables fixed and all later variables summed.
The verifier checks that gᵢ(0) + gᵢ(1) = the value the previous round predicted.
The verifier picks a random rᵢ and the protocol recurses on the smaller polynomial.

At the end of n rounds, the verifier needs to evaluate P(r₁, r₂, …, rₙ) at a single point. For a transformer matmul, the polynomial P is the dot product of two MLEs (multilinear extensions) — one for the weight matrix, one for the activation vector. Evaluating it at a random point is one polynomial commitment opening per operand, which is exactly what STWO is great at.

What we actually built

Obelyzk's prover is roughly three layers:

┌──────────────────────────────────────────┐
│  ML layer (Python, glue, model loading)  │
├──────────────────────────────────────────┤
│  GKR circuits (Rust)                     │
│    matmul · attention · LayerNorm · SiLU │
├──────────────────────────────────────────┤
│  STWO + custom CUDA (Rust + .cu kernels) │
│    MLE eval · sumcheck rounds · Poseidon │
└──────────────────────────────────────────┘

The Python layer takes a model and a tokenized input, breaks the forward pass into proof segments, and dispatches each segment to the Rust GKR layer. The GKR layer materializes the MLEs, runs the sumcheck protocol, and calls down to the CUDA kernels for the heavy lifting.

Three kernels that matter

MLE restriction. Once the verifier picks a random challenge r, the prover has to "restrict" the multilinear polynomial at that point. This is a fold across the hypercube. The kernel is straightforward map-reduce, but it dominates memory bandwidth. We pin the polynomial in HBM, write the result in-place, and reach 92% of theoretical bandwidth on H100.

Sumcheck round. Each round computes gᵢ(0), gᵢ(1), and the leading coefficient. With M31, this is three parallel reductions over half the polynomial. Naively, three kernel launches. We fuse them into a single launch with warp-level shuffles, cutting launch overhead by ~3×.

Poseidon-Merkle commit. The polynomial commitment scheme needs Merkle trees over Poseidon hashes. Poseidon2-M31 has a tight permutation that fits inside a single warp — 32 threads, one tree layer at a time. The whole commitment for a 2²⁵-coefficient polynomial fits in under 800 ms on H100.

What it costs you to build this

Honest accounting:

The GKR layer in Rust is roughly 8,000 lines, plus another 6,000 for circuit-specific code (attention, layer norm, activations).
The CUDA kernels are roughly 20,000 lines.
Test coverage is 950+ tests, mostly checking that the prover output round-trips through the verifier and that the Cairo on-chain verifier accepts everything.
The Cairo verifier — the contract that actually runs on Starknet — is a separate concern, with its own debug cycle. We'll save that for another talk.

Open problems

Three things are still unsolved or under-baked.

1. Recursive verification cost. We recurse proofs to keep them small enough for on-chain verification. The recursive verifier itself is a STARK circuit, and writing one in Cairo that's fast enough and correct is hard. We're at 3.5 s per recursive layer on a single H100, which is acceptable for batch verification but feels like there's another 2× left.

2. Quantization. Real production ML runs in INT8 or BF16. Most ZKML work runs in field arithmetic, which means quantizing the model down to fixed-point. The fixed-point itself is fine; the accumulators are where everything subtly goes wrong if you're not careful.

3. Trust assumptions. Verifiable inference is only as good as the binding between the published model weights and the weights the prover used. We hash the weights into the proof, but the hash has to be the canonical one — and "canonical" for a 14B-parameter model is surprisingly negotiable.

The interesting moment with ZK proving today is not the math. The math is settled. The interesting moment is the engineering that makes the math run in production at 14 billion parameters.

Slides, demos, and a runnable notebook will be linked from the crates.io page once they're cleaned up. Questions welcome at victor.amaya@ciroai.us.

A Practitioner's Guide to GKR Sumcheck Over Circle STARKs

Why GKR, not "just" STARKs

What STWO gives us

The sumcheck round, plainly

What we actually built

Three kernels that matter

What it costs you to build this

Open problems

Verifying a 14-Billion Parameter Neural Network On-Chain

Custom CUDA Kernels for Proving over M31

Verifiable AI Agents for On-Chain Fund Management