Engineering · March 2026 · 8 min read

Custom CUDA Kernels for Proving over M31

The Mersenne-31 field reduces modulo 2³¹ − 1. That means every field multiplication fits in a single 64-bit register — no carries, no Montgomery dance. Here's what we did with that on H100s.

Victor Amaya · Founding Engineer, BitSage Network
CUDA kernel architecture

Most ZK proving stacks pay a tax on every field operation: a BN254-friendly field needs Montgomery reduction, which is a multiplication and a couple of subtractions per "multiply." On a GPU, those extra operations break instruction-level parallelism and balloon register pressure.

M31 — Mersenne-31, p = 2³¹ − 1 — gives you a different deal. The reduction modulo 2³¹ − 1 is just x = (x & p) + (x >> 31), branch-free, one cycle. A multiplication of two M31 elements fits inside a 64-bit register, and the reduction is essentially free. We measured 14× higher field-multiplication throughput on H100 than for a comparable BN254 kernel. That's the deal that makes 14B-parameter on-chain verification tractable.

Memory-resident traces

A 14B-parameter forward pass dispatches roughly 2¹² distinct GKR sumcheck instances. Each sumcheck instance holds a multilinear extension of size up to 2²⁵ field elements — 128 MB at four bytes each. That's not a number you want to bounce between host RAM and HBM more than once.

We allocate the trace polynomial once in HBM, keep it pinned, and have every kernel mutate it in place. The MLE-restriction kernel, the sumcheck round kernel, the Poseidon-Merkle commitment kernel — they all read from the same buffer, write to the same buffer, never round-trip to host. On H100, that's a sustained ~3.0 TB/s effective bandwidth against a theoretical 3.35 TB/s, or about 92%.

// trace lives in HBM, kernels mutate in-place
__device__ M31 trace[1 << 25];

// MLE restriction at r ∈ M31 — fold across hypercube
__global__ void restrict_mle(M31* poly, M31 r, uint32_t n) {
    uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n / 2) return;
    M31 lo = poly[2 * tid];
    M31 hi = poly[2 * tid + 1];
    // (1 - r) * lo + r * hi   in M31
    poly[tid] = m31_add(m31_mul(m31_sub(M31_ONE, r), lo),
                        m31_mul(r, hi));
}

Warp-coalesced Poseidon-Merkle

Poseidon2-M31 has a state width of 16 field elements and an external round structure that maps very naturally onto a CUDA warp. Each warp handles one Merkle layer at a time; 32 threads handle 32 distinct Poseidon permutations in parallel, with shared-memory shuffles for the MDS matrix mixing.

The naïve version stalls on memory: each Merkle node reads from random locations in the previous layer. The trick is to layout the tree in Morton order (Z-order), so neighbouring leaves end up in neighbouring cache lines. After that the kernel is bottlenecked on the Poseidon permutation itself rather than memory.

A 2²⁵-leaf Merkle tree commits in 780 ms on a single H100. The previous best we benchmarked (a public Goldilocks-field implementation) sat at 4.1 s. The M31 advantage compounds: smaller field, cheaper permutation, smaller commitments.

Sumcheck round kernel

Each sumcheck round computes three values: g(0), g(1), and the leading coefficient of g(X). The naïve implementation is three separate reduction kernels — three launches, three memory passes. We fuse them.

__global__ void sumcheck_round(
    const M31* __restrict__ poly,  // input MLE
    uint32_t  n,                   // current size
    M31*      g0_out,              // accumulator for g(0)
    M31*      g1_out,              // accumulator for g(1)
    M31*      gd_out               // accumulator for leading coeff
) {
    __shared__ M31 s_g0[32];
    __shared__ M31 s_g1[32];
    __shared__ M31 s_gd[32];

    uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
    M31 g0 = M31_ZERO, g1 = M31_ZERO, gd = M31_ZERO;

    for (uint32_t i = tid; i < n / 2; i += blockDim.x * gridDim.x) {
        M31 lo = poly[2 * i];
        M31 hi = poly[2 * i + 1];
        g0 = m31_add(g0, lo);
        g1 = m31_add(g1, hi);
        gd = m31_add(gd, m31_sub(hi, lo));
    }

    // warp reduce
    g0 = warp_reduce(g0);
    g1 = warp_reduce(g1);
    gd = warp_reduce(gd);

    if ((threadIdx.x & 31) == 0) {
        atomicAdd_m31(g0_out, g0);
        atomicAdd_m31(g1_out, g1);
        atomicAdd_m31(gd_out, gd);
    }
}

Single kernel launch, single memory pass, three accumulators carried in registers, warp-level reductions and atomic adds for the cross-block combine. For a 2²⁵-coefficient round this completes in 14 ms on H100. The full sumcheck protocol (25 rounds) finishes in just under 600 ms.

What didn't work

Two things we tried and abandoned.

Tensor cores for matrix-mul circuits. Tempting, because Hopper's tensor cores are absurdly fast at low-precision FMA. But the field arithmetic doesn't map onto BF16 or FP16 without a precision loss, and an emulation layer eats all the speedup. We ended up using tensor cores only for the un-quantized matmul, not for the prover.

Persistent kernels. The idea was to launch one big kernel that internally hops between sumcheck rounds, avoiding launch overhead. In practice the per-round register pressure was different enough that the persistent kernel's occupancy collapsed. We went back to a launch per round; on Hopper, kernel launch is roughly 8 μs, which is unmeasurable compared to round work.

Open work

Things we still want to fix.

The unlock isn't a clever algorithm. The unlock is a field that doesn't fight the hardware. Pick the right field, and the GPU does the rest.

Source at crates.io/crates/obelyzk. Benchmarks and a runnable notebook in obelyzk/bench.