Back to blog
16 May 2026 ·

Gemma 4 in 2026: the May update rewrote on-prem math

Gemma 4's April launch was a spec sheet. The May multi-token prediction update is what made on-prem inference production-viable for EU CTOs in 2026.

gemma-4enterprise-aion-premiseopen-weight models
Gemma 4 in 2026: the May update rewrote on-prem math

Gemma 4's real 2026 story for enterprise is the May multi-token prediction update, not the April launch benchmarks. May is when an open model became fast enough on commodity hardware to displace cloud calls for latency-sensitive agentic workloads inside a regulated perimeter.

For CTOs at regulated enterprises, the only update that matters is the May MTP drafter. Tokens-per-second on owned silicon decides whether on-prem AI is production-viable; MMMLU does not. This post gives you the workload-to-variant matrix, the GPU sizing after MTP, the migration traps from Gemma 3, and the cases where Gemma 4 is still the wrong call.

April was a spec sheet. May was the production unlock.

Google introduced Gemma 4 on April 2, 2026 in four sizes — E2B with 2.3B effective parameters, E4B with 4.5B effective, a 26B A4B MoE that activates 3.8B, and a 31B dense model with 256k context — under Apache 2.0 [1][2][3]. The 31B IT Thinking variant posted 85.2% MMMLU, 80.0% on LiveCodeBench v6, and 86.4% on τ2-bench retail agentic tool use [4]. None of that changed an enterprise deployment decision. Strong open-weight models existed before April.

What changed in May is throughput. The multi-token prediction drafters Google shipped made Gemma 4 up to 3x faster locally: 2.8x for E2B on Pixel hardware, 3.1x for E4B, and 2.5x for the 31B model on Apple M4 silicon, with first-class integration into MLX, vLLM, SGLang, and Ollama [5]. That speedup moves a 31B-class model from interesting to viable as a Gemini API replacement in latency-bound agent loops. An agent doing six tool-call hops at 40ms versus 100ms per generated chunk is a different product.

Pick variants by load shape, not by parameter count

E2B, E4B, 26B A4B MoE, and 31B dense are not a quality ladder. They are four deployment postures. E2B and E4B carry 128k context and accept image, video, and audio inputs — built for edge devices and offline agents where the audio modality matters [3]. The 26B A4B MoE activates only 3.8B parameters per token while keeping 26B of knowledge in memory: a throughput play for high-concurrency RAG with a wide knowledge surface and no budget for a 31B forward pass per request [4]. The 31B dense model, with 256k context, is the answer for single-tenant high-stakes reasoning, long-document synthesis, and agentic tool-use.

Choosing by Arena score burns a quarter of GPU budget. The right axes are KV cache pressure at your real concurrency target, which modalities you need at the edge versus the data center, and whether your workload is bursty (favor MoE) or steady-state (favor dense). A multilingual extraction pipeline at 200 concurrent users does not want a 31B dense. A legal-synthesis agent serving twelve lawyers does not want a 26B MoE. Match the variant to the load shape.

What fits in 24GB, 48GB, and 80GB after MTP

Gemma 4 uses alternating local sliding-window and global full-context attention, dual RoPE configurations, Per-Layer Embeddings, and a shared KV cache across attention layers [3]. The shared KV cache is the load-bearing piece for memory sizing. Combined with the May MTP drafters, the practical envelopes shifted. A 24GB consumer card (RTX 4090, L4) now serves E4B with full 128k context and headroom for several concurrent sessions, including multimodal inputs. That was not the production story in April. It is in June.

At 48GB (L40S, A6000 Ada), the 26B A4B MoE with INT8 weights runs production RAG at concurrency that previously demanded an H100, because the activated parameter count is only 3.8B per token and MTP cuts generation steps by roughly 2.5x on comparable silicon [5]. At 80GB (H100, H200), a single GPU now covers what was a two-GPU 31B deployment in April: long-context legal review, full-modality document intake, and tool-using agents under two-second response budgets. Parameter counts did not change. Requests per second per card did — which is the only number that matters when you are sizing a WaveNode appliance against a Gemini API line item.

Apache 2.0 solves residency, not governance

Apache 2.0 [1] lets you deploy Gemma 4 inside an air-gapped perimeter without a per-token license meter and without exporting prompts to a hosted endpoint. That solves the data-residency half of the EU AI Act conversation. It does not solve the governance half. Enterprise posture under the AI Act, GDPR, NIS2, and DORA requires a documented evaluation harness, fine-tune drift monitoring, citation-grounded retrieval with traceable sources, and an audit trail that survives a regulator's inspection. The license gives you the right to deploy. It gives you none of those four artifacts.

A Gemma 4 31B deployment without an eval harness is a liability waiting for its first regulator question. The Wavenetic stack ships citation tracking, page-level source binding, revision-aware retrieval, and per-request audit logs because Apache 2.0 weights plus Ollama is not an AI Act answer — see [EU AI Act-compliant AI](/eu-ai-act-compliant-ai) for what the governance layer actually has to do. The model is the cheap part.

Where Gemma 4 is still the wrong answer

Gemma 4 31B loses to Gemini 3 hosted on ultra-long-horizon planning agents where context runs over 256k and the model has to maintain coherent state across thousands of tool calls. It loses to Qwen3 on specific multilingual extraction tasks, particularly for some non-European scripts where Qwen's training mix is denser. For high-stakes legal synthesis where a single reasoning error has six-figure consequences, a 70B-class frontier model — open or hosted — is the right call until Gemma 4 31B has more production miles behind it.

Repatriation is not all-or-nothing. The right pattern is a policy-based router: Gemma 4 on-prem handles the 80% of workload that is document Q&A, structured extraction, drafting, and well-bounded agent loops; hosted frontier models handle the long tail. The full classification framework is in [On-premise AI vs cloud AI: don't choose a platform, classify the workload](/blog/on-premise-ai-vs-cloud-ai-don-t-choose-a-platform-classify-). Gemma 4's May update widened the on-prem side of that router. It did not eliminate the other side.

Migration from Gemma 3 is a re-eval cycle, not a config swap

Teams running Gemma 3 in production should not treat Gemma 4 as a drop-in. The alternating local sliding-window and global attention pattern is new [3], the tokenizer changed, and the prompt format shifted enough that existing fine-tunes do not port cleanly. Per-Layer Embeddings and dual RoPE configurations change how positional information flows, which means LoRA adapters tuned against Gemma 3 will produce subtly wrong outputs against Gemma 4 weights — wrong in ways that pass smoke tests and fail on edge cases six weeks into production.

The discipline: re-run your full eval harness against Gemma 4 base before any fine-tuning, re-train adapters from scratch on the new tokenizer, and budget a two-to-four week parallel-run window where Gemma 3 and Gemma 4 serve the same workload and you diff outputs at the citation and structured-output level. Skip that and you will rediscover in production why the AICore preview explicitly notes tool calling, structured output, system prompts, and thinking mode are landing on different timelines [5].

The repatriation calculus for European CTOs after May 2026

Run the math at any sustained throughput above a few million tokens per day. Gemma 4 31B on a sealed appliance with an H100-class GPU, accelerated by MTP drafters and integrated through vLLM or SGLang [5], delivers cost-per-million-tokens that crosses under Gemini API pricing inside a year — often inside two quarters once you account for the egress, audit, and data-residency overhead a hosted API forces onto a regulated enterprise. That crossover is why repatriation conversations are accelerating through 2026.

This is the WaveNode thesis: Gemma 4 31B inside the customer perimeter, on Wavenetic hardware, with RAG, citations, audit trail, and drift monitoring shipped as one stack. In production today at ELES, Slovenia's national TSO, running NEXUS. The teams that win 2026 stopped reading Gemma 4 as a model release and started reading it as a quarterly-shifting platform whose May update already rewrote the on-prem business case. For the architecture, see [Enterprise AI on-premise](/enterprise-ai-on-premise).


Talk to our team about sizing Gemma 4 for your on-prem workloadhttps://wavenetic.com

Sources

  1. Gemma 4: Byte for byte, the most capable open models — Google Blog
  2. Gemma 4 — Google DeepMind
  3. Welcome Gemma 4: Frontier multimodal intelligence on device — Hugging Face
  4. Gemma 4 model overview — Google AI for Developers
  5. Google Makes Gemma 4 Up to 3x Faster Locally — Belitsoft