Local LLM Inference Requirements: Size the Workload Before You Buy the GPU

Enterprise local LLM inference is a concurrency and SLO engineering problem, not a GPU shopping problem. Any requirements document that opens with 'which GPU runs a 70B model' has already failed — that question answers a single-user demo and ignores the seven variables that actually determine whether the system survives production: tokens per day, concurrent users, time-to-first-token target, context length, model class, compliance class, and uptime tier.

VRAM tables for FP16 weights tell you nothing about KV cache pressure at 200 concurrent sessions. Single-stream tokens-per-second benchmarks tell you nothing about P99 latency during a Monday morning spike. This post is the sizing sequence we use before any hardware conversation — the one that lets enterprises ship in weeks instead of spending a year discovering what they should have specified on day one.

The GPU shopping list is the wrong artifact

The default enterprise question — 'what hardware do we need to run Llama 70B?' — produces a parts list that survives exactly one phase of the project: the prototype. The reference numbers are seductive because they are concrete. Llama 3.1 70B needs roughly 140 GB of VRAM at FP16, about 35 GB at INT4, and around 38–42 GB in practice once you account for metadata, KV cache, and framework overhead ^[1]. A 24B model lands near 48 GB at FP16 and 12 GB at INT4; a 7–8B model fits in 16 GB at FP16 or 4 GB at INT4 ^[6]. Those numbers are correct. They are also useless as a requirements artifact, because they describe a single forward pass for a single user.

The right starting artifact is a workload profile, not a parts list. Before anyone quotes a GPU, the requirements document needs seven numbers: peak tokens per day, peak concurrent users, target P50 and P99 TTFT, median and maximum context length, model class (chat, RAG, code, multimodal), data classification (public, internal, regulated, air-gapped), and uptime tier (99%, 99.9%, 99.99%). Every downstream decision — quantization, inference engine, GPU count, networking, headcount — falls out of those seven inputs. None of them fall out of a VRAM table.

VRAM is the limiting factor for which models run at all on GPU; memory bandwidth is the constraint for CPU inference ^[8]. That is a hardware truth, not a sizing methodology. Treating it as the methodology is how organizations end up with a beautifully specified box that times out at 40 concurrent users.

Concurrency math is the requirement competitors skip

Real capacity is tokens per second per user multiplied by concurrent users, plus KV cache footprint per active session, plus framework overhead. The FP16 weight size is the floor of your VRAM budget, not the ceiling. Every active session carries its own KV cache that scales with context length, and that cache lives in the same VRAM as the weights. A 70B model that 'fits' in 140 GB at FP16 ^[1] does not fit 100 concurrent users at 8K context — not because the weights grew, but because the per-session cache did.

A 24B INT8 model at 24 GB ^[6] looks comfortable on a single 48 GB GPU until you add 80 concurrent RAG sessions with 16K context windows, at which point KV cache pressure forces either eviction, queueing, or a second GPU. Slow inference, high resource consumption, and low concurrency are the dominant enterprise bottlenecks, and responses over five seconds destroy real-time use cases ^[4]. Dynamic batching and PagedAttention-style memory management exist precisely because naive serving collapses under concurrency.

The sizing sequence is: estimate peak concurrent active sessions, multiply by average context length, derive KV cache budget, add weight footprint, add 2–4 GB framework overhead, then add headroom for the P99 burst. Only then do you know how many GPUs the workload requires. That number is almost never the number you get from dividing model size by GPU VRAM.

TTFT and uptime are inputs, not outcomes

If the business case requires sub-50ms TTFT at P99 and 99.9% uptime, those are sizing constraints — not aspirational metrics to measure after deployment. Appropriately sized local inference delivers roughly 15–30 ms P50 TTFT, while cloud APIs commonly land at 100–300 ms P50 and spike to 1–2 seconds at P99 during provider congestion ^[1]. The local advantage is real, but only if the cluster is provisioned with the headroom to hold P99 under load: redundant inference nodes, conservative batch policies, and capacity reserved for failover.

Uptime tier changes the cluster topology before it changes anything else. 99.9% allows roughly 8.7 hours of downtime per year, which forces N+1 inference nodes, model rollback workflows, canary deployment paths, and a disaster recovery plan that survives a single-rack failure. None of that appears on a VRAM table. The infrastructure floor — redundant power and cooling, 10 Gbps internal networking, 64–128 GB RAM minimum, internal MLOps capability ^[7] — is necessary but not sufficient. The actual node count is a function of your SLO, not your model size.

If you cannot state your TTFT and uptime targets, you cannot size the cluster. Buying a GPU first and discovering the SLO later is the most expensive sequence in this entire category.

Inference engine choice is a workload decision, not a preference

Ollama, vLLM, and SGLang are not interchangeable runtimes with cosmetic differences. They are different engines for different workload profiles, and choosing wrong is how prototypes fail to graduate. The enterprise mapping: Ollama for quick local deployment and internal testing, vLLM for high-performance API services with large-scale requests and high concurrency, and SGLang for complex multimodal tasks such as image-text and OCR workflows ^[4].

The failure mode is predictable. A team prototypes on Ollama because it's frictionless, demos it successfully to leadership, then tries to put it behind an enterprise API gateway with 500 concurrent users and watches throughput collapse. The engine wasn't wrong for the prototype; it was wrong for the production workload profile. vLLM's dynamic batching and paged attention exist because high-concurrency API serving is a fundamentally different engineering problem than single-user inference. Migrating between engines mid-project is expensive — it touches batching strategy, observability, authentication, and SLA instrumentation.

Engine selection belongs in the requirements document next to the workload profile, not in a later 'implementation detail' phase. The workload profile dictates the engine. The engine dictates the batching policy. The batching policy dictates the GPU count. Reverse that sequence and the project ships late.

TCO that stops at GPU price is fiction

The break-even argument against cloud APIs is real. Organizations sustaining more than roughly 10 million tokens per day can see on-premise deployment pay back within 12–18 months: 10M tokens/day costs $6,000–$9,000 monthly via cloud API versus $3,500–$6,000 on-prem, and 100M tokens/day costs $60,000–$90,000 cloud versus $8,000–$15,000 on-prem ^[1]. Those numbers are credible. They are also incomplete.

A TCO model that stops at GPU acquisition cost is fiction. The line items that turn an 18-month payback into a 30-month one are the ones competitors omit: power draw at rack density that requires non-standard PDU provisioning, cooling that may force liquid loops or aisle containment, colocation fees if you don't own the data center, 10 Gbps+ internal networking ^[7], observability and logging infrastructure, MLOps and platform engineering headcount, hardware refresh on a 3–5 year cycle, and utilization risk if traffic doesn't grow into the cluster you sized for peak.

On-premise wins on TCO above a token threshold and below a utilization risk threshold, and only if the operating model is staffed. A single-vendor stack that bundles hardware, runtime, models, applications, and support compresses the line items that destroy naive TCO models — which is why the procurement question is not just 'what does the GPU cost' but 'who owns the stack on day 400'.

Compliance class changes the entire stack, not just the network diagram

Air-gapped and GDPR-bound workloads do not simply add a firewall rule to an otherwise standard deployment. They change the runtime, the update workflow, the audit surface, and the incident response model. An air-gapped cluster needs an offline model artifact pipeline, signed update bundles, reproducible deployments, and a rollback path that doesn't assume internet access. A GDPR-aligned cluster needs prompt and completion retention policies, redaction at ingestion, access control tied to existing identity systems, and audit trails that satisfy a regulator, not just an internal SRE.

RAG workloads add their own compliance layer: citation tracking back to source documents, page numbers, and document revisions, so that any answer can be traced to the artifact that produced it. That requirement is trivial to bolt onto a prototype and painful to retrofit onto a stack chosen purely for throughput. The same applies to model artifact governance — knowing which model version produced which answer on which date, with which retrieval index — which is a first-class runtime concern, not an afterthought.

Compliance class belongs in the workload profile next to tokens per day, because it determines which inference engines, orchestration layers, and storage architectures are even candidates. A stack chosen for raw tokens-per-second and then asked to produce a GDPR audit trail two quarters later is a rewrite, not a configuration change.

Procurement and operating model are part of the requirement

GPU lead times, vendor lock-in, and the build-versus-colocate-versus-single-vendor-stack decision belong inside the requirements document, not in a separate procurement track that starts after the architecture is signed off. A 30-day path to production and a 12-month path to production are different requirements with different cluster designs, different staffing assumptions, and different risk profiles. Only one of them survives a board review where the AI initiative has a quarterly milestone attached.

The build path requires internal MLOps capability, hardware supply relationships, and the patience to integrate runtime, models, applications, and observability yourself ^[7]. The colocate path trades capex for opex and adds a vendor surface to your incident response. The single-vendor stack path — hardware, runtime, open-weight models, applications, deployment, and support from one provider — compresses time-to-production at the cost of stack flexibility. None of these is universally right. All three are legitimate answers to different workload profiles and different organizational realities. The wrong answer is to leave the choice undefined until the GPUs arrive.

Enterprises that size the workload first ship in weeks with predictable SLOs. Enterprises that buy the GPU first spend the next year discovering what they should have specified on day one. The seven numbers — tokens per day, concurrent users, TTFT, context, model class, compliance class, uptime — are not a checklist. They are the requirements document. Everything else is implementation.

Send us your workload profile and we'll size the cluster — https://wavenetic.com