On-Premise AI Cost: A CFO-Ready TCO Breakdown, Not Just a GPU Price

Ask ten vendors what on-premise AI costs and you will get ten GPU price lists. That is not a TCO. A CFO signing a multi-year capital request needs a model that survives an audit committee: every line item, every assumption, every refresh cycle, and a denominator that ties spend to delivered output.

The honest answer is that on-prem AI is cheaper than cloud for sustained, high-utilization workloads and more expensive for bursty or experimental ones ^[2]^[4]. But that framing skips the buyer's actual question. What follows is a line-item cost structure, the math that determines whether the investment pays off, and the facility realities most analyses leave out.

The line items most TCO models quietly omit

Lenovo's 2025 TCO paper is unusually candid about what its own comparison excludes: managed services, data storage, data transfer, OS and application licensing, patching, networking, IT staffing, and software stack maintenance ^[4]. Read that list again. It is most of the cost of operating a real system. A TCO that ignores them is a hardware quote with a spreadsheet wrapper.

A defensible on-prem AI cost model has three layers. CapEx covers GPUs, servers, NVMe storage, network fabric, racks, PDUs, UPS, and any electrical or cooling upgrades the facility needs to host the cluster. OpEx covers electricity, cooling overhead (PUE), hardware support contracts, software licenses, model and runtime maintenance, security tooling, and the staff hours to operate it. Implementation costs — often the most underestimated — cover integration with identity, document repositories, audit systems, change management, and the 30-to-90-day work to take a cluster from racked to producing answers users trust.

Separately, application and model development costs sit on top of infrastructure TCO and should never be blended in ^[7]^[8]. A buyer evaluating on-prem AI infrastructure needs the infrastructure number clean. Otherwise the comparison to a cloud bill becomes apples-to-fruit-salad.

CapEx: what you actually buy on day one

Hardware is the headline, but the bill of materials is longer than the GPU SKU. A production-grade on-prem AI node includes accelerators, host CPUs, ECC memory sized for KV cache and embeddings, NVMe storage for vector indexes and document corpora, and a low-latency network for multi-node inference or fine-tuning. Add redundant power supplies, top-of-rack switches, out-of-band management, and spares — typically 10 to 15 percent of the fleet — so a single failure does not take production down.

Then come the things that live outside the chassis. Rack power density for modern GPU servers frequently exceeds what older enterprise data halls were wired for, which means new PDUs, possibly new circuits, and in some buildings a transformer conversation with the utility. Cooling is the second blocker: air-cooled designs above roughly 30 kW per rack get uncomfortable, and liquid cooling has its own retrofit cost. Uvation pegs initial setup for a dedicated AI data centre at $15,000 to $50,000 as a middle path between full ownership and hyperscaler consumption ^[2], but that figure scales fast with rack count and density.

Depreciation and refresh discipline matter as much as the sticker. GPUs and AI servers should be amortized over a 3-to-5-year useful life, with a refresh plan budgeted from year one. A TCO that assumes a seven-year hold on accelerator hardware is quietly underreporting annualized cost.

OpEx: the recurring bill nobody quotes upfront

Electricity and cooling are the obvious recurring costs and the ones competitors model best ^[4]. PUE assumptions matter: a facility running at 1.5 PUE pays 50 percent more for cooling and overhead per watt of compute than one at 1.2. For a cluster pulling 30 to 60 kW continuously, the difference is six figures a year before anyone touches a model.

Software and support are the costs that catch finance teams off guard. Hardware support contracts typically run 8 to 15 percent of acquisition cost annually. Operating system, virtualization or container platform, observability, and security tooling each carry their own license. Open-weight models avoid per-token licensing, but the runtime, orchestration, and RAG layer still need maintenance — patching, model updates, embedding re-indexing, and citation tracking pipelines that keep audit trails intact.

Staff is where the model often breaks. A serious on-prem AI deployment needs MLOps, platform, and security coverage. Buying a turnkey stack with vendor-provided runtime, models, and support compresses this — single-vendor accountability turns three job descriptions into one operating contract — but the cost does not vanish, it just moves from headcount to a support line.

The formula: cost per useful unit of output

Here is the model worth putting in front of a CFO: Annualized TCO ÷ Useful output. The numerator is (CapEx ÷ refresh years) + annual power and cooling + support and licenses + staff and operations + facility overhead. The denominator is the unit your business actually consumes — GPU-hours delivered, tokens served, documents indexed and queried, or answers produced with citations.

NVIDIA's argument here is the one most on-prem analyses miss: the route to lower cost per token is the denominator, not the numerator ^[6]. Two clusters with identical hardware and power bills can produce wildly different cost-per-token numbers depending on batching, quantization, concurrency, and sustained GPU utilization. A cluster running at 25 percent utilization costs four times per delivered token what the same cluster costs at 80 percent.

This is why the cloud-versus-on-prem break-even is workload-dependent. Uvation's TCO comparison places hyperscalers ahead in the 1-to-3-year window, roughly comparable in 3-to-5 years, and on-prem ahead beyond five years for consistent, high-demand workloads ^[2]. ZySec models a typical 500-knowledge-worker enterprise at $1.6M to $2.2M in five-year cloud TCO, rising above $2.5M with heavy egress ^[3]. The shape of the curve is what matters: cloud is linear in usage, on-prem is mostly fixed. High utilization flattens cost per token; low utilization punishes it.

Facility readiness: the schedule risk that becomes a cost

Power and cooling lead times are the silent budget killers. A new circuit, a transformer upgrade, or a liquid cooling retrofit can take months. During that time the GPUs are either sitting in crates depreciating or running de-rated, and the project's cost-per-token denominator is artificially small because the cluster is not yet producing.

Facility readiness has a hard checklist: rack power density per cabinet, total available kW at the row and room level, cooling capacity headroom, redundant power paths, fire suppression compatible with high-density compute, structural floor loading, and an electrical maintenance plan with spares. Skip any of these in the planning phase and the cost shows up later as a change order or a missed go-live date.

What a credible on-prem AI cost model looks like

A defensible model is auditable, not aspirational. Every line item is sourced — vendor quotes for hardware, utility rates for power, contracted rates for support, fully loaded salaries for staff, measured PUE for the facility. Every assumption is changeable: refresh cycle, utilization rate, concurrent users, average tokens per query, growth in document volume.

The training cost ceiling is worth keeping in view as a sanity check. Lenovo notes Llama 3.1 was trained on more than 15 trillion tokens across 39.3 million GPU hours, with a hypothetical equivalent AWS P5 H100 cost above $483 million in cloud compute alone, excluding training data storage ^[4]. Most enterprises are not training frontier models. They are running RAG over their own documents on open-weight models, where the right cluster size is small, utilization is steady, and the cost-per-answer math is favorable — provided the TCO model is built honestly.

On-prem AI is not cheap and it is not magic. It is a capital decision with a known shape: high upfront investment, predictable recurring cost, and a cost-per-output curve that rewards utilization, citation-grade output, and a stack that does not require five vendors to operate. Build the model that way and the answer to 'what will it cost' stops being a guess.

Talk to our team about a line-item TCO model for your workload — https://wavenetic.com