Infrafire.ai | Self-hosted inference control plane

Inference is becoming infrastructure.

Dedicated GPU capacity is moving into enterprises, private clouds, and provider racks. The missing layer is operational: endpoints, keys, usage, quotas, and health.

GPU capacity needs a product surface.

Raw Kubernetes and model servers do the runtime work. Teams still need a way to package capacity into endpoints people can use.

Private inference needs local control.

Runtime ownership, data locality, key management, and usage visibility matter when inference becomes part of core infrastructure.

Metering turns GPUs into accountable services.

Enterprises need chargeback. GPU providers need customer usage records. Both need a reliable ledger before billing systems arrive.

Built for the teams buying and operating GPU capacity.

Infrafire speaks to the infrastructure owner first. Application teams get a stable endpoint; platform teams keep control of the runtime.

Enterprises

Give internal teams private LLM endpoints on company-owned infrastructure, with centralized deployment, access, usage, and health controls.

GPU providers

Turn dedicated GPU infrastructure into metered customer-facing inference endpoints with tenant isolation and usage records.

The control plane runs where inference runs.

Install into your cluster

Infrafire is packaged for an existing Kubernetes environment. The installer brings up the local control plane, gateway, worker, controller, and supporting services in the customer namespace.

Detect Kubernetes version, ingress, storage class, and GPU resources.
Verify NVIDIA GPU Operator and DCGM metrics are present or explain what is missing.
Create the first admin account and default tenant.
Deploy a model from a small curated vLLM catalog.

helm upgrade --install infrafire \
  oci://registry.infrafire.ai/charts/infrafire \
  --namespace infrafire-system \
  --create-namespace \
  --set app.url=https://infrafire.company.internal \
  --set gateway.url=https://llm.company.internal

What Infrafire manages.

The product sits between Kubernetes and application teams, turning GPU-backed model servers into controlled internal services.

Model deployments

Pick a supported model, target a GPU cluster, and let the controller create vLLM workloads.

Private endpoints

Serve `/v1/chat/completions` behind a local gateway with API-key auth.

Usage ledger

Record requests, tokens, latency, errors, model, deployment, tenant, project, and key.

Tenant controls

Create tenants, projects, and API keys for internal teams or external customers.

Rate limits

Apply request limits at the API key or endpoint layer before traffic reaches the model server.

GPU health

Track GPU inventory, utilization, memory, endpoint status, and deployment errors.

What changes when the whole stack runs on your cluster.

Each path can work. The difference is ownership: who runs the stack, who sees the usage, and who controls the endpoint.

Criterion

Infrafire.ai

DIY Kubernetes + vLLM

Cloud LLM APIs

Generic AI platforms

Deployment model

Self-hosted. Full stack runs in customer Kubernetes.

Self-hosted. Control surfaces are assembled by the platform team.

Provider-hosted. Fast start with externalized runtime ownership.

SaaS or hybrid deployments with application-layer tooling.

Primary buyer

Platform teams, enterprises, GPU providers, data centers.

Infrastructure engineers with time to build internal tooling.

Application teams that want immediate model access.

Business teams buying workflows, assistants, and app tooling.

OpenAI-compatible API

Local gateway. Private models sit behind one API surface.

vLLM provides the server. Auth, routing, and metering remain custom work.

Built in. Inference runs on provider infrastructure.

Often present as part of a larger proprietary platform.

Governance and metering

Native. Tenants, keys, quotas, usage, and health are first-order objects.

Custom work. Teams build dashboards and ledgers themselves.

Usage reports are provider-centric and disconnected from your GPU fleet.

Governance usually sits above the infrastructure layer.

When it wins

When GPUs, data, and inference operations must stay under local control.

When you need maximum flexibility and accept ongoing platform cost.

When speed matters more than infra ownership or margin control.

When teams want app workflows and assistant tooling.

Try it on a GPU Kubernetes cluster.

Bring an existing NVIDIA Kubernetes cluster. Infrafire installs locally, detects GPUs, deploys a vLLM model, and exposes a private endpoint.

Request a technical trial

Talk about the market shift.

Inference is moving from external APIs to owned infrastructure. Infrafire is the control plane for the teams operating that infrastructure.

Start the conversation

Run private LLM endpoints on your own GPU clusters.