Build an Offline AI Platform with K3s, vLLM, and Argo CD

One Command, Zero Internet: Offline AI on K3s

Deploying Kubernetes in the cloud is straightforward when every node can reach package repositories, container registries, GitHub, and model hubs. The task becomes much more interesting when the target server has no internet connection at all.

Andy Golubev built a proof-of-concept AI platform on a single Ubuntu server with K3s, NVIDIA GPU support, vLLM with a locally stored Qwen2.5-7B-Instruct model, Argo CD, a FastAPI/LangChain chatbot, and an observability stack (Prometheus, Grafana, Loki, Tempo, OpenTelemetry). The target machine is an Ubuntu 26.04 AMD64 server with an NVIDIA A10G GPU (validated on EC2 g5.2xlarge). The key requirement: after copying the bundle to the server, the installation must make zero external network requests.

Two-Environment Design: Prepare and Deploy

The solution separates preparation from installation. A connected machine downloads and packages everything. The isolated machine only consumes local files. This boundary is the main architectural decision.

Preparing the Payload

The repository provides one aggregate command:

cd offline-bundle
./scripts/download-all-artifacts.sh

The script requires Docker and at least 50 GB of free space. On macOS and non-AMD64 systems, it uses an Ubuntu 26.04 AMD64 container so that downloaded packages match the target architecture. Each artifact group is handled by a specialized script. Completed steps have content and environment fingerprints, so an interrupted download can resume without rebuilding everything. A --clean option is available for a completely fresh payload.

The generated payload/ directory contains binaries, .deb packages, OCI image archives, Kubernetes manifests, tools, and model weights. It is generated locally and intentionally ignored by Git because it is large and reproducible. The vLLM image is around 8 GB compressed, and the Qwen2.5 7B model requires roughly 14–15 GB. Disk planning is not optional; the target installer checks for at least 80 GB of free space before it starts.

One Command on the Isolated Server

After preparing the payload, copy the complete offline-bundle/ directory to the target and run:

cd offline-bundle
./install.sh

The installer elevates with sudo, verifies Ubuntu version and CPU architecture, checks free space, and validates the SHA256 checksum of every artifact. It then installs Ansible from local .deb files and runs a localhost playbook with ansible_connection=local. The roles run in a deliberate order:

Install K3s from its binary and air-gap image archive.
Install NVIDIA driver and container toolkit, then expose the GPU to Kubernetes.
Start local registry and Git mirror, then install Argo CD.
Install k9s.
Deploy observability stack.
Load the model and start vLLM.

The installer and roles are idempotent. If a validation fails, you can correct the problem and run the same command again.

What Runs Inside the Server

The result is a complete platform on one machine. Some supporting services (Git daemon) run on the host. Application and platform workloads run inside K3s.

K3s imports its standard air-gap archive directly. Additional images are imported into containerd and pushed to a registry listening on localhost:5000. Workloads use local image references, and the vLLM deployment uses imagePullPolicy: Never as an extra guard against accidental pulls.

The Qwen2.5-7B-Instruct snapshot is copied to /opt/models and mounted into the vLLM pod. vLLM exposes an OpenAI-compatible endpoint on port 8000 and gets the single nvidia.com/gpu resource advertised by the NVIDIA device plugin.

GitOps Without GitHub

Argo CD usually pulls desired state from an external Git provider. In an isolated network, the bundle creates bare repositories on the target and serves them with a read-only Git daemon. An in-cluster Service and Endpoints object exposes the host daemon as git://git-mirror.gitops.svc.cluster.local. Argo CD reads an app-of-apps repository from this address, discovers the agent application, and deploys its Helm chart.

The Local AI Application

To verify the stack works as a platform, a small chatbot is included. It uses FastAPI, LangChain, and ChatOpenAI, but points the client to the internal vLLM service instead of a public API. The application supports a system prompt and conversation history. It also adds an OpenTelemetry span around every model invocation. Optionally, Langfuse integration can be enabled when credentials are provided.

Observing the Model and GPU

The bundle includes a compact observability stack:

Prometheus for metrics
Grafana for dashboards
Loki and Promtail for logs
Tempo for traces
OpenTelemetry Collector for OTLP ingestion
kube-state-metrics and node-exporter for Kubernetes and host metrics
NVIDIA DCGM exporter for GPU metrics

Prometheus scrapes vLLM metrics (e.g., running requests) and GPU utilization from DCGM exporter. Grafana provisions Prometheus, Loki, and Tempo as datasources and loads a bundled vLLM/GPU dashboard automatically.

Validation: Critical in Air-Gapped Environments

In a connected environment, a missing image or package may be downloaded later. In an isolated environment, a missing transitive dependency can stop the entire installation. The preparation scripts verify payload structure and checksums before transfer. During installation, Ansible roles validate each layer:

K3s node reaches Ready
Host reports NVIDIA A10G with nvidia-smi
Kubernetes reports nvidia.com/gpu: 1 as allocatable
vLLM pod starts without external image pull
Model loads from /opt/models/Qwen2.5-7B-Instruct
/v1/models and chat completion APIs respond
Argo CD applications synchronize from local Git mirror
Prometheus targets are reachable
Grafana has provisioned datasources and dashboard
Loki returns logs and Tempo returns chatbot trace

The repository includes exact commands in offline-bundle/VALIDATION.md.

Lessons Learned

The most difficult part of an offline Kubernetes installation is not K3s itself. It is the complete dependency graph: OS packages, container images, GPU kernel modules, tools, model files, manifests, and runtime services that normally assume internet access. Key choices:

Use one explicit network boundary.
Pin and record artifacts.
Verify before transfer.
Keep runtime local.
Validate each layer.

This is a proof-of-concept on a single node. Production would need decisions about high availability, storage redundancy, security hardening, and signed bundle updates. Still, the project demonstrates that cloud-native workflows can operate in an isolated environment when artifact preparation is treated as a first-class part of the architecture.

Get the Code

The code is available at github.com/andygolubev/ansible-k3s-on-prem.

Build an Offline AI Platform with K3s, vLLM, and Argo CD

One Command, Zero Internet: Offline AI on K3s

Two-Environment Design: Prepare and Deploy

Preparing the Payload

One Command on the Isolated Server

What Runs Inside the Server

GitOps Without GitHub

The Local AI Application

Observing the Model and GPU

Validation: Critical in Air-Gapped Environments

Lessons Learned

Get the Code

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Deming's 94% Rule: Why Your Dev Team Feels Slow (and How to Fix It)

Proxy Traffic with Netcat: Pipes, FIFOs, and SIGPIPE

TI's MSPM0C1104: 1.38mm² 24MHz Cortex-M0+ MCU with 16KB Flash

Cloudflare Finds Hyper Bug That Truncated Large Image Responses

TypeScript 4.9's satisfies: 5 Patterns You're Missing in 2026

Google's TabFM: Zero-shot tabular classification without training