One Command, Zero Internet: Offline AI on K3s
Deploying Kubernetes in the cloud is straightforward when every node can reach package repositories, container registries, GitHub, and model hubs. The task becomes much more interesting when the target server has no internet connection at all.
Andy Golubev built a proof-of-concept AI platform on a single Ubuntu server with K3s, NVIDIA GPU support, vLLM with a locally stored Qwen2.5-7B-Instruct model, Argo CD, a FastAPI/LangChain chatbot, and an observability stack (Prometheus, Grafana, Loki, Tempo, OpenTelemetry). The target machine is an Ubuntu 26.04 AMD64 server with an NVIDIA A10G GPU (validated on EC2 g5.2xlarge). The key requirement: after copying the bundle to the server, the installation must make zero external network requests.
Two-Environment Design: Prepare and Deploy
The solution separates preparation from installation. A connected machine downloads and packages everything. The isolated machine only consumes local files. This boundary is the main architectural decision.
Preparing the Payload
The repository provides one aggregate command:
cd offline-bundle
./scripts/download-all-artifacts.sh
The script requires Docker and at least 50 GB of free space. On macOS and non-AMD64 systems, it uses an Ubuntu 26.04 AMD64 container so that downloaded packages match the target architecture. Each artifact group is handled by a specialized script. Completed steps have content and environment fingerprints, so an interrupted download can resume without rebuilding everything. A --clean option is available for a completely fresh payload.
The generated payload/ directory contains binaries, .deb packages, OCI image archives, Kubernetes manifests, tools, and model weights. It is generated locally and intentionally ignored by Git because it is large and reproducible. The vLLM image is around 8 GB compressed, and the Qwen2.5 7B model requires roughly 14–15 GB. Disk planning is not optional; the target installer checks for at least 80 GB of free space before it starts.
One Command on the Isolated Server
After preparing the payload, copy the complete offline-bundle/ directory to the target and run:
cd offline-bundle
./install.sh
The installer elevates with sudo, verifies Ubuntu version and CPU architecture, checks free space, and validates the SHA256 checksum of every artifact. It then installs Ansible from local .deb files and runs a localhost playbook with ansible_connection=local. The roles run in a deliberate order:
- Install K3s from its binary and air-gap image archive.
- Install NVIDIA driver and container toolkit, then expose the GPU to Kubernetes.
- Start local registry and Git mirror, then install Argo CD.
- Install k9s.
- Deploy observability stack.
- Load the model and start vLLM.
The installer and roles are idempotent. If a validation fails, you can correct the problem and run the same command again.
What Runs Inside the Server
The result is a complete platform on one machine. Some supporting services (Git daemon) run on the host. Application and platform workloads run inside K3s.
K3s imports its standard air-gap archive directly. Additional images are imported into containerd and pushed to a registry listening on localhost:5000. Workloads use local image references, and the vLLM deployment uses imagePullPolicy: Never as an extra guard against accidental pulls.
The Qwen2.5-7B-Instruct snapshot is copied to /opt/models and mounted into the vLLM pod. vLLM exposes an OpenAI-compatible endpoint on port 8000 and gets the single nvidia.com/gpu resource advertised by the NVIDIA device plugin.
GitOps Without GitHub
Argo CD usually pulls desired state from an external Git provider. In an isolated network, the bundle creates bare repositories on the target and serves them with a read-only Git daemon. An in-cluster Service and Endpoints object exposes the host daemon as git://git-mirror.gitops.svc.cluster.local. Argo CD reads an app-of-apps repository from this address, discovers the agent application, and deploys its Helm chart.
The Local AI Application
To verify the stack works as a platform, a small chatbot is included. It uses FastAPI, LangChain, and ChatOpenAI, but points the client to the internal vLLM service instead of a public API. The application supports a system prompt and conversation history. It also adds an OpenTelemetry span around every model invocation. Optionally, Langfuse integration can be enabled when credentials are provided.
Observing the Model and GPU
The bundle includes a compact observability stack:
- Prometheus for metrics
- Grafana for dashboards
- Loki and Promtail for logs
- Tempo for traces
- OpenTelemetry Collector for OTLP ingestion
- kube-state-metrics and node-exporter for Kubernetes and host metrics
- NVIDIA DCGM exporter for GPU metrics
Prometheus scrapes vLLM metrics (e.g., running requests) and GPU utilization from DCGM exporter. Grafana provisions Prometheus, Loki, and Tempo as datasources and loads a bundled vLLM/GPU dashboard automatically.
Validation: Critical in Air-Gapped Environments
In a connected environment, a missing image or package may be downloaded later. In an isolated environment, a missing transitive dependency can stop the entire installation. The preparation scripts verify payload structure and checksums before transfer. During installation, Ansible roles validate each layer:
- K3s node reaches Ready
- Host reports NVIDIA A10G with
nvidia-smi - Kubernetes reports
nvidia.com/gpu: 1as allocatable - vLLM pod starts without external image pull
- Model loads from
/opt/models/Qwen2.5-7B-Instruct /v1/modelsand chat completion APIs respond- Argo CD applications synchronize from local Git mirror
- Prometheus targets are reachable
- Grafana has provisioned datasources and dashboard
- Loki returns logs and Tempo returns chatbot trace
The repository includes exact commands in offline-bundle/VALIDATION.md.
Lessons Learned
The most difficult part of an offline Kubernetes installation is not K3s itself. It is the complete dependency graph: OS packages, container images, GPU kernel modules, tools, model files, manifests, and runtime services that normally assume internet access. Key choices:
- Use one explicit network boundary.
- Pin and record artifacts.
- Verify before transfer.
- Keep runtime local.
- Validate each layer.
This is a proof-of-concept on a single node. Production would need decisions about high availability, storage redundancy, security hardening, and signed bundle updates. Still, the project demonstrates that cloud-native workflows can operate in an isolated environment when artifact preparation is treated as a first-class part of the architecture.
Get the Code
The code is available at github.com/andygolubev/ansible-k3s-on-prem.




