A developer has documented their fully offline AI-assisted development machine, built around an ASUS ROG Flow Z13 2025 model with 128GB unified memory, running vanilla Arch Linux and the niri scrolling Wayland compositor. The setup uses a custom llama.cpp build with ROCm/HIP acceleration to serve local LLMs like Qwen3.6 27B and Gemma 4 31B, accessed via the OpenCode coding agent.

Hardware: 128GB Unified Memory Changes Everything

The ASUS ROG Flow Z13 GZ302EA is a tablet-form-factor machine with an AMD Ryzen AI Max+ 395 (16 cores, 32 threads) and Radeon 8060S integrated GPU (40 compute units). The standout spec is 128GB of unified memory, configurable in BIOS — the author assigned 64GB to GPU and 64GB to CPU. "For local AI work, memory changes everything," they write. "A 27B quantized model, a large context window, Docker, Chrome, and an editor can happily eat memory like there is no tomorrow."

Operating System: Vanilla Arch Linux for Cutting-Edge Drivers

The author switched from Fedora to Arch Linux to get the latest kernel, Mesa, and ROCm-adjacent bits without waiting for distro releases. New hardware like the Flow Z13 benefits from bleeding-edge packages. The installation uses Btrfs with subvolumes, GRUB bootloader, paru for AUR packages, Timeshift for snapshots, and PipeWire for audio. They also use Topgrade to update everything — pacman, AUR, brew, cargo, npm, VS Code plugins, Docker images — from a single command in Kitty terminal.

Desktop: niri + DankMaterialShell

Instead of a traditional desktop environment like GNOME or KDE, the author uses niri, a scrollable tiling Wayland compositor. Windows live in columns and you scroll horizontally — a workflow that feels natural on ultrawide monitors. They pair it with DankMaterialShell (DMS), which provides the top bar, application launcher, clipboard manager, notification center, lock screen, and more. The theme is Catppuccin Macchiato everywhere, with Inter Variable and JetBrainsMono Nerd Font.

Offline AI Stack: llama.cpp + OpenCode

The core of the AI setup is a custom llama.cpp build compiled with HIP support for AMD GPUs. The build script uses:

cmake -S /mnt/work/Workspace/llms/llama.cpp \
  -B /mnt/work/Workspace/llms/llama.cpp/build-hip \
  -G Ninja \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build /mnt/work/Workspace/llms/llama.cpp/build-hip \
  --config Release \
  -j "$(nproc)" \
  --target llama-server llama-bench

The server runs with full GPU offload, flash attention, and f16 KV cache:

ROCBLAS_USE_HIPBLASLT=1 llama-server \
  --model "$model" \
  --alias "$alias_name" \
  --host 127.0.0.1 \
  --port 18080 \
  --ctx-size "$ctx" \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-mmap \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --batch-size 4096 \
  --ubatch-size 512 \
  --reasoning "$reasoning"

They use two primary models: Qwen3.6 27B and Gemma 4 31B, with quantization levels from 4-bit to 8-bit. The default is Qwen3.6-27B-Q8_0 with 256k context. Benchmarks show:

ModelQuantSizePrompt t/sGen t/s
Qwen3.6 27BQ4_K_M15.40 GiB260.0610.41
Qwen3.6 27BQ6_K20.56 GiB279.378.70
Qwen3.6 27BQ8_026.62 GiB260.127.18
Gemma 4 31B ITQ4_K_M17.39 GiB209.579.12
Gemma 4 31B ITQ8_030.38 GiB202.316.19

With full 256k context, Qwen3.6 27B Q8_0 achieves ~64 tokens/s for prompt+generation, using about 70% of GPU memory.

OpenCode is configured as the coding agent with a local llama.cpp provider and OpenRouter as a backup for complex tasks. The config snippet shows the provider setup:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp ROCm (local)",
      "options": {
        "baseURL": "http://127.0.0.1:18080/v1"
      },
      "models": {
        "qwen3-6-27b-q8-0": {
          "name": "Qwen3.6 27B Q8_0 (local ROCm)",
          "limit": {
            "context": 262144,
            "output": 16384
          }
        }
      }
    },
    "openrouter": {
      "models": {
        "moonshotai/kimi-k2.6": {
          "name": "Kimi K2.6 (OpenRouter backup)",
          "limit": {
            "context": 262144,
            "output": 16384
          }
        }
      }
    }
  }
}

Why Offline Matters

The author explicitly chose local-first tooling to avoid sending code, prompts, or logs to remote APIs. "Not because every project is secret, but because local-first tooling is a good capability to have especially in a world that's heading towards techno oligarchy." They also use an opencode-telegram-bot to manage OpenCode sessions remotely from Telegram.

Practical Advice

If you want to reproduce this setup, the author is publishing a stripped-down public config at deepu105/archdots. The full personal config is private due to machine-specific quirks like ASUS hotkeys, touchpad behavior, and Wi-Fi fixes. They recommend starting with the public dotfiles and adapting for your hardware.

For developers considering local AI, the key takeaway is that 128GB unified memory makes running a 27B model with large context practical. Without that much memory, you'd need to use smaller models or quantizations. The llama.cpp build with HIP support is critical for AMD GPUs — Ollama and LM Studio are less performant on this hardware.