NanoEuler: Build GPT-2 from Scratch in C/CUDA

A developer named JustVugg released NanoEuler, a GPT-2-class language model built entirely from scratch in C and CUDA. No PyTorch, no autograd, no ML libraries. The forward and backward passes are hand-written and verified, and the whole training pipeline lives in a single GitHub repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model. RLHF/DPO is planned.

Architecture Details

The model is a decoder-only transformer with modern building blocks:

Each residual block is x = x + attn(rmsnorm(x)) followed by x = x + swiglu(rmsnorm(x)). The project name comes from the observation that a residual connection x = x + f(x) is exactly the forward-Euler method for an ODE dx/dt = f(x) with step size 1.

Configurations

The repo provides two configurations:

The head size is 64 (768/12), which fits the FlashAttention kernel.

Verified Backward Pass

Every analytic gradient is compared against a central finite difference in double precision. The check runs with make check and outputs max relative errors for each tensor:

tok      : max rel err 1.02e-04
qkvw     : max rel err 7.20e-07
gatew    : max rel err 6.86e-08
...
max relative error: 1.02e-04
>>> backward OK (error < 1e-2)

Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.

GPU Engine (CUDA)

The CUDA engine in cuda/nanoeuler_cuda.cu is a full from-scratch port — forward, backward, training, and inference on the GPU. Every kernel is validated on the device against a CPU reference, and the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).

Kernels: matmul (delegated to cuBLAS with TF32 tensor cores), RMSNorm, RoPE, grouped-query attention with a hand-written FlashAttention (tiled, online softmax, no T×T matrix in memory), SwiGLU, softmax/cross-entropy, and AdamW. FlashAttention made the training step about 3× faster.

Build command (RTX 40-series = Ada = sm_89):

cd cuda
nvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas

Modes:

Chat Pipeline

The chat pipeline is two stages. First pretrain the ~116M base on the books + web mix (./nanoeuler_cuda t). Then supervised fine-tuning turns it into an assistant: ./nanoeuler_cuda s loads the pretrained base, renders each Alpaca example with the standard instruction template, and trains with the loss masked to the response tokens only. The result is saved to nanoeuler_chat.bin; ./nanoeuler_cuda c then wraps each line you type in the same template and samples a reply.

After fine-tuning, the model answers in the right shape — it follows the instruction→response format, writes complete sentences, and stops on its own. The content, though, is shallow and often wrong: this is a small model trained on a single GPU, so it has little world knowledge to express. SFT teaches the model how to respond, not what it knows.

Data

Pretraining uses a real books + web mix:

Then concatenate them into the pretraining corpus:

sh data/get_gutenberg.sh                       # books  -> data/gutenberg.txt
sh data/get_web.sh                             # web    -> data/web.txt (~1 GB by default)
cat data/gutenberg.txt data/web.txt > data/pretrain.txt
sh data/get_alpaca.sh                          # instruction data for SFT -> data/alpaca.json

Roadmap

Why It Matters

NanoEuler is a complete, understandable training pipeline for a decoder-only transformer, from tokenizer to fine-tuned chat model, with no external ML dependencies. It's a goldmine for developers who want to understand every piece of a modern language model — from gradient computation to CUDA kernel design. If you've ever felt that PyTorch's autograd hides too much, this is the antidote.