7 Kernel Parameters That Slashed Latency 10x – No Code Chang

A production API server had 200ms average latency and 800ms P99. After tuning seven kernel parameters—with no code or infrastructure changes—latency dropped to 20ms average and 85ms P99. The defaults are for laptops. Production needs different defaults.

The CPU Scheduler Isn't Always Fair

Linux uses the Completely Fair Scheduler (CFS). It's fair, not fast. Under contention, nice values matter: -20 (highest priority) to 19 (lowest). Your app runs at nice 0 by default. But nice values only matter when processes compete for cores. If you have 16 cores and 32 processes, the scheduler decides who runs.

CPU pinning eliminates cache pollution. When processes bounce between cores, they lose L1/L2 cache lines and reload data. Pin latency-sensitive workloads:

taskset -c 0-3 ./my-application

Or via cgroups:

echo &#34;0-3&#34; &gt; /sys/fs/cgroup/cpuset/my-app/cpuset.cpus

Financial trading systems and game servers live by CPU pinning. For web services, it's rarely worth it unless you've measured cache misses as your bottleneck.

The NUMA trap: On multi-socket servers, Non-Uniform Memory Access means each CPU socket has local memory (fast) and remote memory (2-3x slower). If your app runs on socket 0 but allocates memory on socket 1, every access pays a penalty. Check with:

numactl --hardware
numactl --localalloc ./my-application

Most cloud VMs abstract NUMA away, but bare metal servers? Check your topology.

Memory: Page Cache Is Your Best Friend

Linux uses all free memory as page cache. When you see "10 GB used, 2 GB free" on a 16 GB server, it doesn't mean you're low on memory. It means 4 GB is page cache, released the moment a process needs it.

free -h lies if you don't read carefully. Look at the "available" column, not "free." Available = free + reclaimable cache.

Swap: When physical memory is exhausted, Linux moves pages to swap (disk). Disk access is 1,000x slower than RAM. A system actively swapping is dying slowly.

vm.swappiness controls swap aggressiveness. Default is 60. For database servers: set it to 1 (not 0—0 disables swap entirely, meaning OOM killer strikes without warning). For Redis: set it to 1 and monitor closely.

sysctl vm.swappiness=1

OOM Killer: When memory is exhausted and swap is full, the kernel picks a process to kill. Protect critical processes:

echo -1000 &gt; /proc/$(pidof my-critical-app)/oom_score_adj

This tells the OOM killer: kill anything else before this. But if it's the only process eating memory, even -1000 won't save it.

The team that disabled swap: They read a blog post saying swap hurts performance. They set vm.swappiness=0 and disabled swap. For months, plenty of RAM. Then a memory leak in a sidecar consumed memory over 3 weeks. Without swap as a buffer, the OOM killer fired at 2 AM, killing the primary database process. No graceful shutdown. Transaction log corruption. 4-hour recovery.

Swap isn't the enemy. Uncontrolled swap is. A small swap partition (2-4 GB) gives the OOM killer a buffer to detect memory pressure before killing processes.

I/O: The Scheduler You Didn't Know Existed

Disk I/O has its own scheduler. It determines the order of read/write requests.

deadline: Assigns a deadline (500ms reads, 5s writes). No request starves. Good for databases.
mq-deadline: Multi-queue version for NVMe drives. Default and correct choice.
none (noop): No reordering. Passes requests directly. Use for NVMe SSDs where the device has its own scheduler.

Check and change:

cat /sys/block/sda/queue/scheduler
echo &#34;none&#34; &gt; /sys/block/nvme0n1/queue/scheduler

For SSDs and NVMe: use none or mq-deadline. For spinning disks: use deadline or bfq.

Network Stack: The Parameters That Change Everything

Default Linux network settings are conservative. Designed for a general-purpose machine, not a server handling tens of thousands of connections.

net.core.somaxconn: Maximum queued connections for acceptance. Default: 4096 (was 128 on older kernels). If your app can't accept fast enough, new connections drop.

sysctl net.core.somaxconn=65535

Also increase the application's own listen backlog to match.

net.ipv4.tcp_tw_reuse: Reuse sockets in TIME_WAIT for new outgoing connections. On servers making many short-lived connections, TIME_WAIT sockets can exhaust ephemeral ports.

sysctl net.ipv4.tcp_tw_reuse=1

net.core.rmem_max / net.core.wmem_max: Maximum receive/send buffer sizes. Defaults are often too low for high-throughput apps.

sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem=&#34;4096 87380 16777216&#34;
sysctl net.ipv4.tcp_wmem=&#34;4096 65536 16777216&#34;

net.ipv4.tcp_keepalive_time: Idle time before sending keepalive probes. Default: 7200 seconds (2 hours). If a client disconnects without closing, the server won't notice for 2 hours.

sysctl net.ipv4.tcp_keepalive_time=600
sysctl net.ipv4.tcp_keepalive_intvl=60
sysctl net.ipv4.tcp_keepalive_probes=5

Profiling: perf, strace, eBPF

When metrics don't tell enough, go deeper.

perf – CPU profiling at function level:

perf record -g -p $(pidof my-app) -- sleep 30
perf report

Flame graphs (Brendan Gregg's scripts) make perf output readable. If 40% of CPU time is in malloc, you have a memory allocation problem. If 30% is in pthread_mutex_lock, contention.

strace – System call tracing:

strace -p $(pidof my-app) -f -e trace=network -T

-f follows child threads. -e trace=network filters to network calls. -T shows time per syscall. If connect() takes 50ms, DNS is slow. If write() takes 10ms, disk or network is bottleneck.

Warning: strace adds overhead. For production, use eBPF.

eBPF – Modern, low-overhead kernel tracing:

# Using bcc tools
tcplife          # Track TCP connection lifetimes
biolatency       # Disk I/O latency histogram
runqlat          # CPU scheduler queue latency
funccount        # Count function calls

eBPF gives kernel-level visibility without modifying your application or adding measurable overhead.

The USE Method: Systematic Performance Analysis

Brendan Gregg's USE method: for every resource (CPU, memory, disk, network), check Utilization, Saturation, and Errors.

CPU:

Utilization: mpstat -P ALL 1
Saturation: vmstat – check r column (run queue). If > core count, CPUs overloaded.
Errors: dmesg | grep -i error

Memory:

Utilization: free -h – check "available"
Saturation: vmstat – check si/so (swap in/out). Any non-zero means swapping.
Errors: dmesg | grep -i oom

Disk:

Utilization: iostat -xz 1 – check %util
Saturation: iostat – check avgqu-sz. High values mean requests waiting.
Errors: smartctl -a /dev/sda

Network:

Utilization: sar -n DEV 1 – bytes/sec vs link capacity
Saturation: netstat -s | grep -i drop – dropped packets
Errors: ifconfig or ip -s link – error counters

Go through this checklist when "the system is slow." Most of the time, one resource will be saturated and everything else looks fine. That's your bottleneck.

The 7 Kernel Parameters Story

Production API server. Latency: 200ms average, 800ms P99. After profiling, all time was in kernel-level network and memory operations.

The 7 parameters that changed everything:

net.core.somaxconn = 65535 (was 128)
net.ipv4.tcp_tw_reuse = 1 (was 0)
net.core.rmem_max = 16777216 (was 212992)
net.core.wmem_max = 16777216 (was 212992)
vm.swappiness = 1 (was 60)
net.ipv4.tcp_keepalive_time = 600 (was 7200)
I/O scheduler to none (was cfq on an NVMe drive)

Result: average latency dropped to 20ms. P99 dropped to 85ms. No code changes. No infrastructure changes. Seven sysctl commands.

The defaults are designed for safety and generality. Production servers are specific, high-performance machines with specific workloads. Tune accordingly.

Key Takeaways

The kernel is not a black box. /proc and /sys expose everything. perf, strace, and eBPF let you look inside without guessing.
When "the system is slow," use the USE method. Check utilization, saturation, and errors for every resource.
Default kernel parameters are fine for development machines. They're wrong for production. Every production server should have a tuned sysctl.conf based on its workload.
Never disable swap without understanding what happens when memory runs out. The OOM killer doesn't negotiate.

Now go audit your production servers. Check your sysctl settings. Run vmstat 1 and look for swapping. Change that I/O scheduler. Your latency will thank you.

7 Kernel Parameters That Slashed Latency 10x – No Code Changes

The CPU Scheduler Isn't Always Fair

Memory: Page Cache Is Your Best Friend

I/O: The Scheduler You Didn't Know Existed

Network Stack: The Parameters That Change Everything

Profiling: perf, strace, eBPF

The USE Method: Systematic Performance Analysis

The 7 Kernel Parameters Story

Key Takeaways

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Using unsafe to eliminate Go bound checks for 2x speedup

OpenTelemetry & SigNoz: Instrumenting a Gemini-Powered GitHub Analyzer

DIY V-I Plots: Capturing Real Diode and MOSFET Curves at Home

Kiro CLI Context Rot: Why Sessions Degrade and How to Fix It

Poolside Releases Laguna S 2.1: 118B MoE Model with 1M Context

Stop Using Date.now() for Latency: Use performance.now()