A production API server had 200ms average latency and 800ms P99. After tuning seven kernel parameters—with no code or infrastructure changes—latency dropped to 20ms average and 85ms P99. The defaults are for laptops. Production needs different defaults.
The CPU Scheduler Isn't Always Fair
Linux uses the Completely Fair Scheduler (CFS). It's fair, not fast. Under contention, nice values matter: -20 (highest priority) to 19 (lowest). Your app runs at nice 0 by default. But nice values only matter when processes compete for cores. If you have 16 cores and 32 processes, the scheduler decides who runs.
CPU pinning eliminates cache pollution. When processes bounce between cores, they lose L1/L2 cache lines and reload data. Pin latency-sensitive workloads:
taskset -c 0-3 ./my-application
Or via cgroups:
echo "0-3" > /sys/fs/cgroup/cpuset/my-app/cpuset.cpus
Financial trading systems and game servers live by CPU pinning. For web services, it's rarely worth it unless you've measured cache misses as your bottleneck.
The NUMA trap: On multi-socket servers, Non-Uniform Memory Access means each CPU socket has local memory (fast) and remote memory (2-3x slower). If your app runs on socket 0 but allocates memory on socket 1, every access pays a penalty. Check with:
numactl --hardware
numactl --localalloc ./my-application
Most cloud VMs abstract NUMA away, but bare metal servers? Check your topology.
Memory: Page Cache Is Your Best Friend
Linux uses all free memory as page cache. When you see "10 GB used, 2 GB free" on a 16 GB server, it doesn't mean you're low on memory. It means 4 GB is page cache, released the moment a process needs it.
free -h lies if you don't read carefully. Look at the "available" column, not "free." Available = free + reclaimable cache.
Swap: When physical memory is exhausted, Linux moves pages to swap (disk). Disk access is 1,000x slower than RAM. A system actively swapping is dying slowly.
vm.swappiness controls swap aggressiveness. Default is 60. For database servers: set it to 1 (not 0—0 disables swap entirely, meaning OOM killer strikes without warning). For Redis: set it to 1 and monitor closely.
sysctl vm.swappiness=1
OOM Killer: When memory is exhausted and swap is full, the kernel picks a process to kill. Protect critical processes:
echo -1000 > /proc/$(pidof my-critical-app)/oom_score_adj
This tells the OOM killer: kill anything else before this. But if it's the only process eating memory, even -1000 won't save it.
The team that disabled swap: They read a blog post saying swap hurts performance. They set vm.swappiness=0 and disabled swap. For months, plenty of RAM. Then a memory leak in a sidecar consumed memory over 3 weeks. Without swap as a buffer, the OOM killer fired at 2 AM, killing the primary database process. No graceful shutdown. Transaction log corruption. 4-hour recovery.
Swap isn't the enemy. Uncontrolled swap is. A small swap partition (2-4 GB) gives the OOM killer a buffer to detect memory pressure before killing processes.
I/O: The Scheduler You Didn't Know Existed
Disk I/O has its own scheduler. It determines the order of read/write requests.
- deadline: Assigns a deadline (500ms reads, 5s writes). No request starves. Good for databases.
- mq-deadline: Multi-queue version for NVMe drives. Default and correct choice.
- none (noop): No reordering. Passes requests directly. Use for NVMe SSDs where the device has its own scheduler.
Check and change:
cat /sys/block/sda/queue/scheduler
echo "none" > /sys/block/nvme0n1/queue/scheduler
For SSDs and NVMe: use none or mq-deadline. For spinning disks: use deadline or bfq.
Network Stack: The Parameters That Change Everything
Default Linux network settings are conservative. Designed for a general-purpose machine, not a server handling tens of thousands of connections.
net.core.somaxconn: Maximum queued connections for acceptance. Default: 4096 (was 128 on older kernels). If your app can't accept fast enough, new connections drop.
sysctl net.core.somaxconn=65535
Also increase the application's own listen backlog to match.
net.ipv4.tcp_tw_reuse: Reuse sockets in TIME_WAIT for new outgoing connections. On servers making many short-lived connections, TIME_WAIT sockets can exhaust ephemeral ports.
sysctl net.ipv4.tcp_tw_reuse=1
net.core.rmem_max / net.core.wmem_max: Maximum receive/send buffer sizes. Defaults are often too low for high-throughput apps.
sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"
net.ipv4.tcp_keepalive_time: Idle time before sending keepalive probes. Default: 7200 seconds (2 hours). If a client disconnects without closing, the server won't notice for 2 hours.
sysctl net.ipv4.tcp_keepalive_time=600
sysctl net.ipv4.tcp_keepalive_intvl=60
sysctl net.ipv4.tcp_keepalive_probes=5
Profiling: perf, strace, eBPF
When metrics don't tell enough, go deeper.
perf – CPU profiling at function level:
perf record -g -p $(pidof my-app) -- sleep 30
perf report
Flame graphs (Brendan Gregg's scripts) make perf output readable. If 40% of CPU time is in malloc, you have a memory allocation problem. If 30% is in pthread_mutex_lock, contention.
strace – System call tracing:
strace -p $(pidof my-app) -f -e trace=network -T
-f follows child threads. -e trace=network filters to network calls. -T shows time per syscall. If connect() takes 50ms, DNS is slow. If write() takes 10ms, disk or network is bottleneck.
Warning: strace adds overhead. For production, use eBPF.
eBPF – Modern, low-overhead kernel tracing:
# Using bcc tools
tcplife # Track TCP connection lifetimes
biolatency # Disk I/O latency histogram
runqlat # CPU scheduler queue latency
funccount # Count function calls
eBPF gives kernel-level visibility without modifying your application or adding measurable overhead.
The USE Method: Systematic Performance Analysis
Brendan Gregg's USE method: for every resource (CPU, memory, disk, network), check Utilization, Saturation, and Errors.
CPU:
- Utilization:
mpstat -P ALL 1 - Saturation:
vmstat– checkrcolumn (run queue). If > core count, CPUs overloaded. - Errors:
dmesg | grep -i error
Memory:
- Utilization:
free -h– check "available" - Saturation:
vmstat– checksi/so(swap in/out). Any non-zero means swapping. - Errors:
dmesg | grep -i oom
Disk:
- Utilization:
iostat -xz 1– check%util - Saturation:
iostat– checkavgqu-sz. High values mean requests waiting. - Errors:
smartctl -a /dev/sda
Network:
- Utilization:
sar -n DEV 1– bytes/sec vs link capacity - Saturation:
netstat -s | grep -i drop– dropped packets - Errors:
ifconfigorip -s link– error counters
Go through this checklist when "the system is slow." Most of the time, one resource will be saturated and everything else looks fine. That's your bottleneck.
The 7 Kernel Parameters Story
Production API server. Latency: 200ms average, 800ms P99. After profiling, all time was in kernel-level network and memory operations.
The 7 parameters that changed everything:
net.core.somaxconn = 65535(was 128)net.ipv4.tcp_tw_reuse = 1(was 0)net.core.rmem_max = 16777216(was 212992)net.core.wmem_max = 16777216(was 212992)vm.swappiness = 1(was 60)net.ipv4.tcp_keepalive_time = 600(was 7200)- I/O scheduler to
none(wascfqon an NVMe drive)
Result: average latency dropped to 20ms. P99 dropped to 85ms. No code changes. No infrastructure changes. Seven sysctl commands.
The defaults are designed for safety and generality. Production servers are specific, high-performance machines with specific workloads. Tune accordingly.
Key Takeaways
- The kernel is not a black box.
/procand/sysexpose everything.perf,strace, andeBPFlet you look inside without guessing. - When "the system is slow," use the USE method. Check utilization, saturation, and errors for every resource.
- Default kernel parameters are fine for development machines. They're wrong for production. Every production server should have a tuned
sysctl.confbased on its workload. - Never disable swap without understanding what happens when memory runs out. The OOM killer doesn't negotiate.
Now go audit your production servers. Check your sysctl settings. Run vmstat 1 and look for swapping. Change that I/O scheduler. Your latency will thank you.


