Tag: Troubleshooting

When a 2-Core Server Hits Load 45+: A Real-World LAMP Debugging Story
There’s a particular kind of panic that sets in when you SSH into a production server and see this:
```
load average: 45.63, 38.37, 28.93
```
On a 2-core machine, that’s not just high — it’s catastrophic.

I usually help one of my friends with LAMP servers hosted on DigitalOcean that run WooCommerce. The site brings in good sales for his business. Recently, he reached out to me to say that some of his customers reported slow order placement. When I logged into the server, I found an interesting pattern.

This post walks through a real debugging session using a symptoms → diagnostics → solution approach. Along the way, we’ll uncover multiple overlapping issues (not just one), fix them step by step, and explain why architectural changes like PHP-FPM and Nginx matter.

Symptoms: What went wrong

The server started showing:
- Extremely high load averages (45+ on a 2-core system)
- Slow or unresponsive web requests
- CPU is constantly maxed out
- Intermittent recovery followed by spikes
Initial snapshot:
```
# uptime
load average: 5.95, 25.07, 25.33

# nproc
2
```
Even after partial recovery, the load remained unstable.

Diagnostics: What the system revealed

1. Top CPU consumers
```
# ps aux --sort=-%cpu | head -20
```
Output (trimmed):
```
root          92 35.8  0.0      0     0 ?        S    12:40  82:49 [kswapd0]
mysql     198808 18.6 10.9 1821488 439632 ?      Ssl  16:29   0:31 /usr/sbin/mysqld
www-data  197164  5.6  5.1 504092 205036 ?       S    16:16   0:51 /usr/sbin/apache2
```
The key observation from this is that the process kswapd0 is consuming 35% CPU. This is not normal. It means the kernel is struggling with memory pressure.

2. Apache process explosion
```
# ps aux | grep apache | wc -l
14
```
RSS is the actual physical RAM a process is using right now, measured in KB. It does NOT include swapped-out memory, so it represents memory currently resident in RAM. It is the single most important metric for sizing concurrency.

In the output, I saw that the RSS is approximately 200MB – 260MB for each Apache process.

So for 14 processes it is:

14 processes × ~220MB ≈ ~3GB RAM

On a 4GB system, that’s quite high.

3. MySQL check (surprisingly clean)

When I checked the full process list on the MySQL
```
mysql> SHOW FULL PROCESSLIST;
```
I found it clean, with a few sleep connections and no long-running queries. I verified it with
```
# mysqladmin processlist
```
and found a similar output. So MySQL wasn’t the bottleneck.

4. Network state – hidden problem

The netstat revealed a hidden problem that may be contributing to the sluggishness.
```
# netstat -ant | awk '{print $6}' | sort | uniq -c
.....
121 SYN_RECV
.....
```
This indicates:
- Many half-open TCP connections
- Likely bot traffic or SYN flood behavior
5. System pressure via vmstat

In this case, vmstat was the most powerful tool run. In the output,
- r is the number of runnable processes (waiting for CPU). Ideally, it should have a value less than or equal to the number of CPU cores. A value exceeding the number of available CPU cores on the machine would indicate CPU contention.
- id indicates a percentage of CPU that is idle. A value typically in the range of 70-100% indicate a relaxed system. A low value (say 0-20%) indicates a busy CPU. However, 0% means it is fully saturated.
- si and so are swapped in and out. A value of 0 indicates no swapping and is considered good. Occasionally, a value > 0 indicates mild pressure. But if this value remains above 0 continuously, it may indicate memory problems.
So when I ran:
```
# vmstat 1 5
```
Output (trimmed):
```
r  b   swpd   free   si   so us sy id
14 0      0 399400   0    0 34 29 35
15 0      0 362864   0    0 87 12  0
```
r with a value of 14-15 indicates too many runnable processes, and id with 0 means CPU is fully saturated.

After initial fixes, when I ran vmstat again, I saw the new numbers:
```
r  b   swpd   free   si   so us sy id
1  0  12120 2554084   0    0 34 29 35
0  0  12120 2554084   0    0  0  1 99
```
So, now a value of r between 0-2 indicates a healthy condition, an id of 86-89% indicate idle CPU, and a si/so of 0 indicates no swapping.
- r = 0–2 → healthy
- id = 86–99% → CPU idle
- si/so = 0 → no swapping
Three Root Causes

This wasn’t a single issue. It was a stacked failure:

1. Apache (mod_php) memory bloat
- Each request = full Apache process
- Each process ≈ 200MB+
- Too many workers → RAM exhaustion
2. Swap thrashing (kswapd0)
- Memory filled up
- Kernel started reclaiming memory
- CPU burned by swap management
3. Connection pressure (SYN_RECV flood)
- 121 half-open connections
- Apache workers are tied up waiting
Solutions Applied

1. SYN flood mitigation (UFW + kernel)

I enabled:
```
net.ipv4.tcp_syncookies=1
```
And:
```
ufw limit 80/tcp
ufw limit 443/tcp
```
2. Apache concurrency control

Reduced workers:
```
MaxRequestWorkers 6
```
This helped stabilize the CPU with no process pile-up

3. KeepAlive tuning
```
KeepAlive On
MaxKeepAliveRequests 50
KeepAliveTimeout 2
```
4. OPcache verification and tuning

When PHP runs a script, it parses PHP code, compiles it into bytecode, and executes it. Without OPcache, this happens on every request.

With OPcache enabled, compiled bytecode is stored in memory so that future requests can reuse it. Without OPcache, high CPU usage and slower response times are expected. With OPcache, 30-35% less CPU is used, and execution is faster.

When I checked, I found that OPcache (opcache.enable) was already enabled in the php.ini.

I improved it with more cache:
```
opcache.memory_consumption=192
opcache.interned_strings_buffer=16
opcache.max_accelerated_files=20000
```
Additional Changes I would like to make

1. Replace mod_php with PHP-FPM

I would want to replace mod_php with php-fpm. In mod_php, each Apache process embeds PHP, leading to high memory usage (~200 MB per worker). This results in poor scalability and a lack of separation of concerns.

PHP-FPM, on the other hand, runs as a separate service and has lightweight workers (~20-40 MB), providing better process control and supporting pooling and scaling. This will result in lower memory usage, better CPU efficiency, and more predictable performance.
2. Prefer Nginx Over Apache

Now, this is not about nginx hype; it’s about an architectural choice. I have been using Apache for quite some time and love it. The pre-fork model of Apache has a process/thread per connection, is memory-heavy, and struggles under concurrency.

Nginx, with its event-driven model, can handle thousands of connections with a few processes and non-blocking I/O, making it an ideal choice for modern web workloads.

Finally

What looked like a “CPU problem” turned out to be:
- Memory exhaustion
- Connection pressure
- Poor process model
Fixing it required layered thinking, not just tweaking one parameter.

And the biggest lesson?

One can tune one’s way out of trouble temporarily, but the real win comes from choosing the right architecture.

So, now, if you’ve ever seen load averages that made no sense, this pattern might look familiar. And now you know exactly how to break it down.

Like this:
Like Loading…
April 2, 2026
When Pi-hole + Unbound Stop Resolving: A DNSSEC Trust Anchor Fix
I have my own private DNS setup in my home network, powered by Pi-hole running on my very first Raspberry Pi, a humble Model B Rev 2. It’s been quietly handling ad-blocking and DNS resolution for years. But today, something broke.

I noticed that none of my devices could resolve domain names. Pi-hole’s dashboard looked fine. The DNS service was running, blocking was active, but every query failed. Even direct dig queries returned SERVFAIL. Here’s how I diagnosed and resolved the issue.

The Setup

My Pi-hole forwards DNS queries to Unbound, a recursive DNS resolver running locally on port 5335. This is configured in /etc/pihole/setupVars.conf.
```
PIHOLE_DNS_1=127.0.0.1#5335
PIHOLE_DNS_2=127.0.0.1#5335
```
And my system’s /etc/resolv.conf points to Pi-hole itself
```
nameserver 127.0.0.1
```
Unbound is installed with the dns-root-data package, which provides root hints and DNSSEC trust anchors:
```
$ dpkg -l dns-root-data|grep ^ii
ii  dns-root-data  2024041801~deb11u1 all          DNS root hints and DNSSEC trust anchor
```
The Symptoms

Despite everything appearing normal, DNS resolution failed:
```
$ dig google.com @127.0.0.1 -p 5335

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL
```
Even root-level queries failed:
```
$ dig . @127.0.0.1 -p 5335

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL
```
Unbound was running and listening:
```
$ netstat -tulpn | grep 5335

tcp  0  0 127.0.0.1:5335  0.0.0.0:*  LISTEN  29155/unbound
```
And outbound connectivity was fine. I pinged one of the root DNS servers directly to ensure this:
```
$ ping -c1 198.41.0.4 
PING 198.41.0.4 (198.41.0.4) 56(84) bytes of data.
64 bytes from 198.41.0.4: icmp_seq=1 ttl=51 time=206 ms

--- 198.41.0.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 205.615/205.615/205.615/0.000 ms
```
The Diagnosis

At this point, I suspected a DNSSEC validation failure. Unbound uses a trust anchor, which is simply a cryptographic key stored in root.key. This cryptographic key is used to verify the authenticity of DNS responses. Think of it like a passport authority: when you travel internationally, border agents trust your passport because it was issued by a recognized authority. Similarly, DNSSEC relies on a trusted key at the root of the DNS hierarchy to validate every response down the chain. If that key is missing, expired, or corrupted, Unbound can’t verify the authenticity of DNS data — and like a border agent rejecting an unverified passport, it simply refuses to answer, returning SERVFAIL.

Even though dns-root-data was installed, the trust anchor wasn’t working.

The Fix

I regenerated the trust anchor manually:
```
$ sudo rm /usr/share/dns/root.key
$ sudo unbound-anchor -a /usr/share/dns/root.key
$ sudo systemctl restart unbound
```
After this, Unbound started resolving again:
```
$ dig google.com @127.0.0.1 -p 5335

;; ->>HEADER<<- opcode: QUERY, status: NOERROR
;; ANSWER SECTION:
google.com. 300 IN A 142.250.195.78
```
Why This Happens

Even with dns-root-data, the trust anchor could become stale — especially if the system missed a rollover event or the file was never initialized. Unbound doesn’t log this clearly, so it’s easy to miss.

Preventing Future Failures

To avoid this in the future, I added a weekly cron job to refresh the trust anchor:
```
0 3 * * 0 /usr/sbin/unbound-anchor -a /usr/share/dns/root.key
```
And a watchdog script to monitor Unbound health:
```
$ dig . @127.0.0.1 -p 5335 | grep -q 'status: NOERROR' || systemctl restart unbound
```
This was a good reminder that even quiet systems need occasional maintenance. Pi-hole and Unbound are powerful together, but DNSSEC adds complexity. If you’re running a similar setup, keep an eye on your trust anchors, and don’t trust the dashboard alone.

Like this:
Like Loading…
October 5, 2025

Tag: Troubleshooting

When a 2-Core Server Hits Load 45+: A Real-World LAMP Debugging Story

Symptoms: What went wrong

Diagnostics: What the system revealed

1. Top CPU consumers

2. Apache process explosion

3. MySQL check (surprisingly clean)

4. Network state – hidden problem

5. System pressure via vmstat

Three Root Causes

1. Apache (mod_php) memory bloat

2. Swap thrashing (kswapd0)

3. Connection pressure (SYN_RECV flood)

Solutions Applied

1. SYN flood mitigation (UFW + kernel)

2. Apache concurrency control

3. KeepAlive tuning

4. OPcache verification and tuning

Additional Changes I would like to make

1. Replace mod_php with PHP-FPM

2. Prefer Nginx Over Apache

Finally

Like this:

When Pi-hole + Unbound Stop Resolving: A DNSSEC Trust Anchor Fix

The Setup

The Symptoms

The Diagnosis

The Fix

Why This Happens

Preventing Future Failures

Like this: