tzeejay icon

About

Archive

Github

Linux Unable to Fork Processes

We’ve been busy at work upgrading a couple hundred hosts around the world to Ubuntu 24.04 LTS. While doing so we have had a subset of our hosts freak out and become almost entirely uncommunicative.

Examples of how bad things were (and hoping that the SEO gods pick this up and lead you to victory while trying to get out of your misery)

maloche ~ % ssh admin@region-2
Last login: Mon Oct 13 09:50:04 2025 from 
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
/usr/bin/lesspipe: 1: Cannot fork
admin@region-2:~$ df -h
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                              197M   19M  179M  10% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   15G  7.8G  6.2G  56% /
tmpfs                              982M     0  982M   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/sda2                          2.0G  229M  1.6G  13% /boot
tmpfs                              197M  4.0K  197M   1% /run/user/1000
admin@region-2:~$ htop
admin@region-2:~$ cat /proc/sys/kernel/pid_max
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
4194304
admin@region-2:~$ sudo apt update
FATAL -> Failed to fork.

The FATAL -> Failed to fork. error message finally lead me down a path of something making sense. Clearly this was a resource issue but both RAM & CPU usage were fine and the disk was not out of space. This leaves file descriptors and process ids. As it turns out for some strange reason, in our case if WireGuard was operational on the host as we started the upgrade process it would set off a process bomb that persisted across reboots.

Before the fix

maloche ~ % ssh admin@region-2
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-157-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Mon Oct 13 10:43:49 PM UTC 2025

  System load:  0.42               Processes:              30296
  Usage of /:   53.2% of 14.66GB   Users logged in:        0
  Memory usage: 35%                IPv4 address for ens18: 
  Swap usage:   0%

 * Ubuntu 20.04 LTS Focal Fossa has reached its end of standard support on 31 Ma
 
   For more details see:
   https://ubuntu.com/20-04

Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

and after

maloche ~ % ssh admin@region-2
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-157-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Mon Oct 13 10:52:54 PM UTC 2025

  System load:  0.15               Processes:              118
  Usage of /:   53.1% of 14.66GB   Users logged in:        0
  Memory usage: 8%                 IPv4 address for ens18: 
  Swap usage:   0%

 * Ubuntu 20.04 LTS Focal Fossa has reached its end of standard support on 31 Ma
 
   For more details see:
   https://ubuntu.com/20-04

Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

6 additional security updates can be applied with ESM Apps.
Learn more about enabling ESM Apps service at https://ubuntu.com/esm

As you can imagine 30296 active processes vs 118 makes a difference!
I initially stumbled over this by checking ps -ef and seeing a wall of duplicate processes all with their own pid

root	31120	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31121	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31122	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31123	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31124	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31125	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31126	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31127	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31128	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31129	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31130	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31131	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31132	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31133	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31134	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31135	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31136	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31137	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31138	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31139	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31140	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31141	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31142	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31143	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31144	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31145	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31146	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31147	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31148	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31149	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31150	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31151	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31152	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31153	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31154	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31155	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31156	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31157	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31158	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31160	2 0 15:53 ?		00:00:00 [napi/wg-0]
root	31161	2 0 15:53 ?		00:00:00 [napi/wg-0]

If you see a lot of the same thing in there that you did not anticipate start by shutting that down individually. In our case we stopped it by stopping WireGuard all together on the host which took a lot of attempts to successfully complete while the host had just enough free cycles for us to get in there, so be patient. Once completed the host recovered instantly and we could continue our work. Once fully upgraded we re-enabled the WireGuard service without any further issues.

I have yet to find an explanation for what happened here. This might be a bug in WireGuard itself or the Linux kernel but I couldn’t explain it or knew where to get started with it so I figured I document what we observed and how we got us out of the situation.

15.10.2025