fix: prevent 502 Bad Gateway via PHP-FPM worker pool exhaustion and cold-start latency

- Add request_terminate_timeout = PHP_MAX_TIME in start.sh: without this (default 0 = disabled) workers blocked on a slow DB query, stalled Redis connection, or hung syscall are never reaped. Over time they fill pm.max_children and Apache returns 502 Bad Gateway to the reverse proxy. - Set pm.process_idle_timeout = 300s in Dockerfile: the upstream default of 10 s kills all idle workers after a brief quiet period. The next request burst must then wait for fresh PHP-FPM forks; on a loaded host that spawn latency can push Apache past its FastCGI deadline and produce a 502. 300 s keeps a warm pool through normal desktop-sync polling cycles. - Add a dedicated 502 troubleshooting subsection to reverse-proxy.md documenting the six most common causes (proxy timeout, worker exhaustion, stuck workers, Redis session lock contention, container cold start, Caddy cert renewal) with actionable diagnostics. Agent-Logs-Url: https://github.com/nextcloud/all-in-one/sessions/2fd7a6d1-bfdb-4f26-a8d0-cd54a7307999 Co-authored-by: szaimen <42591237+szaimen@users.noreply.github.com>
2026-05-21 10:50:10 +00:00 · 2026-04-27 15:31:14 +00:00
parent 119f68b6ee
commit 46eb2dfc7d
3 changed files with 38 additions and 0 deletions
--- a/Containers/nextcloud/Dockerfile
+++ b/Containers/nextcloud/Dockerfile
@@ -250,6 +250,14 @@ RUN set -ex; \
 # We don't actually expect so many children but don't want to limit it artificially because people will report issues otherwise.
 # Also children will usually be terminated again after the process is done due to the ondemand setting
    sed -i 's/^pm.max_children =.*/pm.max_children = 5000/' /usr/local/etc/php-fpm.d/www.conf; \
+# With pm = ondemand, workers are killed after pm.process_idle_timeout seconds
+# of inactivity.  The upstream default is 10 s, which is aggressive: after a
+# brief quiet period (e.g. desktop-sync clients polling every few seconds), all
+# workers are reaped and the next request burst must wait for fresh forks.  On
+# a loaded host that spawn latency can push Apache past its FastCGI timeout and
+# produce a 502.  300 s (5 min) keeps a warm pool through normal sync-client
+# polling cycles while still reclaiming memory during genuinely idle periods.
+    sed -i 's/^;*pm.process_idle_timeout.*/pm.process_idle_timeout = 300s/' /usr/local/etc/php-fpm.d/www.conf; \
    sed -i 's|access.log = /proc/self/fd/2|access.log = /proc/self/fd/1|' /usr/local/etc/php-fpm.d/docker.conf; \
    \
    echo "[ -n \"\$TERM\" ] && [ -f /root.motd ] && cat /root.motd" >> /root/.bashrc; \
--- a/Containers/nextcloud/start.sh
+++ b/Containers/nextcloud/start.sh
@@ -156,6 +156,15 @@ while [ "$THIS_IS_AIO" = "true" ] && [ -z "$(dig nextcloud-aio-apache A +short +
    sleep 5
 done

+# Set request_terminate_timeout so that PHP-FPM forcibly kills workers that
+# exceed the wall-clock limit.  Without this (default = 0 = disabled) a worker
+# stuck on a slow DB query, a stalled Redis connection, or a hung syscall is
+# never reaped.  Over time these zombies fill up pm.max_children, leaving no
+# free slots for legitimate requests and causing Apache to return 502 Bad
+# Gateway upstream.  Setting it equal to PHP_MAX_TIME means a worker lives at
+# most as long as a PHP script is allowed to run, which keeps the pool healthy.
+sed -i "s|^;*request_terminate_timeout = .*|request_terminate_timeout = ${PHP_MAX_TIME}|" /usr/local/etc/php-fpm.d/www.conf
+
 set -x
 # shellcheck disable=SC2235
 if [ "$THIS_IS_AIO" = "true" ] && [ "$APACHE_PORT" = 443 ]; then