fix: prevent 502 Bad Gateway via PHP-FPM worker pool exhaustion and cold-start latency

- Add request_terminate_timeout = PHP_MAX_TIME in start.sh: without this
  (default 0 = disabled) workers blocked on a slow DB query, stalled Redis
  connection, or hung syscall are never reaped.  Over time they fill
  pm.max_children and Apache returns 502 Bad Gateway to the reverse proxy.

- Set pm.process_idle_timeout = 300s in Dockerfile: the upstream default of
  10 s kills all idle workers after a brief quiet period.  The next request
  burst must then wait for fresh PHP-FPM forks; on a loaded host that
  spawn latency can push Apache past its FastCGI deadline and produce a 502.
  300 s keeps a warm pool through normal desktop-sync polling cycles.

- Add a dedicated 502 troubleshooting subsection to reverse-proxy.md
  documenting the six most common causes (proxy timeout, worker exhaustion,
  stuck workers, Redis session lock contention, container cold start, Caddy
  cert renewal) with actionable diagnostics.

Agent-Logs-Url: https://github.com/nextcloud/all-in-one/sessions/2fd7a6d1-bfdb-4f26-a8d0-cd54a7307999

Co-authored-by: szaimen <42591237+szaimen@users.noreply.github.com>
This commit is contained in:
copilot-swe-agent[bot]
2026-04-27 15:31:14 +00:00
committed by GitHub
parent 119f68b6ee
commit 46eb2dfc7d
3 changed files with 38 additions and 0 deletions

View File

@@ -250,6 +250,14 @@ RUN set -ex; \
# We don't actually expect so many children but don't want to limit it artificially because people will report issues otherwise.
# Also children will usually be terminated again after the process is done due to the ondemand setting
sed -i 's/^pm.max_children =.*/pm.max_children = 5000/' /usr/local/etc/php-fpm.d/www.conf; \
# With pm = ondemand, workers are killed after pm.process_idle_timeout seconds
# of inactivity. The upstream default is 10 s, which is aggressive: after a
# brief quiet period (e.g. desktop-sync clients polling every few seconds), all
# workers are reaped and the next request burst must wait for fresh forks. On
# a loaded host that spawn latency can push Apache past its FastCGI timeout and
# produce a 502. 300 s (5 min) keeps a warm pool through normal sync-client
# polling cycles while still reclaiming memory during genuinely idle periods.
sed -i 's/^;*pm.process_idle_timeout.*/pm.process_idle_timeout = 300s/' /usr/local/etc/php-fpm.d/www.conf; \
sed -i 's|access.log = /proc/self/fd/2|access.log = /proc/self/fd/1|' /usr/local/etc/php-fpm.d/docker.conf; \
\
echo "[ -n \"\$TERM\" ] && [ -f /root.motd ] && cat /root.motd" >> /root/.bashrc; \

View File

@@ -156,6 +156,15 @@ while [ "$THIS_IS_AIO" = "true" ] && [ -z "$(dig nextcloud-aio-apache A +short +
sleep 5
done
# Set request_terminate_timeout so that PHP-FPM forcibly kills workers that
# exceed the wall-clock limit. Without this (default = 0 = disabled) a worker
# stuck on a slow DB query, a stalled Redis connection, or a hung syscall is
# never reaped. Over time these zombies fill up pm.max_children, leaving no
# free slots for legitimate requests and causing Apache to return 502 Bad
# Gateway upstream. Setting it equal to PHP_MAX_TIME means a worker lives at
# most as long as a PHP script is allowed to run, which keeps the pool healthy.
sed -i "s|^;*request_terminate_timeout = .*|request_terminate_timeout = ${PHP_MAX_TIME}|" /usr/local/etc/php-fpm.d/www.conf
set -x
# shellcheck disable=SC2235
if [ "$THIS_IS_AIO" = "true" ] && [ "$APACHE_PORT" = 443 ]; then