From 82a4dd7df69973c8123c73cf7b140685596a73e0 Mon Sep 17 00:00:00 2001 From: William Lallemand Date: Wed, 12 Jun 2024 14:46:05 +0200 Subject: [PATCH] DOC: internals: add a documentation about the master worker Add a documentation about the history of the master-worker and how it was implemented in its first version and how it is currently working. This is a global view of the architecture, and not an exhaustive explanation of all mechanisms. --- doc/internals/mworker.md | 210 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 210 insertions(+) create mode 100644 doc/internals/mworker.md diff --git a/doc/internals/mworker.md b/doc/internals/mworker.md new file mode 100644 index 000000000..141e3c3e6 --- /dev/null +++ b/doc/internals/mworker.md @@ -0,0 +1,210 @@ +# Master Worker + +2024-06-12 + +## History + +### haproxy-systemd-wrapper + +Back in 2013, distributions are discussing the adoption of systemd as the +default init, this was controversial but fedora and archlinux already uses it. +At this time HAProxy still had a multi-process model, and the way haproxy is +working was incompatible with the daemon mode. + +Systemd is compatible with traditionnal forking services, but somehow HAProxy +is different. To work correctly, systemd needs a main PID, this is the PID of +the process that systemd will supervises. + +With `nbproc 1` that could work, since systemd is able to guess the main PID, +and even to read a PID file. But there is something uncommon that HAProxy is +doing for a reload, which is not supported by systemd. Indeed the reload is in +fact a new haproxy process, which will ask the old one to leave. This means the +new main PID is supposed to change, but systemd is not supporting this, so it +will just see the previous leaving and consider that the service broke and kill +every other processes, meaning the new haproxy. + +With `nbproc > 1` this is worse, systemd is confused with all the processes, +because they are independent, so there is not really a main process to +supervise. + +The systemd-wrapper appeared in HAProxy 1.5, it's a separated binary, which +starts haproxy, so systemd can use the wrapper as the main PID, and the wrapper +never change PID. Upon a reload, which is done with a SIGUSR2 signal the wrapper +will launch a `haproxy -sf`. This was a non-intrusive work which a first step to +deploy in systemd environments. Later contributions would add the support for +upgrading the wrapper binary upon a reload. + +However the wrapper suffered from several problems: + +- It needed a intermediate haproxy process, it's basically a daemon mode, but + instead of the first process leaving to daemonize, it is kept in foreground to + waitpid() on all workers. Which means you need the wrapper + the -Ds + the + haproxy workers, and each reload start a new -Ds. +- it was difficult to integrate new features since it wasn't in haproxy itself +- there were multiple issues with handling the failures during reload + +### mworker V1 + +HAProxy 1.8 got ride of the wrapper which was replaced by the master worker +mode. This first version was basically a reintegration of the wrapper features +within HAProxy. HAProxy is launched with the -W flag, read the configuration and +then fork. In mworker mode, the master is usually launched as a root process, +and will do chroot operations then setuid in the workers. + +Like the wrapper, the master handle the SIGURS2 signal to reload, it is also +able to forward the SIGUSR1 signal to the workers, to ask for a soft stop. +The reload uses the same logic than the standard `-sf` method, but instead of +starting a new process, it will exec() with -sf in the same PID. Which means +that haproxy could upgrade its binary during the reload. + +Once the SIGUSR2 signal is received, the master would block signals and unregister +signals handler so no signals would halt haproxy reload, as it could kill the +master to receive a USR2 if the signal is not register yet after the exec. + +When doing the exec() upon a reload, a new argv array is constructed by copying +the current argv and adding `-sf` and the list of PIDs in the children list, as +well as the oldpids list. + +When the workers are started, the master will first deinit the poller and clean +the FDs that are not needed anymore (inherited fd need to be kept however), then +the master will do a wait() loop instead of the haproxy polling loop, which will +wait for its workers to leave, or for a signal. + +When reloading haproxy, a non-working configuration could exits the master, +which could end in killing all previous workers. This is a complex situation to +handle, since all configuration parsing code was not written to let a process +alive upon a failure. To handle this problem, an atexit() callback was used, so +haproxy would reexec() upon a configuration loading failure, without any +configuration, and without trying to fork new workers. This is called the +master-worker "wait mode". + +The master-worker mode also comes with a feature which automates the seamless +reload (-x), meaning it would select the stats socket from the configuration to +be added to the -x parameter for the next reload, so the FD of the bind could be +retrieved automatically. + +The master is supervising the workers, when a current worker (not a previous one +from before the reload) is exiting without being asked for a reload, the master +will emit an "exit-on-failure" error and will kill every workers with a SIGTERM +and exits with the same error code than the failed master, this behavior can be +changed by using the "no exit-on-failure" option in the global section. + +While the master is supervising the workers using the wait() function, the +workers is also surpervising the master. To achieve this, there is a pipe +between the master and the workers. The FD of the worker side of the pipe is +inserted in the poller so it can watch for a close. When the pipe is closed this +means the master left, and this is not supposed to happen, so it could have +crash. When it happens all workers are leaving. To survive the reloads of the +master, the FD are saved in environment variables (HAPROXY_MWORKER_PIPE_{RD,WR}) + +The master-worker mode could be activated by using either "-W" or +"master-worker" in the global section of the configuration, but it is prefered +to use "-W". + +The pidfile is usable in master-worker mode, instead of writing the PIDs of all +workers, this will only write the PID of the master. + +A systemd mode (-Ws) could also be used, it behaves the same way as -W, but will +keep the master in foreground, and will send status messages to systemd using +the sd_notify API. + +### mworker V2 + +HAProxy 1.9 go a little bit further with the master worker, instead of using the +mworker_wait() fuction from V1, it uses the haproxy polling loop, so the signals +will be handled directly by haproxy polling loop, removing the specific code. + +Instead of using 1 pipe per haproxy instance, the V2 is using a socketpair per +worker and the polling loop allows real network communication using these +socketpairs. It needs to keep 1 FD per worker in the master, so they can be +reused after a reload. The master keeps a linked list of processes, +mworker_proc, containing socketpairs fd, PID, relative pid... This list is then +serialized in the HAPROXY_PROCESSES environment variable to be unserialized upon +a reload and the FD reinserted in the poller. + +Since the FD are in the poller, there is a special flag in the listeners +LI_O_WORKER, which specify that some FD mustn't be used in the worker, these FD +are unbind once in the worker. + +Meanwhile the thread support was implemented in haproxy, since mworker shares +more code than before when using the polling loop, the nbthread configuration +variable is not used for instancing the master, and the master will always +remain with only 1 thread. + +The HAPROXY_PROCESSES structures allow to store a lot more thing, the number of +reload for each worker is kept, the PID etc... + +The socketpairs are useful for bi-directional communication, so each socketpair +are connected to a stats applet on the worker side, so the master could access +to a stats socket for each worker. + +The master implements a CLI proxy, which is an analyzer which is able to parse +CLI input, which will split individual CLI commands and redirect them to the +right worker. This is implemented like the HTTP pipelining with command being +sent and responsed one after another. This proxy could be accessed by using the +master CLI which is only bound using the -S option of the haproxy command. +Special prefixed using @ syntax are used to select the right worker. + +The master CLI implements its own commands set like `show proc` which shows the +content of the HAPROXY_PROCESSES structure. + +A 'reload' command was implemented so the reload could be asked from the master +CLI without using the SIGUSR2 signal. + +### more features in mworker V2 + +HAProxy 2.0 implements a new configuration section called `program` this section +allows to handle the start and stop of executables with the master-worker. One +could launch the dataplane API from haproxy for example. The programs are +shown in the `show proc` command. The programs will be added to the +HAPROXY_PROCESSES structure. The option 'start-on-reload' allows to configure +the behavior of a program during an haproxy reload, it can either start a new +instance of the program or keep the previous one. + +A `mworker-max-reloads` keyword was added in the global section, it allows to +limit the number of reload a worker can endure. That helps limiting the number +of remaining worker processes. This will send a SIGTERM to the worker once it +reach this value, instead of a SIGUSR1, so any stuck worker is killed. + +Version and starting time were added to HAPROXY_PROCESSES so they could be +displayed in `show proc`. + +HAProxy 2.1 added user/group to the program section so they could change their +uid after the fork. + +HAProxy 2.5 added the reexec of haproxy in wait mode after a successful loading, +instead of doing it only after a configuration failure. It is useful to clear +the memory of the master because charging the configuration from the master can +take a lot of RAM, and there is no simple wait to free everything and decrease +the memory space of the process. + +In HAProxy 2.6, the seamless reload with the master-worker changed, instead of +using a stats socket declared in the configuration, this uses the internal +socketpair of the previous worker. The change is actually simple, instead of +doing a `-x /path/to/previous/socket` it does a `-x sockpair@FD` using the FD +number that can be found in HAPROXY_PROCESSES. With this change the stats socket +in the configuration is less useful and everything can be done from the master +CLI. + +With 2.7, the reload mecanism of the master CLI evolved, with previous versions, +this mecanism was asynchronous, so once the `reload` command was received, the +master would reload, the active master CLI connection was closed, and there was +no way to return a status as a response to the `reload` command. To achieve a +synchronous reload, a dedicated sockpair is used, one side uses a master CLI +applet and the other side wait to receive a socket. When the master CLI receives +the `reload` command, it takes the FD of the active master CLI session, sends it +in the socketpair and then does an exec. The FD is then stuck in the kernel +during the reload, because the poller is disabled. Once haproxy reloaded and the +poller active, the FD of the master CLI connection is received, so HAProxy can +reply a success or failure status for the reload. When built with +USE_SHM_OPEN=1, a shm is used to keep the warnings and errors when loading the +configuration in a shared buffer so this could survive the rexec in wait mode +and then be dumped as a response to the `reload` command after the status. + +In 2.9 the master CLI command `hard-reload` was implemented, it works the same +way as the `reload` command, but instead of exec() with -sf for a soft-stop, it +starts with -st to achieve a hard stop on the previous worker. + +Version 3.0 got rid of the libsystemd dependencies for sd_notify() after the +events of xz/openssh, the function is now implemented directly in haproxy in +src/systemd.c.