NEW_TOKEN frame is never emitted by a client, hence parsing was not
tested on frontend side.
On backend side, an issue can occur, as expected token length is static,
based on the token length used internally by haproxy. This is not
sufficient for most server implementation which uses larger token. This
causes a parsing error, which may cause skipping of following frames in
the same packet. This issue was detected using ngtcp2 as server.
As for now tokens are unused by haproxy, simply discard test on token
length during NEW_TOKEN frame parsing. The token itself is merely
skipped without being stored. This is sufficient for now to continue on
experimenting with QUIC backend implementation.
This does not need to be backported.
Report an error during server configuration if QUIC is used by SSL is
not activiated via 'ssl' keyword. This is done in _srv_parse_finalize(),
which is both used by static and dynamic servers.
Note that contrary to listeners, an error is reported instead of a
warning, and SSL is not automatically activated if missing. This is
mainly due to the complex server configuration : _srv_parse_finalize()
is ideal to affect every servers, including dynamic entries. However, it
is executed after server SSL context allocation performed via
<prepare_srv> XPRT operation. A proper fix would be to move SSL ctx
alloc in _srv_parse_finalize(), but this may have unknown impact. Thus,
for now a simpler solution has been chosen.
QUIC traces in ssl_quic_srv_new_ssl_ctx() are problematic as this
function is called early during startup. If activating traces via -dt
command-line argument, a crash occurs due to stderr sink not yet
available.
Thus, traces from ssl_quic_srv_new_ssl_ctx() are simply removed.
No backport needed.
This commit added a "err" C label reachable only with USE_QUIC_OPENSSL_COMPAT:
MINOR: quic-be: Missing callbacks initializations (USE_QUIC_OPENSSL_COMPAT)
leading coverity to warn this:
*** CID 1611481: Control flow issues (UNREACHABLE)
/src/quic_ssl.c: 802 in ssl_quic_srv_new_ssl_ctx()
796 goto err;
797 #endif
798
799 leave:
800 TRACE_LEAVE(QUIC_EV_CONN_NEW);
801 return ctx;
>>> CID 1611481: Control flow issues (UNREACHABLE)
>>> This code cannot be reached: "err:
SSL_CTX_free(ctx);".
802 err:
803 SSL_CTX_free(ctx);
804 ctx = NULL;
805 TRACE_DEVEL("leaving on error", QUIC_EV_CONN_NEW);
806 goto leave;
807 }
The less intrusive (without #ifdef) way to fix this it to add a "goto err"
statement from the code part which is reachable without USE_QUIC_OPENSSL_COMPAT.
Thank you to @chipitsine for having reported this issue in GH #3003.
This issue may occur when qc_new_conn() fails after having allocated
and attached <conn_cid> to its tree. This is the case when compiling
haproxy against WolfSSL for an unknown reason at this time. In this
case the <conn_cid> is freed by pool_head_quic_connection_id(), then
freed again by quic_conn_release().
This bug arrived with this commit:
MINOR: quic-be: QUIC connection allocation adaptation (qc_new_conn())
So, the aim of this patch is to free <conn_cid> only for QUIC backends
and if it is not attached to its tree. This is the case when <conn_id>
local variable passed with NULL value to qc_new_conn() is then intialized
to the same <conn_cid> value.
This patch should have come with this last commit for the last qc_new_conn()
modifications for QUIC backends:
MINOR: quic-be: get rid of ->li quic_conn member
qc_new_conn() must be passed NULL pointers for several variables as mentioned
by the comment. Some of these local variables are used to avoid too much
code modifications.
Implement transcoding of a HTX request into HTTP/0.9. This protocol is a
simplified version of HTTP. Request only supports GET method without any
header. As such, only a request line is written during snd_buf
operation.
Implement transcoding of a HTTP/0.9 response into a HTX message.
HTTP/0.9 is a really simple substract of HTTP spec. The response does
not have any status line and is contains only the payload body. Response
is finished when the underlying connection/stream is closed.
A status line is generated to be compliant with HTX. This is performed
on the first invokation of rcv_buf for the current stream. Status code
is set to 200. Payload body if present is then copied using
htx_add_data().
This commit is the second and final step to initiate QUIC MUX on the
backend side. On handshake completion, MUX is woken up just after its
creation. This step is necessary to notify the stream layer, via the QCS
instance pre-initialized on MUX init, so that the transfer can be
resumed.
This mode of operation is similar to TCP stack when TLS+ALPN are used,
which forces MUX initialization to be delayed after handshake
completion.
Adjust qmux_init() to handle frontend and backend sides differently.
Most notably, on backend side, the first bidirectional stream is created
preemptively. This step is necessary as MUX layer will be woken up just
after handshake completion.
Stream data layer is notified that data is expected when FIN is
received, which marks the end of the HTTP request. This prepares data
layer to be able to handle the expected HTTP response.
Thus, this step is only relevant on frontend side. On backend side, FIN
marks the end of the HTTP response. No further content is expected, thus
expect data should not be set in this case.
Note that se_expect_data() invokation via qcs_attach_sc() is not
protected. This is because this function will only be called during
request headers parsing which is performed on the frontend side.
Mux connection is flagged with new QC_CF_IS_BACK if used on the backend
side. For now the only change is during traces, to be able to
differentiate frontend and backend usage.
Complete document for rcv_buf/snd_buf operations. In particular, return
value is now explicitely defined. For H3 layer, associated functions
documentation is also extended.
Use conn_ctrl_init() on the connection when quic_connect_server()
succeeds. This is necessary so that the connection is considered as
completely initialized. Without this, connect operation will be call
again if connection is reused.
On backend side, multiplexer layer is initialized during
connect_server(). However, this step is not performed if ALPN is used,
as the negotiated protocol may be unknown. Multiplexer initialization is
delayed after TLS handshake completion.
There are still exceptions though that forces the MUX to be initialized
even if ALPN is used. One of them was if <mux_proto> server field was
already set at this stage, which is the case when an explicit proto is
selected on the server line configuration. Remove this condition so that
now MUX init is delayed with ALPN even if proto is forced.
The scope of this change should be minimal. In fact, the only impact
concerns server config with both proto and ALPN set, which is pretty
unlikely as it is contradictory.
The main objective of this patch is to prepare QUIC support on the
backend side. Indeed, QUIC proto will be forced on the server if a QUIC
address is used, similarly to bind configuration. However, we still want
to delay MUX initialization after QUIC handshake completion. This is
mandatory to know the selected application protocol, required during
QUIC MUX init.
Change wake callback behavior for QUIC MUX. This operation loops over
each QCS and notify their stream data layer on certain events via
internal helper qcc_wake_some_streams().
Previously, streams were notified only if an error occured on the
connection. Change this to notify streams data layer everytime wake
callback is used. This behavior is now identical to H2 MUX.
qcc_wake_some_streams() is also renamed to qcc_wake_streams(), as it
better reflect its true behavior.
This change should not have performance impact as wake mux ops should
not be called frequently. Note that qcc_wake_streams() can also be
called directly via qcc_io_process() to ensure a new error is correctly
propagated. As wake callback first uses qcc_io_process(), it will only
call qcc_wake_streams() if no error is present.
No known issue is associated with this commit. However, it could prevent
freezing transfer under certain condition. As such, it is considered as
a bug fix worthy of backporting.
This should be backported after a period of observation.
It was still failing on Ubuntu-24.04 with GCC+ASAN. So, instead of
understand the code path the compiler followed to report uninitialized
variables, let's init them now.
No backport needed.
commit 16eb0fab3 ("MAJOR: counters: dispatch counters over thread groups")
introduced a build regression on some compilers:
src/listener.c: In function 'listener_accept':
src/listener.c:1095:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (int it = 0; it < global.nbtgroups; it++)
^
src/listener.c:1095:3: note: use option -std=c99 or -std=gnu99 to compile your code
src/listener.c:1101:4: error: 'for' loop initial declarations are only allowed in C99 mode
for (int it = 0; it < global.nbtgroups; it++) {
^
make: *** [src/listener.o] Error 1
make: *** Waiting for unfinished jobs....
Let's fix that.
No backport needed
In hlua_applet_tcp_recv_try() and hlua_applet_tcp_getline_yield(), GCC 14.2
reports warnings about 'blk2' variable that may be used uninitialized. It is
a bit strange because the code is pretty similar than before. But to make it
happy and to avoid bugs if the API change in future, 'blk2' is now used only
when its length is greater than 0.
No need to backport.
In hlua_applet_tcp_getline_yield(), the function may yield if there is no
data available. However we must take care to add a return statement just
after the call to hlua_yieldk(). I don't know the details of the LUA API,
but at least, this return statement fix a build error about uninitialized
variables that may be used.
It is a 3.3-specific issue. No backport needed.
On backend side, MUX is instantiated after QUIC handshake completion.
This step is performed via qc_ssl_provide_quic_data(). First, connection
flags for handshake completion are resetted. Then, MUX is instantiated
via conn_create_mux() function.
Force QUIC as <mux_proto> for server if a QUIC address is used. This is
similarly to what is already done for bind instances on the frontend
side. This step ensures that conn_create_mux() will select the proper
protocol.
Replace ->li quic_conn pointer to struct listener member by ->target which is
an object type enum and adapt the code.
Use __objt_(listener|server)() where the object type is known. Typically
this is were the code which is specific to one connection type (frontend/backend).
Remove <server> parameter passed to qc_new_conn(). It is redundant with the
<target> parameter.
GSO is not supported at this time for QUIC backend. qc_prep_pkts() is modified
to prevent it from building more than an MTU. This has as consequence to prevent
qc_send_ppkts() to use GSO.
ssl_clienthello.c code is run only by listeners. This is why __objt_listener()
is used in place of ->li.
Disable the code around SSL_get_peer_quic_transport_params() as this was done
for USE_QUIC_OPENSSL_COMPAT because SSL_get_peer_quic_transport_params() is not
defined by OpenSSL 3.5 QUIC API.
quic_tls_compat_keylog_callback() is the callback used by the QUIC OpenSSL
compatibility module to derive the TLS secrets from other secrets provided
by keylog. The <write> local variable to this function is initialized to denote
the direction (write to send, read to receive) the secret is supposed to be used
for. That said, as the QUIC cryptographic algorithms are symmetrical, the
direction is inversed between the peer: a secret which is used to write/send/cipher
data from a peer point of view is also the secret which is used to
read/receive/decipher data. This was confirmed by the fact that without this
patch, the TLS stack first provides the peer with Handshake to send/cipher
data. The client could not use such secret to decipher the Handshake packets
received from the server. This patch simply reverse the direction stored by
<write> variable to make the secrets derivation works for the QUIC client.
quic_tls_compat_init() function is called from OpenSSL QUIC compatibility module
(USE_QUIC_OPENSSL_COMPAT) to initialize the keylog callback and the callback
which stores the QUIC transport parameters as a TLS extensions into the stack.
These callbacks must also be initialized for QUIC backends.
This is done from TLS secrets derivation callback at Application level (the last
encryption level) calling SSL_get_peer_quic_transport_params() to have an access
to the TLS transport paremeters extension embedded into the Server Hello TLS message.
Then, quic_transport_params_store() is called to store a decoded version of
these transport parameters.
For connection to QUIC servers, this patch modifies the moment where the I/O
handler callback is switched to quic_conn_app_io_cb(). This is no more
done as for listener just after the handshake has completed but just after
it has been confirmed.
Discard the Initial packet number space as soon as possible. This is done
during handshakes in quic_conn_io_cb() as soon as an Handshake packet could
be successfully sent.
The initialization of <ssl_app_data_index> SSL user data index is required
to make all the SSL sessions to QUIC servers work as this is done for TCP
servers. The conn object notably retrieve for SSL callback which are
server specific (e.g. ssl_sess_new_srv_cb()).
Store the peer connection ID (SCID) as the connection DCID as soon as an Initial
packet is received.
Stop comparing the packet to QUIC_PACKET_TYPE_0RTT is already match as
QUIC_PACKET_TYPE_INITIAL.
A QUIC server must not send too short datagram with ack-eliciting packets inside.
This cannot be done from quic_rx_pkt_parse() because one does not know if
there is ack-eliciting frame into the Initial packets. If the packet must be
dropped, this is after having parsed it!
Modify quic_dgram_parse() to stop passing it a listener as third parameter.
In place the object type address of the connection socket owner is passed
to support the haproxy servers with QUIC as transport protocol.
qc_owner_obj_type() is implemented to return this address.
qc_counters() is also implemented to return the QUIC specific counters of
the proxy of owner of the connection.
quic_rx_pkt_parse() called by quic_dgram_parse() is also modify to use
the object type address used by this latter as last parameter. It is
also modified to send Retry packet only from listeners. A QUIC client
(connection to haproxy QUIC servers) must drop the Initial packets with
non null token length. It is also not supposed to receive O-RTT packets
which are dropped.
The QUIC datagram redispatch is there to counter the race condition which
exists only for QUIC connections to listener where datagrams may arrive
on the wrong socket between the bind() and connect() calls.
Run this code part only for listeners.
Allocate a connection to connect to QUIC servers from qc_conn_init() which is the
->init() QUIC xprt callback.
Also initialize ->prepare_srv and ->destroy_srv callback as this done for TCP
servers.
For haproxy QUIC servers (or QUIC clients), the peer is considered as validated.
This is a property which is more specific to QUIC servers (haproxy QUIC listeners).
No <odcid> is used for the QUIC client connection. It is used only on the QUIC server side.
The <token_odcid> is also not used on the QUIC client side. It must be embedded into
the transport parameters only on the QUIC server side.
The quic_conn is created before the socket allocation. So, the local address is
zeroed.
Initilize the transport parameter with qc_srv_params_init().
Stop hardcoding the <server> parameter passed value to qc_new_isecs() to correctly
initialize the Initial secrets.
Modify quic_connect_server() which is the ->connect() callback for QUIC protocol:
- add a BUG_ON() run when entering this funtion: the <fd> socket must equal -1
- conn->handle is a union. conn->handle.qc is use for QUIC connection,
conn->handle.fd must not be used to store the fd.
- code alignment fix for setsockopt(fd, SOL_SOCKET, (SO_SNDBUF|SO_RCVBUF))
statements
- remove the section of code which was duplicated from ->connect() TCP callback
- fd_insert() the new socket file decriptor created to connect to the QUIC
server with quic_conn_sock_fd_iocb() as callback for read event.
This patch only adds <proto_type> new proto_type enum parameter and <sock_type>
socket type parameter to sock_create_server_socket() and adapts its callers.
This is to prepare the use of this function by QUIC servers/backends.
Modify qc_alloc_ssl_sock_ctx() to pass the connection object as parameter. It is
NULL for a QUIC listener, not NULL for a QUIC server. This connection object is
set as value for ->conn quic_conn struct member. Initialise the SSL session object from
this function for QUIC servers.
qc_ssl_set_quic_transport_params() is also modified to pass the SSL object as parameter.
This is the unique parameter this function needs. <qc> parameter is used only for
the trace.
SSL_do_handshake() must be calle as soon as the SSL object is initialized for
the QUIC backend connection. This triggers the TLS CRYPTO data delivery.
tasklet_wakeup() is also called to send asap these CRYPTO data.
Modify the QUIC_EV_CONN_NEW event trace to dump the potential errors returned by
SSL_do_handshake().
Implement ssl_sock_new_ssl_ctx() to allocate a SSL server context as this is currently
done for TCP servers and also for QUIC servers depending on the <is_quic> boolean value
passed as new parameter. For QUIC servers, this function calls ssl_quic_srv_new_ssl_ctx()
which is specific to QUIC.
From connect_server(), QUIC protocol could not be retreived by protocol_lookup()
because of the PROTO_TYPE_STREAM default passed as argument. In place to support
QUIC srv->addr_type.proto_type may be safely passed.
The QUIC servers xprts have already been set at server line parsing time.
This patch prevents the QUIC servers xprts to be reset to <ssl_sock> value which is
the value used for SSL/TCP connections.
Add ->quic_params new member to server struct.
Also set the ->xprt member of the server being initialized and initialize asap its
transport parameters from _srv_parse_init().
This XPRT callback is called from check_config_validity() after the configuration
has been parsed to initialize all the SSL server contexts.
This patch implements the same thing for the QUIC servers.
Add a little check to verify that the version chosen by the server matches
with the client one. Initiliazes local transport parameters ->negotiated_version
value with this version if this is the case. If not, return 0;
According to the RFC, a QUIC client must encode the QUIC version it supports
into the "Available Versions" of "Version Information" transport parameter
order by descending preference.
This is done defining <quic_version_2> and <quic_version_draft_29> new variables
pointers to the corresponding version of <quic_versions> array elements.
A client announces its available versions as follows: v1, v2, draft29.
Activate QUIC protocol support for MUX-QUIC on the backend side,
additionally to current frontend support. This change is mandatory to be
able to implement QUIC on the backend side.
Without this modification, it is impossible to activate explicitely QUIC
protocol on a server line, hence an error is reported :
config : proxy 'xxxx' : MUX protocol 'quic' is not usable for server 'yyyy'
Mark QUIC address support for servers as experimental on the backend
side. Previously, it was allowed but wouldn't function as expected. As
QUIC backend support requires several changes, it is better to declare
it as experimental first.
QUIC is not implemented on the backend side. To prevent any issue, it is
better to reject any server configured which uses it. This is done via
_srv_parse_init() which is used both for static and dynamic servers.
This should be backported up to all stable versions.
Released version 3.3-dev1 with the following main changes :
- BUILD: tools: properly define ha_dump_backtrace() to avoid a build warning
- DOC: config: Fix a typo in 2.7 (Name format for maps and ACLs)
- REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+ (5)
- REGTESTS: Remove REQUIRE_VERSION=2.3 from all tests
- REGTESTS: Remove REQUIRE_VERSION=2.4 from all tests
- REGTESTS: Remove tests with REQUIRE_VERSION_BELOW=2.4
- REGTESTS: Remove support for REQUIRE_VERSION and REQUIRE_VERSION_BELOW
- MINOR: server: group postinit server tasks under _srv_postparse()
- MINOR: stats: add stat_col flags
- MINOR: stats: add ME_NEW_COMMON() helper
- MINOR: proxy: collect per-capability stat in proxy_cond_disable()
- MINOR: proxy: add a true list containing all proxies
- MINOR: log: only run postcheck_log_backend() checks on backend
- MEDIUM: proxy: use global proxy list for REGISTER_POST_PROXY_CHECK() hook
- MEDIUM: server: automatically add server to proxy list in new_server()
- MEDIUM: server: add and use srv_init() function
- BUG/MAJOR: leastconn: Protect tree_elt with the lbprm lock
- BUG/MEDIUM: check: Requeue healthchecks on I/O events to handle check timeout
- CLEANUP: applet: Update comment for applet_put* functions
- DEBUG: check: Add the healthcheck's expiration date in the trace messags
- BUG/MINOR: mux-spop: Fix null-pointer deref on SPOP stream allocation failure
- CLEANUP: sink: remove useless cleanup in sink_new_from_logger()
- MAJOR: counters: add shared counters base infrastructure
- MINOR: counters: add shared counters helpers to get and drop shared pointers
- MINOR: counters: add common struct and flags to {fe,be}_counters_shared
- MEDIUM: counters: manage shared counters using dedicated helpers
- CLEANUP: counters: merge some common counters between {fe,be}_counters_shared
- MINOR: counters: add local-only internal rates to compute some maxes
- MAJOR: counters: dispatch counters over thread groups
- BUG/MEDIUM: cli: Properly parse empty lines and avoid crashed
- BUG/MINOR: config: emit warning for empty args only in discovery mode
- BUG/MINOR: config: fix arg number reported on empty arg warning
- BUG/MINOR: quic: Missing SSL session object freeing
- MINOR: applet: Add API functions to manipulate input and output buffers
- MINOR: applet: Add API functions to get data from the input buffer
- CLEANUP: applet: Simplify a bit comments for applet_put* functions
- MEDIUM: hlua: Update TCP applet functions to use the new applet API
- BUG/MEDIUM: fd: Use the provided tgid in fd_insert() to get tgroup_info
- BUG/MINIR: h1: Fix doc of 'accept-unsafe-...-request' about URI parsing
The description of tests performed on the URI in H1 when
'accept-unsafe-violations-in-http-request' option is wrong. It states that
only characters below 32 and 127 are blocked when this option is set,
suggesting that otherwise, when it is not set, all invalid characters in the
URI, according to the RFC3986, are blocked.
But in fact, it is not true. By default all character below 32 and above 127
are blocked. And when 'accept-unsafe-violations-in-http-request' option is
set, characters above 127 (excluded) are accepted. But characters in
(33..126) are never checked, independently of this option.
This patch should fix the issue #2906. It should be backported as far as
3.0. For older versions, the docuementation could also be clarified because
this part is not really clear.
Note the request URI validation is still under discution because invalid
characters in (33.126) are never checked and some users request a stricter
parsing.
In fd_insert(), use the provided tgid to ghet the thread group info,
instead of using the one of the current thread, as we may call
fd_insert() from a thread of another thread group, that will happen at
least when binding the listeners. Otherwise we'd end up accessing the
thread mask containing enabled thread of the wrong thread group, which
can lead to crashes if we're binding on threads not present in the
thread group.
This should fix Github issue #2991.
This should be backported up to 2.8.
The functions responsible to extract data from the applet input buffer or to
push data into the applet output buffer are now relying on the newly added
functions in the applet API. This simplifies a bit the code.
There was already functions to pushed data from the applet to the stream by
inserting them in the right buffer, depending the applet was using or not
the legacy API. Here, functions to retreive data pushed to the applet by the
stream were added:
* applet_getchar : Gets one character
* applet_getblk : Copies a full block of data
* applet_getword : Copies one text block representing a word using a
custom separator as delimiter
* applet_getline : Copies one text line
* applet_getblk_nc : Get one or two blocks of data
* applet_getword_nc: Gets one or two blocks of text representing a word
using a custom separator as delimiter
* applet_getline_nc: Gets one or two blocks of text representing a line
In this patch, some functions were added to ease input and output buffers
manipulation, regardless the corresponding applet is using its own buffers
or it is relying on channels buffers. Following functions were added:
* applet_get_inbuf : Get the buffer containing data pushed to the applet
by the stream
* applet_get_outbuf : Get the buffer containing data pushed by the applet
to the stream
* applet_input_data : Return the amount of data in the input buffer
* applet_skip_input : Skips <len> bytes from the input buffer
* applet_reset_input: Skips all bytes from the input buffer
* applet_output_room: Returns the amout of space available at the output
buffer
* applet_need_room : Indicates that the applet have more data to deliver
and it needs more room in the output buffer to do
so
qc_alloc_ssl_sock_ctx() allocates an SSL_CTX object for each connection. It also
allocates an SSL object. When this function failed, it freed only the SSL_CTX object.
The correct way to free both of them is to call qc_free_ssl_sock_ctx().
Must be backported as far as 2.6.
If an empty argument is used in configuration, for example due to an
undefined environment variable, the rest of the line is not parsed. As
such, a warning is emitted to report this.
The warning was not totally correct as it reported the wrong argument
index. Fix this by this patch. Note that there is still an issue with
the "^" indicator, but this is not as easy to fix yet.
This is related to github issue #2995.
This should be backported up to 3.2.
Hide warning about empty argument outside of discovery mode. This is
necessary, else the message will be displayed twice, which hampers
haproxy output lisibility.
This should fix github isue #2995.
This should be backported up to 3.2.
Empty lines was not properly parsed and could lead to crashes because the
last argument was parsed outside of the cmdline buffer. Indeed, the last
argument is parsed to look for an eventual payload pattern. It is started
one character after the newline at the end of the command line. But it is
only valid for an non-empty command line.
So, now, this case is properly detected when we leave if an empty line is
detected.
This patch must be backported to 3.2.
Most fe and be counters are good candidates for being shared between
processes. They are now grouped inside "shared" struct sub member under
be_counters and fe_counters.
Now they are properly identified, they would greatly benefit from being
shared over thread groups to reduce the cost of atomic operations when
updating them. For this, we take the current tgid into account so each
thread group only updates its own counters. For this to work, it is
mandatory that the "shared" member from {fe,be}_counters is initialized
AFTER global.nbtgroups is known, because each shared counter causes the stat
to be allocated lobal.nbtgroups times. When updating a counter without
concurrency, the first counter from the array may be updated.
To consult the shared counters (which requires aggregation of per-tgid
individual counters), some helper functions were added to counter.h to
ease code maintenance and avoid computing errors.
cps_max (max new connections received per second), sps_max (max new
sessions per second) and http.rps_max (maximum new http requests per
second) all rely on shared counters (namely conn_per_sec, sess_per_sec and
http.req_per_sec). The problem is that shared counters are about to be
distributed over thread groups, and we cannot afford to compute the
total (for all thread groups) each time we update the max counters.
Instead, since such max counters (relying on shared counters) are a very
few exceptions, let's add internal (sess,conn,req) per sec freq counters
that are dedicated to cps_max, sps_max and http.rps_max computing.
Thanks to that, related *_max counters shouldn't be negatively impacted
by the thread-group distribution, yet they will not benefit from it
either. Related internal freq counters are prefixed with "_" to emphasize
the fact that they should not be used for other purpose (the shared ones,
which are about to be distributed over thread groups in upcoming commits
are still available and must be used instead). The internal ones could
eventually be removed at any time if we find another way to compute the
{cps,sps,http.rps)_max counters.
Now that we have a common struct between fe and be shared counters struct
let's perform some cleanup to merge duplicate members into the common
struct part. This will ease code maintenance.
proxies, listeners and server shared counters are now managed via helpers
added in one of the previous commits.
When guid is not set (ie: when not yet assigned), shared counters pointer
is allocated using calloc() (local memory) and a flag is set on the shared
counters struct to know how to manipulate (and free it). Else if guid is
set, then it means that the counters may be shared so while for now we
don't actually use a shared memory location the API is ready for that.
The way it works, for proxies and servers (for which guid is not known
during creation), we first call counters_{fe,be}_shared_get with guid not
set, which results in local pointer being retrieved (as if we just
manually called calloc() to retrieve a pointer). Later (during postparsing)
if guid is set we try to upgrade the pointer from local to shared.
Lastly, since the memory location for some objects (proxies and servers
counters) may change from creation to postparsing, let's update
counters->last_change member directly under counters_{fe,be}_shared_get()
so we don't miss it.
No change of behavior is expected, this is only preparation work.
fe_counters_shared and be_counters_shared may share some common members
since they are quite similar, so we add a common struct part shared
between the two. struct counters_shared is added for convenience as
a generic pointer to manipulate common members from fe or be shared
counters pointer.
Also, the first common member is added: shared fe and be counters now
have a flags member.
create include/haproxy/counters.h and src/counters.c files to anticipate
for further helpers as some counters specific tasks needs to be carried
out and since counters are shared between multiple object types (ie:
listener, proxy, server..) we need generic helpers.
Add some shared counters helper which are not yet used but will be updated
in upcoming commits.
Shareable counters are not tagged as shared counters and are dynamically
allocated in separate memory area as a prerequisite for being stored
in shared memory area. For now, GUID and threads groups are not taken into
account, this is only a first step.
also we ensure all counters are now manipulated using atomic operations,
namely, "last_change" counter is now read from and written to using atomic
ops.
Despite the numerous changes caused by the counters being moved away from
counters struct, no change of behavior should be expected.
As reported by Ilya in GH #2994, some cleanup parts in
sink_new_from_logger() function are not used.
We can actually simplify the cleanup logic to remove dead code, let's
do that by renaming "error_final" label to "error" and only making use
of the "error" label, because sink_free() already takes care of proper
cleanup for all sink members.
When we try to allocate a new SPOP stream, if an error is encountered,
spop_strm_destroy() is called to released the eventually allocated
stream. But, it must only be called if a stream was allocated. If the
reported error is an SPOP stream allocation failure, we must just leave to
avoid null-pointer dereference.
This patch should fix point 1 of the issue #2993. It must be backported as
far as 3.1.
These functions were copied from the channel API and modified to work with
applets using the new API or the legacy one. However, the comments were
updated accordingly. It is the purpose of this patch.
When a healthchecks is processed, once the first wakeup passed to start the
check, and as long as the expiration timer is not reached, only I/O events
are able to wake it up. It is an issue when there is a check timeout
defined. Especially if the connect timeout is high and the check timeout is
low. In that case, the healthcheck's task is never requeue to handle any
timeout update. When the connection is established, the check timeout is set
to replace the connect timeout. It is thus possible to report a success
while a timeout should be reported.
So, now, when an I/O event is handled, the healthcheck is requeue, except if
an success or an abort is reported.
Thanks to Thierry Fournier for report and the reproducer.
This patch must be backported to all stable versions.
In fwlc_srv_reposition(), set the server's tree_elt while we still hold
the lbprm read lock. While it was protected from concurrent
fwlc_srv_reposition() calls by the server's lb_lock, it was not from
dequeuing/requeuing that could occur if the server gets down/up or its
weight is changed, and that would lead to inconsistencies, and the
watchdog killing the process because it is stuck in an infinite loop in
fwlc_get_next_server().
This hopefully fixes github issue #2990.
This should be backported to 3.2.
rename _srv_postparse() internal function to srv_init() function and group
srv_init_per_thr() plus idle conns list init inside it. This way we can
perform some simplifications as srv_init() performs multiple server
init steps after parsing.
SRV_F_CHECKED flag was added, it is automatically set when srv_init()
runs successfully. If the flag is already set and srv_init() is called
again, nothing is done. This permis to manually call srv_init() earlier
than the default POST_CHECK hook when needed without risking to do things
twice.
while new_server() takes the parent proxy as argument and even assigns
srv->proxy to the parent proxy, it didn't actually inserted the server
to the parent proxy server list on success.
The result is that sometimes we add the server to the list after
new_server() is called, and sometimes we don't.
This is really error-prone and because of that hooks such as
REGISTER_POST_SERVER_CHECK() which as run for all servers listed in
all proxies may not be relied upon for servers which are not actually
inserted in their parent proxy server list. Plus it feels very strange
to have a server that points to a proxy, but then the proxy doesn't know
about it because it cannot find it in its server list.
To prevent errors and make proxy->srv list reliable, we move the insertion
logic directly under new_server(). This requires to know if we are called
during parsing or during runtime to either insert or append the server to
the parent proxy list. For that we use PR_FL_CHECKED flag from the parent
proxy (if the flag is set, then the proxy was checked so we are past the
init phase, thus we assume we are called during runtime)
This implies that during startup if new_server() has to be cancelled on
error paths we need to call srv_detach() (which is now exposed in server.h)
before srv_drop().
The consequence of this commit is that REGISTER_POST_SERVER_CHECK() should
not run reliably on all servers created using new_server() (without having
to manually loop on global servers_list)
REGISTER_POST_PROXY_CHECK() used to iterate over "main" proxies to run
registered callbacks. This means hidden proxies (and their servers) did
not get a chance to get post-checked and could cause issues if some post-
checks are expected to be executed on all proxies no matter their type.
Instead we now rely on the global proxies list. Another side effect is that
the REGISTER_POST_SERVER_CHECK() now runs as well for servers from proxies
that are not part of the main proxies list.
postcheck_log_backend() checks are executed no matter if the proxy
actually has the backend capability while the checks actually depend
on this.
Let's fix that by adding an extra condition to ensure that the BE
capability is set.
This issue is not tagged as a bug because for now it remains impossible
to have a syslog proxy without BE capability in the main proxy list, but
this may change in the future.
We have global proxies_list pointer which is announced as the list of
"all existing proxies", but in fact it only represents regular proxies
declared on the config file through "listen, frontend or backend" keywords
It is ambiguous, and we currently don't have a straightforwrd method to
iterate over all proxies (either public or internal ones) within haproxy
Instead we still have to manually iterate over multiple lists (main
proxies, log-forward proxies, peer proxies..) which is error-prone.
In this patch we add a struct list member (8 bytes) inside struct proxy
in order to store every proxy (except default ones) within a global
"proxies" list which is actually representative for all proxies existing
under haproxy process, like we already have for servers.
proxy_cond_disable() collects and prints cumulated connections for be and
fe proxies no matter their type. With shared stats it may cause issues
because depending on the proxy capabilities only fe or be counters may
be allocated.
In this patch we add some checks to ensure we only try to read from
valid memory locations, else we rely on default values (0).
init_srv_requeue() and init_srv_slowstart() functions are called after
initial server parsing via REGISTER_POST_SERVER_CHECK() hook, and they
are also manually called for dynamic server after the server is
initialized.
This may conflict with _srv_postparse() which is also registered via
REGISTER_POST_SERVER_CHECK() and called during dynamic server creation
To ensure functions don't conflict with each other, let's ensure they
are executed in proper order by calling init_srv_requeue and
init_srv_slowstart() from _srv_postparse() which now becomes the parent
function for server related postparsing stuff. No change of behavior is
expected.
This is no longer used since the migration to the native `haproxy -cc
'version_atleast(X)'` functionality.
see 8727614dc4046e91997ecce421bcb6a5537cac93
see 5efc48dcf1b133dd415c759e83b21d52dc303786
Introduced in:
25bcdb1d9 BUG/MAJOR: h1: Be stricter on request target validation during message parsing
see also:
fbbbc33df REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+
In resolve_sym_name() we declare a few symbols that we want to be able
to resolve. ha_dump_backtrace() was declared with a struct buffer instead
of a pointer to such a struct, which has no effect since we only want to
get the function's pointer, but produces a build warning with LTO, so
let's fix it.
This can be backported to 3.0.
Released version 3.2.0 with the following main changes :
- MINOR: promex: Add agent check status/code/duration metrics
- MINOR: ssl: support strict-sni in ssl-default-bind-options
- MINOR: ssl: also provide the "tls-tickets" bind option
- MINOR: server: define CLI I/O handler for "add server"
- MINOR: server: implement "add server help"
- MINOR: server: use stress mode for "add server help"
- BUG/MEDIUM: server: fix crash after duplicate GUID insertion
- BUG/MEDIUM: server: fix potential null-deref after previous fix
- MINOR: config: list recently added sections with -dKcfg
- BUG/MAJOR: cache: Crash because of wrong cache entry deleted
- DOC: configuration: fix the example in crt-store
- DOC: config: clarify the wording around single/double quotes
- DOC: config: clarify the legacy cookie and header captures
- DOC: config: fix alphabetical ordering of layer 7 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 6 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 5 sample fetch functions
- DOC: config: fix alphabetical ordering of layer 4 sample fetch functions
- DOC: config: fix alphabetical ordering of internal sample fetch functions
- BUG/MINOR: h3: Set HTX flags corresponding to the scheme found in the request
- BUG/MEDIUM: h3: Declare absolute URI as normalized when a :authority is found
- DOC: config: mention in bytes_in and bytes_out that they're read on input
- DOC: config: clarify the basics of ACLs (call point, multi-valued etc)
- REGTESTS: Make the script testing conditional set-var compatible with Vtest2
- REGTESTS: Explicitly allow failing shell commands in some scripts
- MINOR: listeners: Add support for a label on bind line
- BUG/MEDIUM: cli/ring: Properly handle shutdown in "show event" I/O handler
- BUG/MEDIUM: hlua: Properly detect shudowns for TCP applets based on the new API
- BUG/MEDIUM: hlua: Fix getline() for TCP applets to work with applet's buffers
- BUG/MEDIUM: hlua: Fix receive API for TCP applets to properly handle shutdowns
- CI: vtest: Rely on VTest2 to run regression tests
- CI: vtest: Fix the build script to properly work on MaOS
- CI: combine AWS-LC and AWS-LC-FIPS by template
- BUG/MEDIUM: httpclient: Throw an error if an lua httpclient instance is reused
- DOC: hlua: Add a note to warn user about httpclient object reuse
- DOC: hlua: fix a few typos in HTTPMessage.set_body_len() documentation
- DEV: patchbot: prepare for new version 3.3-dev
- MINOR: version: mention that it's 3.2 LTS now.
A few typos were noticed while gathering info for the 3.2 announce
messages, this fixes them, and will probably constitute the last
commit of this release. There's no need to backport it unless commit
94055a5e7 ("MEDIUM: hlua: Add function to change the body length of
an HTTP Message") is backported.
It is not supported to reuse an lua httpclient instance to process several
requests. A new object must be created for each request. Thanks to the
previous patch ("BUG/MEDIUM: httpclient: Throw an error if an lua httpclient
instance is reused"), an error is now reported if this happens. But it is
not obvious for users. So the lua-api docuementation was updated accordingly.
This patch is related to issue #2986. It should be backported with the
commit above.
It is not expected/supported to reuse an httpclient instance to process
several requests. A new instance must be created for each request. However,
in lua, there is nothing to prevent a user to create an httpclient object
and use it in a loop to process requests.
That's unfortunate because this will apparently work, the requests will be
sent and a response will be received and processed. However internally some
ressources will be allocated and never released. When the next response is
processed, the ressources allocated for the previous one are definitively
lost.
In this patch we take care to check that the httpclient object was never
used when a request is sent from a lua script by checking
HTTPCLIENT_FS_STARTED flags. This flag is set when a httpclient applet is
spawned to process a request and never removed after that. In lua, the
httpclient applet is created when the request is sent. So, it is the right
place to do this test.
This patch should fix the issue #2986. It should be backported as far as
2.6.
VTest2 (https://github.com/vtest/VTest2) was released and is a remplacement
for VTest. VTest was archived. So let's use the new version now.
If this commit is backported, the 2 following commits must also be
backported:
* 2808e3577 ("REGTESTS: Explicitly allow failing shell commands in some scripts")
* 82c291124 ("REGTESTS: Make the script testing conditional set-var compatible with Vtest2")
An optional timeout was added to AppletTCP.receive() to interrupt calls after a
delay. It was mandatory to be able to implement interactive applets (like
trisdemo). However, this broke the API and it made impossible to differentiate
the shutdowns from the delays expirations. Indeed, in both cases, an empty
string was returned.
Because historically an empty string was used to notify a connection shutdown,
it should not be changed. So now, 'nil' value is returned when no data was
available before the delay expiration.
The new AppletTCP:try_receive() function was also affected. To fix it, instead
of stating there is no delay when a receive is tried, an expired delay is
set. Concretely TICK_ETERNITY was replaced by now_ms.
Finally, AppletTCP:getline() function is not concerned for now because there
is no way to interrupt it after some delay.
The documentation and trisdemo lua script were updated accordingly.
This patch depends on "BUG/MEDIUM: hlua: Properly detect shudowns for TCP
applets based on the new API". However, it is a 3.2-specific issue, so no
backport is needed.
The commit e5e36ce09 ("BUG/MEDIUM: hlua/cli: Fix lua CLI commands to work
with applet's buffers") fixed the TCP applets API to work with applets using
its own buffers. Howver the getline() function was not updated. It could be
an issue for anyone registering a CLI commands reading lines.
This patch should be backported as far as 3.0.
The internal function responsible to receive data for TCP applets with
internal buffers is buggy. Indeed, for these applets, the buffer API is used
to get data. So there is no tests on the SE to properly detect connection
shutdowns. So, it must be performed by hand after the call to b_getblk_nc().
This patch must be backported as far as 3.0.
The commit 03dc54d802 ("BUG/MINOR: ring: Fix I/O handler of "show event"
command to not rely on the SC") introduced a regression. By removing
dependencies on the SC, a test to detect client shutdowns was removed. So
now, the CLI applet is no longer released when the client shut the
connection during a "show event -w".
So of course, we should not use the SC to detect the shutdowns. But the SE
must be used insteead.
It is a 3.2-specific issue, so no backport needed.
It is now possile to set a label on a bind line. All sockets attached to
this bind line inherits from this label. The idea is to be able to groud of
sockets. For now, there is no mechanism to create these groups, this must be
done by hand.
Vtest2, that should replaced Vtest in few months, will reject any failing
commands in shell blocks. However, some scripts are executing some commands,
expecting an error to be able to parse the error output. So, now use "set
+e" in those scripts to explicitly state failing commads are expected.
It is just used for non-final commands. At the end, the shell block must
still report a success.
VTest2 will replaced VTest in few months. There is not so much change
expected. One of them is that a User-Agent header is added by default in all
requests, except if an custom one is already set or if "-nouseragent" option
is used. To still be compatible with VTest, it is not possible to use the
option to avoid the header addition. So, a custom user-agent is added in the
last test of "sample_fetches/cond_set_var.vtc" to be sure it will pass with
Vtest and Vtest2. It is mandatory because the request length is tested.
This is essentially in order to address the concerns expressed in
issue #2226 where it is mentioned that the moment they are called is
not clear enough. Admittedly, re-reading the paragraph doesn't make
it obvious on a quick read that they behave like functions. This patch
adds an extra paragraph that makes the parallel with programming
languages' boolean functions and explains the fact that they can be
multi-valued. Hoping this is clearer now.
Issue #2267 suggests that it's unclear what exactly the byte counts mean
(particularly when compression is involved). Let's clarify that the counts
are read on data input and that they also cover headers and a bit of
internal overhead.
Since commit 2c3d656f8 ("MEDIUM: h3: use absolute URI form with
:authority"), the absolute URI form is used when a ':authority'
pseudo-header is found. However, this URI was not declared as normalized
internally. So, when the request is reformated to be sent to an h1 server,
the absolute-form is used instead of the origin-form. It is unexpected and
may be an issue for some servers that could reject the request.
So, now, we take care to set HTX_SL_F_HAS_AUTHORITY flag on the HTX message
when an authority was found and HTX_SL_F_NORMALIZED_URI flag is set for
"http" or "https" schemes.
No backport needed because the commit above must not be backported. It
should fix a regression reported on the 3.2-dev17 in issue #2977.
This commit depends on "BUG/MINOR: h3: Set HTX flags corresponding to the
scheme found in the request".
When a ":scheme" pseudo-header is found in a h3 request, the
HTX_SL_F_HAS_SCHM flag must be set on the HTX message. And if the scheme is
'http' or 'https', the corresponding HTX flag must also be set. So,
respectively, HTX_SL_F_SCHM_HTTP or HTX_SL_F_SCHM_HTTPS.
It is mainly used to send the right ":scheme" pseudo-header value to H2
server on backend side.
This patch could be backported as far as 2.6.
As reported in issue #2195, cookie captures and header captures are no
longer the recommended way to proceed. Let's mention that this is the
legacy way and provide a few pointers to the recommended functions and
actions to use the modern methods.
As reported in issue #2327, the wording used in the section about quoting
can be read two ways due to the use of the two types of quotes to protect
each other quote. Better only use the quoting without mixing the two when
mentioning them.
Fix a bad example in the crt-store section. site1 does not use the "web"
crt-store but the global one.
Must be backported as far as 3.0 however the section was 3.12 in
previous version.
When "vary" is enabled, we can have multiple entries for a given primary
key in the cache tree. There is a limit to how many secondary entries
can be inserted for a given key. When we try to insert a new secondary
entry, if the limit is already reached, we can try to find expired
entries with the same primary key, and if the limit is still reached we
want to abort the current insertion and to remove the node that was just
inserted.
In commit "a29b073: MEDIUM: cache: Add refcount on cache_entry" though,
a regression was introduced. Instead of removing the entry just inserted
as the comments suggested, we removed the second to last entry and
returned NULL. We then reset the eb.key of the cache_entry in the caller
because we assumed that the entry was already removed from the tree.
This means that some entries with an empty key were wrongly kept in the
tree and the last secondary entry, which keeps the number of secondary
entries of a given key was removed.
This ended up causing some crashes later on when we tried to iterate
over the elements of this given key. The crash could occur in multiple
places, either when trying to retrieve an entry or to add some new ones.
This crash was raised in GitHub issue #2950.
The fix should be backported up to 3.0.
A valid build warning was reported in the CI with latest commit b40ce97ecc
("BUG/MEDIUM: server: fix crash after duplicate GUID insertion"). Indeed,
if the first test in the function fails, we branch to the err label
with guid==NULL and will crash there. Let's just test guid before
dereferencing it for freeing.
This needs to be backported to 3.0 as well since the commit above was
meant to go there.
On "add server", if a GUID is defined, guid_insert() is used to add the
entry into the global GUID tree. If a similar entry already exists, GUID
insertion fails and the server creation is eventually aborted.
A crash could occur in this case because of an invalid memory access via
guid_remove(). The latter is caused via free_server() as the server
insertion is rejected. The invalid occurs on GUID key.
The issue occurs because of guid_insert(). The function properly
deallocates the GUID key on duplicate insertion, but it failed to reset
<guid.node.key> to NULL. This caused the invalid memory access on
guid_remove(). To fix this, ensure that key member is properly resetted
on guid_insert() error path.
This must be backported up to 3.0.
Implement stress mode on "add server help". This ensures that the
command is fully reentrant on full output buffer.
For testing, it requires compilation with USE_STRESS and global setting
"stress-level 1".
Implement "help" as a sub-command for "add server" CLI. The objective is
to list all the keywords that are supported for dynamic servers. CLI IO
handler and add_srv_ctx are used to support reentrancy on full output
buffer.
Now that this command is implemented, the outdated keyword list on "add
server" from management documentation can be removed.
Extend "add server" to support an IO handler function named
cli_io_handler_add_server(). A context object is also defined whose
usage will depend on IO handler capabilities.
IO handler is skipped when "add server" is run in default mode, i.e. on
a dynamic server creation. Thus, currently IO handler is unneeded.
However, it will become useful to support sub-commands for "add server".
Note that return value of "add server" parser has been changed on server
creation success. Previously, it was used incorrectly to report if
server was inserted or not. In fact, parser return value is used by CLI
generic code to detect if command processing has been completed, or
should continue to the IO handler. Now, "add server" always returns 1 to
signal that CLI processing is completed. This is necessary to preserve
CLI output emitted by parser, even now that IO handler is defined for
the command. Previously, output was emitted in every situations due to
IO handler not defined. See below code snippet from cli.c for a better
overview :
if (kw->parse && kw->parse(args, payload, appctx, kw->private) != 0) {
ret = 1;
goto fail;
}
/* kw->parse could set its own io_handler or io_release handler */
if (!appctx->cli_ctx.io_handler) {
ret = 1;
goto fail;
}
appctx->st0 = CLI_ST_CALLBACK;
ret = 1;
goto end;
Currently there is "no-tls-tickets" that is also supported in the
ssl-default-bind-options directive, but there's no way to re-enable
them on a specific "bind" line. This patch simply provides the option
to re-enable them. Note that the flag is inverted because tickets are
enabled by default and the no-tls-ticket option sets the flag to
disable them.
Several users already reported that it would be nice to support
strict-sni in ssl-default-bind-options. However, in order to support
it, we also need an option to disable it.
This patch moves the setting of the option from the strict_sni field
to a flag in the ssl_options field so that it can be inherited from
the default bind options, and adds a new "no-strict-sni" directive to
allow to disable it on a specific "bind" line.
The test file "del_ssl_crt-list.vtc" which already tests both options
was updated to make use of the default option and the no- variant to
confirm everything continues to work.
In the Prometheus exporter, the last health check status is already exposed,
with its code and duration in seconds. The server status is also exposed.
But the information about the agent check are not available. It is not
really handy because when a server status is changed because of the agent,
it is not obvious by looking to the Prometheus metrics. Indeed, the server
may reported as DOWN for instance, while the health check status still
reports a success. Being able to get the agent status in that case could be
valuable.
So now, the last agent check status is exposed, with its code and duration
in seconds. Following metrics can be grabbe now:
* haproxy_server_agent_status
* haproxy_server_agent_code
* haproxy_server_agent_duration_seconds
Note that unlike the other metrics, no per-backend aggregated metric is
exposed.
This patch is related to issue #2983.
Released version 3.2-dev17 with the following main changes :
- DOC: configuration: explicit multi-choice on bind shards option
- BUG/MINOR: sink: detect and warn when using "send-proxy" options with ring servers
- BUG/MEDIUM: peers: also limit the number of incoming updates
- MEDIUM: hlua: Add function to change the body length of an HTTP Message
- BUG/MEDIUM: stconn: Disable 0-copy forwarding for filters altering the payload
- BUG/MINOR: h3: don't insert more than one Host header
- BUG/MEDIUM: h1/h2/h3: reject forbidden chars in the Host header field
- DOC: config: properly index "table and "stick-table" in their section
- DOC: management: change reference to configuration manual
- BUILD: debug: mark ha_crash_now() as attribute(noreturn)
- IMPORT: slz: avoid multiple shifts on 64-bits
- IMPORT: slz: support crc32c for lookup hash on sse4 but only if requested
- IMPORT: slz: use a better hash for machines with a fast multiply
- IMPORT: slz: fix header used for empty zlib message
- IMPORT: slz: silence a build warning on non-x86 non-arm
- BUG/MAJOR: leastconn: do not loop forever when facing saturated servers
- BUG/MAJOR: queue: properly keep count of the queue length
- BUG/MINOR: quic: fix crash on quic_conn alloc failure
- BUG/MAJOR: leastconn: never reuse the node after dropping the lock
- MINOR: acme: renewal notification over the dpapi sink
- CLEANUP: quic: Useless BIO_METHOD initialization
- MINOR: quic: Add useful error traces about qc_ssl_sess_init() failures
- MINOR: quic: Allow the use of the new OpenSSL 3.5.0 QUIC TLS API (to be completed)
- MINOR: quic: implement all remaining callbacks for OpenSSL 3.5 QUIC API
- MINOR: quic: OpenSSL 3.5 internal QUIC custom extension for transport parameters reset
- MINOR: quic: OpenSSL 3.5 trick to support 0-RTT
- DOC: update INSTALL for QUIC with OpenSSL 3.5 usages
- DOC: management: update 'acme status'
- BUG/MEDIUM: wdt: always ignore the first watchdog wakeup
- CLEANUP: wdt: clarify the comments on the common exit path
- BUILD: ssl: avoid possible printf format warning in traces
- BUILD: acme: fix build issue on 32-bit archs with 64-bit time_t
- DOC: management: precise some of the fields of "show servers conn"
- BUG/MEDIUM: mux-quic: fix BUG_ON() on rxbuf alloc error
- DOC: watchdog: update the doc to reflect the recent changes
- BUG/MEDIUM: acme: check if acme domains are configured
- BUG/MINOR: acme: fix formatting issue in error and logs
- EXAMPLES: lua: avoid screen refresh effect in "trisdemo"
- CLEANUP: quic: remove unused cbuf module
- MINOR: quic: move function to check stream type in utils
- MINOR: quic: refactor handling of streams after MUX release
- MINOR: quic: add some missing includes
- MINOR: quic: adjust quic_conn-t.h include list
- CLEANUP: cfgparse: alphabetically sort the global keywords
- MINOR: glitches: add global setting "tune.glitches.kill.cpu-usage"
It was mentioned during the development of glitches that it would be
nice to support not killing misbehaving connections below a certain
CPU usage so that poor implementations that routinely misbehave without
impact are not killed. This is now possible by setting a CPU usage
threshold under which we don't kill them via this parameter. It defaults
to zero so that we continue to kill them by default.
Adjust include list in quic_conn-t.h. This file is included in many QUIC
source, so it is useful to keep as lightweight as possible. Note that
connection/QUIC MUX are transformed into forward declaration for better
layer separation.
Insert some missing includes statement in QUIC source files. This was
detected after the next commit which adjust the include list used in
quic_conn-t.h file.
quic-conn layer has to handle itself STREAM frames after MUX release. If
the stream was already seen, it is probably only a retransmitted frame
which can be safely ignored. For other streams, an active closure may be
needed.
Thus it's necessary that quic-conn layer knows the highest stream ID
already handled by the MUX after its release. Previously, this was done
via <nb_streams> member array in quic-conn structure.
Refactor this by replacing <nb_streams> by two members called
<stream_max_uni>/<stream_max_bidi>. Indeed, it is unnecessary for
quic-conn layer to monitor locally opened uni streams, as the peer
cannot by definition emit a STREAM frame on it. Also, bidirectional
streams are always opened by the remote side.
Previously, <nb_streams> were set by quic-stream layer. Now,
<stream_max_uni>/<stream_max_bidi> members are only set one time, just
prior to QUIC MUX release. This is sufficient as quic-conn do not use
them if the MUX is available.
Note that previously, IDs were used relatively to their type, thus
incremented by 1, after shifting the original value. For simplification,
use the plain stream ID, which is incremented by 4.
Move general function to check if a stream is uni or bidirectional from
QUIC MUX to quic_utils module. This should prevent unnecessary include
of QUIC MUX header file in other sources.
In current version of the game, there is a "screen refresh" effect: the
screen is cleared before being re-drawn.
I moved the clear right after the connection is opened and removed it
from rendering time.
Stop emitting \n in errmsg for intermediate error messages, this was
emitting multiline logs and was returning to a new line in the middle of
sentences.
We don't need to emit them in acme_start_task() since the errmsg is
ouput in a send_log which already contains a \n or on the CLI which
also emits it.
When starting the ACME task with a ckch_conf which does not contain the
domains, the ACME task would segfault because it will try to dereference
a NULL in this case.
The patch fix the issue by emitting a warning when no domains are
configured. It's not done at configuration parsing because it is not
easy to emit the warning because there are is no callback system which
give access to the whole ckch_conf once a line is parsed.
No backport needed.
RX buffer allocation has been reworked in current dev tree. The
objective is to support multiple buffers per QCS to improve upload
throughput.
RX buffer allocation failure is handled simply : the whole connection is
closed. This is done via qcc_set_error(), with INTERNAL_ERROR as error
code. This function contains a BUG_ON() to ensure it is called only one
time per connection instance.
On RX buffer alloc failure, the aformentioned BUG_ON() crashes due to a
double invokation of qcc_set_error(). First by qcs_get_rxbuf(), and
immediately after it by qcc_recv(), which is the caller of the previous
one. This regression was introduced by the following commit.
60f64449fbba7bb6e351e8343741bb3c960a2e6d
MAJOR: mux-quic: support multiple QCS RX buffers
To fix this, simply remove qcc_set_error() invocation in
qcs_get_rxbuf(). On buffer alloc failture, qcc_recv() is responsible to
set the error.
This does not need to be backported.
As reported in issue #2970, the output of "show servers conn" is not
clear. It was essentially meant as a debugging tool during some changes
to idle connections management, but if some users want to monitor or
graph them, more info is needed. The doc mentions the currently known
list of fields, and reminds that this output is not meant to be stable
over time, but as long as it does not change, it can provide some useful
metrics to some users.
The build failed on mips32 with a 64-bit time_t here:
https://github.com/haproxy/haproxy/actions/runs/15150389164/job/42595310111
Let's just turn the "remain" variable used to show the remaining time
into a more portable ullong and use %llu for all format specifiers,
since long remains limited to 32-bit on 32-bit archs.
No backport needed.
When building on MIPS-32 with gcc-9.5 and glibc-2.31, I got this:
src/ssl_trace.c: In function 'ssl_trace':
src/ssl_trace.c:118:42: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'ssize_t' {aka 'const int'} [-Wformat=]
118 | chunk_appendf(&trace_buf, " : size=%ld", *size);
| ~~^ ~~~~~
| | |
| | ssize_t {aka const int}
| long int
| %d
Let's just cast the type. No backport needed.
With commit a06c215f08 ("MEDIUM: wdt: always make the faulty thread
report its own warnings"), when the TH_FL_STUCK flag was flipped on,
we'd then go to the panic code instead of giving a second chance like
before the commit. This can trigger rare cases that only happen with
moderate loads like was addressed by commit 24ce001771 ("BUG/MEDIUM:
wdt: fix the stuck detection for warnings"). This is in fact due to
the loss of the common "goto update_and_leave" that used to serve
both the warning code and the flag setting for probation, and it's
apparently what hit Christian in issue #2980.
Let's make sure we exit naturally when turning the bit on for the
first time. Let's also update the confusing comment at the end of
the check that was left over by latest change.
Since the first commit was backported to 3.1, this commit should be
backported there as well.
For an unidentified reason, SSL_do_hanshake() succeeds at its first call when 0-RTT
is enabled for the connection. This behavior looks very similar by the one encountered
by AWS-LC stack. That said, it was documented by AWS-LC. This issue leads the
connection to stop sending handshake packets after having release the handshake
encryption level. In fact, no handshake packets could even been sent leading
the handshake to always fail.
To fix this, this patch simulates a "handshake in progress" state waiting
for the application level read secret to be established by the TLS stack.
This may happen only after the QUIC listener has completed/confirmed the handshake
upon handshake CRYPTO data receipt from the peer.
A QUIC must sent its transport parameter using a TLS custom extention. This
extension is reset by SSL_set_SSL_CTX(). It can be restored calling
quic_ssl_set_tls_cbs() (which calls SSL_set_quic_tls_cbs()).
The quic_conn struct is modified for two reasons. The first one is to store
the encoded version of the local tranport parameter as this is done for
USE_QUIC_OPENSSL_COMPAT. Indeed, the local transport parameter "should remain
valid until after the parameters have been sent" as mentionned by
SSL_set_quic_tls_cbs(3) manual. In our case, the buffer is a static buffer
attached to the quic_conn object. qc_ssl_set_quic_transport_params() function
whose role is to call SSL_set_tls_quic_transport_params() (aliased by
SSL_set_quic_transport_params() to set these local tranport parameter into
the TLS stack from the buffer attached to the quic_conn struct.
The second quic_conn struct modification is the addition of the new ->prot_level
(SSL protection level) member added to the quic_conn struct to store "the most
recent write encryption level set via the OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn
callback (if it has been called)" as mentionned by SSL_set_quic_tls_cbs(3) manual.
This patches finally implements the five remaining callacks to make the haproxy
QUIC implementation work.
OSSL_FUNC_SSL_QUIC_TLS_crypto_send_fn() (ha_quic_ossl_crypto_send) is easy to
implement. It calls ha_quic_add_handshake_data() after having converted
qc->prot_level TLS protection level value to the correct ssl_encryption_level_t
(boringSSL API/quictls) value.
OSSL_FUNC_SSL_QUIC_TLS_crypto_recv_rcd_fn() (ha_quic_ossl_crypto_recv_rcd())
provide the non-contiguous addresses to the TLS stack, without releasing
them.
OSSL_FUNC_SSL_QUIC_TLS_crypto_release_rcd_fn() (ha_quic_ossl_crypto_release_rcd())
release these non-contiguous buffer relying on the fact that the list of
encryption level (qc->qel_list) is correctly ordered by SSL protection level
secret establishements order (by the TLS stack).
OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn() (ha_quic_ossl_got_transport_params())
is a simple wrapping function over ha_quic_set_encryption_secrets() which is used
by boringSSL/quictls API.
OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() (ha_quic_ossl_got_transport_params())
role is to store the peer received transport parameters. It simply calls
quic_transport_params_store() and set them into the TLS stack calling
qc_ssl_set_quic_transport_params().
Also add some comments for all the OpenSSL 3.5 QUIC API callbacks.
This patch have no impact on the other use of QUIC API provided by the others TLS
stacks.
This patch allows the use of the new OpenSSL 3.5.0 QUIC TLS API when it is
available and detected at compilation time. The detection relies on the presence of the
OSSL_FUNC_SSL_QUIC_TLS_CRYPTO_SEND macro from openssl-compat.h. Indeed this
macro is defined by OpenSSL since 3.5.0 version. It is not defined by quictls.
This helps in distinguishing these two TLS stacks. When the detection succeeds,
HAVE_OPENSSL_QUIC is also defined by openssl-compat.h. Then, this is this new macro
which is used to detect the availability of the new OpenSSL 3.5.0 QUIC TLS API.
Note that this detection is done only if USE_QUIC_OPENSSL_COMPAT is not asked.
So, USE_QUIC_OPENSSL_COMPAT and HAVE_OPENSSL_QUIC are exclusive.
At the same location, from openssl-compat.h, ssl_encryption_level_t enum is
defined. This enum was defined by quictls and expansively used by the haproxy
QUIC implementation. SSL_set_quic_transport_params() is replaced by
SSL_set_quic_tls_transport_params. SSL_set_quic_early_data_enabled() (quictls) is also replaced
by SSL_set_quic_tls_early_data_enabled() (OpenSSL). SSL_quic_read_level() (quictls)
is not defined by OpenSSL. It is only used by the traces to log the current
TLS stack decryption level (read). A macro makes it return -1 which is an
usused values.
The most of the differences between quictls and OpenSSL QUI APIs are in quic_ssl.c
where some callbacks must be defined for these two APIs. This is why this
patch modifies quic_ssl.c to define an array of OSSL_DISPATCH structs: <ha_quic_dispatch>.
Each element of this arry defines a callback. So, this patch implements these
six callabcks:
- ha_quic_ossl_crypto_send()
- ha_quic_ossl_crypto_recv_rcd()
- ha_quic_ossl_crypto_release_rcd()
- ha_quic_ossl_yield_secret()
- ha_quic_ossl_got_transport_params() and
- ha_quic_ossl_alert().
But at this time, these implementations which must return an int return 0 interpreted
as a failure by the OpenSSL QUIC API, except for ha_quic_ossl_alert() which
is implemented the same was as for quictls. The five remaining functions above
will be implemented by the next patches to come.
ha_quic_set_encryption_secrets() and ha_quic_add_handshake_data() have been moved
to be defined for both quictls and OpenSSL QUIC API.
These callbacks are attached to the SSL objects (sessions) calling qc_ssl_set_cbs()
new function. This latter callback the correct function to attached the correct
callbacks to the SSL objects (defined by <ha_quic_method> for quictls, and
<ha_quic_dispatch> for OpenSSL).
The calls to SSL_provide_quic_data() and SSL_process_quic_post_handshake()
have been also disabled. These functions are not defined by OpenSSL QUIC API.
At this time, the functions which call them are still defined when HAVE_OPENSSL_QUIC
is defined.
There were no traces to diagnose qc_ssl_sess_init() failures from QUIC traces.
This patch add calls to TRACE_DEVEL() into qc_ssl_sess_init() and its caller
(qc_alloc_ssl_sock_ctx()). This was useful at least to diagnose SSL context
initialization failures when porting QUIC to the new OpenSSL 3.5 QUIC API.
Should be easily backported as far as 2.6.
This code is there from QUIC implementation start. It was supposed to
initialize <ha_quic_meth> as a BIO_METHOD static object. But this
BIO_METHOD is not used at all!
Should be backported as far as 2.6 to help integrate the next patches to come.
Output a sink message when the certificate was renewed by the ACME
client.
The message is emitted on the "dpapi" sink, and ends by \n\0.
Since the message contains this binary character, the right -0 parameter
must be used when consulting the sink over the CLI:
Example:
$ echo "show events dpapi -nw -0" | socat -t9999 /tmp/haproxy.sock -
<0>2025-05-19T15:56:23.059755+02:00 acme newcert foobar.pem.rsa\n\0
When used with the master CLI, @@1 should be used instead of @1 in order
to keep the connection to the worker.
Example:
$ echo "@@1 show events dpapi -nw -0" | socat -t9999 /tmp/master.sock -
<0>2025-05-19T15:56:23.059755+02:00 acme newcert foobar.pem.rsa\n\0
On ARM with 80 cores and a single server, it's sometimes possible to see
a segfault in fwlc_get_next_server() around 600-700k RPS. It seldom
happens as well on x86 with 128 threads with the same config around 1M
rps. It turns out that in fwlc_get_next_server(), before calling
fwlc_srv_reposition(), we have to drop the lock and that one takes it
back again.
The problem is that anything can happen to our node during this time,
and it can be freed. Then when continuing our work, we later iterate
over it and its next to find a node with an acceptable key, and by
doing so we can visit either uninitialized memory or simply nodes that
are no longer in the tree.
A first attempt at fixing this consisted in artificially incrementing
the elements count before dropping the lock, but that turned out to be
even worse because other threads could loop forever on such an element
looking for an entry that does not exist. Maintaining a separate
refcount didn't work well either, and it required to deal with the
memory release while dropping it, which is really not convenient.
Here we're taking a different approach consisting in simply not
trusting this node anymore and going back to the beginning of the
loop, as is done at a few other places as well. This way we can
safely ignore the possibly released node, and the test runs reliably
both on the arm and the x86 platforms mentioned above. No performance
regression was observed either, likely because this operation is quite
rare.
No backport is needed since this appeared with the leastconn rework
in 3.2.
If there is an alloc failure during qc_new_conn(), cleaning is done via
quic_conn_release(). However, since the below commit, an unchecked
dereferencing of <qc.path> is performed in the latter.
e841164a4402118bd7b2e2dc2b5068f21de5d9d2
MINOR: quic: account for global congestion window
To fix this, simply check <qc.path> before dereferencing it in
quic_conn_release(). This is safe as it is properly initialized to NULL
on qc_new_conn() first stage.
This does not need to be backported.
The queue length was moved to its own variable in commit 583303c48
("MINOR: proxies/servers: Calculate queueslength and use it."), however a
few places were missed in pendconn_unlink() and assign_server_and_queue()
resulting in never decreasing counts on aborted streams. This was
reproduced when injecting more connections than the total backend
could stand in TCP mode and letting some of them time out in the
queue. No backport is needed, this is only 3.2.
Since commit 9fe72bba3 ("MAJOR: leastconn; Revamp the way servers are
ordered."), there's no way to escape the loop visiting the mt_list heads
in fwlc_get_next_server if all servers in the list are saturated,
resulting in a watchdog panic. It can be reproduced with this config
and injecting with more than 2 concurrent conns:
balance leastconn
server s1 127.0.0.1:8000 maxconn 1
server s2 127.0.0.1:8000 maxconn 1
Here we count the number of saturated servers that were encountered, and
escape the loop once the number of remaining servers exceeds the number
of saturated ones. No backport is needed since this arrived in 3.2.
Building with clang 16 on MIPS64 yields this warning:
src/slz.c:931:24: warning: unused function 'crc32_uint32' [-Wunused-function]
static inline uint32_t crc32_uint32(uint32_t data)
^
Let's guard it using UNALIGNED_LE_OK which is the only case where it's
used. This saves us from introducing a possibly non-portable attribute.
This is libslz upstream commit f5727531dba8906842cb91a75c1ffa85685a6421.
Calling slz_rfc1950_finish() without emitting any data would result in
incorrectly emitting a gzip header (rfc1952) instead of a zlib header
(rfc1950) due to a copy-paste between the two wrappers. The impact is
almost inexistent since the zlib format is almost never used in this
context, and compressing totally empty messages is quite rare as well.
Let's take this opportunity for fixing another mistake on an RFC number
in a comment.
This is slz upstream commit 7f3fce4f33e8c2f5e1051a32a6bca58e32d4f818.
The current hash involves 3 simple shifts and additions so that it can
be mapped to a multiply on architecures having a fast multiply. This is
indeed what the compiler does on x86_64. A large range of values was
scanned to try to find more optimal factors on machines supporting such
a fast multiply, and it turned out that new factor 0x1af42f resulted in
smoother hashes that provided on average 0.4% better compression on both
the Silesia corpus and an mbox file composed of very compressible emails
and uncompressible attachments. It's even slightly better than CRC32C
while being faster on Skylake. This patch enables this factor on archs
with a fast multiply.
This is slz upstream commit 82ad1e75c13245a835c1c09764c89f2f6e8e2a40.
If building for sse4 and USE_CRC32C_HASH is defined, then we can use
crc32c to calculate the lookup hash. By default we don't do it because
even on skylake it's slower than the current hash, which only involves
a short multiply (~5% slower). But the gains are marginal (0.3%).
This is slz upstream commit 44ae4f3f85eb275adba5844d067d281e727d8850.
Note: this is not used by default and only merged in order to avoid
divergence between the code bases.
On 64-bit platforms, disassembling the code shows that send_huff() performs
a left shift followed by a right one, which are the result of integer
truncation and zero-extension caused solely by using different types at
different levels in the call chain. By making encode24() take a 64-bit
int on input and send_huff() take one optionally, we can remove one shift
in the hot path and gain 1% performance without affecting other platforms.
This is slz upstream commit fd165b36c4621579c5305cf3bb3a7f5410d3720b.
Building on MIPS64 with clang16 incorrectly reports some uninitialized
value warnings in stats-proxy.c due to some calls to ABORT_NOW() where
the compiler didn't know the code wouldn't return. Let's properly mark
the function as noreturn, and take this opportunity for also marking it
unused to avoid possible warnings depending on the build options (if
ABORT_NOW is not used). No backport needed though it will not harm.
Since e24b77e7 ('DOC: config: move the extraneous sections out of the
"global" definition') the ACME section of the configuration manual was
move from 3.13 to 12.8.
Change the reference to that section in "acme renew".
Tim reported in issue #2953 that "stick-table" and "table" were not
indexed as keywords. The issue was the indent level. Also let's make
sure to put a box around the "store" arguments as well.
In continuation with 9a05c1f574 ("BUG/MEDIUM: h2/h3: reject some
forbidden chars in :authority before reassembly") and the discussion
in issue #2941, @DemiMarie rightfully suggested that Host should also
be sanitized, because it is sometimes used in concatenation, such as
this:
http-request set-url https://%[req.hdr(host)]%[pathq]
which was proposed as a workaround for h2 upstream servers that require
:authority here:
https://www.mail-archive.com/haproxy@formilux.org/msg43261.html
The current patch then adds the same check for forbidden chars in the
Host header, using the same function as for the patch above, since in
both cases we validate the host:port part of the authority. This way
we won't reconstruct ambiguous URIs by concatenating Host and path.
Just like the patch above, this can be backported afer a period of
observation.
Let's make sure we drop extraneous Host headers after having compared
them. That also works when :authority was already present. This way,
like for h1 and h2, we only keep one copy of it, while still making
sure that Host matches :authority. This way, if a request has both
:authority and Host, only one Host header will be produced (from
:authority). Note that due to the different organization of the code
and wording along the evolving RFCs, here we also check that all
duplicates are identical, while h2 ignores them as per RFC7540, but
this will be re-unified later.
This should be backported to stable versions, at least 2.8, though
thanks to the existing checks the impact is probably nul.
It is especially a problem with Lua filters, but it is important to disable
the 0-copy forwarding if a filter alters the payload, or at least to be able
to disable it. While the filter is registered on the data filtering, it is
not an issue (and it is the common case) because, there is now way to
fast-forward data at all. But it may be an issue if a filter decides to
alter the payload and to unregister from data filtering. In that case, the
0-copy forwarding can be re-enabled in a hardly precdictable state.
To fix the issue, a SC flags was added to do so. The HTTP compression filter
set it and lua filters too if the body length is changed (via
HTTPMessage.set_body_len()).
Note that it is an issue because of a bad design about the HTX. Many info
about the message are stored in the HTX structure itself. It must be
refactored to move several info to the stream-endpoint descriptor. This
should ease modifications at the stream level, from filter or a TCP/HTTP
rules.
This should be backported as far as 3.0. If necessary, it may be backported
on lower versions, as far as 2.6. In that case, it must be reviewed and
adapted.
There was no function for a lua filter to change the body length of an HTTP
Message. But it is mandatory to be able to alter the message payload. It is
not possible update to directly update the message headers because the
internal state of the message must also be updated accordingly.
It is the purpose of HTTPMessage.set_body_len() function. The new body
length myst be passed as argument. If it is an integer, the right
"Content-Length" header is set. If the "chunked" string is used, it forces
the message to be chunked-encoded and in that case the "Transfer-Encoding"
header.
This patch should fix the issue #2837. It could be backported as far as 2.6.
There's a configurable limit to the number of messages sent to a
peer (tune.peers.max-updates-at-once), but this one is not applied to
the receive side. While it can usually be OK with default settings,
setups involving a large tune.bufsize (1MB and above) regularly
experience high latencies and even watchdogs during reloads because
the full learning process sends a lot of data that manages to fill
the entire buffer, and due to the compactness of the protocol, 1MB
of buffer can contain more than 100k updates, meaning taking locks
etc during this time, which is not workable.
Let's make sure the receiving side also respects the max-updates-at-once
setting. For this it counts incoming updates, and refrains from
continuing once the limit is reached. It's a bit tricky to do because
after receiving updates we still have to send ours (and possibly some
ACKs) so we cannot just leave the loop.
This issue was reported on 3.1 but it should progressively be backported
to all versions having the max-updates-at-once option available.
using "send-proxy" or "send-proxy-v2" option on a ring server is not
relevant nor supported. Worse, on 2.4 it causes haproxy process to
crash as reported in GH #2965.
Let's be more explicit about the fact that this keyword is not supported
under "ring" context by ignoring the option and emitting a warning message
to inform the user about that.
Ideally, we should do the same for peers and log servers. The proper way
would be to check servers options during postparsing but we currently lack
proper cross-type server postparsing hooks. This will come later and thus
will give us a chance to perform the compatibilty checks for server
options depending on proxy type. But for now let's simply fix the "ring"
case since it is the only one that's known to cause a crash.
It may be backported to all stable versions.
From the documentation, this wasn't clear enough that shards should
be followed by one of the options number / by-thread / by-group.
Align it with existing options in documentation so that it becomes
more explicit.
Released version 3.2-dev16 with the following main changes :
- BUG/MEDIUM: mux-quic: fix crash on invalid fctl frame dereference
- DEBUG: pool: permit per-pool UAF configuration
- MINOR: acme: add the global option 'acme.scheduler'
- DEBUG: pools: add a new integrity mode "backup" to copy the released area
- MEDIUM: sock-inet: re-check IPv6 connectivity every 30s
- BUG/MINOR: ssl: doesn't fill conf->crt with first arg
- BUG/MINOR: ssl: prevent multiple 'crt' on the same ssl-f-use line
- BUG/MINOR: ssl/ckch: always free() the previous entry during parsing
- MINOR: tools: ha_freearray() frees an array of string
- BUG/MINOR: ssl/ckch: always ha_freearray() the previous entry during parsing
- MINOR: ssl/ckch: warn when the same keyword was used twice
- BUG/MINOR: threads: fix soft-stop without multithreading support
- BUG/MINOR: tools: improve parse_line()'s robustness against empty args
- BUG/MINOR: cfgparse: improve the empty arg position report's robustness
- BUG/MINOR: server: dont depend on proxy for server cleanup in srv_drop()
- BUG/MINOR: server: perform lbprm deinit for dynamic servers
- MINOR: http: add a function to validate characters of :authority
- BUG/MEDIUM: h2/h3: reject some forbidden chars in :authority before reassembly
- MINOR: quic: account Tx data per stream
- MINOR: mux-quic: account Rx data per stream
- MINOR: quic: add stream format for "show quic"
- MINOR: quic: display QCS info on "show quic stream"
- MINOR: quic: display stream age
- BUG/MINOR: cpu-topo: fix group-by-cluster policy for disordered clusters
- MINOR: cpu-topo: add a new "group-by-ccx" CPU policy
- MINOR: cpu-topo: provide a function to sort clusters by average capacity
- MEDIUM: cpu-topo: change "performance" to consider per-core capacity
- MEDIUM: cpu-topo: change "efficiency" to consider per-core capacity
- MEDIUM: cpu-topo: prefer grouping by CCX for "performance" and "efficiency"
- MEDIUM: config: change default limits to 1024 threads and 32 groups
- BUG/MINOR: hlua: Fix Channel:data() and Channel:line() to respect documentation
- DOC: config: Fix a typo in the "term_events" definition
- BUG/MINOR: spoe: Don't report error on applet release if filter is in DONE state
- BUG/MINOR: mux-spop: Don't report error for stream if ACK was already received
- BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
- BUG/MINOR: mux-spop: Don't open new streams for SPOP connection on error
- MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
- BUG/MEDIUM: mux-spop: Remove frame parsing states from the SPOP connection state
- BUG/MEDIUM: mux-spop: Properly handle CLOSING state
- BUG/MEDIUM: spop-conn: Report short read for partial frames payload
- BUG/MEDIUM: mux-spop: Properly detect truncated frames on demux to report error
- BUG/MEDIUM: mux-spop; Don't report a read error if there are pending data
- DEBUG: mux-spop: Review some trace messages to adjust the message or the level
- DOC: config: move address formats definition to section 2
- DOC: config: move stick-tables and peers to their own section
- DOC: config: move the extraneous sections out of the "global" definition
- CI: AWS-LC(fips): enable unit tests
- CI: AWS-LC: enable unit tests
- CI: compliance: limit run on forks only to manual + cleanup
- CI: musl: enable unit tests
- CI: QuicTLS (weekly): limit run on forks only to manual dispatch
- CI: WolfSSL: enable unit tests
Due to some historic mistakes that have spread to newly added sections,
a number of of recently added small sections found themselves described
under section 3 "global parameters" which is specific to "global" section
keywords. This is highly confusing, especially given that sections 3.1,
3.2, 3.3 and 3.10 directly start with keywords valid in the global section,
while others start with keywords that describe a new section.
Let's just create a new chapter "12. other sections" and move them all
there. 3.10 "HTTPclient tuning" however was moved to 3.4 as it's really
a definition of the global options assigned to the HTTP client. The
"programs" that are going away in 3.3 were moved at the end to avoid a
renumbering later.
Another nice benefit is that it moves a lot of text that was previously
keeping the global and proxies sections apart.
As suggested by Tim in issue #2953, stick-tables really deserve their own
section to explain the configuration. And peers have to move there as well
since they're totally dedicated to stick-tables.
Now we introduce a new section "Stick-tables and Peers", explaining the
concepts, and under which there is one subsection for stick-tables
configuration and one for the peers (which mostly keeps the existing
peers section).
Section 2 describes the config file format, variables naming etc, so
there's no reason why the address format used in this file should be
in a separate section, let's bring it into section 2 as well.
Some trace messages were not really accurrate, reporting a CLOSED connection
while only an error was reported on it. In addition, an TRACE_ERROR() was
used to report a short read on HELLO/DISCONNECT frames header. But it is not
an error. a TRACE_DEVEL() should be used instead.
This patch could be backported to 3.1 to ease future backports.
When an read error is detected, no error must be reported on the SPOP
connection is there are still some data to parse. It is important to be sure
to process all data before reporting the error and be sure to not truncate
received frames. However, we must also take care to handle short read case
to not wait data that will never be received.
This patch must be backported to 3.1.
There was no test in the demux part to detect truncated frames and to report
an error at the connection level. The SPOP streams were properly switch to
half-closed state. But waiting the associated SPOE applets were woken up and
released, the SPOP connection could be woken up several times for nothing. I
never triggered the watchdog in that case, but it is not excluded.
Now, at the end of the demux function, if a specific test was added to
detect truncated frames to report an error and close the connection.
This patch must be backported to 3.1.
When a frame was not fully received, a short read must be reported on the
SPOP connection to help the demux to handle truncated frames. This was
performed for frames truncated on the header part but not on the payload
part. It is now properly detected.
This patch must be backported to 3.1.
The CLOSING state was not handled at all by the SPOP multiplexer while it is
mandatory when a DISCONNECT frame was sent and the mux should wait for the
DISCONNECT frame in reply from the agent. Thanks to this patch, it should be
fixed.
In addition, if an error occurres during the AGENT HELLO frame parsing, the
SPOP connection is no longer switched to CLOSED state and remains in ERROR
state instead. It is important to be able to send the DISCONNECT frame to
the agent instead of closing the TCP connection immediately.
This patch depends on following commits:
* BUG/MEDIUM: mux-spop: Remove frame parsing states from the SPOP connection state
* MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
* BUG/MINOR: mux-spop: Don't open new streams for SPOP connection on error
* BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
All the series must be backported to 3.1.
SPOP_CS_FRAME_H and SPOP_CS_FRAME_P states, that were used to handle frame
parsing, were removed. The demux process now relies on the demux stream ID
to know if it is waiting for the frame header or the frame
payload. Concretly, when the demux stream ID is not set (dsi == -1), the
demuxer is waiting for the next frame header. Otherwise (dsi >= 0), it is
waiting for the frame payload. It is especially important to be able to
properly handle DISCONNECT frames sent by the agents.
SPOP_CS_RUNNING state is introduced to know the hello handshake was finished
and the SPOP connection is able to open SPOP streams and exchange NOTIFY/ACK
frames with the agents.
It depends on the following fixes:
* MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
* BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
This change will be mandatory for the next fix. It must be backported to 3.1
with the commits above.
After the ACK frame was parsed, it is useless to set the SPOP connection
state to SPOP_CS_FRAME_H state because this will be automatically handled by
the demux function. If it is not an issue, but this will simplify changes
for the next commit.
Till now, only SPOP connections fully closed or those with a TCP connection on
error were concerned. But available streams could be reported for SPOP
connections in error or closing state. But in these states, no NOTIFY frames
will be sent and no ACK frames will be parsed. So, no new SPOP streams should be
opened.
This patch should be backported to 3.1.
The demux stream ID of a SPOP connection, used when received frames are
parsed, must be a signed integer because it is set to -1 when the SPOP
connection is initialized. It will be important for the next fix.
This patch must be backported to 3.1.
When a SPOP connection was closed or was in error, an error was
systematically reported on all its SPOP streams. However, SPOP streams that
already received their ACK frame must be excluded. Otherwise if an agent
sends a ACK and close immediately, the ACK will be ignored because the SPOP
stream will handle the error first.
This patch must be backported to 3.1.
When the SPOE applet was released, if a SPOE filter context was still
attached to it, an error was reported to the filter. However, there is no
reason to report an error if the ACK message was already received. Because
of this bug, if the ACK message is received and the SPOE connection is
immediately closed, this prevents the ACK message to be processed.
This patch should be backported to 3.1.
When the channel API was revisted, the both functions above was added. An
offset can be passed as argument. However, this parameter could be reported
to be out of range if there was not enough input data was received yet. It
is an issue, especially with a tcp rule, because more data could be
received. If an error is reported too early, this prevent the rule to be
reevaluated later. In fact, an error should only be reported if the offset
is part of the output data.
Another issue is about the conditions to report 'nil' instead of an empty
string. 'nil' was reported when no data was found. But it is not aligned
with the documentation. 'nil' must only be returned if no more data cannot
be received and there is no input data at all.
This patch should fix the issue #2716. It should be backported as far as 2.6.
A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll
be facing new limits during the lifetime of 3.2 with our current 16
groups and 256 threads max:
$ cat test.cfg
global
cpu-policy perforamnce
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
Raising the default limit to 1024 threads and 32 groups is sufficient
to buy us enough margin for a long time (hopefully, please don't laugh,
you, reader from the future):
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
9/1-32 257-288 32: 128-143,448-463
10/1-32 289-320 32: 144-159,464-479
11/1-32 321-352 32: 160-175,480-495
12/1-32 353-384 32: 176-191,496-511
13/1-32 385-416 32: 192-207,512-527
14/1-32 417-448 32: 208-223,528-543
15/1-32 449-480 32: 224-239,544-559
16/1-32 481-512 32: 240-255,560-575
17/1-32 513-544 32: 256-271,576-591
18/1-32 545-576 32: 272-287,592-607
19/1-32 577-608 32: 288-303,608-623
20/1-32 609-640 32: 304-319,624-639
We can change this default now because it has no functional effect
without any configured cpu-policy, so this will only be an opt-in
and it's better to do it now than to have an effect during the
maintenance phase. A tiny effect is a doubling of the number of
pool buckets and stick-table shards internally, which means that
aside slightly reducing contention in these areas, a dump of tables
can enumerate keys in a different order (hence the adjustment in the
vtc).
The only really visible effect is a slightly higher static memory
consumption (29->35 MB on a small config), but that difference
remains even with 50k servers so that's pretty much acceptable.
Thanks to Erwan Velu for the quick tests and the insights!
Most of the time, machines made of multiple CPU types use the same L3
for them, and grouping CPUs by frequencies to form groups doesn't bring
any value and on the opposite can impair the incoming connection balancing.
This choice of grouping by cluster was made in order to constitute a good
choice on homogenous machines as well, so better rely on the per-CCX
grouping than the per-cluster one in this case. This will create less
clusters on machines where it counts without affecting other ones.
It doesn't seem necessary to change anything for the "resource" policy
since it selects a single cluster.
This is similar to the previous change to the "performance" policy but
it applies to the "efficiency" one. Here we're changing the sorting
method to sort CPU clusters by average per-CPU capacity, and we evict
clusters whose per-CPU capacity is above 125% of the previous one.
Per-core capacity allows to detect discrepancies between CPU cores,
and to continue to focus on efficient ones as a priority.
Running the "performance" policy on highly heterogenous systems yields
bad choices when there are sufficiently more small than big cores,
and/or when there are multiple cluster types, because on such setups,
the higher the frequency, the lower the number of cores, despite small
differences in frequencies. In such cases, we quickly end up with
"performance" only choosing the small or the medium cores, which is
contrary to the original intent, which was to select performance cores.
This is what happens on boards like the Orion O6 for example where only
the 4 medium cores and 2 big cores are choosen, evicting the 2 biggest
cores and the 4 smallest ones.
Here we're changing the sorting method to sort CPU clusters by average
per-CPU capacity, and we evict clusters whose per-CPU capacity falls
below 80% of the previous one. Per-core capacity allows to detect
discrepancies between CPU cores, and to continue to focus on high
performance ones as a priority.
The current per-capacity sorting function acts on a whole cluster, but
in some setups having many small cores and few big ones, it becomes
easy to observe an inversion of metrics where the many small cores show
a globally higher total capacity than the few big ones. This does not
necessarily fit all use cases. Let's add new a function to sort clusters
by their per-cpu average capacity to cover more use cases.
This cpu-policy will only consider CCX and not clusters. This makes
a difference on machines with heterogenous CPUs that generally share
the same L3 cache, where it's not desirable to create multiple groups
based on the CPU types, but instead create one with the different CPU
types. The variants "group-by-2/3/4-ccx" have also been added.
Let's also add some text explaining the difference between cluster
and CCX.
Some (rare) boards have their clusters in an erratic order. This is
the case for the Radxa Orion O6 where one of the big cores appears as
CPU0 due to booting from it, then followed by the small cores, then the
medium cores, then the remaining big cores. This results in clusters
appearing this order: 0,2,1,0.
The core in cpu_policy_group_by_cluster() expected ordered clusters,
and performs ordered comparisons to decide whether a CPU's cluster has
already been taken care of. On the board above this doesn't work, only
clusters 0 and 2 appear and 1 is skipped.
Let's replace the cluster number comparison with a cpuset to record
which clusters have been taken care of. Now the groups properly appear
like this:
Tgrp/Thr Tid CPU set
1/1-2 1-2 2: 0,11
2/1-4 3-6 4: 1-4
3/1-6 7-12 6: 5-10
No backport is needed, this is purely 3.2.
Complete stream output for "show quic" by displaying information from
its upper QCS. Note that QCS may be NULL if already released, so a
default output is also provided.
Add a new format for "show quic" command labelled as "stream". This is
an equivalent of "show sess", dedicated to the QUIC stack. Each active
QUIC streams are listed on a line with their related infos.
The main objective of this command is to ensure there is no freeze
streams remaining after a transfer.
Add counters to measure Rx buffers usage per QCS. This reused the newly
defined bdata_ctr type already used for Tx accounting.
Note that for now, <tot> value of bdata_ctr is not used. This is because
it is not easy to account for data accross contiguous buffers.
These values are displayed both on log/traces and "show quic" output.
Add accounting at qc_stream_desc level to be able to report the number
of allocated Tx buffers and the sum of their data. This represents data
ready for emission or already emitted and waiting on ACK.
To simplify this accounting, a new counter type bdata_ctr is defined in
quic_utils.h. This regroups both buffers and data counter, plus a
maximum on the buffer value.
These values are now displayed on QCS info used both on logline and
traces, and also on "show quic" output.
As discussed here:
https://github.com/httpwg/http2-spec/pull/936https://github.com/haproxy/haproxy/issues/2941
It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/'). This patch does this, both for h2 and h3.
The impact on H2 was measured in the worst case at 0.3% of the request
rate, while the impact on H3 is around 1%, but H3 was about 1% faster
than H2 before and is now on par.
It may be backported after a period of observation, and in this case it
relies on this previous commit:
MINOR: http: add a function to validate characters of :authority
Thanks to @DemiMarie for reviving this topic in issue #2941 and bringing
new potential interesting cases.
As discussed here:
https://github.com/httpwg/http2-spec/pull/936https://github.com/haproxy/haproxy/issues/2941
It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/').
This patch adds a specific function which was checks all such characters
and their ranges on an ist, and benefits from modern compilers
optimizations that arrange the comparisons into an evaluation tree for
faster match. That's the version that gave the most consistent performance
across various compilers, though some hand-crafted versions using bitmaps
stored in register could be slightly faster but super sensitive to code
ordering, suggesting that the results might vary with future compilers.
This one takes on average 1.2ns per character at 3 GHz (3.6 cycles per
char on avg). The resulting impact on H2 request processing time (small
requests) was measured around 0.3%, from 6.60 to 6.618us per request,
which is a bit high but remains acceptable given that the test only
focused on req rate.
The code was made usable both for H2 and H3.
Last commit 7361515 ("BUG/MINOR: server: dont depend on proxy for server
cleanup in srv_drop()") introduced a regression because the lbprm
server_deinit is not evaluated anymore with dynamic servers, possibly
resulting in a memory leak.
To fix the issue, in addition to free_proxy(), the server deinit check
should be manually performed in cli_parse_delete_server() as well.
No backport needed.
In commit b5ee8bebfc ("MINOR: server: always call ssl->destroy_srv when
available"), we made it so srv_drop() doesn't depend on proxy to perform
server cleanup.
It turns out this is now mandatory, because during deinit, free_proxy()
can occur before the final srv_drop(). This is the case when using Lua
scripts for instance.
In 2a9436f96 ("MINOR: lbprm: Add method to deinit server and proxy") we
added a freeing check under srv_drop() that depends on the proxy.
Because of that UAF may occur during deinit when using a Lua script that
manipulate server objects.
To fix the issue, let's perform the lbprm server deinit logic under
free_proxy() directly, where the DEINIT server hooks are evaluated.
Also, to prevent similar bugs in the future, let's explicitly document
in srv_drop() that server cleanups should assume that the proxy may
already be freed.
No backport needed unless 2a9436f96 is.
OSS Fuzz found that the previous fix ebb19fb367 ("BUG/MINOR: cfgparse:
consider the special case of empty arg caused by \x00") was incomplete,
as the output can sometimes be larger than the input (due to variables
expansion) in which case the work around to try to report a bad arg will
fail. While the parse_line() function has been made more robust now in
order to avoid this condition, let's fix the handling of this special
case anyway by just pointing to the beginning of the line if the supposed
error location is out of the line's buffer.
All details here:
https://oss-fuzz.com/testcase-detail/5202563081502720
No backport is needed unless the fix above is backported.
The fix in 10e6d0bd57 ("BUG/MINOR: tools: only fill first empty arg when
not out of range") was not that good. It focused on protecting against
<arg> becoming out of range to detect we haven't emitted anything, but
it's not the right way to detect this. We're always maintaining arg_start
as a copy of outpos, and that later one is incremented when emitting a
char, so instead of testing args[arg] against out+arg_start, we should
instead check outpos against arg_start, thereby eliminating the <out>
offset and the need to access args[]. This way we now always know if
we've emitted an empty arg without dereferencing args[].
There's no need to backport this unless the fix above is also backported.
When thread support is disabled ("USE_THREAD=" or "USE_THREAD=0" when
building), soft-stop doesn't work as haproxy never ends after stopping
the proxies.
This used to work fine in the past but suddenly stopped working with
ef422ced91 ("MEDIUM: thread: make stopping_threads per-group and add
stopping_tgroups") because the "break;" instruction under the stopping
condition is never executed when support for multithreading is disabled.
To fix the issue, let's add an "else" block to run the "break;"
instruction when USE_THREAD is not defined.
It should be backported up to 2.8
When using a crt-list or a crt-store, keywords mentionned twice on the
same line overwritte the previous value.
This patch emits a warning when the same keyword is found another time
on the same line.
The ckch_conf_parse() function is the generic function which parses
crt-store keywords from the crt-store section, and also from a
crt-list.
When having multiple time the same keyword, a leak of the previous
value happens. This patch ensure that the previous value is always
freed before overwriting it.
This is the same problem as the previous "BUG/MINOR: ssl/ckch: always
free() the previous entry during parsing" patch, however this one
applies on PARSE_TYPE_ARRAY_SUBSTR.
No backport needed.
The ckch_conf_parse() function is the generic function which parses
crt-store keywords from the crt-store section, and also from a crt-list.
When having multiple time the same keyword, a leak of the previous value
happens. This patch ensure that the previous value is always freed
before overwriting it.
This patch should be backported as far as 3.0.
The 'ssl-f-use' implementation doesn't prevent to have multiple time the
'crt' keyword, which overwrite the previous value. Letting users think
that is it possible to use multiple certificates on the same line, which
is not the case.
This patch emits an alert when setting the 'crt' keyword multiple times
on the same ssl-f-use line.
Should fix issue #2966.
No backport needed.
Commit c7f29afc ("MEDIUM: ssl: replace "crt" lines by "ssl-f-use"
lines") forgot to remove an the allocation of the crt field which was
done with the first argument.
Since ssl-f-use takes keywords, this would put the first keyword in
"crt" instead of the certificate name.
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.
With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).
The variables remain exported so that we could present them in show info
or anywhere else.
This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
This way we can preserve the entire contents of the released area for
later inspection. This automatically enables comparison at reallocation
time as well (like "integrity" does). If used in combination with
integrity, the comparison is disabled but the check of non-corruption
of the area mangled by integrity is still operated.
The automatic scheduler is useful but sometimes you don't want to use,
or schedule manually.
This patch adds an 'acme.scheduler' option in the global section, which
can be set to either 'auto' or 'off'. (auto is the default value)
This also change the ouput of the 'acme status' command so it does not
shows scheduled values. The state will be 'Stopped' instead of
'Scheduled'.
The new MEM_F_UAF flag can be set just after a pool's creation to make
this pool UAF for debugging purposes. This allows to maintain a better
overall performance required to reproduce issues while still having a
chance to catch UAF. It will only be used by developers who will manually
add it to areas worth being inspected, though.
Emission of flow-control frames have been recently modified. Now, each
frame is sent one by one, via a single entry list. If a failure occurs,
emission is interrupted and frame is reinserted into the original
<qcc.lfctl.frms> list.
This code is incorrect as it only checks if qcc_send_frames() returns an
error code to perform the reinsert operation. However, an error here
does not always mean that the frame was not properly emitted by lower
quic-conn layer. As such, an extra test LIST_ISEMPTY() must be performed
prior to reinsert the frame.
This bug would cause a heap overflow. Indeed, the reinsert frame would
be a random value. A crash would occur as soon as it would be
dereferenced via <qcc.lfctl.frms> list.
This was reproduced by issuing a POST with a big file and interrupt it
after just a few seconds. This results in a crash in about a third of
the tests. Here is an example command using ngtcp2 :
$ ngtcp2-client -q --no-quic-dump --no-http-dump \
-m POST -d ~/infra/html/1g 127.0.0.1 20443 "http://127.0.0.1:20443/post"
Heap overflow was detected via a BUG_ON() statement from qc_frm_free()
via qcc_release() caller :
FATAL: bug condition "!((&((*frm)->reflist))->n == (&((*frm)->reflist)))" matched at src/quic_frame.c:1270
This does not need to be backported.
Released version 3.2-dev15 with the following main changes :
- BUG/MEDIUM: stktable: fix sc_*(<ctr>) BUG_ON() regression with ctx > 9
- BUG/MINOR: acme/cli: don't output error on success
- BUG/MINOR: tools: do not create an empty arg from trailing spaces
- MEDIUM: config: warn about the consequences of empty arguments on a config line
- MINOR: tools: make parse_line() provide hints about empty args
- MINOR: cfgparse: visually show the input line on empty args
- BUG/MINOR: tools: always terminate empty lines
- BUG/MINOR: tools: make parseline report the required space for the trailing 0
- DEBUG: threads: don't keep lock label "OTHER" in the per-thread history
- DEBUG: threads: merge successive idempotent lock operations in history
- DEBUG: threads: display held locks in threads dumps
- BUG/MINOR: proxy: only use proxy_inc_fe_cum_sess_ver_ctr() with frontends
- Revert "BUG/MEDIUM: mux-spop: Handle CLOSING state and wait for AGENT DISCONNECT frame"
- MINOR: acme/cli: 'acme status' show the status acme-configured certificates
- MEDIUM: acme/ssl: remove 'acme ps' in favor of 'acme status'
- DOC: configuration: add "acme" section to the keywords list
- DOC: configuration: add the "crt-store" keyword
- BUG/MAJOR: queue: lock around the call to pendconn_process_next_strm()
- MINOR: ssl: add filename and linenum for ssl-f-use errors
- BUG/MINOR: ssl: can't use crt-store some certificates in ssl-f-use
- BUG/MINOR: tools: only fill first empty arg when not out of range
- MINOR: debug: bump the dump buffer to 8kB
- MINOR: stick-tables: add "ipv4" as an alias for the "ip" type
- MINOR: quic: extend return value during TP parsing
- BUG/MINOR: quic: use proper error code on missing CID in TPs
- BUG/MINOR: quic: use proper error code on invalid server TP
- BUG/MINOR: quic: reject retry_source_cid TP on server side
- BUG/MINOR: quic: use proper error code on invalid received TP value
- BUG/MINOR: quic: fix TP reject on invalid max-ack-delay
- BUG/MINOR: quic: reject invalid max_udp_payload size
- BUG/MEDIUM: peers: hold the refcnt until updating ts->seen
- BUG/MEDIUM: stick-tables: close a tiny race in __stksess_kill()
- BUG/MINOR: cli: fix too many args detection for commands
- MINOR: server: ensure server postparse tasks are run for dynamic servers
- BUG/MEDIUM: stick-table: always remove update before adding a new one
- BUG/MEDIUM: quic: free stream_desc on all data acked
- BUG/MINOR: cfgparse: consider the special case of empty arg caused by \x00
- DOC: config: recommend disabling libc-based resolution with resolvers
Using both libc and haproxy resolvers can lead to hard to diagnose issues
when their bevahiour diverges; recommend using only one type of resolver.
Should be backported to stable versions.
Link: https://www.mail-archive.com/haproxy@formilux.org/msg45663.html
Co-authored-by: Lukas Tribus <lukas@ltri.eu>
The reporting of the empty arg location added with commit 08d3caf30
("MINOR: cfgparse: visually show the input line on empty args") falls
victim of a special case detected by OSS Fuzz:
https://issues.oss-fuzz.com/issues/415850462
In short, making an argument start with "\x00" doesn't make it empty for
the parser, but still emits an empty string which is detected and
displayed. Unfortunately in this case the error pointer is not set so
the sanitization function crashes.
What we're doing in this case is that we fall back to the position of
the output argument as an estimate of where it was located in the input.
It's clearly inexact (quoting etc) but will still help the user locate
the problem.
No backport is needed unless the commit above is backported.
The following patch simplifies qc_stream_desc_ack(). The qc_stream_desc
instance is not freed anymore, even if all data were acknowledged. As
implies by the commit message, the caller is responsible to perform this
cleaning operation.
f4a83fbb14bdd14ed94752a2280a2f40c1b690d2
MINOR: quic: do not remove qc_stream_desc automatically on ACK handling
However, despite the commit instruction, qc_stream_desc_free()
invokation was not moved in the caller. This commit fixes this by adding
it after stream ACK handling. This is performed only when a transfer is
completed : all data is acknowledged and qc_stream_desc has been
released by its MUX stream instance counterpart.
This bug may cause a significant increase in memory usage when dealing
with long running connection. However, there is no memory leak, as every
qc_stream_desc attached to a connection are finally freed when quic_conn
instance is released.
This must be backported up to 3.1.
Since commit 388539faa ("MEDIUM: stick-tables: defer adding updates to a
tasklet"), between the entry creation and its arrival in the updates tree,
there is time for scheduling, and it now becomes possible for an stksess
entry to be requeued into the list while it's still in the tree as a remote
one. Only local updates were removed prior to being inserted. In this case
we would re-insert the entry, causing it to appear as the parent of two
distinct nodes or leaves, and to be visited from the first leaf during a
delete() after having already been removed and freed, causing a crash,
as Christian reported in issue #2959.
There's no reason to backport this as this appeared with the commit above
in 3.2-dev13.
commit 29b76cae4 ("BUG/MEDIUM: server/log: "mode log" after server
keyword causes crash") introduced some postparsing checks/tasks for
server
Initially they were mainly meant for "mode log" servers postparsing, but
we already have a check dedicated to "tcp/http" servers (ie: only tcp
proto supported)
However when dynamic servers are added they bypass _srv_postparse() since
the REGISTER_POST_SERVER_CHECK() is only executed for servers defined in
the configuration.
To ensure consistency between dynamic and static servers, and ensure no
post-check init routine is missed, let's manually invoke _srv_postparse()
after creating a dynamic server added via the cli.
d3f928944 ("BUG/MINOR: cli: Issue an error when too many args are passed
for a command") added a new check to prevent the command to run when
too many arguments are provided. In this case an error is reported.
However it turns out this check (despite marked for backports) was
ineffective prior to 20ec1de21 ("MAJOR: cli: Refacor parsing and
execution of pipelined commands") as 'p' pointer was reset to the end of
the buffer before the check was executed.
Now since 20ec1de21, the check works, but we have another issue: we may
read past initialized bytes in the buffer because 'p' pointer is always
incremented in a while loop without checking if we increment it past 'end'
(This was detected using valgrind)
To fix the issue introduced by 20ec1de21, let's only increment 'p' pointer
if p < end.
For 3.2 this is it, now for older versions, since d3f928944 was marked for
backport, a sligthly different approach is needed:
- conditional p increment must be done in the loop (as in this patch)
- max arg check must moved above "fill unused slots" comment where p is
assigned to the end of the buffer
This patch should be backported with d3f928944.
It might be possible not to see the element in the tree, then not to
see it in the update list, thus not to take the lock before deleting.
But an element in the list could have moved to the tree during the
check, and be removed later without the updt_lock.
Let's delete prior to checking the presence in the tree to avoid
this situation. No backport is needed since this arrived in -dev13
with the update list.
In peer_treat_updatemsg(), we call stktable_touch_remote() after
releasing the write lock on the TS, asking it to decrement the
refcnt, then we update ts->seen. Unfortunately this is racy and
causes the issue that Christian reported in issue #2959.
The sequence of events is very hard to trigger manually, but what happens
is the following:
T1. stktable_touch_remote(table, ts, 1);
-> at this point the entry is in the mt_list, and the refcnt is zero.
T2. stktable_trash_oldest() or process_table_expire()
-> these can run, because the refcnt is now zero.
The entry is cleanly deleted and freed.
T1. HA_ATOMIC_STORE(&ts->seen, 1)
-> we dereference freed memory.
A first attempt at a fix was made by keeping the refcnt held during
all the time the entry is in the mt_list, but this is expensive as
such entries cannot be purged, causing lots of skips during
trash_oldest_data(). This managed to trigger watchdogs, and was only
hiding the real cause of the problem.
The correct approach clearly is to maintain the ref_cnt until we
touch ->seen. That's what this patch does. It does not decrement
the refcnt, while calling stktable_touch_remote(), and does it
manually after touching ->seen. With this the problem is gone.
Note that a reproducer involves the following:
- a config with 10 stick-ctr tracking the same table with a
random key between 10M and 100M depending on the machine.
- the expiration should be between 10 and 20s. http_req_cnt
is stored and shared with the peers.
- 4 total processes with such a config on the local machine,
each corresponding to a different peer. 3 of the peers are
bound to half of the cores (all threads) and share the same
threads; the last process is bound to the other half with
its own threads.
- injecting at full load, ~256 conn, on the shared listening
port. After ~2x expiration time to 1 minute the lone process
should segfault in pools code due to a corrupted by_lru list.
This problem already exists in earlier versions but the race looks
narrower. Given how difficult it is to trigger on a given machine
in its current form, it's likely that it only happens once in a
while on stable branches. The fix must be backported wherever the
code is similar, and there's no hope to reproduce it to validate
the backport.
Thanks again to Christian for his amazing help!
Add a checks on received max_udp_payload transport parameters. As
defined per RFC 9000, values below 1200 are invalid, and thus the
connection must be closed with TRANSPORT_PARAMETER_ERROR code.
Prior to this patch, an invalid value was silently ignored.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Checks are implemented on some received transport parameter values,
to reject invalid ones defined per RFC 9000. This is the case for
max_ack_delay parameter.
The check was not properly implemented as it only reject values strictly
greater than the limit set to 2^14. Fix this by rejecting values of 2^14
and above. Also, the proper error code TRANSPORT_PARAMETER_ERROR is now
set.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
As per RFC 9000, checks must be implemented to reject invalid values for
received transport parameters. Such values are dependent on the
parameter type.
Checks were already implemented for ack_delay_exponent and
active_connection_id_limit, accordingly with the QUIC specification.
However, the connection was closed with an incorrect error code. Fix
this to ensure that TRANSPORT_PARAMETER_ERROR code is used as expected.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Close the connection on error if retry_source_connection_id transport
parameter is received. This is specified by RFC 9000 as this parameter
must not be emitted by a client. Previously, it was silently ignored.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
This commit is similar to the previous one. It fixes the error code
reported when dealing with invalid received transport parameters. This
time, it handles reception of original_destination_connection_id,
preferred_address and stateless_reset_token which must not be emitted by
the client.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Handle missing received transport parameter value
initial_source_connection_id / original_destination_connection_id.
Previously, such case would result in an error reported via
quic_transport_params_store(), which triggers a TLS alert converted as
expected as a CONNECTION_CLOSE. The issue is that the error code
reported in the frame was incorrect.
Fix this by returning QUIC_TP_DEC_ERR_INVAL for such conditions. This is
directly handled via quic_transport_params_store() which set the proper
TRANSPORT_PARAMETER_ERROR code for the CONNECTION_CLOSE. However, no
error is reported so the SSL handshake is properly terminated without a
TLS alert. This is enough to ensure that the CONNECTION_CLOSE frame will
be emitted as expected.
This should be backported up to 2.6. Note that is relies on previous
patch "MINOR: quic: extend return value on TP parsing".
Extend API used for QUIC transport parameter decoding. This is done via
the introduction of a dedicated enum to report the various error
condition detected. No functional change should occur with this patch,
as the only returned code is QUIC_TP_DEC_ERR_TRUNC, which results in the
connection closure via a TLS alert.
This patch will be necessary to properly reject transport parameters
with the proper CONNECTION_CLOSE error code. As such, it should be
backported up to 2.6 with the following series.
However the doc purposely says the opposite, to encourage migrating away
from "ip". The goal is that in the future we change "ip" to mean "ipv6",
which seems to be what most users naturally expect. But we cannot break
configurations in the LTS version so for now "ipv4" is the alias.
The reason for not changing it in the table is that the type name is
used at a few places (look for "].kw"):
- dumps
- promex
We'd rather not change that output for 3.2, but only do it in 3.3.
This way, 3.2 can be made future-proof by using "ipv4" in the config
without any other side effect.
Please see github issue #2962 for updates on this transition.
Now with the improved backtraces, the lock history and details in the
mux layers, some dumps appear truncated or with some chars alone at
the beginning of the line. The issue is in fact caused by the limited
dump buffer size (2kB for stderr, 4kB for warning), that cannot hold
a complete trace anymore.
Let's jump bump them to 8kB, this will be plenty for a long time.
In commit 3f2c8af313 ("MINOR: tools: make parse_line() provide hints
about empty args") we've added the ability to record the position of
the first empty arg in parse_line(), but that check requires to
access the args[] array for the current arg, which is not valid in
case we stopped on too large an argument count. Let's just check the
arg's validity before doing so.
This was reported by OSS Fuzz:
https://issues.oss-fuzz.com/issues/415850462
No backport is needed since this was in the latest dev branch.
When declaring a certificate via the crt-store section, this certificate
can then be used 2 ways in a crt-list:
- only by using its name, without any crt-store options
- or by using the exact set of crt-list option that was defined in the
crt-store
Since ssl-f-use is generating a crt-list, this is suppose to behave the
same. To achieve this, ckch_conf_parse() will parse the keywords related
to the ckch_conf on the ssl-f-use line and use ckch_conf_cmp() to
compare it to the previous declaration from the crt-store. This
comparaison is only done when any ckch_conf keyword are present.
However, ckch_conf_parse() was done for the crt-list, and the crt-list
does not use the "crt" parameter to declare the name of the certificate,
since it's the first element of the line. So when used with ssl-f-use,
ckch_conf_parse() will always see a "crt" keyword which is a ckch_conf
one, and consider that it will always need to have the exact same set of
paremeters when using the same crt in a crt-store and an ssl-f-use line.
So a simple configuration like this:
crt-store web
load crt "foo.com.crt" key "foo.com.key" alias "foo"
frontend mysite
bind :443 ssl
ssl-f-use crt "@web/foo" ssl-min-ver TLSv1.2
Would lead to an error like this:
config : '@web/foo' in crt-list '(null)' line 0, is already defined with incompatible parameters:
- different parameter 'key' : previously 'foo.com.key' vs '(null)'
In order to fix the issue, this patch parses the "crt" parameter itself
for ssl-f-use instead of using ckch_conf_parse(), so the keyword would
never be considered as a ckch_conf keyword to compare.
This patch also take care of setting the CKCH_CONF_SET_CRTLIST flag only
if a ckch_conf keyword was found. This flag is used by ckch_conf_cmp()
to know if it has to compare or not.
No backport needed.
Fill cfg_crt_node with a filename and linenum so the post_section
callback can use it to emit errors.
This way the errors are emitted with the right filename and linenum
where ssl-f-use is used instead of (null):0
The extra call to pendconn_process_next_strm() made in commit cda7275ef5
("MEDIUM: queue: Handle the race condition between queue and dequeue
differently") was performed after releasing the server queue's lock,
which is incompatible with the calling convention for this function.
The result is random corruption of the server's streams list likely
due to picking old or incorrect pendconns from the queue, and in the
end infinitely looping on apparently already locked mt_list objects.
Just adding the lock fixes the problem.
It's very difficult to reproduce, it requires low maxconn values on
servers, stickiness on the servers (cookie), a long enough slowstart
(e.g. 10s), and regularly flipping servers up/down to re-trigger the
slowstart.
No backport is needed as this was only in 3.2.
Add the "crt-store" keyword with its argument in the "3.12" section, so
this could be detected by haproxy-dconv has a keyword and put in the
keywords list.
Must be backported as far as 3.0
Remove the 'acme ps' command which does not seem useful anymore with the
'acme status' command.
The big difference with the 'acme status' command is that it was only
displaying the running tasks instead of the status of all certificate.
The "acme status" command, shows the status of every certificates
configured with ACME, not only the running task like "acme ps".
The IO handler loops on the ckch_store tree and outputs a line for each
ckch_store which has an acme section set. This is still done under the
ckch_store lock and doesn't support resuming when the buffer is full,
but we need to change that in the future.
This reverts commit 53c3046898633e56f74f7f05fb38cabeea1c87a1.
This patch introduced a regression leading to a loop on the frames
demultiplexing because a frame may be ignore but not consumed.
But outside this regression that can be fixed, there is a design issue that
was not totally fixed by the patch above. The SPOP connection state is mixed
with the status of the frames demultiplexer and this needlessly complexify
the connection management. Instead of fixing the fix, a better solution is
to revert it to work a a proper solution.
For the record, the idea is to deal with the spop connection state onlu
using 'state' field and to introduce a new field to handle the frames
demultiplexer state. This should ease the closing state management.
Another issue that must be fixed. We must take care to not abort a SPOP
stream when an error is detected on a SPOP connection or when the connection
is closed, if the ACK frame was already received for this stream. It is not
a common case, but it can be solved by saving the last known stream ID that
recieved a ACK.
This patch must be backported if the commit above is backported.
proxy_inc_fe_cum_sess_ver_ctr() was implemented in 9969adbc
("MINOR: stats: add by HTTP version cumulated number of sessions and
requests")
As its name suggests, it is meant to be called for frontends, not backends
Also, in 9969adbc, when used under h1_init(), a precaution is taken to
ensure that the function is only called with frontends.
However, this precaution was not applied in h2_init() and qc_init().
Due to this, it remains possible to have proxy_inc_fe_cum_sess_ver_ctr()
being called with a backend proxy as parameter. While it did not cause
known issues so far, it is not expected and could result in bugs in the
future. Better fix this by ensuring the function is only called with
frontends.
It may be backported up to 2.8
Based on the lock history, we can spot some locks that are still held
by checking the last operation that happened on them: if it's not an
unlock, then we know the lock is held. In this case we append the list
after "locked:" with their label and state like below:
U:QUEUE S:IDLE_CONNS U:IDLE_CONNS R:TASK_WQ U:TASK_WQ S:QUEUE S:QUEUE S:QUEUE locked: QUEUE(S)
S:IDLE_CONNS U:IDLE_CONNS S:TASK_RQ U:TASK_RQ S:QUEUE U:QUEUE S:IDLE_CONNS locked: IDLE_CONNS(S)
R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ S:TASK_WQ R:TASK_WQ locked: TASK_WQ(R)
W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT U:STK_TABLE_UPDT W:STK_TABLE W:STK_TABLE_UPDT locked: STK_TABLE(W) STK_TABLE_UPDT(W)
The format is slightly different (label(status)) so as to easily
differentiate them visually from the history.
In order to make the lock history a bit more useful, let's try to merge
adjacent lock/unlock sequences that don't change anything for other
threads. For this we can replace the last unlock with the new operation
on the same label, and even just not store it if it was the same as the
one before the unlock, since in the end it's the same as if the unlock
had not been done.
Now loops that used to be filled with "R:LISTENER U:LISTENER" show more
useful info such as:
S:IDLE_CONNS U:IDLE_CONNS S:PEER U:PEER S:IDLE_CONNS U:IDLE_CONNS R:LISTENER U:LISTENER
U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE
R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS W:STK_TABLE_UPDT U:STK_TABLE_UPDT S:PEER
It's worth noting that it can sometimes induce confusion when recursive
locks of the same label are used (a few exist on peers or stick-tables),
as in such a case the two operations would be needed. However these ones
are already undebuggable, so instead they will just have to be renamed
to make sure they use a distinct label.
Most threads are filled with "R:OTHER U:OTHER" in their history. Since
anything non-important can use other it's not observable but it pollutes
the history. Let's just drop OTHER entirely during the recording.
The fix in commit 09a325a4de ("BUG/MINOR: tools: always terminate empty
lines") is insufficient. While it properly addresses the lack of trailing
zero, it doesn't account for it in the returned outlen that is used to
allocate a larger line. This happens at boot if the very first line of
the test file is exactly a sharp with nothing else. In this case it will
return a length 0 and the caller (parse_cfg()) will try to re-allocate an
entry of size zero and will fail, bailing out a lack of memory. This time
it should really be OK.
It doesn't need to be backported, unless the patch above would be.
Since latest commit 7e4a2f39ef ("BUG/MINOR: tools: do not create an empty
arg from trailing spaces"), an empty line will no longer produce an arg
and no longer append a trailing zero to them. This was not visible because
one is already present in the input string, however all the trailing args
are set to out+outpos-1, which now points one char before the buffer since
nothing was emitted, and was noticed by ASAN, and/or when parsing garbage.
Let's make sure to always emit the zero for empty lines as well to address
this issue. No backport is needed unless the patch above gets backported.
Now when an empty arg is found on a line, we emit the sanitized
input line and the position of the first empty arg so as to help
the user figure the cause (likely an empty environment variable).
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
In order to help parse_line() callers report the position of empty
args to the user, let's decide that if no error is emitted, then
we'll stuff the errptr with the position of the first empty arg
without affecting the return value.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
For historical reasons, the config parser relies on the trailing '\0'
to detect the end of the line being parsed. When the lines started to be
tokenized into arguments, this principle has been preserved, and now all
the parsers rely on *args[arg]='\0' to detect the end of a line. But as
reported in issue #2944, while most of the time it breaks the parsing
like below:
http-request deny if { path_dir '' }
it can also cause some elements to be silently ignored like below:
acl bad_path path_sub '%2E' '' '%2F'
This may also subtly happen with environment variables that don't exist
or which are empty:
acl bad_path path_sub '%2E' "$BAD_PATTERN" '%2F'
Fortunately, parse_line() returns the number of arguments found, so it's
easy from the callers to verify if any was empty. The goal of this commit
is not to perform sensitive changes, it's only to mention when parsing a
line that an empty argument was found and alert about its consequences
using a warning. Most of the time when this happens, the config does not
parse. But for examples as the ACLs above, there could be consequences
that are better detected early.
This patch depends on this previous fix:
BUG/MINOR: tools: do not create an empty arg from trailing spaces
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Trailing spaces on the lines of the config file create an empty arg
which makes it complicated to detect really empty args. Let's first
address this. Note that it is not user-visible but prevents from
fixing user-visible issues. No backport is needed.
The initial issue was introduced with this fix that already tried to
address it:
8a6767d266 ("BUG/MINOR: config: don't count trailing spaces as empty arg (v2)")
The current patch properly addresses leading and trailing spaces by
only counting arguments if non-lws chars were found on the line. LWS
do not cause a transition to a new arg anymore but they complete the
current one. The whole new code relies on a state machine to detect
when to create an arg (!in_arg->in_arg), and when to close the current
arg. A special care was taken for word expansion in the form of
"${ARGS[*]}" which still continue to emit individual arguments past
the first LWS. This example works fine:
ARGS="100 check inter 1000"
server name 192.168.1."${ARGS[*]}"
It properly results in 6 args:
"server", "name", "192.168.1.100", "check", "inter", "1000"
This fix should not have any visible user impact and is a bit tricky,
so it's best not to backport it, at least for a while.
Co-authored-by: Valentine Krasnobaeva <vkrasnobaeva@haproxy.com>
Previous patch 7251c13c7 ("MINOR: acme: move the acme task init in a dedicated
function") mistakenly returned the wrong error code when "acme renew" parsing
was successful, and tried to emit an error message.
This patch fixes the issue by returning 0 when the acme task was correctly
scheduled to start.
No backport needed.
As reported in GH #2958, commit 6c9b315 caused a regression with sc_*
fetches and tracked counter id > 9.
As such, the below configuration would cause a BUG_ON() to be triggered:
global
log stdout format raw local0
tune.stick-counters 11
defaults
log global
mode http
frontend www
bind *:8080
acl track_me bool(true)
http-request set-var(txn.track_var) str("a")
http-request track-sc10 var(txn.track_var) table rate_table if track_me
http-request set-var(txn.track_var_rate) sc_gpc_rate(0,10,rate_table)
http-request return status 200
backend rate_table
stick-table type string size 1k expire 5m store gpc_rate(1,1m)
While in 6c9b315 the src_fetch logic was removed from
smp_fetch_sc_stkctr(), num > 9 is indeed not expected anymore as
original num value. But what we didn't consider is that num is effectively
re-assigned for generic sc_* variant.
Thus the BUG_ON() is misplaced as it should only be evaluated for
non-generic fetches. It explains why it triggers with valid configurations
Thanks to GH user @tkjaer for his detailed report and bug analysis
No backport needed, this bug is specific to 3.2.
Released version 3.2-dev14 with the following main changes :
- MINOR: acme: retry label always do a request
- MINOR: acme: does not leave task for next request
- BUG/MINOR: acme: reinit the retries only at next request
- MINOR: acme: change the default max retries to 5
- MINOR: acme: allow a delay after a valid response
- MINOR: acme: wait 5s before checking the challenges results
- MINOR: acme: emit a log when starting
- MINOR: acme: delay of 5s after the finalize
- BUG/MEDIUM: quic: Let it be known if the tasklet has been released.
- BUG/MAJOR: tasks: fix task accounting when killed
- CLEANUP: tasks: use the local state, not t->state, to check for tasklets
- DOC: acme: external account binding is not supported
- MINOR: hlua: ignore "tune.lua.bool-sample-conversion" if set after "lua-load"
- MEDIUM: peers: Give up if we fail to take locks in hot path
- MEDIUM: stick-tables: defer adding updates to a tasklet
- MEDIUM: stick-tables: Limit the number of old entries we remove
- MEDIUM: stick-tables: Limit the number of entries we expire
- MINOR: cfgparse-global: add explicit error messages in cfg_parse_global_env_opts
- MINOR: ssl: add function to extract X509 notBefore date in time_t
- BUILD: acme: need HAVE_ASN1_TIME_TO_TM
- MINOR: acme: move the acme task init in a dedicated function
- MEDIUM: acme: add a basic scheduler
- MINOR: acme: emit a log when the scheduler can't start the task
This patch implements a very basic scheduler for the ACME tasks.
The scheduler is a task which is started from the postparser function
when at least one acme section was configured.
The scheduler will loop over the certificates in the ckchs_tree, and for
each certificate will start an ACME task if the notAfter date is past
curtime + (notAfter - notBefore) / 12, or 7 days if notBefore is not
available.
Once the lookup over all certificates is terminated, the task will sleep
and will wakeup after 12 hours.
acme_start_task() is a dedicated function which starts an acme task
for a specified <store> certificate.
The initialization code was move from the "acme renew" command parser to
this function, in order to be called from a scheduler.
When env variable name or value are not provided for setenv/presetenv it's not
clear from the old error message shown at stderr, what exactly is missed. User
needs to search in it's configuration.
Let's add more explicit error messages about these inconsistencies.
No need to be backported.
In process_table_expire(), limit the number of entries we remove in one
call, and just reschedule the task if there's more to do. Removing
entries require to use the heavily contended update write lock, and we
don't want to hold it for too long.
This helps getting stick tables perform better under heavy load.
Limit the number of old entries we remove in one call of
stktable_trash_oldest(), as we do so while holding the heavily contended
update write lock, so we'd rather not hold it for too long.
This helps getting stick tables perform better under heavy load.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
In peer_send_msgs(), give up in order to retry later if we failed at
getting the update read lock.
Similarly, in __process_running_peer_sync(), give up and just reschedule
the task if we failed to get the peer lock. There is an heavy contention
on both those locks, so we could spend a lot of time trying to get them.
This helps getting peers perform better under heavy load.
tune.lua.bool-sample-conversion must be set before any lua-load or
lua-load-per-thread is used for it to be considered. Indeed, lua-load
directives are parsed on the fly and will cause some parts of the scripts
to be executed during init already (script body/init contexts).
As such, we cannot afford to have "tune.lua.bool-sample-conversion" set
after some Lua code was loaded, because it would mean that the setting
would be handled differently for Lua's code executed during or after
config parsing.
To avoid ambiguities, the documentation now states that the setting must
be set before any lua-load(-per-thread) directive, and if the setting
is met after some Lua was already loaded, the directive is ignored and
a warning informs about that.
It should fix GH #2957
It may be backported with 29b6d8af16 ("MINOR: hlua: rename
"tune.lua.preserve-smp-bool" to "tune.lua.bool-sample-conversion"")
There's no point reading t->state to check for a tasklet after we've
atomically read the state into the local "state" variable. Not only it's
more expensive, it's also less clear whether that state is supposed to
be atomic or not. And in any case, tasks and tasklets have their type
forever and the one reflected in state is correct and stable.
After recent commit b81c9390f ("MEDIUM: tasks: Mutualize the TASK_KILLED
code between tasks and tasklets"), the task accounting was no longer
correct for killed tasks due to the decrement of tasks in list that was
no longer done, resulting in infinite loops in process_runnable_tasks().
This just illustrates that this code remains complex and should be further
cleaned up. No backport is needed, as this was in 3.2.
quic_conn_release() may, or may not, free the tasklet associated with
the connection. So make it return 1 if it was, and 0 otherwise, so that
if it was called from the tasklet handler itself, the said handler can
act accordingly and return NULL if the tasklet was destroyed.
This should be backported if 9240cd4a2771245fae4d0d69ef025104b14bfc23
is backported.
The next request was always leaving the task befor initializing the
httpclient. This patch optimize it by jumping to the next step at the
end of the current one. This way, only the httpclient is doing a
task_wakeup() to handle the response. But transiting from response to
the next request does not leave the task.
Released version 3.2-dev13 with the following main changes :
- MEDIUM: checks: Make sure we return the tasklet from srv_chk_io_cb
- MEDIUM: listener: Make sure w ereturn the tasklet from accept_queue_process
- MEDIUM: mux_fcgi: Make sure we return the tasklet from fcgi_deferred_shut
- MEDIUM: quic: Make sure we return the tasklet from qcc_io_cb
- MEDIUM: quic: Make sure we return NULL in quic_conn_app_io_cb if needed
- MEDIUM: quic: Make sure we return the tasklet from quic_accept_run
- BUG/MAJOR: tasklets: Make sure he tasklet can't run twice
- BUG/MAJOR: listeners: transfer connection accounting when switching listeners
- MINOR: ssl/cli: add a '-t' option to 'show ssl sni'
- DOC: config: fix ACME paragraph rendering issue
- DOC: config: clarify log-forward "host" option
- MINOR: promex: expose ST_I_PX_RATE (current_session_rate)
- BUILD: acme: use my_strndup() instead of strndup()
- BUILD: leastconn: fix build warning when building without threads on old machines
- MINOR: threads: prepare DEBUG_THREAD to receive more values
- MINOR: threads: turn the full lock debugging to DEBUG_THREAD=2
- MEDIUM: threads: keep history of taken locks with DEBUG_THREAD > 0
- MINOR: threads/cli: display the lock history on "show threads"
- MEDIUM: thread: set DEBUG_THREAD to 1 by default
- BUG/MINOR: ssl/acme: free EVP_PKEY upon error
- MINOR: acme: separate the code generating private keys
- MINOR: acme: failure when no directory is specified
- MEDIUM: acme: generate the account file when not found
- MEDIUM: acme: use 'crt-base' to load the account key
- MINOR: compiler: add more macros to detect macro definitions
- MINOR: cli: split APPCTX_CLI_ST1_PROMPT into two distinct flags
- MEDIUM: cli: make the prompt mode configurable between n/i/p
- MEDIUM: mcli: make the prompt mode configurable between i/p
- MEDIUM: mcli: replicate the current mode when enterin the worker process
- DOC: configuration: acme account key are auto generated
- CLEANUP: acme: remove old TODO for account key
- DOC: configuration: add quic4 to the ssl-f-use example
- BUG/MINOR: acme: does not try to unlock after a failed trylock
- BUG/MINOR: mux-h2: fix the offset of the pattern for the ping frame
- MINOR: tcp: add support for setting TCP_NOTSENT_LOWAT on both sides
- BUG/MINOR: acme: creating an account should not end the task
- MINOR: quic: rename min/max fields for congestion window algo
- MINOR: quic: refactor BBR API
- BUG/MINOR: quic: ensure cwnd limits are always enforced
- MINOR: thread: define cshared type
- MINOR: quic: account for global congestion window
- MEDIUM: quic: limit global Tx memory
- MEDIUM: acme: use a map to store tokens and thumbprints
- BUG/MINOR: acme: remove references to virt@acme
- MINOR: applet: add appctx_schedule() macro
- BUG/MINOR: dns: add tempo between 2 connection attempts for dns servers
- CLEANUP: dns: remove unused dns_stream_server struct member
- BUG/MINOR: dns: prevent ds accumulation within dss
- CLEANUP: proxy: mention that px->conn_retries isn't relevant in some cases
- DOC: ring: refer to newer RFC5424
- MINOR: tools: make my_strndup() take a size_t len instead of and int
- MINOR: Add "sigalg" to "sigalg name" helper function
- MINOR: ssl: Add traces to ssl init/close functions
- MINOR: ssl: Add traces to recv/send functions
- MINOR: ssl: Add traces to ssl_sock_io_cb function
- MINOR: ssl: Add traces around SSL_do_handshake call
- MINOR: ssl: Add traces to verify callback
- MINOR: ssl: Add ocsp stapling callback traces
- MINOR: ssl: Add traces to the switchctx callback
- MINOR: ssl: Add traces about sigalg extension parsing in clientHello callback
- MINOR: Add 'conn' param to ssl_sock_chose_sni_ctx
- BUG/MEDIUM: mux-spop: Wait end of handshake to declare a spop connection ready
- BUG/MEDIUM: mux-spop: Handle CLOSING state and wait for AGENT DISCONNECT frame
- BUG/MINOR: mux-h1: Don't pretend connection was released for TCP>H1>H2 upgrade
- BUG/MINOR: mux-h1: Fix trace message in h1_detroy() to not relay on connection
- BUILD: ssl: Fix wolfssl build
- BUG/MINOR: mux-spop: Use the right bitwise operator in spop_ctl()
- MEDIUM: mux-quic: increase flow-control on each bufsize
- MINOR: mux-quic: limit emitted MSD frames count per qcs
- MINOR: add hlua_yield_asap() helper
- MINOR: hlua_fcn: enforce yield after *_get_stats() methods
- DOC: config: restore default values for resolvers hold directive
- MINOR: ssl/cli: "acme ps" shows the acme tasks
- MINOR: acme: acme_ctx_destroy() returns upon NULL
- MINOR: acme: use acme_ctx_destroy() upon error
- MEDIUM: tasks: Mutualize code between tasks and tasklets.
- MEDIUM: tasks: More code factorization
- MEDIUM: tasks: Remove TASK_IN_LIST and use TASK_QUEUED instead.
- MINOR: tasks: Remove unused tasklet_remove_from_tasklet_list
- MEDIUM: tasks: Mutualize the TASK_KILLED code between tasks and tasklets
- BUG/MEDIUM: connections: Report connection closing in conn_create_mux()
- BUILD/MEDIUM: quic: Make sure we build with recent changes
Add an extra parametre to conn_create_mux(), "closed_connection".
If a pointer is provided, then let it know if the connection was closed.
Callers have no way to determine that otherwise, and we need to know
that, at least in ssl_sock_io_cb(), as if the connection was closed we
need to return NULL, as the tasklet was free'd, otherwise that can lead
to memory corruption and crashes.
This should be backported if 9240cd4a2771245fae4d0d69ef025104b14bfc23
is backported too.
The code to handle a task/tasklet when it's been killed before it were
to run is mostly identical, so move it outside of task and tasklet
specific code, and inside the common code.
This commit is just cosmetic, and should have no impact.
TASK_QUEUED was used to mean "the task has been scheduled to run",
TASK_IN_LIST was used to mean "the tasklet has been scheduled to run",
remove TASK_IN_LIST and just use TASK_QUEUED for tasklets instead.
This commit is just cosmetic, and should not have any impact.
There is some code that should run no matter if the task was killed or
not, and was needlessly duplicated, so only use one instance.
This also fixes a small bug when a tasklet that got killed before it
could run would still count as a tasklet that ran, when it should not,
which just means that we'd run one less useful task before going back to
the poller.
This commit is mostly cosmetic, and should not have any impact.
The code that checks if we're currently running, and waits if so, was
identical between tasks and tasklets, so move it in code common to tasks
and tasklets.
This commit is just cosmetic, and should not have any impact.
Use acme_ctx_destroy() instead of a simple free() upon error in the
"acme renew" error handling.
It's better to use this function to be sure than everything has been
been freed.
Implement a way to display the running acme tasks over the CLI.
It currently only displays a "Running" status with the certificate name
and the acme section from the configuration.
The displayed running tasks are limited to the size of a buffer for now,
it will require a backref list later to be called multiple times to
resume the list.
Default values for hold directive (resolver context) used to be documented
but this was lost when the keyword description was reworked in 24b319b
("Default value is 10s for "valid", 0s for "obsolete" and 30s for
others.")
Restoring the part that describes the default value.
It may be backported to all stable versions with 24b319b
{listener,proxy,server}_get_stats() methods are know to be expensive,
expecially if used under an iteration. Indeed, while automatic yield
is performed every X lua instructions (defaults to 10k), computing an
object's stats 10K times in a single cpu loop is not desirable and
could create contention.
In this patch we leverage hlua_yield_asap() at the end of *_get_stats()
methods in order to force the automatic yield to occur ASAP after the
method returns. Hopefully this should help in similar scenarios as the
one described in GH #2903
When called, this function will try to enforce a yield (if available) as
soon as possible. Indeed, automatic yield is already enforced every X
Lua instructions. However, there may be some cases where we know after
running heavy operation that we should yield already to avoid taking too
much CPU at once.
This is what this function offers, instead of asking the user to manually
yield using "core.yield()" from Lua itself after using an expensive
Lua method offered by haproxy, we can directly enforce the yield without
the need to do it in the Lua script.
The previous commit has implemented a new calcul method for
MAX_STREAM_DATA frame emission. Now, a frame may be emitted as soon as a
buffer was consumed by a QCS instance.
This will probably increase the number of MAX_STREAM_DATA frame
emission. It may even cause a series of frame emitted for the same
stream with increasing values under high load, which is completely
unnecessary.
To improve this, limit the number of MAX_STREAM_DATA frames built to one
per QCS instance. This is implemented by storing a reference to this
frame in QCS structure via a new member <tx.msd_frm>.
Note that to properly reset QCS msd_frm member, emission of flow-control
frames have been changed. Now, each frame is emitted individually. On
one side, it is better as it prevent to emit frames related to different
streams in a single datagram, which is not desirable in case of packet
loss. However, this can also increase sendto() syscall invocation.
Recently, QCS Rx allocation buffer method has been improved. It is now
possible to allocate multiple buffers per QCS instances, which was
necessary to improve HTTP/3 POST throughput.
However, a limitation remained related to the emission of
MAX_STREAM_DATA. These frames are only emitted once at least half of the
receive capacity has been consumed by its QCS instance. This may be too
restrictive when a client need to upload a large payload.
Improve this by adjusting MAX_STREAM_DATA allocation. If QCS capacity is
still limited to 1 or 2 buffers max, the old calcul is still used. This
is necessary when user has limited upload throughput via their
configuration. If QCS capacity is more than 2 buffers, a new frame is
emitted if at least a buffer was consumed.
This patch has reduced number of STREAM_DATA_BLOCKED frames received in
POST tests with some specific clients.
Becaues of a typo, '||' was used instead of '|' to test the SPOP conneciton
flags and decide if the mux is ready or not. The regression was introduced
in the commit fd7ebf117 ("BUG/MEDIUM: mux-spop: Wait end of handshake to
declare a spop connection ready").
This patch must be backported to 3.1 with the commit above.
The newly added SSL traces require an extra 'conn' parameter to
ssl_sock_chose_sni_ctx which was added in the "regular" code but not in
the wolfssl specific one.
Wolfssl also has a different prototype for some getter functions
(SSL_get_servername for instance), which do not expect a const SSL while
openssl version does.
h1_destroy() may be called to release a H1C after a multiplexer upgrade. In
that case, the connection is no longer attached to the H1C. It must not be
used in the h1 trace message because the connection context is no longer a H1C.
Because of this bug, when a H1>H2 upgrade is performed, a crash may be
experienced if the H1 traces are enabled.
This patch must be backport to all stable versions.
When an applicative upgrade of the H1 multiplexer is performed, we must not
pretend the connection was released. Indeed, in that case, a H1 stream is
still their with a stream connector attached on it. It must be detached
first before releasing the H1 connection and the underlying connection. So
it is important to not pretend the connection was already released.
Concretely, in that case h1_process() must return 0 instead of -1. It is
minor error because, AFAIK, it is harmless. But it is not correct. So let's
fix it to avoid futur bugs.
To be clear, this happens when a TCP connection is upgraded to H1 connection
and a H2 preface is detected, leading to a second upgrade from H1 to H2.
This patch may be backport to all stable versions.
In the SPOE specification, when an error occurred on the SPOP connection,
HAProxy must send a DISCONNECT frame and wait for the agent DISCONNECT frame
in return before trully closing the connection.
However, this part was not properly handled by the SPOP multiplexer. In this
case, the SPOP connection should be in the CLOSING state. But this state was
not used at all. Depending on when the error was encountered, the connection
could be closed immediately, without sending any DISCONNECT frame. It was
the case when an early error was detected during the AGENT-HELLO frame
parsing. Or it could be moved from ERROR to FRAME_H state, as if no error
were detected. This case was less dramatic than it seemed because some flags
were also set to prevent any problem. But it was not obvious.
So now, the SPOP connection is properly switch to CLOSING state when an
DISCONNECT is sent to the agent to be able to wait for its DISCONNECT in
reply. spop_process_demux() was updated to parse frames in that state and
some validity checks was added.
This patch must be backport to 3.1.
A SPOP connection must not be considered as ready while the hello handshake
is not finished with success. In addition, no error or shutdown must have
been reported for the underlying connection. Otherwise a freshly openned
spop connexion may be reused while it is in fact dead, leading to a
connection retry.
This patch must be backported to 3.1.
We had to parse the sigAlg extension by hand in order to properly select
the certificate used by the SSL frontends. These traces allow to dump
the allowed sigAlg list sent by the client in its clientHello.
This callback allows to pick the used certificate on an SSL frontend.
The certificate selection is made according to the information sent by
the client in the clientHello. The traces that were added will allow to
better understand what certificate was chosen and why. It will also warn
us if the chosen certificate was the default one.
The actual certificate parsing happens in ssl_sock_chose_sni_ctx. It's
in this function that we actually get the filename of the certificate
used.
If OCSP stapling fails because of a missing or invalid OCSP response we
used to silently disable stapling for the given session. We can now know
a bit more what happened regarding OCSP stapling.
Those traces dump information about the multiple SSL_do_handshake calls
(renegotiation and regular call). Some errors coud also be dumped in
case of rejected early data.
Depending on the chosen verbosity, some information about the current
handshake can be dumped as well (servername, tls version, chosen cipher
for instance).
In case of failed handshake, the error codes and messages will also be
dumped in the log to ease debugging.
Add a dedicated trace for some unlikely allocation failures and async
errors. Those traces will ostly be used to identify the start and end of
a given SSL connection.
This function can be used to convert a TLSv1.3 sigAlg entry (2bytes)
from the signature_agorithms client hello extension into a string.
In order to ease debugging, some TLSv1.2 combinations can also be
dumped. In TLSv1.2 those signature algorithms pairs were built out of a
one byte signature identifier combined to a one byte hash identifier.
In TLSv1.3 those identifiers are two bytes blocs that must be treated as
such.
In relation to issue #2954, it appears that turning some size_t length
calculations to the int that uses my_strndup() upsets coverity a bit.
Instead of dealing with such warnings each time, better address it at
the root. An inspection of all call places show that the size passed
there is always positive so we can safely use an unsigned type, and
size_t will always suit it like for strndup() where it's available.
Since 91e785edc ("MINOR: stream: Rely on a per-stream max connection
retries value"), px->conn_retries may be ignored in the following cases:
* proxy not part of a list which gets properly post-init (ie: main proxy
list, log-forward list, sink list)
* proxy lacking the CAP_FE capability
Documenting such cases where the px->conn_retries is set but effectively
ignored, so that we either remove ignored statements or fix them in
the future if they are really needed. In fact all cases affected here are
automomous applets that already handle the retries themselves so the fact
that 91e785edc made ->conn_retries ineffective should not be a big deal
anyway.
when dns session callback (dns_session_release()) is called upon error
(ie: when some pending queries were not sent), we try our best to
re-create the applet in order to preserve the pending queries and give
them a chance to be retried. This is done at the end of
dns_session_release().
However, doing so exposes to an issue: if the error preventing queries
from being sent is still encountered over and over the dns session could
stay there indefinitely. Meanwhile, other dns sessions may be created on
the same dns_stream_server periodically. If previous failing dns sessions
don't terminate but we also keep creating new ones, we end up accumulating
failing sessions on a given dns_stream_server, which can eventually cause
ressource shortage.
This issue was found when trying to address ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers")
To fix it, we track the number of failed consecutive sessions for a given
dns server. When we reach the threshold (set to 100), we consider that the
link to the dns server is broken (at least temporarily) and we force
dns_session_new() to fail, so that we stop creating new sessions until one
of the existing one eventually succeeds.
A workaround for this fix consists in setting the "maxconn" parameter on
nameserver directive (under resolvers section) to a reasonnable value so
that no more than "maxconn" sessions may co-exist on the same server at
a given time.
This may be backported to all stable versions.
("CLEANUP: dns: remove unused dns_stream_server struct member") may be
backported to ease the backport.
As reported by Lukas Tribus on the mailing list [1], trying to connect to
a nameserver with invalid network settings causes haproxy to retry a new
connection attempt immediately which eventually causes unexpected CPU usage
on the thread responsible for the applet (namely 100% on one CPU will be
observed).
This can be reproduced with the test config below:
resolvers default
nameserver ns1 tcp4@8.8.8.8:53 source 192.168.99.99
listen listen
mode http
bind :8080
server s1 www.google.com resolvers default init-addr none
To fix this the issue, we add a temporisation of one second between a new
connection attempt is retried. We do this in dns_session_create() when we
know that the applet was created in the release callback (when previous
query attempt was unsuccessful), which means initial connection is not
affected.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg45665.html
This should fix GH #2909 and may be backported to all stable versions.
This patch depends on ("MINOR: applet: add appctx_schedule() macro")
"virt@acme" was the default map used during development, now this must
be configured in the acme section or it won't try to use any map.
This patch removes the references to virt@acme in the comments and the
code.
The stateless mode which was documented previously in the ACME example
is not convenient for all use cases.
First, when HAProxy generates the account key itself, you wouldn't be
able to put the thumbprint in the configuration, so you will have to get
the thumbprint and then reload.
Second, in the case you are using multiple account key, there are
multiple thumbprint, and it's not easy to know which one you want to use
when responding to the challenger.
This patch allows to configure a map in the acme section, which will be
filled by the acme task with the token corresponding to the challenge,
as the key, and the thumbprint as the value. This way it's easy to reply
the right thumbprint.
Example:
http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].%[path,field(-1,/),map(virt@acme)]\n" if { path_beg '/.well-known/acme-challenge/' }
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().
Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
Use the newly defined cshared type to account for the sum of congestion
window of every QUIC connection. This value is stored in global counter
quic_mem_global defined in proto_quic module.
Define a new type "struct cshared". This can be used as a tool to
manipulate a global counter with thread-safety ensured. Each thread
would declare its thread-local cshared type, which would point to a
global counter.
Each thread can then add/substract value to their owned thread-local
cshared instance via cshared_add(). If the difference exceed a
configured limit, either positively or negatively, the global counter is
updated and thread-local instance is reset to 0. Each thread can safely
read the global counter value using cshared_read().
Congestion window is limit by a minimal and maximum values which can
never be exceeded. Min value is hardcoded to 2 datagrams as recommended
by the specification. Max value is specified via haproxy configuration.
These values must be respected each time the congestion window size is
adjusted. However, in some rare occasions, limit were not always
enforced. Fix this by implementing wrappers to set or increment the
congestion window. These functions ensure limits are always applied
after the operation.
Additionnally, wrappers also ensure that if window reached a new maximum
value, it is saved in <cwnd_last_max> field.
This should be backported up to 2.6, after a brief period of
observation.
Write minor adjustments to QUIC BBR functions. The objective is to
centralize every modification of path cwnd field.
No functional change. This patch will be useful to simplify
implementation of global QUIC Tx memory usage limitation.
There was some possible confusion between fields related to congestion
window size min and max limit which cannot be exceeded, and the maximum
value previously reached by the window.
Fix this by adopting a new naming scheme. Enforced limit are now renamed
<limit_max>/<limit_min>, while the previously reached max value is
renamed <cwnd_last_max>.
This should be backported up to 3.1.
The account creation was mistakenly ending the task instead of being
wakeup for the NewOrder state, it was preventing the creation of the
certificate, however the account was correctly created.
To fix this, only the jump to the end label need to be remove, the
standard leaving codepath of the function will allow to be wakeup.
No backport needed.
TCP_NOTSENT_LOWAT is very convenient as it indicates when to report
EAGAIN on the sending side. It takes a margin on top of the estimated
window, meaning that it's no longer needed to store too many data in
socket buffers. Instead there's just enough to fill the send window
and a little bit of margin to cover the scheduling time to restart
sending. Experiments on a 100ms network have shown a 10-fold reduction
in the memory used by socket buffers by just setting this value to
tune.bufsize, without noticing any performance degradation. Theoretically
the responsiveness on multiplexed protocols such as H2 should also be
improved.
The ping frame's pattern must be written at offset 9 (frame header
length), not 8. This was added in 3.2 with commit 4dcfe098a6 ("MINOR:
mux-h2: prepare to support PING emission"), so no backport is needed.
While humans can find it convenient to enter the worker process in prompt
mode, for external tools it will not be convenient to have to systematically
disable it. A better approach is to replicate the master socket's mode
there, since it has already been configured to suit the user: interactive,
prompt and timed modes are automatically passed to the worker process.
This makes the using the worker commands more natural from the master
process, without having to systematically adapt it for each new connection.
Support the same syntax in master mode as in worker mode in order to
configure the prompt. The only thing is that for now the master doesn't
have a non-interactive mode and it doesn't seem necessary to implement
it, so we only support the interactive and prompt modes. However the code
was written in a way that makes it easy to change this later if desired.
Now the prompt mode can more finely be configured between non-interactive
(default), interactive without prompt, and interactive with prompt. This
will ease the usage from automated tools which are not necessarily
interested in having to consume '> ' after each command nor displaying
"+" on payload lines. This can also be convenient when coming from the
master CLI to keep the same output format.
The CLI's "prompt" command toggles two distinct things:
- displaying or hiding the prompt at the beginning of the line
- single-command vs interactive mode
These are two independent concepts and the prompt mode doesn't
always cope well with tools that would like to upload data without
having to read the prompt on return. Also, the master command line
works in interactive mode by default with no prompt, which is not
consistent (and not convenient for tools). So let's start by splitting
the bit in two, and have a new APPCTX_CLI_ST1_INTER flag dedicated
to the interactive mode. For now the "prompt" command alone continues
to toggle the two at once.
We add __equals_0(NAME) which is only true if NAME is defined as zero,
and __def_as_empty(NAME) which is only true if NAME is defined as an
empty string.
Generate the private key on the account file when the file does not
exists. This generate a private key of the type and parameters
configured in the acme section.
Setting DEBUG_THREAD to 1 allows recording the lock history for each
thread. Tests have shown that (as predicted) the cost of updating a
single thread-local variable is not perceptible in the noise, especially
when compared to the cost of obtaining a lock. Since this can provide
useful value when debugging deadlocks, let's enable it by default when
threads are enabled.
This will display the lock labels and modes for each non-empty step
at the end of "show threads" when these are defined. This allows to
emit up to the last 8 locking operation for each thread on 64 bit
machines.
by only storing a word in each thread context, we can keep the history
of all taken/dropped locks by label. This is expected to be very cheap
and to permit to store up to 8 consecutive lock operations in 64 bits.
That should significantly help detect recursive locks as well as figure
what thread was likely to hinder another one waiting for a lock.
For now we only store the final state of the lock, we don't store the
attempt to get it. It's just a matter of space since we already need
4 ops (rd,sk,wr,un) which take 2 bits, leaving max 64 labels. We're
already around 45. We could also multiply by 5 and still keep 8 bits
total per lock, that would limit us to 51 locks max. It seems that
most of the time if we get a watchdog panic, anyway the victim thread
will be perfectly located so that we don't need a specific value for
this. Another benefit is that we perform a single memory write per
lock.
We now default the value to zero and make sure all tests properly take
care of values above zero. This is in preparation for supporting several
degrees of debugging.
Machines lacking CAS8B/DWCAS and emit a warning in lb_fwlc.c without
threads due to declaration ordering. Let's just move the variable
declaration into the block that uses it as a last variable. No
backport is needed.
Not all systems have strndup(), that's why we have our "my_strndup()",
so let's make use of it here. This fixes the build on Solaris 10. No
backport is needed.
It has been requested to have the current_session_rate exposed at the
frontend level. For now only the per-process value was exposed
(ST_I_INF_SESS_RATE).
Thanks to the work done lately to merge promex and stat_cols_px[]
array, let's simply defined an .alt_name for the ST_I_PX_RATE metric in
order to have promex exposing it as current_session_rate for relevant
contexts.
log-forward "host" option may be confusing because we often mention the
host field when talking about syslog RFC3164 or RFC5424 messages, but
neither rfc actually define "host" field. In fact, everywhere we used
"host field" we actually meant "hostname field" as documented in RFC5424.
This was a language abuse on our side.
In this patch we replace "host" with "hostname" where relevant in the
documentation to prevent confusion.
Thanks to Nick Ramirez for having reported the issue.
Nick Ramirez reported that the ACME paragraph (3.13) caused a rendering
issue where simple text was rendered as a directive. This was caused
by the use of unescaped <name> which confuses dconv.
Let's escape <name> by putting quotes around it to prevent the rendering
issue.
No backport needed.
Add a -t option to 'show ssl sni', allowing to add an offset to the
current date so it would allow to check which certificates are expired
after a certain period of time.
Since we made it possible for a bind_conf to listen to multiple thread
groups with shards in 2.8 with commit 9d360604bd ("MEDIUM: listener:
rework thread assignment to consider all groups"), the per-listener
connection count was not properly transferred to the target listener
with the connection when switching to another thread group. This results
in one listener possibly reaching high values and another one possibly
reaching negative values. Usually it's not visible, unless a maxconn is
set on the bind_conf, in which case comparisons will quickly put an end
to the willingness to accept new connections.
This problem only happens when thread groups are enabled, and it seems
very hard to trigger it normally, it only impacts sockets having a single
shard, hence currently the CLI (or any conf with "bind ... shards 1"),
where it can be reproduced with a config having a very low "maxconn" on
the stats socket directive (here, 4), and issuing a few tens of
socat <<< "show activity" in parallel, or sending HTTP connections to a
single-shared listener. Very quickly, haproxy stops accepting connections
and eats CPU in the poller which tries to get its connections accepted.
A BUG_ON(l->nbconn<0) after HA_ATOMIC_DEC() in listener_release() also
helps spotting them better.
Many thanks to Christian Ruppert who once again provided a very accurate
report in GH #2951 with the required data permitting this analysis.
This fix must be backported to 2.8.
tasklets were originally designed to alway run on only one thread, so it
was not possible to have it run on 2 threads concurrently.
The API has been extended so that another thread may wake the tasklet,
the idea was still that we wanted to have it run on one thread only.
However, the way it's been done meant that unless a tasklet was bound to
a specific tid with tasklet_set_tid(), or we explicitely used
tasklet_wakeup_on() to specify the thread for the target to run on, it
would be scheduled to run on the current thread.
This is in fact a desirable feature. There is however a race condition
in which the tasklet would be scheduled on a thread, while it is running
on another. This could lead to the same tasklet to run on multiple
threads, which we do not want.
To fix this, just do what we already do for regular tasks, set the
"TASK_RUNNING" flag, and when it's time to execute the tasklet, wait
until that flag is gone.
Only one case has been found in the current code, where the tasklet
could run on different threads depending on who wakes it up, in the
leastconn load balancer, since commit
627280e15f03755b8f59f0191cd6d6bcad5afeb3.
It should not be a problem in practice, as the function called can be
called concurrently.
If a bug is eventually found in relation to this problem, and this patch
should be backported, the following patches should be backported too :
MEDIUM: quic: Make sure we return the tasklet from quic_accept_run
MEDIUM: quic: Make sure we return NULL in quic_conn_app_io_cb if needed
MEDIUM: quic: Make sure we return the tasklet from qcc_io_cb
MEDIUM: mux_fcgi: Make sure we return the tasklet from fcgi_deferred_shut
MEDIUM: listener: Make sure w ereturn the tasklet from accept_queue_process
MEDIUM: checks: Make sure we return the tasklet from srv_chk_io_cb
In quic_conn_app_io_cb, make sure we return NULL if the tasklet has been
destroyed, so that the scheduler knows. It is not yet needed, but will
be soon.
Released version 3.2-dev12 with the following main changes :
- BUG/MINOR: quic: do not crash on CRYPTO ncbuf alloc failure
- BUG/MINOR: proxy: always detach a proxy from the names tree on free()
- CLEANUP: proxy: detach the name node in proxy_free_common() instead
- CLEANUP: Slightly reorder some proxy option flags to free slots
- MINOR: proxy: Add options to drop HTTP trailers during message forwarding
- MINOR: h1-htx: Skip C-L and T-E headers for 1xx and 204 messages during parsing
- MINOR: mux-h1: Keep custom "Content-Length: 0" header in 1xx and 204 messages
- MINOR: hlua/h1: Use http_parse_cont_len_header() to parse content-length value
- CLEANUP: h1: Remove now useless h1_parse_cont_len_header() function
- BUG/MEDIUM: mux-spop: Respect the negociated max-frame-size value to send frames
- MINOR: http-act: Add 'pause' action to temporarily suspend the message analysis
- MINOR: acme/cli: add the 'acme renew' command to the help message
- MINOR: httpclient: add an "https" log-format
- MEDIUM: acme: use a customized proxy
- MEDIUM: acme: rename "uri" into "directory"
- MEDIUM: acme: rename "account" into "account-key"
- MINOR: stick-table: use a separate lock label for updates
- MINOR: h3: simplify h3_rcv_buf return path
- BUG/MINOR: mux-quic: fix possible infinite loop during decoding
- BUG/MINOR: mux-quic: do not decode if conn in error
- BUG/MINOR: cli: Issue an error when too many args are passed for a command
- MINOR: cli: Use a full prompt command for bidir connections with workers
- MAJOR: cli: Refacor parsing and execution of pipelined commands
- MINOR: cli: Rename some CLI applet states to reflect recent refactoring
- CLEANUP: applet: Update st0/st1 comment in appctx structure
- BUG/MINOR: hlua: Fix I/O handler of lua CLI commands to not rely on the SC
- BUG/MINOR: ring: Fix I/O handler of "show event" command to not rely on the SC
- MINOR: cli/applet: Move appctx fields only used by the CLI in a private context
- MINOR: cache: Add a pointer on the cache in the cache applet context
- MINOR: hlua: Use the applet name in error messages for lua services
- MINOR: applet: Save the "use-service" rule in the stream to init a service applet
- CLEANUP: applet: Remove unsued rule pointer in appctx structure
- BUG/MINOR: master/cli: properly trim the '@@' process name in error messages
- MEDIUM: resolvers: add global "dns-accept-family" directive
- MINOR: resolvers: add command-line argument -4 to force IPv4-only DNS
- MINOR: sock-inet: detect apparent IPv6 connectivity
- MINOR: resolvers: add "dns-accept-family auto" to rely on detected IPv6
- MEDIUM: acme: use Retry-After value for retries
- MEDIUM: acme: reset the remaining retries
- MEDIUM: acme: better error/retry management of the challenge checks
- BUG/MEDIUM: cli: Handle applet shutdown when waiting for a command line
- Revert "BUG/MINOR: master/cli: properly trim the '@@' process name in error messages"
- BUG/MINOR: master/cli: only parse the '@@' prefix on complete lines
- MINOR: resolvers: use the runtime IPv6 status instead of boot time one
On systems where the network is not reachable at boot time (certain HA
systems for example, or dynamically addressed test machines), we'll want
to be able to periodically revalidate the IPv6 reachability status. The
current code makes it complicated because it sets the config bits once
for all at boot time. This commit changes this so that the config bits
are not changed, but instead we rely on a static inline function that
relies on sock_inet6_seems_reachable for every test (really cheap). This
also removes the now unneeded resolvers late init code.
This variable for now is still set at boot time but this will ease the
transition later, as the resolvers code is now ready for this.
The new adhoc parser for the '@@' prefix forgot to require the presence
of the LF character marking the end of the line. This is the reason why
entering incomplete commands would display garbage, because the line was
expected to have its LF character replaced with a zero.
The problem is well illustrated by using socat in raw mode:
socat /tmp/master.sock STDIO,raw,echo=0
then entering "@@1 show info" one character at a time would error just
after the second "@". The command must take care to report an incomplete
line and wait for more data in such a case.
This reverts commit 0e94339eaf1c8423132debb6b1b485d8bb1bb7da.
This patch was in fact fixing the symptom, not the cause. The root cause
of the problem is that the parser was processing an incomplete line when
looking for '@@'. When the LF is present, this problem does not exist
as it's properly replaced with a zero. This can be verified using socat
in raw mode:
socat /tmp/master.sock STDIO,raw,echo=0
Then entering "@@1 show info" one character at a time will immediately
fail on "@@" without going further. A subsequent patch will fix this.
No backport is needed.
When the CLI applet was refactord in the commit 20ec1de21 ("MAJOR: cli:
Refacor parsing and execution of pipelined commands"), a regression was
introduced. The applet shutdown was not longer handled when the applet was
waiting for the next command line. It is especially visible when a client
timeout occurred because the client connexion is no longer closed.
To fix the issue, the test on the SE_FL_SHW flag was reintroduced in
CLI_ST_PARSE_CMDLINE state, but only is there is no pending input data.
It is a 3.2-specific issue. No backport needed.
When the ACME task is checking for the status of the challenge, it would
only succeed or retry upon failure.
However that's not the best way to do it, ACME objects contain an
"status" field which could have a final status or a in progress status,
so we need to be able to retry.
This patch adds an acme_ret enum which contains OK, RETRY and FAIL.
In the case of the CHKCHALLENGE, the ACME could return a "pending" or a
"processing" status, which basically need to be rechecked later with the
RETRY. However a "invalid" or "valid" status is final and will return
either a FAIL or a OK.
So instead of retrying in any case, the "invalid" status will ends the
task with an error.
Parse the Retry-After header in response and store it in order to use
the value as the next delay for the next retry, fallback to 3s if the
value couldn't be parse or does not exist.
Instead of always having to force IPv4 or IPv6, let's now also offer
"auto" which will only enable IPv6 if the system has a default gateway
for it. This means that properly configured dual-stack systems will
default to "ipv4,ipv6" while those lacking a gateway will only use
"ipv4". Note that no real connectivity test is performed, so firewalled
systems may still get it wrong and might prefer to rely on a manual
"ipv4" assignment.
In order to ease dual-stack deployments, we could at least try to
check if ipv6 seems to be reachable. For this we're adding a test
based on a UDP connect (no traffic) on port 53 to the base of
public addresses (2001::) and see if the connect() is permitted,
indicating that the routing table knows how to reach it, or fails.
Based on this result we're setting a global variable that other
subsystems might use to preset their defaults.
In order to ease troubleshooting and testing, the new "-4" command line
argument enforces queries and processing of "A" DNS records only, i.e.
those representing IPv4 addresses. This can be useful when a host lack
end-to-end dual-stack connectivity. This overrides the global
"dns-accept-family" directive and is equivalent to value "ipv4".
By default, DNS resolvers accept both IPv4 and IPv6 addresses. This can be
influenced by the "resolve-prefer" keywords on server lines as well as the
family argument to the "do-resolve" action, but that is only a preference,
which does not block the other family from being used when it's alone. In
some environments where dual-stack is not usable, stumbling on an unreachable
IPv6-only DNS record can cause significant trouble as it will replace a
previous IPv4 one which would possibly have continued to work till next
request. The "dns-accept-family" global option permits to enforce usage of
only one (or both) address families. The argument is a comma-delimited list
of the following words:
- "ipv4": query and accept IPv4 addresses ("A" records)
- "ipv6": query and accept IPv6 addresses ("AAAA" records)
When a single family is used, no request will be sent to resolvers for the
other family, and any response for the othe family will be ignored. The
default value is "ipv4,ipv6", which effectively enables both families.
When '@@' alone is sent on the master CLI (no trailing LF), we get an
error that displays anything past these two characters in the buffer
since there's no room for a \0. Let's make sure to limit the length of
the process name in this case. No backport is needed since this was added
with 00c967fac4 ("MINOR: master/cli: support bidirectional communications
with workers").
When a service is initialized, the "use-service" rule that was executed is
now saved in the stream, using "current_rule" field, instead of saving it
into the applet context. It is safe to do so becaues this field is unused at
this stage. To avoid any issue, it is reset after the service
initialization. Doing so, it is no longer necessary to save it in the applet
context. It was the last usage of the rule pointer in the applet context.
The init functions for TCP and HTTP lua services were updated accordingly.
The lua function name was used in error messages of HTTP/TCP lua services
while the applet name can be used. Concretely, this will not change
anything, because when a lua service is regiestered, the lua function name
is used to name the applet. But it is easier, cleaner and more logicial
because it is really the applet name that should be displayed in these error
messages.
Thanks to this change, when a response is delivered from the cache, it is no
longer necessary to get the cache filter configuration from the http
"use-cache" rule saved in the appctx to get the currently used cache. It was
a bit complex to get an info that can be directly and naturally stored in
the cache applet context.
There are several fields in the appctx structure only used by the CLI. To
make things cleaner, all these fields are now placed in a dedicated context
inside the appctx structure. The final goal is to move it in the service
context and add an API for cli commands to get a command coontext inside the
cli context.
Thanks to the CLI refactoring ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), it is possible to fix "show event" I/O handle function
to no longer use the SC.
When the applet API was refactored to no longer manipulate the channels or
the stream-connectors, this part was missed. However, without the patch
above, it could not be fixed. It is now possible so let's do it.
This patch must not be backported becaues it depends on refactoring of the
CLI applet.
Thanks to the CLI refactoring ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), it is possible to fix the I/O handler function used by
lua CLI commands to no longer use the SC.
When the applet API was refactored to no longer manipulate the channels or
the stream-connectors, this part was missed. However, without the patch
above, it could not be fixed. It is now possible so let's do it.
This patch must not be backported becaues it depends on refactoring of the
CLI applet.
CLI_ST_GETREQ state was renamed into CLI_ST_PARSE_CMDLINE and CLI_ST_PARSEREQ
into CLI_ST_PROCESS_CMDLINE to reflect the real action performed in these
states.
Before this patch, when pipelined commands were received, each command was
parsed and then excuted before moving to the next command. Pending commands
were not copied in the input buffer of the applet. The major issue with this
way to handle commands is the impossibility to consume inputs from commands
with an I/O handler, like "show events" for instance. It was working thanks
to a "bug" if such commands were the last one on the command line. But it
was impossible to use them followed by another command. And this prevents us
to implement any streaming support for CLI commands.
So we decided to refactor the command line parsing to have something similar
to a basic shell. Now an entire line is parsed, including the payload,
before starting commands execution. The command line is copied in a
dedicated buffer. "appctx->chunk" buffer is used for this purpose. It was an
unsed field, so it is safe to use it here. Once the command line copied, the
commands found on this line are executed. Because the applet input buffer
was flushed, any input can be safely consumed by the CLI applet and is
available for the command I/O handler. Thanks to this change, "show event
-w" command can be followed by a command. And in theory, it should be
possible to implement commands supporting input data streaming. For
instance, the Tetris like lua applet can be used on the CLI now.
Note that the payload, if any, is part of the command line and must be fully
received before starting the commands processing. It means there is still
the limitation to a buffer, but not only for the payload but for the whole
command line. The payload is still necessarily at the end of the command
line and is passed as argument to the last command. Internally, the
"appctx->cli_payload" field was introduced to point on the payload in the
command line buffer.
This patch is quite huge but it cannot easily be splitted. It should not
introduced significant changes.
When a bidirection connection with no command is establisehd with a worker
(so "@@<pid>" alone), a "prompt" command is automatically added to display
the worker's prompt and enter in interactive mode in the worker context.
However, till now, an unfinished command line is sent, with a semicolon
instead of a newline at the end. It is not exactly a bug because this
works. But it is not really expected and could be a problem for future
changes.
So now, a full command line is sent: the "prompt" command finished by a
newline character.
When a command is parsed to split it in an array of arguments, by default,
at most 64 arguments are supported. But no warning was emitted when there
were too many arguments. Instead, the arguments above the limit were
silently ignored. It could be an issue for some commands, like "add server",
because there was no way to know some arguments were ignored.
Now an error is issued when too many arguments are passed and the command is
not executed.
This patch should be backported to all stable versions.
Add an early return to qcc_decode_qcs() if QCC instance is flagged on
error and connection is scheduled for immediate closure.
The main objective is to ensure to not trigger BUG_ON() from
qcc_set_error() : if a stream decoding has set the connection error, do
not try to process decoding on other streams as they may also encounter
an error. Thus, the connection is closed asap with the first encountered
error case.
This should be backported up to 2.6, after a period of observation.
With the support of multiple Rx buffers per QCS instance, stream
decoding in qcc_io_recv() has been reworked for the next haproxy
release. An issue appears in a double while loop : a break statement is
used in the inner loop, which is not sufficient as it should instead
exit from the outer one.
Fix this by replacing break with a goto statement.
No need to backport this.
Remove return statement in h3_rcv_buf() in case of stream/connection
error. Instead, reuse already existing label err. This simplifies the
code path. It also fixes the missing leave trace for these cases.
Use a customized proxy for the ACME client.
The proxy is initialized at the first acme section parsed.
The proxy uses the httpsclient log format as ACME CA use HTTPS.
Add an experimental "https" log-format for the httpclient, it is not
used by the httpclient by default, but could be define in a customized
proxy.
The string is basically a httpslog, with some of the fields replaced by
their backend equivalent or - when not available:
"%ci:%cp [%tr] %ft -/- %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
The 'pause' HTTP action can now be used to suspend for a moment the message
analysis. A timeout, expressed in milliseconds using a time-format
parameter, or an expression can be used. If an expression is used, errors
and invalid values are ignored.
Internally, the action will set the analysis expiration date on the
corresponding channel to the configured value and it will yield while it is
not expired.
The 'pause' action is available for 'http-request' and 'http-response'
rules.
When a SPOP connection is opened, the maximum size for frames is negociated.
This negociated size is properly used when a frame is received and if a too
big frame is detected, an error is triggered. However, the same was not
performed on the sending path. No check was performed on frames sent to the
agent. So it was possible to send frames bigger than the maximum size
supported by the the SPOE agent.
Now, the size of NOTIFY and DISCONNECT frames is checked before sending them
to the agent.
Thanks to Miroslav to have reported the issue.
This patch must be backported to 3.1.
Since the commit "MINOR: hlua/h1: Use http_parse_cont_len_header() to parse
content-length value", this function is no longer used. So it can be safely
removed.
Till now, h1_parse_cont_len_header() was used during the H1 message parsing and
by the lua HTTP applets to parse the content-length header value. But a more
generic function was added some years ago doing exactly the same operations. So
let's use it instead.
Thanks to the commit "MINOR: mux-h1: Don't remove custom "Content-Length: 0"
header in 1xx and 204 messages", we are now sure that 1xx and 204 responses
were sanitized during the parsing. So, if one of these headers are found in
such responses when sent to the client, it means it was added by hand, via a
"set-header" action for instance. In this context, we are able to make an
exception for the "Content-Length: 0" header, and only this one with this
value, to not break leagacy applications.
So now, a user can force the "Content-Length: 0" header to appear in 1xx and
204 responses by adding the right action in hist configuration.
"Transfer-Encoding" headers are still dropped as "Content-Length" headers
with another value than 0. Note, that in practice, only 101 and 204 are
concerned because other 1xx message are not subject to HTTP analysis.
This patch should fix the issue #2888. There is no reason to backport
it. But if we do so, the patch above must be backported too.
According to the RFC9110 and RFC9112, a server must not add 'Content-Length'
or 'Transfer-Encoding' headers into 1xx and 204 responses. So till now,
these headers were dropped from the response when it is sent to the client.
However, it seems more logical to remove it during the message parsing. In
addition to sanitize messages as early as possible, this will allow us to
apply some exception in some cases (This will be the subject of another
patch).
In this patch, 'Content-Length' and 'Transfer-Encoding' headers are removed
from 1xx and 204 responses during the parsing but the same is still
performed during the formatting stage.
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.
On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.
This patch should fix the issue #2930.
This changes commit d2a9149f0 ("BUG/MINOR: proxy: always detach a proxy
from the names tree on free()") to be cleaner. Aurlien spotted that
the free(p->id) was indeed already done in proxy_free_common(), which is
called before we delete the node. That's still a bit ugly and it only
works because ebpt_delete() does not dereference the key during the
operation. Better play safe and delete the entry before freeing it,
that's more future-proof.
Stephen Farrell reported in issue #2942 that recent haproxy versions
crash if there's no resolv.conf. A quick bisect with his reproducer
showed that it started with commit 4194f75 ("MEDIUM: tree-wide: avoid
manually initializing proxies") which reorders the proxies initialization
sequence a bit. The crash shows a corrupted tree, typically indicating a
use-after-free. With the help of ASAN it was possible to find that a
resolver proxy had been destroyed and freed before the name insertion
that causes the crash, very likely caused by the absence of the needed
resolv.conf:
#0 0x7ffff72a82f7 in free (/usr/local/lib64/libasan.so.5+0x1062f7)
#1 0x94c1fd in free_proxy src/proxy.c:436
#2 0x9355d1 in resolvers_destroy src/resolvers.c:2604
#3 0x93e899 in resolvers_create_default src/resolvers.c:3892
#4 0xc6ed29 in httpclient_resolve_init src/http_client.c:1170
#5 0xc6fbcf in httpclient_create_proxy src/http_client.c:1310
#6 0x4ae9da in ssl_ocsp_update_precheck src/ssl_ocsp.c:1452
#7 0xa1b03f in step_init_2 src/haproxy.c:2050
But free_proxy() doesn't delete the ebpt_node that carries the name,
which perfectly explains the situation. This patch simply deletes the
name node and Stephen confirmed that it fixed the problem for him as
well. Let's also free it since the key points to p->id which is never
freed either in this function!
No backport is needed since the patch above was first merged into
3.2-dev10.
To handle out-of-order received CRYPTO frames, a ncbuf instance is
allocated. This is done via the helper quic_get_ncbuf().
Buffer allocation was improperly checked. In case b_alloc() fails, it
crashes due to a BUG_ON(). Fix this by removing it. The function now
returns NULL on allocation failure, which is already properly handled in
its caller qc_handle_crypto_frm().
This should fix the last reported crash from github issue #2935.
This must be backported up to 2.6.
Released version 3.2-dev11 with the following main changes :
- CI: enable weekly QuicTLS build
- DOC: management: slightly clarify the prefix role of the '@' command
- DOC: management: add a paragraph about the limitations of the '@' prefix
- MINOR: master/cli: support bidirectional communications with workers
- MEDIUM: ssl/ckch: add filename and linenum argument to crt-store parsing
- MINOR: acme: add the acme section in the configuration parser
- MINOR: acme: add configuration for the crt-store
- MINOR: acme: add private key configuration
- MINOR: acme/cli: add the 'acme renew' command
- MINOR: acme: the acme section is experimental
- MINOR: acme: get the ACME directory
- MINOR: acme: handle the nonce
- MINOR: acme: check if the account exist
- MINOR: acme: generate new account
- MINOR: acme: newOrder request retrieve authorizations URLs
- MINOR: acme: allow empty payload in acme_jws_payload()
- MINOR: acme: get the challenges object from the Auth URL
- MINOR: acme: send the request for challenge ready
- MINOR: acme: implement a check on the challenge status
- MINOR: acme: generate the CSR in a X509_REQ
- MINOR: acme: finalize by sending the CSR
- MINOR: acme: verify the order status once finalized
- MINOR: acme: implement retrieval of the certificate
- BUG/MINOR: acme: ckch_conf_acme_init() when no filename
- MINOR: ssl/ckch: handle ckch_conf in ckchs_dup() and ckch_conf_clean()
- MINOR: acme: copy the original ckch_store
- MEDIUM: acme: replace the previous ckch instance with new ones
- MINOR: acme: schedule retries with a timer
- BUILD: acme: enable the ACME feature when JWS is present
- BUG/MINOR: cpu-topo: check the correct variable for NULL after malloc()
- BUG/MINOR: acme: key not restored upon error in acme_res_certificate()
- BUG/MINOR: thread: protect thread_cpus_enabled_at_boot with USE_THREAD
- MINOR: acme: default to 2048bits for RSA
- DOC: acme: explain how to configure and run ACME
- BUG/MINOR: debug: remove the trailing \n from BUG_ON() statements
- DOC: config: add the missing "profiling.memory" to the global kw index
- DOC: config: add the missing "force-cfg-parser-pause" to the global kw index
- DEBUG: init: report invalid characters in debug description strings
- DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default
- DEBUG: counters: make COUNT_IF() only appear at DEBUG_COUNTERS>=1
- DEBUG: counters: add the ability to enable/disable updating the COUNT_IF counters
- MINOR: tools: let dump_addr_and_bytes() support dumping before the offset
- MINOR: debug: in call traces, dump the 8 bytes before the return address, not after
- MINOR: debug: detect call instructions and show the branch target in backtraces
- BUG/MINOR: acme: fix possible NULL deref
- CLEANUP: acme: stored value is overwritten before it can be used
- BUILD: incompatible pointer type suspected with -DDEBUG_UNIT
- BUG/MINOR: http-ana: Properly detect client abort when forwarding the response
- BUG/MEDIUM: http-ana: Report 502 from req analyzer only during rsp forwarding
- CI: fedora rawhide: enable unit tests
- DOC: configuration: fix a typo in ACME documentation
- MEDIUM: sink: add a new dpapi ring buffer
- Revert "BUG/MINOR: acme: key not restored upon error in acme_res_certificate()"
- BUG/MINOR: acme: key not restored upon error in acme_res_certificate() V2
- BUG/MINOR: acme: fix the exponential backoff of retries
- DOC: configuration: specify limitations of ACME for 3.2
- MINOR: acme: emit logs instead of ha_notice
- MINOR: acme: add a success message to the logs
- BUG/MINOR: acme/cli: fix certificate name in error message
- MINOR: acme: register the task in the ckch_store
- MINOR: acme: free acme_ctx once the task is done
- BUG/MEDIUM: h3: trim whitespaces when parsing headers value
- BUG/MEDIUM: h3: trim whitespaces in header value prior to QPACK encoding
- BUG/MINOR: h3: filter upgrade connection header
- BUG/MINOR: h3: reject invalid :path in request
- BUG/MINOR: h3: reject request URI with invalid characters
- MEDIUM: h3: use absolute URI form with :authority
- BUG/MEDIUM: hlua: fix hlua_applet_{http,tcp}_fct() yield regression (lost data)
- BUG/MINOR: mux-h2: prevent past scheduling with idle connections
- BUG/MINOR: rhttp: fix reconnect if timeout connect unset
- BUG/MINOR: rhttp: ensure GOAWAY can be emitted after reversal
- BUG/MINOR: mux-h2: do not apply timer on idle backend connection
- MINOR: mux-h2: refactor idle timeout calculation
- MINOR: mux-h2: prepare to support PING emission
- MEDIUM: server/mux-h2: implement idle-ping on backend side
- MEDIUM: listener/mux-h2: implement idle-ping on frontend side
- MINOR: mux-h2: do not emit GOAWAY on idle ping expiration
- MINOR: mux-h2: handle idle-ping on conn reverse
- BUILD: makefile: enable backtrace by default on musl
- BUG/MINOR: threads: set threads_idle and threads_harmless even with no threads
- BUG/MINOR debug: fix !USE_THREAD_DUMP in ha_thread_dump_fill()
- BUG/MINOR: wdt/debug: avoid signal re-entrance between debugger and watchdog
- BUG/MINOR: debug: detect and prevent re-entrance in ha_thread_dump_fill()
- MINOR: debug: do not statify a few debugging functions often used with wdt/dbg
- MINOR: tools: also protect the library name resolution against concurrent accesses
- MINOR: tools: protect dladdr() against reentrant calls from the debug handler
- MINOR: debug: protect ha_dump_backtrace() against risks of re-entrance
- MINOR: tinfo: keep a copy of the pointer to the thread dump buffer
- MINOR: debug: always reset the dump pointer when done
- MINOR: debug: remove unused case of thr!=tid in ha_thread_dump_one()
- MINOR: pass a valid buffer pointer to ha_thread_dump_one()
- MEDIUM: wdt: always make the faulty thread report its own warnings
- MINOR: debug: make ha_stuck_warning() only work for the current thread
- MINOR: debug: make ha_stuck_warning() print the whole message at once
- CLEANUP: debug: no longer set nor use TH_FL_DUMPING_OTHERS
- MINOR: sched: add a new function is_sched_alive() to report scheduler's health
- MINOR: wdt: use is_sched_alive() instead of keeping a local ctxsw copy
- MINOR: sample: add 4 new sample fetches for clienthello parsing
- REGTEST: add new reg-test for the 4 new clienthello fetches
- MINOR: servers: Move the per-thread server initialization earlier
- MINOR: proxies: Initialize the per-thread structure earlier.
- MINOR: servers: Provide a pointer to the server in srv_per_tgroup.
- MINOR: lb_fwrr: Move the next weight out of fwrr_group.
- MINOR: proxies: Add a per-thread group lbprm struct.
- MEDIUM: lb_fwrr: Use one ebtree per thread group.
- MEDIUM: lb_fwrr: Don't start all thread groups on the same server.
- MINOR: proxies: Do stage2 initialization for sinks too
In check_config_validity(), we initialize the proxy in several stages.
We do so for the sink list for stage1, but not for stage2. It may not be
needed right now, but it may become needed in the future, so do it
anyway.
Now that all there is one tree per thread group, all thread groups will
start on the same server. To prevent that, just insert the servers in a
different order for each thread group.
When using the round-robin load balancer, the major source of contention
is the lbprm lock, that has to be held every time we pick a server.
To mitigate that, make it so there are one tree per thread-group, and
one lock per thread-group. That means we now have a lb_fwrr_per_tgrp
structure that will contain the two lb_fwrr_groups (active and backup) as well
as the lock to protect them in the per-thread lbprm struct, and all
fields in the struct server are now moved to the per-thread structure
too.
Those changes are mostly mechanical, and brings good performances
improvment, on a 64-cores AMD CPU, with 64 servers configured, we could
process about 620000 requests par second, and we now can process around
1400000 requests per second.
Add a new structure in the per-thread groups proxy structure, that will
contain whatever is per-thread group in lbprm.
It will be accessed as p->per_tgrp[tgid].lbprm.
Move the "next_weight" outside of fwrr_group, and inside struct lb_fwrr
directly, one for the active servers, one for the backup servers.
We will soon have one fwrr_group per thread group, but next_weight will
be global to all of them.
Add a pointer to the server into the struct srv_per_tgroup, so that if
we only have access to that srv_per_tgroup, we can come back to the
corresponding server.
Move the call to initialize the proxy's per-thread structure earlier
than currently done, so that they are usable when we're initializing the
load balancers.
Move the code responsible for calling per-thread server initialization
earlier than it was done, so that per-thread structures are available a
bit later, when we initialize load-balancing.
This patch contains this 4 new fetches and doc changes for the new fetches:
- req.ssl_cipherlist
- req.ssl_sigalgs
- req.ssl_keyshare_groups
- req.ssl_supported_groups
Towards:#2532
Now we can simply call is_sched_alive() on the local thread to verify
that the scheduler is still ticking instead of having to keep a copy of
the ctxsw and comparing it. It's cleaner, doesn't require to maintain
a local copy, doesn't rely on activity[] (whose purpose is mainly for
observation and debugging), and shows how this could be extended later
to cover other use cases. Practically speaking this doesn't change
anything however, the algorithm is still the same.
This verifies that the scheduler is still ticking without having to
access the activity[] array nor keeping local copies of the ctxsw
counter. It just tests and sets a flag that is reset after each
return from a ->process() function.
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
It has been noticed quite a few times during troubleshooting and even
testing that warnings can happen in avalanches from multiple threads
at the same time, and that their reporting it interleaved bacause the
output is produced in small chunks. Originally, this code inspired by
the panic code aimed at making sure to log whatever could be emitted
in case it would crash later. But this approach was wrong since writes
are atomic, and performing 5 writes in sequence in each dumping thread
also means that the outputs can be mixed up at 5 different locations
between multiple threads. The output of warnings is never very long,
and the stack-based buffer is 4kB so let's just concatenate everything
in the buffer and emit it at once using a single write(). Now there's
no longer this confusion on the output.
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
Warnings remain tricky to deal with, especially for other threads as
they require some inter-thread synchronization that doesn't cope very
well with other parallel activities such as "show threads" for example.
However there is nothing that forces us to handle them this way. The
panic for example is already handled by bouncing the WDT signal to the
faulty thread.
This commit rearranges the WDT handler to make a better used of this
existing signal bouncing feature of the WDT handler so that it's no
longer limited to panics but can also deal with warnings. In order not
to bounce on all wakeups, we only bounce when there is a suspicion,
that is, when the warning timer has been crossed. We'll let the target
thread verify the stuck flag and context switch count by itself to
decide whether or not to panic, warn, or just do nothing and update
the counters.
As a bonus, now all warning traces look the same regardless of the
reporting thread:
call trace(16):
| 0x6bc733 <01 00 00 e8 6d e6 de ff]: ha_dump_backtrace+0x73/0x309 > main-0x2570
| 0x6bd37a <00 00 00 e8 d6 fb ff ff]: ha_thread_dump_fill+0xda/0x104 > ha_thread_dump_one
| 0x6bd625 <00 00 00 e8 7b fc ff ff]: ha_stuck_warning+0xc5/0x19e > ha_thread_dump_fill
| 0x7b2b60 <64 8b 3b e8 00 aa f0 ff]: wdt_handler+0x1f0/0x212 > ha_stuck_warning
| 0x7fd7e2cef3a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
| 0x7ffc6af9e634 <85 a6 00 00 00 0f 01 f9]: linux-vdso:__vdso_gettimeofday+0x34/0x2b0
| 0x6bad74 <7c 24 10 e8 9c 01 df ff]: sc_conn_io_cb+0x9fa4 > main-0x2400
| 0x67c457 <89 f2 4c 89 e6 41 ff d0]: main+0x1cf147
| 0x67d401 <48 89 df e8 8f ed ff ff]: cli_io_handler+0x191/0xb38 > main+0x1cee80
| 0x6dd605 <40 48 8b 45 60 ff 50 18]: task_process_applet+0x275/0xce9
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.
It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.
Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
We don't need to copy the old dump pointer to the thread_dump_pointer
area anymore to indicate a dump is collected. It used to be done as an
artificial way to keep the pointer for the post-mortem analysis but
since we now have this pointer stored separately, that's no longer
needed and it simplifies the mechanim to reset it.
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
If a thread is dumping itself (warning, show thread etc) and another one
wants to dump the state of all threads (e.g. panic), it may interrupt the
first one during backtrace() and re-enter it from the signal handler,
possibly triggering a deadlock in the underlying libc. Let's postpone
the debug signal delivery at this point until the call ends in order to
avoid this.
If a thread is currently resolving a symbol while another thread triggers
a thread dump, the current thread may enter the debug handler and call
resolve_sym_addr() again, possibly deadlocking if the underlying libc
uses locking. Let's postpone the debug signal delivery in this area
during the call. This will slow the resolution a little bit but we don't
care, it's not supposed to happen often and it must remain rock-solid.
This is an extension of eb41d768f ("MINOR: tools: use only opportunistic
symbols resolution"). It also makes sure we're not calling dladddr() in
parallel to dladdr_and_size(), as a preventive measure against some
potential deadlocks in the inner layers of the libc.
A few functions are used when debugging debug signals and watchdog, but
being static, they're not resolved and are hard to spot in dumps, and
they appear as any random other function plus an offset. Let's just not
mark them static anymore, it only hurts:
- cli_io_handler_show_threads()
- debug_run_cli_deadlock()
- debug_parse_cli_loop()
- debug_parse_cli_panic()
In the following trace trying to abuse the watchdog from the CLI's
"debug dev loop" command running in parallel to "show threads" loops,
it's clear that some re-entrance may happen in ha_thread_dump_fill().
A first minimal fix consists in using a test-and-set on the flag
indicating that the function is currently dumping threads, so that
the one from the signal just returns. However the caller should be
made more reliable to serialize all of this, that's for future
work.
Here's an example capture of 7 threads stuck waiting for each other:
(gdb) bt
#0 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#1 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#2 0x00000000005ba4f5 in ha_thread_dump_fill (thr=2, buf=0x7ffdd8e08ab0) at src/debug.c:402
#3 ha_thread_dump_fill (buf=0x7ffdd8e08ab0, thr=<optimized out>) at src/debug.c:384
#4 0x00000000005baac4 in ha_stuck_warning (thr=thr@entry=2) at src/debug.c:840
#5 0x00000000006a360d in wdt_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized out>) at src/wdt.c:156
#6 <signal handler called>
#7 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#8 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#9 0x00000000005ba4c2 in ha_thread_dump_fill (thr=2, buf=0x7fe78f2d6420) at src/debug.c:426
#10 ha_thread_dump_fill (buf=0x7fe78f2d6420, thr=2) at src/debug.c:384
#11 0x00000000005ba7c6 in cli_io_handler_show_threads (appctx=0x2a89ab0) at src/debug.c:548
#12 0x000000000057ea43 in cli_io_handler (appctx=0x2a89ab0) at src/cli.c:1176
#13 0x00000000005d7885 in task_process_applet (t=0x2a82730, context=0x2a89ab0, state=<optimized out>) at src/applet.c:920
#14 0x0000000000659002 in run_tasks_from_lists (budgets=budgets@entry=0x7ffdd8e0a5c0) at src/task.c:644
#15 0x0000000000659bd7 in process_runnable_tasks () at src/task.c:886
#16 0x00000000005cdcc9 in run_poll_loop () at src/haproxy.c:2858
#17 0x00000000005ce457 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3075
#18 0x0000000000430628 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3665
As seen in issue #2860, there are some situations where a watchdog could
trigger during the debug signal handler, and where similarly the debug
signal handler may trigger during the wdt handler. This is really bad
because it could trigger some deadlocks inside inner libc code such as
dladdr() or backtrace() since the code will not protect against re-
entrance but only against concurrent accesses.
A first attempt was made using ha_sigmask() but that's not always very
convenient because the second handler is called immediately after
unblocking the signal and before returning, leaving signal cascades in
backtrace. Instead, let's mark which signals to block at registration
time. Here we're blocking wdt/dbg for both signals, and optionally
SIGRTMAX if DEBUG_DEV is used as that one may also be used in this case.
This should be backported at least to 3.1.
The function must make sure to return NULL for foreign threads and
the local buffer for the current thread in this case, otherwise panics
(and sometimes even warnings) will segfault when USE_THREAD_DUMP is
disabled. Let's slightly re-arrange the function to reduce the #if/else
since we have to specifically handle the case of !USE_THREAD_DUMP anyway.
This needs to be backported wherever b8adef065d ("MEDIUM: debug: on
panic, make the target thread automatically allocate its buf") was
backported (at least 2.8).
Some signal handlers rely on these to decide about the level of detail to
provide in dumps, so let's properly fill the info about entering/leaving
idle. Note that for consistency with other tests we're using bitops with
t->ltid_bit, while we could simply assign 0/1 to the fields. But it makes
the code more readable and the whole difference is only 88 bytes on a 3MB
executable.
This bug is not important, and while older versions are likely affected
as well, it's not worth taking the risk to backport this in case it would
wake up an obscure bug.
The reason musl builds was not producing exploitable backtraces was
that the toolchain used appears to automatically omit the frame pointer
at -O2 but leaves it at -O0. This patch just makes sure to always append
-fno-omit-frame-pointer to the BACKTRACE cflags and enables the option
with musl where it now works. This will allow us to finally get
exploitable traces from docker images where core dumps are not always
available.
This commit extends MUX H2 connection reversal step to properly take
into account the new idle-ping feature. It first ensures that h2c task
is properly instantiated/freed depending now on both timers and
idle-ping configuration. Also, h2c_update_timeout() is now called
instead of manually requeuing the task, which ensures the proper timer
value is selected depending on the new connection side.
If idle-ping is activated and h2c task is expired due to missing PING
ACK, consider that the peer is away and the connection can be closed
immediately. GOAWAY emission is thus skipped.
A new test is necessary in h2c_update_timeout() when PING ACK is
currently expected, but the next timer expiration selected is not
idle-ping. This may happen if http-keep-alive/http-request timers are
selected first. In this case, H2_CF_IDL_PING_SENT flag is resetted. This
is necessary to not prevent GOAWAY emission on expiration.
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.
H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.
The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().
On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.
Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
Adapt the already existing function h2c_ack_ping(). The objective is to
be able to emit a PING request. First, it is renamed as h2c_send_ping().
A new boolean argument <ack> is used to emit either a PING request or
ack.
Reorganize code for timeout calculation in case the connection is idle.
The objective is to better reflect the relations between each timeouts
as follow :
* if GOAWAY already emitted, use shut-timeout, or if unset fallback to
client/server one. However, an already set timeout is never erased.
* else, for frontend connection, http-request or keep-alive timeout is
applied depending on the current demux state. If the selected value is
unset, fallback to client timeout
* for backend connection, no timeout is set to perform http-reuse
This commit is pure refactoring, so no functional change should occur.
Since the following commit, MUX H2 timeout function has been slightly
exetended.
d38d8c6ccb189e7bc813b3693fec3093c9be55f1
BUG/MEDIUM: mux-h2: make sure control frames do not refresh the idle timeout
A side-effect of this patch is that now backend idle connection expire
timer is not reset if already defined. This means that if a timer was
registered prior to the connection transition to idle, the connection
would be destroyed on its timeout. If this happens for enough
connection, this may have an impact on the reuse rate.
In practice, this case should be rare, as h2c timer is set to
TICK_ETERNITY while there is active streams. The timer is not refreshed
most of the time before going the transition to idle, so the connection
won't be deleted on expiration.
The only case where it could occur is if there is still pending data
blocked on emission on stream detach. Here, timeout server is applied on
the connection. When the emission completes, the connection goes to
idle, but the timer will still armed, and thus will be triggered on the
idle connection.
To prevent this, explicitely reset h2c timer to TICK_ETERNITY for idle
backend connection via h2c_update_timeout().
This patch is explicitely not scheduled for backport for now, as it is
difficult to estimate the real impact of the previous code state.
GOAWAY emission should not be emitted before preface. Thus, max_id field
from h2c acting as a server is initialized to -1, which prevents its
emission until preface is received from the peer. If acting as a client,
max_id is initialized to a valid value on the first h2s emission.
This causes an issue with reverse HTTP on the active side. First, it
starts as a client, so the peer does not emit a preface but instead a
simple SETTINGS frame. As role are switched, max_id is initialized much
later when the first h2s response is emitted. Thus, if the connection
must be terminated before any stream transfer, GOAWAY cannot be emitted.
To fix this, ensure max_id is initialized to 0 on h2_conn_reverse() for
active connect side. Thus, a GOAWAY indicating that no stream has been
handled can be generated.
Note that passive connect side is not impacted, as it max_id is
initialized thanks to preface reception.
This should be backported up to 2.9.
Active connect on reverse http relies on connect timeout to detect
connection failure. Thus, if this timeout was unset, connection failure
may not be properly detected.
Fix this by fallback on hardcoded value of 1s for connect if timeout is
unset in the configuration. This is considered as a minor bug, as
haproxy advises against running with timeout unset.
This must be backported up to 2.9.
While reviewing HTTP/2 MUX timeout, it seems there is a possibility that
MUX task is requeued via h2c_update_timeout() with an already expired
date. This can happens with idle connections on two cases :
* first with shut timeout, as timer is not refreshed if already set
* second with http-request and keep-alive timers, which are based on
idle_start
Queuing an already expired task is an undefined behavior. Fix this by
using task_wakeup() instead of task_queue() at the end of
h2c_update_timeout() if such case occurs.
This should be backported up to 2.6.
Jacques Heunis from bloomberg reported on the mailing list [1] that
with haproxy 2.8 up to master, yielding from a Lua tcp service while
data was still buffered inside haproxy would eat some data which was
definitely lost.
He provided the reproducer below which turned out to be really helpful:
global
log stdout format raw local0 info
lua-load haproxy_yieldtest.lua
defaults
log global
timeout connect 10s
timeout client 1m
timeout server 1m
listen echo
bind *:9090
mode tcp
tcp-request content use-service lua.print_input
haproxy_yieldtest.lua:
core.register_service("print_input", "tcp", function(applet)
core.Info("Start printing input...")
while true do
local inputs = applet:getline()
if inputs == nil or string.len(inputs) == 0 then
core.Info("closing input connection")
return
end
core.Info("Received line: "..inputs)
core.yield()
end
end)
And the script below:
#!/usr/bin/bash
for i in $(seq 1 9999); do
for j in $(seq 1 50); do
echo "${i}_foo_${j}"
done
sleep 2
done
Using it like this:
./test_seq.sh | netcat localhost 9090
We can clearly see the missing data for every "foo" burst (every 2
seconds), as they are holes in the numbering.
Thanks to the reproducer, it was quickly found that only versions
>= 2.8 were affected, and that in fact this regression was introduced
by commit 31572229e ("MEDIUM: hlua/applet: Use the sedesc to report and
detect end of processing")
In fact in 31572229e 2 mistakes were made during the refaco.
Indeed, both in hlua_applet_tcp_fct() (which is involved in the reproducer
above) and hlua_applet_http_fct(), the request (buffer) is now
systematically consumed when returning from the function, which wasn't the
case prior to this commit: when HLUA_E_AGAIN is returned, it means a
yield was requested and that the processing is not done yet, thus we
should not consume any data, like we did prior to the refacto.
Big thanks to Jacques who did a great job reproducing and reporting this
issue on the mailing list.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg45778.html
It should be backported up to 2.8 with commit 31572229e
Change the representation of the start-line URI when parsing a HTTP/3
request into HTX. Adopt the same conversion as HTTP/2. If :authority
header is used (default case), the URI is encoded using absolute-form,
with scheme, host and path concatenated. If only a plain host header is
used instead, fallback to the origin form.
This commit may cause some configuration to be broken if parsing is
performed on the URI. Indeed, now most of the HTTP/3 requests will be
represented with an absolute-form URI at the stream layer.
Note that prior to this commit a check was performed on the path used as
URI to ensure that it did not contain any invalid characters. Now, this
is directly performed on the URI itself, which may include the path.
This must not be backported.
Ensure that the HTX start-line generated after parsing an HTTP/3 request
does not contain any invalid character, i.e. control or whitespace
characters.
Note that for now path is used directly as URI. Thus, the check is
performed directly over it. A patch will change this to generate an
absolute-form URI in most cases, but it won't be backported to avoid
configuration breaking in stable versions.
This must be backported up to 2.6.
RFC 9114 specifies some requirements for :path pseudo-header when using
http or https scheme. This commit enforces this by rejecting a request
if needed. Thus, path cannot be empty, and it must either start with a
'/' character or contains only '*'.
This must be backported up to 2.6.
As specified in RFC 9114, connection headers required special care in
HTTP/3. When a request is received with connection headers, the stream
is immediately closed. Conversely, when translating the response from
HTX, such headers are not encoded but silently ignored.
However, "upgrade" was not listed in connection headers. This commit
fixes this by adding a check on it both on request parsing and response
encoding.
This must be backported up to 2.6.
This commit does a similar job than the previous one, but it acts now on
the response path. Any leading or trailing whitespaces characters from a
HTX block header value are removed, prior to the header encoding via
QPACK.
This must be backported up to 2.6.
Remove any leading and trailing whitespace from header field values
prior to inserting a new HTX header block. This is done when parsing a
HEADERS frame, both as headers and trailers.
This must be backported up to 2.6.
Free the acme_ctx task context once the task is done.
It frees everything but the config and the httpclient,
everything else is free.
The ckch_store is freed in case of error, but when the task is
successful, the ptr is set to NULL to prevent the free once inserted in
the tree.
This patch registers the task in the ckch_store so we don't run 2 tasks
at the same time for a given certificate.
Move the task creation under the lock and check if there was already a
task under the lock.
Exponential backoff values was multiplied by 3000 instead of 3 with a
second to ms conversion. Leading to a 9000000ms value at the 2nd
attempt.
Fix the issue by setting the value in seconds and converting the value
in tick_add().
No backport needed.
When receiving the final certificate, it need to be loaded by
ssl_sock_load_pem_into_ckch(). However this function will remove any
existing private key in the struct ckch_store.
In order to fix the issue, the ptr to the key is swapped with a NULL
ptr, and restored once the new certificate is commited.
However there is a discrepancy when there is an error in
ssl_sock_load_pem_into_ckch() fails and the pointer is lost.
This patch fixes the issue by restoring the pointer in the error path.
This must fix issue #2933.
Add a 1MB ring buffer called "dpapi" for communication with the
dataplane API. It would first be used to transmit ACME informations to
the dataplane API but could be used for more.
A server abort must be handled by the request analyzers only when the
response forwarding was already started. Otherwise, it it the responsability
of the response analyzer to detect this event. L7-retires and conditions to
decide to silently close a client conneciotn are handled by this analyzer.
Because a reused server connections closed too early could be detected at
the wrong place, it was possible to get a 502/SH instead of a silent close,
preventing the client to safely retries its request.
Thanks to this patch, we are able to silently close the client connection in
this case and eventually to perform a L7 retry.
This patch must be backported as far as 2.8.
During the response payload forwarding, if the back SC is closed, we try to
figure out if it is because of a client abort or a server abort. However,
the condition was not accurrate, especially when abortonclose option is
set. Because of this issue, a server abort may be reported (SD-- in logs)
instead of a client abort (CD-- in logs).
The right way to detect a client abort when we try to forward the response
is to test if the back SC was shut down (SC_FL_SHUT_DOWN flag set) AND
aborted (SC_FL_ABRT_DONE flag set). When these both flags are set, it means
the back connection underwent the shutdown, which should be converted to a
client abort at this stage.
This patch should be backported as far as 2.8. It should fix last strange SD
report in the issue #2749.
src/jws.c: In function '__jws_init':
src/jws.c:594:38: error: passing argument 2 of 'hap_register_unittest' from incompatible pointer type [-Wincompatible-pointer-types]
594 | hap_register_unittest("jwk", jwk_debug);
| ^~~~~~~~~
| |
| int (*)(int, char **)
In file included from include/haproxy/api.h:36,
from include/import/ebtree.h:251,
from include/import/ebmbtree.h:25,
from include/haproxy/jwt-t.h:25,
from src/jws.c:5:
include/haproxy/init.h:37:52: note: expected 'int (*)(void)' but argument is of type 'int (*)(int, char **)'
37 | void hap_register_unittest(const char *name, int (*fct)());
| ~~~~~~^~~~~~
GCC 15 is warning because the function pointer does have its
arguments in the register function.
Should fix issue #2929.
>>> CID 1609049: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value "NULL" to "new_ckchs" here, but that stored value is overwritten before it can be used.
592 struct ckch_store *old_ckchs, *new_ckchs = NULL;
Coverity reported an issue where a variable is initialized to NULL then
directry overwritten with another value. This doesn't arm but this patch
removes the useless initialization.
Must fix issue #2932.
In call traces, we're interested in seeing the code that was executed, not
the code that was not yet. The return address is where the CPU will return
to, so we want to see the bytes that precede this location. In the example
below on x86 we can clearly see a number of direct "call" instructions
(0xe8 + 4 bytes). There are also indirect calls (0xffd0) that cannot be
exploited but it gives insights about where the code branched, which will
not always be the function above it if that one used tail branching for
example. Here's an example dump output:
call ------------,
v
0x6bd634 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95
0x7fa4ea2593a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
0x752132 <00 00 00 00 00 90 41 55]: htx_remove_blk+0x2/0x354
0x5b1a2c <4c 89 ef e8 04 07 1a 00]: main+0x10471c
0x5b5f05 <48 89 df e8 8b b8 ff ff]: main+0x108bf5
0x60b6f4 <89 ee 4c 89 e7 41 ff d0]: tcpcheck_eval_send+0x3b4/0x14b2
0x610ded <00 00 00 e8 53 a5 ff ff]: tcpcheck_main+0x7dd/0xd36
0x6c5ab4 <48 89 df e8 5c ab f4 ff]: wake_srv_chk+0xc4/0x3d7
0x6c5ddc <48 89 f7 e8 14 fc ff ff]: srv_chk_io_cb+0xc/0x13
For code dumps, dumping from the return address is pointless, what is
interesting is to dump before the return address to read the machine
code that was executed before branching. Let's just make the function
support negative sizes to indicate that we're dumping this number of
bytes to the address instead of this number from the address. In this
case, in order to distinguish them, we're using a '<' instead of '[' to
start the series of bytes, indicating where the bytes expand and where
they stop. For example we can now see this:
0x6bd634 <64 8b 38 e8 ac f7 ff ff]: debug_handler+0x84/0x95
0x7fa4ea2593a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
0x752132 <00 00 00 00 00 90 41 55]: htx_remove_blk+0x2/0x354
0x5b1a2c <4c 89 ef e8 04 07 1a 00]: main+0x10471c
0x5b5f05 <48 89 df e8 8b b8 ff ff]: main+0x108bf5
0x60b6f4 <89 ee 4c 89 e7 41 ff d0]: tcpcheck_eval_send+0x3b4/0x14b2
0x610ded <00 00 00 e8 53 a5 ff ff]: tcpcheck_main+0x7dd/0xd36
0x6c5ab4 <48 89 df e8 5c ab f4 ff]: wake_srv_chk+0xc4/0x3d7
0x6c5ddc <48 89 f7 e8 14 fc ff ff]: srv_chk_io_cb+0xc/0x13
These counters can have a noticeable cost on large machines, though not
dramatic. There's no single good choice to keep them enabled or disabled.
This commit adds multiple choices:
- DEBUG_COUNTERS set to 2 will automatically enable them by default, while
1 will disable them by default
- the global "debug.counters on/off" will allow to change the setting at
boot, regardless of DEBUG_COUNTERS as long as it was at least 1.
- the CLI "debug counters on/off" will also allow to change the value at
run time, allowing to observe a phenomenon while it's happening, or to
disable counters if it's suspected that their cost is too high
Finally, the "debug counters" command will append "(stopped)" at the end
of the CNT lines when these counters are stopped.
Not that the whole mechanism would easily support being extended to all
counter types by specifying the types to apply to, but it doesn't seem
useful at all and would require the user to also type "cnt" on debug
lines. This may easily be changed in the future if it's found relevant.
COUNT_IF() is convenient but can be heavy since some of them were found
to trigger often (roughly 1 counter per request on avg). This might even
have an impact on large setups due to the cost of a shared cache line
bouncing between multiple cores. For now there's no way to disable it,
so let's only enable it when DEBUG_COUNTERS is 1 or above. A future
change will make it configurable.
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
It's easy to leave some trailing \n or even other characters that can
mangle the debug output. Let's verify at boot time that the debug sections
are clean by checking for chars 0x20 to 0x7e inclusive. This is very simple
to do and it managed to find another one in a multi-line message:
[WARNING] (23696) : Invalid character 0x0a at position 96 in description string at src/cli.c:2516 _send_status()
This way new offending code will be spotted before being committed.
These ones were added by mistake during the change of the cfgparse
mechanism in 3.1, but they're corrupting the output of "debug counters"
by leaving stray ']' on their own lines. We could possibly check them
all once at boot but it doens't seem worth it.
This should be backported to 3.1.
Following error is triggered at linker invokation, when we try to compile with
USE_THREAD=0 and -O0.
make -j 8 TARGET=linux-glibc USE_LUA=1 USE_PCRE2=1 USE_LINUX_CAP=1 \
USE_MEMORY_PROFILING=1 OPT_CFLAGS=-O0 USE_THREAD=0
/usr/bin/ld: src/thread.o: warning: relocation against `thread_cpus_enabled_at_boot' in read-only section `.text'
/usr/bin/ld: src/thread.o: in function `thread_detect_count':
/home/vk/projects/haproxy/src/thread.c:1619: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: /home/vk/projects/haproxy/src/thread.c:1619: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: /home/vk/projects/haproxy/src/thread.c:1620: undefined reference to `thread_cpus_enabled_at_boot'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:1044: haproxy] Error 1
thread_cpus_enabled_at_boot is only available when we compiled with
USE_THREAD=1, which is the default for the most targets now.
In some cases, we need to recompile in mono-thread mode, thus
thread_cpus_enabled_at_boot should be protected with USE_THREAD in
thread_detect_count().
thread_detect_count() is always called during the process initialization
never mind of multi thread support. It sets some defaults in global.nbthread
and global.nbtgroups.
This patch is related to GitHub issue #2916.
No need to be backported as it was added in 3.2-dev9 version.
When receiving the final certificate, it need to be loaded by
ssl_sock_load_pem_into_ckch(). However this function will remove any
existing private key in the struct ckch_store.
In order to fix the issue, the ptr to the key is swapped with a NULL
ptr, and restored once the new certificate is commited.
However there is a discrepancy when there is an error in
ssl_sock_load_pem_into_ckch() fails and the pointer is lost.
This patch fixes the issue by restoring the pointer in the error path.
This must fix issue #2933.
This step is the latest to have a usable ACME certificate in haproxy.
It looks for the previous certificate, locks the "BIG CERTIFICATE LOCK",
copy every instance, deploys new ones, remove the previous one.
This is done in one step in a function which does not yield, so it could
be problematic if you have thousands of instances to handle.
It still lacks the rate limit which is mandatory to be used in
production, and more cleanup and deinit.
Copy the original ckch_store instead of creating a new one. This allows
to inherit the ckch_conf from the previous structure when doing a
ckchs_dup(). The ckch_conf contains the SAN for ACME.
Free the previous PKEY since it a new one is generated.
Handle new members of the ckch_conf in ckchs_dup() and
ckch_conf_clean().
This could be automated at some point since we have the description of
all types in ckch_conf_kws.
Once the Order status is "valid", the certificate URL is accessible,
this patch implements the retrieval of the certificate which is stocked
in ctx->store.
Generate the X509_REQ using the generated private key and the SAN from
the configuration. This is only done once before the task is started.
It could probably be done at the beginning of the task with the private
key generation once we have a scheduler instead of a CLI command.
This patch implements a check on the challenge URL, once haproxy asked
for the challenge to be verified, it must verify the status of the
challenge resolution and if there weren't any error.
This patch sends the "{}" message to specify that a challenge is ready.
It iterates on every challenge URL in the authorization list from the
acme_ctx.
This allows the ACME server to procede to the challenge validation.
https://www.rfc-editor.org/rfc/rfc8555#section-7.5.1
This patch implements the retrieval of the challenges objects on the
authorizations URLs. The challenges object contains a token and a
challenge url that need to be called once the challenge is setup.
Each authorization URLs contain multiple challenge objects, usually one
per challenge type (HTTP-01, DNS-01, ALPN-01... We only need to keep the
one that is relevent to our configuration.
This patch implements the newOrder action in the ACME task, in order to
ask for a new certificate, a list of SAN is sent as a JWS payload.
the ACME server replies a list of Authorization URLs. One Authorization
is created per SAN on a Order.
The authorization URLs are stored in a linked list of 'struct acme_auth'
in acme_ctx, so we can get the challenge URLs from them later.
The location header is also store as it is the URL of the order object.
https://datatracker.ietf.org/doc/html/rfc8555#section-7.4
This patch implements the retrival of the KID (account identifier) using
the pkey.
A request is sent to the newAccount URL using the onlyReturnExisting
option, which allow to get the kid of an existing account.
acme_jws_payload() implement a way to generate a JWS payload using the
nonce, pkey and provided URI.
ACME requests are supposed to be sent with a Nonce, the first Nonce
should be retrieved using the newNonce URI provided by the directory.
This nonce is stored and must be replaced by the new one received in the
each response.
The first request of the ACME protocol is getting the list of URLs for
the next steps.
This patch implements the first request and the parsing of the response.
The response is a JSON object so mjson is used to parse it.
The "acme renew" command launch the ACME task for a given certificate.
The CLI parser generates a new private key using the parameters from the
acme section..
This commit allows to configure the generated private keys, you can
configure the keytype (RSA/ECDSA), the number of bits or the curves.
Example:
acme LE
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
keytype ECDSA
curves P-384
Add new acme keywords for the ckch_conf parsing, which will be used on a
crt-store, a crt line in a frontend, or even a crt-list.
The cfg_postparser_acme() is called in order to check if a section referenced
elsewhere really exists in the config file.
Add a configuration parser for the new acme section, the section is
configured this way:
acme letsencrypt
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
When unspecified, the challenge defaults to HTTP-01, and the account key
to "<section_name>.account.key".
Section are stored in a linked list containing acme_cfg structures, the
configuration parsing is mostly resolved in the postsection parser
cfg_postsection_acme() which is called after the parsing of an acme section.
Some rare commands in the worker require to keep their input open and
terminate when it's closed ("show events -w", "wait"). Others maintain
a per-session context ("set anon on"). But in its default operation
mode, the master CLI passes commands one at a time to the worker, and
closes the CLI's input channel so that the command can immediately
close upon response. This effectively prevents these two specific cases
from being used.
Here the approach that we take is to introduce a bidirectional mode to
connect to the worker, where everything sent to the master is immediately
forwarded to the worker (including the raw command), allowing to queue
multiple commands at once in the same session, and to continue to watch
the input to detect when the client closes. It must be a client's choice
however, since doing so means that the client cannot batch many commands
at once to the master process, but must wait for these commands to complete
before sending new ones. For this reason we use the prefix "@@<pid>" for
this. It works exactly like "@" except that it maintains the channel
open during the whole execution. Similarly to "@<pid>" with no command,
"@@<pid>" will simply open an interactive CLI session to the worker, that
will be ended by "quit" or by closing the connection. This can be convenient
for the user, and possibly for clients willing to dedicate a connection to
the worker.
The '@' prefix permits to execute a single command at once in a worker.
It is very handy but comes with some limitations affecting rare commands,
which is better to be documented (one command per session, input closed)
since it can seldom have user-visible effects.
While the examples were clear, the text did not fully imply what was
reflected there. Better have the text explicitly mention that the
'@' command may be used as a prefix or wrapper in front of a command
as well as a standalone command.
Released version 3.2-dev10 with the following main changes :
- REORG: ssl: move curves2nid and nid2nist to ssl_utils
- BUG/MEDIUM: stream: Fix a possible freeze during a forced shut on a stream
- MEDIUM: stream: Save SC and channel flags earlier in process_steam()
- BUG/MINOR: peers: fix expire learned from a peer not converted from ms to ticks
- BUG/MEDIUM: peers: prevent learning expiration too far in futur from unsync node
- CI: spell check: allow manual trigger
- CI: codespell: add "pres" to spellcheck whitelist
- CLEANUP: assorted typo fixes in the code, commits and doc
- CLEANUP: atomics: remove support for gcc < 4.7
- CLEANUP: atomics: also replace __sync_synchronize() with __atomic_thread_fence()
- TESTS: Fix build for filltab25.c
- MEDIUM: ssl: replace "crt" lines by "ssl-f-use" lines
- DOC: configuration: replace "crt" by "ssl-f-use" in listeners
- MINOR: backend: mark srv as nonnull in alloc_dst_address()
- BUG/MINOR: server: ensure check-reuse-pool is copied from default-server
- MINOR: server: activate automatically check reuse for rhttp@ protocol
- MINOR: check/backend: support conn reuse with SNI
- MINOR: check: implement check-pool-conn-name srv keyword
- MINOR: task: add thread safe notification_new and notification_wake variants
- BUG/MINOR: hlua_fcn: fix potential UAF with Queue:pop_wait()
- MINOR: hlua_fcn: register queue class using hlua_register_metatable()
- MINOR: hlua: add core.wait()
- MINOR: hlua: core.wait() takes optional delay paramater
- MINOR: hlua: split hlua_applet_tcp_recv_yield() in two functions
- MINOR: hlua: add AppletTCP:try_receive()
- MINOR: hlua_fcn: add Queue:alarm()
- MEDIUM: task: make notification_* API thread safe by default
- CLEANUP: log: adjust _lf_cbor_encode_byte() comment
- MEDIUM: ssl/crt-list: warn on negative wildcard filters
- MEDIUM: ssl/crt-list: warn on negative filters only
- BUILD: atomics: fix build issue on non-x86/non-arm systems
- BUG/MINOR: log: fix CBOR encoding with LOG_VARTEXT_START() + lf_encode_chunk()
- BUG/MEDIUM: sample: fix risk of overflow when replacing multiple regex back-refs
- DOC: configuration: rework the crt-list section
- MINOR: ring: support arbitrary delimiters through ring_dispatch_messages()
- MINOR: ring/cli: support delimiting events with a trailing \0 on "show events"
- DEV: h2: fix h2-tracer.lua nil value index
- BUG/MINOR: backend: do not use the source port when hashing clientip
- BUG/MINOR: hlua: fix invalid errmsg use in hlua_init()
- MINOR: proxy: add setup_new_proxy() function
- MINOR: checks: mark CHECKS-FE dummy frontend as internal
- MINOR: flt_spoe: mark spoe agent frontend as internal
- MEDIUM: tree-wide: avoid manually initializing proxies
- MINOR: proxy: add deinit_proxy() helper func
- MINOR: checks: deinit checks_fe upon deinit
- MINOR: flt_spoe: deinit spoe agent proxy upon agent release
Even though spoe agent proxy is statically allocated, it uses the proxy
API and is initialized like a regular proxy, thus specific cleanup is
required upon release. This is not tagged as a bug because as of now this
would only cause some minor memory leak upon deinit.
We check the presence of proxy->id to know if it was initialized since
we cannot rely on a pointer for that.
This is just to make valgrind and friends happy, leverage deinit_proxy()
for checks_fe proxy upon deinit to ensure proper cleanup.
We check the presence of proxy->id to know if it was initialized because
we cannot rely on a pointer for that.
Same as free_proxy(), but does not free the base proxy pointer (ie: the
proxy itself may not be allocated)
Goal is to be able to cleanup statically allocated dummy proxies.
In this patch we try to use the proxy API init functions as much as
possible to avoid code redundancy and prevent proxy initialization
errors. As such, we prefer using alloc_new_proxy() and setup_new_proxy()
instead of manually allocating the proxy pointer and performing the
base init ourselves.
spoe agent frontend is used by the agent internally, but it is not meant
to be directly exposed like user-facing proxies defined in the config.
As such, better mark it as internal using PR_CAP_INT capability to prevent
any mis-use.
CHECKS-FE frontend is a dummy frontend used to create checks sessions
as such, it is internal and should not be exposed to the user.
Better mark it as internal using PR_CAP_INT capability to prevent
proxy API from ever exposing it.
Split alloc_new_proxy() in two functions: the preparing part is now
handled by setup_new_proxy() which can be called individually, while
alloc_new_proxy() takes care of allocating a new proxy struct and then
calling setup_new_proxy() with the freshly allocated proxy.
errmsg is used with memprintf and friends, thus it must be NULL
initialized before being passed to memprintf, else invalid read will
occur.
However in hlua_init() the errmsg value isn't initialized, let's fix that
This is really minor because it would only cause issue on error paths,
yet it may be backported to all stable versions, just in case.
The server's "usesrc" keyword supports among other options "client"
and "clientip". The former means we bind to the client's IP and port
to connect to the server, while the latter means we bind to its IP
only. It's done in two steps, first alloc_bind_address() retrieves
the IP address and port, and second, tcp_connect_server() decides
to either bind to the IP only or IP+port.
The problem comes with idle connection pools, which hash all the
parameters: the hash is calculated before (and ideally withouy) calling
tcp_connect_server(), and it considers the whole struct sockaddr_storage
for the hash, except that both client and clientip entirely fill it with
the client's address. This means that both client and clientip make use
of the source port in the hash calculation, making idle connections
almost not reusable when using "usesrc clientip" while they should for
clients coming from the same source. A work-around is to force the
source port to zero using "tcp-request session set-src-port int(0)" but
it's ugly.
Let's fix this by properly zeroing the port for AF_INET/AF_INET6 addresses.
This can be backported to 2.4. Thanks to Sebastien Gross for providing a
reproducer for this problem.
Nick Ramirez reported the following error while testing the h2-tracer.lua
script:
Lua filter 'h2-tracer' : [state-id 0] runtime error: /etc/haproxy/h2-tracer.lua:227: attempt to index a nil value (field '?') from /etc/haproxy/h2-tracer.lua:227: in function line 109.
It is caused by h2ff indexing with an out of bound value. Indeed, h2ff
is indexed with the frame type, which can potentially be > 9 (not common
nor observed during Willy's tests), while h2ff only defines indexes from
0 to 9.
The fix was provided by Willy, it consists in skipping h2ff indexing if
frame type is > 9. It was confirmed that doing so fixes the error.
At the moment it is not supported to produce multi-line events on the
"show events" output, simply because the LF character is used as the
default end-of-event mark. However it could be convenient to produce
well-formatted multi-line events, e.g. in JSON or other formats. UNIX
utilities have already faced similar needs in the past and added
"-print0" to "find" and "-0" to "xargs" to mention that the delimiter
is the NUL character. This makes perfect sense since it's never present
in contents, so let's do exactly the same here.
Thus from now on, "show events <ring> -0" will delimit messages using
a \0 instead of a \n, permitting a better and safer encapsulation.
In order to support delimiting output events with other characters than
just the LF, let's pass the delimiter through the API. The default remains
the LF, used by applet_append_line(), and ignored by the log forwarder.
Aleandro Prudenzano of Doyensec and Edoardo Geraci of Codean Labs
reported a bug in sample_conv_regsub(), which can cause replacements
of multiple back-references to overflow the temporary trash buffer.
The problem happens when doing "regsub(match,replacement,g)": we're
replacing every occurrence of "match" with "replacement" in the input
sample, which requires a length check. For this, a max is applied, so
that a replacement may not use more than the remaining length in the
buffer. However, the length check is made on the replaced pattern and
not on the temporary buffer used to carry the new string. This results
in the remaining size to be usable for each input match, which can go
beyond the temporary buffer size if more than one occurrence has to be
replaced with something that's larger than the remaining room.
The fix proposed by Aleandro and Edoardo is the correct one (check on
"trash" not "output"), and is the one implemented in this patch.
While it is very unlikely that a config will replace multiple short
patterns each with a larger one in a request, this possibility cannot
be entirely ruled out (e.g. mask a known, short IP address using
"XXX.XXX.XXX.XXX"). However when this happens, the replacement pattern
will be static, and not be user-controlled, which is why this patch is
marked as medium.
The bug was introduced in 2.2 with commit 07e1e3c93e ("MINOR: sample:
regsub now supports backreferences"), so it must be backported to all
versions.
Special thanks go to Aleandro and Edoardo for reporting this bug with
a simple reproducer and a fix.
There have been some reports that using %HV logformat alias with CBOR
encoder would produce invalid CBOR payload according to some CBOR
implementations such as "cbor.me". Indeed, with the below log-format:
log-format "%{+cbor}o %(protocol)HV"
And the resulting CBOR payload:
BF6870726F746F636F6C7F48485454502F312E31FFFF
cbor.me would complain with: "bytes/text mismatch (ASCII-8BIT != UTF-8) in
streaming string") error message.
It is due to the version string being first announced as text, while CBOR
encoder actually encodes it as byte string later when lf_encode_chunk()
is used.
In fact it affects all patterns combining LOG_VARTEXT_START() with
lf_encode_chunk() which means %HM, %HU, %HQ, %HPO and %HP are also
affected. To fix the issue, in _lf_encode_bytes() (which is
lf_encode_chunk() helper), we now check if we are inside a VARTEXT (we
can tell it if ctx->in_text is true), in which case we consider that we
already announced the current data as regular text so we keep the same
type to encode the bytes from the chunk to prevent inconsistencies.
It should be backported in 3.0
Commit f435a2e518 ("CLEANUP: atomics: also replace __sync_synchronize()
with __atomic_thread_fence()") replaced the builtins used for barriers,
but the different API required an argument while the macros didn't specify
any, resulting in double parenthesis that were causing obscure build errors
such as "called object type 'void' is not a function or function pointer".
Let's just specify the args for the macro. No backport is needed.
negative SNI filters on crt-list lines only have a meaning when they
match a positive wildcard filter. This patch adds a warning which
is emitted when trying to use negative filters without any wildcard on
the same line.
This was discovered in ticket #2900.
negative wildcard filters were always a noop, and are not useful for
anything unless you want to use !* alone to remove every name from a
certificate.
This is confusing and the documentation never stated it correctly. This
patch adds a warning during the bind initialization if it founds one,
only !* does not emit a warning.
This patch was done during the debugging of issue #2900.
_lf_cbor_encode_byte() comment was not updated in c33b857df ("MINOR: log:
support true cbor binary encoding") to reflect the new behavior.
Indeed, binary form is now supported. Updating the comment that says
otherwise.
Some notification_* functions were not thread safe by default as they
assumed only one producer would emit events for registered tasks.
While this suited well with the Lua sockets use-case, this proved to
be a limitation with some other event sources (ie: lua Queue class)
instead of having to deal with both the non thread safe and thread
safe variants (_mt suffix), which is error prone, let's make the
entire API thread safe regarding the event list.
Pruning functions still require that only one thread executes them,
with Lua this is always the case because there is one cleanup list
per context.
Queue:alarm() sets a wakeup alarm on the task when new data becomes
available on Queue. It must be re-armed for each event.
Lua documentation was updated
This is the non-blocking variant for AppletTCP:receive(). It doesn't
take any argument, instead it tries to read as much data as available
at once. If no data is available, empty string is returned.
Lua documentation was updated.
core.wait() now accepts optional delay parameter in ms. Passed this delay
the task is woken up if no event woke the task before.
Lua documentation was updated.
Similar to core.yield(), except that the task is not woken up
automatically, instead it waits for events to trigger the task
wakeup.
Lua documentation was updated.
If Queue:pop_wait() excecuted from a stream context and pop_wait() is
aborted due to a Lua or ressource error, then the waiting object pointing
to the task will still be registered, so if the task eventually dissapears,
Queue:push() may try to wake invalid task pointer..
To prevent this bug from happening, we now rely on notification_* API to
deliver waiting signals. This way signals are properly garbage collected
when a lua context is destroyed.
It should be backported in 2.8 with 86fb22c55 ("MINOR: hlua_fcn: add Queue
class").
This patch depends on ("MINOR: task: add thread safe notification_new and
notification_wake variants")
notification_new and notification_wake were historically meant to be
called by a single thread doing both the init and the wakeup for other
tasks waiting on the signals.
In this patch, we extend the API so that notification_new and
notification_wake have thread-safe variants that can safely be used with
multiple threads registering on the same list of events and multiple
threads pushing updates on the list.
This commit is a direct follow-up of the previous one. It defines a new
server keyword check-pool-conn-name. It is used as the default value for
the name parameter of idle connection hash generation.
Its behavior is similar to server keyword pool-conn-name, but reserved
for checks reuse. If check-pool-conn-name is set, it is used in priority
to match a connection for reuse. If unset, a fallback is performed on
check-sni.
Support for connection reuse during server checks was implemented
recently. This is activated with the server keyword check-reuse-pool.
Similarly to stream processing via connect_backend(), a connection hash
is calculated when trying to perform reuse for checks. This is necessary
to retrieve for a connection which shares the check connect parameters.
However, idle connections can additionnally be tagged using a
pool-conn-name or SNI under connect_backend(). Check reuse does not test
these values, which prevent to retrieve a matching connection.
Improve this by using "check-sni" value as idle connection hash input
for check reuse. be_calculate_conn_hash() API has been adjusted so that
name value can be passed as input, both when using streams or checks.
Even with the current patch, there is still some scenarii which could
not be covered for checks connection reuse. most notably, when using
dynamic pool-conn-name/SNI value. It is however at least sufficient to
cover simpler cases.
Without check-reuse-pool, it is impossible to perform check on server
using @rhttp protocol. This is due to the inherent nature of the
protocol which does not implement an active connect method.
Thus, ensure that check-reuse-pool is always set when a reverse HTTP
server is declared. This reduces server configuration and should prevent
any omission. Note that it is still require to add "check" server
keyword so activate server checks.
Duplicate server check.reuse_pool boolean value in srv_settings_cpy().
This is necessary to ensure that check-reuse-pool value can be set via
default-server or server-template.
This does not need to be backported.
Server instance can be NULL on connect_server(), either when dispatch or
transparent proxy are active. However, in alloc_dst_address() access to
<srv> is safe thanks to SF_ASSIGNED stream flag. Add an ASSUME_NONNULL()
to reflect this state.
This should fix coverity report from github issue #2922.
Replace the "crt" keyword from the frontend section with a "ssl-f-use"
keyword, "crt" could be ambigous in case we don't want to put a
certificate filename.
The new "crt" lines in frontend and listen sections are confusing:
- a filename is mandatory but we could need a syntax without the
filename in the future, if the filename is generated for example
- there is no clue about the fact that its only used on the frontend
side when reading the line
A new "ssl-f-use" line replaces the "crt" line, but a "crt" keyword
can be used on this line. "f" indicates that this is the frontend
configuration, a "ssl-b-use" keyword could be used in the future.
The "crt" lines only appeared in 3.2-dev so this won't change anything
for people using configurations from previous major versions.
The old __sync_* API is no longer necessary since we do not support
gcc before 4.7 anymore. Let's just get rid of this code, the file is
still ugly enough without it.
spellcheck was triggered by the following:
* pres : same as "res" but using the parent stream, if any. "pres"
variables are only accessible during response processing of the
parent stream.
This patch sets the expire of the entry to the max value in
configuration if the value showed in the peer update message
is too far in futur.
This should be backported an all supported branches.
At the begining of process_stream(), the flags of the stream connectors and
channels are saved to be able to handle changes performed in sub-functions
(for instance in analyzers). But, some operations were performed before
saving these flags: Synchronous receives and forced shutdowns. While it
seems to safe for now, it is a bit annoying because some events could be
missed.
So, to avoid bugs in the future, the channels and stream connectors flags
are now really saved before any other processing.
When a forced shutdown is performed on a stream, it is possible to freeze it
infinitly because it is performed in an unexpected way from process_stream()
point of view, especially when the stream is waiting for a server
connection. The events sequence is a bit complex but at the end the stream
remains blocked in turn-around state and no event are trriggered to unblock
it.
By trying to fix the issue, we considered it was safer to rethink the
feature. The idea is to quickly shutdown a stream to release resources. For
instance to be able to delete a server. So, instead of scheduling a
shutdown, it is more efficient to trigger an error and detach the stream
from the server, if neecessary. The same code than the one used to deal with
connection errors in back_handle_st_cer() is used.
This patch must be slowly backported as far as 2.6.
Released version 3.2-dev9 with the following main changes :
- MINOR: quic: move global tune options into quic_tune
- CLEANUP: quic: reorganize TP flow-control initialization
- MINOR: quic: ignore uni-stream for initial max data TP
- MINOR: mux-quic: define config for max-data
- MINOR: quic: define max-stream-data configuration as a ratio
- MEDIUM: lb-chash: add directive hash-preserve-affinity
- MEDIUM: pools: be a bit smarter when merging comparable size pools
- REGTESTS: disable the test balance/balance-hash-maxqueue
- BUG/MINOR: log: fix gcc warn about truncating NUL terminator while init char arrays
- CI: fedora rawhide: allow "on: workflow_dispatch" in forks
- CI: fedora rawhide: install "awk" as a dependency
- CI: spellcheck: allow "on: workflow_dispatch" in forks
- CI: coverity scan: allow "on: workflow_dispatch" in forks
- CI: cross compile: allow "on: workflow_dispatch" in forks
- CI: Illumos: allow "on: workflow_dispatch" in forks
- CI: NetBSD: allow "on: workflow_dispatch" in forks
- CI: QUIC Interop on AWS-LC: allow "on: workflow_dispatch" in forks
- CI: QUIC Interop on LibreSSL: allow "on: workflow_dispatch" in forks
- MINOR: compiler: add __nonstring macro
- MINOR: thread: dump the CPU topology in thread_map_to_groups()
- MINOR: cpu-set: compare two cpu sets with ha_cpuset_isequal()
- MINOR: cpu-set: add a new function to print cpu-sets in human-friendly mode
- MINOR: cpu-topo: add a dump of thread-to-CPU mapping to -dc
- MINOR: cpu-topo: pass an extra argument to ha_cpu_policy
- MINOR: cpu-topo: add new cpu-policies "group-by-2-clusters" and above
- BUG/MINOR: config: silence .notice/.warning/.alert in discovery mode
- EXAMPLES: add "games.cfg" and an example game in Lua
- MINOR: jws: emit the JWK thumbprint
- TESTS: jws: change the jwk format
- MINOR: ssl/ckch: add substring parser for ckch_conf
- MINOR: mt_list: Implement mt_list_try_lock_prev().
- MINOR: lbprm: Add method to deinit server and proxy
- MINOR: threads: Add HA_RWLOCK_TRYRDTOWR()
- MAJOR: leastconn; Revamp the way servers are ordered.
- BUG/MINOR: ssl/ckch: leak in error path
- BUILD: ssl/ckch: potential null pointer dereference
- MINOR: log: support "raw" logformat node typecast
- CLEANUP: assorted typo fixes in the code and comments
- DOC: config: fix two missing "content" in "tcp-request" examples
- MINOR: cpu-topo: cpu_dump_topology() SMT info check little optimisation
- BUILD: compiler: undefine the CONCAT() macro if already defined
- BUG/MEDIUM: leastconn: Don't try to reposition if the server is down
- BUG/MINOR: rhttp: fix incorrect dst/dst_port values
- BUG/MINOR: backend: do not overwrite srv dst address on reuse
- BUG/MEDIUM: backend: fix reuse with set-dst/set-dst-port
- MINOR: sample: define bc_reused fetch
- REGTESTS: extend conn reuse test with transparent proxy
- MINOR: backend: fix comment when killing idle conns
- MINOR: backend: adjust conn_backend_get() API
- MINOR: backend: extract conn hash calculation from connect_server()
- MINOR: backend: extract conn reuse from connect_server()
- MINOR: backend: remove stream usage on connection reuse
- MINOR: check define check-reuse-pool server keyword
- MEDIUM: check: implement check-reuse-pool
- BUILD: backend: silence a build warning when not using ssl
- BUILD: quic_sock: address a strict-aliasing build warning with gcc 5 and 6
- BUILD: ssl_ckch: use my_strndup() instead of strndup()
- DOC: update INSTALL to reflect the minimum compiler version
The mt_list update in 3.1 mandated the support for c11-like atomics that
arrived with gcc-4.7. As such, older versions are no longer supported.
For special cases in single-threaded environments, mt_lists could be
replaced with regular lists but it doesn't seem worth the hassle. It
was verified that gcc 4.7 to 14 and clang 3.0 and 19 do build fine.
That leaves us with 10 years of coverage of compiler versions, which
remains reasonable assuming that users of old ultra-stable systems are
unlikely to upgrade haproxy without touching the rest of the system.
This should be backported to 3.1.
Not all systems have strndup(), that's why we have our "my_strndup()",
so let's make use of it here. This fixes the build on Solaris 10.
No backport is needed, this was just merged with commit fdcb97614c
("MINOR: ssl/ckch: add substring parser for ckch_conf").
The UDP GSO code emits a build warning with older toolchains (gcc 5 and 6):
src/quic_sock.c: In function 'cmsg_set_gso':
src/quic_sock.c:683:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
*((uint16_t *)CMSG_DATA(c)) = gso_size;
^
Let's just use the write_u16() function that's made for this purpose.
It was verified that for all versions from 5 to 13, gcc produces the
exact same code with the fix (and without the warning). It arrived in
3.1 with commit 448d3d388a ("MINOR: quic: add GSO parameter on quic_sock
send API") so this can be backported there.
Since recent commit ee94a6cfc1 ("MINOR: backend: extract conn reuse
from connect_server()") a build warning "set but not used" on the
"reuse" variable is emitted, because indeed the variable is now only
checked when SSL is in use. Let's just mark it as such.
Implement the possibility to reuse idle connections when performing
server checks. This is done thanks to the recently introduced functions
be_calculate_conn_hash() and be_reuse_connection().
One side effect of this change is that be_calculate_conn_hash() can now
be called with a NULL stream instance. As such, part of the functions
are adjusted accordingly.
Note that to simplify configuration, connection reuse is not performed
if any specific check connection parameters are defined on the server
line or via the tcp-check connect rule. This is performed via newly
defined tcpcheck_use_nondefault_connect().
Define a new server keyword check-reuse-pool, and its counterpart with a
"no" prefix. For the moment, only parsing is implemented. The real
behavior adjustment will be implemented in the next patch.
Adjust newly defined be_reuse_connection() API. The stream argument is
removed. This will allows checks to be able to invoke it without relying
on a stream instance.
Following the previous patch, the part directly related to connection
reuse is extracted from connect_server(). It is now define in a new
function be_reuse_connection().
On connection reuse, a hash is first calculated. It is generated from
various connection parameters, to retrieve a matching connection.
Extract hash calculation from connect_server() into a new dedicated
function be_calculate_conn_hash(). The objective is to be able to
perform connection reuse for checks, without connect_server() invokation
which relies on a stream instance.
The main objective of this patch is to remove the stream instance from
conn_backend_get() parameters. This would allow to perform reuse outside
of stream contexts, for example for checks purpose.
Previously, if a server reached its pool-high-count limit, connection
were killed on connect_server() when reuse was not possible. However,
this is now performed even if reuse is done since the following patch :
b3397367dc7cec9e78c62c54efc24d9db5cde2d2
MEDIUM: connections: Kill connections even if we are reusing one.
Thus, adjust the related comment to reflect this state.
Recently, work on connection reuses reveals an issue when mixed with
transparent proxy and set-dst. This patch rewrites the related regtests
to be able to catch this now fixed bug.
Note that it is the first regtest which relies on bc_reused recently
introduced sample fetches. This fetch could be reuse in other related
connection reuse regtests to simplify them.
On backend connection reuse, a hash is calculated from various
parameters, to ensure the selected connection match the requested
parameters. Notably, destination address is one of these parameters.
However, it is only taken into account if using a transparent server
(server address 0.0.0.0).
This may cause issue where an incorrect connection is reused, which is
not targetted to the correct destination address. This may be the case
if a set-dst/set-dst-port is used with a transparent proxy (proxy option
transparent).
The fix is simple enough. Destination address is now always used as
input to the connection reuse hash.
This must be backported up to 2.6. Note that for reverse HTTP to work,
it relies on the following patch, which ensures destination address
remains NULL in this case.
commit e94baf6ca71cb2319610baa74dbf17b9bc602b18
BUG/MINOR: rhttp: fix incorrect dst/dst_port values
Previously, destination address of backend connection was systematically
always reassigned. However, this step is unnecessary on connection
reuse. Indeed, reuse should only be conducted with connection using the
same destination address matching the stream requirements.
This patch removes this unnecessary assignment. It is now only performed
when reuse cannot be conducted and a new connection is instantiated.
Functionnally speaking, this patch should not change anything in theory,
as reuse is performed in conformance with the destination address.
However, it appears that it was not always properly enforced. The
systematic assignment of the destination address hides these issues, so
it is now remove. The identified bogus cases will then be fixed in the
following patches.would
This should be backported up to all stable versions.
With a @rhttp server, connect is not possible, transfer is only possible
via idle connection reuse. The server does not have any network address.
Thus, it is unnecessary to allocate the stream destination address prior
to connection reuse. This patch adjusts this by fixing
alloc_dst_address() to take this into account.
Prior to this patch, alloc_dst_address() would incorrectly assimilate a
@rhttp server with a transparent proxy mode. Thus stream destination
address would be copied from the destination address. Connection adress
would then be rewrote with this incorrect value. This did not impact
connect or reuse as destination addr is only used in idle conn hash
calculation for transparent servers. However, it causes incorrect values
for dst/dst_port samples.
This should be backported up to 2.9.
It may happen that the server is going down, and fwlc_srv_reposition()
is still called, because streams still attached to the server are
being terminated.
So in fwlc_srv_reposition(), just do nothing if we've been removed from
the tree.
This should fix github issue #2919.
This should not be backported, unless commit
9fe72bba3cf3484577fa1ef00723de08df757996 is also backported.

As Ilya reported in issue #2911, the CONCAT() macro breaks on NetBSD
which defines its own as __CONCAT() (which is exactly the same). Let's
just undefine it before ours to fix the issue instead of renaming, but
keep ours so that we don't have doubts about what we're running with.
Note that the patch introducing this breaking change was backported
to 3.0.
As reported by Uku Srmus in GitHub issue #2917, two "tcp-request" rules
in an example were mistakenly missing the "content" hook, rendering them
invalid.
This can be backported.
"raw" logformat node typecast is a special value (unlike str,bool,int..)
which tells haproxy to completely ignore logformat options (including
encoding ones) and force binary output for the current node only. It is
mainly intended for use with JSON or CBOR encoders in order to generate
nested CBOR or nested JSON by storing intermediate log-formats within
variables and assembling the final object in the parent log-format.
Example:
http-request set-var-fmt(txn.intermediate) "%{+json}o %(lower)[str(value)]"
log-format "%{+json}o %(upper)[str(value)] %(intermediate:raw)[var(txn.intermediate)]"
Would produce:
{"upper": "value", "intermediate": {"lower": "value"}}
src/ssl_ckch.c: In function ‘ckch_conf_parse’:
src/ssl_ckch.c:4852:40: error: potential null pointer dereference [-Werror=null-dereference]
4852 | while (*r) {
| ^~
Add a test on r before using *r.
No backport needed
fdcb97614cb ("MINOR: ssl/ckch: add substring parser for ckch_conf")
introduced a leak in the error path when the strndup fails.
This patch fixes issue #2920. No backport needed.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
Add two new methods to lbprm, server_deinit() and proxy_deinit(),
in case something should be done at the lbprm level when
removing servers and proxies.
Implement mt_list_try_lock_prev(), that does the same thing
as mt_list_lock_prev(), exceot if the list is locked, it
returns { NULL, NULL } instaed of waiting.
jwk_thumbprint() is a function which is a function which implements
RFC7368 and emits a JWK thumbprint using a EVP_PKEY.
EVP_PKEY_EC_to_pub_jwk() and EVP_PKEY_RSA_to_pub_jwk() were changed in
order to match what is required to emit a thumbprint (ie, no spaces or
lines and the lexicographic order of the fields)
The purpose is mainly to exhibit certain limitations that come with such
less common programming models, to show users how to program interactive
tools in Lua, and how to connect interactively.
Other use cases that could be envisioned are "top" and various monitoring
utilities, with sliding graphs etc. Lua is particularly attractive for
this usage, easy to program, well known from most AI tools (including its
integration into haproxy), making such programs very quick to obtain in
their basic form, and to improve later.
A very limited example game is provided, following the principle of a
very popular one, where the player must compose lines from falling
pieces. It quickly revealed the need to the ability to enforce a timeout
to applet:receive(). Other identified limitations include the difficulty
from the Lua side to monitor multiple events at once, but it seems that
callbacks and/or event dispatchers would be useful here.
At the moment the CLI is not workable (it interactivity was broken in 2.9
when line buffering was adopted), though it was verified that it works
with older releases.
The command needed to connect to the game is displayed as a notice message
during boot.
When first pre-parsing the config to detect the presence or absence of
the master mode, we must not emit messages because they are not supposed
to be visible at this point, otherwise they appear twice each. The
pre-parsing, also called discovery mode, is only for internal use,
thus it should remain silent.
This should be backported to 3.1 where this mode was introduced.
This adds "group-by-{2,3,4}-clusters", which, as its name implies,
create one thread group per X clusters. This can be useful when CPUs
are split into too small clusters, as well as when the total number
of assigned cores is not even between the clusters, to try to spread
the load between less different ones.
When emitting the CPU topology info with -dc, also emit a list of
thread-to-CPU mapping. The group/thread and thread ID are emitted
with the list of their CPUs on each line. The count of CPUs is shown
to ease comparisons, and as much as possible, we try to pack identical
lines within a group by showing thread ranges.
The new function "print_cpu_set()" will print cpu sets in a human-friendly
way, with commas and dashes for intervals. The goal is to keep them compact
enough.
It was previously done in thread_detect_count() but that's not quite
handy because we still don't know about the groups setting. Better do
it slightly later and have all the relevant info instead.
GCC 15 throws the following warning on fixed-size char arrays if they do not
contain terminated NUL:
src/tools.c:2041:25: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
2041 | const char hextab[16] = "0123456789ABCDEF";
We are using a couple of such definitions for some constants. Converting them
to flexible arrays, like: hextab[] = "0123456789ABCDEF" may have consequences,
as enlarged arrays won't fit anymore where they were possibly located due to
the memory alignement constraints.
GCC adds 'nonstring' variable attribute for such char arrays, but clang and
other compilers don't have it. Let's wrap 'nonstring' with our
__nonstring macro, which will test if the compiler supports this attribute.
This fixes the issue #2910.
gcc 15 throws such kind of warnings about initialization of some char arrays:
src/log.c:181:33: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
181 | const char sess_term_cond[16] = "-LcCsSPRIDKUIIII"; /* normal, Local, CliTo, CliErr, SrvTo, SrvErr, PxErr, Resource, Internal, Down, Killed, Up, -- */
| ^~~~~~~~~~~~~~~~~~
src/log.c:182:33: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (9 chars into 8 available) [-Werror=unterminated-string-initialization]
182 | const char sess_fin_state[8] = "-RCHDLQT"; /* cliRequest, srvConnect, srvHeader, Data, Last, Queue, Tarpit */
So, let's make it happy by not giving the sizes of these char arrays
explicitly, thus he can accomodate there NUL terminators.
Reported in GitHub issue #2910.
This should be backported up to 2.6.
This test brought by commit 8ed1e91efd ("MEDIUM: lb-chash: add directive
hash-preserve-affinity") seems to have hit a limitation of what can be
expressed in vtc, as it would be desirable to have one server response
release two clients at once but the various attempts using barriers
have failed so far. The test seems to work fine locally but still fails
almost 100% of the time on the CI, so it remains timing dependent in
some ways. Tests have been done with nbthread 1, pool-idle-shared off,
http-reuse never (since always fails locally) etc but to no avail. Let's
just mark it broken in case we later figure another way to fix it. It's
still usable locally most of the time, though.
By default, pools of comparable sizes are merged together. However, the
current algorithm is dumb: it rounds the requested size to the next
multiple of 16 and compares the sizes like this. This results in many
entries which are already multiples of 16 not being merged, for example
1024 and 1032 are separate, 65536 and 65540 are separate, 48 and 56 are
separate (though 56 merges with 64).
This commit changes this to consider not just the entry size but also the
average entry size, that is, it compares the average size of all objects
sharing the pool with the size of the object looking for a pool. If the
object is not more than 1% bigger nor smaller than the current average
size or if it neither 16 bytes smaller nor larger, then it can be merged.
Also, it always respects exact matches in order to avoid merging objects
into larger pools or worse, extending existing ones for no reason, and
when there's a tie, it always avoids extending an existing pool.
Also, we now visit all existing pools in order to spot the best one, we
do not stop anymore at the smallest one large enough. Theoretically this
could cost a bit of CPU but in practice it's O(N^2) with N quite small
(typically in the order of 100) and the cost at each step is very low
(compare a few integer values). But as a side effect, pools are no
longer sorted by size, "show pools bysize" is needed for this.
This causes the objects to be much better grouped together, accepting to
use a little bit more sometimes to avoid fragmentation, without causing
everyone to be merged into the same pool. Thanks to this we're now
seeing 36 pools instead of 48 by default, with some very nice examples
of compact grouping:
- Pool qc_stream_r (80 bytes) : 13 users
> qc_stream_r : size=72 flags=0x1 align=0
> quic_cstrea : size=80 flags=0x1 align=0
> qc_stream_a : size=64 flags=0x1 align=0
> hlua_esub : size=64 flags=0x1 align=0
> stconn : size=80 flags=0x1 align=0
> dns_query : size=64 flags=0x1 align=0
> vars : size=80 flags=0x1 align=0
> filter : size=64 flags=0x1 align=0
> session pri : size=64 flags=0x1 align=0
> fcgi_hdr_ru : size=72 flags=0x1 align=0
> fcgi_param_ : size=72 flags=0x1 align=0
> pendconn : size=80 flags=0x1 align=0
> capture : size=64 flags=0x1 align=0
- Pool h3s (56 bytes) : 17 users
> h3s : size=56 flags=0x1 align=0
> qf_crypto : size=48 flags=0x1 align=0
> quic_tls_se : size=48 flags=0x1 align=0
> quic_arng : size=56 flags=0x1 align=0
> hlua_flt_ct : size=56 flags=0x1 align=0
> promex_metr : size=48 flags=0x1 align=0
> conn_hash_n : size=56 flags=0x1 align=0
> resolv_requ : size=48 flags=0x1 align=0
> mux_pt : size=40 flags=0x1 align=0
> comp_state : size=40 flags=0x1 align=0
> notificatio : size=48 flags=0x1 align=0
> tasklet : size=56 flags=0x1 align=0
> bwlim_state : size=48 flags=0x1 align=0
> xprt_handsh : size=48 flags=0x1 align=0
> email_alert : size=56 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
- Pool quic_cids (32 bytes) : 13 users
> quic_cids : size=16 flags=0x1 align=0
> quic_tls_ke : size=32 flags=0x1 align=0
> quic_tls_iv : size=12 flags=0x1 align=0
> cbuf : size=32 flags=0x1 align=0
> hlua_queuew : size=24 flags=0x1 align=0
> hlua_queue : size=24 flags=0x1 align=0
> promex_modu : size=24 flags=0x1 align=0
> cache_st : size=24 flags=0x1 align=0
> spoe_appctx : size=32 flags=0x1 align=0
> ehdl_sub_tc : size=32 flags=0x1 align=0
> fcgi_flt_ct : size=16 flags=0x1 align=0
> sig_handler : size=32 flags=0x1 align=0
> pipe : size=24 flags=0x1 align=0
- Pool quic_crypto (1032 bytes) : 2 users
> quic_crypto : size=1032 flags=0x1 align=0
> requri : size=1024 flags=0x1 align=0
- Pool quic_conn_r (65544 bytes) : 2 users
> quic_conn_r : size=65536 flags=0x1 align=0
> dns_msg_buf : size=65540 flags=0x1 align=0
On a very unscientific test consisting in sending 1 million H1 requests
and 1 million H2 requests to the stats page, we're seeing an ~6% lower
memory usage with the patch:
before the patch:
Total: 48 pools, 4120832 bytes allocated, 4120832 used (~3555680 by thread caches).
after the patch:
Total: 36 pools, 3880648 bytes allocated, 3880648 used (~3299064 by thread caches).
This should be taken with care however since pools allocate and release
in batches.
When using hash-based load balancing, requests are always assigned to
the server corresponding to the hash bucket for the balancing key,
without taking maxconn or maxqueue into account, unlike in other load
balancing methods like 'first'. This adds a new backend directive that
can be used to take maxconn and possibly maxqueue in that context. This
can be used when hashing is desired to achieve cache locality, but
sending requests to a different server is preferable to queuing for a
long time or failing requests when the initial server is saturated.
By default, affinity is preserved as was the case previously. When
'hash-preserve-affinity' is set to 'maxqueue', servers are considered
successively in the order of the hash ring until a server that does not
have a full queue is found.
When 'maxconn' is set on a server, queueing cannot be disabled, as
'maxqueue=0' means unlimited. To support picking a different server
when a server is at 'maxconn' irrespective of the queue,
'hash-preserve-affinity' can be set to 'maxconn'.
Define a new global configuration tune.quic.frontend.max-data. This
allows users to explicitely set the value for the corresponding QUIC TP
initial-max-data, with direct impact on haproxy memory consumption.
Initial TP value for max-data is automatically calculated to be adjusted
to the maximum number of opened streams over a QUIC connection. This
took into account both max-streams-bidi-remote and uni-streams. By
default, this is equivalent to 100 + 3 = 103 max opened streams.
This patch simplifies the calculation by only using bidirectional
streams. Uni streams are ignored because they are only used for HTTP/3
control exchanges, which should only represents a few bytes. For now,
users can only configure the max number of remote bidi streams, so the
simplified calculation should make more sense to them.
Note that this relies on the assumption that HTTP/3 is used as
application protocol. To support other protocols, it may be necessary to
review this and take into account both local bidi and uni streams.
Adjust initialization of flow-control transport parameters via
quic_transport_params_init().
This is purely cosmetic, with some comments added. It is also a
preparatory step for future patches with addition of new configuration
keywords related to flow-control TP values.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
Released version 3.2-dev8 with the following main changes :
- MINOR: jws: implement JWS signing
- TESTS: jws: implement a test for JWS signing
- CI: github: add "jose" to apt dependencies
- CLEANUP: log-forward: remove useless options2 init
- CLEANUP: log: add syslog_process_message() helper
- MINOR: proxy: add proxy->options3
- MINOR: log: migrate log-forward options from proxy->options2 to options3
- MINOR: log: provide source address information in syslog_process_message()
- MINOR: tools: only print address in sa2str() when port == -1
- MINOR: log: add "option host" log-forward option
- MINOR: log: handle log-forward "option host"
- MEDIUM: log: change default "host" strategy for log-forward section
- BUG/MEDIUM: thread: use pthread_self() not ha_pthread[tid] in set_affinity
- MINOR: compiler: add a simple macro to concatenate resolved strings
- MINOR: compiler: add a new __decl_thread_var() macro to declare local variables
- BUILD: tools: silence a build warning when USE_THREAD=0
- BUILD: backend: silence a build warning when threads are disabled
- DOC: management: rename some last occurences from domain "dns" to "resolvers"
- BUG/MINOR: stats: fix capabilities and hide settings for some generic metrics
- MINOR: cli: export cli_io_handler() to ease symbol resolution
- MINOR: tools: improve symbol resolution without dl_addr
- MINOR: tools: ease the declaration of known symbols in resolve_sym_name()
- MINOR: tools: teach resolve_sym_name() a few more common symbols
- BUILD: tools: avoid a build warning on gcc-4.8 in resolve_sym_name()
- DEV: ncpu: also emulate sysconf() for _SC_NPROCESSORS_*
- DOC: design-thoughts: commit numa-auto.txt
- MINOR: cpuset: make the API support negative CPU IDs
- MINOR: thread: rely on the cpuset functions to count bound CPUs
- MINOR: cpu-topo: add ha_cpu_topo definition
- MINOR: cpu-topo: allocate and initialize the ha_cpu_topo array.
- MINOR: cpu-topo: rely on _SC_NPROCESSORS_CONF to trim maxcpus
- MINOR: cpu-topo: add a function to dump CPU topology
- MINOR: cpu-topo: update CPU topology from excluded CPUs at boot
- REORG: cpu-topo: move bound cpu detection from cpuset to cpu-topo
- MINOR: cpu-topo: add detection of online CPUs on Linux
- MINOR: cpu-topo: add detection of online CPUs on FreeBSD
- MINOR: cpu-topo: try to detect offline cpus at boot
- MINOR: cpu-topo: add CPU topology detection for linux
- MINOR: cpu-topo: also store the sibling ID with SMT
- MINOR: cpu-topo: add NUMA node identification to CPUs on Linux
- MINOR: cpu-topo: add NUMA node identification to CPUs on FreeBSD
- MINOR: thread: turn thread_cpu_mask_forced() into an init-time variable
- MINOR: cfgparse: move the binding detection into numa_detect_topology()
- MINOR: cfgparse: use already known offline CPU information
- MINOR: global: add a command-line option to enable CPU binding debugging
- MINOR: cpu-topo: add a new "cpu-set" global directive to choose cpus
- MINOR: cpu-topo: add "drop-cpu" and "only-cpu" to cpu-set
- MEDIUM: thread: start to detect thread groups and threads min/max
- MEDIUM: cpu-topo: make sure to properly assign CPUs to threads as a fallback
- MEDIUM: thread: reimplement first numa node detection
- MEDIUM: cfgparse: remove now unused numa & thread-count detection
- MINOR: cpu-topo: refine cpu dump output to better show kept/dropped CPUs
- MINOR: cpu-topo: fall back to nominal_perf and scaling_max_freq for the capacity
- MINOR: cpu-topo: use cpufreq before acpi cppc
- MINOR: cpu-topo: boost the capacity of performance cores with cpufreq
- MINOR: cpu-topo: skip CPU detection when /sys/.../cpu does not exist
- MINOR: cpu-topo: skip identification of non-existing CPUs
- MINOR: cpu-topo: skip CPU properties that we've verified do not exist
- MINOR: cpu-topo: implement a sorting mechanism for CPU index
- MINOR: cpu-topo: implement a sorting mechanism by CPU locality
- MINOR: cpu-topo: implement a CPU sorting mechanism by cluster ID
- MINOR: cpu-topo: ignore single-core clusters
- MINOR: cpu-topo: assign clusters to cores without and renumber them
- MINOR: cpu-topo: make sure we don't leave unassigned IDs in the cpu_topo
- MINOR: cpu-topo: assign an L3 cache if more than 2 L2 instances
- MINOR: cpu-topo: renumber cores to avoid holes and make them contiguous
- MINOR: cpu-topo: add a function to sort by cluster+capacity
- MINOR: cpu-topo: consider capacity when forming clusters
- MINOR: cpu-topo: create an array of the clusters
- MINOR: cpu-topo: ignore excess of too small clusters
- MINOR: cpu-topo: add "only-node" and "drop-node" to cpu-set
- MINOR: cpu-topo: add "only-thread" and "drop-thread" to cpu-set
- MINOR: cpu-topo: add "only-core" and "drop-core" to cpu-set
- MINOR: cpu-topo: add "only-cluster" and "drop-cluster" to cpu-set
- MINOR: cpu-topo: add a CPU policy setting to the global section
- MINOR: cpu-topo: add a 'first-usable-node' cpu policy
- MEDIUM: cpu-topo: use the "first-usable-node" cpu-policy by default
- CLEANUP: thread: now remove the temporary CPU node binding code
- MINOR: cpu-topo: add cpu-policy "group-by-cluster"
- MEDIUM: cpu-topo: let the "group-by-cluster" split groups
- MINOR: cpu-topo: add a new "performance" cpu-policy
- MINOR: cpu-topo: add a new "efficiency" cpu-policy
- MINOR: cpu-topo: add a new "resource" cpu-policy
- MINOR: jws: add new functions in jws.h
- MINOR: cpu-topo: fix unused stack var 'cpu2' reported by coverity
- MINOR: hlua: add an optional timeout to AppletTCP:receive()
- MINOR: jws: use jwt_alg type instead of a char
- BUG/MINOR: log: prevent saddr NULL deref in syslog_io_handler()
- MINOR: stream: decrement srv->served after detaching from the list
- BUG/MINOR: hlua: fix optional timeout argument index for AppletTCP:receive()
- MINOR: server: simplify srv_has_streams()
- CLEANUP: server: make it clear that srv_check_for_deletion() is thread-safe
- MINOR: cli/server: don't take thread isolation to check for srv-removable
- BUG/MINOR: limits: compute_ideal_maxconn: don't cap remain if fd_hard_limit=0
- MINOR: limits: fix check_if_maxsock_permitted description
- BUG/MEDIUM: hlua/cli: fix cli applet UAF in hlua_applet_wakeup()
- MINOR: tools: path_base() concatenates a path with a base path
- MEDIUM: ssl/ckch: make the ckch_conf more generic
- BUG/MINOR: mux-h2: Reset streams with NO_ERROR code if full response was already sent
- MINOR: stats: add .generic explicit field in stat_col struct
- MINOR: stats: STATS_PX_CAP___B_ macro
- MINOR: stats: add .cap for some static metrics
- MINOR: stats: use stat_col storage stat_cols_info
- MEDIUM: promex: switch to using stat_cols_info for global metrics
- MINOR: promex: expose ST_I_INF_WARNINGS (AKA total_warnings) metric
- MEDIUM: promex: switch to using stat_cols_px for front/back/server metrics
- MINOR: stats: explicitly add frontend cap for ST_I_PX_REQ_TOT
- CLEANUP: promex: remove unused PROMEX_FL_{INFO,FRONT,BACK,LI,SRV} flags
- BUG/MEDIUM: mux-quic: fix crash on RS/SS emission if already close local
- BUG/MINOR: mux-quic: remove extra BUG_ON() in _qcc_send_stream()
- MEDIUM: mt_list: Reduce the max number of loops with exponential backoff
- MINOR: stats: add alt_name field to stat_col struct
- MINOR: stats: add alt name info to stat_cols_info where relevant
- MINOR: promex: get rid of promex_global_metric array
- MINOR: stats-proxy: add alt_name field for ME_NEW_{FE,BE,PX} helpers
- MINOR: stats-proxy: add alt name info to stat_cols_px where relevant
- MINOR: promex: get rid of promex_st_metrics array
- MINOR: pools: rename the "by_what" field of the show pools context to "how"
- MINOR: cli/pools: record the list of pool registrations even when merging them
By default, create_pool() tries to merge similar pools into one. But when
dealing with certain bugs, it's hard to say which ones were merged together.
We do have the information at registration time, so let's just create a
list of registrations ("pool_registration") attached to each pool, that
will store that information. It can then be consulted on the CLI using
"show pools detailed", where the names, sizes, alignment and flags are
reported.
The goal will be to support other dump options. We don't need 32 bits to
express sorting criteria, let's reserve only 4 bits for them and leave
the remaining ones unused.
In this patch we pursue the work started in a5aadbd ("MEDIUM: promex:
switch to using stat_cols_px for front/back/server metrics"):
Indeed, while having ".promex_name" info in stat_cols_info generic array
was confusing, Willy suggested that we have ".alt_name" which stays
generic and may be considered by alternative exporters for metric naming.
For now, only promex exporter will make use of it.
Thanks to this, it allows us to completely get rid of the
stat_cols_px array. The other main benefit is that it will be much harder
to overlook promex metric definition now because .alt_name has more
visibility in the main metric array rather than in an addon file.
For now alt_name is systematically set to NULL. Thanks to this change we
may easily add an altname to existing metrics. Also by requiring explicit
value it offers more visibility for this field.
In this patch we pursue the work started in 1adc796 ("MEDIUM: promex:
switch to using stat_cols_info for global metrics"):
Indeed, while having ".promex_name" info in stat_cols_info generic array
was confusing, Willy suggested that we have ".alt_name" which stays
generic and may be considered by alternative exporters for metric naming.
For now, only promex exporter will make use of it.
Thanks to this, it allows us to completely get rid of the
promex_global_metric array. The other main benefit is that it will be
much harder to overlook promex metric definition now because .alt_name
has more visibility in the main metric array rather than in an addon file.
alt_name will be used by metric exporters to know how the metric should be
presented to the user. If the alt_name is NULL, the metric should be
ignored. For now only promex exporter will make use of this.
Reduce the max number of loops in the mt_list code while waiting for
a lock to be available with exponential backoff. It's been observed that
the current value led to severe performances degradation at least on
some hardware, hopefully this value will be acceptable everywhere.
The following patch fixed a BUG_ON() which could be triggered if RS/SS
emission was scheduled after stream local closure.
7ee1279f4b8416435faba5cb93a9be713f52e4df
BUG/MEDIUM: mux-quic: fix crash on RS/SS emission if already close local
qcc_send_stream() was rewritten as a wrapper around an internal
_qcc_send_stream() used to bypass the faulty BUG_ON(). However, an extra
unnecessary BUG_ON() was added by mistake in _qcc_send_stream().
This should not cause any issue, as the BUG_ON() is only active if <urg>
argument is false, which is not the case for RS/SS emission. However,
this patch is labelled as a bug as this BUG_ON() is unnecessary and may
cause issues in the future.
This should be backported up to 2.8, after the above mentionned patch.
A BUG_ON() is present in qcc_send_stream() to ensure that emission is
never performed with a stream already closed locally. However, this
function is also used for RESET_STREAM/STOP_SENDING emission. No
protection exists to ensure that RS/SS is not scheduled after stream
local closure, which would result in this BUG_ON() crash.
This crash can be triggered with the following QUIC client sequence :
1. SS is emitted to open a new stream. QUIC-MUX schedules a RS emission
by and the stream is locally closed.
2. An invalid HTTP/3 request is sent on the same stream, for example
with duplicated pseudo-headers. The objective is to ensure
qcc_abort_stream_read() is called after stream closure, which results
in the following backtrace.
0x000055555566a620 in qcc_send_stream (qcs=0x7ffff0061420, urg=1, count=0) at src/mux_quic.c:1633
1633 BUG_ON(qcs_is_close_local(qcs));
[ ## gdb ## ] bt
#0 0x000055555566a620 in qcc_send_stream (qcs=0x7ffff0061420, urg=1, count=0) at src/mux_quic.c:1633
#1 0x000055555566a921 in qcc_abort_stream_read (qcs=0x7ffff0061420) at src/mux_quic.c:1658
#2 0x0000555555685426 in h3_rcv_buf (qcs=0x7ffff0061420, b=0x7ffff748d3f0, fin=0) at src/h3.c:1454
#3 0x0000555555668a67 in qcc_decode_qcs (qcc=0x7ffff0049eb0, qcs=0x7ffff0061420) at src/mux_quic.c:1315
#4 0x000055555566c76e in qcc_recv (qcc=0x7ffff0049eb0, id=12, len=0, offset=23, fin=0 '\000',
data=0x7fffe0049c1c "\366\r,\230\205\354\234\301;\2563\335\037k\306\334\037\260", <incomplete sequence \323>) at src/mux_quic.c:1901
#5 0x0000555555692551 in qc_handle_strm_frm (pkt=0x7fffe00484b0, strm_frm=0x7ffff00539e0, qc=0x7fffe0049220, fin=0 '\000') at src/quic_rx.c:635
#6 0x0000555555694530 in qc_parse_pkt_frms (qc=0x7fffe0049220, pkt=0x7fffe00484b0, qel=0x7fffe0075fc0) at src/quic_rx.c:980
#7 0x0000555555696c7a in qc_treat_rx_pkts (qc=0x7fffe0049220) at src/quic_rx.c:1324
#8 0x00005555556b781b in quic_conn_app_io_cb (t=0x7fffe0037f20, context=0x7fffe0049220, state=49232) at src/quic_conn.c:601
#9 0x0000555555d53788 in run_tasks_from_lists (budgets=0x7ffff748e2b0) at src/task.c:603
#10 0x0000555555d541ae in process_runnable_tasks () at src/task.c:886
#11 0x00005555559c39e9 in run_poll_loop () at src/haproxy.c:2858
#12 0x00005555559c41ea in run_thread_poll_loop (data=0x55555629fb40 <ha_thread_info+64>) at src/haproxy.c:3075
The proper solution is to not execute this BUG_ON() for RS/SS emission.
Indeed, it is valid and can be useful to emit these frames, even after
stream local closure.
To implement this, qcc_send_stream() has been rewritten as a mere
wrapper function around the new internal _qcc_send_stream(). The latter
is used only by QMUX for STREAM, RS and SS emission. Application layer
continue to use the original function for STREAM emission, with the
BUG_ON() still in place there.
This must be backported up to 2.8.
While being a generic metric, ST_I_PX_REQ_TOT is handled specifically for
the frontend case. But the frontend capability isn't set for that metric
It is actually quite misleading, because the capability may be checked
to see whether the metric is relevant for a given scope, yet it is
relevant for frontend scope.
In this patch we also add the frontend capability for the metric.
Now the stat_cols_px array contains all info that-prometheus requires
stop using the promex_st_metrics array that contains redundant infos.
As for ("MEDIUM: promex: switch to using stat_cols_info for global
metrics"), initial goal was to completely get rid of promex_st_metrics
array, but it turns out it is still required but only for the name
mapping part now. So in this commit we change it from complex structure
array (with redundant info) to a simple ist array with the
metric id:promex name mapping. If a metric name is not defined there, then
promex ignores it.
It has been requested to have the ST_I_INF_WARNINGS metric available from
prometheus, let's define it in promex_global_metrics ist array so that
prometheus starts advertising it.
Now the stat_cols_info array contains all info that prometheus requires,
stop using the promex_global_metrics array that contains redundant infos.
Initial goal was to completely drop the promex_global_metrics array.
However it was deemed no longer relevant as prometheus stats rely on a
custom name that cannot be derived from stat_cols_info[], unless we add
a specific ".promex_name" field or similar to name the stats for
prometheus. This is what was carried over on a first attempt but it proved
to burden stat_cols_info[] array (not only memory wise, it is quite
confusing to see promex in the main codebase, given that prometheus is
shipped as an optional add-on).
The new strategy consists in revamping the promex_global_metrics array
from promex_metric (with all redundant fields for metrics) to a simple
ID<==>IST mapping. If the metric is mapped, then it means promex addon
should advertise it (using the name provided in the mapping). Now for
all the metric retrieval, no longer rely on built-in hardcoded values
but instead leverage the new stat cols API.
The tricky part is the .type association because the general rule doesn't
apply for all metrics as it seems that we stated that some non-counters
oriented metrics (at least from haproxy point of view) had to be presented
as counter metrics. So in this patch we add some special treatment for
those metrics to emulate the old behavior. If that's not relevant in the
future, it may be removed. But this requires to ensure that promex users
will properly cope with that change. At least for now, no change of
behavior should be expected.
Use stat_col storage for stat_cols_info[] array instead of name_desc.
As documented in 65624876f ("MINOR: stats: introduce a more expressive
stat definition method"), stat_col supersedes name_desc storage but
it remains backward compatible. Here we migrate to the new API to be
able to further extend stat_cols_info[] in following patches.
Goal is to merge promex metrics definition into the main one.
Promex metrics will use the metric capability to know available scopes,
thus only metrics relevant for prometheus were updated.
Further extend logic implemented in 65624876 ("MINOR: stats: introduce a
more expressive stat definition method") and 4e9e8418 ("MINOR: stats:
prepare stats-file support for values other than FN_COUNTER"): we don't
rely anymore on the presence of the capability to know if the metric is
generic or not. This is because it prevents us from setting a capability
on static statistics. Yet it could be useful to set the capability even
on static metrics, thus we add a dedicated .generic bit to tell haproxy
that the metric is generic and can be handled automatically by the API.
Also, ME_NEW_* helpers are not explicitly associated to generic metric
definition (as it was already the case before) to avoid ambiguities.
It may change in the future as we may need to use the new definition
method to define static metrics (without the generic bit set). But for
now it isn't the case as this need definition was implemented for generic
metrics support in the first place. If we want to define static metrics
using the API, we could add a new set of helpers for instance.
On frontend side, when a stream is shut while the response was already fully
sent, it was cancelled by sending a RST_STREAM(CANCEL) frame. However, it is
not accurrate. CANCEL error code must only be used if the response headers
were sent, but not the full response. As stated in the RFC 9113, when the
response was fully sent, to stop the request sending, a RST_STREAM with an
error code of NO_ERROR must be sent.
This patch should solve the issue #1219. It must be backported to all stable
versions.
The ckch_store_load_files() function makes specific processing for
PARSE_TYPE_STR as if it was a type only used for paths.
This patch changes a little bit the way it's done,
PARSE_TYPE_STR is only meant to strdup() a string and stores the
resulting pointer in the ckch_conf structure.
Any processing regarding the path is now done in the callback.
Since the callbacks were basically doing the same thing, they were
transformed into the DECLARE_CKCH_CONF_LOAD() macros which allows to
do some templating of these functions.
The resulting ckch_conf_load_* functions will do the same as before,
except they will also do the path processing instead of letting
ckch_store_load_files() do it, which means we don't need the "base"
member anymore in the struct ckch_conf_kws.
With the SSL configuration, crt-base, key-base are often used, these
keywords concatenates the base path with the path when the path does not
start by '/'.
This is done at several places in the code, so a function to do this
would be better to standardize the code.
Recent commit e5e36ce09 ("BUG/MEDIUM: hlua/cli: Fix lua CLI commands
to work with applet's buffers") revealed a bug in hlua cli applet handling
Indeed, playing with Willy's lua tetris script on the cli, a segfault
would be encountered when forcefully closing the session by sending a
CTRL+C on the terminal.
In fact the crash was caused by a UAF: while the cli applet was already
freed, the lua task responsible for waking it up would still point to it.
Thus hlua_applet_wakeup() could be called even if the applet didn't exist
anymore.
To fix the issue, in hlua_cli_io_release_fct() we must also free the hlua
task linked to the applet, like we already do for
hlua_applet_tcp_release() and hlua_applet_http_release().
While this bug exists on stable versions (where it should be backported
too for precaution), it only seems to be triggered starting with 3.0.
'global.fd_hard_limit' stays uninitialized, if haproxy is started with -m
(global.rlimit_memmax). 'remain' is the MAX between soft and hard process fd
limits. It will be always bigger than 'global.fd_hard_limit' (0) in this case.
So, if we reassign 'remain' to the 'global.fd_hard_limit' unconditionally,
calculated then 'maxconn' will be even negative and the DEFAULT_MAXCONN (100)
will be set as the 'ideal_maxconn'.
During the 'global.maxconn' calculations in set_global_maxconn(), if the
provided 'global.rlimit_memmax' is quite big, system will refuse to calculate
based on its 'global.maxconn' and we will do a fallback to the 'ideal_maxconn',
which is 100.
Same problem for the configs with SSL frontends and backends.
This fixes the issue #2899.
This should be backported to v3.1.0.
Thanks to the previous commits, we now know that "wait srv-removable"
does not require thread isolation, as long as 3372a2ea00 ("BUG/MEDIUM:
queues: Stricly respect maxconn for outgoing connections") and c880c32b16
("MINOR: stream: decrement srv->served after detaching from the list")
are present. Let's just get rid of thread_isolate() here, which can
consume a lot of CPU on highly threaded machines when removing many
servers at once.
This function was marked as requiring thread isolation because its code
was extracted from cli_parse_delete_server() and was running under
isolation. But upon closer inspection, and using atomic loads to check
a few counters, it is actually safe to run without isolation, so let's
reflect that in its description.
However, it remains true that cli_parse_delete_server() continues to call
it under isolation.
Now that thanks to commit c880c32b16 ("MINOR: stream: decrement
srv->served after detaching from the list") we can trust srv->served,
let's use it and no longer loop on threads when checking if a server
still has streams attached to it. This will be much cheaper and will
result in keeping isolation for a shorter time in the "wait" command.
Baptiste reported that using the new optional timeout argument introduced
in 19e48f2 ("MINOR: hlua: add an optional timeout to AppletTCP:receive()")
the following error would occur at some point:
runtime error: file.lua:lineno: bad argument #-2 to 'receive' (number
expected, got light userdata) from [C]: in method 'receive...
In fact this is caused by exp_date being retrieved using relative index -1
instead of absolute index 3. Indeed, while using relative index is fine
most of the time when we trust the stack, when combined with yielding the
top of the stack when resuming from yielding is not necessarily the same
as when the function was first called (ie: if some data was pushed to the
stack in the yieldable function itself). As such, it is safer to use
explicit index to access exp_date variable at position 3 on the stack.
It was confirmed that doing so addresses the issue.
No backport needed unless 19e48f2 is.
In commit 3372a2ea00 ("BUG/MEDIUM: queues: Stricly respect maxconn for
outgoing connections"), it has been ensured that srv->served is held
as long as possible around the periods where a stream is attached to a
server. However, it's decremented early when entering sess_change_server,
and actually just before detaching from that server's list. While there
is theoretically nothing wrong with this, it prevents us from looking at
this counter to know if streams are still using a server or not.
We could imagine decrementing it much later but that wouldn't work with
leastconn, since that algo needs ->served to be final before calling
lbprm.server_drop_conn(). Thus what we're doing here is to detach from
the server, then decrement ->served, and only then call the LB callback
to update the server's position in the tree. At this moment the stream
doesn't know the server anymore anyway (except via this function's
local variable) so it's safe to consider that no stream knows the server
once the variable reaches zero.
In ad0133cc ("MINOR: log: handle log-forward "option host""), we
de-reference saddr without first checking if saddr is NULL. In practise
saddr shouldn't be null, but it may be the case if memory error happens
for tcp syslog handler so we must assume that it can be NULL at some
point.
To fix the bug, we simply check for NULL before de-referencing it
under syslog_io_handler(), as the function comment suggests.
No backport needed unless ad0133cc is.
This patch implements the function EVP_PKEY_to_jws_algo() which returns
a jwt_alg compatible with the private key.
This value can then be passed to jws_b64_protected() and
jws_b64_signature() which modified to take an jwt_alg instead of a char.
TCP services might want to be interactive, and without a timeout on
receive(), the possibilities are a bit limited. Let's add an optional
timeout in the 3rd argument to possibly limit the wait time. In this
case if the timeout strikes before the requested size is complete,
a possibly incomplete block will be returned.
Coverity has reported that cpu2 seems sometimes unused in
cpu_fixup_topology():
*** CID 1593776: Code maintainability issues (UNUSED_VALUE)
/src/cpu_topo.c: 690 in cpu_fixup_topology()
684 continue;
685
686 if (ha_cpu_topo[cpu].cl_gid != curr_id) {
687 if (curr_id >= 0 && cl_cpu <= 2)
688 small_cl++;
689 cl_cpu = 0;
>>> CID 1593776: Code maintainability issues (UNUSED_VALUE)
>>> Assigning value from "cpu" to "cpu2" here, but that stored value is overwritten before it can be used.
690 cpu2 = cpu;
691 curr_id = ha_cpu_topo[cpu].cl_gid;
692 }
693 cl_cpu++;
694 }
695
That's it. 'cpu2' automatic/stack variable is used only in for() loop scopes to
save cpus ID in which we are interested in. In the loop pointed by coverity
this variable is not used for further processing within the loop's scope.
Then it is always reinitialized to 0 in the another following loops.
This fixes GitHUb issue #2895.
This cpu policy keeps the smallest CPU cluster. This can
be used to limit the resource usage to the strict minimum
that still delivers decent performance, for example to
try to further reduce power consumption or minimize the
number of cores needed on some rented systems for a
sidecar setup, in order to scale the system down more
easily. Note that if a single cluster is present, it
will still be fully used.
When started on a 64-core EPYC gen3, it uses only one CCX
with 8 cores and 16 threads, all in the same group.
This cpu policy tries to evict performant core clusters and only
focuses on efficiency-oriented ones. On an intel i9-14900k, we can
get 525k rps using 8 performance cores, versus 405k when using all
24 efficiency cores. In some cases the power savings might be more
desirable (e.g. scalability tests on a developer's laptop), or the
performance cores might be better suited for another component
(application or security component).
This cpu policy tries to evict efficient core clusters and only
focuses on performance-oriented ones. On an intel i9-14900k, we can
get 525k rps using only 8 cores this way, versus 594k when using all
24 cores. The gains from using all these codes are not significant
enough to waste them on this. Also these cores can be much slower
at doing SSL handshakes so it can make sense to evict them. Better
keep the efficiency cores for network interrupts for example.
Also, on a developer's machine it can be convenient to keep all these
cores for the local tasks and extra tools (load generators etc).
When a cluster is too large to fit into a single group, let's split it
into two equal groups, which will still be allowed to use all the CPUs
of the cluster. This allows haproxy to start all the threads with a
minimum number of groups (e.g. 2x40 for 80 cores).
This policy forms thread groups from the CPU clusters, and bind all the
threads in them to all the CPUs of the cluster. This is recommended on
system with bad inter-CCX latencies. It was shown to simply triple the
performance with queuing on a 64-core EPYC without having to manually
assign the cores with cpu-map.
This is now superseded by the default "safe" cpu-policy, and every time
it's used, that code was bypassed anyway since global.nbthread was set.
We can now safely remove it. Note that for other policies which do not
set a thread count nor further restrict CPUs (such as "none", or even
"safe" when finding a single node), we continue to go through the fallback
code that automatically assigns CPUs to threads and counts them.
This now turns the cpu-policy to "first-usable-node" by default, so that
we preserve the current default behavior consisting in binding to the
first node if nothing was forced. If a second node is found,
global.nbthread is set and the previous code will be skipped.
This is a reimplemlentation of the current default policy. It binds to
the first node having usable CPUs if found, and drops CPUs from the
second and next nodes.
We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
cluster number(s). It can be used to bind to only some clusters, such
as CCX or different energy efficiency cores. For this reason, here we
use the cluster's local ID (local to the node).
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
core number(s). It can be used to bind to only some clusters as well
as to evict efficient cores whose number is known.
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated hardware
thread number(s). It can be used to reserve even threads for HW IRQs
and odd threads for haproxy for example, or to evict efficient cores
that do only have thread #0.
On some Arm systems (typically A76/N1) where CPUs can be associated in
pairs, clusters are reported while they have no incidence on I/O etc.
Yet it's possible to have tens of clusters of 2 CPUs each, which is
counter productive since it does not even allow to start enough threads.
Let's detect this situation as soon as there are at least 4 clusters
having each 2 CPUs or less, which is already very suspcious. In this
case, all these clusters will be reset as meaningless. In the worst
case if needed they'll be re-assigned based on L2/L3.
The goal here is to keep an array of the known CPU clusters, because
we'll use that often to decide of the performance of a cluster and
its relevance compared to other ones. We'll store the number of CPUs
in it, the total capacity etc. For the capacity, we count one unit
per core, and 1/3 of it per extra SMT thread, since this is roughly
what has been measured on modern CPUs.
In order to ease debugging, they're also dumped with -dc.
The purpose here is to detect heterogenous clusters which are not
properly reported, based on the exposed information about the cores
capacity. The algorithm here consists in sorting CPUs by capacity
within a cluster, and considering as equal all those which have 5%
or less difference in capacity with the previous one. This allows
large clusters of more than 5% total between extremities, while
keeping apart those where the limit is more pronounced. This is
quite common in embedded environments with big.little systems, as
well as on some laptops.
Due to the way core numbers are assigned and the presence of SMT on
some of them, some holes may remain in the array. Let's renumber them
to plug holes once they're known, following pkg/node/die/llc etc, so
that they're local to a (pkg,node) set. Now an i7-14700 shows cores
0 to 19, not 0 to 27.
On some machines, L3 is not always reported (e.g. on some lx2 or some
armada8040). But some also don't have L3 (core 2 quad). However, no L3
when there are more than 2 L2 is quite unheard of, and while we don't
really care about firing 2 thread groups for 2 L2, we'd rather avoid
doing this if there are 8! In this case we'll declare an L3 instance
to fix the situation. This allows small machines to continue to start
with two groups while not derivating on large ones.
It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.
Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.
Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.
The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.
We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.
Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.
Some platforms (several armv7, intel 14900 etc) report one distinct
cluster per core. This is problematic as it cannot let clusters be
used to distinguish real groups of cores, and cannot be used to build
thread groups.
Let's just compare the cluster cpus to the siblings, and ignore it if
they exactly match. We must also take care of not falling back to
core_cpus_list, which can enumerate cores that already have their
cluster assigned (e.g. intel 14900 has 4 4-Ecore clusters in addition
to the 8 Pcores).
This will be used to detect and fix incorrect setups which report
the same cluster ID for multiple L3 instances.
The arrangement of functions in this file is becoming a real problem.
Maybe we should move all this to cpu_topo for example, and better
distinguish OS-specific and generic code.
Once we've kept only the CPUs we want, the next step will be to form
groups and these ones are based on locality. Thus we'll have to sort by
locality. For now the locality is only inferred by the index. No grouping
is made at this point. For this we add the "cpu_reorder_by_locality"
function with a locality-based comparison function.
CPU selection will be performed by sorting CPUs according to
various criteria. For dumps however, that's really not convenient
and we'll need to reorder the CPUs according to their index only.
This is what the new function cpu_reorder_by_index() does. It's
called in thread_detect_count() before dumping the CPU topology.
A number of entries under /cpu/cpu%d only exist on certain kernel
versions, certain archs and/or with certain modules loaded. It's
pointless to insist on trying to read them all for all CPUs when
we've already verified they do not exist. Thus let's use stat()
the first time prior to checking some of them, and only try to
access them when they really exist. This almost completely
eliminates the large number of ENOENT that was visible in strace
during startup.
Cpufreq alone isn't a good metric on heterogenous CPUs because efficient
cores can reach almost as high frequencies as performant ones. Tests have
shown that majoring performance cores by 50% gives a pretty accurate
estimate of the performance to expect on modern CPUs, and that counting
+33% per extra SMT thread is reasonable as well. We don't have the info
about the core's quality, but using the presence of SMT is a reasonable
approach in this case, given that efficiency cores will not use it.
As an example, using one thread of each of the 8 P-cores of an intel
i9-14900k gives 395k rps for a corrected total capacity of 69.3k, using
the 16 E-cores gives 40.5k for a total capacity of 70.4k, and using both
threads of 6 P-cores gives 41.1k for a total capacity of 69.6k. Thus the
3 same scores deliver the same performance in various combinations.
The acpi_cppc method was found to take about 5ms per CPU on a 64-core
EPYC system, which is plain unacceptable as it delays the boot by half
a second. Let's use the less accurate cpufreq first, which should be
sufficient anyway since many systems do not have acpi_cppc. We'll only
fall back to acpi_cppc for systems without cpufreq. If it were to be
an issue over time, we could also automatically consider that all
threads of the same core or even of the same cluster run at the same
speed (when a cluster is known to be accurate).
When cpu_capacity is not present, let's try to check acpi_cppc's
nominal_perf which is similar and commonly found on servers, then
scaling_max_freq (though that last one may vary a bit between CPUs
depending on die quality). That variation is not a problem since
we can absorb a ~5% variation without issue.
It was verified on an i9-14900 featuring 5.7-P, 6.0-P and 4.4-E GHz
that P-cores were not reordered and that E cores were placed last.
It was also OK on a W3-2345 with 4.3 to 4.5GHz.
It's becoming difficult to see which CPUs are going to be kept/dropped.
Let's just skip all offline CPUs, and indicate "keep" in front of those
that are going to be used, and "----" in front of the excluded ones. It
is way more readable this way.
Also let's just drop the array entry number, since it's always the same
as the CPU number and is only an internal representation anyway.
Let's reimplement automatic binding to the first NUMA node when thread
count is not forced. It's the same thing as is already done in
check_config_validity() except that this time it's based on the
collected CPU information. The threads are automatically counted
and CPUs from non-first node(s) are evicted.
If no cpu-map is done and no cpu-policy could be enforced, we still need
to count the number of usable CPUs, assign them to all threads and set
the nbthread value accordingly.
This already handles the part that was done in check_config_validity()
via thread_cpus_enabled_at_boot.
By mutually refining the thread count and group count, we can try
to detect the most suitable setup for the current machine. Taskset
is implicitly handled correctly. tgroups automatically adapt to the
configured number of threads. cpu-map manages to limit tgroups to
the smallest supported value.
The thread-limit is enforced. Just like in cfgparse, if the thread
count was forced to a higher value, it's reduced and a warning is
emitted. But if it was not set, the thr_max value is bound to this
limit so that further calculations respect it.
We continue to default to the max number of available threads and 1
tgroup by default, with the limit. This normally allows to get rid
of that test in check_config_validity().
For now it's limited, it only supports "reset" to ask that any previous
"taskset" be ignored. The goal will be to later add more actions that
allow to symbolically define sets of cpus to bind to or to drop. This
also clears the cpu_mask_forced variable that is used to detect
that a taskset had been used.
During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.
For now the function refrains from detecting the CPU topology when a
restrictive taskset or cpu-map was already performed on the process,
and it's documented as such, the reason being that until we're able
to automatically create groups, better not change user settings. But
we'll need to be able to detect bound CPUs and to process them as
desired by the user, so we now need to move that detection into the
function itself. It changes nothing to the logic, just gives more
freedom to the function.
The function is not convenient because it doesn't allow us to undo the
startup changes, and depending on where it's being used, we don't know
whether the values read have already been altered (this is not the case
right now but it's going to evolve).
Let's just compute the status during cpu_detect_usable() and set a
variable accordingly. This way we'll always read the init value, and
if needed we can even afford to reset it. Also, placing it in cpu_topo.c
limits cross-file dependencies (e.g. threads without affinity etc).
With this patch we're also NUMA node IDs to each CPU when the info is
found. The code is highly inspired from the one in commit f5d48f8b3
("MEDIUM: cfgparse: numa detect topology on FreeBSD."), the difference
being that we're just setting the value in ha_cpu_topo[].
With this patch we're also assigning NUMA node IDs to each CPU when one
is found. The code is highly inspired from the one in commit b56a7c89a
("MEDIUM: cfgparse: detect numa and set affinity if needed") that already
did the job, except that it could be simplified since we're just collecting
info to fill the ha_cpu_topo[] array.
The sibling ID was not reported because it's not directly accessible
but we don't care, what matters is that we assign numbers to all the
threads we find using the same CPU so that some strategies permit to
allocate one thread at a time if we want to use few threads with max
performance.
This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.
When possible, the offline CPUs are detected at boot and their OFFLINE
flag is set in the ha_cpu_topo[] array. When the detection is not
possible (e.g. not linux, /sys not mounted etc), we just mark none of
them as being offline, as we don't want to infer wrong info that could
hinder automatic CPU placement detection. When valid, we take this
opportunity for refining cpu_topo_lastcpu so that we don't need to
manipulate CPUs beyond this value.
On FreeBSD we can detect online CPUs at least by doing the bitwise-OR of
the CPUs of all domains, so we're using this and adding this detection
to ha_cpuset_detect_online(). If we find simpler later, we can always
rework it, but it's reasonably inexpensive since we only check existing
domains.
This adds a generic function ha_cpuset_detect_online() which for now
only supports linux via /sys. It fills a cpuset with the list of online
CPUs that were detected (or returns a failure).
The cpuset files are normally used only for cpu manipulations. It happens
that the initial CPU binding detection was initially placed there since
there was no better place, but in practice, being OS-specific, it should
really be in cpu-topo. This simplifies cpuset which doesn't need to know
about the OS anymore.
Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.
It's important to proceed this way so that when we have no info we
assume they're all available.
The new function cpu_dump_topology() will centralize most debugging
calls, and it can make efforts of not dumping some possibly irrelevant
fields (e.g. non-existing cache levels).
We don't want to constantly deal with as many CPUs as a cpuset can hold,
so let's first try to trim the value to what the system claims to support
via _SC_NPROCESSORS_CONF. It is obviously still subject to the limit of
the cpuset size though. The value is stored globally so that we can
reuse it elsewhere after initialization.
This structure will be used to store information about each CPU's
topology (package ID, L3 cache ID, NUMA node ID etc). This will be used
in conjunction with CPU affinity setting to try to perform a mostly
optimal binding between threads and CPU numbers by default. Since it
was noticed during tests that absolutely none of the many machines
tested reports different die numbers, the die_id is not stored.
Also, it was found along experiments that the cluster ID will be used
a lot, half of the time as a node-local identifier, and half of the
time as a global identifier. So let's store the two versions at once
(cl_gid, cl_lid).
Some flags are added to indicate causes of exclusion (offline, excluded
at boot, excluded by rules, ignored by policy).
let's just clean up the thread_cpus_enabled() code a little bit
by removing the OS-specific code and rely on ha_cpuset_detect_bound()
instead. On macos we continue to use sysconf() for now.
Negative IDs are very convenient to mean "not set", so let's just make
the cpuset API robust against this, especially with ha_cpuset_isset()
so that we don't have to manually add this check everywhere when a
value is not known.
Lots of collected data and observations aggregated into a single commit
so as not to lose them. Some parts below come from several commit
messages and are incremental.
Add captures and analysis of intel 14900 where it's not easy to draw
the line between the desired P and E cores.
The 14900 raises some questions (imagine a dual-die variant in multi-socket).
That's the start of an algorithmic distribution of performance cores into
thread groups.
cpu-map currently conflicts a lot with the choices after auto-detection
but it doesn't have to. The problem is the inability to configure the
threads for the whole process like taskset does. By offering this ability
we can also start to designate groups of CPUs symbolically (package, die,
ccx, cores, smt).
It can also be useful to exploit the info from cpuinfo that is not
available in /sys, such as the model number. At least on arm, higher
numbers indicate bigger cores and can be useful to distinguish cores
inside a cluster. It will not indicate big vs medium ones of the same
type (e.g. a78 3.0 vs 2.4 GHz) but can still be effective at identifying
the efficient ones.
In short, infos such as cluster ID not always reliable, and are
local to the package. die_id as well. die number is not reported
here but should definitely be used, as a higher priority than L3.
We're still missing a discriminant between the l3 and cluster number
in order to address heterogenous CPUs (e.g. intel 14900), though in
terms of locality that's currently done correctly.
CPU selection is also a full topic, and some thoughts were noted
regarding sorting by perf vs locality so as never to mix inter-
socket CPUs due to sorting.
The proposed cpu-selection cannot work as-is, because it acts both on
restriction and preference, and these two are not actions but a sequence.
First restrictions must be enforced, and second the remaining CPUs are
sorted according to the preferred criterion, and a number of threads are
selected.
Currently we refine the OS-exposed cluster number but it's not correct
as we can end up with something poorly numbered. We need to respect the
LLC in any case so let's explain the approach.
This is also needed in order to make the requested number of CPUs
appear. For now we don't reroute to the original sysconf() call so
we return -1,EINVAL for all other info.
A build warning is emitted with gcc-4.8 in tools.c since commit
e920d73f59 ("MINOR: tools: improve symbol resolution without dl_addr")
because the compiler doesn't see that <size> is necessarily initialized.
Let's just preset it.
This adds run_poll_loop, run_tasks_from_lists, process_runnable_tasks,
ha_dump_backtrace and cli_io_handler which are fairly common in
backtraces. This will be less relative symbols when dladdr is not
usable.
Let's have a macro that declares both the symbol and its name, it will
avoid the risk of introducing typos, and encourages adding more when
needed. The macro also takes an optional second argument to permit an
inline declaration of an extern symbol.
When dl_addr is not usable or fails, better fall back to the closest
symbol among the known ones instead of providing everything relative
to main. Most often, the location of the function will give some hints
about what it can be. Thus now we can emit fct+0xXXX in addition to
main+0xXXX or main-0xXXX. We keep a margin of +256kB maximum after a
function for a match, which is around the maximum size met in an object
file, otherwise it becomes pointless again.
Performing a diff on stats output before vs after commit 66152526
("MEDIUM: stats: convert counters to new column definition") revealed
that some metrics were not properly ported to to the new API. Namely,
"lbtot", "cli_abrt" and "srv_abrt" are now exposed on frontend and
listeners while it was not the case before.
Also, "hrsp_other" is exposed even when "mode http" wasn't set on the
proxy.
In this patch we restore original behavior by fixing the capabilities
and hide settings.
As this could be considered as a minor regression (looking at the commit
message it doesn't seem intended), better tag this as a bug. It should be
backported in 3.0 with 66152526.
This is a complementary patch to cf913c2f9 ("DOC: management: rename show
stats domain cli "dns" to "resolvers"). The doc still refered to the
legacy "dns" domain filter for stat command. Let's rename those occurences
to "resolvers".
It may be backported to all stable versions.
Since commit 8de8ed4f48 ("MEDIUM: connections: Allow taking over
connections from other tgroups.") we got this partially absurd
build warning when disabling threads:
src/backend.c: In function 'conn_backend_get':
src/backend.c:1371:27: warning: array subscript [0, 0] is outside array bounds of 'struct tgroup_info[1]' [-Warray-bounds]
The reason is that gcc sees that curtgid is not equal to tgid which is
defined as 1 in this case, thus it figures that tgroup_info[curtgid-1]
will be anything but zero and that doesn't fit. It is ridiculous as it
is a perfect case of dead code elimination which should not warrant a
warning. Nevertheless we know we don't need to do this when threads are
disabled and in this case there will not be more than 1 thread group, so
we can happily use that preliminary test to help the compiler eliminate
the dead condition and avoid spitting this warning.
No backport is needed.
The dladdr_lock that was added to avoid re-entering into dladdr is
conditioned by threads, but the way it's declared causes a build
warning if threads are disabled due to the insertion of a lone semi
colon in the variables block. Let's switch to __decl_thread_var()
for this.
This can be backported wherever commit eb41d768f9 ("MINOR: tools:
use only opportunistic symbols resolution") is backported. It relies
on these previous two commits:
bb4addabb7 ("MINOR: compiler: add a simple macro to concatenate resolved strings")
69ac4cd315 ("MINOR: compiler: add a new __decl_thread_var() macro to declare local variables")
__decl_thread() already exists but is more suited for struct members.
When using it in a variables block, it appends the final trailing
semi-colon which is a statement that ends the variable block. Better
clean this up and have one precisely for variable blocks. In this
case we can simply define an unused enum value that will consume the
semi-colon. That's what the new macro __decl_thread_var() does.
It's often useful to be able to concatenate strings after resolving
them (e.g. __FILE__, __LINE__ etc). Let's just have a CONCAT() macro
to do that, which calls _CONCAT() with the same arguments to make
sure the contents are resolved before being concatenated.
A bug was uncovered by the work on NUMA. It only triggers in the CI
with libmusl due to a race condition. What happens is that the call
to set_thread_cpu_affinity() is done very early in the polling loop,
and that it relies on ha_pthread[tid] instead of pthread_self(). The
problem is that ha_pthread[tid] is only set by the return from
pthread_create(), which might happen later depending on the number of
CPUs available to run the starting thread.
Let's just use pthread_self() here. ha_pthread[] is only used to send
signals between threads, there's no point in using it here.
This can be backported to 2.6.
Historically, log-forward proxy used to preserve host field from input
message as much as possible, and if syslog host wasn't provided
(rfc5424 '-' or bad rfc3164 or rfc5424 message) then "localhost" or "-"
would be used as host when outputting message using rfc3164 or rfc5424.
We change that behavior (which corresponds to "keep" host option), so that
log-forward now uses "fill" strategy as default: if the host is provided
in input message, it is preserved. However if it is missing and IP address
from sender is available, we use it.
Following previous patch, we know implement the logic for the host
option under log-forward section. Possible strategies are:
replace If input message already contains a value for the host
field, we replace it by the source IP address from the
sender.
If input message doesn't contain a value for the host field
(ie: '-' as input rfc5424 message or non compliant rfc3164
or rfc5424 message), we use the source IP address from the
sender as host field.
fill If input message already contains a value for the host field,
we keep it.
If input message doesn't contain a value for the host field
(ie: '-' as input rfc5424 message or non compliant rfc3164
or rfc5424 message), we use the source IP address from the
sender as host field.
keep If input message already contains a value for the host field,
we keep it.
If input message doesn't contain a value for the host field,
we set it to localhost (rfc3164) or '-' (rfc5424).
(This is the default)
append If input message already contains a value for the host field,
we append a comma followed by the IP address from the sender.
If input message doesn't contain a value for the host field,
we use the source IP address from the sender.
Default value (unchanged) is "keep" strategy. option host is only relevant
with rfc3164 or rfc5424 format on log targets. Also, if the source address
is not available (ie: UNIX socket), default behavior prevails.
Documentation was updated.
Migrate recently added log-forward section options, currently stored under
proxy->options2 to proxy->options3 since proxy->options2 is running out of
space and we plan on adding more log-forward options.
proxy->options2 is almost full, yet we will add new log-forward options
in upcoming patches so we anticipate that by adding a new {no_}options3
and cfg_opts3[] to further extend proxy options
Prevent code duplication under syslog_fd_handler() and syslog_io_handler()
by merging common code path in a single syslog_process_message() helper
that processed a single message stored in <buf> according to <frontend>
settings.
This test returns a JWS payload signed a specified private key in the
PEM format, and uses the "jose" command tool to check if the signature
is correct against the jwk public key.
The test could be improved later by using the code from jwt.c allowing
to check a signature.
This commits implement JWS signing, this is divided in 3 parts:
- jws_b64_protected() creates a JWS "protected" header, which takes the
algorithm, kid or jwk, nonce and url as input, and fill a destination
buffer with the base64url version of the header
- jws_b64_payload() just encode a payload in base64url
- jws_b64_signature() generates a signature using as input the protected
header and the payload, it supports ES256, ES384 and ES512 for ECDSA
keys, and RS256 for RSA ones. The RSA signature just use the
EVP_DigestSign() API with its result encoded in base64url. For ECDSA
it's a little bit more complicated, and should follow section 3.4 of
RFC7518, R and S should be padded to byte size.
Then the JWS can be output with jws_flattened() which just formats the 3
base64url output in a JSON representation with the 3 fields, protected,
payload and signature.
Released version 3.2-dev7 with the following main changes :
- BUG/MEDIUM: applet: Don't handle EOI/EOS/ERROR is applet is waiting for room
- BUG/MEDIUM: spoe/mux-spop: Introduce an NOOP action to deal with empty ACK
- BUG/MINOR: cfgparse: fix NULL ptr dereference in cfg_parse_peers
- BUG/MEDIUM: uxst: fix outgoing abns address family in connect()
- REGTESTS: fix reg-tests/server/abnsz.vtc
- BUG/MINOR: log: fix outgoing abns address family
- BUG/MINOR: sink: add tempo between 2 connection attempts for sft servers
- MINOR: clock: always use atomic ops for global_now_ms
- CI: QUIC Interop: clean old docker images
- BUG/MINOR: stream: do not call co_data() from __strm_dump_to_buffer()
- BUG/MINOR: mux-h1: always make sure h1s->sd exists in h1_dump_h1s_info()
- MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
- BUG/MEDIUM: stream: never allocate connection addresses from signal handler
- MINOR: freq_ctr: provide non-blocking read functions
- BUG/MEDIUM: stream: use non-blocking freq_ctr calls from the stream dumper
- MINOR: tools: use only opportunistic symbols resolution
- CLEANUP: task: move the barrier after clearing th_ctx->current
- MINOR: compression: Introduce minimum size
- BUG/MINOR: h2: always trim leading and trailing LWS in header values
- MINOR: tinfo: split the signal handler report flags into 3
- BUG/MEDIUM: stream: don't use localtime in dumps from a signal handler
- OPTIM: connection: don't try to kill other threads' connection when !shared
- BUILD: add possibility to use different QuicTLS variants
- MEDIUM: fd: Wait if locked in fd_grab_tgid() and fd_take_tgid().
- MINOR: fd: Add fd_lock_tgid_cur().
- MEDIUM: epoll: Make sure we can add a new event
- MINOR: pollers: Add a fixup_tgid_takeover() method.
- MEDIUM: pollers: Drop fd events after a takeover to another tgid.
- MEDIUM: connections: Allow taking over connections from other tgroups.
- MEDIUM: servers: Add strict-maxconn.
- BUG/MEDIUM: server: properly initialize PROXY v2 TLVs
- BUG/MINOR: server: fix the "server-template" prefix memory leak
- BUG/MINOR: h3: do not report transfer as aborted on preemptive response
- CLEANUP: h3: fix documentation of h3_rcv_buf()
- MINOR: hq-interop: properly handle incomplete request
- BUG/MEDIUM: mux-fcgi: Try to fully fill demux buffer on receive if not empty
- MINOR: h1: permit to relax the websocket checks for missing mandatory headers
- BUG/MINOR: hq-interop: fix leak in case of rcv_buf early return
- BUG/MINOR: server: check for either proxy-protocol v1 or v2 to send hedaer
- MINOR: jws: implement a JWK public key converter
- DEBUG: init: add a way to register functions for unit tests
- TESTS: add a unit test runner in the Makefile
- TESTS: jws: register a unittest for jwk
- CI: github: run make unit-tests on the CI
- TESTS: add config smoke checks in the unit tests
- MINOR: jws: conversion to NIST curves name
- CI: github: remove smoke tests from vtest.yml
- TESTS: ist: fix wrong array size
- TESTS: ist: use the exit code to return a verdict
- TESTS: ist: add a ist.sh to launch in make unit-tests
- CI: github: fix h2spec.config proxy names
- DEBUG: init: Add a macro to register unit tests
- MINOR: sample: allow custom date format in error-log-format
- CLEANUP: log: removing "log-balance" references
- BUG/MINOR: log: set proper smp size for balance log-hash
- MINOR: log: use __send_log() with exact payload length
- MEDIUM: log: postpone the decision to send or not log with empty messages
- MINOR: proxy: make pr_mode enum bitfield compatible
- MINOR: cfgparse-listen: add and use cfg_parse_listen_match_option() helper
- MINOR: log: add options eval for log-forward
- MINOR: log: detach prepare from parse message
- MINOR: log: add dont-parse-log and assume-rfc6587-ntf options
- BUG/MEIDUM: startup: return to initial cwd only after check_config_validity()
- TESTS: change the output of run-unittests.sh
- TESTS: unit-tests: store sh -x in a result file
- CI: github: show results of the Unit tests
- BUG/MINOR: cfgparse/peers: fix inconsistent check for missing peer server
- BUG/MINOR: cfgparse/peers: properly handle ignored local peer case
- BUG/MINOR: server: dont return immediately from parse_server() when skipping checks
- MINOR: cfgparse/peers: provide more info when ignoring invalid "peer" or "server" lines
- BUG/MINOR: stream: fix age calculation in "show sess" output
- MINOR: stream/cli: rework "show sess" to better consider optional arguments
- MINOR: stream/cli: make "show sess" support filtering on front/back/server
- TESTS: quic: create first quic unittest
- MINOR: h3/hq-interop: restore function for standalone FIN receive
- MINOR/OPTIM: mux-quic: do not allocate rxbuf on standalone FIN
- MINOR: mux-quic: refine reception of standalone STREAM FIN
- MINOR: mux-quic: define globally stream rxbuf size
- MINOR: mux-quic: define rxbuf wrapper
- MINOR: mux-quic: store QCS Rx buf in a single-entry tree
- MINOR: mux-quic: adjust Rx data consumption API
- MINOR: mux-quic: adapt return value of qcc_decode_qcs()
- MAJOR: mux-quic: support multiple QCS RX buffers
- MEDIUM: mux-quic: handle too short data splitted on multiple rxbuf
- MAJOR: mux-quic: increase stream flow-control for multi-buffer alloc
- BUG/MINOR: cfgparse-tcp: relax namespace bind check
- MINOR: startup: adjust alert messages, when capabilities are missed
CAP_SYS_ADMIN support was added, in order to access sockets in namespaces. So
let's adjust the alert at startup, where we check preserved capabilities from
global.last_checks. Let's mention here cap_sys_admin as well.
Commit 5cbb278 introduced cap_sys_admin support, and enforced checks for
both binds and servers. However, when binding into a namespace, the bind
is done before dropping privileges. Hence, checking that we have
cap_sys_admin capability set in this case is not needed (and it would
decrease security to add it).
For users starting haproxy with other user than root and without
cap_sys_admin, bind should have already failed.
As a consequence, relax runtime check for binds into a namespace.
Support for multiple Rx buffers per QCS instance has been introduced by
previous patches. However, due to flow-control initial values, client
were still unable to fully used this to increase their upload
throughput.
This patch increases max-stream-data-bidi-remote flow-control initial
values. A new define QMUX_STREAM_RX_BUF_FACTOR will fix the number of
concurrent buffers allocable per QCS. It is set to 90.
Note that connection flow-control initial value did not changed. It is
still configured to be equivalent to bufsize multiplied by the maximum
concurrent streams. This ensures that Rx buffers allocation is still
constrained per connection, so that it won't be possible to have all
active QCS instances using in parallel their maximum Rx buffers count.
Previous commit introduces support for multiple Rx buffers per QCS
instance. Contiguous data may be splitted accross multiple buffers
depending on their offset.
A particular issue could arise with this new model. Indeed, app_ops
rcv_buf callback can still deal with a single buffer at a time. This may
cause a deadlock in decoding if app_ops layer cannot proceed due to
partial data, but such data are precisely divided on two buffers. This
can for example intervene during HTTP/3 frame header parsing.
To deal with this, a new function is implemented to force data realign
between two contiguous buffers. This is called only when app_ops rcv_buf
returned 0 but data is available in the next buffer after the current
one. In this case, data are transferred from the next into the current
buffer via qcs_transfer_rx_data(). Decoding is then restarted, which
should ensure that app_ops layer has enough data to advance.
During this operation, special care is ensure to removed both
qc_stream_rxbuf entries, as their offset are adjusted. The next buffer
is only reinserted if there is remaining data in it, else it can be
freed.
This case is not easily reproducible as it depends on the HTTP/3 framing
used by the client. It seems to be easily reproduced though with quiche.
$ quiche-client --http-version HTTP/3 --method POST --body /tmp/100m \
"https://127.0.0.1:20443/post"
Implement support for multiple Rx buffers per QCS instances. This
requires several changes mostly in qcc_recv() / qcc_decode_qcs() which
deal with STREAM frames reception and decoding. These multiple buffers
can be stored in QCS rx.bufs tree which was introduced in an earlier
patch.
On STREAM frame reception, a buffer is retrieved from QCS bufs tree, or
allocated if necessary, based on the data starting offset. Each buffers
are aligned on bufsize for convenience. This ensures there is no overlap
between two contiguous buffers. Special care is taken when dealing with
a STREAM frame which must be splitted and stored in two contiguous
buffers.
When decoding input data, qcc_decode_qcs() is still invoked with a
single buffer as input. This requires a new while loop to ensure
decoding is performed accross multiple contiguous buffers until all data
are decoded or app stream buffer is full.
Also, after qcs_consume() has been performed, the stream Rx channel is
immediately closed if FIN was already received and QCS now contains only
a single buffer with all remaining data. This is necessary as qcc_recv()
is unable to close the Rx channel if FIN is received for a buffer
different from the current readable offset.
Note that for now stream flow-control value is still too low to fully
utilizing this new infrastructure and improve clients upload throughput.
Indeed, flow-control max-stream-data initial values are set to match
bufsize. This ensures that each QCS will use 1 buffer, or at most 2 if
data are splitted. A future patch will increase this value to unblock
this limitation.
Change return value of qcc_decode_qcs(). It now directly returns the
value from app_ops rcv_buf callback. Function documentation is updated
to reflect this.
For now, qcc_decode_qcs() return value is ignored by callers, so this
patch should not have any functional change. However, it will become
necessary when implementing multiple Rx buffers per QCS, as a loop will
be implemented to invoke qcc_decode_qcs() on several contiguous buffers.
Decoding must be stopped however as soon as an error is returned by
rcv_buf callback. This is also the case in case of a null value, which
indicates there is not enough data to continue decoding.
HTTP/3 data are converted into HTX via qcc_decode_qcs() function. On
completion, these data are removed from QCS Rx buffer via qcs_consume().
This patch adjust qcs_consume() API with several changes. Firstly, the
Rx buffer instance to operate on must now be specified as a new argument
to the function. Secondly, buffer liberation when all data were removed
from qcs_consume() is extracted up to qcc_decode_qcs() caller.
No functional change with this patch. The objective is to have an API
which can be better adapted to multiple Rx buffers per QCS instance.
Convert QCS rx buffer pointer to a tree container. Additionnaly, offset
field of qc_stream_rxbuf is thus transformed into a node tree.
For now, only a single Rx buffer is stored at most in QCS tree. Multiple
Rx buffers will be implemented in a future patch to improve QUIC clients
upload throughput.
Define a new type qc_stream_rxbuf. This is used as a wrapper around QCS
Rx buffer with encapsulation of the ncbuf storage. It is allocated via a
new pool. Several functions are adapted to be able to deal with
qc_stream_rxbuf as a wrapper instead of the previous plain ncbuf
instance.
No functional change should happen with this patch. For now, only a
single qc_stream_rxbuf can be instantiated per QCS. However, this new
type will be useful to implement multiple Rx buffer storage in a future
commit.
QCS uses ncbuf for STREAM data storage. This serves as a limit for
maximum STREAM buffering capacity, advertised via QUIC transport
parameters for initial flow-control values.
Define a new function qmux_stream_rx_bufsz() which can be used to
retrieve this Rx buffer size. This can be used both in MUX/H3 layers and
in QUIC transport parameters.
Reception of standalone STREAM FIN is a corner case, which may be
difficult to handle. In particular, care must be taken to ensure app_ops
rcv_buf() is always called to be notify about FIN, even if Rx buffer is
empty or full demux flag is set. If this is the case, it could prevent
closure of QCS Rx channel.
To ensure this, rcv_buf() was systematically called if FIN was received,
with or without data payload. This could called unnecessary invokation
when FIN is transmitted with data and full demux flag is set, or data
are received out-of-order.
This patches improve qcc_recv() by detecting explicitely a standalone
FIN case. Thus, rcv_buf() is only forcefully called in this case and if
all data were already previously received.
STREAM FIN may be received without any payload. However, qcc_recv()
always called qcs_get_ncbuf() indiscriminately, which may allocate a QCS
Rx buffer. This is unneeded as there is no payload to store.
Improve this by skipping qcs_get_ncbuf() invokation when dealing with a
standalone FIN signal. This should prevent superfluous buffer
allocation.
Previously, a function qcs_http_handle_standalone_fin() was implemented
to handle a received standalone FIN, bypassing app_ops layer decoding.
However, this was removed as app_ops layer interaction is necessary. For
example, HTTP/3 checks that FIN is never sent on the control uni stream.
This patch reintroduces qcs_http_handle_standalone_fin(), albeit in a
slightly diminished version. Most importantly, it is now the
responsibility of the app_ops layer itself to use it, to avoid the
shortcoming described above.
The main objective of this patch is to be able to support standalone FIN
in HTTP/0.9 layer. This is easily done via the reintroduction of
qcs_http_handle_standalone_fin() usage. This will be useful to perform
testing, as standalone FIN is a corner case which can easily be broken.
Define a first unit-test dedicated to QUIC. A single test for now
ensures that variable length decoding is compliant. This should be
extended in the future with new set of tests.
With "show sess", particularly "show sess all", we're often missing the
ability to inspect only streams attached to a frontend, backend or server.
Let's just add these filters to the command. Only one at a time may be set.
One typical use case could be to dump streams attached to a server after
issuing "shutdown sessions server XXX" to figure why any wouldn't stop
for example.
The "show sess" CLI command parser is getting really annoying because
several options were added in an exclusive mode as the single possible
argument. Recently some cumulable options were added ("show-uri") but
the older ones were not yet adapted. Let's just make sure that the
various filters such as "older" and "age" now belong to the options
and leave only <id>, "all", and "help" for the first ones. The doc was
updated and it's now easier to find these options.
The "show sess" output reports an age that's based on the last byte of
the HTTP request instead of the stream creation date, due to a confusion
between logs->request_ts and the request_date sample fetch function. Most
of the time these are equal except when the request is not yet full for
any reason (e.g. wait-body). This explains why a few "show sess" could
report a few new streams aged by 99 days for example.
Let's perform the correct request timestamp calculation like the sample
fetch function does, by adding t_idle and t_handshake to the accept_ts.
Now the stream's age is correct and can be correctly used with the
"show sess older <age>" variant.
This issue was introduced in 2.9 and the fix can be backported to 3.0.
Invalid (incomplete) "server" or "peer" lines under peers section are now
properly ignored. For completeness, in this patch we add some reports so
that the user knows that incomplete lines were ignored.
For an incomplete server line, since it is tolerated (see GH #565), we
only emit a diag warning.
For an incomplete peer line, we report a real warning, as it is not
expected to have a peer line without an address:port specified.
Also, 'newpeer == curpeers->local' check could be simplified since
we already have the 'local_peer' variable which tells us that the
parsed line refers to a local peer.
If parse_server() is called under peers section parser, and the address
needs to be parsed but it is missing, we directly return from the function
However since 0fc136ce5b ("REORG: server: use parsing ctx for server
parsing"), parse_server() uses parsing ctx to emit warning/errors, and
the ctx must be reset before returning from the function, yet this early
return was overlooked. Because of that, any ha_{warning,alert..} message
reported after early return from parse_server() could cause messages to
have an extra "parsing [file:line]" info.
We fix that by ensuring parse_server() doesn't return without resetting
the parsing context.
It should be backported up to 2.6
In 8ba10fea6 ("BUG/MINOR: peers: Incomplete peers sections should be
validated."), some checks were relaxed in parse_server(), and extra logic
was added in the peers section parser in an attempt to properly ignore
incomplete "server" or "peer" statement under peers section.
This was done in response to GH #565, the main intent was that haproxy
should already complain about incomplete peers section (ie: missing
localpeer).
However, 8ba10fea69 explicitly skipped the peer cleanup upon missing
srv association for local peers. This is wrong because later haproxy
code always assumes that peer->srv is valid. Indeed, we got reports
that the (invalid) config below would cause segmentation fault on
all stable versions:
global
localpeer 01JM0TEPAREK01FQQ439DDZXD8
peers my-table
peer 01JM0TEPAREK01FQQ439DDZXD8
listen dummy
bind localhost:8080
To fix the issue, instead of by-passing some cleanup for the local
peer, handle this case specifically by doing the regular peer cleanup
and reset some fields set on the curpeers and curpeers proxy because
of the invalid local peer (do as if the peer was not declared).
It should still comply with requirements from #565.
This patch should be backported to all stable versions.
In the "peers" section parser, right after parse_server() is called, we
used to check whether the curpeers->peers_fe->srv pointer was set or not
to know if parse_server() successfuly added a server to the peers proxy,
server that we can then associate to the new peer.
However the check is wrong, as curpeers->peers_fe->srv points to the
last added server, if a server was successfully added before the
failing one, we cannot detect that the last parse_server() didn't
add a server. This is known to cause bug with bad "peer"/"server"
statements.
To fix the issue, we save a pointer on the last known
curpeers->peers_fe->srv before parse_server() is called, and we then
compare the save with the pointer after parse_server(), if the value
didn't change, then parse_server() didn't add a server. This makes
the check consistent in all situations.
It should be backported to all stable versions.
In check_config_validity() we evaluate some sample fetch expressions
(log-format, server rules, etc). These expressions may use external files like
maps.
If some particular 'default-path' was set in the global section before, it's no
longer applied to resolve file pathes in check_config_validity(). parse_cfg()
at the end of config parsing switches back to the initial cwd.
This fixes the issue #2886.
This patch should be backported in all stable versions since 2.4.0, including
2.4.0.
This commit introduces the dont-parse-log option to disable log message
parsing, allowing raw log data to be forwarded without modification.
Also, it adds the assume-rfc6587-ntf option to frame log messages
using only non-transparent framing as per RFC 6587. This avoids
missparsing in certain cases (mainly with non RFC compliant messages).
The documentation is updated to include details on the new options and
their intended use cases.
This feature was discussed in GH #2856
This commit adds a new function `prepare_log_message` to initialize log
message buffers and metadata. This function sets default values for log
level and facility, ensuring a consistent starting state for log
processing. It also prepares the buffer and metadata fields, simplifying
subsequent log parsing and construction.
This commit adds parsing of options in log-forward config sections and
prepares the scenario to implement actual changes of behaviuor. So far
we only take in account proxy->options2, which is the bit container with
more available positions.
cfg_parse_listen_match_option() takes cfg_opt array as parameter, as well
current args, expected mode and cap bitfields.
It is expected to be used under cfg_parse_listen() function or similar.
Its goal is to remove code duplication around proxy->options and
proxy->options2 handling, since the same checks are performed for the
two. Also, this function could help to evaluate proxy options for
mode-specific proxies such as log-forward section for instance:
by giving the expected mode and capatiblity as input, the function
would only match compatible options.
Current pr_mode enum is a regular enum because a proxy only supports one
mode at a time. However it can be handy for a function to be given a
list of compatible modes for a proxy, and we can't do that using a
bitfield because pr_mode is not bitfield compatible (values share
the same bits).
In this patch we manually define pr_mode values so that they are all
using separate bits and allows a function to take a bitfield of
compatible modes as parameter.
As reported by Nick Ramirez in GH #2891, it is currently not possible to
use log-profile without a log-format set on the proxy.
This is due to historical reason, because all log sending functions avoid
trying to send a log with empty message. But now with log-profile which
can override log-format, it is possible that some loggers may actually
end up generating a valid log message that should be sent! Yet from the
upper logging functions we don't know about that because loggers are
evaluated in lower API functions.
Thus, to avoid skipping potentially valid messages (thanks to log-profile
overrides), in this patch we postpone the decision to send or not empty
log messages in lower log API layer, ie: _process_send_log_final(), once
the log-profile settings were evaluated for a given logger.
A known side-effect of this change is that fe->log_count statistic may
be increased even if no log message is sent because the message was empty
and even the log-profile didn't help to produce a non empty log message.
But since configurations lacking proxy log-format are not supposed to be
used without log-profile (+ log steps combination) anyway it shouldn't be
an issue.
Historically, __send_log() was called with terminating NULL byte after
the message payload. But now that __send_log() supports being called
without terminating NULL byte (thanks to size hint), and that __sendlog()
actually stips any \n or NULL byte, we don't need to bother with that
anymore. So let's remove extra logic around __send_log() users where we
added 1 extra byte for the terminating NULL byte.
No change of behavior should be expected.
result.data.u.str.size was set to size+1 to take into account terminating
NULL byte as per the comment. But this is wrong because the caller is free
to set size to just the right amount of bytes (without terminating NULL
byte). In fact all smp API functions will not read past str.data so there
is not risk about uninitialized reads, but this leaves an ambiguity for
converters that may use all the smp size to perform transformations, and
since we don't know about the "message" memory origin, we cannot assume
that its size may be greater than size. So we max it out to size just to
be safe.
This bug was not known to cause any issue, it was spotted during code
review. It should be backported in 2.9 with b30bd7a ("MEDIUM: log/balance:
support for the "hash" lb algorithm")
This is a complementary patch to 0e1f389fe9 ("DOC: config: removing
"log-balance" references"): we properly removed all log-balance
references in the doc but there remained some in the code, let's fix
that.
It could be backported in 2.9 with 0e1f389fe9
Sample fetches %[accept_date] and %[request_date] with converters can be used
in error-log-format string. But in the most error cases they fetches nothing,
as error logs are produced on SSL handshake issues or when invalid PROXY
protocol header is used. Stream object is never allocated in such cases and
smp_fetch_accept_date() just simply returns 0.
There is a need to have a custom date format (ISO8601) also in the error logs,
along with normal logs. When sess_build_logline_orig() builds log line it
always copies the accept date to strm_logs structure. When stream is absent,
accept date is copied from the session object.
So, if the steam object wasn't allocated, let's use the session date info in
smp_fetch_accept_date(). This allows then, in sample_process(), to apply to the
fetched date different converters and formats.
This fixes the issue #2884.
Add a new macro, REGISTER_UNITTEST(), that will automatically make sure
we call hap_register_unittest(), instead of having to create a function
that will do so.
OpenSSL version greater than 3.0 does not use the same API when
manipulating EVP_PKEY structures, the EC_KEY API is deprecated and it's
not possible anymore to get an EC_GROUP and simply call
EC_GROUP_get_curve_name().
Instead, one must call EVP_PKEY_get_utf8_string_param with the
OSSL_PKEY_PARAM_GROUP_NAME parameter, but this would result in a SECG
curves name, instead of a NIST curves name in previous version.
(ex: secp384r1 vs P-384)
This patch adds 2 functions:
- the first one look for a curves name and converts it to an openssl
NID.
- the second one converts a NID to a NIST curves name
The list only contains: P-256, P-384 and P-521 for now, it could be
extended in the fure with more curves.
`make unit-tests` would run shell scripts from tests/unit/
The run-unittests.sh script will look for any .sh in tests/unit/ and
will call it twice:
- first with the 'check' argument in order to decide if we should skip
the test or not
- second to run the check
A simple test could be written this way:
#!/bin/sh
check() {
${HAPROXY_PROGRAM} -cc 'feature(OPENSSL)'
command -v socat
}
run() {
${HAPROXY_PROGRAM} -dI -f ${ROOTDIR}/examples/quick-test.cfg -c
}
case "$1" in
"check")
check
;;
"run")
run
;;
esac
The tests *MUST* be written in POSIX shell in order to be portable, and
any special commands should be tested with `command -v` before using it.
Tests are run with `sh -e` so everything must be tested.
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.
This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.
When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.
HAProxy need to be built with DEBUG_UNIT to activate this feature.
Implement a converter which takes an EVP_PKEY and converts it to a
public JWK key. This is the first step of the JWS implementation.
It supports both EC and RSA keys.
Know to work with:
- LibreSSL
- AWS-LC
- OpenSSL > 1.1.1
As reported in issue #2882, using "no-send-proxy-v2" on a server line does
not properly disable the use of proxy-protocol if it was enabled in a
default-server directive in combination with other PP options. The reason
for this is that the sending of a proxy header is determined by a test on
srv->pp_opts without any distinction, so disabling PPv2 while leaving other
options results in a PPv1 header to be sent.
Let's fix this by explicitly testing for the presence of either send-proxy
or send-proxy-v2 when deciding to send a proxy header.
This can be backported to all versions. Thanks to Andre Sencioles (@asenci)
for reporting the issue and testing the fix.
HTTP/0.9 parser was recently updated to support truncated requests in
rcv_buf operation. However, this caused a leak as input buffer is
allocated early.
In fact, the leak was already present in case of fatal errors. Fix this
by first delaying buffer allocation, so that initial checks are
performed before. Then, ensure that buffer is released in case of a
latter error.
This is considered as minor, as HTTP/0.9 is reserved for experiment and
QUIC interop usages.
This should be backported up to 2.6.
At least one user would like to allow a standards-violating client setup
WebSocket connections through haproxy to a standards-violating server that
accepts them. While this should of course never be done over the internet,
it can make sense in the datacenter between application components which do
not need to mask the data, so this typically falls into the situation of
what the "accept-unsafe-violations-in-http-request" option and the
"accept-unsafe-violations-in-http-response" option are made for.
See GH #2876 for more context.
This patch relaxes the test on the "Sec-Websocket-Key" header field in
the request, and of the "Sec-Websocket-Accept" header in the response
when these respective options are set.
The doc was updated to reference this addition. This may be backported
to 3.1 but preferably not further.
Don't reserve space for the HTX overhead on receive if the demux buffer is
not empty. Otherwise, the demux buffer may be erroneously reported as full
and this may block records processing. Because of this bug, a ping-pong loop
till timeout between data reception and demux process can be observed.
This bug was introduced by the commit 5f927f603 ("BUG/MEDIUM: mux-fcgi:
Properly handle read0 on partial records"). To fix the issue, if the demux
buffer is not empty when we try to receive more data, all free space in the
buffer can now be used. However, if the demux buffer is empty, we still try
to keep it aligned with the HTX.
This patch must be backported to 3.1.
Extends HTTP/0.9 layer to be able to deal with incomplete requests.
Instead of an error, 0 is returned. Thus, instead of a stream closure.
QUIC-MUX may retry rcv_buf operation later if more data is received,
similarly to HTTP/3 layer.
Note that HTTP/0.9 is only used for testing and interop purpose. As
such, this limitation is not considered as a bug. It is probably not
worth to backport it.
Return value of h3_rcv_buf() is incorrectly documented. Indeed, it may
return a positive value to indicate that input bytes were converted into
HTX. This is especially important, as caller uses this value to consume
the reported data amount in QCS Rx buffer.
This should be backported up to 2.6. Note that on 2.8, h3_rcv_buf() was
named h3_decode_qcs().
HTTP/3 specification allows a server to emit the entire response even if
only a partial request was received. In particular, this happens when
request STREAM FIN is delayed and transmitted in an empty payload frame.
In this case, qcc_abort_stream_read() was used by HTTP/3 layer to emit a
STOP_SENDING. Remaining received data were not transmitted to the stream
layer as they were simply discared. However, this prevents FIN
transmission to the stream layer. This causes the transfer to be
considered as prematurely closed, resulting in a cL-- log line status.
This is misleading to users which could interpret it as if the response
was not sent.
To fix this, disable STOP_SENDING emission on full preemptive reponse
emission. Rx channel is kept opened until the client closes it with
either a FIN or a RESET_STREAM. This ensures that the FIN signal can be
relayed to the stream layer, which allows the transfer to be reported as
completed.
This should be backported up to 2.9.
The PROXY v2 TLVs were not properly initialized when defined with
"set-proxy-v2-tlv-fmt" keyword, which could have caused a crash when
validating the configuration or malfunction (e.g. when used in
combination with "server-template" and/or "default-server").
The issue was introduced with commit 6f4bfed3a ("MINOR: server: Add
parser support for set-proxy-v2-tlv-fmt").
This should be backported up to 2.9.
Maxconn is a bit of a misnomer when it comes to servers, as it doesn't
control the maximum number of connections we establish to a server, but
the maximum number of simultaneous requests. So add "strict-maxconn",
that will make it so we will never establish more connections than
maxconn.
It extends the meaning of the "restricted" setting of
tune.takeover-other-tg-connections, as it will also attempt to get idle
connections from other thread groups if strict-maxconn is set.
Allow haproxy to take over idle connections from other thread groups
than our own. To control that, add a new tunable,
tune.takeover-other-tg-connections. It can have 3 values, "none", where
we won't attempt to get connections from the other thread group (the
default), "restricted", where we only will try to get idle connections
from other thread groups when we're using reverse HTTP, and "full",
where we always try to get connections from other thread groups.
Unless there is a special need, it is advised to use "none" (or
restricted if we're using reverse HTTP) as using connections from other
thread groups may have a performance impact.
In pollers that support it, provide the generation number in addition to
the fd, and, when an event happened, if the generation number is the
same, but the tgid changed, then assumed the fd was taken over by a
thread from another thread group, and just delete the event from the
current thread's poller, as we no longer want to hear about it.
Add a fixup_tgid_takeover() method to pollers for which it makes sense
(epoll, kqueue and evport). That method can be called after a takeover
of a fd from a different thread group, to make sure the poller's
internal structure reflects the new state.
Check that the call to epoll_ctl() succeeds, and if it does not, if
we're adding a new event and it fails with EEXIST, then delete and
re-add the event. There are a few cases where we may already have events
for a fd. If epoll_ctl() fails for any reason, use BUG_ON to make sure
we immediately crash, as this should not happen.
Users may have good reasons for using "tune.idle-pool.shared off", one of
them being the cost of moving cache lines between cores, or the kernel-
side locking associated with moving FDs. For this reason, when getting
close to the file descriptors limits, we must not try to kill adjacent
threads' FDs when the sharing of pools is disabled. This is extremely
expensive and kills the performance. We must limit ourselves to our local
FDs only. In such cases, it's up to the users to configure a large enough
maxconn for their usages.
Before this patch, perf top reported 9% CPU usage in connect_server()
onthe trylock used to kill connections when running at 4800 conns for
a global maxconn of 6400 on a 128-thread server. Now it doesn't spend
its time there anymore, and performance has increased by 12%. Note,
it was verified that disabling the locks in such a case has no effect
at all, so better keep them and stay safe.
In issue #2861, Jarosaw Rzesztko reported another issue with
"show threads", this time in relation with the conversion of a stream's
accept date to local time. Indeed, if the libc was interrupted in this
same function, it could have been interrupted with a lock held, then
it's no longer possible to dump the date, and we face a deadlock.
This is easy to reproduce with logging enabled.
Let's detect we come from a signal handler and do not try to resolve
the time to localtime in this case.
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
Annika Wickert reported some occasional disconnections between haproxy
and varnish when communicating over HTTP/2, with varnish complaining
about protocol errors while captures looked apparently normal. Nils
Goroll managed to reproduce this on varnish by injecting the capture of
the outgoing haproxy traffic and noticed that haproxy was forwarding a
header value containing a trailing space, which is now explicitly
forbidden since RFC9113.
It turns out that the only way for such a header to pass through haproxy
is to arrive in h2 and not be edited, in which case it will arrive in
HTX with its undesired spaces. Since the code dealing with HTX headers
always trims spaces around them, these are not observable in dumps, but
only when started in debug mode (-d). Conversions to/from h1 also drop
the spaces.
With this patch we trim LWS both on input and on output. This way we
always present clean headers in the whole stack, and even if some are
manually crafted by the configuration or Lua, they will be trimmed on
the output.
This must be backported to all stable versions.
Thanks to Annika for the helpful capture and Nils for the help with
the analysis on the varnish side!
This is the introduction of "minsize-req" and "minsize-res".
These two options allow you to set the minimum payload size required for
compression to be applied.
This helps save CPU on both server and client sides when the payload does
not need to be compressed.
There's a barrier after releasing the current task in the scheduler.
However it's improperly placed, it's done after pool_free() while in
fact it must be done immediately after resetting the current pointer.
Indeed, the purpose is to make sure that nobody sees the task as valid
when it's in the process of being released. This is something that
could theoretically happen if interrupted by a signal in the inlined
code of pool_free() if the compiler decided to postpone the write to
->current. In practice since nothing fancy is done in the inlined part
of the function, there's currently no risk of reordering. But it could
happen if the underlying __pool_free() were to be inlined for example,
and in this case we could possibly observe th_ctx->current pointing
to something currently being destroyed.
With the barrier between the two, there's no risk anymore.
As seen in issue #2861, dladdr_and_size() an be quite expensive and
will often hold a mutex in the underlying library. It becomes a real
problem when issuing lots of "show threads" or wdt warnings in parallel
because threads will queue up waiting for each other to finish, adding
to their existing latency that possibly caused the warning in the first
place.
Here we're taking a different approach. If the thread is not isolated
and not panicking, it's doing unimportant stuff like showing threads
or warnings. In this case we try to grab a lock, and if we fail because
another thread is already there, we just pretend we cannot resolve the
symbol. This is not critical because then we fall back to the already
used case which consists in writing "main+<offset>". In practice this
will almost never happen except in bad situations which could have
otherwise degenerated.
The stream dump function is called from signal handlers (warning, show
threads, panic). It makes use of read_freq_ctr() which might possibly
block if it tries to access a locked freq_ctr in the process of being
updated, e.g. by the current thread.
Here we're relying on the non-blocking API instead. It may return incorrect
values (typically smaller ones after resetting the curr counter) but at
least it will not block.
This needs to be backported to stable versions along with the previous
commit below:
MINOR: freq_ctr: provide non-blocking read functions
At least 3.1 is concerned as the warnings tend to increase the risk of
this situation appearing.
Some code called by the debug handlers in the context of a signal handler
accesses to some freq_ctr and occasionally ends up on a locked one from
the same thread that is dumping it. Let's introduce a non-blocking version
that at least allows to return even if the value is in the process of being
updated, it's less problematic than hanging.
In __strm_dump_to_buffer(), we call conn_get_src()/conn_get_dst() to try
to retrieve the connection's IP addresses. But this function may be called
from a signal handler to dump a currently running stream, and if the
addresses were not allocated yet, a poll_alloc() will be performed while
we might possibly already be running pools code, resulting in pool list
corruption.
Let's just make sure we don't call these sensitive functions there when
called from a signal handler.
This must be backported at least to 3.1 and ideally all other versions,
along with this previous commit:
MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.
Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
This function may be called from a signal handler during a warning,
a panic or a show thread. We need to be more cautious about what may
or may not be dereferenced since an h1s is not necessarily fully
initialized. Loops of "show threads" sometimes manage to crash when
dereferencing a null h1s->sd, so let's guard it and add a comment
remining about the unusual call place.
This can be backported to the relevant versions.
co_data() was instrumented to detect cases where c->output > data and
emits a warning if that's not correct. The problem is that it happens
quite a bit during "show threads" if it interrupts traffic anywhere,
and that in some environments building with -DDEBUG_STRICT_ACTION=3,
it will kill the process.
Let's just open-code the channel functions that make access to co_data(),
there are not that many and the operations remain very simple.
This can be backported to 3.1. It didn't trigger in earlier versions
because they didn't have this CHECK_IF_HOT() test.
global_now_ms is shared between threads so we must give hint to the
compiler that read/writes operations should be performed atomically.
Everywhere global_now_ms was used, atomic ops were used, except in
clock_update_global_date() where a read was performed without using
atomic op. In practise it is not an issue because on most systems
such reads should be atomic already, but to prevent any confusion or
potential bug on exotic systems, let's use an explicit _HA_ATOMIC_LOAD
there.
This may be backported up to 2.8
When the connection for sink_forward_{oc}_applet fails or a previous one
is destroyed, the sft->appctx is instantly released.
However process_sink_forward_task(), which may run at any time, iterates
over all known sfts and tries to create sessions for orphan ones.
It means that instantly after sft->appctx is destroyed, a new one will
be created, thus a new connection attempt will be made.
It can be an issue with tcp log-servers or sink servers, because if the
server is unavailable, process_sink_forward() will keep looping without
any temporisation until the applet survives (ie: connection succeeds),
which results in unexpected CPU usage on the threads responsible for
that task.
Instead, we add a tempo logic so that a delay of 1second is applied
between two retries. Of course the initial attempt is not delayed.
This could be backported to all stable versions.
While reviewing the code in an attempt to fix GH #2875, I stumbled
on another case similar to aac570c ("BUG/MEDIUM: uxst: fix outgoing
abns address family in connect()") that caused abns(z) addresses to
fail when used as log targets.
The underlying cause is the same as aac570c, which is the rework of the
unix socket families in order to support custom addresses for different
adressing schemes, where a real_family() was overlooked before passing
a haproxy-internal address struct to socket-oriented syscall.
To fix the issue, we first copy the target's addr, and then leverage
real_family() to set the proper low-level address family that is passed
to sendmsg() syscall.
It should be backported in 3.1
It was proved in GH #2875 that the regtest was broken, at least for the
server-side abnsz, as the connect() was not performed using the proper
family, which results in kernel refusing to perform the call, while the
reg-test actually succeeds.
Indeed, in the test we used vtest client to connect to haproxy, which
then routed the request to another haproxy instance listening on an
abnsz socket, and this last haproxy was the one to answer the http
request.
As we only used "rxresp" in vtest client, the test succeeded with empty
responses, which was the case due to the server connection failing on the
first haproxy process.
Since we reworked the unix socket families in order to support custom
addresses for different addressing schemes, we've been using extra
values for the ss_family field in sockaddr_storage. These ones have
to be adjusted before calling bind() or connect(). It turns out that
after the abns/abnsz updates in 3.1, the connect() code was not adjusted
to take care of the change, resulting in AF_CUST_ABNS or AF_CUST_ABNSZ
to be placed in the address that was passed to connect().
The right approach is to locally copy the address, get its length,
fixup the family and use the fixed value and length for connect().
This must be backported to 3.1. Many thanks for @Mewp for reporting
this issue in github issue #2875.
When "peers" keyword is followed by more than one argument and it's the first
"peers" section in the config, cfg_parse_peers() detects it and exits with
"ERR_ALERT|ERR_FATAL" err_code.
So, upper layer parser, parse_cfg(), continues and parses the next keyword
"peer" and then he tries to check the global cfg_peers, which should contain
"my_cluster". The global cfg_peers is still NULL, because after alerting a user
in alertif_too_many_args, cfg_parse_peers() exited.
peers my_cluster __some_wrong_data__
peer haproxy1 1.1.1.1 1000
In order to fix this, let's add ERR_ABORT, if "peers" keyword is followed by
more than one argument. Like this parse_cfg() will stops immediately and
terminates haproxy with "too many args for peers my_cluster..." alert message.
It's more reliable, than add checks "if (cfg_peers !=NULL)" in "peer"
subparser, as we may have many "peers" sections.
peers my_another_cluster
peer haproxy1 1.1.1.2 1000
peers my_cluster __some_wrong_data__
peer haproxy1 1.1.1.1 1000
In addition, for the example above, parse_cfg() will parse all configuration
until the end and only then terminates haproxy with the alert
"too many args...". Peer haproxy1 will be wrongly associated with
my_another_cluster.
This fixes the issue #2872.
This should be backported in all stable versions.
In the SPOP protocol, ACK frame with empty payload are allowed. However, in
that case, because only the payload is transferred, there is no data to
return to the SPOE applet. Only the end of input is reported. Thus the
applet is never woken up. It means that the SPOE filter will be blocked
during the processing timeout and will finally return an error.
To workaournd this issue, a NOOP action is introduced with the value 0. It
is only an internal action for now. It does not exist in the SPOP
protocol. When an ACK frame with an empy payload is received, this noop
action is transferred to the SPOE applet, instead of nothing. Thanks to this
trick, the applet is properly notified. This works because unknown actions
are ignored by the SPOE filter.
This patch must be backported to 3.1.
The commit 7214dcd52 ("BUG/MEDIUM: applet: Don't pretend to have more data
to handle EOI/EOS/ERROR") introduced a regression. Because of this patch, it
was possible to handle EOI/EOS/ERROR applet flags too early while the applet
was waiting for more room to transfer the last output data.
This bug can be encountered with any applet using its own buffers (cache and
stats for instance). And depending on the configuration and the timing, the
data may be truncated or the stream may be blocked, infinitely or not.
Streams blocked infinitely were observed with the cache applet and the HTTP
compression enabled.
For the record, it is important to detect EOI/EOS/ERROR applet flags to be
able to report the corresponding event on the SE and by transitivity on the
SC. Most of time, this happens when some data should be transferred to the
stream. The .rcv_buf callback function is called and these flags are
properly handled. However, some applets may also report them spontaneously,
outside of any data transfer. In that case, the .rcv_buf callback is not
called.
It is the purpose of this patch (and the one above). Being able to detect
pending EOI/EOS/ERROR applet flags. However, we must be sure to not handle
them too early at this place. When these flags are set, it means no more
data will be produced by the applet. So we must only wait to have
transferred everything to the stream. And this happens when the applet is no
longer waiting for more room.
This patch must be backported to 3.1 with the one above.
Released version 3.2-dev6 with the following main changes :
- BUG/MEDIUM: debug: close a possible race between thread dump and panic()
- DEBUG: thread: report the spin lock counters as seek locks
- DEBUG: thread: make lock time computation more consistent
- DEBUG: thread: report the wait time buckets for lock classes
- DEBUG: thread: don't keep the redundant _locked counter
- DEBUG: thread: make lock_stat per operation instead of for all operations
- DEBUG: thread: reduce the struct lock_stat to store only 30 buckets
- MINOR: lbprm: add a new callback ->server_requeue to the lbprm
- MEDIUM: server: allocate a tasklet for asyncronous requeuing
- MAJOR: leastconn: postpone the server's repositioning under contention
- BUG/MINOR: quic: reserve length field for long header encoding
- BUG/MINOR: quic: fix CRYPTO payload size calcul for encoding
- MINOR: quic: simplify length calculation for STREAM/CRYPTO frames
- BUG/MINOR: mworker: section ignored in discovery after a post_section_parser
- BUG/MINOR: mworker: post_section_parser for the last section in discovery
- CLEANUP: mworker: "program" section does not have a post_section_parser anymore
- MEDIUM: initcall: allow to register mutiple post_section_parser per section
- CI: cirrus-ci: bump FreeBSD image to 14-2
- DOC: initcall: name correctly REGISTER_CONFIG_POST_SECTION()
- REGTESTS: stop using truncated.vtc on freebsd
- MINOR: quic: refactor STREAM encoding and splitting
- MINOR: quic: refactor CRYPTO encoding and splitting
- BUG/MEDIUM: fd: mark FD transferred to another process as FD_CLONED
- BUG/MINOR: ssl/cli: "show ssl crt-list" lacks client-sigals
- BUG/MINOR: ssl/cli: "show ssl crt-list" lacks sigals
- MINOR: ssl/cli: display more filenames in 'show ssl cert'
- DOC: watchdog: document the sequence of the watchdog and panic
- MINOR: ssl: store the filenames resulting from a lookup in ckch_conf
- MINOR: startup: allow hap_register_feature() to enable a feature in the list
- MINOR: quic: support frame type as a varint
- BUG/MINOR: startup: leave at first post_section_parser which fails
- BUG/MINOR: startup: hap_register_feature() fix for partial feature name
- BUG/MEDIUM: cli: Be sure to drop all input data in END state
- BUG/MINOR: cli: Wait for the last ACK when FDs are xferred from the old worker
- BUG/MEDIUM: filters: Handle filters registered on data with no payload callback
- BUG/MINOR: fcgi: Don't set the status to 302 if it is already set
- MINOR: ssl/crtlist: split the ckch_conf loading from the crtlist line parsing
- MINOR: ssl/crtlist: handle crt_path == cc->crt in crtlist_load_crt()
- MINOR: ssl/ckch: return from ckch_conf_clean() when conf is NULL
- MEDIUM: ssl/crtlist: "crt" keyword in frontend
- DOC: configuration: document the "crt" frontend keyword
- DEV: h2: add a Lua-based HTTP/2 connection tracer
- BUG/MINOR: quic: prevent crash on conn access after MUX init failure
- BUG/MINOR: mux-quic: prevent crash after MUX init failure
- DEV: h2: fix flags for the continuation frame
- REGTESTS: Fix truncated.vtc to send 0-CRLF
- BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut
- Revert "REGTESTS: stop using truncated.vtc on freebsd"
- MINOR: mux-quic: define a QCC application state member
- MINOR: mux-quic/h3: emit SETTINGS via MUX tasklet handler
- MINOR: mux-quic/h3: support temporary blocking on control stream sending
When HTTP/3 layer is initialized via QUIC MUX, it first emits a SETTINGS
frame on an unidirectional control stream. However, this could be
prevented if client did not provide initial flow control.
Previously, QUIC MUX was unable to deal with such situation. Thus, the
connection was closed immediately and no transfer could occur. Improve
this by extending QUIC MUX application layer API : initialization may
now return a transient error. This allows MUX to continue to use the
connection normally. Initialization will be retried periodically alter
until it can succeed.
This new API allows to deal with the flow control issue described above.
Note that this patch is not considered as a bug fix. Indeed, clients are
strongly advised to provide enough flow control for a SETTINGS frame
exchange.
Previously, QUIC MUX application layer was installed and initialized via
MUX init. However, the latter stage involve I/O operations, for example
when using HTTP/3 with the emission of a SETTINGS frame.
Change this to prevent any I/O operations during MUX init. As such,
finalize app_ops callback is now called during the first invokation of
qcc_io_send(), in the context of MUX tasklet. To implement this, a new
application state value is added, to detect the transition from NULL to
INIT stage.
Introduce a new QCC field to track the current application layer state.
For the moment, only INIT and SHUT state are defined. This allows to
replace the older flag QC_CF_APP_SHUT.
This commit does not bring major changes. It is only necessary to permit
future evolutions on QUIC MUX. The only noticeable change is that QMUX
traces can now display this new field.
This reverts commit 0b9a75e8781593c250f6366a64a019018ade688e.
Thanks to the previous fixes ("REGTESTS: Fix truncated.vtc to send 0-CRLF" and
"BUG/MINOR: mux-h2: Properly handle full or truncated HTX messages on shut"),
this script can be reenabled for FreeBSD.
On shut, truncated HTX messages were not properly handled by the H2
multiplexer. Depending on how data were emitted, a chunked HTX message
without the 0-CRLF could be considered as full and an empty data with ES
flag set could be emitted instead of a RST_STREAM(CANCEL) frame.
In the H2 multiplexer, when a shut is performed, an HTX message is
considered as truncated if more HTX data are still expected. It is based on
the presence or not of the H2_SF_MORE_HTX_DATA flag on the H2 stream.
However, this flag is set or unset depending on the HTX extra field
value. This field is used to state how much data that must still be
transferred, based on the announced data length. For a message with a
content-length, this assumption is valid. But for a chunked message, it is
not true. Only the length of the current chunk is announced. So we cannot
rely on this field in that case to know if a message is full or not.
Instead, we must rely on the HTX start-line flags to know if more HTX data
are expected or not. If the xfer length is known (the HTX_SL_F_XFER_LEN flag
is set on the HTX start-line), it means that more data are always expected,
until the end of message is reached (the HTX_FL_EOM flag is set on the HTX
message). This is true for bodyless message because the end of message is
reported with the end of headers. This is also true for tunneled messages
because the end of message is received before switching the H2 stream in
tunnel mode.
This patch must be backported as far as 2.8.
qmux_init() may fail for several reasons. In this case, connection
resources are freed and underlying and a CONNECTION_CLOSE will be
emitted via its quic_conn instance.
In case of qmux_init() failure, qcc_release() is used to clean up
resources, but QCC <conn> member is first resetted to NULL, as
connection released must be delayed. Some cleanup operations are thus
skipped, one of them is the resetting of <ctx> connection member to
NULL. This may cause a crash as <ctx> is a dangling pointer after QCC
release. One of the possible reproducer is to activate QMUX traces,
which will cause a segfault on the qmux_init() error leave trace.
To fix this, simply reset <ctx> to NULL manually on qmux_init() failure.
This must be backported up to 3.0.
Initially, QUIC-MUX was responsible to reset quic_conn <conn> member to
NULL when MUX was released. This was performed via qcc_release().
However, qcc_release() is also used on qmux_init() failure. In this
case, connection must be freed via its session, so QCC <conn> member is
resetted to NULL prior to qcc_release(), which prevents quic_conn <conn>
member to also be resetted. As the connection is freed soon after,
quic_conn <conn> is a dangling pointer, which may cause crashes.
This bug should be very rare as first it implies that QUIC-MUX
initialization has failed (for example due to a memory alloc error).
Also, <conn> member is rarely used by quic_conn instance. In fact, the
only reproducible crash was done with QUIC traces activated, as in this
case connection is accessed via quic_conn under __trace_enabled()
function.
To fix this, detach connection from quic_conn via the XPRT layer instead
of the MUX. More precisely, this is performed via quic_close(). This
should ensure that it will always be conducted, either on normal
connection closure, but also after special conditions such as MUX init
failure.
This should be backported up to 2.6.
The following config is sufficient to trace H2 exchanges between a client
and a server:
global
lua-load "dev/h2/h2-tracer.lua"
listen h2_sniffer
mode tcp
bind :8002
filter lua.h2-tracer #hex
server s1 127.0.0.1:8003
The commented "hex" argument will also display full frames in hex (not
recommended). The connections are prefixed with a 3-hex digit number in
order to also support a bit of multiplexing without impacting the reading
too much. The screen is split in two, with the request on the left and
the response on the right. Here's an example of what it does between an
haproxy backend and an haproxy frontend both in H2, when submitted a
curl request for /?s=30k handled by httpterm:
[001] ### req start
[001] [PREFACE len=24]
[001] [SETTINGS sid=0 len=24 (bytes=24)]
[001] | ### res start
[001] | [SETTINGS sid=0 len=18 (bytes=27)]
[001] | [SETTINGS ACK sid=0 len=0 (bytes=0)]
[001] [SETTINGS ACK sid=0 len=0 (bytes=56)]
[001] [HEADERS EH+ES sid=1 len=47 (bytes=47)]
[001] | [HEADERS EH sid=1 len=101 (bytes=15351)]
[001] | [DATA sid=1 len=15126 (bytes=15241)]
[001] | [DATA sid=1 len=1258 (bytes=106)]
[001] | ... -106 = 1152
[001] | ... -1152 = 0
[001] [WINDOW_UPDATE sid=1 len=4 (bytes=43)]
[001] [WINDOW_UPDATE sid=0 len=4 (bytes=30)]
[001] [WINDOW_UPDATE sid=1 len=4 (bytes=17)]
[001] [WINDOW_UPDATE sid=0 len=4 (bytes=4)]
[001] | [DATA ES sid=1 len=14336 (bytes=14336)]
[001] [WINDOW_UPDATE sid=0 len=4 (bytes=4)]
[001] ### req end: 31080 bytes total
[001] | [GOAWAY sid=0 len=8 (bytes=8)]
[001] | ### res end: 31097 bytes total
It deserves some improvements. For instance at the moment it does not
verify the preface, any 24 bytes will work. It does not perform any
protocol validation either. Detecting some issues such as out-of-sequence
frames could be helpful. But it already helps as-is.
This patch implements the "crt" keywords in frontend, declaring an
implicit crt-list named after the frontend.
The patch is split in two steps:
The first step is the crt keyword parser, which parses crt lines and
fill a "cfg_crt_node" struct containing a ssl_bind_conf and a
ckch_conf which are put in a list to be used later.
After parsing the frontend section, as a 2nd step, a
post_section_parser is called, it will create a crt-list named after
the frontend and will fill it with certificates from the list of
cfg_crt_node. Once created this crt-list will be loaded in every "ssl"
bind lines that didn't declare any crt or crt-list.
Example:
listen https
bind :443 ssl
crt foobar.pem
crt test1.net.crt key test1.net.key
Implements part of #2854
ckch_conf loading is not that simple as it requires to check
- if the cert already exists in the ckchs_tree
- if the ckch_conf is compatible with an existing cert in ckchs_tree
- if the cert is a bundle which need to load multiple ckch_store
This logic could be reuse elsewhere, so this commit introduce the new
crtlist_load_crt() function which does that.
When a "Location" header was found in a FCGI response, the status code was
forced to 302. But it should only be performed if no status code was set
first.
So now, we take care to not override an already defined status code when the
"Location" header is found.
This patch should fix the issue #2865. It must backported to all stable
versions.
An HTTP filter with no http_payload callback function may be registered on
data. In that case, this filter is obviously not called when some data are
received but it remains important to update its internal state to be sure to
keep it synchronized on the stream, especially its offet value. Otherwise,
the wrong calculation on the global offset may be performed in
flt_http_end(), leading to an integer overflow when data are moved from
input to output. This overflow triggers a BUG_ON() in c_adv().
The same is true for TCP filters with no tcp_payload callback function.
This patch must be backport to all stable versions.
On reload, the new worker requests bound FDs to the old one. The old worker
sends them in message of at most 252 FDs. Each message is acknowledged by
the new worker. All messages sent or received by the old worker are handled
manually via sendmsg/recv syscalls. So the old worker must be sure consume
all the ACK replies. However, the last one was never consumed. So it was
considered as a command by the CLI applet. This issue was hidden since
recently. But it was the root cause of the issue #2862.
Note this last ack is also the first one when there are less than 252 FDs to
transfer.
This patch must be backported to all stable versions.
Commit 7214dcd ("BUG/MEDIUM: applet: Don't pretend to have more data to
handle EOI/EOS/ERROR") revealed a bug with the CLI applet. Pending input
data when the applet is in CLI_ST_END state were never consumed or dropped,
leading to a wakeup loop.
The CLI applet implements its own snd_buf callback function. It is important
it consumes all pending input data. Otherwise, the applet is woken up in
loop until it empties the request buffer. Another way to fix the issue would
be to report an error. But in that case, it seems reasonnable to drop these
data.
The issue can be observed on reload, in master/worker mode, because of issue
about the last ACK message which was never consummed by the _getsocks()
command.
This patch should fix the issue #2862. It must be backported to 3.1 with the
commit above.
2025-02-17 15:31:07 +01:00
342 changed files with 25453 additions and 7592 deletions
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.