BUG/MEDIUM: connection: check the control layer before stopping polling

The bug described in commit 568743a ("BUG/MEDIUM: stream-int: completely detach connection on connect error") was not a stream-interface layer bug but a connection layer bug. There was exactly one place in the code where we could change a file descriptor's status without first checking whether it is valid or not, it was in conn_stop_polling(). This one is called when the polling status is changed after an update, and calls fd_stop_both even if we had already closed the file descriptor : 1479388298.484240 ->->->->-> conn_fd_handler > conn_cond_update_polling 1479388298.484240 ->->->->->-> conn_cond_update_polling > conn_stop_polling 1479388298.484241 ->->->->->->-> conn_stop_polling > conn_ctrl_ready 1479388298.484241 conn_stop_polling < conn_ctrl_ready 1479388298.484241 ->->->->->->-> conn_stop_polling > fd_stop_both 1479388298.484242 ->->->->->->->-> fd_stop_both > fd_update_cache 1479388298.484242 ->->->->->->->->-> fd_update_cache > fd_release_cache_entry 1479388298.484242 fd_update_cache < fd_release_cache_entry 1479388298.484243 fd_stop_both < fd_update_cache 1479388298.484243 conn_stop_polling < fd_stop_both 1479388298.484243 conn_cond_update_polling < conn_stop_polling 1479388298.484243 conn_fd_handler < conn_cond_update_polling The problem with the previous fix above is that it break the http_proxy mode and possibly even some Lua parts and peers to a certain extent ; all outgoing connections where the target address is initially copied into the outgoing connection which experience a retry would use a random outgoing address after the retry because closing and detaching the connection causes the target address to be lost. This was attempted to be addressed by commit 0857d7a ("BUG/MAJOR: stream: properly mark the server address as unset on connect retry") but it used to only solve the most visible effect and not the root cause. Prior to this fix, it was possible to cause this config to keep CLOSE_WAIT for as long as it takes to expire a client or server timeout (note the missing client timeout) : listen test mode http bind :8002 server s1 127.0.0.1:8001 $ tcploop 8001 L0 W N20 A R P100 S:"HTTP/1.1 200 OK\r\nContent-length: 0\r\n\r\n" & $ tcploop 8002 N200 C T W S:"GET / HTTP/1.0\r\n\r\n" O P10000 K With this patch, these CLOSE_WAIT properly vanish when both processes leave. This commit reverts the two fixes above and replaces them with the proper fix in connection.h. It must be backported to 1.6 and 1.5. Thanks to Robson Roberto Souza Peixoto for providing very detailed traces showing some obvious inconsistencies leading to finding this bug.
2016-11-17 12:05:13 +01:00 · 2016-11-17 12:05:13 +01:00 · 350135cf49
commit 350135cf49
parent a44fdd95f9
2 changed files with 4 additions and 3 deletions
--- a/include/proto/connection.h
+++ b/include/proto/connection.h
@ -261,7 +261,8 @@ static inline void conn_stop_polling(struct connection *c)
 	c->flags &= ~(CO_FL_CURR_RD_ENA | CO_FL_CURR_WR_ENA |
 		      CO_FL_SOCK_RD_ENA | CO_FL_SOCK_WR_ENA |
 		      CO_FL_DATA_RD_ENA | CO_FL_DATA_WR_ENA);
-	fd_stop_both(c->t.sock.fd);
+	if (conn_ctrl_ready(c))
+		fd_stop_both(c->t.sock.fd);
 }

 /* Automatically update polling on connection <c> depending on the DATA and
--- a/src/stream.c
+++ b/src/stream.c
@ -560,6 +560,7 @@ static int sess_update_st_con_tcp(struct stream *s)
 	struct stream_interface *si = &s->si[1];
 	struct channel *req = &s->req;
 	struct channel *rep = &s->res;
+	struct connection *srv_conn = __objt_conn(si->end);

 	/* If we got an error, or if nothing happened and the connection timed
 	 * out, we must give up. The CER state handler will take care of retry
@ -579,8 +580,7 @@ static int sess_update_st_con_tcp(struct stream *s)
 		si->exp   = TICK_ETERNITY;
 		si->state = SI_ST_CER;

-		si_release_endpoint(si);
-		s->flags &= ~SF_ADDR_SET;
+		conn_force_close(srv_conn);

 		if (si->err_type)
 			return 0;