[Sqlgrey-users] Load balancing question, part 2

Discussion:

Gary Smith

2010-04-26 16:56:56 UTC

I have setup two sqlgrey servers load balanced with ipvsadm. Load balancing is operating but I end up with a lot of orphaned ESTABLISHED connections on the real servers. In a period of 48 hours, I received ~500 requests (per real server) and here were about ~250 established connections per server.

When I bypass ipvsadm and just go direct to a single server, I see only a few connections established (and there is a corresponding connection on the postfix side).

Does anyone else on the list run sqlgrey in an ipvsadm load balanced scenario? If so, any pointers? Postfix seems to have no complaint on this, but I think by design it reconnects when the connection is gone.

Gary Smith

2010-04-26 18:38:34 UTC

Permalink

Post by Gary Smith
I have setup two sqlgrey servers load balanced with ipvsadm. Load balancing
is operating but I end up with a lot of orphaned ESTABLISHED connections on
the real servers. In a period of 48 hours, I received ~500 requests (per real
server) and here were about ~250 established connections per server.
When I bypass ipvsadm and just go direct to a single server, I see only a few
connections established (and there is a corresponding connection on the
postfix side).
Does anyone else on the list run sqlgrey in an ipvsadm load balanced scenario?
If so, any pointers? Postfix seems to have no complaint on this, but I think
by design it reconnects when the connection is gone.

This might be helpful for people on the list.

Okay, isolating it to a single real server node in the load balanced cluster still causes the same result. It appears that after N second, postfix hangs up on the connection but it's not realized by sqlgrey, probably because of the load balancer. So it is then up to the OS TCP TTL settings to kill the TCP connection. I have put a dirty hack in place just to test. I have setup $mux->set_timeout in the mux_input and for the mux_timeout callback, I close the current filehandle ($fh). I do this after the processing has taken place (immediately after the while loop). It's probably the wrong thing to do, but the connection is closed after the timeout and even this closure is seen by postfix and the load balancer.

The concern here is that sqlgrey isn't reacting gracefully when connections are abandoned (that is closed, but never receiving notification). It stands out when something like load balancing is put in place (my observation, I could still be wrong here).

It might be useful to put some type of sanity timeout check in place for a case like this. If you have a reasonably configured default TTL for TCP at the OS level then the impact is probably minimal.

I have been using sqlgrey for some years now, and I am migrating it to a separate cluster (currently it lives on each MTA, but trying to break that as we have a resource need to move this to it's own cluster).

Thoughts?

Kenneth Marshall

2010-04-26 18:52:30 UTC

Permalink

Post by Gary Smith

This might be helpful for people on the list.
Okay, isolating it to a single real server node in the load balanced cluster still causes the same result. It appears that after N second, postfix hangs up on the connection but it's not realized by sqlgrey, probably because of the load balancer. So it is then up to the OS TCP TTL settings to kill the TCP connection. I have put a dirty hack in place just to test. I have setup $mux->set_timeout in the mux_input and for the mux_timeout callback, I close the current filehandle ($fh). I do this after the processing has taken place (immediately after the while loop). It's probably the wrong thing to do, but the connection is closed after the timeout and even this closure is seen by postfix and the load balancer.
The concern here is that sqlgrey isn't reacting gracefully when connections are abandoned (that is closed, but never receiving notification). It stands out when something like load balancing is put in place (my observation, I could still be wrong here).
It might be useful to put some type of sanity timeout check in place for a case like this. If you have a reasonably configured default TTL for TCP at the OS level then the impact is probably minimal.
I have been using sqlgrey for some years now, and I am migrating it to a separate cluster (currently it lives on each MTA, but trying to break that as we have a resource need to move this to it's own cluster).
Thoughts?

SQLgrey is doing the correct thing in this case. It does not know why
the connection is gone or even if it is gone for a while. The load balancer
should close the connection to the remote SQLgrey when the frontends go
away or depending on how it works, when all connections from the frontend
are closed. This will keep SQLgrey from holding old connections around until
they are reclaimed. It is useful to have timeouts such as you mention to handle
other bits of poorly designed software.

Cheers,
Ken

Gary Smith

2010-04-26 19:43:26 UTC

Permalink

Post by Kenneth Marshall
SQLgrey is doing the correct thing in this case. It does not know why
the connection is gone or even if it is gone for a while. The load balancer
should close the connection to the remote SQLgrey when the frontends go
away or depending on how it works, when all connections from the frontend
are closed. This will keep SQLgrey from holding old connections around until
they are reclaimed. It is useful to have timeouts such as you mention to handle
other bits of poorly designed software.

I will probably implement some level of optional timeouts in the codebase that I have, and I will provide it as a patch for those who might be interested. I think adding it as an optional config variable has merits, default it to 0, which would just bypass the timeout for normal operation. For the most part, postfix->sqlgrey has always worked flawlessly (for the most part) and it's this load balancing that is disrupting the normal flow.

I will also try some of the options on the load balancer, but if persistent connections didn't resolve it, then its doubtful that other options will.

If we are seeing this issue on test (~500msg/day) this on production it might fail entirely (~250msg/minute). That's where my concern really is.

Karl O. Pinc

2010-04-26 20:03:24 UTC

Permalink

Post by Gary Smith
I will also try some of the options on the load balancer, but if
persistent connections didn't resolve it, then its doubtful that
other
options will.

You could try talking with the load balancing folk.

Karl <***@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein

Gary Smith

2010-04-26 22:36:15 UTC

Permalink

Post by Karl O. Pinc
You could try talking with the load balancing folk.

I'm working with them on this as well. As for right now, sqlgrey is the only service that I am having problems with. I had issues with mysql as well, but fixing the arp issue seemed to resolve it for that server. It did not however resolve it for sqlgrey. I'm pretty sure that it has something to do with the return close from postfix to the load balancer. I don't think that the close is actually making it back. At the same time, postfix enters a FIN_WAIT for a minute or so, then it falls off.

Anyway, I will also check with the postfix group as well as there could be something in the closure logic for policy maps that's only brought forward during this type of scenario.

Gary Smith

Karl O. Pinc

2010-04-27 00:06:24 UTC

Permalink

Post by Gary Smith

Post by Karl O. Pinc
You could try talking with the load balancing folk.

I'm working with them on this as well. As for right now, sqlgrey is
the only service that I am having problems with. I had issues with
mysql as well, but fixing the arp issue seemed to resolve it for that
server. It did not however resolve it for sqlgrey. I'm pretty sure
that it has something to do with the return close from postfix to the
load balancer. I don't think that the close is actually making it
back. At the same time, postfix enters a FIN_WAIT for a minute or
so,
then it falls off.
Anyway, I will also check with the postfix group as well as there
could be something in the closure logic for policy maps that's only
brought forward during this type of scenario.

If postfix is in FIN_WAIT then it thinks the tcp connection is
closed. IIRC FIN_WAIT is a TCP delay that avoids accidentally
injecting old packets that are still on the wire into
a new TCP session. (See the TCP RFC.)

Karl <***@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein

Gary Smith

2010-04-27 00:42:27 UTC

Permalink

Post by Gary Smith

Post by Karl O. Pinc
You could try talking with the load balancing folk.

I'm working with them on this as well. As for right now, sqlgrey is the only
service that I am having problems with. I had issues with mysql as well, but
fixing the arp issue seemed to resolve it for that server. It did not however
resolve it for sqlgrey. I'm pretty sure that it has something to do with the
return close from postfix to the load balancer. I don't think that the close
is actually making it back. At the same time, postfix enters a FIN_WAIT for a
minute or so, then it falls off.
Anyway, I will also check with the postfix group as well as there could be
something in the closure logic for policy maps that's only brought forward
during this type of scenario.

Things work much better. The lost connections were because of iptables. I have this rule early on for server that has the director. I guess the ACK FIN is an technically an invalid state...

-A INPUT -p tcp -m conntrack --ctstate INVALID -j LOG --log-prefix "FW-I BF: "
-A INPUT -p tcp -m conntrack --ctstate INVALID -j REJECT --reject-with icmp-port-unreachable

Apr 26 04:36:02 wall1 kernel: FW-I BF: IN=br0 OUT= PHYSIN=eth1 MAC=00:50:56:b1:63:bc:00:0c:29:92:be:b7:08:00 SRC=10.80.66.24 DST=10.80.55.11 LEN=52 TOS=0x08 PREC=0x00 TTL=64 ID=40835 DF PROTO=TCP SPT=52114 DPT=3917 WINDOW=363 RES=0x00 ACK FIN URGP=0

Post by Gary Smith
Gary Smith

Dan Faerch

2010-04-26 19:33:54 UTC

Permalink

Ive not tried LVS'ing sqlgrey but i do use LVS (ipvsadm) alot.
I run sqlgrey on the MTA, and then LVS to the MTA's. The MTA's then
communicate with sqlgrey on localhost. Sqlgrey reads from MySQL on
localhost and writes to a master for replication.

When you say ESTABLISHED, i assume you mean from looking at the output
from netstat or something like it. If so, i cant imagine that it can be
an application level problem. If postfix deliberately closes the
connection (eg due to a timeout), it should/would transmit either a TCP
RST or FIN. The receiving party (your cluster node) will handle RST and
FIN in the kernel's ip-stack, not in the application.
Upon receiving FIN, "ESTABLISHED" should have changed to "CLOSE-WAIT" in
netstat.
Which suggests that your cluster node does not actually receive a FIN.
Which again suggest that the connection drop before your cluster node,
ie. the loadbalancer.

Of course, not knowing you LVS setup and config, i can only guess. But a
typical misconfigure would be using "Direct Routing" (the "gatewaying"
option in ipvsadm. which is default), without taking precautions against
the cluster nodes ARP'ing the virutal ip. If this is the case, all
cluster nodes would "battle" for the virtual ip, making the ip "hop"
around the nodes.
That would, i imagine, leave ESTABLISHED connections behind.

Regards
- Dan Faerch

Gary Smith

2010-04-26 20:04:27 UTC

Permalink

Post by Dan Faerch
Ive not tried LVS'ing sqlgrey but i do use LVS (ipvsadm) alot.
I run sqlgrey on the MTA, and then LVS to the MTA's. The MTA's then
communicate with sqlgrey on localhost. Sqlgrey reads from MySQL on
localhost and writes to a master for replication.
When you say ESTABLISHED, i assume you mean from looking at the output
from netstat or something like it. If so, i cant imagine that it can be
an application level problem. If postfix deliberately closes the
connection (eg due to a timeout), it should/would transmit either a TCP
RST or FIN. The receiving party (your cluster node) will handle RST and
FIN in the kernel's ip-stack, not in the application.
Upon receiving FIN, "ESTABLISHED" should have changed to "CLOSE-WAIT" in
netstat.
Which suggests that your cluster node does not actually receive a FIN.
Which again suggest that the connection drop before your cluster node,
ie. the loadbalancer.
Of course, not knowing you LVS setup and config, i can only guess. But a
typical misconfigure would be using "Direct Routing" (the "gatewaying"
option in ipvsadm. which is default), without taking precautions against
the cluster nodes ARP'ing the virutal ip. If this is the case, all
cluster nodes would "battle" for the virtual ip, making the ip "hop"
around the nodes.
That would, i imagine, leave ESTABLISHED connections behind.

Dan,

I'll look into this. The arp problem makes sense. We have been doing some work with the arp configurations on these. This is a new cluster environment. As for the routing, we are currently using NAT (-m) for all nodes.

I'm pretty sure that's it's the load balancer that has introduced the problems. We do have other IVPS nodes running (which have been running for years) and this is the first set that I have run into the lingering connection problems.

Thanks for the pointers.

Post by Dan Faerch
Regards
- Dan Faerch
------------------------------------------------------------------------------
_______________________________________________
Sqlgrey-users mailing list
https://lists.sourceforge.net/lists/listinfo/sqlgrey-users

Continue reading on narkive:

Search results for '[Sqlgrey-users] Load balancing question, part 2' (Questions and Answers)

replies

How Long must I wait for him to pop the question?

started 2007-12-19 13:07:23 UTC

marriage & divorce

replies

Question about sex- serious answers only please?

started 2007-09-09 05:09:00 UTC

marriage & divorce

replies

Question about wheel bearings?