For a couple of months we’ve had some issues with the one of our servers, the issue was summarized as follows:
Dev: I can’t connect to the server right now
Ops: What do you mean?
Dev: Connections time out
Ops: I can connect, we can connect, everyone can connect. Stop messing with your OS settings.
Half an hour passes
Ops: Oh shit I can’t connect now.
Now this incident happened a few times over the course of a few months, but since this was only happening from our offices and the server was located at the colocation site we thought it was something between the office and the datacenter. We could, however, connect to the server by jumping from any other server at the datacenter. This, of course, was not optimal but we managed to push the issue onto backlog since we had several other critical incidents to take into account.
After checking everything on the network we noticed something was off about the server. We couldn’t reproduce the issue on any server, including this one, totally intermitent and it didn’t point towards anything else other than the server.
Getting down to it
One day, I had some time to kill so I decided to take a kick at it. Note that this isn’t the best way to "kill time", you’ll end up trying to kill the guys next to you after a few hours of nothing making sense.
So, I connected to the server via SSH, via mysql remote connection (port 3306), http, https, and a few others and couldn’t reproduce the issue. Checked it on a few other pc’s, Windows, Linux, Mac, a Solaris VM I carry around, nothing under the sun or above it could not connect. Wait it is.
After a few hours one of the computers failed to be able to connect, mine (praise the luck gods). So I started checking for eberything within that machine, connected via another machine on the Datacenter’s network.
Now, this machine had a few kernel parameters set up since it needed to use a few thousand connections every few hours due to some hadoop processes that were supposed to run on it, for more information check this article we wrote about it: here
Searching, searching, searching…
Then our almighty friend Google came in, then it’s less mighty friend Bing, then DuckDuckGo, then I was there looking at my screen with a few dozen tabs open with nothing to show for it and a light headache.
So I started looking at the networking stack within Linux and checking the network statistics recorded by the kernel itself. You can check these with:
The output for the
netstat -s command left me with more questions than answers but something stood out of the terminal staring straight to my face:
35544811584 SYNS to LISTEN sockets dropped
This can happen since the Linux network stack is designed to avoid SYN floods by default and this counter is almost never zero because of false positives, however this doesn’t hinder the performance or the usability of the system. In case you want to learn more about SYN Flood attacks read here
Other than the SYNS to LISTEN sockets dropped I noticed another couple entries:
7008728 invalid SYN cookies received 10703 packets collapsed in receive queue due to low socket buffer
These entries looked strange to me, I know this is a server but those numbers looked way too high for a server that wasn’t exposed to external networks.
After checking a lot more documentation than I should have about those entries which, by the way, is like reading a legal book with it’s own technical language that no one but kernel developers seem to understand. I decided to go back to the kernel config at:
And noticed this (this was copied verbatim straight from the server):
vm.overcommit_memory = 1 net.ipv4.tcp_fin_timeout=30 vm.swappiness=1 #net.core.somaxconn = 1024 #fs.file-max = 1311072 net.ipv4.tcp_max_syn_backlog = 16192 net.core.somaxconn = 4096 vm.nr_hugepages=512 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_tw_reuse = 1 net.netfilter.nf_conntrack_tcp_timeout_established=600
The entry for
net.ipv4.tcp_tw_reuse = 1 was added by us on the article about reusing TCP ports but these other entries don’t match what I was expecting for this server to have configured. However, these sounded familiar and I decided to up the scale on some of them, including the
tcp_max_syn_backlog and the
somaxconn, adapting them to the memory quantities this server has, not outdated numbers for 4GB systems (this node has 512GB of RAM and averages 200GB free RAM, it can afford to give a few megs to the kernel)
But one other entry popped right at my eyes, after hours and hours of searching and watching for logs, and command outputs:
net.ipv4.tcp_tw_recycle = 1
Fixing the issue
Let me explain. tcp_tw_reuse and tcp_tw_recycle should be used when server and client timestamps are enabled, by default they are. tcp_tw_reuse only works for the client, tcp_tw_recycle works for both the client and the server, it reclaims the socket after 3.5*RTO (RTO=Retransmission timeout, if you want to read more about RTO, check this out) after opening the connection. In the case of LANs the retransmission is really fast since there’s basically no delay on the network so the server tries to recycle connections faster than they can be used, hence the no connection after a while.
Basically the server says "Nope, you already connected, I’m recycling your connection for something else". But what else? …
The solution, after months, turns out to be disabling the entry. Just comment it out altogether from the
net.ipv4.tcp_tw_recycle = 0
And reload it with a:
Yeap, a few hours wasted, buuuuuuuuut. Now I know how the Linux network stack works, I’ll have a future article about it soon.