Last week I was checking out some issues on the ticketting system that hadn’t been solved way before my time. While browsing them I noticed an unsolved ticket that had a lot of updates from both the sysadmins that came before me and the developers from an application that ran a few times per day but always had issues running.
The errors that were thrown by the application (which was programmed in Java of all things) were like this:
java.net.NoRouteToHostException: Can't assign requested address (Address not available)
All of the updates regarding this issue "pointed" towards the database cluster that was being called in order for the application to store it’s output (the quotes serve a purpose by the way, just wait and read). Now, I could see a year’s worth of interactions from both teams discussing the error and possible solutions but somehow they all managed to "point" the error towards the database cluster.
Please take into account that this is a Percona cluster that could handle tens of thousands of connections, not counting the concurrency. Real beefy stuff. That had been configured by a co-worker which I know can do his job competently, but since the ticket had my attention I decided to take a crack at it.
The first few steps of the troubleshooting I had to do were to eliminate all obvious conclusions (year’s worth of work, remember?) my co-workers came to; which pointed me towards the Percona cluster. Since this is a production system I wasn’t able to really stress it since I knew my testing would get unwanted attention from developers and managers that start yelling wolf as soon as they see something that’s not expected on their metrics. So I just threw arround 30k concurrent sessions to the cluster and it didn’t break a sweat.
So, next steps it is. After checking out the whole error log from Java I noticed the error and decided to SSH into the application server from where the app was being deployed and as soon as I connected to it I received a Slack message asking me not to play with since it was on production!!!
Naturally I did what every sysadmin does atleast once (a day), ignore the developer and started checking out every log I could get my hands on and noticed something specific to those machines, they weren’t reusing TCP connections. As you can see it didn’t "point" to a database issue. These kinds of database clusters are designed to run with HUGE workloads.
At this point it was painfully obvious that the machine was running out of internal ports for the application that was deployed on it, in order to check this I just reviewed the kernel parameters in execution with the following command:
sysctl -a | grep range
And the reusability of the ports:
sysctl -a | grep reuse
Both of those commands gave me what I needed, the port range was ok:
net.ipv4.ip_local_port_range = 15000 65000
Which is expected since it’s a default behaviour, but the reuse flag wasn’t enabled, so we just changed it on the kernel config on the file:
And added the entry:
net.ipv4.tcp_tw_reuse = 1
Applied it live:
As soon as I had finished the changes, I went to the developers desk and described my solution and the horror face he made when I told him I needed to change a kernel parameter was priceless. He kept rambling about how it needed to be scheduled since it needed a reboot and this is a production system that had to approved by the higher-ups and bla bla bla bla bla (Good thing those types of changes don’t require a reboot). I just told him it was done and added a simple "Please run the application again, it won’t fail now".
Hope it doesn’t :p