Unix: Checking for open sockets on nginx
Tim and I were investigating a weird problem we were having with nginx where it was getting in a state where it had exceeded the number of open files allowed on the system and started rejecting requests.
We can find out the maximum number of open files that we’re allowed on a system with the following command:
$ ulimit -n
1024
Our hypothesis was that some socket connections were never being closed and therefore the number of open files was climbing slowly upwards until it exceeded the limit.
We wanted to check how many sockets nginx had open so to start with we needed to know the process IDs it was running under:
$ ps aux | grep nginx | grep -v grep
root 1089 0.0 0.7 105152 2736 ? Ss 17:34 0:00 nginx: master process /usr/sbin/nginx
www-data 17474 0.0 0.6 105300 2296 ? S 21:49 0:04 nginx: worker process
www-data 17475 0.0 0.7 105300 2856 ? S 21:49 0:04 nginx: worker process
www-data 17476 0.0 0.7 105300 2792 ? S 21:49 0:03 nginx: worker process
www-data 17477 0.0 0.7 105300 2668 ? S 21:49 0:04 nginx: worker process
So the process IDs we’re interested in are 1089, 17474, 17475, 17476 and 17477.
We can check which file descriptors they have open with the following command:
$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd
/proc/17476/fd:
total 0
dr-x------ 2 www-data www-data 0 Apr 23 23:40 .
...
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]
/proc/17477/fd:
total 0
...
lrwx------ 1 www-data www-data 64 Apr 23 23:40 56 -> socket:[52213]
lrwx------ 1 www-data www-data 64 Apr 23 23:40 57 -> anon_inode:[eventpoll]
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]
We can narrow that down to just show us how many sockets are open:
$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd | grep socket | wc -l
189
We could also use http://linux.die.net/man/8/lsof although for some reason that returns a slightly different number:
$ sudo lsof -p 1089,17474,17475,17476,17477 | grep socket | wc -l
184
If we want to use brace expansion to do that it becomes a bit more tricky:
$ sudo lsof -p `echo {1089,174{74,75,76,77}} | sed 's/ /,/g'` | grep socket | wc -l
184
Annoyingly we couldn’t actually replicate the error but think that it’s been solved in nginx 1.2.0 (we were using 1.1.19) by this change:
Bugfix: a segmentation fault might occur in a worker process if the
"try_files" directive was used; the bug had appeared in 1.1.19.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.