发现established少,但是NON_established比较多,我们的业务大概每秒也就几个的访问量,所以established是正确的,但是NON_established却比较多,有异常。
于是登录服务器使用netstat查看了一下:
1 2 3 4 5 6 7 8 9 10 | #netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' CLOSE_WAIT 5 TIME_WAIT 1081 ESTABLISHED 7 # netstat -n | awk '/^tcp/' tcp6 0 0 127.0.0.1:8887 127.0.0.1:48673 TIME_WAIT tcp6 0 0 127.0.0.1:8887 127.0.0.1:48737 TIME_WAIT tcp6 0 0 127.0.0.1:8887 127.0.0.1:48704 TIME_WAIT tcp6 0 0 127.0.0.1:8887 127.0.0.1:48731 TIME_WAIT tcp6 0 0 127.0.0.1:8887 127.0.0.1:48743 TIME_WAIT |
大部分都是8887端口的,这是一个跑在docker里面的http服务,前端是使用nginx根据域名进行转发到docker的http服务,大致配置如下
1 2 3 4 5 6 7 8 9 10 11 | upstream svr { server 127.0.0.1:8887; } server { listen 80; server_name my.com; location / { proxy_pass http://svr ; } } |
百度了一个这个问题,发现已经有同学给出解决方案了:
https://www.cnblogs.com/QLeelulu/p/3601499.html
Nginx 1.1以上版本的upstream已经支持keep-alive的,所以我们可以开启Nginx proxy的keep-alive来减少tcp连接:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | upstream http_backend { server 127.0.0.1:8080; keepalive 16; } server { ... location /http/ { proxy_pass http://http_backend; proxy_http_version 1.1; proxy_set_header Connection ""; ... } } |
Nginx keepalive介绍文档: http://nginx.org/cn/docs/http/ngx_http_upstream_module.html#keepalive
可是配置keep-alive之后,TIME_WAIT并没有减少,nginx版本是1.4.6,应该支持keepalive特性的。
1 2 | /usr/sbin/nginx -v nginx version: nginx/1.4.6 (Ubuntu) |
我查看了一下服务器的访问日志,很多100.121.1** 开头的访问记录过来,这个是不带域名直接访问服务的,
但是我的nginx配置里配置了多个域名访问,在没有为缺省配置时,则使用第一个配置的域名,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | 100.121.139.53 - - [15/May/2019:11:16:41 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.203 - - [15/May/2019:11:16:41 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.139.108 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.139.112 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.221 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.224 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.139.86 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.119.4 - - [15/May/2019:11:16:42 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.254 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.110.1 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.179 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.139.70 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.139.39 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.109.167 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.110.29 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" 100.121.119.23 - - [15/May/2019:11:16:43 +0800] "HEAD / HTTP/1.0" 200 0 "-" "-" upstream svr { server 127.0.0.1:8887; } server { listen 80; server_name my2.com; location / { proxy_pass http://svr ; } } |
因此,在my2.com的配置上也按nginx的keepalive配置后,问题解决了。服务器清静了。
1 2 3 4 | #netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' CLOSE_WAIT 5 TIME_WAIT 2 ESTABLISHED 4 |
其他:
1. 为什么会有100.121.1** 开头的访问记录:
这个服务器在阿里云上配置了 负载均衡,在负载均衡的agent服务器集群为了探测后端服务器是否健康,每台agent服务器每隔2-3秒就会访问一次后端服务器来检测是否健康
2. 很多文章都是推荐修改tcp_tw_reuse和tcp_tw_recycle内核参数来解决 time_wait过多问题,但是我非常不建议这样做。具体原因参考前面引用的blog文章有描述
记一次TIME_WAIT网络故障
再叙TIME_WAIT