Linux下大量TIME_WAIT分析
版权声明:本文为博主原创文章,未经博主允许不得转载。https://www.jianshu.com/p/46162f49be8a
问题现象:
应用反馈在通过beeline方式连接hiveserver2时程序报错:'Error:java.io.IOException(java.net.UnkownHostException:hostname)'
在HiveServer2节点上ping部分计算节点
[root@host HIST]# ping A
PING A.up (IP) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
64 bytes from spbdhbdh507.up (IP): icmp_seq=3 ttl=62 time=0.470 ms
64 bytes from spbdhbdh507.up (IP): icmp_seq=4 ttl=62 time=0.301 ms
--- A.up ping statistics ---
4 packets transmitted, 2 received, 50% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.301/0.385/0.470/0.086 ms
出现"ping: sendmsg: Operation not permitted"报错
检查hiveserver2节点的dmesg,发现conntrack表满
dmesg -T
[Wed Feb 20 16:38:36 2019] nf_conntrack: table full, dropping packet
[Wed Feb 20 16:38:36 2019] nf_conntrack: table full, dropping packet
[Wed Feb 20 16:38:41 2019] net_ratelimit: 173 callbacks suppressed
[Wed Feb 20 16:38:41 2019] nf_conntrack: table full, dropping packet
[Wed Feb 20 16:38:41 2019] nf_conntrack: table full, dropping packet
检查HiveServer2节点上的链接情况
[root@A HIST]# netstat -an|wc -l
235460
查看链接分布,确认均和主机A的7187端口相关
[root@A HIST]# netstat -anp|grep -i TIME_WAIT|awk '{print $4}'|awk -F ':' '{print $2}'|sort|uniq -c
2044 7187
查看链接详情
[root@A HIST]# netstat -anp|grep -i 7187
tcp 0 0 IP:7187 IPX0:52866 TIME_WAIT -
tcp 0 0 IP:7187 IPX:15158 TIME_WAIT -
tcp 0 0 IP:7187 IPX:48446 TIME_WAIT -
tcp 0 0 IP:7187 IPX:47154 TIME_WAIT -
tcp 0 0 IP:7187 IPX:21540 TIME_WAIT -
tcp 0 0 IP:7187 IPX7:58386 TIME_WAIT -
tcp 0 0 IP:7187 IPX6:29124 TIME_WAIT -
tcp 0 0 IP:7187 IPX9:51854 TIME_WAIT -
…………
[root@A HIST]# lsof -i:7187
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 28212 cloudera-scm 495u IPv4 1634756051 0t0 TCP localhost:47586->localhost:7187 (CLOSE_WAIT)
java 28212 cloudera-scm 499u IPv4 1625816374 0t0 TCP *:7187 (LISTEN)
[root@A HIST]# ps -ef|grep -i 28212
root 24236 25114 0 16:32 pts/4 00:00:00 grep --color=auto -i 28212
clouder+ 28212 3559 99 15:34 ? 01:35:54 /usr/java/jdk1.8.0_131/bin/java -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Dmgmt.log.file=mgmt-cmf-mgmt-NAVIGATORMETASERVER-host.up.log.out -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Xms8589934592 -Xmx8589934592 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:PermSize=256m -XX:MaxPermSize=1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/mgmt_mgmt-NAVIGATORMETASERVER-70499d2a6c3b56720e7b795db8af62ec_pid28212.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -cp /usr/share/java/mysql-connector-java.jar:/usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar:/usr/share/java/oracle-connector-java.jar:/usr/share/cmf/cloudera-navigator-server/nav-server-2.13.4.jar:/usr/share/cmf/cloudera-navigator-server/jars/*::/usr/share/cmf/lib/plugins/event-publish-5.14.4-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-5.14.4.jar -Dlog4j.configuration=file:///run/cloudera-scm-agent/process/52624-cloudera-mgmt-NAVIGATORMETASERVER/log4j.properties -Dnavigator.auditModels.dir=/usr/share/cmf/cloudera-navigator-audit-server/auditModels com.cloudera.nav.server.NavServer /run/cloudera-scm-agent/process/52624-cloudera-mgmt-NAVIGATORMETASERVER/cloudera-navigator.properties /run/cloudera-scm-agent/process/52624-cloudera-mgmt-NAVIGATORMETASERVER/cloudera-navigator-cm-auth.properties /run/cloudera-scm-agent/process/52624-cloudera-mgmt-NAVIGATORMETASERVER/db.navms.properties /run/cloudera-scm-agent/process/52624-cloudera-mgmt-NAVIGATORMETASERVER/cm-ext-accounts.properties
判断7187端口所起服务为NAVIGATOR META SERVER
【TIME_WAIT产生原因】
通信双方建立TCP连接后,主动关闭连接的一方在发送最后一个ack后会进入TIME_WAIT状态,停留2MSL(maxmum segment lifetime,最大分节生命期),这是一个IP数据包能在网络上生存的最长时间。
这个设计是TCP/IP必不可少的,假设网络是不可靠的,那就无法保证最后发送的ack报文一定会被对方收到,因此对方处于last_ack状态下的socket可能会因超时未收到ack报文,而重发FIN报文,所以该TIME_WAIT状态的作用就在于此,留下一个时间窗口用来重发可能会丢失的报文。
【TIME_WAIT过多解决】
#示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭。
net.ipv4.tcp_tw_recycle=1
这解决了TIME_WAIT过多的问题,接下来是对table full, dropping packet问题的分析
【table full, dropping packet产生原因】
当服务器连接多于最大连接数时会出现该错误
【table full, dropping packet解决办法】
修改conntrack最大跟踪连接数
net.netfilter.nf_conntrack_max = 5242880
【解决7187端口链接数过多】
2019-02-20 13:28:32,748 WARN com.cloudera.nav.extract.ExtractorScheduler [ExtractorServicePoller-0]: Exception while processing extractor factory com.cloudera.nav.extract.CmExtractorService
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Cursor functionality requires a sort containing a uniqueKey field tie breaker
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310)
at com.cloudera.nav.utils.solr.SolrResultSetIterator.getNextBatch(SolrResultSetIterator.java:96)
at com.cloudera.nav.utils.solr.SolrResultSetIterator.hasNext(SolrResultSetIterator.java:76)
at com.google.common.collect.Iterables.isEmpty(Iterables.java:1038)
通过日志,可以看出Navigator Metadata Server的后台solr数据库有问题。
由于各个节点的agent服务会发送lineage血缘关系信息给Navigator Metadata Server,假如Navigator Metadata Server无法将lineage写入solr数据库,就可能会导致大量的tcp链接处于等待的状态。
先清理下Navigator Metadata Server的内部数据库。
1.停止Navigator Metadata Server
2.修改后台Navigator Metadata Server后台的数据库 (mysql)
"DELETE FROM NAV_UPGRADE_ORDINAL;"
"INSERT INTO NAV_UPGRADE_ORDINAL VALUES (-1, -1);"
"DELETE FROM NAV_ALTUS_FLAG;"
3.删除Navigator 内置的solr数据库
cd /var/lib/cloudera-scm-navigator
rm -rf ./*
(默认为 /var/lib/cloudera-scm-navigator).
4.启动Navigator Metadata Server
要说明一点的是,删除solr内置数据库会导致丢失已有的血缘关系和您手工标记的tag信息。
并不影响任何对应的实际数据。
删除之后Navigator Metadata Server会自动重建索引值,并不影响任何的审计信息。
通过上述步骤,之后可以看到,7181端口链接数基本稳定在29个左右。