记录一下
一次Elasticsearch集群事故分析、排查、处理
背景介绍
事故发生的ElasticSearch集群共有7台机器:
- 10.163.204.193
- 10.163.204.194
- 10.163.204.195
- 10.163.220.73
- 10.163.220.74
- 10.163.220.220
- 10.163.220.221
其中193、194、195的机器配置一样,具体如下:
- CPU:32核
- 内存:128G
- 磁盘:4T*3
系统盘单独挂载:40G
73、74、220、221的机器配置一样,具体如下:
- CPU:32核
- 内存:128G
- 磁盘:10T
系统盘单独挂载:50G
以上7台机器用的都是阿里云的高效云盘,https://help.aliyun.com/zh/ecs/user-guide/disks-2
也就是说最大吞吐量(读+写 上限)为140MB/s
问题
由于比较穷,所以集群的部署情况比较复杂,这7台机器上共有一个Kafka集群、一个ZooKeeper集群、两个ElasticSearch集群。
其中Kafka、ZooKeeper集群部署在193、194、195上;
两个ElasticSearch集群各有7个节点(也就是这个7台机器每个都有两个ElasticSearch集群的实例);
193、194、195上的3块盘,Kafka、ZooKeeper、ElasticSearch(2个)集群都在用。
在这次事故之前发生过一次Kafka集群写入延迟,经分析是两个ElasticSearch集群与Kafka集群公用一块磁盘导致磁盘io满,造成Kafka集群集群延迟。
所以将193、194、195上的3块盘的使用调整为:
- 一块4T的盘单独给Kafka、ZooKeeper使用
- 另外两块4T的盘两个ElasticSearch集群混用(其中一个ElasticSearch集群的数据量,已经很小了,不到10G,所以可以忽略)
好了,前情回顾已经写完了,下面来说下这次出现的问题:
- 23号下午5点,ElasticSearch集群挂了
- 集群没有master
- 25号上午9点,ElasticSearch集群又开始rebalance shards
分析
因为23号的问题很紧急,所以当时采取的方案就是重启整个集群,先恢复。
193上当时没有master的异常日志
{"type": "server", "timestamp": "2023-10-23T09:22:08,955Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "business-log", "node.name": "es-b-193", "message": "master not discovered or elected yet, an election requires at least 4 nodes with ids from [hYrrmhLHTx-QHoDmZ2wATg, jaNLhd1eT6SUYcLJkHpE1Q, 1VQFmt9jQ-6d7fCMjP-vnQ, uzGdH3VeRbOlcDIWeKdgIw, PhjlCea6TNKh4rdZPyPDkA, ZFQdNP4HSgCtPCnMfDdopw, 1WRaRyU-SuCPn8qbz3e4hg], have discovered [{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{10.163.204.194}{10.163.204.194:9301}{cdfhilmrstw}, {es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{10.163.220.220}{10.163.220.220:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{10.163.204.195}{10.163.204.195:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{10.163.220.74}{10.163.220.74:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}] which is a quorum; discovery will continue using [10.163.204.194:9301, 10.163.204.195:9301] from hosts providers and [{es-b-220}{hYrrmhLHTx-QHoDmZ2wATg}{Qyz9MF2bRmKQq9uPWgTBuw}{10.163.220.220}{10.163.220.220:9301}{cdfhilmrstw}, {es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}, {es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}, {es-b-74}{uzGdH3VeRbOlcDIWeKdgIw}{fhNTddq7Syi7zPDVCejDWQ}{10.163.220.74}{10.163.220.74:9301}{cdfhilmrstw}, {es-b-194}{PhjlCea6TNKh4rdZPyPDkA}{6pQJDRevScad3hr_hJcnTg}{10.163.204.194}{10.163.204.194:9301}{cdfhilmrstw}, {es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}, {es-b-195}{1VQFmt9jQ-6d7fCMjP-vnQ}{e_P2NNXnTIWGzg7vE8EWkA}{10.163.204.195}{10.163.204.195:9301}{cdfhilmrstw}] from last-known cluster state; node term 43, last-accepted version 397615 in term 43", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-23T09:22:11,119Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "business-log", "node.name": "es-b-193", "message": "path: /_cat/nodes, params: {h=ip,name,heap.percent,heap.current,heap.max,ram.percent,ram.current,ram.max,node.role,master,cpu,load_1m,load_5m,load_15m,disk.used_percent,disk.used,disk.total}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" ,org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];\",\n","stream":"stdout","time":"2023-10-23T09:21:34.490903538Z"}
{"log":"\"at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179
恢复后,排查日志,主要看到的现象就是193节点频繁的added、removed
added {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397620, reason: ApplyCommitRequest{term=43, version=397620, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{k0x5Rog-S0CFhAmorJ3V0Q}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}, term: 43, version: 397621, reason: ApplyCommitRequest{term=43, version=397621, sourceNode={es-b-73}{1WRaRyU-SuCPn8qbz3e4hg}{1iRrVMWcSNGGxsuQYxePSg}{10.163.220.73}{10.163.220.73:9301}{cdfhilmrstw}{ml.machine_memory=133070966784, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=33285996544, transform.node=true}}
导致193上的shard一直rebalance。但193上日志,不知道什么原因丢失了,能找到的最早的日志也是23号下午5点21分之后的日志了。经过日志排查问题到这里被卡主了。
回看发生问题前的一段时间各个节点的监控,CPU、内存、磁盘、写入量等都没有异常的波动,唯一发现的一个异常:
那么generic thread pool是做啥的呢?
=> 用于通用操作(例如,后台节点发现)它的线程池类型是动态缩放的
到这里也只是能印证了,节点间相互发现有问题。那么到底什么原因造成的集群宕机呢?
时间来到了今天上午也就是25号9点左右的时候,发现集群又开始rebalance shards,通过
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason
可以看到shard重新迁移的原因是node-left,到master节点的机器看下日志:
root@jiankunking-es-02:~# docker logs -f 1f741600dbcc |grep node-left
{"type": "server", "timestamp": "2023-10-23T09:33:18,042Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 399161, delta: removed {{es-b-221}{jaNLhd1eT6SUYcLJkHpE1Q}{XkD-b14XQzq1-o3EiDSuKA}{10.163.220.221}{10.163.220.221:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:52:35,140Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: followers check retry count exceeded], term: 52, version: 404805, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:24,564Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404807, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:53:30,879Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "business-log", "node.name": "es-b-194", "message": "node-left[{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw} reason: disconnected], term: 52, version: 404810, delta: removed {{es-b-193}{ZFQdNP4HSgCtPCnMfDdopw}{PPAowFAWQRiI9s5FoaDxWQ}{10.163.204.193}{10.163.204.193:9301}{cdfhilmrstw}}", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
按理说 193 194 195这些机器与另外4台机器相比,区别点在于挂了多块盘,那性能应该更好啊,为啥连不上的会是他们呢?
又回看了下节点日志,发现节点间互连存在大量的超时
master(194)节点日志:
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758574}{false}{false}{false}}] of size [21075] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758608}{false}{false}{false}}] of size [1992] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{70758614}{false}{false}{false}}] of size [5563] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [29403ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758686}{false}{false}{false}}] of size [9103] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
{"type": "server", "timestamp": "2023-10-25T01:35:18,510Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-194", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{70758737}{false}{false}{false}}] of size [12666] on [Netty4TcpChannel{localAddress=/10.163.204.194:57580, remoteAddress=10.163.204.193/10.163.204.193:9301, profile=default}] took [28603ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "PhjlCea6TNKh4rdZPyPDkA" }
普通节点(193)
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855747}{false}{false}{false}}] of size [3828] on [Netty4TcpChannel{localAddress=/10.163..204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855763}{false}{false}{false}}] of size [14563] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{73855795}{false}{false}{false}}] of size [12940] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
{"type": "server", "timestamp": "2023-10-25T01:34:46,866Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "business-log", "node.name": "es-b-193", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s][r]}{73855813}{false}{false}{false}}] of size [10721] on [Netty4TcpChannel{localAddress=/10.163.204.193:38616, remoteAddress=10.163.220.221/10.163.220.221:9301, profile=default}] took [35027ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "ArYy-qmCTbCQTDUI8ogsBg", "node.id": "ZFQdNP4HSgCtPCnMfDdopw" }
看到到这里难道是网络或者网卡打满了?
看下带宽Bandwidth:
root@jiankunking-es-02:~# ethtool eth0 | grep SpeedSpeed: Unknown!
root@jiankunking-es-02:~#
再看下网卡实时流量
iftop -n
找负责机器的同事核对,得到以下信息
- 瓶颈是按照网卡上限看,不看两个服务器之间链路 =>也就是服务器间的网络带宽应该是远大于机器网卡的,所以服务器的网卡更容易成为瓶颈
- 服务器网卡上限是:内网带宽(bit/s) 10.00Gpbs=>也就是服务器网卡总流量的上限是 10/8 大约是1G的流量
同时要了下阿里云的监控图
从阿里云的监控可以判断出这时候网卡并没有满的(进+出,也就能到阈值10%的样子),我们再看下23号傍晚的阿里云监控
这里的要注意下,对于阿里云来说,比如网卡有阈值,但短时间阿里云是允许突破限制的(虽然这里并没有满)
既然网络没有问题,那问题出在哪?
算了,找一个节点频繁added remove的时间在master(193)上抓包一下
从抓包中可以看出以下信息:
- 网络连接正常,建立连接–发送数据–连接关闭
- 是master节点主动断开的连接
但是master节点为什么断开的连接,就不清楚了。追查到这里,感觉问题应该不是在网络还是es自身哪里出了问题。
继续一顿Google,找到了这么一个issue:https://github.com/elastic/elasticsearch/issues/67873
虽然各种异常提示都不太一样,但现象是一样的
在issue中发现shard数太多会对于集群造成一些不利的影响。
我们回看了下集群shard监控
发现shard数一直在增加,整个集群已经到达6w+。要不我们也删除部分历史索引试试?
这时候74节点一直在added removed,我手动重启了74节点,并将尝试将7、8月份的索引删除,集群逐步恢复到70%+的时候,又有节点下线了;
我们再次尝试删除9月份加10月份1号-6号的索引,这时候集群恢复成功了(这时候集群shard数已经从6w降到了4w)。
问题暂时得到了处理,观察一段时间再没有发现node-left的情况,但根本原因并没有找到。(比如23号重启集群后,shard数也是接近6w,但也重启成功了,但25号却一直没有成功)。
如果大家有这方面的资料或者最佳实践,欢迎留言。
小结
经过这件事之后,我最大的一个感触就是官方的推荐值还是要参考一下:
Elasticsearch 集群内应该设置多少个分片(shard)?
主要是因为要控制成本所以只要集群不出问题,官方的建议,我们一般是这么个态度:
其实对于shard大小跟每个节点推荐的数据量一样,我们都没有接受他们的建议。比如这次排查为什么没有怀疑过30T数据量太多呢?因为之前都是40T+。
其实经过这件事,还有以下几个点想分享一下
- 各种集群间应该资源隔离,也就是机器不要共用
- 不要在各种边缘上运行
- 比如我们es的磁盘其实一直就是瓶颈,只要触发rebalance shards,磁盘io一定是满的,就会导致写入速率下降等各种问题
- 集群内各个相同功能的节点配置要一样,比如我们的es集群,磁盘大小不一样,当到达磁盘阈值的时候,集群负载就会不均衡