Elasticsearch 索引数据被删除问题的研究
背景
前段时间帮着客户排查ES相关的问题,客户环境后期接入的数据量比当初规划的多了很多,依据机器资源的使用情况决定对当前ES集群进行扩容;由2data扩充为4data且专门独立出一个master。由于ES集群当前已经存储了TB级别的数据,想要后续对ES集群操作上更轻便一些,所以决定暂时将存储的索引数据(每个data节点存储路径下的indices目录中)提前move到一个临时存储位置Dest。对ES集群扩充操作完毕后,为了测试,这个时候先从Dest中移出一小部分索引数据加载到当前ES集群中的data节点,然后重启ES集群;因为容器存储卷映射配置上出了点问题,导致data节点的分词插件出现错误,所以加载进来的索引均没有成功assigned。重新迁回索引数据,正确处理好容器卷映射的问题后,不经意间通过_cat/indices接口发现所有unassigned索引,心里想着反正是未分配的,且已经将数据拷贝出来了,所以就随手执行了DELETE *索引的操作(当时心里的认知是认为索引的数据以及metaData等信息都是存储在索引文件中的,在data节点加载数据的时候会读取进来并上报给master节点然后进行全局的集群状态更新;所以不认为DELETE *的删除索引操作会出事儿,况且还是删除的未被正常分配的索引)。之后再重新将上述操作的同样的那部分索引数据分别拷贝至ES集群的data节点,重启整个ES集群;重启完成后 ,这个时候严重的问题出现了,_cluster/health接口无索引恢复的百分比,感觉奇怪;接着马上执行_cat/indices接口,结果无任何索引信息;最后查看每个data节点存储路径下拷贝过来的索引目录也已经不存在了。到这里心里开始慌了,因为搞丢了一部分数据,且这个意外的发生已经超出了自己对于ES这块知识的认知了;后面小心谨慎的处理好了客户环境后,但这个问题需要好好深入的研究下了。所以这篇文档是对上述问题对应的ES内部处理机制的研究记录。
实践与分析
- ES 5.6.16
- 1master + 1data(分别用Intellij IDE源码运行ES实例)
对于上述问题,其实刚开始并没有清晰的目标知道要从ES的哪个模块,哪个类开始研究,所以决定先搭建ES环境重现上述问题,然后从中寻找切入点。搭建1master + 1data两个节点的ES集群,并分别都设置debug日志级别,模拟上述数据被删除的整个操作流程,尝试从debug日志中挖掘有用的信息
[2020-10-09T13:48:48,538][DEBUG][o.e.i.c.IndicesClusterStateService] [master] [[twitter/4fHvcKLSRBuXK4mGTVI9Bg]] cleaning index, no longer part of the metadata
如上,master与data角色的节点debug日志中均发现了上述删除索引数据的日志记录,因此IndicesClusterStateService类以及其中的deleteIndices(...)方法就是研究的重点与切入点。deleteIndices(...)方法体完整如下:
/**
* Deletes indices (with shard data).
*
* @param event cluster change event
*/
private void deleteIndices(final ClusterChangedEvent event) {
final ClusterState previousState = event.previousState();
final ClusterState state = event.state();
final String localNodeId = state.nodes().getLocalNodeId();
assert localNodeId != null;
for (Index index : event.indicesDeleted()) {
if (logger.isDebugEnabled()) {
logger.debug("[{}] cleaning index, no longer part of the metadata", index);
}
AllocatedIndex<? extends Shard> indexService = indicesService.indexService(index);
final IndexSettings indexSettings;
if (indexService != null) {
indexSettings = indexService.getIndexSettings();
indicesService.removeIndex(index, DELETED, "index no longer part of the metadata");
} else if (previousState.metaData().hasIndex(index.getName())) {
// The deleted index was part of the previous cluster state, but not loaded on the local node
final IndexMetaData metaData = previousState.metaData().index(index);
indexSettings = new IndexSettings(metaData, settings);
indicesService.deleteUnassignedIndex("deleted index was not assigned to local node", metaData, state);
} else {
// The previous cluster state's metadata also does not contain the index,
// which is what happens on node startup when an index was deleted while the
// node was not part of the cluster. In this case, try reading the index
// metadata from disk. If its not there, there is nothing to delete.
// First, though, verify the precondition for applying this case by
// asserting that the previous cluster state is not initialized/recovered.
assert previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK);
final IndexMetaData metaData = indicesService.verifyIndexIsDeleted(index, event.state());
if (metaData != null) {
indexSettings = new IndexSettings(metaData, settings);
} else {
indexSettings = null;
}
}
if (indexSettings != null) {
threadPool.generic().execute(new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
logger.warn(
(Supplier<?>) () -> new ParameterizedMessage("[{}] failed to complete pending deletion for index", index), e);
}
@Override
protected void doRun() throws Exception {
try {
// we are waiting until we can lock the index / all shards on the node and then we ack the delete of the store
// to the master. If we can't acquire the locks here immediately there might be a shard of this index still
// holding on to the lock due to a "currently canceled recovery" or so. The shard will delete itself BEFORE the
// lock is released so it's guaranteed to be deleted by the time we get the lock
indicesService.processPendingDeletes(index, indexSettings, new TimeValue(30, TimeUnit.MINUTES));
} catch (LockObtainFailedException exc) {
logger.warn("[{}] failed to lock all shards for index - timed out after 30 seconds", index);
} catch (InterruptedException e) {
logger.warn("[{}] failed to lock all shards for index - interrupted", index);
}
}
});
}
}
}
方法接受的参数是ClusterChangedEvent类型,ClusterChangedEvent是对ES集群状态发生变化的一个描述,主要由master节点向其他节点同步状态。for循环中对event.indicesDeleted()结果进行遍历操作,event.indicesDeleted(...)方法体如下:
/**
* Returns the indices deleted in this event
*/
public List<Index> indicesDeleted() {
if (previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK)) {
// working off of a non-initialized previous state, so use the tombstones for index deletions
return indicesDeletedFromTombstones();
} else {
// examine the diffs in index metadata between the previous and new cluster states to get the deleted indices
return indicesDeletedFromClusterState();
}
}
private List<Index> indicesDeletedFromTombstones() {
// We look at the full tombstones list to see which indices need to be deleted. In the case of
// a valid previous cluster state, indicesDeletedFromClusterState() will be used to get the deleted
// list, so a diff doesn't make sense here. When a node (re)joins the cluster, its possible for it
// to re-process the same deletes or process deletes about indices it never knew about. This is not
// an issue because there are safeguards in place in the delete store operation in case the index
// folder doesn't exist on the file system.
List<IndexGraveyard.Tombstone> tombstones = state.metaData().indexGraveyard().getTombstones();
return tombstones.stream().map(IndexGraveyard.Tombstone::getIndex).collect(Collectors.toList());
}
private List<Index> indicesDeletedFromClusterState() {
// If the new cluster state has a new cluster UUID, the likely scenario is that a node was elected
// master that has had its data directory wiped out, in which case we don't want to delete the indices and lose data;
// rather we want to import them as dangling indices instead. So we check here if the cluster UUID differs from the previous
// cluster UUID, in which case, we don't want to delete indices that the master erroneously believes shouldn't exist.
// See test DiscoveryWithServiceDisruptionsIT.testIndicesDeleted()
// See discussion on https://github.com/elastic/elasticsearch/pull/9952 and
// https://github.com/elastic/elasticsearch/issues/11665
if (metaDataChanged() == false || isNewCluster()) {
return Collections.emptyList();
}
List<Index> deleted = null;
for (ObjectCursor<IndexMetaData> cursor : previousState.metaData().indices().values()) {
IndexMetaData index = cursor.value;
IndexMetaData current = state.metaData().index(index.getIndex());
if (current == null) {
if (deleted == null) {
deleted = new ArrayList<>();
}
deleted.add(index.getIndex());
}
}
return deleted == null ? Collections.<Index>emptyList() : deleted;
}
实践发现,我们这里重启场景下的event获取的deleted状态的索引主要是通过集群metaData中的tombstones拿到的,这个也很好理解因为ES节点是重启操作,因此不会依赖对比previous与当前集群metaData来获取结果值。现在排查问题的思路就到了tombstones这里了,tombstones表示啥,为啥可以从metaData.indexGraveyard中获取到。经过研究发现,其实每次在执行DELETE删除索引操作时,被删除的索引都会被记录到集群metaData中,内容形式如下(_cluster/state接口获取内容):
"metadata": {
"cluster_uuid": "kURWiZwNQ0-jmDqNIQOa9g",
"templates": {},
"indices": {},
"index-graveyard": {
"tombstones": [
{
"index": {
"index_name": "twitter",
"index_uuid": "IR5DYQLLTJKKBGxgal63nQ"
},
"delete_date_in_millis": 1602208073269
}
]
}
}
同时ES中也专门用IndexGraveyard类来定义被删除的索引,IndexGraveyard直译过来也是索引墓地的意思。这里集中解释下几个名词:
- IndexGraveyard(索引墓地):此类用来表示被删除索引的类
- tombstone(墓碑):被删除的索引
- tombstones:被删除的索引的集合,tombstones大小可通过cluster.indices.tombstones.size设置,默认大小为500
- dangling indices:表示这类索引其state信息还在磁盘中,但不存在于集群的metaData中(上述操作就属于此类型)
有了这些认识铺垫后,接着研究了ES master节点的持久化存储,在master存储路径下有两个很重要的文件,一个用于记录集群metaData相关信息(global-x.st),一个用于记录master节点相关信息(node-x.st)。通过vim并以16进制的方式分别打开这两个文件:
# global-1.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000 ?.l..state......
00000010: 0001 3a29 0a05 fa88 6d65 7461 2d64 6174 ..:)....meta-dat
00000020: 61fa 8676 6572 7369 6f6e d08b 636c 7573 a..version..clus
00000030: 7465 725f 7575 6964 5542 7563 7a51 3365 ter_uuidUBuczQ3e
00000040: 6353 6757 6e61 7378 7465 476b 7636 6788 cSgWnasxteGkv6g.
00000050: 7465 6d70 6c61 7465 73fa fb8e 696e 6465 templates...inde
00000060: 782d 6772 6176 6579 6172 64fa 8974 6f6d x-graveyard..tom
00000070: 6273 746f 6e65 73f8 fa84 696e 6465 78fa bstones...index.
00000080: 8969 6e64 6578 5f6e 616d 6544 6e61 6d65 .index_nameDname
00000090: 7389 696e 6465 785f 7575 6964 5564 5857 s.index_uuidUdXW
000000a0: 4957 4878 7352 6d57 3575 5441 6274 3969 IWHxsRmW5uTAbt9i
000000b0: 6b65 77fb 9464 656c 6574 655f 6461 7465 kew..delete_date
000000c0: 5f69 6e5f 6d69 6c6c 6973 2501 3a43 0e1b _in_millis%.:C..
000000d0: 1dae fbf9 fbfb fbc0 2893 e800 0000 0000 ........(.......
000000e0: 0000 0028 e8a7 b60a ...(....
# node-0.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000 ?.l..state......
00000010: 0001 3a29 0a05 fa86 6e6f 6465 5f69 6455 ..:)....node_idU
00000020: 6857 5147 786b 3637 5342 2d4d 5575 3874 hWQGxk67SB-MUu8t
00000030: 6548 7173 4c51 fbc0 2893 e800 0000 0000 eHqsLQ..(.......
00000040: 0000 001e e8fb f70a
可以看到其中存储的metaData主要信息,每次master节点启动时会通过MetaDataStateFormat类的read方法从本地global-x.st文件中读取内容并填充进metaData数据结构中。同时当集群metaData发生变更时,master也会及时的将内容更新到本地文件中。所以ES主要是依赖本地文件存储集群相关的元数据。这里global-x以及node-x其中的x表示集群状态发布的version号码,通常是从0往上递增的。
这样我们就搞清楚了IndicesClusterStateService类中deleteIndices(...)方法内event.indicesDeleted()返回值的意义了。接着往下看deleteIndices(...)方法的主体逻辑,因为我们的场景是重启ES节点,所以indexService为null且集群previousState的metaData中不包含tombstones集合中的索引;此时逻辑进入indicesService.verifyIndexIsDeleted(...)方法内,如下:
/**
* Verify that the contents on disk for the given index is deleted; if not, delete the contents.
* This method assumes that an index is already deleted in the cluster state and/or explicitly
* through index tombstones.
* @param index {@code Index} to make sure its deleted from disk
* @param clusterState {@code ClusterState} to ensure the index is not part of it
* @return IndexMetaData for the index loaded from disk
*/
@Override
@Nullable
public IndexMetaData verifyIndexIsDeleted(final Index index, final ClusterState clusterState) {
// this method should only be called when we know the index (name + uuid) is not part of the cluster state
if (clusterState.metaData().index(index) != null) {
throw new IllegalStateException("Cannot delete index [" + index + "], it is still part of the cluster state.");
}
if (nodeEnv.hasNodeFile() && FileSystemUtils.exists(nodeEnv.indexPaths(index))) {
final IndexMetaData metaData;
try {
metaData = metaStateService.loadIndexState(index);
} catch (Exception e) {
logger.warn((Supplier<?>) () -> new ParameterizedMessage("[{}] failed to load state file from a stale deleted index, folders will be left on disk", index), e);
return null;
}
final IndexSettings indexSettings = buildIndexSettings(metaData);
try {
deleteIndexStoreIfDeletionAllowed("stale deleted index", index, indexSettings, ALWAYS_TRUE);
} catch (Exception e) {
// we just warn about the exception here because if deleteIndexStoreIfDeletionAllowed
// throws an exception, it gets added to the list of pending deletes to be tried again
logger.warn((Supplier<?>) () -> new ParameterizedMessage("[{}] failed to delete index on disk", metaData.getIndex()), e);
}
return metaData;
}
return null;
}
该方法由名称可知主要是用来验证索引是否被删除,这个删除操作主要是指data节点本地存储路径下的索引目录是否被有效删除。方法内首先判断当前data节点的nodeEnv对象的nodePath与lock是否为null,若均不为null且本地存在index的完整路径(比如我这里的/Users/tony/myelasticsearch/2-5.6.16/home/data/nodes/0/indices/YzNUXHEZQrqe2CVCVu0thg),则进入到if逻辑体内。首先通过metaStateService的loadIndexState(...)方法获取当前索引的metaData,和master节点读取本地global-x.st文件获取集群metaData一样,也是通过MetaDataStateFormat类的read(...)方法从本地存储路径下读取磁盘上的索引目录中的状态文件;接着通过metaData获取indexSettings信息,buildIndexSettings(metaData)方法构建indexSettings对象,buildIndexSettings(...)方法背后就是调用了IndexSettings类的构造方法来生成indexSettings对象;接着进入到deleteIndexStoreIfDeleteAllowed(...)方法内,最终执行删除操作的是NodeEnvironment类的deleteIndexDirectoryUnderLock(...)方法;代码IOUtils.rm(indexPaths)对已存在tombstones中的索引目录文件执行了删除操作,上述操作中master节点重启之后发现拷贝过来的索引目录文件全部不见了的根因就在于此处执行了真正的删除操作。代码逻辑分析到此处,我们就清晰了索引文件被删除的内部处理机制了。梳理下索引目录被删除的流程:
- 执行DELETE 删除索引的操作,被删除的索引会被写入到集群metaData中的tombstones集合中,且metaData信息是存储在master节点的本地文件中的(global-x.st)
- master节点启动时,会从本地路径下读取对应的文件,并将集群信息加载到metaData中
- 在master节点同步集群状态过程中,会验证处于tombstones中的索引是否被有效删除(本地索引存储目录是否被有效删除)
- 如果tombstones中的索引文件依然存在,则会在此过程中被删除
- 上述丢数据的场景就是因为首先执行了DELETE删除操作,这个时候这些deleted状态的索引已经被记录到了metaData中,后面又拷贝索引文件至data节点的路径下,故而会被ES删除掉
小结
到此,结合着实践与代码完整分析了索引数据为啥会被删除的整个逻辑,因为对ES的这块知识把握的不是很精确,导致在操作过程中出现了一部分的数据丢失,有如下两点很深的感悟:
- 操作数据之前,做好完整的数据备份(使用cp而不是mv)
- 对一个功能背后的知识点有了足够的掌握之后,再去做进一步的操作
ES集群的状态发布以及ES本地存储文件的详细说明,文档中没有去做进一步的探索,网上找了两篇比较不错的博客,推荐给大家看下;还是那样,与大家一起学习ES,一起进步。