分布式存储-CEPH

CephFS 介绍及使用经验分享

2019-01-14  本文已影响8人  lihanglucien

目录

  1. Ceph架构介绍
  2. NFS介绍
  3. 分布式文件系统比较
  4. CephFS介绍
  5. MDS介绍
    • 5.1 单活MDS介绍
    • 5.2 单活MDS高可用
  6. CephFS遇到的部分问题
    • 6.1 客户端缓存问题
    • 6.2 务端缓存不释放
    • 6.3 客户端夯住或者慢查询
    • 6.4 客户端失去连接
    • 6.5 主从切换问题
  7. CephFS问题解决方案
    • 7.1 服务端缓存警告问题
    • 7.2 客户端夯住问题
      • 7.2.1 MDS锁的问题
    • 7.3 MDS主从切换问题
      • 7.3.1 为什么mds切换耗时比较高?
      • 7.3.2 MDS切换循环?
    • 7.4 客户端失去连接
  8. 总结及优化方案推荐
  9. 多活MDS
    • 9.1 简介
    • 9.2 多活MDS优势
    • 9.3 多活MDS特点
    • 9.4 CephFS Subtree Partitioning
      • 9.4.1 介绍
    • 9.5 Subtree Pinning(static subtree partitioning)
    • 9.6 动态负载均衡
      • 9.6.1 介绍
      • 9.6.2 可配置的负载均衡
      • 9.6.3 负载均衡策略
      • 9.6.4 通过lua灵活控制负载均衡
      • 9.6.5 内部结构图
  10. 多活负载均衡-实战演练
    • 10.1 集群架构
    • 10.2 扩容活跃MDS
    • 10.3 多活MDS压测
    • 10.4 多活MDS-动态负载均衡
    • 10.5 多活MDS-静态分区(多租户隔离)
    • 10.6 多活MDS-主备模式
  11. 多活负载均衡-总结
    • 11.1 测试报告
    • 11.2 结论
  12. MDS状态说明
    • 12.1 MDS主从切换流程图
    • 12.2 MDS状态
    • 12.3 State Diagram
  13. 深入研究
    • 13.1 MDS启动阶段分析
    • 13.2 MDS核心组件
    • 13.3 MDSDaemon类图
    • 13.4 MDSDaemon源码分析
    • 13.5 MDSRank类图
    • 13.6 MDSRank源码分析

1. Ceph架构介绍

image.png

Ceph是一种为优秀的性能、可靠性和可扩展性而设计的统一的、分布式文件系统。

特点如下:

使用场景:

系统架构:

Ceph 生态系统架构可以划分为四部分:

  1. Clients:客户端(数据用户)
  2. mds:Metadata server cluster,元数据服务器(缓存和同步分布式元数据)
  3. osd:Object storage cluster,对象存储集群(将数据和元数据作为对象存储,执行其他关键职能)
  4. mon:Cluster monitors,集群监视器(执行监视功能)
image.png

2. NFS介绍

1. NAS(Network Attached Storage)

2. NFS(Network File System)

3. 分布式文件系统比较

名称 功能 适合场景 优缺点
MFS 1. 单点MDS
2. 支持FUSE
3. 数据分片分布
4. 多副本
5. 故障手动恢复
大量小文件读写 1. 运维实施简单
2. 但存在单点故障
Ceph 1. 多个MDS,可扩展
2. 支持FUSE
3. 数据分片(crush)分布
4. 多副本/纠删码
5. 故障自动恢复
统一小文件存储 1. 运维实施简单
2. 故障自愈,自我恢复
3. MDS锁的问题
4. J版本很多坑, L版本可以上生产环境
ClusterFS 1. 不存在元数据节点
2. 支持FUSE
3. 数据分片分布
4. 镜像
5. 故障自动恢复
适合大文件 1. 运维实施简单
2. 不存储元数据管理
3. 增加了客户端计算负载
Lustre 1. 双MDS互备,不可用扩展
2. 支持FUSE
3. 数据分片分布
4. 冗余(无)
5. 故障自动恢复
大文件读写 1. 运维实施复杂
2. 太庞大
3. 比较成熟

4. CephFS介绍

image.png

说明:

5. MDS介绍

5.1 单活MDS介绍

image.png

说明:

5.2 单活MDS高可用

image.png

说明:

6. CephFS遇到的部分问题

6.1 客户端缓存问题

消息: Client name failing to respond to cache pressure

说明: 客户端有各自的元数据缓存,客户端缓存中的条目(比如索引节点)也会存在于 MDS 缓存中,
所以当 MDS 需要削减其缓存时(保持在 mds_cache_size 以下),它也会发消息给客户端让它们削减自己的缓存。如果某个客户端的响应时间超过了 mds_recall_state_timeout (默认为 60s ),这条消息就会出现。

6.2 服务端缓存不释放

如果有客户端没响应或者有缺陷,就会妨碍 MDS 将缓存保持在 mds_cache_size 以下, MDS 就有可能耗尽内存而后崩溃。

6.3 客户端夯住或者慢查询

6.4 客户端失去连接

客户端由于网络问题或者其他问题,导致客户端不可用。

6.5 主从切换问题

7. CephFS问题解决方案

7.1 服务端缓存警告问题

v12 luminous版本已修复:
https://github.com/ceph/ceph/commit/51c926a74e5ef478c11ccbcf11c351aa520dde2a
mds: fix false "failing to respond to cache pressure" warning

7.2 客户端夯住问题

7.2.1 MDS锁的问题

7.2.1.1 场景模拟

  1. 读写代码
//read.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <pthread.h>
int main()
{
    int i = 0;
    for(i = 0; ;i++)
    {
        char *filename = "test.log";
        int fd = open(filename, O_RDONLY);
        printf("fd=[%d]", fd);
        fflush(stdout);
        sleep(5);
    }
}
 
//write.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <pthread.h>
int main()
{
    int i = 0;
    for(i = 0; ;i++)
    {
        char *filename = "test.log";
        int fd = open(filename, O_CREAT | O_WRONLY | O_APPEND, S_IRUSR | S_IWUSR);
        write(fd, "aaaa\n", 6);
        printf("fd=[%d] buffer=[%s]", fd, "aaaa");
        close(fd);
        fflush(stdout);
        sleep(5);
    }
}
  1. A用户执行read, B用户执行write。
2018-12-13 19:56:11.222816 7fffee6d0700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.670943 secs
2018-12-13 19:56:11.222826 7fffee6d0700  0 log_channel(cluster) log [WRN] : slow request 30.670943 seconds old, received at 2018-12-13 19:55:40.551820: client_request(client.22614489:538 lookup #0x1/test.log 2018-12-13 19:55:40.551681 caller_uid=0, caller_gid=0{0,}) currently failed to rdlock, waiting
2018-12-13 19:56:13.782378 7ffff0ed5700  1 mds.ceph-xxx-osd02.ys Updating MDS map to version 229049 from mon.0
2018-12-13 19:56:33.782572 7ffff0ed5700  1 mds.ceph-xxx-osd02.ys Updating MDS map to version 229050 from mon.0
2018-12-13 20:00:26.226405 7fffee6d0700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph-xxx-osd01.ys (22532339), after 303.489228 seconds

总结:

7.3 MDS主从切换问题

7.3.1 为什么mds切换耗时比较高?

  1. 分析日志(发现执行rejoin_start,rejoin_joint_start动作耗时比较高)。
2018-04-27 19:24:15.984156 7f53015d7700  1 mds.0.2738 rejoin_start
2018-04-27 19:25:15.987531 7f53015d7700  1 mds.0.2738 rejoin_joint_start
2018-04-27 19:27:40.105134 7f52fd4ce700  1 mds.0.2738 rejoin_done
2018-04-27 19:27:42.206654 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:27:42.206658 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active
  1. 跟踪代码分析(在执行process_imported_caps超时了, 这个函数主要是打开inodes 加载到cache中)。


    image.png

7.3.2 MDS切换循环?

MDS守护进程至少在mds_beacon_grace中未能向监视器发送消息,而它们应该在每个mds_beacon_interval发送消息。此时Ceph监视器将自动将MDS切换为备用MDS。 如果MDS的Session Inode过多导致MDS繁忙,只从切换未能及时发送消息,就可能会出现循环切换的概率。一般建设增大mds_beacon_grace。

mds beacon grace
描述: 多久没收到标识消息就认为 MDS 落后了(并可能替换它)。
类型: Float
默认值: 15

7.4 客户端失去连接

client: fix fuse client hang because its pipe to mds is not ok
There is a risk client will hang if fuse client session had been killed by mds and
the mds daemon restart or hot-standby switch happens right away but the client
did not receive any message from monitor due to network or other whatever reason
untill the mds become active again.Thus cause client didn't do closed_mds_session
lead the seession still is STATE_OPEN but client can't send any message to
mds because its pipe is not ok.

So we should create pipe to mds guarantee any meta request can be sent to
server.When mds recevie the message will send a CLOSE_SESSION to client
becasue its session for this client is STATE_CLOSED.After the previous
steps the old session of client can be closed and new session and pipe
can be established and the mountpoint will be ok.

8. 总结及优化方案推荐

9. 多活MDS

9.1 简介

也叫: multi-mds 、 active-active MDS
每个 CephFS 文件系统默认情况下都只配置一个活跃 MDS 守护进程。在大型系统中,为了扩展元数据性能你可以配置多个活跃的 MDS 守护进程,它们会共同承担元数据负载。

CephFS 在Luminous版本中多元数据服务器(Multi-MDS)的功能和目录分片(dirfragment)的功能宣称已经可以在生产环境中使用。

image.png

9.2 多活MDS优势

9.3 多活MDS特点

image.png

9.4 CephFS Subtree Partitioning

9.4.1 介绍

image.png

说明:
为了实现文件系统数据和元数据的负载均衡,业界一般有几种分区方法:

9.5 Subtree Pinning(static subtree partitioning)

image.png

说明:

9.6 动态负载均衡

9.6.1 介绍

多个活动的MDSs可以迁移目录以平衡元数据负载。何时、何地以及迁移多少的策略都被硬编码到元数据平衡模块中。

Mantle是一个内置在MDS中的可编程元数据均衡器。其思想是保护平衡负载(迁移、复制、碎片化)的机制,但使用Lua定制化平衡策略。

大多数实现都在MDBalancer中。度量通过Lua栈传递给均衡器策略,负载列表返回给MDBalancer。这些负载是“发送到每个MDS的数量”,并直接插入MDBalancer“my_targets”向量。

暴露给Lua策略的指标与已经存储在mds_load_t中的指标相同:auth.meta_load()、all.meta_load()、req_rate、queue_length、cpu_load_avg。

它位于当前的均衡器实现旁边,并且它是通过“ceph.conf”中的字符串启用的。如果Lua策略失败(无论出于何种原因),我们将回到原来的元数据负载均衡器。
均衡器存储在RADOS元数据池中,MDSMap中的字符串告诉MDSs使用哪个均衡器。

This PR does not not have the following features from the Supercomputing paper:

  1. Balancing API: all we require is that balancer written in Lua returns a targets table, where each index is the amount of load to send to each MDS
  2. "How much" hook: this let's the user define meta_load()
  3. Instantaneous CPU utilization as metric
    Supercomputing '15 Paper: http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html

9.6.2 可配置的负载均衡

image.png

参考:

9.6.3 负载均衡策略

image.png

9.6.4 通过lua灵活控制负载均衡

image.png
参考:

9.6.5 内部结构图

image.png
参考:

10. 多活负载均衡-实战演练

10.1 集群架构

10.2 扩容活跃MDS

10.2.1 设置max_mds为2

$ ceph fs set test1_fs max_mds 2

10.2.2 查看fs状态信息


$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd02.ys | Reqs:    0 /s | 3760  |   14  |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s |   11  |   13  |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  194M | 88.7T |
|   cephfs_data   |   data   |    0  | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd03.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.2.3 总结

10.3 多活MDS压测

10.3.1 用户挂载目录

$ ceph-fuse /mnt/
$ df
ceph-fuse      95330861056     40960 95330820096   1% /mnt

10.3.2 filebench压测

image.png

10.3.3 查看fs mds负载


$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 5624 /s |  139k |  133k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  238M | 88.7T |
|   cephfs_data   |   data   | 2240M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd01.ys |
| ceph-xxx-osd02.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.3.4 总结

10.4 多活MDS-动态负载均衡

10.4.1 Put the balancer into RADOS

rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua

10.4.2 Activate Mantle

ceph fs set test1_fs max_mds 2
ceph fs set test1_fs balancer greedyspill.lua

10.4.3 挂载压测

$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+------------------------+---------------+-------+-------+
| 0 | active | ceph-xxx-osd03.ys | Reqs: 2132 /s | 4522 | 1783 |
| 1 | active | ceph-xxx-osd02.ys | Reqs: 173 /s | 306 | 251 |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 223M | 88.7T |
| cephfs_data | data | 27.1M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
| Standby MDS |
+------------------------+
| ceph-xxx-osd01.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.4.4 总结

10.5 多活MDS-静态分区(多租户隔离)

10.5.1 根据目录绑定不同的mds

#mds00绑定到/mnt/test0
#mds01绑定到/mnt/test1
#setfattr -n ceph.dir.pin -v <rank> <path>
 
setfattr -n ceph.dir.pin -v 0 /mnt/test0
setfattr -n ceph.dir.pin -v 1 /mnt/test1

10.5.2 两个客户端压测

image.png

10.5.3 观察fs 状态信息(2个压测端)

#检查mds请求负责情况
$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 3035 /s |  202k |  196k |
|  1   | active | ceph-xxx-osd02.ys | Reqs: 3039 /s | 70.8k | 66.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  374M | 88.7T |
|   cephfs_data   |   data   | 4401M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd01.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.5.4 结论

10.6 多活MDS-主备模式

10.6.1 查看mds状态

$ ceph fs status
test1_fs - 4 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd02.ys | Reqs:    0 /s | 75.7k | 72.6k |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  311M | 88.7T |
|   cephfs_data   |   data   | 3322M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd03.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.2 停掉mds2

$ systemctl stop ceph-mds.target

10.6.3 查看mds状态信息

$ ceph fs status
test1_fs - 2 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | replay | ceph-xxx-osd03.ys |               |    0  |    0  |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  311M | 88.7T |
|   cephfs_data   |   data   | 3322M | 88.7T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
+-------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.4 压测观察

#进行压测rank0, 发现请求能正常落在mds3上
$ ceph fs status
test1_fs - 4 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 2372 /s | 72.7k | 15.0k |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  367M | 88.7T |
|   cephfs_data   |   data   | 2364M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd02.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.5 总结

11. 多活负载均衡-总结

11.1 测试报告

工具 集群模式 客户端数量(压测端) 性能
filebench 1MDS 2个客户端 5624 ops/s
filebench 2MDS 2个客户端 客户端1:3035 ops/s
客户端2:3039 ops/s

11.2 结论

12. MDS状态说明

12.1 MDS主从切换流程图

image.png

说明:

  1. 用户手动发起主从切换fail。
  2. active mds手动信号,发起respawn重启。
  3. standby mds收到信号,经过分布式算法推选为新主active mds。
  4. 新主active mds 从up:boot状态,变成up:replay状态。日志恢复阶段,他将日志内容读入内存后,在内存中进行回放操作。
  5. 新主active mds 从up:replay状态,变成up:reconnect状态。恢复的mds需要与之前的客户端重新建立连接,并且需要查询之前客户端发布的文件句柄,重新在mds的缓存中创建一致性功能和锁的状态。
  6. 新主active mds从up:reconnect状态,变成up:rejoin状态。把客户端的inode加载到mds cache。(耗时最多的地方)
  7. 新主active mds从up:rejoin状态,变成up:active状态。mds状态变成正常可用的状态。
  8. recovery_done 迁移完毕。
  9. active_start 正常可用状态启动,mdcache加载相应的信息。

12.2 MDS状态

状态 说明
up:active This is the normal operating state of the MDS. It indicates that the MDS and its rank in the file system is available.

这个状态是正常运行的状态。 这个表明该mds在rank中是可用的状态。
up:standby The MDS is available to takeover for a failed rank (see also :ref:mds-standby). The monitor will automatically assign an MDS in this state to a failed rank once available.

这个状态是灾备状态,用来接替主挂掉的情况。
up:standby_replay The MDS is following the journal of another up:active MDS. Should the active MDS fail, having a standby MDS in replay mode is desirable as the MDS is replaying the live journal and will more quickly takeover. A downside to having standby replay MDSs is that they are not available to takeover for any other MDS that fails, only the MDS they follow.

灾备守护进程就会持续读取某个处于 up 状态的 rank 的元数据日志。这样它就有元数据的热缓存,在负责这个 rank 的守护进程失效时,可加速故障切换。

一个正常运行的 rank 只能有一个灾备重放守护进程( standby replay daemon ),如果两个守护进程都设置成了灾备重放状态,那么其中任意一个会取胜,另一个会变为普通的、非重放灾备状态。

一旦某个守护进程进入灾备重放状态,它就只能为它那个 rank 提供灾备。如果有另外一个 rank 失效了,即使没有灾备可用,这个灾备重放守护进程也不会去顶替那个失效的。
up:boot This state is broadcast to the Ceph monitors during startup. This state is never visible as the Monitor immediately assign the MDS to an available rank or commands the MDS to operate as a standby. The state is documented here for completeness.

此状态在启动期间被广播到CEPH监视器。这种状态是不可见的,因为监视器立即将MDS分配给可用的秩或命令MDS作为备用操作。这里记录了完整性的状态。
up:creating The MDS is creating a new rank (perhaps rank 0) by constructing some per-rank metadata (like the journal) and entering the MDS cluster.
up:starting The MDS is restarting a stopped rank. It opens associated per-rank metadata and enters the MDS cluster.
up:stopping When a rank is stopped, the monitors command an active MDS to enter the up:stopping state. In this state, the MDS accepts no new client connections, migrates all subtrees to other ranks in the file system, flush its metadata journal, and, if the last rank (0), evict all clients and shutdown (see also :ref:cephfs-administration).
up:replay The MDS taking over a failed rank. This state represents that the MDS is recovering its journal and other metadata.

日志恢复阶段,他将日志内容读入内存后,在内存中进行回放操作。
up:resolve The MDS enters this state from up:replay if the Ceph file system has multiple ranks (including this one), i.e. it's not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All ranks in the file system must be in this state or later for progress to be made, i.e. no rank can be failed/damaged or up:replay.

用于解决跨多个mds出现权威元数据分歧的场景,对于服务端包括子树分布、Anchor表更新等功能,客户端包括rename、unlink等操作。
up:reconnect An MDS enters this state from up:replay or up:resolve. This state is to solicit reconnections from clients. Any client which had a session with this rank must reconnect during this time, configurable via mds_reconnect_timeout.

恢复的mds需要与之前的客户端重新建立连接,并且需要查询之前客户端发布的文件句柄,重新在mds的缓存中创建一致性功能和锁的状态。mds不会同步记录文件打开的信息,原因是需要避免在访问mds时产生多余的延迟,并且大多数文件是以只读方式打开。
up:rejoin The MDS enters this state from up:reconnect. In this state, the MDS is rejoining the MDS cluster cache. In particular, all inter-MDS locks on metadata are reestablished.
If there are no known client requests to be replayed, the MDS directly becomes up:active from this state.

把客户端的inode加载到mds cache
up:clientreplay The MDS may enter this state from up:rejoin. The MDS is replaying any client requests which were replied to but not yet durable (not journaled). Clients resend these requests during up:reconnect and the requests are replayed once again. The MDS enters up:active after completing replay.
down:failed No MDS actually holds this state. Instead, it is applied to the rank in the file system
down:damaged No MDS actually holds this state. Instead, it is applied to the rank in the file system
down:stopped No MDS actually holds this state. Instead, it is applied to the rank in the file system

12.3 State Diagram

This state diagram shows the possible state transitions for the MDS/rank. The legend is as follows:

Color

Shape

Lines

image.png

13. 深入研究

上一篇下一篇

猜你喜欢

热点阅读