使用kolla来部署容器化ceph集群

2019-04-23 本文已影响0人 wangwDavid

简介

kolla 是openstack的容器化部署项目，主要目的是实现生产级别容器化openstack平台的部署，做到开箱即用。kolla利用ansible来编排相关容器的部署。

ceph作为开源的分布式存储，与openstack的联系很紧密，kolla实现了简单的ceph集群部署以及与openstack组件比如cinder,manila,nova,glance之间的交互。

kolla项目包含两个代码：

kolla 主要负责镜像构建(https://github.com/openstack/kolla)
kolla-ansible 主要负责部署及升级(https://github.com/openstack/kolla-ansible)

下面就简单说下用kolla来部署ceph的优缺点:

kolla部署ceph的优缺点

优点

1. kolla-ansible的部署过程会比较环境的差异化，比如镜像的tag发生了变化，相关的配置发生了变化，这些变化的部分才会被应用到环境上，而没有发生的部分则会保持不变。反应到ceph的部署中，则是可以很方便的来部署ceph和升级ceph。

2. kolla中初始化ceph osd的过程很巧妙，主要工作是围绕disk的partname来实现，而后ceph-osd容器各自对应相应的磁盘，所以添加新的osd或者修复损坏的osd都很方便。

3. ansible自身的优点(方便定制化开发)

缺点

1. 任意改动会对整个ceph集群的所有服务进行应用, 比如升级ceph的镜像，所有的osd会被重启(最小可以做到只升级一个节点的组件，使用--limit或者ANSIBLE_SERIAL=1,但是因为部署过程有一些问题,这些特性并不能很好的适用)。

2. 对ceph的新特性支持不足，kolla ceph现在支持luminous的部署，但是对于一些新特性比如device class和支持device class的pool创建这些都不支持。

3. 对于ceph的bluestore部署上,不支持bcache磁盘(磁盘分区名不支持)以及多路径磁盘(初始化流程不支持)

也就是说，kolla目前对ceph的支持，作为一个测试集群来说足够了，但是对于生产化的ceph集群，还是有许多方面需要改进。可能社区的本来目的就是作为一个测试集群，所以在cinder/manila等组件的部署中推出了对外接ceph集群的支持。

我接触kolla两年多了，主要是用kolla来部署openstack组件和ceph在生产环境，所以针对使用kolla来部署生产环境标准的ceph集群做了一些工作，下面我会写一些文章来讲解一些相关的改进工作。我也提交过一些改进到社区，可能是因为改动太大，而社区本身也缺乏对ceph了解的人，所以到最后就是一些改动大的commits一直搁置。

kolla ceph部署

先用社区的rocky稳定版来部署一个集群。因为kolla-ansible可以把ceph和其他openstack组件一起部署,但是这种方式不利于ceph集群的维护,所以比较推荐的方式是单独部署ceph,然后与openstack组件的对接用external_ceph的方式来使用.

节点初始化

kolla部署需要一个部署节点，最好是与其他ceph节点区分开

节点介绍

节点要求，至少需要一个网卡。

ps:mon最好先设置成三个或者两个,可以后续再添加,多余三个会卡住,我提交了一个commit到社区来修复这个问题.

commit url : https://review.openstack.org/652606

usage	hostname	ip	disks	usage
deploy	deploy-node	192.168.0.10		deploy, docker_registry
ceph	ceph-node1	192.168.0.11	sdb,sdc,sdd	mon, mgr, osd , mds ,rgw
ceph	ceph-node2	192.168.0.12	sdb,sdc,sdd	osd
ceph	ceph-node3	192.168.0.13	sdb,sdc,sdd	osd

公共部分(所有节点都需要安装)

# yum 安装源及必要包
yum install epel-release -y
yum install python-pip -y
yum install -y python-devel libffi-devel gcc openssl-devel git

# 安装docker api
pip install docker

# 安装docker
sudo yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/docker-ce-18.03.0.ce-1.el7.centos.x86_64.rpm

# docker的配置
配置docker的driver和对应的registry(如果有公共的registry，则不用配置)
例如:
tee /etc/docker/daemon.json <<-'EOF'
{
  "storage-driver": "devicemapper",
  "insecure-registries":["192.168.0.10:4000"]
}
EOF

# 新建部署用户kollasu
useradd -d /home/kollasu -m kollasu
passwd kollasu #修改密码
echo "kollasu ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/kollasu
chmod 0440 /etc/sudoers.d/kollasu

# nscd(kolla自定义了一些用户,如manila,ceph等,用来与主机上的用户区分,需进行如下配置)
yum install -y nscd
sed -i 's/\(^[[:space:]]*enable-cache[[:space:]]*passwd[[:space:]]*\)yes/\1no/g' /etc/nscd.conf
sed -i 's/\(^[[:space:]]*enable-cache[[:space:]]*group[[:space:]]*\)yes/\1no/g' /etc/nscd.conf
systemctl restart nscd

部署节点

# 安装ansible
pip install -U ansible==2.4.1 (rocky版本需要2.4以上版本ansible)

# 建立部署节点到其他节点的无密码访问，并且ssh用户具有对应节点的root权限(如kollasu用户)

#  建立自己的registry
docker run -d -p 4000:5000 --restart=always --name registry registry:2

ceph osd disk初始化

systemctl daemon-reload

sudo sgdisk --zap-all -- /dev/sdb
sudo sgdisk --zap-all -- /dev/sdc
sudo sgdisk --zap-all -- /dev/sdd

sudo /sbin/parted  /dev/sdb  -s  -- mklabel  gpt  mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1  1 -1
sudo /sbin/parted  /dev/sdc  -s  -- mklabel  gpt  mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO2  1 -1
sudo /sbin/parted  /dev/sdd  -s  -- mklabel  gpt  mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO3  1 -1

ps: Kolla在初始化的时候会以partname来判断如何初始化该磁盘为osd, filestore的前缀是KOLLA_CEPH_OSD_BOOTSTRAP, bluestore的前缀是KOLLA_CEPH_OSD_BOOTSTRAP_BS, 前缀后面的名称来标识不同的OSD,比如KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1和KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO2就分别对应两个不同的osd,名称随意，后面kolla会根据申请的osd id来修改磁盘名称。

对于bluestore, 一共有四类磁盘, osd data 分区, block分区, wal分区和db分区:

A small partition is formatted with XFS and contains basic metadata for the OSD. This data directory includes information about the OSD (its identifier, which cluster it belongs to, and its private keyring).
The rest of the device is normally a large partition occupying the rest of the device that is managed directly by BlueStore contains all of the actual data. This primary device is normally identifed by a block symlink in data directory.

It is also possible to deploy BlueStore across two additional devices:

A WAL device can be used for BlueStore’s internal journal or write-ahead log. It is identified by the block.wal symlink in the data directory. It is only useful to use a WAL device if the device is faster than the primary device (e.g., when it is on an SSD and the primary device is an HDD).
A DB device can be used for storing BlueStore’s internal metadata. BlueStore (or rather, the embedded RocksDB) will put as much metadata as it can on the DB device to improve performance. If the DB device fills up, metadata will spill back onto the primary device (where it would have been otherwise). Again, it is only helpful to provision a DB device if it is faster than the primary device.

kolla是通过磁盘名称的后缀来判断作为osd的哪个分区的, 在filestore中, "J" 用来表示日志盘，在bluestore中,"B"表示block,"W"表示wal,"D"表示db. 例如:

sudo /sbin/parted  /dev/sdb  -s  -- mklabel  gpt  mkpart  KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 200
sudo /sbin/parted /dev/sdb -s mkpart  KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_W  201  2249
sudo /sbin/parted /dev/sdb -s mkpart  KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_D  2250  4298
sudo /sbin/parted /dev/sdb -s mkpart  KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_B  4299  100%

可以自己定义各分区的大小和对应的磁盘, kolla会根据partname自动匹配。如果没指定专门的block分区, 则KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1所在的磁盘会被自动格式化成两个分区,osd data分区和block分区, 而该磁盘上的其他分区都会被清理掉。

构建ceph镜像

使用配置文件来单独构建ceph镜像.

[root@deploy-node rocky-ceph]# tree -L 1
.
├── build-test
├── ceph-test
├── kolla
└── kolla-ansible

先看一下我们的部署脚本, 这里我们直接下载的是kolla和kolla-ansible的源代码, 都是stable/rocky分支,用源码部署的好处就是你可以方便的进行定制化开发.

build-test是构建镜像的配置

.
├── build-test
│   └── kolla-build.conf

内容如下:

[DEFAULT]
base = centos
profile = image_ceph

namespace = kolla
install_type = source

retries = 1
push_threads = 4
maintainer = kolla Project

[profiles]
image_ceph = cron,kolla-toolbox,fluentd,ceph

cron和kolla-toolbox以及fluentd是公共镜像

构建命令

python kolla/kolla/cmd/build.py --config-file build-test/kolla-build.conf --push --registry 192.168.0.10:4000 --tag cephRocky-7.0.2.0001 --type source

部署ceph

kolla-ansible的部署过程中可以使用--tags来指定要部署的项目,当然为了单独部署ceph,我们可以在globals.yml文件中disable掉其他所有的项目,只保留ceph相关的.

├── ceph-test
│   ├── custom
│   ├── globals.yml
│   ├── multinode-inventory
│   └── passwords.yml

globals.yml 如下:

---
# The directory to merge custom config files the kolla's config files
node_custom_config: "{{ CONFIG_DIR }}/custom"

# The project to generate configuration files for
project: ""

# The directory to store the config files on the destination node
node_config_directory: "/home/kollasu/kolla/{{ project }}"

# The group which own node_config_directory, you can use a non-root
# user to deploy kolla
config_owner_user: "kollasu"
config_owner_group: "kollasu"

###################
# Kolla options
###################
# Valid options are ['centos', 'debian', 'oraclelinux', 'rhel', 'ubuntu']
kolla_base_distro: "centos"

# Valid options are [ binary, source ]
kolla_install_type: "source"

kolla_internal_vip_address: ""

####################
# Docker options
####################
### Example: Private repository with authentication

docker_registry: "192.168.0.10:4000"
docker_namespace: "kolla"

docker_registry_username: ""

####################
# OpenStack options
####################
openstack_release: "auto"
openstack_logging_debug: "False"

enable_glance: "no"
enable_haproxy: "no"
enable_keystone: "no"
enable_mariadb: "no"
enable_memcached: "no"
enable_neutron: "no"
enable_nova: "no"
enable_rabbitmq: "no"

enable_ceph: "yes"
enable_ceph_mds: "yes"
enable_ceph_rgw: "yes"
enable_ceph_nfs: "no"
enable_ceph_dashboard: "{{ enable_ceph | bool }}"
enable_chrony: "no"
enable_cinder: "no"
enable_fluentd: "yes"
enable_heat: "no"
enable_horizon: "no"
enable_manila: "no"

###################
# Ceph options
###################
# Valid options are [ erasure, replicated ]
ceph_pool_type: "replicated"

# Integrate Ceph Rados Object Gateway with OpenStack keystone
enable_ceph_rgw_keystone: "no"

ceph_erasure_profile: "k=2 m=1 ruleset-failure-domain=osd"

ceph_pool_pg_num: 32
ceph_pool_pgp_num: 32

osd_initial_weight: "auto"

# Set the store type for ceph OSD
# Valid options are [ filestore, bluestore]
ceph_osd_store_type: "bluestore"

multinode-inventory如下:

[storage-mon]
ceph-node1 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0
ceph-node2 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0

[storage-osd]
ceph-node1 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0
ceph-node2 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0
ceph-node3 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0

[storage-rgw]
ceph-node1 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0

[storage-mgr]
ceph-node1 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0

[storage-mds]
ceph-node1 ansible_user=kollasu network_interface=eth0 api_interface=eth0 storage_interface=eth0 cluster_interface=eth0

[storage-nfs]

[ceph-mon:children]
storage-mon

[ceph-rgw:children]
storage-rgw

[ceph-osd:children]
storage-osd

[ceph-mgr:children]
storage-mgr

[ceph-mds:children]
storage-mds

[ceph-nfs:children]
storage-nfs

这样定义group的好处就是可以随意的修改你的集群服务的节点,而针对每个节点指定相应的用户和interface也可以适应更复杂的情况,比如网卡名称不同的节点组成的集群,如果都配置默认的同一网卡名,安装中就会报错.

passwords.yml 如下(配置这些是因为在部署中利用模板生成配置文件会用到这些参数,没有会报错):

ceph_cluster_fsid: 804effd3-1013-4e57-93ca-983a13cfa133
docker_registry_password:
keystone_admin_password:

custom文件夹中放置的是你自己的ceph.conf文件,这个ceph.conf会跟kolla-ansible根据模板生成的ceph.conf合并,最终作为ceph集群的配置文件(优先级自定义的ceph.conf要高于自动生成的ceph.conf).

[global]
rbd_default_features = 1
public_network = 192.168.0.0/24
cluster_network = 192.168.0.0/24
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_crush_update_on_start = false
osd_class_update_on_start = false
mon_max_pg_per_osd = 500
mon_allow_pool_delete = true

...

部署ceph

chmod +x kolla-ansible/tools/kolla-ansible

# pull镜像到具体节点
kolla-ansible/tools/kolla-ansible pull --configdir ceph-test -i ceph-test/multinode-inventory --passwords ceph-test/passwords.yml --tags ceph -e openstack_release=cephRocky-7.0.2.0001

# 部署ceph集群
kolla-ansible/tools/kolla-ansible deploy --configdir ceph-test -i ceph-test/multinode-inventory --passwords ceph-test/passwords.yml --tags ceph -e openstack_release=cephRocky-7.0.2.0001

升级ceph集群

kolla-ansible对ceph升级既有方便的地方,即按顺序自动升级所有组件,mon-->mgr-->osd-->rgw-->mds-->nfs,可以自动化升级所有容器的镜像.

缺点就是升级是针对所有服务,不能具体指定升级某一项.而在osd的升级过程中缺乏一些对ceph集群状态的检测,kolla-ansible升级osd是同时升级所有节点上(一次最多可执行ANSIBLE_FORKS规定的节点数)的osd,单个节点上的osd是顺序升级的. 理论上来说只要镜像没有问题,osd升级过程中的重启是很快的,不会影响集群的状态.但是不怕一万就怕万一,同时升级好几个节点的osd出现问题的概率当然也大,我们理想的状态是升级的服务可以自动指定,osd的升级过程可以做到单节点上顺序升级,然后升级中伴随着ceph状态的检测.

构建新镜像

python kolla/kolla/cmd/build.py --config-file build-test/kolla-build.conf --push --registry 192.168.0.10:4000 --tag cephRocky-7.0.2.0002 --type source

升级ceph

kolla-ansible/tools/kolla-ansible upgrade --configdir ceph-test -i ceph-test/multinode-inventory --passwords ceph-test/passwords.yml --tags ceph -e openstack_release=cephRocky-7.0.2.0002

osd 修复

osd 挂载关系

osd分区初始化之后,会根据osd data分区的uuid,把分区挂载到/var/lib/ceph/osd/${uuid}目录下,然后启动容器的时候把/var/lib/ceph/osd/${uuid}目录以docker volumes的方式挂载到容器中,/var/lib/ceph/osd/${uuid}:/var/lib/ceph/osd/ceph-${osd_id}.

以uuid的方式挂载,很适合带cache的disk磁盘和多路径磁盘,这两者的磁盘名称很容易变化,但是uuid是一直不变的.

osd 修复

kolla-ansible启动的osd容器有时候会因为磁盘的原因进入故障,修复也是很简单,重新格式化磁盘并重新部署, 当然修复之前要先踢出osd,修复后先置osd的weight值为0,然后慢慢添加.

举例

以osd.7为例

(ceph-mon)[root@ceph-node1 /]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       0.44989 root default                                   
-4       0.14996     host 192.168.0.11                         
11       0.04999         osd.11             up  1.00000 1.00000
12       0.04999         osd.12             up  1.00000 1.00000
13       0.04999         osd.13             up  1.00000 1.00000
-2       0.14996     host 192.168.0.12                        
 0       0.04999         osd.0              up  1.00000 1.00000
 3       0.04999         osd.3              up  1.00000 1.00000
 6       0.04999         osd.6              up  1.00000 1.00000
-3       0.14996     host 192.168.0.13                          
 1       0.04999         osd.1              up  1.00000 1.00000
 4       0.04999         osd.4              up  1.00000 1.00000
 7       0.04999         osd.7            down  1.00000 1.00000

去ceph-node3查看osd,可知disk为sdb

Disk /dev/sdb: 53.7 GB, 53687091200 bytes, 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: FA31FD88-190E-4CA4-AF0D-E31AB1FCADDC


#         Start          End    Size  Type            Name
 1         2048       206847    100M  unknown         KOLLA_CEPH_DATA_BS_7
 2       206848    104857566   49.9G  unknown         KOLLA_CEPH_DATA_BS_7_B

解绑磁盘

# df -h 命令查看

/dev/sdb1                 97M  5.3M   92M   6% /var/lib/ceph/osd/0ffdd2fc-41cd-429c-84ee-8150467c06ed

# 解绑
umount /var/lib/ceph/osd/0ffdd2fc-41cd-429c-84ee-8150467c06ed

# 清理/etc/fstab
# 删除sdb1对应的挂载信息
UUID=0ffdd2fc-41cd-429c-84ee-8150467c06ed /var/lib/ceph/osd/0ffdd2fc-41cd-429c-84ee-8150467c06ed xfs defaults,noatime 0 0

清理osd

# 清理旧的osd

osd_number=7
ceph osd out ${osd_number}
ceph osd crush remove osd.${osd_number}
ceph auth del osd.${osd_number}
ceph osd rm ${osd_number}

docker stop ceph_osd_7
docker rm ceph_osd_7

重新初始化磁盘

systemctl daemon-reload

sudo sgdisk --zap-all -- /dev/sdb

sudo /sbin/parted  /dev/sdb  -s  -- mklabel  gpt  mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1  1 -1

重新部署osd

kolla-ansible/tools/kolla-ansible deploy --configdir ceph-test -i ceph-test/multinode-inventory --passwords ceph-test/passwords.yml --tags ceph -e openstack_release=cephRocky-7.0.2.0001

确认镜像的tag没有发生变化,则这次部署只会重新添加一个新的osd,并不会影响之前的osd.

总结

这篇主要讲了一下社区中ceph的部署,下一篇会针对ceph集群的部署和维护中存在的问题,进一步的改进.