小型架构实践--Mysql双主+corosync+NFS
IP规划:
Master01:192.168.40.100
Master02:192.168.40.101
NFS:192.168.40.110
VIP:192.168.40.150
前置条件:
1.NFS搭建完成
2.完成防火墙配置(打开5404,5405,5406三个端口)
iptables -I INPUT -i eth0 -p udp -m multiport --dports 5404,5405,5406 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
3.Selinux关闭
setenforce 0
sed -i 's/enforcing/disabled/g' /etc/sysconfig/selinux
参考文章:小型架构实践--NFS环境搭建
4.务必阅读CRM配置
CRM配置可以留待后面阅读,但该文前面的重要补充关系到一个因为安装顺序引起的报错,务必先行阅读
#############################################
插一点mysql版本的知识
mysql分三种版本
源码,二进制,rpm包
源码需要配合cmake进行编译,据说耦合度最好,性能最好
二进制解压后进行简单配置几乎可以直接使用,相当于linux上面的绿色版,移植性最好
RPM包采用yum/rpm 进行安装,省心
参考目前网上的安装教程,采用共享存储的安装必须使用二进制包!!!
提供2个mysql的下载地址
源1的版本要比源2更新一些
#############################################
(考虑到后面2个节点有很多文件需要互传,因此优先配置SSH互信是一个比较不错的选择)
1.配置SSH互信
Master01端
echo '192.168.40.100 Master01' >> /etc/hosts
echo '192.168.40.101 Master02' >> /etc/hosts
Master02端
echo '192.168.40.100 Master01' >> /etc/hosts
echo '192.168.40.101 Master02' >> /etc/hosts
切回Master01端
ssh-keygen -P '' -f /root/.ssh/id_rsa -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub Master02
Master02端
ssh-keygen -P '' -f /root/.ssh/id_rsa -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub Master01
SSH互信配置完毕
2.Mysql部署部分
Master01端
Mysql共用NFS的必要条件之一就是id mysql的值,无论Uid还是gid、Gid都必须一致,所以这里统一设定为502了;在NFS和Master02端也做同样的设置,不再赘述
groupadd -g 502 mysql
useradd -u 502 mysql -g mysql -G mysql
mkdir /data
chown -R mysql.mysql /data
mount -t nfs 192.168.40.110:/data /data
(把192.168.40.110:/data /data nfs rw 0 0加到/etc/fstab里)
mkdir -p /root/qimo/tools/
cd /root/qimo/tools/
wget http://ftp.ntu.edu.tw/MySQL/Downloads/MySQL-5.6/mysql-5.6.38-linux-glibc2.12-x86_64.tar.gz
tar -xf mysql-5.6.38-linux-glibc2.12-x86_64.tar.gz
mv mysql-5.6.38-linux-glibc2.12-x86_64 mysql
cp -r mysql /usr/local/
cd /usr/local/mysql
在做下一步之前rpm -qa|grep mysql看一下,有些系统会自带一个5.1的mysql,习惯上我会yum remove掉这个
cp support-files/my-default.cnf /etc/my.cnf
cp support-files/mysql.server /etc/rc.d/init.d/mysqld
vi /etc/my.cnf
在[mysqld]下添加:
log_bin=/usr/local/mysql/log/binlog
binlog_format= mixed
log-error=/usr/local/mysql/log/mysql.err
basedir =/usr/local/mysql
datadir = /data
保存
vi /etc/rc.d/init.d/mysqld
datadir=/data
保存
yum -y install numactl(缺失这个包有可能导致初始化失败,报错为error while loading shared libraries: libnuma.so.1)
mkdir -p /usr/local/mysql/log
touch /usr/local/mysql/log/mysql.err
chown -R mysql.mysql /usr/local/mysql
执行初始化
scripts/mysql_install_db --user=mysql --basedir=/usr/local/mysql --datadir=/data
这里要说一下初始化,my.cnf里面有定义basedir,datadir,那么这里初始化的时候最好就写出来,否则很容易造成配置和实际不符而导致启动失败
service mysqld start
Starting MySQL. SUCCESS!
service mysqld stop
umount /data
echo ‘export MySQL_HOME=/usr/local/mysql’ >> /etc/profile
echo ‘export PATH=$PATH:$MySQL_HOME/bin’ >> /etc/profile
source /etc/profile
添加到chkconfig中(在本次实验中不要做这个操作)
chkconfig --add /etc/init.d/mysqld
在pacemaker的配置中,虚拟IP(VIP),共享存储(NFS),数据库(Mysql)被定义为3个资源,这3个资源的启动顺序是有要求的,并且完全由pacemaker去支配,按照要求,mysql必须在NFS启动之后才能启动,因此mysql切忌不能设定为开机自启
chkconfig mysqld off #一定不要自启动
Master01设置完毕,将部分可用资源拷贝到Master02
scp /etc/my.cnf 192.168.40.101:/etc/
scp /etc/rc.d/init.d/mysqld 192.168.40.101:/etc/rc.d/init.d/
scp -r /root/qimo/tools/mysql 192.168.40.101:/usr/local/
Master02端
cd /usr/local/mysql
mkdir -p log
touch log/mysql.err
chown -R mysql.mysql /usr/local/mysql
(确定/etc/fstab里面已经有做配置 192.168.40.110:/data /data nfs rw 0 0)
mount -a
Master02不需要进行初始化,直接采用刚才01生成的/data即可
service mysqld start
Starting MySQL. SUCCESS!
service mysqld stop
umount /data
echo ‘export MySQL_HOME=/usr/local/mysql’ >> /etc/profile
echo ‘export PATH=$PATH:$MySQL_HOME/bin’ >> /etc/profile
source /etc/profile
chkconfig mysqld off #一定不要自启动
至此,2个节点上的Mysql配置完结
###################
还是说一下安装mysql过程中遇到的问题(其实上面黑体部分就是我出问题的部分)
①.my.cnf和初始化那一段定义的参数一定要一致,否则service mysqld start的时候会看到错误
②.mysql一定是安装二进制版本,我之前用的源码,Master01倒是正常安装了,在安装Master02的时候不需要初始化,这时候源码根本不知道如何下手。
....
3.corosync配置
正常来说,接下来应该是配置ntp同步,但是考虑到corosync配置过程中要生成一个奇葩的authkey,所以我把这步提前了
Master01端
yum Install corosync -y
vi /etc/corosync/corosync.conf
compatibility: whitetank
aisexec {
# Run as root - this is necessary to be able to manage resources with Pacemaker
user: root
group: root
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
use_mgmtd: yes
use_logd: yes
}
totem {
version: 2
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
bindnetaddr: 192.168.40.0
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 1
}
}
nodelist {
node {
ring0_addr: 192.168.40.100
nodeid: 1
}
node {
ring0_addr: 192.168.40.101
nodeid: 2
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: no
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
上面配置里面的几个加黑段说明一下,192.168.40.0是Master01和02所在的网段
239.255.1.1是广播地址段,照着写不用改
重点来了,authkey的生成
corosync-keygen
然后就会看到下面这一堆
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Press keys on your keyboard to generate entropy (bits = 104).
这一堆的意思是需要有1024bit的随机数,目前只有104bit,有趣的是随机数并不是自己输入的,而是系统的熵(shang)池
解决方法是再打开一个Master01的session,去执行一些其他操作,比如说我接着执行本篇文章的
第4步配置ntp同步,第5步yum install pacemaker -y,这时候你就会发现bits数在不断增加,直到达到1024之后,在/etc/corosync下就会生成一个authkey
预计第4步和第5步执行完成后,这个authkey应该能生成^_^
贴一下之前安装时不断更新的提示
Press keys on your keyboard to generate entropy (bits = 224).
Press keys on your keyboard to generate entropy (bits = 272).
Press keys on your keyboard to generate entropy (bits = 320).
....
(下跳到第4步和第5步,照着做完了再来看这里吧)
Master01端
ll /etc/corosync/
...
-r--------. 1 root root 128 Jan 30 22:20 authkey
默认权限为400
完成节点2上面corosync的安装,并将2个配置文件从节点1拷贝到节点2
ssh Master02 yum install -y corosync
scp -r /etc/corosync/authkey 192.168.40.101:/etc/corosync/
scp -r /etc/corosync/corosync.conf 192.168.40.101:/etc/corosync/
4.配置NTP同步(为了生成authkey而奋斗~)
Master01端
yum install -y ntp ntpdate
cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime #修改时区为东8区
cp: overwrite `/etc/localtime'? y
[root@Master01 ~]# service ntpdate start
ntpdate: Synchronizing with time server: [ OK ]
[root@Master01 ~]# date -R
Tue, 30 Jan 2018 14:45:34 +0800 #+0800是东8区
Master02端
cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
cp: overwrite `/etc/localtime'? y
[root@Master02 ~]# service ntpdate start
ntpdate: Synchronizing with time server: [ OK ]
[root@Master02 Asia]# date -R
Tue, 30 Jan 2018 14:43:41 +0800
(PS:手动同步:[root@Master02 corosync]# ntpdate pool.ntp.org)
5.pacemaker,pssh,crmsh的安装
yum install -y pacemaker
将pssh和crmsh包拷贝到Master01上
下载地址:
crmsh依赖于pssh和python,所以要先对pssh进行安装
安装过程中可能会提示缺少python包,均可以通过yum安装解决
yum install -y pssh-2.3.1-5.el6.noarch.rpm
yum install -y crmsh-1.2.6-4.el6.x86_64.rpm
需要注意的是上面链接里面的资源包除开上面2个以外,还有crmsh3.0.0的包,那个包对应的python为2.7,应该不是属于centos6.x版本默认对应的python,因此那个源码包对于此次安装没有什么意义;
执行完这里的安装之后,前面的authkey应该是能生成了,返回第三步完成剩下的操作吧!
######################
发散:
1.源码包的安装
之前的资源包解压出来以后,跟crmsh有关的有2个
crmsh-1.2.6-4.el6.x86_64.rpm(centos6)
crmsh-3.0.0-6.1.src.rpm(源码包,适用于任意系统)
yum install -y rpm-build
rpmbuild --rebuild --clean crmsh-3.0.0-6.1.src.rpm
(中间可能会提示某些python依赖包需要安装)
...
安装完成后会在当前目录下生成一个rpmbuild的文件夹
cd rpmbuild/RPMS/noarch;ll
...
-rw-r--r-- 1 root root 795920 Feb 1 09:11 crmsh-3.0.0-6.1.noarch.rpm
-rw-r--r-- 1 root root 92928 Feb 1 09:11 crmsh-scripts-3.0.0-6.1.noarch.rpm
-rw-r--r-- 1 root root 220784 Feb 1 09:11 crmsh-test-3.0.0-6.1.noarch.rpm
3.0的资源需要python2.7以上的版本,然而python2.7不支持现有的yum..... 所以这就非常尴尬了
2.安装过程中遇到的奇怪的缺少python包的报错
Error: Package: pssh-1.4.3-1.noarch (/pssh-1.4.3-1.noarch)
Requires: python(abi) = 2.5
解决方法是:yum install python25 -y
千万不要很傻很天真的 yum install python(abi) -y,python版本很多,每个版本对应的包感觉并不存在上下兼容的情况,所以要安装特定版本的python
######################
6.启动corosync
在前面的corosync.conf里面,已经把pacemaker进行了配置,因此这里不需要单独去启动pacemaker了
Master01端
service corosync start
cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>' #后续会对这个进行说明,在节点1执行就OK;节点2不需要执行
ssh Master02 service corosync start #完成节点2的启动
#######################################
安装过程中遇到的报错
1.没有配置authkey就启动corosync
Starting Corosync Cluster Engine (corosync): [FAILED]
cat /var/log/cluster/corosync.log
...
Jan 30 15:35:02 corosync [MAIN ] Could not open /etc/corosync/authkey: No such file or directory
2.在corosync.conf里面没有配置pacemaker的相关字段,会发现service pacemaker start是可以启动的,但是如果先启动corosync会出现pacemaker启动失败的情况,具体原因没有研究,配置参照我前面的corosync.conf即可
3.检查服务是否启动成功的命令
检查corosync引擎是否启动成功
grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log
检查初始化节点通知是否发送成功
grep TOTEM /var/log/cluster/corosync.log
检查pacemaker是否启动正常
grep pcmk_startup /var/log/cluster/corosync.log
检查是否有报错
grep ERROR: /var/log/cluster/corosync.log
Jan 30 17:26:37 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
Jan 30 17:26:37 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
Jan 30 17:26:38 corosync [pcmk ] ERROR: pcmk_wait_dispatch: Child process mgmtd exited (pid=6015, rc=100)
这个错误可以忽略
#######################################
在后面执行crm相关命令的时候,还会遇到一个版本检查的报错,
ERROR: CIB not supported: validator 'pacemaker-2.5', release '3.0.10'
ERROR: You may try the upgrade command
因此我们在进行crm配置之前先解决这个错误
cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>'
7.crm配置
这个才算是心跳配置的难点,依葫芦画瓢吧
[root@Master01 RPMS]# crm
crm(live)# configure
crm(live)configure# property stonith-enabled=false
crm(live)configure# property no-quorum-policy=ignore
crm(live)configure# rsc_defaults resource-stickiness=100
crm(live)configure# verify
crm(live)configure# commit
crm(live)configure# show
node Master01
node Master02
property $id="cib-bootstrap-options" \
have-watchdog="false" \
dc-version="1.1.15-5.el6-e174ec8" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="1" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
资源配置
[root@Master01 ~]# crm
crm(live)# configure
定义nfs
crm(live)configure# primitive mynfs ocf:heartbeat:Filesystem params device="192.168.40.110:/data" directory="/data" fstype="nfs" op start timeout=60s op stop timeout=60s
crm(live)configure# verify
crm(live)configure# commit
定义vip
crm(live)configure# primitive myvip ocf:heartbeat:IPaddr params ip="192.168.40.150" op monitor interval=20 timeout=20 on-fail=restart
crm(live)configure# verify
crm(live)configure# commit
定义mysql
crm(live)configure# primitive myserver lsb:mysqld op monitor interval=20 timeout=20 on-fail=restart
crm(live)configure# verify
crm(live)configure# commit
配置约束
crm(live)configure# colocation myserver_with_mynfs inf: myserver mynfs
配置启动顺序
crm(live)configure# order mynfs_before_myserver mandatory: mynfs:start myserver:start
继续配置约束
crm(live)configure# colocation myvip_with_myserver inf: myvip myserver
配置启动顺序
crm(live)configure# order myvip_before_myserver mandatory: myvip myserver
crm(live)configure# verify
crm(live)configure# commit
顺利的话,配置完这一串就算大功告成,用crm_mon看一下,让3个资源都在一个节点上启动
如果有资源在node2上面启动,可以到node2上面执行
crm node standby
crm node online
这样操作之后,所有资源应该都会集中到一个节点上
理想的状态应该是这样
[root@Master01 ~]# crm status
Stack: classic openais (with plugin)
Current DC: Master01 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Fri Feb 2 01:25:05 2018 Last change: Fri Feb 2 01:23:35 2018 by root via cibadmin on Master01
, 2 expected votes
2 nodes and 3 resources configured
Online: [ Master01 Master02 ]
Active resources:
myvip (ocf::heartbeat:IPaddr): Started Master01
mynfs (ocf::heartbeat:Filesystem): Started Master01
myserver (lsb:mysqld): Started Master01
这时候你会发现,mysql已经自己启动起来了;如果你执行crm node standby
节点1会话会被中断,你打开节点2,执行crm_mon会发现这时候3个资源都飘到了Master02上,而且mysql也自动运行了;
这就是为什么mysql一定不能设定为开机启动的原因,后续mysql在哪个节点上,都交给corosync来进行调配了!
###################################
遇到的问题以及一些理解
1.启动了防火墙并且没有做相应的配置容易出现下面的错误(这时候还只配置完一个资源)
[root@Master01 ~]# crm status
Stack: classic openais (with plugin)
Current DC: Master01 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Thu Feb 1 11:25:52 2018 Last change: Thu Feb 1 11:13:23 2018 by root via cibadmin on Master01
, 1 expected votes
2 nodes and 1 resource configured
Online: [ Master01 ]
OFFLINE: [ Master02 ]
Active resources:
myip (ocf::heartbeat:IPaddr): Started Master01
2.查看状态,ip在Master01,NFS在Master02,想切换到一个节点上
[root@Master01 ~]# crm status
Stack: classic openais (with plugin)
Current DC: Master02 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Thu Feb 1 14:50:50 2018 Last change: Thu Feb 1 14:50:21 2018 by root via cibadmin on Master01
, 2 expected votes
2 nodes and 2 resources configured
Online: [ Master01 Master02 ]
Active resources:
myip (ocf::heartbeat:IPaddr): Started Master01
mynfs (ocf::heartbeat:Filesystem): Started Master02
切换到Master02上
[root@Master02 ~]# crm node standby
切回Master01
[root@Master01 ~]# crm status
Stack: classic openais (with plugin)
Current DC: Master02 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Thu Feb 1 14:51:34 2018 Last change: Thu Feb 1 14:51:29 2018 by root via crm_attribute on Master02
, 2 expected votes
2 nodes and 2 resources configured
Node Master02: standby
Online: [ Master01 ]
Active resources:
myip (ocf::heartbeat:IPaddr): Started Master01
mynfs (ocf::heartbeat:Filesystem): Started Master01
3.服务挂掉的情况
[root@Master01 ~]# crm status
Stack: classic openais (with plugin)
Current DC: Master01 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Fri Feb 2 01:25:05 2018 Last change: Fri Feb 2 01:23:35 2018 by root via cibadmin on Master01
, 2 expected votes
2 nodes and 3 resources configured
Online: [ Master01 Master02 ]
Active resources:
myvip (ocf::heartbeat:IPaddr): Started Master01
mynfs (ocf::heartbeat:Filesystem): Started Master01
myserver (lsb:mysqld): Started Master01
Failed Actions:
* myserver_start_0 on Master02 'unknown error' (1): call=28, status=complete, exitreason='none',
last-rc-change='Fri Feb 2 01:23:35 2018', queued=0ms, exec=3182ms
这个报错是因为Master02的corosync挂了,重启一下就正常了,大功告成!
4.关于VIP
以前搭RAC的时候,除了公有IP以外,还有私IP和VIP的概念,导致这次组建的时候一直受到困扰到底需不需要额外准备一张网卡
VIP为虚拟IP(漂移IP),并不需要额外的网卡对其进行配置,这个IP就是通过软件虚出来的;不要和私IP搞混了
另外,检测方法有的文档上写的是ifconfig,然而我这边实际测试发现配置完成,VIP资源启动后通过ifconfig并不能发现VIP
查看VIP的方法为:ip addr show
5.关于mysql双机
关于mysql双机,之前一直以为是同oracle的rac一样2个节点同时运行,现在发现corosync是用来做判定的,其实只有一个节点是活着的,一个standby以后,另一个顶上继续用而已
在NFS观察发现一个节点standby之后大约3秒,另一个节点会接管这个共享存储,生成自己的pid
###################################
后续:
6.开机配置
iptables,corosync以及ntpdate配置为开启,mysqld配置为关闭
pacemaker因为配置在我的corosync里面了,所以我这里配置成了关闭
[root@Master02 ~]# chkconfig --list|egrep "iptables|mysqld|corosync|pacemaker|ntpdate"
corosync 0:off 1:off 2:on 3:on 4:on 5:on 6:off
iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
mysqld 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ntpdate 0:off 1:off 2:on 3:on 4:on 5:on 6:off
pacemaker 0:off 1:off 2:off 3:off 4:off 5:off 6:off
儿