编程笔记微服务开发实战

Redis 集群的构建和监控

2020-06-13  本文已影响0人  老瓦在霸都

Redis 集群

Redis 是 Remote Dictionary Service 远程字典服务,它在近年来风靡一时,不仅限于传统的 memcached 之类的缓存,还通过其丰富的数据结构支持许多其他的应用,比如少量数据存储,分布式锁,简单的 pub/sub 服务等等。

当然单台 Redis 肯定不满足高可用性的要求,Redis 的 HA 方案主要有两种

  1. Redis Sentinel
  1. Redis Cluster

Redis Sentinel 的扩展性不好,所以从略,主要以 Redis Cluster 为例,它将所有的数据划分为 16384 个槽位 slot, 每个节点负责一部分, 如上图,每对redis 节点分配三分之一的槽位。

当然客户端连接时需要获取槽位的配置信息才知道到哪个节点上去存取数据,这个配置信息由于节点可能增减或者崩溃,所以也不一定准,在访问到不匹配的节点时 ,Redis 会返回 MOVED 响应让客户端去其他节点

构建 Redis 集群

Redis 5.0 自身就提供了构建 redis 的命令,我用 Fabric 简单写一个脚本,封装了繁琐的命令,用来快速构建 Redis Cluster

为简单起见,我在一台 Ubuntu Server 上生成 6个 Redis 实例来模拟6 台 Redis 服务器节点

  1. 生成 6个 redis 实例的配置文件
fab generate_config
  1. 启动 6个redis 实例
fab start_redis
  1. 创建包含这6个redis 实例的 Redis 集群
fab create_redis_cluster

准备工作:

这个 fabfile.py 脚本的源代码如下

from fabric.api import *
from fabric.api import settings
from fabric.context_managers import *
from fabric.contrib.console import confirm
import os, subprocess

redis_path = '/home/walter/package/redis-5.0.8/src'
redis_config = '''daemonize yes
bind 0.0.0.0
port 9001
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes

'''
@task
def clean_config():
    for port in range(9001,9007):
        local("rm -rf {}".format(port))

@task
def write_config(file_path, port):
    config_content = redis_config.replace('9001', str(port))
    with open(file_path, "w") as fp:
        fp.write(config_content)
@task
def generate_config():
    for port in range(9001,9007):
        local("mkdir -p {}".format(port))
        config_file = '{}/redis.conf'.format(port)
        print("write {}".format(config_file))
        write_config(config_file, port)
@task
def start_redis():
    for folder in range(9001,9007):
        with lcd(str(folder)):
            local("{}/redis-server ./redis.conf".format(redis_path))

@task
def stop_redis():
    cmd = redis_path + "/redis-cli -p {} shutdown nosave"
    for port in range(9001,9007):
        local(cmd.format(port))
@task
def kill_redis():
    cmd = "ps -efw --width 1024|grep redis-server |grep -v grep|awk '{print $2}"
    pids = subprocess.check_output(cmd, shell=True)
    print(pids)
    with settings(warn_only=True):
        for pid in pids.decode("utf-8").split('\n'):
            local("kill -9 {}".format(pid))
@task
def check_ports_mac():
    with settings(warn_only=True):
        for port in range(9001,9007):
            local("lsof -nP -iTCP:{} | grep LISTEN".format(port))
@task
def check_redis():
    cmd = "ps -ef|grep redis-server |grep -v grep"
    with settings(warn_only=True):
        local(cmd)

@task
def create_redis_cluster():
    cmd = redis_path + "/redis-cli --cluster create {} {}"
    host_and_ports = ""
    for port in range(9001,9007):
            host_and_ports = host_and_ports + "0.0.0.0:{} ".format(port)
    option = "--cluster-replicas 1"
    local(cmd.format(host_and_ports, option))

@task
def redis_cli(command=''):
    if command:
        local(redis_path + "/redis-cli -c -p 9001 %s" % command)
    else:
        local(redis_path + "/redis-cli -p 9001")

注意将以上脚本中的 redis_path 改为自己的 redis 源代码路径
redis 代码下载后直接执行 make 就可以编译生成可执行文件
将以上的 fabfile 放在一个任一个目录下,执行

fab generate_config
fab start_redis
fab create_redis_cluster

即可生成 Redis 集群

$ ps -ef|grep redis|grep -v grep
root     19702     1  0 Jun06 ?        00:12:42 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9001 [cluster]
root     19708     1  0 Jun06 ?        00:12:38 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9002 [cluster]
root     19714     1  0 Jun06 ?        00:12:41 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9003 [cluster]
root     19720     1  0 Jun06 ?        00:14:27 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9004 [cluster]
root     19726     1  0 Jun06 ?        00:14:20 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9005 [cluster]
root     19732     1  0 Jun06 ?        00:14:25 /home/walter/package/redis-5.0.8/src/redis-server 0.0.0.0:9006 [cluster]
root@node5:~#                                                                                                            

通过Redis 的命令行 redis-cli 可以察看集群内部的信息:

fab redis_cli:"cluster nodes"
[localhost] local: /home/walter/package/redis-5.0.8/src/redis-cli -c -p 9001 cluster nodes
be6213517632bdc1dc21ecdc6db99718ad996227 127.0.0.1:9006@19006 slave 5999be7b420c0b5efadb3adaca8d1fc96b6a2494 0 1592020538000 6 connected
907f0344f54a5df28ef1cf548da32e64cfab8d16 127.0.0.1:9001@19001 myself,master - 0 1592020538000 1 connected 0-5460
50f321d0f3007da039f9350930b176f2b22ec1e0 127.0.0.1:9004@19004 slave 5f4f1fe7432dff359ea5b05f8997432ae726f7bd 0 1592020537000 4 connected
ba918ea80644efca8405b8480c2ccb941ed831b0 127.0.0.1:9005@19005 slave 907f0344f54a5df28ef1cf548da32e64cfab8d16 0 1592020537378 5 connected
5999be7b420c0b5efadb3adaca8d1fc96b6a2494 127.0.0.1:9002@19002 master - 0 1592020537578 2 connected 5461-10922
5f4f1fe7432dff359ea5b05f8997432ae726f7bd 127.0.0.1:9003@19003 master - 0 1592020538380 3 connected 10923-16383

fab redis_cli:"cluster info"
[localhost] local: /home/walter/package/redis-5.0.8/src/redis-cli -c -p 9001 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:1053876
cluster_stats_messages_pong_sent:1054292
cluster_stats_messages_sent:2108168
cluster_stats_messages_ping_received:1054292
cluster_stats_messages_pong_received:1053871
cluster_stats_messages_received:2108163
 fab redis_cli:"cluster slots"
[localhost] local: /home/walter/package/redis-5.0.8/src/redis-cli -c -p 9001 cluster slots
1) 1) (integer) 0
   2) (integer) 5460
   3) 1) "127.0.0.1"
      2) (integer) 9001
      3) "907f0344f54a5df28ef1cf548da32e64cfab8d16"
   4) 1) "127.0.0.1"
      2) (integer) 9005
      3) "ba918ea80644efca8405b8480c2ccb941ed831b0"
2) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "127.0.0.1"
      2) (integer) 9002
      3) "5999be7b420c0b5efadb3adaca8d1fc96b6a2494"
   4) 1) "127.0.0.1"
      2) (integer) 9006
      3) "be6213517632bdc1dc21ecdc6db99718ad996227"
3) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "127.0.0.1"
      2) (integer) 9003
      3) "5f4f1fe7432dff359ea5b05f8997432ae726f7bd"
   4) 1) "127.0.0.1"
      2) (integer) 9004
      3) "50f321d0f3007da039f9350930b176f2b22ec1e0"

如果我们想知道某个 key 属于哪一个 slot ,也可以用命令 cluster KEYSLOT $key 来查询,例如 “hello” 通过如下命令可以查出它属于 slot 866, 应该落在 “ 127.0.0.1:9001” 这个实例上。

fab redis_cli:"cluster KEYSLOT hello"
[localhost] local: /home/walter/package/redis-5.0.8/src/redis-cli -c -p 9001 cluster KEYSLOT hello
(integer) 866

访问 Redis 集群

Redis 的 RESP 协议非常简单明了,几乎各种语言都有相应的库可以访问 Redis
以 Python 为例, 创建如下两个文件

loguru
redis
hiredis
redis-py-cluster
import sys
from rediscluster import RedisCluster
import redis
from redis.client import Redis
from loguru import logger

logger.add(sys.stderr,
           format="{time} {message}",
           filter="client",
           level="INFO")
logger.add('logs/redis_client_{time:YYYY-MM-DD}.log',
           format="{time} {level} {message}",
           filter="client",
           level="ERROR")

class RedisClient:
    def __init__(self, connection_string, password=None):
        self.startup_nodes = []
        nodes = connection_string.split(',')
        for node in nodes:
            host_port = node.split(':')
            self.startup_nodes.append({'host': host_port[0], 'port': host_port[1]})

        self.password = password
        logger.info(self.startup_nodes)
        self.redis_pool = None
        self.redis_instance = None
        self.redis_cluster = None

    def connect(self):
        if(len(self.startup_nodes) < 2):
            host = self.startup_nodes[0].get('host')
            port = self.startup_nodes[0].get('port')
            if self.password:
                self.redis_pool = redis.ConnectionPool(host=host, port=port, db=0)
            else:
                self.redis_pool = redis.ConnectionPool(host=host, port=port, password = self.password, db=0)

            self.redis_instance = Redis(connection_pool=self.redis_pool, decode_responses=True)
            return self.redis_instance
        #, skip_full_coverage_check=True
        self.redis_cluster = RedisCluster(startup_nodes=self.startup_nodes, password=self.password)
        return self.redis_cluster

def quick_test():
    client = RedisClient("10.224.112.73:9001")
    conn = client.connect()
    key = "hello"
    value = "world"
    conn.set(key, value)
    conn.expire(key, 300)
    ret = conn.get(key)
    logger.info("value={}", ret)

    conn.hsetnx("walter", "age", 30)
    conn.hsetnx("walter", "gender", 'male')
    conn.expire(key, 300)

    values = conn.hgetall("walter")
    for key, value in values.items():
        logger.info("{}={}".format(key, value))


if __name__ == "__main__":
    quick_test()

执行命令

pip install -r requirements.txt
python redis-test.py

输出结果如下

2020-06-13 13:05:52.071 | INFO     | __main__:__init__:27 - [{'host': '10.224.112.73', 'port': '9001'}]
2020-06-13 13:05:52.229 | INFO     | __main__:test_cluster:110 - value=b'world'
2020-06-13 13:05:52.354 | INFO     | __main__:test_cluster:118 - b'age'=b'30'
2020-06-13 13:05:52.354 | INFO     | __main__:test_cluster:118 - b'gender'=b'male'

监控

redis 的命令行工具可以获取 Redis 的各种度量信息,除了上面提到的 cluster 相关命令,还有很多,比如最常用的 info 命令

fab redis_cli:info
[localhost] local: /home/walter/package/redis-5.0.8/src/redis-cli -c -p 9001 info 
#-------------------------------
# Server 服务器信息
#-------------------------------
redis_version:5.0.8
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:391103cbae5277b6
redis_mode:cluster
os:Linux 4.4.0-135-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:5.4.0
process_id:19702
run_id:0c3d2c132eefbf68a23dab136cdc79999b6c8ca3
tcp_port:9001
uptime_in_seconds:1444
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:14386466
executable:/home/walter/package/redis-5.0.8/src/redis-server
config_file:/home/walter/mdd/oss/redis/9001/./redis.conf

#-------------------------------
# Clients 客户端信息
#-------------------------------
connected_clients:1
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

#-------------------------------
# Memory 内存信息
#-------------------------------
used_memory:2652664
used_memory_human:2.53M
used_memory_rss:5267456
used_memory_rss_human:5.02M
used_memory_peak:2693616
used_memory_peak_human:2.57M
used_memory_peak_perc:98.48%
used_memory_overhead:2578384
used_memory_startup:1463192
used_memory_dataset:74280
used_memory_dataset_perc:6.24%
allocator_allocated:2638792
allocator_active:2822144
allocator_resident:5382144
total_system_memory:8370958336
total_system_memory_human:7.80G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.07

allocator_frag_bytes:183352
allocator_rss_ratio:1.91
allocator_rss_bytes:2560000
rss_overhead_ratio:0.98
rss_overhead_bytes:-114688
mem_fragmentation_ratio:2.03
mem_fragmentation_bytes:2677952
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:16922
mem_clients_normal:49694
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

#----------------------------------
# Persistence 持久化信息
#----------------------------------
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1591443327
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:274432
aof_enabled:1
aof_rewrite_in_progress:0


aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
aof_current_size:0
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

#----------------------------------
# Stats 统计信息
#----------------------------------
total_connections_received:4
total_commands_processed:1448
instantaneous_ops_per_sec:0
total_net_input_bytes:52741
total_net_output_bytes:14030
instantaneous_input_kbps:0.02
instantaneous_output_kbps:0.01
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:1
expired_keys:0

expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:178
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

#----------------------------------
# Replication 复制信息
#----------------------------------
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=9005,state=online,offset=2016,lag=1
master_replid:d633853458a2973c12ab79442bc807d35e387f5d
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:2016
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:2016

#----------------------------------
# CPU 中央处理器信息
#----------------------------------
used_cpu_sys:0.912000
used_cpu_user:0.844000
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

#----------------------------------
# Cluster 集群信息
#----------------------------------
cluster_enabled:1

但普通用户不可能经常登录服务器执行,通过服务器端的 agent 来定时采样和收集 redis 所在服务器及其 redis 自身的度量数据是实际中的常用做法。

我曾经用过两种方案:

  1. MetricBeat --> ElasticSearch -> Kibana

  2. Telegraf --> Influxdb --> Grafana

还有其他基于 Collectd, Prometheus 的方案,按下不表.

在实践中,为了防止发送端和接收端的吞吐量不匹配的问题,常用 Kafka 来中转,这里先介绍 Telegraf --> Influxdb --> Grafana 这种方案。

wget -qO- https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list

sudo apt-get update
sudo apt-get install telegraf
sudo systemctl start telegraf

启动之后会有以下默认的配置

2020-06-13T10:28:56Z I! Starting Telegraf 1.14.4
2020-06-13T10:28:56Z I! Using config file: /etc/telegraf/telegraf.conf
2020-06-13T10:28:56Z I! Loaded inputs: diskio kernel mem processes swap system cpu disk
2020-06-13T10:28:56Z I! Loaded aggregators:
2020-06-13T10:28:56Z I! Loaded processors:
2020-06-13T10:28:56Z I! Loaded outputs: influxdb

配置文件位于 /etc/telegraf/telegraf.conf, 配置项非常多,不过主要的也就几条

- input  度量数据从哪里来
- output  度量数据到哪里去
- aggregators 聚合器
- processors 处理器

先快速安装它

sudo apt-get install influxdb
sudo systemctl unmask influxdb.service
sudo systemctl start influxdb

用 influx 命令察看一下 telegraf 默认创建的度量表

 influx
Connected to http://localhost:8086 version 1.8.0
InfluxDB shell version: 1.8.0
> show databases
name: databases
name
----
telegraf
_internal
> use telegraf
Using database telegraf
> show measurements
name: measurements
name
----
cpu
disk
diskio
kernel
mem
processes
swap
system

因为我们要监控是的 Redis cluster, 所以我们需要更改 telegraf 的配置, 通过它的 Redis 监控插件来获取 Redis 的度量数据

参见 https://github.com/influxdata/telegraf/blob/release-1.14/plugins/inputs/redis/README.md

The Redis input plugin gathers the results of the INFO Redis command. There are two separate measurements: redis and redis_keyspace, the latter is used for gathering database-related statistics.
Additionally the plugin also calculates the hit/miss ratio (keyspace_hitrate) and the elapsed time since the last RDB save (rdb_last_save_time_elapsed).

可以直接修改 vi /etc/telegraf/telegraf.conf 或者直接生成一个新的配置文件

telegraf --input-filter redis:cpu:mem:net:swap --output-filter influxdb config > telegraf.conf

主要是增加了这两行

[[inputs.redis]]
servers = ["tcp://127.0.0.1:9001"] 

重新启动 telegraf

sudo systemctl stop influxdb
sudo systemctl start influxdb

这样我们再运行 influx 命令行可以看到出现了4 张新表

influx -database='telegraf' -execute='show measurements'
name: measurements
name
----
cpu
disk
diskio
kernel
mem
processes
redis
redis_cmdstat
redis_keyspace
redis_replication
swap
system

看看 redis measurement 的结构,我们之前在 redis-cli info 的输出看到的在 influxdb 的 redis measurement 都能看到

influx -database='telegraf' -execute='SHOW TAG KEYS FROM "redis"'
name: redis
tagKey
------
host
port
replication_role
server
root@node5:~# influx -database='telegraf' -execute='SHOW FIELD KEYS FROM "redis"'
name: redis
fieldKey                        fieldType
--------                        ---------
active_defrag_hits              integer
active_defrag_key_hits          integer
active_defrag_key_misses        integer
active_defrag_misses            integer
active_defrag_running           integer
allocator_active                integer
allocator_allocated             integer
allocator_frag_bytes            integer
allocator_frag_ratio            float
allocator_resident              integer
allocator_rss_bytes             integer
allocator_rss_ratio             float
aof_base_size                   integer
aof_buffer_length               integer
aof_current_rewrite_time_sec    integer
aof_current_size                integer
aof_delayed_fsync               integer
aof_enabled                     integer
aof_last_bgrewrite_status       string
aof_last_cow_size               integer
aof_last_rewrite_time_sec       integer
aof_last_write_status           string
aof_pending_bio_fsync           integer
aof_pending_rewrite             integer
aof_rewrite_buffer_length       integer
aof_rewrite_in_progress         integer
aof_rewrite_scheduled           integer
blocked_clients                 integer
client_recent_max_input_buffer  integer
client_recent_max_output_buffer integer
clients                         integer
cluster_enabled                 integer
connected_slaves                integer
evicted_keys                    integer
expired_keys                    integer
expired_stale_perc              float
expired_time_cap_reached_count  integer
instantaneous_input_kbps        float
instantaneous_ops_per_sec       integer
instantaneous_output_kbps       float
keyspace_hitrate                float
keyspace_hits                   integer
keyspace_misses                 integer
latest_fork_usec                integer
lazyfree_pending_objects        integer
loading                         integer
lru_clock                       integer
master_repl_offset              integer
maxmemory                       integer
maxmemory_policy                string
mem_aof_buffer                  integer
mem_clients_normal              integer
mem_clients_slaves              integer
mem_fragmentation_bytes         integer
mem_fragmentation_ratio         float
mem_not_counted_for_evict       integer
mem_replication_backlog         integer
migrate_cached_sockets          integer
number_of_cached_scripts        integer
pubsub_channels                 integer
pubsub_patterns                 integer
rdb_bgsave_in_progress          integer
rdb_changes_since_last_save     integer
rdb_current_bgsave_time_sec     integer
rdb_last_bgsave_status          string
rdb_last_bgsave_time_sec        integer
rdb_last_cow_size               integer
rdb_last_save_time              integer
rdb_last_save_time_elapsed      integer
redis_version                   string
rejected_connections            integer
repl_backlog_active             integer
repl_backlog_first_byte_offset  integer
repl_backlog_histlen            integer
repl_backlog_size               integer
rss_overhead_bytes              integer
rss_overhead_ratio              float
second_repl_offset              integer
slave_expires_tracked_keys      integer
sync_full                       integer
sync_partial_err                integer
sync_partial_ok                 integer
total_commands_processed        integer
total_connections_received      integer
total_net_input_bytes           integer
total_net_output_bytes          integer
total_system_memory             integer
uptime                          integer
used_cpu_sys                    float
used_cpu_sys_children           float
used_cpu_user                   float
used_cpu_user_children          float
used_memory                     integer
used_memory_dataset             integer
used_memory_dataset_perc        float
used_memory_lua                 integer
used_memory_overhead            integer
used_memory_peak                integer
used_memory_peak_perc           float
used_memory_rss                 integer
used_memory_scripts             integer
used_memory_startup             integer
root@node5:~#                                      

接下来就可以安装 Grafana 以 InfluxDb 为数据源,进行基于时间序列的数据展示和分析,并添加相应的报警规则

上一篇下一篇

猜你喜欢

热点阅读