架构师训练营第五周作业 一致性Hash算法
2020-07-08 本文已影响0人
浩哥有料
用你熟悉的编程语言实现一致性 hash 算法。
编写测试用例测试这个算法,测试 100 万 KV 数据,10 个服务器节点的情况下,计算这些 KV 数据在服务器上分布数量的标准差,以评估算法的存储负载不均衡性。
# coding:utf-8
# Import built-in modules
import hashlib
# Import third-party modules
from numpy import std
def calculate_hash(key):
"""Function to calculate hash code from a string.
Different algorithm will affect much on final result.
Args:
key (str): String to be used to calculate hash code from.
"""
# Calculate SHA-256 code from the string and convert it to int.
return int(hashlib.sha256(key).hexdigest(), 16)
class CacheSystem(object):
"""Class that simulates a distributed cache system."""
def __init__(self, key_count):
"""Initialize the cache system."""
# List to store all the physical cache nodes.
self.nodes = []
# Hash circle.
self.hash_circle = {}
# Default virtual nodes count for each cache node is 200.
self.virtual_nodes_count = 200
# Total count of the data keys
self._key_count = key_count
def add_node(self, node):
"""Add a physical cache node into this system.
Use Name of the node as argument.
Args:
node (str): Name of the physical node.
"""
for i in range(self.virtual_nodes_count):
hashcode = calculate_hash('{}->vn{}'.format(node, i))
self.hash_circle[hashcode] = node
if node not in self.nodes:
self.nodes.append(node)
def _fill_data(self):
"""Fill the hash circle with data keys."""
for i in range(self._key_count):
key = 'key{}'.format(i)
hashcode = calculate_hash(key)
self.hash_circle[hashcode] = key
def setup(self, count):
"""Setup/reset the cache system.
Args:
count (int): Specific virtual nodes count for
each physical node.
"""
self.virtual_nodes_count = count
self.hash_circle = {}
self.nodes = []
self._fill_data()
class TestCacheSystem(object):
"""Class to test the cache system and hash algorithm."""
def __init__(self, key_count=1000000):
"""Initialize the test handler.
Args:
key_count (int): Number of the data keys.
"""
self.cache_system = CacheSystem(key_count)
def _add_nodes(self, nodes_count):
"""Add specified amount of physical nodes into cache system.
Args:
nodes_count (int): Number of the physical nodes.
"""
for i in range(nodes_count):
self.cache_system.add_node('Node-{}'.format(i))
def calculate(self, nodes_count, virtual_nodes_count):
"""Calculate the standard deviation of the keys each node stores.
Args:
nodes_count (int): Number of the physical nodes.
virtual_nodes_count (int): Number of virtual nodes for each
physical node.
"""
self.cache_system.setup(virtual_nodes_count)
self._add_nodes(nodes_count)
hashcodes = sorted(self.cache_system.hash_circle.keys())
nodes = {node:0 for node in self.cache_system.nodes}
i = 0
first_node = None
for hc in hashcodes:
if self.cache_system.hash_circle[hc].startswith('key'):
i += 1
continue
node = self.cache_system.hash_circle[hc].split('->')[0]
if not first_node:
first_node = node
nodes[node] = nodes[node] + i
i = 0
if i != 0:
nodes[first_node] = nodes[first_node] + i
print "物理节点数:", nodes_count
print "每个物理节点分配的虚拟节点数量:", self.cache_system.virtual_nodes_count
print "每个物理节点存储的KV数据数量:\n"
node_keys = []
for node in sorted(nodes.keys()):
print node, nodes[node]
node_keys.append(nodes[node])
print "\n当前配置下1000000万KV数据的分布标准差:", std(node_keys, ddof=1)
test = TestCacheSystem()
test.calculate(10, 500)
运行结果:
物理节点数: 10
每个物理节点分配的虚拟节点数量: 500
每个物理节点存储的KV数据数量:
Node-0 101833
Node-1 99428
Node-2 95766
Node-3 97259
Node-4 100356
Node-5 99638
Node-6 102234
Node-7 104631
Node-8 102526
Node-9 96329
当前配置下1000000万KV数据的分布标准差: 2899.7801449228677
物理节点数: 10
每个物理节点分配的虚拟节点数量: 200
每个物理节点存储的KV数据数量:
Node-0 104533
Node-1 94639
Node-2 103166
Node-3 102795
Node-4 89517
Node-5 99433
Node-6 104803
Node-7 110528
Node-8 95187
Node-9 95399
当前配置下1000000万KV数据的分布标准差: 6285.614369335745
结论
不同的Hash算法、不同的虚拟节点数量、不同的虚拟节点命名方式都会影响到标准差结果。