架构师训练营第五周作业 一致性Hash算法

2020-07-08  本文已影响0人  浩哥有料

用你熟悉的编程语言实现一致性 hash 算法。
编写测试用例测试这个算法,测试 100 万 KV 数据,10 个服务器节点的情况下,计算这些 KV 数据在服务器上分布数量的标准差,以评估算法的存储负载不均衡性。

# coding:utf-8

# Import built-in modules
import hashlib

# Import third-party modules
from numpy import std


def calculate_hash(key):
    """Function to calculate hash code from a string.

    Different algorithm will affect much on final result.

    Args:
        key (str): String to be used to calculate hash code from.

    """
    # Calculate SHA-256 code from the string and convert it to int.
    return int(hashlib.sha256(key).hexdigest(), 16)


class CacheSystem(object):
    """Class that simulates a distributed cache system."""

    def __init__(self, key_count):
        """Initialize the cache system."""

        # List to store all the physical cache nodes.
        self.nodes = []

        # Hash circle.
        self.hash_circle = {}

        # Default virtual nodes count for each cache node is 200.
        self.virtual_nodes_count = 200

        # Total count of the data keys
        self._key_count = key_count

    def add_node(self, node):
        """Add a physical cache node into this system.

        Use Name of the node as argument.

        Args:
            node (str): Name of the physical node.
        """
        for i in range(self.virtual_nodes_count):
            hashcode = calculate_hash('{}->vn{}'.format(node, i))
            self.hash_circle[hashcode] = node
        if node not in self.nodes:
            self.nodes.append(node)

    def _fill_data(self):
        """Fill the hash circle with data keys."""
        for i in range(self._key_count):
            key = 'key{}'.format(i)
            hashcode = calculate_hash(key)
            self.hash_circle[hashcode] = key

    def setup(self, count):
        """Setup/reset the cache system.

        Args:
            count (int): Specific virtual nodes count for 
                each physical node.
        """
        self.virtual_nodes_count = count
        self.hash_circle = {}
        self.nodes = []
        self._fill_data()


class TestCacheSystem(object):
    """Class to test the cache system and hash algorithm."""

    def __init__(self, key_count=1000000):
        """Initialize the test handler.

        Args:
            key_count (int): Number of the data keys.
        """
        self.cache_system = CacheSystem(key_count)

    def _add_nodes(self, nodes_count):
        """Add specified amount of physical nodes into cache system.

        Args:
            nodes_count (int): Number of the physical nodes.
        """
        for i in range(nodes_count):
            self.cache_system.add_node('Node-{}'.format(i))

    def calculate(self, nodes_count, virtual_nodes_count):
        """Calculate the standard deviation of the keys each node stores.
        
        Args:
            nodes_count (int): Number of the physical nodes.
            virtual_nodes_count (int): Number of virtual nodes for each 
                physical node.

        """
        self.cache_system.setup(virtual_nodes_count)
        self._add_nodes(nodes_count)

        hashcodes = sorted(self.cache_system.hash_circle.keys())
        nodes = {node:0 for node in self.cache_system.nodes}
        i = 0
        first_node = None
        for hc in hashcodes:
            if self.cache_system.hash_circle[hc].startswith('key'):
                i += 1
                continue
            node = self.cache_system.hash_circle[hc].split('->')[0]
            if not first_node:
                first_node = node
            nodes[node] = nodes[node] + i
            i = 0

        if i != 0:
            nodes[first_node] = nodes[first_node] + i

        print "物理节点数:", nodes_count
        print "每个物理节点分配的虚拟节点数量:", self.cache_system.virtual_nodes_count
        print "每个物理节点存储的KV数据数量:\n"

        node_keys = []
        for node in sorted(nodes.keys()):
            print node, nodes[node]
            node_keys.append(nodes[node])

        print "\n当前配置下1000000万KV数据的分布标准差:", std(node_keys, ddof=1)


test = TestCacheSystem()
test.calculate(10, 500)

运行结果:

物理节点数: 10
每个物理节点分配的虚拟节点数量: 500
每个物理节点存储的KV数据数量:

Node-0 101833
Node-1 99428
Node-2 95766
Node-3 97259
Node-4 100356
Node-5 99638
Node-6 102234
Node-7 104631
Node-8 102526
Node-9 96329

当前配置下1000000万KV数据的分布标准差: 2899.7801449228677
物理节点数: 10
每个物理节点分配的虚拟节点数量: 200
每个物理节点存储的KV数据数量:

Node-0 104533
Node-1 94639
Node-2 103166
Node-3 102795
Node-4 89517
Node-5 99433
Node-6 104803
Node-7 110528
Node-8 95187
Node-9 95399

当前配置下1000000万KV数据的分布标准差: 6285.614369335745

结论

不同的Hash算法、不同的虚拟节点数量、不同的虚拟节点命名方式都会影响到标准差结果。

上一篇下一篇

猜你喜欢

热点阅读