字典、集合，你真的了解吗？

2019-05-25 本文已影响0人倔强的潇洒小姐

一、字典和集合基础

字典：一系列无序元素的组合，其长度大小可变，元素可以任意地删减和改变

优点：特别是对于查找、添加和删除操作，字典都能在常数时间复杂度内完成

注：这里的元素，是一对键（key）和值（value）的配对

集合：没有键和值的配对，是一系列无序的、唯一的元素组合

注：不支持索引操作，因为集合本质上是一个哈希表，和列表不一样

集合的 pop() 操作是删除集合中最后一个元素，可是集合本身是无序的，你无法知道会删除哪个元素，因此这个操作得谨慎使用

>>> s = {1, 2, 4, 5, 6, 7, 8}
>>> s.pop()
1
>>> s
{2, 4, 5, 6, 7, 8}
>>> s.pop()
2
>>> s.add(2)
>>> s
{2, 4, 5, 6, 7, 8}
>>> s.pop()
4

补充：字典和集合，无论是键还是值，都可以是混合类型

s = {1, 'hello', 5.0}

二、字典与集合的性能

集合是高度优化的哈希表，里面元素不能重复，并且其添加和查找操作只需 O(1) 的复杂度，那么，总的时间复杂度就只有 O(n)

import time


id = [x for x in range(0, 10000)]
price = [x for x in range(10000, 20000)]
product_test = list(zip(id, price))


def find_unique_price_using_list(products):
    unique_price_list = []
    for _, price in products:
        if price not in unique_price_list:
            unique_price_list.append(price)
    return len(unique_price_list)


def find_unique_price_using_set(products):
    unique_price_set = set()
    for _, price in products:
        if price not in unique_price_set:
            unique_price_set.add(price)
    return len(unique_price_set)

    
start_using_list = time.perf_counter()
find_unique_price_using_list(product_test)
end_using_list = time.perf_counter()

print('time using A: {}'.format(end_using_list-start_using_list))

start_using_set = time.perf_counter()
find_unique_price_using_set(product_test)
end_using_set = time.perf_counter()

print('time using B : {}'.format(end_using_set - start_using_set))

时间输出：

time using A: 0.538301896
time using B : 0.0012807300000000632

总结：仅仅十万的数据量，两者的速度差异就如此之大。事实上，大型企业的后台数据往往有上亿乃至十亿数量级，如果使用了不合适的数据结构，就很容易造成服务器的崩溃，不但影响用户体验，并且会给公司带来巨大的财产损失

补充知识：
zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。

如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同，利用 * 号操作符，可以将元组解压为列表。

>>>a = [1,2,3]
>>> b = [4,5,6]
>>> c = [4,5,6,7,8]
>>> zipped = zip(a,b)     # 打包为元组的列表
[(1, 4), (2, 5), (3, 6)]
>>> zip(a,c)              # 元素个数与最短的列表一致
[(1, 4), (2, 5), (3, 6)]
>>> zip(*zipped)          # 与 zip 相反，*zipped 可理解为解压，返回二维矩阵式
[(1, 2, 3), (4, 5, 6)]

三、字典和集合的工作原理

字典和集合的内部结构都是一张哈希表。

字典：这张表存储了哈希值（hash）、键和值这 3个元素

集合：哈希表内没有键和值的配对，只有单一的元素

四、思考题

1、下面初始化字典的方式，哪一种更高效？

# Option A
d = {'name': 'jason', 'age': 20, 'gender': 'male'}

# Option B
d = dict({'name': 'jason', 'age': 20, 'gender': 'male'})

解答：OptionA 效率更高，因为OptionB调用dict函数等于运行时又增加了一层额外的操作，会去直接调用底层C写好的代码

2、字典的键可以是一个列表吗？下面这段代码中，字典的初始化是否正确呢？原因是什么

d = {'name': 'jason', ['education']: ['Tsinghua University', 'Stanford University']}

解答：列表是一个动态变化的数据结构，可变类型不可hash，字典当中的 key 要求是不可变的，原因也很好理解，key 首先是不重复的，如果 Key 是可以变化的话，那么随着 Key 的变化，这里就有可能就会有重复的 Key，那么这就和字典的定义相违背；如果把这里的列表换成之前我们讲过的元组是可以的，因为元组不可变