HashMap深入剖析

2020-01-04 本文已影响0人 Jade_4c8c

该篇文章主要对HashMap进行深入探索，主要探索内容为
一、HashMap介绍
二、HashMap构造方法（初始容量，负载因子）；
三、HashMap数据结构；
四、HashMap put方法(hash算法，index算法，扩容，重哈希)；
五、HashMap get方法；
六、简单举例hash算法和indexFor算法

一、HashMap介绍

HashMap是Map族中最为常用的一种，也是 Java Collection Framework 的重要成员。
HashMap是基于哈希表实现的，每一个元素都是一个key-value对，其内部通过单链表解决冲突问题，容量不足（超过了阈值）时，同样会自动增长。
HashMap是非线程安全的，只是用于单线程环境下，多线程环境下可以采用concurrent并发包下的concurrentHashMap。
HashMap实现了Serializable接口，因此它支持序列化，实现了Cloneable接口，能被克隆。
HashMap最多只允许一条Entry的键为Null(多条会覆盖)，但允许多条Entry的值为Null。

二、HashMap构造方法

HashMap 一共提供了四个构造函数，其中默认无参的构造函数和参数为Map的构造函数为 Java Collection Framework 规范的推荐实现，其余两个构造函数则是 HashMap 专门提供的。
1、HashMap()
该构造函数意在构造一个具有默认初始容量 (16) 和默认负载因子(0.75) 的空 HashMap，是 Java Collection Framework 规范推荐提供的，其源码如下：

/**
* Constructs an empty HashMap with the default initial capacity
* (16) and the default load factor (0.75).
*/
public HashMap(){
    //负载因子:用于衡量的是一个散列表的空间的使用程度
    this.loadFactor=DEFAULT_LOAD_FACTOR;
    //HashMap进行扩容的阈值，它的值等于 HashMap 的容量乘以负载因子
    threshold=(int)(DEFAULT_INITIAL_CAPACITY*DEFAULT_LOAD_FACTOR);
    // HashMap的底层实现仍是数组，只是数组的每一项都是一条链
    table=newEntry[DEFAULT_INITIAL_CAPACITY];
    init();
}

2、HashMap(int initialCapacity, float loadFactor)
该构造函数意在构造一个指定初始容量和指定负载因子的空 HashMap，其源码如下：

 /**
 * Constructs an empty HashMap with the specified initial capacity and load factor.
 */
 public HashMap(int initialCapacity, float loadFactor) {
     //初始容量不能小于 0
     if (initialCapacity < 0)
         throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
     //初始容量不能超过 2^30
     if (initialCapacity > MAXIMUM_CAPACITY)
         initialCapacity = MAXIMUM_CAPACITY;
     //负载因子不能小于 0 
     if (loadFactor <= 0 || Float.isNaN(loadFactor))
         throw new IllegalArgumentException("Illegal load factor: " +loadFactor);
     // HashMap 的容量必须是2的幂次方，超过 initialCapacity 的最小 2^n 
     int capacity = 1;
     while (capacity < initialCapacity)
     capacity <<= 1; 
     //负载因子
     this.loadFactor = loadFactor;
     //设置HashMap的容量极限，当HashMap的容量达到该极限时就会进行自动扩容操作
     threshold = (int)(capacity * loadFactor);
     // HashMap的底层实现仍是数组，只是数组的每一项都是一条链
     table = new Entry[capacity];
     init();
 }

3、HashMap(int initialCapacity)
该构造函数意在构造一个指定初始容量和默认负载因子 (0.75)的空 HashMap，其源码如下：

// Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75)
 public HashMap(int initialCapacity) {
     this(initialCapacity, DEFAULT_LOAD_FACTOR); // 直接调用上述构造函数
 }

4、HashMap(Map<? extends K, ? extends V> m)
该构造函数意在构造一个与指定 Map 具有相同映射的 HashMap，其初始容量不小于 16 (具体依赖于指定Map的大小)，负载因子是 0.75，是 Java Collection Framework 规范推荐提供的，其源码如下：

 // Constructs a new HashMap with the same mappings as the specified Map. 
 // The HashMap is created with default load factor (0.75) and an initial capacity
 // sufficient to hold the mappings in the specified Map.
 public HashMap(Map<? extends K, ? extends V> m) {
      // 初始容量不小于 16 
     this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1,
     DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
     putAllForCreate(m);
 }

在这里，我们提到了两个非常重要的参数：初始容量和负载因子，这两个参数是影响HashMap性能的重要参数。其中，容量表示哈希表中桶的数量 (table 数组的大小)，初始容量是创建哈希表时桶的数量；负载因子是哈希表在其容量自动增加之前可以达到多满的一种尺度，它衡量的是一个散列表的空间的使用程度，负载因子越大表示散列表的装填程度越高，反之愈小。
在这里先留下一个疑问：
为什么达到阀值扩容呢，而不是容量满了再扩容呢

三、HashMap的数据结构

哈希的相关概念
Hash 就是把任意长度的输入(又叫做预映射， pre-image)，通过哈希算法，变换成固定长度的输出(通常是整型)，该输出就是哈希值。这种转换是一种压缩映射，也就是说，散列值的空间通常远小于输入的空间。不同的输入可能会散列成相同的输出，从而不可能从散列值来唯一的确定输入值。简单的说，就是一种将任意长度的消息压缩到某一固定长度的息摘要函数。

哈希的应用：数据结构
我们知道，数组的特点是：寻址容易，插入和删除困难；而链表的特点是：寻址困难，插入和删除容易。那么我们能不能综合两者的特性，做出一种寻址容易，插入和删除也容易的数据结构呢？答案是肯定的，这就是我们要提起的哈希表。事实上，哈希表有多种不同的实现方法，我们接下来解释的是最经典的一种方法 —— 拉链法，我们可以将其理解为链表的数组，如下图所示：

image

我们可以从上图看到，左边很明显是个数组，数组的每个成员是一个链表。该数据结构所容纳的所有元素均包含一个指针，用于元素间的链接。我们根据元素的自身特征把元素分配到不同的链表中去，反过来我们也正是通过这些特征找到正确的链表，再从链表中找出正确的元素。其中，根据元素特征计算元素数组下标的方法就是哈希算法。

总的来说，哈希表适合用作快速查找、删除的基本数据结构，通常需要总数据量可以放入内存。在使用哈希表时，有以下几个关键点：

hash 函数（哈希算法）的选择：针对不同的对象(字符串、整数等)具体的哈希方法；
碰撞处理：常用的有两种方式，一种是open hashing，即 >拉链法；
另一种就是 closed hashing，即开地址法(opened addressing)。

HashMap 的数据结构
我们知道，在Java中最常用的两种结构是数组和链表，几乎所有的数据结构都可以利用这两种来组合实现，HashMap 就是这种应用的一个典型。实际上，HashMap 就是一个链表数组，如下是它数据结构：

image

从上图中，我们可以形象地看出HashMap底层实现还是数组，只是数组的每一项都是一条链。其中参数initialCapacity 就代表了该数组的长度，也就是桶的个数。在第三节我们已经了解了HashMap 的默认构造函数的源码：

 /**
 * Constructs an empty HashMap with the default initial capacity
 * (16) and the default load factor (0.75).
 */
 public HashMap() {
     //负载因子:用于衡量的是一个散列表的空间的使用程度
     this.loadFactor = DEFAULT_LOAD_FACTOR; 
     //HashMap进行扩容的阈值，它的值等于 HashMap 的容量乘以负载因子
     threshold = (int)(DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR);
     // HashMap的底层实现仍是数组，只是数组的每一项都是一条链
     table = new Entry[DEFAULT_INITIAL_CAPACITY];
     init();
 }

从上述源码中我们可以看出，每次新建一个HashMap时，都会初始化一个Entry类型的table数组，其中 Entry类型的定义如下：

static class Entry<K,V> implements Map.Entry<K,V> {
     final K key; // 键值对的键
     V value; // 键值对的值
     Entry<K,V> next; // 下一个节点
     final int hash; // hash(key.hashCode())方法的返回值
     /**
     * Creates new entry.
     */
     Entry(int h, K k, V v, Entry<K,V> n) { // Entry 的构造函数
     value = v;
     next = n;
     key = k;
     hash = h;
 }
 ......
}

其中，Entry为HashMap的内部类，实现了 Map.Entry 接口，其包含了键key、值value、下一个节点next，以及hash值四个属性。事实上，Entry 是构成哈希表的基石，是哈希表所存储的元素的具体形式。

四、HashMap put方法

下面我们结合JDK源码看HashMap 的存取实现。
HashMap 的存储实现
在 HashMap 中，键值对的存储是通过 put(key,vlaue) 方法来实现的，其源码如下：

/**
 * Associates the specified value with the specified key in this map.
 * If the map previously contained a mapping for the key, the old
 * value is replaced.
 *
 * @param key key with which the specified value is to be associated
 * @param value value to be associated with the specified key
 * @return the previous value associated with key, or null if there was no mapping for key.
 * Note that a null return can also indicate that the map previously associated null with key.
 */
 public V put(K key, V value) {
     //当key为null时，调用putForNullKey方法，并将该键值对保存到table的第一个位置 
     if (key == null)
         return putForNullKey(value); 
     //根据key的hashCode计算hash值
     int hash = hash(key.hashCode()); // ------- (1)
     //计算该键值对在数组中的存储位置（哪个桶）
     int i = indexFor(hash, table.length); // ------- (2)
     //在table的第i个桶上进行迭代，寻找 key 保存的位置
     for (Entry<K,V> e = table[i]; e != null; e = e.next) { // ------- (3)
         Object k;
         //判断该条链上是否存在hash值相同且key值相等的映射，若存在，则直接覆盖 value，并返回旧value
         if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
             V oldValue = e.value;
             e.value = value;
             e.recordAccess(this);
             return oldValue; // 返回旧值
         }
     }
     modCount++; //修改次数增加1，快速失败机制
     //原HashMap中无该映射，将该添加至该链的链头
     addEntry(hash, key, value, i); 
     return null;
  }

对NULL键的特别处理：putForNullKey()
我们直接看其源码：

/**
 * Offloaded version of put for null keys
 */
 private V putForNullKey(V value) {
     // 若key==null，则将其放入table的第一个桶，即 table[0]
     for (Entry<K,V> e = table[0]; e != null; e = e.next) { 
         if (e.key == null) { // 若已经存在key为null的键，则替换其值，并返回旧值
             V oldValue = e.value;
             e.value = value;
             e.recordAccess(this);
             return oldValue;
         }
     }
     modCount++; // 快速失败
     addEntry(0, null, value, 0); // 否则，将其添加到 table[0] 的桶中
     return null;
 }

HashMap 中的哈希策略（算法）

/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits.
*
* Note: Null keys always map to hash 0, thus index 0.
*/

static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

正如JDK官方对该方法的描述那样，使用hash()方法对一个对象的hashCode进行重新计算是为了防止质量低下的hashCode()函数实现。由于hashMap的支撑数组长度总是 2 的幂次，通过右移可以使低位的数据尽量的不同，从而使hash值的分布尽量均匀。

通过上述hash()方法计算得到 Key 的 hash值后，怎么才能保证元素均匀分布到table的每个桶中呢？我们会想到取模，但是由于取模的效率较低，HashMap 是通过调用上面的indexFor()方法处理的，其不但简单而且效率很高，对应源码如下所示：

/**
 * Returns index for hash code h.
 */
static int indexFor( int h, int length )
{
    return(h & (length - 1) ); /* 作用等价于取模运算，但这种方式效率更高 */
}

HashMap 中键值对的添加：addEntry()

我们直接看其源码：

/**
 * Adds a new entry with the specified key, value and hash code to
 * the specified bucket. It is the responsibility of this
 * method to resize the table if appropriate.
 *
 * Subclass overrides this to alter the behavior of put method.
 * 
 * 永远都是在链表的表头添加新元素
 */
 void addEntry(int hash, K key, V value, int bucketIndex) {
     //获取bucketIndex处的链表
     Entry<K,V> e = table[bucketIndex];
     //将新创建的 Entry 链入 bucketIndex处的链表的表头 
     table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
     //若HashMap中元素的个数超过极限值 threshold，则容量扩大两倍
     if (size++ >= threshold)
     resize(2 * table.length);
 }

HashMap 的扩容：resize()
随着HashMap中元素的数量越来越多，发生碰撞的概率将越来越大，所产生的子链长度就会越来越长，这样势必会影响HashMap的存取速度。为了保证HashMap的效率，系统必须要在某个临界点进行扩容处理，该临界点就是HashMap中元素的数量在数值上等于threshold（table数组长度乘以加载因子）。但是，不得不说，扩容是一个非常耗时的过程，因为它需要重新计算这些元素在新table数组中的位置并进行复制处理。所以，如果我们能够提前预知HashMap 中元素的个数，那么在构造HashMap时预设元素的个数能够有效的提高HashMap的性能（已解答之前的疑问：为什么阀值呢，而不是容量满了再扩容呢）。我们直接看其源码：

/**
 * Rehashes the contents of this map into a new array with a
 * larger capacity. This method is called automatically when the
 * number of keys in this map reaches its threshold.
 * If current capacity is MAXIMUM_CAPACITY, this method does not
 * resize the map, but sets threshold to Integer.MAX_VALUE.
 * This has the effect of preventing future calls.
 *
 * @param newCapacity the new capacity, MUST be a power of two;
 * must be greater than current capacity unless current
 * capacity is MAXIMUM_CAPACITY (in which case value is irrelevant).
 *
 */
void resize( int newCapacity )
{
    Entry[] oldTable = table;
    int oldCapacity = oldTable.length;
    /* 若 oldCapacity 已达到最大值，直接将 threshold 设为 Integer.MAX_VALUE */
    if ( oldCapacity == MAXIMUM_CAPACITY )
    {
        threshold = Integer.MAX_VALUE;
        return; /* 直接返回 */
    }
    /* 否则，创建一个更大的数组 */
    Entry[] newTable = new Entry[newCapacity];
    /* 将每条Entry重新哈希到新的数组中 */
    transfer( newTable );
    table = newTable;
    threshold = (int) (newCapacity * loadFactor); /* 重新设定 threshold */
}

HashMap 的重哈希：transfer()
重哈希的主要是一个重新计算原HashMap中的元素在新table数组中的位置并进行复制处理的过程，我们直接看其源码：

/**
*
* Transfers all entries from current table to newTable.
*
*/
void transfer( Entry[] newTable ){
/* 将原数组 table 赋给数组 src */
Entry[] src = table;
int newCapacity = newTable.length;
/* 将数组 src 中的每条链重新添加到 newTable 中 */
for ( int j = 0; j < src.length; j++ ){
    Entry<K, V> e = src[j];
    if ( e != null ){
        src[j] = null; /* src 回收 */
        /* 将每条链的每个元素依次添加到 newTable 中相应的桶中 */
         do {
            Entry<K, V> next = e.next;
            /* e.hash指的是 hash(key.hashCode())的返回值; */
            /* 计算在newTable中的位置，注意原来在同一条子链上的元素可能被分配到不同的子链 */
            int i = indexFor( e.hash, newCapacity );
            e.next = newTable[i];
            newTable[i] = e;
            e = next;
         }
        while ( e != null );
     }
}
}

特别需要注意的是，在重哈希的过程中，原属于一个桶中的Entry对象可能被分到不同的桶，因为HashMap 的容量发生了变化，那么 h&(length - 1) 的值也会发生相应的变化。极端地说，如果重哈希后，原属于一个桶中的Entry对象仍属于同一桶，那么重哈希也就失去了意义。

HashMap get方法

相对于HashMap的存储而言，读取就显得比较简单了。因为，HashMap只需通过key的hash值定位到table数组的某个特定的桶，然后查找并返回该key对应的value即可，源码如下：

/**
* Returns the value to which the specified key is mapped,
* or {@code null} if this map contains no mapping for the key.
* <p>More formally, if this map contains a mapping from a key
* {@code k} to a value {@code v} such that {@code (key==null ? k==null :
* key.equals(k))}, then this method returns {@code v}; otherwise
* it returns {@code null}. (There can be at most one such mapping.)
* <p>A return value of {@code null} does not <i>necessarily</i>
* indicate that the map contains no mapping for the key; it's also
* possible that the map explicitly maps the key to {@code null}.
* The {@link #containsKey containsKey} operation may be used to
* distinguish these two cases.
* @see #put(Object, Object)
*/
public V get( Object key ){
    /* 若为null，调用getForNullKey方法返回相对应的value */
    if ( key == null )
    /* 从table的第一个桶中寻找 key 为 null 的映射；若不存在，直接返回null */
        return(getForNullKey() );
    /* 根据该 key 的 hashCode 值计算它的 hash 码 */
    int hash = hash( key.hashCode() );
    /* 找出 table 数组中对应的桶 */
    for ( Entry<K, V> e = table[indexFor( hash, table.length )];e != null;e = e.next ){
        Object k;
        /* 若搜索的key与查找的key相同，则返回相对应的value */
        if ( e.hash == hash && ( (k = e.key) == key || key.equals( k ) ) )
            return(e.value);
     }
    return(null);
}

针对键为NULL的键值对，HashMap 提供了专门的处理：getForNullKey()，其源码如下：

/**
 * Offloaded version of get() to look up null keys. Null keys map
 * to index 0\. This null case is split out into separate methods
 *
 * for the sake of performance in the two most commonly used
 *
 * operations (get and put), but incorporated with conditionals in
 *
 * others.
 *
 */
private V getForNullKey()
{
    /* 键为NULL的键值对若存在，则必定在第一个桶中 */
    for ( Entry<K, V> e = table[0]; e != null; e = e.next )
    {
        if ( e.key == null )
            return(e.value);
    }
    /* 键为NULL的键值对若不存在，则直接返回 null */
    return(null);
}

因此，调用HashMap的get(Object key)方法后，若返回值是 NULL，则存在如下两种可能：
1.该 key 对应的值就是 null;
2.HashMap 中不存在该 key

六、简单举例hash算法和indexFor算法

哈希函数是通过把key的hash值映射到数组中的一个位置来进行访问。比如：

存在一组哈希值 10，13，7，5，4，20
存在一个长度为10的数组 arrays
定义一个hash函数 int index = h % arrays.length; 

10 % 10 = 0 那么 哈希值为10的对象放在数组索引为0的位置上；
13 % 10 = 3 那么 哈希值为13的对象放在数组索引为3的位置上；
......
20 % 10 = 0 那么 哈希值为13的对象放在数组索引为0的位置上；

这时候大家看出了一个问题，哈希值为10的对象和哈希值为20的对象，放在了一个索引上。发生了碰撞，那么怎么解决这样碰撞呢，有很多种方式，这里不展开叙述。HashMap中维护了一个链表组成的数组。如果冲突的话就添加到链表中，下面来看下hashmap中的hash算法，以Java8源码为例（各Java版本的hash算法可能存在差异）。

static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

其中，key.hashCode（）是Key自带的hashCode()方法，返回一个int类型的散列值。我们大家知道，32位带符号的int表值范围从-2147483648到2147483648。这样只要hash函数松散的话，一般是很难发生碰撞的，因为HashMap的初始容量只有16。但是这样的散列值我们是不能直接拿来用的。用之前需要对数组的长度取模运算。得到余数才是索引值。我们来看下HashMap中怎么实现的。

int index = hash & (arrays.length-1);

那么这也就明白了为什么HashMap的数组长度是2的整数幂。比如以初始长度为16为例，16-1 = 15，15的二进制数位00000000 00000000 00001111。可以看出一个基数二进制最后一位必然位1，当与一个hash值进行与运算时，最后一位可能是0也可能是1。但偶数与一个hash值进行与运算最后一位必然为0，造成有些位置永远映射不上值。
但是这时，又出现了一个问题，即使散列函数很松散，但只取最后几位碰撞也会很严重。这时候hash算法的价值就体现出来了，

image

hashCode右移16位，正好是32bit的一半。与自己本身做异或操作（相同为0，不同为1）。就是为了混合哈希值的高位和地位，增加低位的随机性。并且混合后的值也变相保持了高位的特征。

如果同一个格子里的key不超过8个，使用链表结构存储。如果超过了8个，那么会调用treeifyBin函数，将链表转换为红黑树，可以自己查资料了解

HashMap深入剖析

一、HashMap介绍

二、HashMap构造方法

三、HashMap的数据结构

四、HashMap put方法

HashMap get方法

六、简单举例hash算法和indexFor算法

猜你喜欢

热点阅读