HashMap源码分析（主要基于JDK1.8）

2019-12-09 本文已影响0人言淺

什么是HashMap

HashMap是一个用于存储Key-Value键值对的集合，也是Java中使用频率最高的用于映射(键值对)处理的数据类型。

众所周知，HashMap访问速度快，遍历顺序不确定，线程不安全。这些特点都是由底层实现决定的，将在本文中进行分析。

在JDK1.8中，对HashMap的底层实现进行了优化，接下来本文将结合JDK1.7及JDK1.8的区别，对HashMap的底层实现原理做简单的分析（主要基于JDK1.8）。

JDK1.7、JDK1.8中，HashMap的区别简述

JDK1.7
1、hash计算，4次扰动
2、实现方式：链表 + 数组
3、链表：头插法（并发扩容时容易出现环形链表）
4、扩容时重新计算桶下标

JDK1.8
1、hash计算，1次扰动
2、实现方式：链表 + 数组 / 红黑树
3、链表：尾插法
4、扩容时只计算hash值新增的bit

存储结构

都知道JDK1.7中，HashMap的实现方式是：数组 + 链表，在JDK1.8中，引入了红黑树做优化。那么，数组 + 链表 / 红黑树，这个存储结构，是怎么实现的呢？让我们把目光放到源码中，

/* ---------------- Fields -------------- */

    /**
     * The table, initialized on first use, and resized as
     * necessary. When allocated, length is always a power of two.
     * (We also tolerate length zero in some operations to allow
     * bootstrapping mechanics that are currently not needed.)
     */
    transient Node<K,V>[] table;

字段定义的一开始，就是一个Node<K,V>类型的table数组。那么再看看Node<K,V>的定义：

    /**
     * Basic hash bin node, used for most entries.  (See below for
     * TreeNode subclass, and in LinkedHashMap for its Entry subclass.)
     */
    static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

        ......

    }

Node<K,V>是HashMap的一个内部类，它实现了Map.Entry接口，本质上就是一个映射（键值对）。Node<K,V>中，除了key和value字段之外，还定义了hash和next字段。next字段也是Node<K,V>类型的，这意味着，Node<K,V>这个类有能力组成链表。事实也正是如此，下图中，我画出了HashMap的实现图示，包含了链表和红黑树，红黑树的定义TreeNode<K,V>同样也是HashMap中的一个内部类。

存储结构图示

左侧的是table数组，每一个圆点都代表一个Node / TreeNode。

有个小的细节，在这里提及一下。当链表达到树化阀值（TREEIFY_THRESHOLD = 8）的时候，并不会直接树化这条链表，而是先判断table数组有没有达到最小的树化容量（MIN_TREEIFY_CAPACITY = 64），如果没达到的话，此时HashMap会选择扩容，而非树化。因此，实际使用过程中，要出现图示的结构，table数组的长度至少要有64。

HashMap初始化

简单的了解下初始化方法。在实际使用过程当中，最常用的就是不带参数的默认构造函数，和指定容量的构造函数。例：

        Map<String, String> map1 = new HashMap<>();

        Map<String, String> map2 = new HashMap<>(20);

        Map<String, String> map3 = new HashMap<>(65);

不带参数的默认构造函数，实际上只做了一件事情，就是设置负载因子loadFactor的值。在此不赘述。

    /**
     * Constructs an empty <tt>HashMap</tt> with the default initial capacity
     * (16) and the default load factor (0.75).
     */
    public HashMap() {
        this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
    }

带指定容量的构造函数，通过代码，可以看到实际上是调用了这个方法：

/* ---------------- Public operations -------------- */

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }

可以看出，除了设置负载因子loadFactor的值外，还调用了tableSizeFor方法，将其返回值赋给threshold字段。

tableSizeFor这个方法，可以重点了解下。方法入参是容量的期望值，返回则是大于等于期望值的，距离这个值最近的2的次方值。

     /**
     * Returns a power of two size for the given target capacity.
     */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

通过不断的与自身无符号右移的结果做按位或，得到一个低位全部都是1的数据。然后再加1，就可以得到2的次方值。

以期望容量65为例，经过一系列无符号右移及按位或操作后，得到了实际容量128。
（这个值是容量，也是table的长度。初始化的时候暂时把这个值赋值给了threshold，一般threshold的计算方式其实是capacity * load factor，由此可以得知，当table == null时，threshold的值可能为0，也可能为当前实际容量）

tableSizeFor方法计算图示

按位或( | )：按位运算，有一个是1，结果为1，否则为0；
无符号右移( >>> )：向右移动N位，高位用0补齐；

HashMap的put方法解析

要再深入一点了解HashMap的原理，那么可以先从put方法入手，put方法大致可以分为三步：

· 检查是否需要resize()
· 确定哈希桶数组索引位置
· 存放键值至对应链表/红黑树中

先来一个整体流程图：

put方法流程图

可以看到，流程的第一步是判断是否需要扩容，这一步的目的很明确。就是保证后边的存放步骤可以存放到一个正确的数组当中。像刚才我们提到的初始化方法，就肯定要进行扩容，因为构造函数中，并没有对table字段进行赋值，这样显然是无法使用的。这个时候扩容，就会将table默认初始化成一个长度为16的Node数组，当然，如果填了初始化容量，那么table就会被初始化成长度为tableSizeFor返回值的Node数组。

计算哈希桶数组索引

此时我们已经拥有了一个长度合适的table数组，那么接下来将要计算Key在哈希桶数组中的索引，那索引的计算公式是什么呢，看源码：

     /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }

    /**
     * Implements Map.put and related methods
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            ......
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

    ......

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

关键代码在putVal方法的第5行，p = tab[i = (n - 1) & hash] 这里。JDK1.8中并没有将哈希桶数组的索引抽成方法来计算。但从代码中我们还是可以提取出，其计算公式为：

(n - 1) & hash

话不多说，上图：
key = "hello world"
key.hashCode() = 1794106052
（1794106052的二进制表示：0110 1010 1110 1111 1110 0010 1100 0100）

哈希桶数组索引计算图示

按位与( & )：按位运算，都是1，结果为1，否则为0；
位异或( ^ )：按位运算，相反为1，相同为0；

图示中是JDK1.8中，哈希桶数组索引的计算步骤拆解。其中h ^ (h >>> 16)这一步，是为了让hashCode的高位也参与到运算中来，增加hash值的随机性。JDK1.7和JDK1.8中，hash值的计算方式是不同的，因此，在JDK1.7和JDK1.8中，同样的数据所对应的桶下标是可能不一致的。（并没有什么关系，因为除了你我，别人可能都不太关心HashMap中的数据到底是怎么存放的😑）

从图示中可以看到，key在哈希桶数组的位置，取决与它自己的hash值和当前HashMap的实际容量，也就是table的数组长度。（图示中HashMap的当前容量是16，key的桶索引其实就是其hash值的低4位）

存放逻辑

哈希桶数组索引已经有了，那么接下来就是将键值对存放到对应的数组 / 红黑树中。从流程图中可以看到，接下来就是判断是否有产生了hash冲突。本文开头提及过，HashMap的访问速度非常快。在Hash算法合理的前提下，最理想的情况是所有的键值对都均匀的散落到table数组中，在不发生哈希冲突的情况下，只要能定位到哈希桶索引，就能定位到想获取的数据。所以这里判断了两次，第一次，判断tab[i]是否为null，如果为null，那么直接存放即可；如果不是，则进行第二次判断，判断tab[i]中已存放的key是否就是我们将要处理的key，如果是的话，说明也没有发生哈希冲突，此时直接覆盖即可。如果也不是，说明发生了哈希冲突，这时候，就要区分：

链表：遍历链表，查找是否已经存在该key，已存在则覆盖原有值；若不存在，则将key、value插入链表末尾（JDK1.8：尾插法）；如果插入该数据将使得链表长度达到树化阀值8，则进行树化；

红黑树：遍历树节点，若该key已存在则覆盖原有值，若不存在则插入树节点；（红黑树这块后边单独开个文章细讲）

扩容逻辑

将数据按逻辑存放好后，在putVal方法的末尾，进行了一次检查（++size > threshold），如果判断数据已经达到了threshold，则做一次扩容resize()。

已经了解到HashMap的核心字段table之后，扩容这个操作就很好理解了。在Java中，数组是不可以自动扩容的。因此，扩容的思路就是用一个容量大的数组，替代原先容量小的数组，并将数据迁移到新数组当中。

实际上HashMap也是这么做的，用一个新的数组替代原先容量小的数组，但是在数据迁移的过程中，并不是直接一个个数据重新计算hash值及桶下标，而是只计算了hash值新增的bit。设计得非常巧妙。

还记得上边说到的哈希桶数组索引的计算方式吗？

(n - 1) & hash

还是以key = "hello world"为例，请看：

扩容机制-1

同一个key，它的hash值是不变的，那么变量在哪里呢？自然就是n了，n是数组的长度。图中HashMap的容量从16扩容到32，唯一变化的就是(n - 1)，从低4位都是1，变成了低5位都是1。
而变化的这一位，很关键的，就等于原有的容量。

扩容机制-2

由上可以得知，扩容前后，数据所在的桶下标只有2种情况：
1、新增bit为0，桶下标不变；
2、新增bit为1，桶下标 = 原有桶下标 + 原容量；

这样，就可以将原先某个哈希桶中冲突的节点，分散到2个哈希桶中了。

同时，新增bit为0 / 1，这件事情可以认为是随机的，因此扩容过程，可以把原先hash冲突的节点均匀的分散到新的桶中。

至此，HashMap的put方法就简单的解析完了。源码写得很赞，可以学到挺多的。

非线程安全

HashMap并非线程安全类，在刚才的源码解析中，可以发现在put操作中，并没有加锁，也就是说任一时刻可以多个线程同时写HashMap。那么就会有一定的几率会出现数据丢失的情况。先简单的列举以下几个场景，如：

1、两个线程同时判断完 tab[i] == null, 准备newNode赋值给 tab[i] 时；
2、两个线程同时判断到链表末尾，(e = p.next) == null，准备newNode赋值给 p.next 时；
3、一个线程扩容中，已经将新的Node数组赋值给table，但未迁移完数据。此时另一个线程执行了put / remove操作，则 put / remove 操作可能无效；

……

还有其他非线程安全的可能，这里没有全部列举出来，只是提供思路供参考。

/**
     * Implements Map.put and related methods
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

当2个线程同时已经判断完了if ((p = tab[i = (n - 1) & hash]) == null)，准备执行写入数据时，先写入的数据将被后写入的数据覆盖。同理，如果2个线程同时已经判断到了链表的末尾 if ((e = p.next) == null)，准备执行 p.next = newNode(hash, key, value, null)写入数据时，后写入的数据将覆盖先写入的数据。（大前提都是，同时写入的数据，桶下标刚好相同，也就是产生了Hash冲突）

除此之外，当线程1正在扩容，已经将新数组赋值给table（已执行到table = newTab），但还未执行下边的数据迁移操作，此时线程2写入（或删除）数据，线程1再执行完对应的数据迁移操作，都会导致线程2写入的数据被覆盖（删除数据将不生效）

    /**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;
        int newCap, newThr = 0;
        
        ......

        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if (oldTab != null) {

           ......

        }
        return newTab;
    }

为了加深印象，这里准备了2个例子，供参考。

下边是我本地复现线程不安全的代码，感兴趣的话可以试试。这是基于JDK1.8的。

示例1

import java.util.HashMap;

public class HashMapDemoResize1 {

    private static HashMap<String, String> map = new HashMap<String, String>();
    public static void main(String[] args) {
        map.put("5", "C");

        new Thread("Thread1") {
            public void run() {
                map.put("16", "B");
                System.out.println("put record. key: 16, value: B. result: " + map);
            };
        }.start();
        new Thread("Thread2") {
            public void run() {
                map.put("27", "A");
                System.out.println("put record. key: 27, value: A. result: " + map);
            };
        }.start();

        System.out.println("HashMapDemoResize1");
    }

}

本地我是先让两个线程都同时跑到if ((e = p.next) == null)，然后让Thread2先把key = "27"的数据put到链表末尾，再执行Thread1的代码。可以看到，打印出来的结果中，key = "27"的数据被覆盖了。

HashMapDemoResize1

示例2

import java.util.HashMap;

public class HashMapDemoResize2 {

    private static HashMap<String, String> map = new HashMap<String, String>(4);

    public static void main(String[] args) {

        map.put("12", "A");
        map.put("23", "B");
        map.put("34", "C");
        System.out.println("");

        new Thread("Thread1") {
            public void run() {
                map.put("45", "D");
                System.out.println("put record. key: 36, value: D. result: " + map);
            };
        }.start();
        new Thread("Thread2") {
            public void run() {
                map.put("56", "E");
                System.out.println("put record. key: 48, value: E. result: " + map);
            };
        }.start();

        System.out.println("HashMapDemoResize2");


    }
    
}

这里初始容量设置为4的目的是，threshold = capacity * load factor，threshold为3，也就是当put到第4个数据的时候，需要扩容resize()。
我本地让Thread1先跑到resize()方法中table = newTab这一步，然后让Thread2将key = "56"的键值对put进去，再让Thread1执行数据迁移操作。
从结果中可以看到，key = "56"的键值对被覆盖了。

HashMapDemoResize2

看到这里，相信你已经知道为什么说HashMap是线程不安全的了吧。如果要求线程安全的话，建议使用ConcurrentHashMap；如果不要求线程安全，直接使用HashMap即可。

如果这篇文章对你有帮助，不妨点个赞再走吧~ ^_