JVM代码优化：方法内联(Method inlining)

2019-05-27 本文已影响0人 WillMiao

什么是方法内联

方法内联，是指JVM在运行时将调用次数达到一定阈值的方法调用替换为方法体本身，从而消除调用成本，并为接下来进一步的代码性能优化提供基础，是JVM的一个重要优化手段之一。

如何进行方法内联

方法内联是由JIT编译器在运行时完成的。既然涉及到编译，方法内联也是有一定的开销的，包括cpu时间和内存，所以这又是一个trade-off的老问题了。JIT根据以下信息决定是否进行内联：

被调用方法是否足够hot。这个取决于该方法被调用的次数，次数阈值默认值为10,000。即运行时被调用次数超过10,000的方法，可以被认为是hot。
被调用方法大小是否合适。对于过大的方法，JIT认为它是不适合做内联的。这个方法大小阈值由-XX:FreqInlineSize指定，不建议修改。即大于这个阈值size的方法，不考虑进行内联
被调用方法运行时其实现是否可以唯一确定。显然，对于类方法、私有方法和final方法，JIT是可以唯一确定它们的具体实现代码的(这里对应字节码中的invokestatic和invokespecial)；另一方面，对于public方法调用，它所指向的具体实现可能是自身、父类、子类的方法实现代码（多态），只有当JIT能唯一确定方法的具体实现时，才有可能完成内联（对应字节码中的invokevirtual和invokeinterface）

具体例子

我们看一下具体例子。

假设有如下接口和实现定义：

    interface Host {
        int compute(int x, int y);
    }

    static class HostA implements Host {
        @Override
        public int compute(int x, int y) {
            return x + y;
        }
    }

    static class HostB implements Host {
        @Override
        public int compute(int x, int y) {
            return x + x + y;
        }
    }

    static class HostC implements Host {
        @Override
        public int compute(int x, int y) {
            return x + y + y;
        }
    }

    static class HostD implements Host {
        @Override
        public int compute(int x, int y) {
            return x - y;
        }
    }

使用以下代码进行调用测试：

public class TestInline {
    private static final int COUNT = 2000000000;

    public static void main(String[] args) {
        System.out.println(arrayCompute() + " " + virtualCompute() + " " + interfaceCompute());
    }

    static long arrayCompute() {
        Host[] hosts = new Host[4];

        hosts[0] = new HostA();
        hosts[1] = new HostB();
        hosts[2] = new HostC();
        hosts[3] = new HostD();

        long start = System.currentTimeMillis();
        Random r = new Random(start);

        int x = r.nextInt(10);
        int y = r.nextInt(10);

        for (int i = 0; i < COUNT; i++) {
            for (Host host : hosts) {
                x = host.compute(x, y);
            }
        }

        return System.currentTimeMillis() - start;
    }

    static long virtualCompute() {
        HostA hostA = new HostA();
        HostB hostB = new HostB();
        HostC hostC = new HostC();
        HostD hostD = new HostD();

        long start = System.currentTimeMillis();
        Random r = new Random(start);

        int x = r.nextInt(10);
        int y = r.nextInt(10);

        for (int i = 0; i < COUNT; i++) {
            x = hostA.compute(x, y);
            x = hostB.compute(x, y);
            x = hostC.compute(x, y);
            x = hostD.compute(x, y);
        }

        return System.currentTimeMillis() - start;
    }

    static long interfaceCompute() {
        Host[] hosts = new Host[4];

        hosts[0] = new HostA();
        hosts[1] = new HostB();
        hosts[2] = new HostC();
        hosts[3] = new HostD();

        long start = System.currentTimeMillis();
        Random r = new Random(start);

        int x = r.nextInt(10);
        int y = r.nextInt(10);

        for (int i = 0; i < COUNT; i++) {
            x = hosts[0].compute(x, y);
            x = hosts[1].compute(x, y);
            x = hosts[2].compute(x, y);
            x = hosts[3].compute(x, y);
        }

        return System.currentTimeMillis() - start;
    }
}

测试方法说明：

arrayCompute，创建一个接口类型Host的数组Host[4]，分别指向4个不同接口实现类实例；多次调用这些接口方法
virtualCompute，分别创建4个不同的接口实现类引用，并分别指向对应的实现类实例；多次调用这些方法
interfaceCompute，与arrayCompute类似，不同的是将在arrayCompute中的loop调用做展开；多次调用这些方法

添加如下JVM启动选项：

-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining

选项参数说明：

PrintCompilation，打印JIT编译日志
UnlockDiagnosticVMOptions和PrintInlining，打印inlining信息

3个测试方法(arrayCompute、virtualCompute和interfaceCompute)所做的工作完全一致，哪个方法在实际运行中性能最好呢？哪些Host::compute调用会被内联，这些内联又能对性能有多大影响呢？

运行结果及分析

运行时间结果:

43393 1477 1471

可以看到virtualCompute和interfaceCompute耗时相近，但arrayCompute耗时则远超其他两个方法。

查看JIT编译日志和调试信息，可以看到对于arrayCompute方法，有以下日志信息：

TestInline$Host::compute (0 bytes)   not inlineable

而对于virtualCompute和interfaceCompute方法，有以下日志信息：

TestInline$HostA::compute (4 bytes)   inline (hot)
TestInline$HostB::compute (4 bytes)   inline (hot)
TestInline$HostC::compute (4 bytes)   inline (hot)
TestInline$HostD::compute (4 bytes)   inline (hot)

以上日志信息说明，arrayCompute方法中对Host::compute的调用，JIT认为“not inlineable”，未对其进行内联优化；另一方面，对virtualCompute和interfaceCompute中的Host::compute调用，JIT唯一确定了每个调用的具体实现方法，并进行了内联优化（注：inline (hot)表示当前调用已被内联）。结合运行时间结果，可以看到方法内联带来的性能提升是巨大的。在本例中，方法内联带来的性能提升约为20多倍。

比较有意思的一点是，interfaceCompute相对于arrayCompute，仅仅是将foreach loop做了展开，但结果却是让JIT唯一确定了interface方法的具体实现实例，从而可以正确的完成内联优化。这里面的JIT逻辑我也还没有进一步的了解，暂作留存，待后面再做研究吧。

结论

对于调用次数较多的方法，考虑是否可以利用JIT方法内联对其进行优化，是非常有必要的
为此，我们需要明确JIT方法内联的逻辑和先决条件
必要时，可以通过打印JIT编译日志和调试信息，进一步确认JIT的实际行为

同时需要明白，方法内联带来的性能提升不仅仅在消除了调用开销上，而在于在代码内联的基础上，JVM可以对代码进行更进一步的优化。这部分将在后续的blog中继续探讨。

一些额外的观察

若我们仅调用和比较virtualCompute、interfaceCompute两个方法，运行时间结果如下：

1984 1699

可以看到第一个方法的耗时要大于第二个方法。即便更换这两个方法的调用次序，结果仍是如此，即第一个方法耗时都要大于第二个方法的耗时。我对这个结果的解释是，JIT需要进行一定的Profile之后，才能确定哪些方法调用是hot的，进而进行内联优化；而在这之前，该有的调用开销并没有办法被消除，从而导致了上面的耗时上的差异。

对于interfaceCompute，inlining调试信息中显示，在最初时JIT也认为Host::compute调用为“not inlineable”；但在若干次调用后，又将调用的方法明确到具体实例所实现的方法，并显示"inline (hot)"。这里的JIT逻辑需要进一步了解。

TODO

进一步了解JIT唯一确定某个public方法具体实现的逻辑。
进一步了解建立在内联基础之上的进一步JVM代码优化。