问题：

简单地添加一个方法参数（更苗条的jit代码）带来的10%+性能提升

燕烨

2023-03-14

代码在文章的末尾。剧透：我还发现jit代码胖了224字节，即使它实际上应该更简单（就像字节代码大小显示的那样；请参见下面的最新更新）。

即使在尝试使用一些微基准测试工具(JMH)来消除这种影响之后，性能差异仍然存在。

我在问：为什么生成的本机代码会有这样的差异，它在做什么？

通过给一个方法添加一个参数，它使它更快...！我知道GC/JIT/Warmup/etc的影响。您可以按原样运行代码，也可以使用较大/较小的迭代次数运行代码。实际上，您甚至应该先注释掉一个然后注释另一个perf测试，并在不同的jvm实例中运行每个测试，以证明它们之间没有干扰。

字节码没有显示出太大的区别，除了sleft/sright的明显的getstatic之外，还有一个奇怪的'iload4'而不是“iload_3”（和istore4/istore_3)

到底是怎么回事？iLoad3/iStore3真的比iLoad4/iStore4慢吗？甚至添加的getstatic调用也没有让它变慢吗？我可以猜测静态字段没有使用，所以jit可能会跳过它。

无论如何，在我这方面没有任何歧义，因为它总是可复制的，我正在寻找为什么JavaC/JIT会做它们所做的事情，以及为什么性能会受到如此大的影响的解释。这些是相同的递归algo具有相同的数据，相同的内存搅动，等等...如果我愿意，我无法进行更孤立的更改，以显示显著的可复制的运行时差异。

环境：

java version "1.8.0_161" 
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
(also tried and reproduced on java9)
on a 4 core i5 laptop 8GB ram.
windows 10 with the meltdown/specter patch.

使用-verbose:gc-xx:+printcompilation，没有gc，jit编译在C2（第4层）中已经稳定下来。

n=20000:

main]: qs1: 1561.3336199999999 ms (res=null)
main]: qs2: 1749.748416 ms (res=null)

main]: qs1: 1422.0767509999998 ms (res=null)
main]: qs2: 1700.4858689999999 ms (res=null)

main]: qs1: 1519.5280269999998 ms (res=null)
main]: qs2: 1786.2206899999999 ms (res=null)

main]: qs1: 1450.0802979999999 ms (res=null)
main]: qs2: 1675.223256 ms (res=null)

main]: qs1: 1452.373318 ms (res=null)
main]: qs2: 1634.581156 ms (res=null)

顺便说一句，漂亮的java9似乎让两者都更快，但仍然比对方低10-15%

[0.039s][info][gc] Using G1
main]: qs1: 1287.062819 ms (res=null)
main]: qs2: 1451.041391 ms (res=null)

main]: qs1: 1240.800305 ms (res=null)
main]: qs2: 1391.2404299999998 ms (res=null)

main]: qs1: 1257.1707159999999 ms (res=null)
main]: qs2: 1433.84716 ms (res=null)

main]: qs1: 1233.7582109999998 ms (res=null)
main]: qs2: 1394.7195849999998 ms (res=null)

main]: qs1: 1250.885867 ms (res=null)
main]: qs2: 1413.88437 ms (res=null)

main]: qs1: 1261.5805739999998 ms (res=null)
main]: qs2: 1458.974334 ms (res=null)

main]: qs1: 1237.039902 ms (res=null)
main]: qs2: 1394.823751 ms (res=null)

main]: qs1: 1255.14672 ms (res=null)
main]: qs2: 1400.693295 ms (res=null)

main]: qs1: 1293.009808 ms (res=null)
main]: qs2: 1432.430952 ms (res=null)

main]: qs1: 1262.839628 ms (res=null)
main]: qs2: 1421.376644 ms (res=null)

代码（包括测试）：

（请不要注意这个快速排序有多糟糕；它与问题无关）。

import java.util.Random;
import java.util.concurrent.Callable;

public class QuicksortTrimmed {

    static void p(Object msg) {
        System.out.println(Thread.currentThread().getName()+"]: "+msg);
    }

    static void perf(int N, String msg, Callable c) throws Exception {
        Object res = null;
        long t = System.nanoTime();
        for(int i=0; i<N; i++) {
            res = c.call();
        }
        Double d = 1e-6*(System.nanoTime() - t);
        p(msg+": "+d+" ms (res="+res+")");
    }

    static String und = "__________";//10
    static {
        und += und;//20
        und += und;//40
        und += und;//80
        und += und;//160
    }

    static String sleft = "//////////";//10
    static {
        sleft += sleft;//20
        sleft += sleft;//40
        sleft += sleft;//80
        sleft += sleft;//160
    }

    static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
    static {
        sright += sright;//20
        sright += sright;//40
        sright += sright;//80
        sright += sright;//160
    }

    //============

    public static void main(String[] args) throws Exception {
        int N = 20000;
        int n = 1000;
        int bound = 10000;
        Random r = new Random(1);
        for(int i=0; i<5; i++) {
            testperf(N, r, n, bound);
            //System.gc();
        }
    }

    static void testperf(int N, Random r, int n, int bound) throws Exception {
        final int[] orig = r.ints(n, 0, bound).toArray();
        final int[] a = orig.clone();

        perf(N, "qs1", () -> {
            System.arraycopy(orig, 0, a, 0, orig.length);
            quicksort1(a, 0, a.length-1, und);
            return null;
        });

        perf(N, "qs2", () -> {
            System.arraycopy(orig, 0, a, 0, orig.length);
            quicksort2(a, 0, a.length-1);
            return null;
        });
        System.out.println();
    }


    private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort1(a, _from, j-1, sleft);
            if(j+1 < _to)
                quicksort1(a, j+1, _to, sright);
        }
    }

    private static void quicksort2(int[] a, final int _from, final int _to) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort2(a, _from, j-1);
            if(j+1 < _to)
                quicksort2(a, j+1, _to);
        }
    }

}

更新：

我做了JMH测试，它仍然证明quicksort1比QuickSort2快。

# Run complete. Total time: 00:13:38

Benchmark                    Mode  Cnt      Score    Error  Units
MyBenchmark.testQuickSort1  thrpt  200  14861.437 ± 86.707  ops/s
MyBenchmark.testQuickSort2  thrpt  200  13097.209 ± 46.178  ops/s

以下是JMH基准：

package org.sample;

import java.util.Random;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

public class MyBenchmark {
    static String und = "__________";//10
    static {
        und += und;//20
        und += und;//40
        und += und;//80
        und += und;//160
    }

    static String sleft = "//////////";//10
    static {
        sleft += sleft;//20
        sleft += sleft;//40
        sleft += sleft;//80
        sleft += sleft;//160
    }

    static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
    static {
        sright += sright;//20
        sright += sright;//40
        sright += sright;//80
        sright += sright;//160
    }

    static final int n = 1000;
    static final int bound = 10000;
    static Random r = new Random(1);
    static final int[] orig = r.ints(n, 0, bound).toArray();

    @State(Scope.Thread)
    public static class ThrState {
        int[] a;

        @Setup(Level.Invocation)
        public void setup() {
            a = orig.clone();
        }
    }

    //============

    @Benchmark
    public void testQuickSort1(Blackhole bh, ThrState state) {
        int[] a = state.a;
        quicksort1(a, 0, a.length-1, und);
        bh.consume(a);
    }

    @Benchmark
    public void testQuickSort2(Blackhole bh, ThrState state) {
        int[] a = state.a;
        quicksort2(a, 0, a.length-1);
        bh.consume(a);
    }


    private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort1(a, _from, j-1, sleft);
            if(j+1 < _to)
                quicksort1(a, j+1, _to, sright);
        }
    }

    private static void quicksort2(int[] a, final int _from, final int _to) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort2(a, _from, j-1);
            if(j+1 < _to)
                quicksort2(a, j+1, _to);
        }
    }

}

更新：

此时，我运行并捕获了jitwatch的jit日志（我使用了1.3.0标记，并从https://github.com/adoptopenjdk/jitwatch/tree/1.3.0)构建

-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:"./gc.log"
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintCompilation
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+UnlockDiagnosticVMOptions  -XX:+LogCompilation -XX:+PrintInlining

使用额外的方法参数(quicksort1):字节代码=187字节，本机代码=1872字节

没有额外的方法参数(quicksort2):字节代码=178字节（小9字节）本机代码=2096字节（大224字节！！！）

这有力地证明了QuickSort2中的jit代码更胖、更慢。

我终于掌握了汇编代码，但正如我所料，几乎不可能区分和理解发生了什么。我遵循了从https://stackoverflow.com/a/24524285/2023577找到的最新指令。我有7MB的xml日志文件（压缩到675KB)，您可以在https://wetransfer.com/downloads/65fe0e94ab409d57cba1b95459064dd420180427150905/612dc9获得它并查看7天（过期至2018年5月4日），以防您理解它（当然是在jitwatch！）。

添加的字符串参数使程序集代码更加紧凑。问题（仍未回答）是为什么？在汇编代码中有什么不同？在较慢的代码中没有使用的规则或优化是什么？

共有1个答案

孔星宇

2023-03-14

再现与分析

我复制了你的结果。机器数据：

Linux #143-Ubuntu x86_64 GNU/Linux
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

我重写了您的代码一点，我做了一些额外的测试。您的测试时间包括system.arraycopy()调用。我创建了一个400MB的数组结构并保存了它：

int[][][] data = new int[iterations][testCases][];
for (int iteration = 0; iteration < data.length; iteration++) {
    for (int testcase = 0; testcase < data[iteration].length; testcase++) {
        data[iteration][testcase] = random.ints(numberCount, 0, bound).toArray();
    }
}

FileOutputStream fos = new FileOutputStream("test_array.dat");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(data);

在此之后，我运行了以下测试（热身、拆分以及运行）：

{
    FileInputStream fis = new FileInputStream(fileName);
    ObjectInputStream iis = new ObjectInputStream(fis);
    int[][][] data = (int[][][]) iis.readObject();


    perf("qs2", () -> {
        for (int iteration = 0; iteration < data.length; iteration++) {
            for (int testCase = 0; testCase < data[iteration].length; testCase++) {
                quicksort2(data[iteration][testCase], 0, data[iteration][testCase].length - 1);
            }
        }
        return null;
    });
}
{
    FileInputStream fis = new FileInputStream(fileName);
    ObjectInputStream iis = new ObjectInputStream(fis);
    int[][][] data = (int[][][]) iis.readObject();


    perf("qs1", () -> {
        for (int iteration = 0; iteration < data.length; iteration++) {
            for (int testCase = 0; testCase < data[iteration].length; testCase++) {
                quicksort1(data[iteration][testCase], 0, data[iteration][testCase].length - 1, und);
            }
        }
        return null;
    });
}

如果我同时运行qs1和qs2：

main]: qs1: 6646.219874 ms (res=null)
main]: qs2: 7418.376646 ms (res=null)

结果与执行顺序无关：

main]: qs2: 7526.215395 ms (res=null)
main]: qs1: 6624.261529 ms (res=null)

实例一：

main]: qs1: 6592.699738 ms (res=null)

实例二：

main]: qs2: 7456.326028 ms (res=null)

如果您在没有JIT的情况下尝试：

-Djava.compiler=NONE

结果如“预期”（较小的字节码速度更快）：

main]: qs1: 56547.589942 ms (res=null)
main]: qs2: 53585.909246 ms (res=null)

为了更好地分析，我将代码提取到两个不同的类中。

我使用jclasslib进行字节码检查。方法的长度为：

Q1: 505
Q2: 480

这对于没有JIT的执行是有意义的：

53585.909246×505÷480 = 56376.842019229

接近56547.589942。

对于我来说，在编译输出中（使用-xx:+printcompilation)，我有以下几行

1940  257       2       QS1::sort (185 bytes)
1953  258 %     4       QS1::sort @ 73 (185 bytes)
1980  259       4       QS1::sort (185 bytes)
1991  257       2       QS1::sort (185 bytes)   made not entrant
9640  271       3       QS2::sort (178 bytes)
9641  272       4       QS2::sort (178 bytes)
9654  271       3       QS2::sort (178 bytes)   made not entrant

其中%表示堆栈替换（编译后的代码正在运行）。根据这个日志，使用额外字符串参数的调用会得到优化，而第二个调用不会。我在考虑更好的分支预测，但这里不应该是这样（试图添加randomli生成的字符串作为参数）。样本大小(400MB)基本上排除了缓存。我想优化treshold，但是当我使用这些选项-xcomp-xx:+printcompilization-xbatch时，输出如下：

 6408 3254    b  3       QS1::sort (185 bytes)
 6409 3255    b  4       QS1::sort (185 bytes)
 6413 3254       3       QS1::sort (185 bytes)   made not entrant
14580 3269    b  3       QS2::sort (178 bytes)
14580 3270    b  4       QS2::sort (178 bytes)
14584 3269       3       QS2::sort (178 bytes)   made not entrant

这意味着metods在调用之前被阻塞编译，但时间保持不变：

main]: qs1: 6982.721328 ms (res=null)
main]: qs2: 7606.077812 ms (res=null)

我认为关键是字符串。如果我将额外的（未使用的）参数更改为int，那么它的处理速度会稍微慢一点（使用前面的优化参数运行）：

main]: qs1: 7925.472909 ms (res=null)
main]: qs2: 7727.628422 ms (res=null)

我的结论是，优化可能会受到额外参数对象类型的影响。可能在原语的情况下没有那么急切的优化，这对我来说是有意义的，但我找不到这一说法的确切来源。

类似资料：

带传单的苗条

我正在努力找到我的方式，以苗条结合传单。我遇到的问题是如何正确地将传单组件拆分为文件。为了学习，我正在尝试用Svelte构建官方的官方传单quickstart。这就是我的App.Svelte的样子：和“我的圈子”组件：虽然这起作用，但我认为考虑每个组件并使用将其添加到映射中并不有效。如何将map对象传递给circle组件，或者是否有更好的模式来构建包含多个组件的map？注意：我知道Svel
jQuery代码性能优化的10种方法

本文向大家介绍jQuery代码性能优化的10种方法，包括了jQuery代码性能优化的10种方法的使用技巧和注意事项，需要的朋友参考一下 1、总是使用#id去寻找element. 在jQuery中最快的选择器是ID选择器 ($('#someid')). 这是因为它直接映射为JavaScript的getElementById()方法。选择单个元素选择button的性能不好的一种方式：取而代之的是
在Java 8中有没有一种简单的方法来迭代一个带有索引的流？

有没有一种简单的方法可以在访问流中的索引的同时对流进行迭代？这与上面给出的LINQ示例相比似乎相当令人失望有没有更简洁的方式？而且，拉链似乎不是动了就是被拆了...
Gson：有没有更简单的方法来序列化地图

问题内容：来自Gson项目的此链接似乎表明，将类型化的Map序列化为JSON时，我将必须执行以下操作：很酷，这行得通，但是似乎有很多开销（整个Type Adapter类？）。我使用了其他JSONLib之类的JSON库，它们使您可以通过以下方式构建地图：或者，如果我有一个自定义类，则如下所示：该过程更加手动，但是所需的代码更少，并且没有开销来为Number创建自定义类型适配器，或者在大多
使用FriendFeed来提升MySQL性能的方法

本文向大家介绍使用FriendFeed来提升MySQL性能的方法，包括了使用FriendFeed来提升MySQL性能的方法的使用技巧和注意事项，需要的朋友参考一下背景我们使用MySQL存储了FriendFeed的所有数据。数据库随着用户基数的增长而增长了很多。现在已经存储了超过2.5亿条记录与一堆涵盖了从评论和“喜欢”到好友列表的其他数据。随着数据的增长，我们也曾迭代地解决了随着如此迅猛的
利用jQuery来动态为属性添加或者删除属性的简单方法

本文向大家介绍利用jQuery来动态为属性添加或者删除属性的简单方法，包括了利用jQuery来动态为属性添加或者删除属性的简单方法的使用技巧和注意事项，需要的朋友参考一下现在做的项目有这样一个需要：先看图吧^^ 要求： 1、当点击导出Excel方式的时候，如果是“勾选导出”或“不分页导出”时，下面的文本框不能修改 2、当点击“分页导出”时，第一个文本框中的值可以被修改，但第二个文本框中的值

简单地添加一个方法参数（更苗条的jit代码）带来的10%+性能提升

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档