问题：

流矩阵乘法比For循环慢10倍？

孙翰墨

2023-03-14

我已经创建了一个使用流执行矩阵乘法的模块。可以在这里找到：https://github.com/firefly-math/firefly-math-lineal-real/

当我在大小为100x100和1000x1000的矩阵上运行基准测试时，发现Apache Commons Math（使用for循环）比相应的流实现快10倍（大致）。

# Run complete. Total time: 00:14:10

Benchmark                              Mode  Cnt      Score     Error      Units
MultiplyBenchmark.multiplyCM1000_1000  avgt   30   1040.804 ±  11.796  ms/op
MultiplyBenchmark.multiplyCM100_100    avgt   30      0.790 ±   0.010  ms/op
MultiplyBenchmark.multiplyFM1000_1000  avgt   30  11981.228 ± 405.812  ms/op
MultiplyBenchmark.multiplyFM100_100    avgt   30      7.224 ±   0.685  ms/op

我在基准测试中做错了什么吗（希望是：））？

我添加了测试的方法，这样每个人都可以看到比较的是什么。这是Apache Commons的数学Array2DrowRealMatrix.multiply（）方法：

/**
 * Returns the result of postmultiplying {@code this} by {@code m}.
 *
 * @param m matrix to postmultiply by
 * @return {@code this * m}
 * @throws DimensionMismatchException if
 * {@code columnDimension(this) != rowDimension(m)}
 */
public Array2DRowRealMatrix multiply(final Array2DRowRealMatrix m)
    throws DimensionMismatchException {
    MatrixUtils.checkMultiplicationCompatible(this, m);

    final int nRows = this.getRowDimension();
    final int nCols = m.getColumnDimension();
    final int nSum = this.getColumnDimension();

    final double[][] outData = new double[nRows][nCols];
    // Will hold a column of "m".
    final double[] mCol = new double[nSum];
    final double[][] mData = m.data;

    // Multiply.
    for (int col = 0; col < nCols; col++) {
        // Copy all elements of column "col" of "m" so that
        // will be in contiguous memory.
        for (int mRow = 0; mRow < nSum; mRow++) {
            mCol[mRow] = mData[mRow][col];
        }

        for (int row = 0; row < nRows; row++) {
            final double[] dataRow = data[row];
            double sum = 0;
            for (int i = 0; i < nSum; i++) {
                sum += dataRow[i] * mCol[i];
            }
            outData[row][col] = sum;
        }
    }

    return new Array2DRowRealMatrix(outData, false);
}

/**
 * Returns a {@link BinaryOperator} that multiplies {@link SimpleMatrix}
 * {@code m1} times {@link SimpleMatrix} {@code m2} (m1 X m2).
 * 
 * Example {@code multiply(true).apply(m1, m2);}
 * 
 * @param parallel
 *            Whether to perform the operation concurrently.
 * 
 * @throws MathException
 *             Of type {@code MATRIX_DIMENSION_MISMATCH__MULTIPLICATION} if
 *             {@code m} is not the same size as {@code this}.
 * 
 * @return the {@link BinaryOperator} that performs the operation.
 */
public static BinaryOperator<SimpleMatrix> multiply(boolean parallel) {

    return (m1, m2) -> {
        checkMultiplicationCompatible(m1, m2);

        double[][] a1 = m1.toArray();
        double[][] a2 = m2.toArray();

        Stream<double[]> stream = Arrays.stream(a1);
        stream = parallel ? stream.parallel() : stream;

        final double[][] result =
                stream.map(r -> range(0, a2[0].length)
                        .mapToDouble(i -> range(0, a2.length).mapToDouble(j -> r[j]
                                * a2[j][i]).sum())
                        .toArray()).toArray(double[][]::new);

        return new SimpleMatrix(result);
    };
}

共有1个答案

艾子石

2023-03-14

查看DoublePipeline.toArray:

public final double[] toArray() {
  return Nodes.flattenDouble((Node.OfDouble) evaluateToArrayNode(Double[]::new))
                    .asPrimitiveArray();
}

似乎先创建装箱数组，然后将其转换为基元数组。

类似资料：

为什么Strassen矩阵乘法比标准矩阵乘法慢得多？

C++:15秒（源） Python:6分13秒（来源） C++:45分钟（源）蟒蛇：10小时后被杀死（来源）为什么Strassen矩阵乘法比标准矩阵乘法慢得多？ null null null
为什么单纯的C ++矩阵乘法比BLAS慢100倍？

问题内容：我正在研究大型矩阵乘法，并运行以下实验以形成基准测试：从std normal（0平均值，1 stddev）随机生成两个4096x4096矩阵X，Y。 Z = X * Y Z的元素求和（以确保它们被访问）并输出。这是朴素的C ++实现：编译并运行：这是Octave / matlab实现：跑：八度使用BLAS（我承担功能）硬件是Linux x86-64上的i7 3930X，内
NumPy矩阵乘法

主要内容：逐元素矩阵乘法,矩阵乘积运算,矩阵点积矩阵乘法是将两个矩阵作为输入值，并将 A 矩阵的行与 B 矩阵的列对应位置相乘再相加，从而生成一个新矩阵，如下图所示：注意：必须确保第一个矩阵中的行数等于第二个矩阵中的列数，否则不能进行矩阵乘法运算。图1：矩阵乘法矩阵乘法运算被称为向量化操作，向量化的主要目的是减少使用的 for 循环次数或者根本不使用。这样做的目的是为了加速程序的计算。下面介绍 NumPy 提供的三种矩阵乘法，从而进一步
多矩阵乘法

问题内容：在numpy中，我有N个3x3矩阵的数组。这将是我如何存储它们的示例（我正在提取内容）：我也有一个由3个向量组成的数组，这将是一个示例：我似乎无法弄清楚如何通过numpy将它们相乘，从而实现如下效果：与的形状（在投射到阵列）是。但是，由于速度的原因，列表实现是不可能的。我尝试了各种换位的np.dot，但最终结果没有得到正确的形状。问题答案：使用脚步： 1）保持第一根轴对
C矩阵乘法

我想使用寄存器（逐行信息）通过向量算法创建矩阵乘法。打开外循环4次我有空洞matvec_XMM（双* a，双* x，双* y，整数n，整数磅）函数的问题，它返回了不好的结果，这是算法wchich我必须使用：它是ma代码：
OpenCV for Android中的基本矩阵乘法

我在这里可能是非常愚蠢的，但是我在使用OpenCV for Android进行一些基本的Mat乘法时遇到了困难。我想将它们相乘，得到大小为3行，1个cols的乘积。我已尝试使用: 但我得到一个错误：我做错了什么？感谢任何事先的帮助。编辑: 如果有帮助，3x3矩阵是的结果，其余代码如下所示：

流矩阵乘法比For循环慢10倍？

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档