问题：

为什么pow(int，int)这么慢？

端木高卓

2023-03-14

我一直在做一些欧拉项目练习，以提高我的C语言知识。

我编写了以下函数：

int a = 0,b = 0,c = 0;

for (a = 1; a <= SUMTOTAL; a++)
{
    for (b = a+1; b <= SUMTOTAL-a; b++)
    {
        c = SUMTOTAL-(a+b);

        if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
        {
            std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;
            std::cout << a * b * c << std::endl;
        }
    }
}

这将在17毫秒内计算。

但是，如果我改变路线

if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)

自

if (c == sqrt((a*a)+(b*b)) && b < c)

计算在2毫秒内完成。我是否遗漏了< code>pow(int，int)的一些明显的实现细节，导致第一个表达式的计算速度如此之慢？

共有2个答案

步嘉德

2023-03-14

你选择了一种最慢的检查方式

c*c == a*a + b*b   // assuming c is non-negative

这将编译为三个整数乘法（其中一个可以从循环中提升出来）。即使没有 pow（），您仍然会转换为双精度并取一个平方根，这对吞吐量来说是很糟糕的。（还有延迟，但分支预测在现代CPU上的推测执行意味着延迟不是这里的一个因素）。

Intel Haswell的SQRTSD指令的吞吐量是每8-14个周期一个(来源:Agner Fog的指令表)，所以即使你的< code>sqrt()版本保持FP sqrt执行单元饱和，它仍然比我让gcc发出的慢4倍左右(如下)。

您还可以优化循环条件，以在＜code＞b

void foo_optimized()
{ 
  for (int a = 1; a <= SUMTOTAL; a++) {
    for (int b = a+1; b < SUMTOTAL-a-b; b++) {
        // int c = SUMTOTAL-(a+b);   // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(
        int c = (SUMTOTAL-a) - b;
        // if (b >= c) break;  // just changed the loop condition instead

        // the compiler can hoist a*a out of the loop for us
        if (/* b < c && */ c*c == a*a + b*b) {
            // Just print a newline.  std::endl also flushes, which bloats the asm
            std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';
            std::cout << a * b * c << '\n';
        }
    }
  }
}

这将编译(使用gcc 6.2 < code >-O3-mtune = has well )以使用该内部循环进行编码。查看Godbolt编译器资源管理器上的完整代码。

# a*a is hoisted out of the loop.  It's in r15d
.L6:
    add     ebp, 1    # b++
    sub     ebx, 1    # c--
    add     r12d, r14d        # ivtmp.36, ivtmp.43  # not sure what this is or why it's in the loop, would have to look again at the asm outside
    cmp     ebp, ebx  # b, _39
    jg      .L13    ## This is the loop-exit branch, not-taken until the end
                    ## .L13 is the rest of the outer loop.
                    ##  It sets up for the next entry to this inner loop.
.L8:
    mov     eax, ebp        # multiply a copy of the counters
    mov     edx, ebx
    imul    eax, ebp        # b*b
    imul    edx, ebx        # c*c
    add     eax, r15d       # a*a + b*b
    cmp     edx, eax  # tmp137, tmp139
    jne     .L6
 ## Fall-through into the cout print code when we find a match
 ## extremely rare, so should predict near-perfectly

在Intel Haswell上，所有这些指令均为1 uop。（cmp/jcc将宏熔合配对为比较和分支UOP。）因此，有10个熔合域UOP，每2.5个周期可以在一次迭代中发出。

Haswell以每时钟一次迭代的吞吐量运行＜code＞imul r32，r32＜/code＞，因此内部环路内的两个乘法不会使端口1在每2.5c两个乘法器时饱和。这为吸收来自添加和子窃取端口1的不可避免的资源冲突留下了空间。

我们甚至没有接近任何其他执行端口瓶颈，因此前端瓶颈是唯一的问题，这应该在英特尔 Haswell 及以后的每 2.5 个周期迭代一次。

循环展开可以帮助减少每次检查的uops数量。例如，使用lea ecx，[rbx 1]为下一次迭代计算b 1，因此我们可以imul ebx， ebx而无需使用MOV使其非破坏性。

强度降低也是可能的：给定 b*b，我们可以尝试在没有 IMUL 的情况下计算 （b-1） * （b-1）。（b-1） * （b-1） = b*b - 2*b 1，所以也许我们可以做一个 lea ecx， [rbx*2 - 1] ，然后从 b*b 中减去它。（嗯，也许我们可以将 -b 保留在寄存器中，并计数为零，因此我们可以使用 lea ecx，[rcx rbx*2 - 1] 在 ECX 中更新 b*b，在 EBX 中给定 -b）。

除非您实际上遇到了IMUL吞吐量的瓶颈，否则这可能会花费更多的UOP，并且不会成功。看看编译器如何处理C源代码中的这种强度降低可能会很有趣。

您可能还可以使用 SSE 或 AVX 对此进行矢量化，并行检查 4 或 8 个连续的 b 值。由于命中非常罕见，您只需检查8个命中中的任何一个是否有命中，然后找出在极少数情况下匹配的是哪一个。

另请参阅 x86 标签 wiki 以获取更多优化内容。

巫马自明

2023-03-14

pow（）适用于真正的浮点数，并在引擎盖下使用公式

pow(x,y) = e^(y log(x))

计算x^y。int在调用pow之前转换为双。（log是自然对数，基于e）

因此，使用pow（）比x*x。

根据相关评论进行编辑

即使对整数指数使用pow也可能会产生不正确的结果（PaulMcKenzie）
除了使用具有双类型的数学函数外，pow是一个函数调用（而x*x不是）（jtband es）
许多现代编译器实际上会使用常数整数参数优化pow，但这不应该被依赖。

为什么pow(int，int)这么慢？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档