问题：

X86-64的imm64和m64哪个更快？

乐正德华

2023-03-14

在测试了大约100亿次之后，如果IMM64比AMD64的M64快0.1纳秒，那么M64似乎更快，但我不太明白。下面代码中val_ptr的地址本身不是一个直接值吗？

# Text section
.section __TEXT,__text,regular,pure_instructions
# 64-bit code
.code64
# Intel syntax
.intel_syntax noprefix
# Target macOS High Sierra
.macosx_version_min 10,13,0

# Make those two test functions global for the C measurer
.globl _test1
.globl _test2

# Test 1, imm64
_test1:
  # Move the immediate value 0xDEADBEEFFEEDFACE to RAX (return value)
  movabs rax, 0xDEADBEEFFEEDFACE
  ret
# Test 2, m64
_test2:
  # Move from the RAM (val_ptr) to RAX (return value)
  mov rax, qword ptr [rip + val_ptr]
  ret
# Data section
.section __DATA,__data
val_ptr:
  .quad 0xDEADBEEFFEEDFACE

计量代码为：

#include <stdio.h>            // For printf
#include <stdlib.h>           // For EXIT_SUCCESS
#include <math.h>             // For fabs
#include <stdint.h>           // For uint64_t
#include <stddef.h>           // For size_t
#include <string.h>           // For memset
#include <mach/mach_time.h>   // For time stuff

#define FUNCTION_COUNT  2     // Number of functions to test
#define TEST_COUNT      0x10000000  // Number of times to test each function

// Type aliases
typedef uint64_t rettype_t;
typedef rettype_t(*function_t)();

// External test functions (defined in Assembly)
rettype_t test1();
rettype_t test2();

// Program entry point
int main() {

  // Time measurement stuff
  mach_timebase_info_data_t info;
  mach_timebase_info(&info);

  // Sums to divide by the test count to get average
  double sums[FUNCTION_COUNT];

  // Initialize sums to 0
  memset(&sums, 0, FUNCTION_COUNT * sizeof (double));

  // Functions to test
  function_t functions[FUNCTION_COUNT] = {test1, test2};

  // Useless results (should be 0xDEADBEEFFEEDFACE), but good to have
  rettype_t results[FUNCTION_COUNT];

  // Function loop, may get unrolled based on optimization level
  for (size_t test_fn = 0; test_fn < FUNCTION_COUNT; test_fn++) {
    // Test this MANY times
    for (size_t test_num = 0; test_num < TEST_COUNT; test_num++) {
      // Get the nanoseconds before the action
      double nanoseconds = mach_absolute_time();
      // Do the action
      results[test_fn] = functions[test_fn]();
      // Measure the time it took
      nanoseconds = mach_absolute_time() - nanoseconds;

      // Convert it to nanoseconds
      nanoseconds *= info.numer;
      nanoseconds /= info.denom;

      // Add the nanosecond count to the sum
      sums[test_fn] += nanoseconds;
    }
  }
  // Compute the average
  for (size_t i = 0; i < FUNCTION_COUNT; i++) {
    sums[i] /= TEST_COUNT;
  }

  if (FUNCTION_COUNT == 2) {
    // Print some fancy information
    printf("Test 1 took %f nanoseconds average.\n", sums[0]);
    printf("Test 2 took %f nanoseconds average.\n", sums[1]);
    printf("Test %d was faster, with %f nanoseconds difference\n", sums[0] < sums[1] ? 1 : 2, fabs(sums[0] - sums[1]));
  } else {
    // Else, just print something
    for (size_t fn_i = 0; fn_i < FUNCTION_COUNT; fn_i++) {
      printf("Test %zu took %f clock ticks average.\n", fn_i + 1, sums[fn_i]);
    }
  }

  // Everything went fine!
  return EXIT_SUCCESS;
}

共有1个答案

狄新立

2023-03-14

您没有显示您测试的实际循环，也没有说明您是如何测量时间的。显然，您测量的是挂钟时间，而不是核心时钟周期（使用性能计数器）。因此，测量噪声的来源包括涡轮增压/省电，以及与另一个逻辑线程共享物理内核（在i7上）。

在Intel Ivybridge上：

Movabs rax,0xDeadBeeffeedFace是一个ALU指令

null

编译器通常选择使用mov r64,imm64生成64位常数。（相关内容：哪些指令序列是动态生成向量常数的最佳指令序列？但实际上，这些指令序列从来没有出现在标量整数中，因为没有简单的单指令方法来获得64位-1。）

这通常是正确的选择，尽管在长时间运行的循环中，您希望常量在缓存中保持热状态，但从.rodata加载它可能是一个成功的选择。尤其是如果这样可以执行和rax,[constant]而不是movabs r8,imm64/和rax,r8。

如果您的64位常数是一个地址，如果可能的话，请使用RIP相对的lea。lea rax,[rel my_symbol]在NASM语法中，lea my_symbol(%RIP),%rax在AT&T中。

在考虑asm的微小序列时，周围的代码非常重要，尤其是当它们争夺不同的吞吐量资源时。

类似资料：

x86-64系统V ABI文件在哪里？

x86-64 System V ABI（除Windows外的所有系统都使用）以前在http://x86-64.org/documentation/ABI.pdf上运行，但现在该站点已从Internet上消失。文件有没有新的权威归宿？
x86-64 上的红色区域究竟在哪里？

来自维基百科：在计算中，红色区域是函数堆栈帧中超出返回地址的固定大小的区域，该函数未保留。被调用方函数可以使用红色区域来存储局部变量，而不会产生修改堆栈指针的额外开销。中断/异常/信号处理程序不得修改此内存区域。系统 V 使用的 x86-64 ABI 要求一个 128 字节的红色区域，该区域直接在返回地址之后开始，并包含函数的参数。OpenRISC 工具链采用 128 字节的红色区域。从Sys
Intel x86 Assembly& Microarchitecture 64位Windows

本文向大家介绍Intel x86 Assembly& Microarchitecture 64位Windows，包括了Intel x86 Assembly& Microarchitecture 64位Windows的使用技巧和注意事项，需要的朋友参考一下示例参数前4个参数按顺序传递RCX，RDX，R8和R9。XMM0至XMM3用于传递浮点参数。任何其他参数都在堆栈上传递。大于64位的参数
x86\u 64-cmpxchg。返回值

我在读《英特尔手册》第2A卷。将AL、AX、EAX或RAX寄存器中的值与第一个操作数（目标操作数）进行比较。如果两个值相等，则将第二个操作数（源操作数）加载到目标操作数中。否则，目标操作数将加载到AL、AX、EAX或RAX寄存器中。RAX寄存器仅在64位模式下可用。如果失败，在累加器中加载目标的目的是什么？
什么是x86-64上的long double？

问题内容：有人告诉我：在x86-64下，FP算法是通过SSE完成的，因此long double是64位。但是在x86-64 ABI中它表示：参见：amd64-abi.pdf 和gcc说是16并给出= 和所以我很困惑，64位怎么样？我认为这是一个80位的表示形式。问题答案：在x86-64下，FP算法是通过SSE完成的，因此long double是64位。这就是通常发生 X86-64
与位置无关的代码的区别：x86与x86-64

问题内容：我最近正在构建针对x86-64架构的特定共享库（ELF），如下所示：失败并显示以下错误：创建共享库时，不能使用针对“本地符号”的R_X86_64_32重定位；用-fPIC重新编译当然，这意味着我需要将其重建为位置无关的代码，因此适合链接到共享库。但这在具有完全相同的构建参数的x86上效果很好。所以问题是，x86上的重定位与x86-64有何不同？为什么我不需要在前一个上进行编译？

X86-64的imm64和m64哪个更快？

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档