Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

10/14 13:29
阅读数 0


Background: 背景:

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it. 在使用嵌入式汇编语言优化某些Pascal代码时,我注意到了一条不必要的MOV指令,并将其删除了。

To my surprise, removing the un-necessary instruction caused my program to slow down . 令我惊讶的是,删除不必要的指令会导致我的程序变慢

I found that adding arbitrary, useless MOV instructions increased performance even further. 我发现添加任意的,无用的MOV指令可以进一步提高性能

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown . 效果不稳定,并且基于执行顺序进行更改: 相同的垃圾指令向上或向下移动一行会产生减速

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic. 我知道CPU会进行各种优化和精简,但这看起来更像是黑魔法。

The data: 数据:

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. 我的代码版本有条件地在一个运行2**20==1048576次的循环中编译三个垃圾操作 (The surrounding program just calculates SHA-256 hashes). (周围的程序只计算SHA-256哈希值)。

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz): 在我相当老的机器(英特尔(R)Core(TM)2 CPU 6400 @ 2.13 GHz)上的结果:

avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without:        1836.44 ms

The programs were run 25 times in a loop, with the run order changing randomly each time. 程序在循环中运行25次,每次运行顺序随机变化。

Excerpt: 摘抄:

{$asmmode intel}
procedure example_junkop_in_sha256;
  var s1, t2 : uint32;
    // Here are parts of the SHA-256 algorithm, in Pascal:
    // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
    // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
    // Here is how I translated them (side by side to show symmetry):
    MOV r8d, a                 ; MOV r9d, e
    ROR r8d, 2                 ; ROR r9d, 6
    MOV r10d, r8d              ; MOV r11d, r9d
    ROR r8d, 11    {13 total}  ; ROR r9d, 5     {11 total}
    XOR r10d, r8d              ; XOR r11d, r9d
    ROR r8d, 9     {22 total}  ; ROR r9d, 14    {25 total}
    XOR r10d, r8d              ; XOR r11d, r9d

    // Here is the extraneous operation that I removed, causing a speedup
    // s1 is the uint32 variable declared at the start of the Pascal code.
    // I had cleaned up the code, so I no longer needed this variable, and 
    // could just leave the value sitting in the r11d register until I needed
    // it again later.
    // Since copying to RAM seemed like a waste, I removed the instruction, 
    // only to discover that the code ran slower without it.
    MOV s1,  r11d

    // The next part of the code just moves on to another part of SHA-256,
    // maj { r12d } := (a and b) xor (a and c) xor (b and c)
    mov r8d,  a
    mov r9d,  b
    mov r13d, r9d // Set aside a copy of b
    and r9d,  r8d

    mov r12d, c
    and r8d, r12d  { a and c }
    xor r9d, r8d

    and r12d, r13d { c and b }
    xor r12d, r9d

    // Copying the calculated value to the same s1 variable is another speedup.
    // As far as I can tell, it doesn't actually matter what register is copied,
    // but moving this line up or down makes a huge difference.
    MOV s1,  r9d // after mov r12d, c

    // And here is where the two calculated values above are actually used:
    // T2 {r12d} := S0 {r10d} + Maj {r12d};
    ADD r12d, r10d
    MOV T2, r12d


Try it yourself: 亲自尝试一下:

The code is online at GitHub if you want to try it out yourself. 如果您想自己试用,代码在GitHub上在线。

My questions: 我的问题:

  • Why would uselessly copying a register's contents to RAM ever increase performance? 为什么无用地将寄存器的内容复制到RAM会不会提高性能?
  • Why would the same useless instruction provide a speedup on some lines, and a slowdown on others? 为什么同样无用的指令会在某些线路上提供加速,而在其他线路上则会减速?
  • Is this behavior something that could be exploited predictably by a compiler? 这种行为是否可以被编译器可预测地利用?


参考一: https://en.stackoom.com/question/1D5kg
参考二: https://stackoom.com/question/1D5kg
0 收藏
0 评论
0 收藏