CMOV Shower Thoughts

A friend asked about the point of the CMOV instruction, so i wrote this while in the shower. I do plan to write better quality stuff about various things (with all the fancy graphs, and benchmarking and all that silly visual stimuli), unsure exactly what i will tackle yet, we will see. Anyway here is the post:

So mfs be asking quite OFTEN, how CMOV helps optimize branch prediction. Which is a silly question bc its a leading question which leads the wrong way (misprediction, ironic). CMOV removes branches, and relies on a condition flag(s) (as denoted by which version of CMOVcc is being used) being met conditionally.

Live CMOV Cam:


cmp ecx, 10
mov eax, FALSE
mov ebx, TRUE
cmove eax, ebx

a couple things of note, firstly, notice how we cant pass immediate values into CMOV, we must load them into registers first and operate on those registers.

Secondly, see the "cmovE". This is just the syntax of CMOV-Equal. so if Equal flag is set, it will load "TRUE" into eax (from ebx).

If the cmp doesnt set the ZF flag to 1, it just passes. (Remember that CMP A, B just subtracts A from B, if result is 0, they are equal in value. This is why the equal flag is the Zero Flag, ZF).

The alternative is generally a CMP + JMP + MOV looking something like:

Live CMP+JMP+MOV cam:


cmp ecx, 10
je .settrue
jmp .skip
.settrue:
mov eax TRUE
.skip: 
; whatever here...

note that the JE and JZ instruction do the same thing.

Given i am on a Haswell CPU (Thinkpad t440p ftw), lets say i am in a scenario where the branch is highly predictable. Utilizing the CMP -> JE -> MOV is very ideal. Since we have the magic of Macro fusion, CMP -> JE can be fused together into i uOp, and if the branch is guessed properly the jump is done with no penalty. This can overall greatly reduce latency, and only require 2 uOps.

Another consideration is that CMOV is a 3 input instruction (flag, source reg, dest reg). This means that the Haswell microarch pushes it from the simple decoders to the complex decoders, where its split into 2 separate uOps.

This is done bc some bullshit with the Reserve station and the execution units being designed for 2 input instructions.

So when is a CMOV ideal? 2 scenarios! Firstly is when a branch is unpredictable. Since a misprediction causing the pipeline to be flushed, and a penalty of 14-18 cycles (fuck), using a CMOV with a consistent 1-2 cycles is ideal to avoid that penalty in cases where its common for mispredictions to occur.

There is also the topic of avoiding timing side channel attacks since CMOV is significantly more consistent. Though, i personally dont know if this is a thing that actually matters, it seems to be more hypothetical, though i could be wrong.

In practice, using CMOV isnt something you should worry about, id say here, unless you are hyper autist, the compilers heuristics are likely going to be fine at deciding what to use. Nonetheless its interesting.

Anyways, if this is interesting, yay! Ill make more stuff. Stuff thats significantly better, i just need to get SOMETHING out there. If there any inconsistencies/errors you see, please direct them to sparky@hackerbf.com. Thanks.