More recently I have contributed a patch (https://github.com/llvm/llvm-project/commit/796b218edd358ab112e876e9c22490987dae7f06) related to div instruction optimization for wider 128 integer types with some constant divisors.
Before the fix llvm was generating libcalls to compiler-rt for those instructions which has a huge cost of function calls to external libraries, also they can not be optimized by the compiler passes.
Background -
128 bit type integers have mainly been used in cryptography, HPC and other financial applications which require high precision math.
Since we do not have 128 bit registers for most hardwares, we can not have native division instruction. Also division have a high cost so even on the target that supports div, division is usually performed by mul, add and shift instructions.
LLVM previous Implementation -
So If we talk about 128 support in LLVM. It was generating a libcall of __umodti3 for rem and __udivdi3 for div which performs a serialized shift-subtract loop over the bit width.
--- compiler-rt/lib/builtins/udivmodti4.c
for (; shift >= 0; --shift) {
quotient <<= 1;
if (dividend >= divisor) {
dividend -= divisor;
quotient |= 1;
}
divisor >>= 1;
}
Earlier codegen for this look like this -
div_by_7(unsigned __int128):
push rax
mov edx, 7
xor ecx, ecx
call __udivti3@PLT
pop rcx
ret
GCC x86 Codegen -
However GCC was not using libcalls, It has a trick of Hacker delight’s division by constant with the variable chunk size. Following is the output for that routine -
div_by_7(unsigned __int128):
movabs rcx, 1152921504606846975
mov r8, rdi
mov r9, rsi
mov rax, r8
and rdi, rcx
shrd rax, rsi, 60
shr rsi, 56
and rax, rcx
add rdi, rax
movabs rax, 2635249153387078803
add rsi, rdi
mul rsi
mov rax, rsi
sub rax, rdx
shr rax
add rdx, rax
shr rdx, 2
lea rax, [0+rdx*8]
sub rax, rdx
movabs rdx, -5270498306774157605
sub rsi, rax
sub r8, rsi
movabs rsi, 7905747460161236407
sbb r9, 0
imul rdx, r8
mov rax, r8
mov rcx, r9
imul rcx, rsi
add rcx, rdx
mul rsi
add rcx, rdx
mov rdx, rcx
ret
Hacker Delight remainder trick -
The original idea was mentioned in the Hacker delights book, 10-19 Remainder by Summing Digits -
static char table[62] = {0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1};
int remu7(unsigned n) {
static char table[75] = {0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3,
4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4};
n = (n >> 15) + (n & 0x7FFF); // Max 0x27FFE.
n = (n >> 9) + (n & 0x001FF); // Max 0x33D.
n = (n >> 6) + (n & 0x0003F); // Max 0x4A.
return table[n];
}
To better understand this with example, let’s remember the trick we used to study in schools -
A number is divisible by 9 if the sum of its digits is divisible by 9, in binary so divisor 9 have some special property,
Similarly a number is divisible by d if the sum of its W-bit “digits” is divisible by d, provided (2^W (mod) D) = 1.
So what does this constraint mean? Why is it there? -
Let’s write it as
2^W ≡ 1 (modD)
Which means multiplying by 2^W is the same as multiplying by 1(mod D).
Now think about a number split into chunks of size W bits (for example 60):
N = a0 + a1*2^W + a2*2^2W + ...
Now apply modulo D:
Since:
2^W≡1
Then:
2^2W ≡ 1, 2^3W ≡ 1
So the number becomes:
N ≡ a0 + a1 + a2 + ... (mod D)
All powers are collapsed to 1
Every chunk contributes equally modulo D, so instead of doing full division, we can just add the chunks together before division. And this is what we will be doing in our implementation shown below in the next section. We will partition the dividend into smaller chunks.
Real World example -
Lets takes this number as dividend -
N = 101011001
1*2^8 + 0*2^7 + 1*2^6 + 0*2^5 + 1*2^4 + 1*2^3 + 0*2^2 + 0*2^1 + 1*2^0
256+64+16+8+1 = 345
Now lets div by 7 We get 345/7 = 49 as quotient and 2 is remainder.
For this 7 divisor we have 3 which satisfies the property (2^W (mod) D) = 1.
(2^3) % 7 = 1
So let’s divide this into the chunk size of three. We will have three chunks and we sum them.
N1 = 101 -> 5
N2 = 011 -> 3
N3 = 001 -> 1
5 + 3 + 1 = 9
And take the modulo
9 mod 7 = 2
So by this method we also get the same remainder but we have reduced the input N into smaller chunks.
To get quotient when we already have remainder, we usually do -
345 = Q⋅7 + 2
Q = 343 / 7 = 49
Which means this is the formula -
Q = (N − R)/ D
Now you are wondering if we need to divide again. For this case we use multiplicative inverse formula -
Which is
Q = (N−R) * D ^ −1(mod 2^n)
Here R is the remainder we calculated earlier.
Performance Number -
We studied the theory, before we get into LLVM implementation lets read the numbers on how much performance it improves -
“Before vs. After” comparison on google micro-benchmark of division on Macbook Air m4 which is arm64 chip.
Before
Running ./div
Run on (10 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x10)
Load Average: 1.68, 1.86, 2.80
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_Uint128DivBy7 3.21 ns 3.21 ns 217350129
---------------------------------------------------------------
After
Running ./div
Run on (10 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x10)
Load Average: 2.33, 2.02, 2.77
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_Uint128DivBy7 2.10 ns 2.10 ns 332630058
We saw 35% performance gain with Google benchmark.
LLVM implementation -
GlobalISel -
Before discussing the summing digits trick with variable chunks, let’s see what the current approach for GlobalISel following. Instead of generating libcalls it is using the popular magic multiplication algorithm (https://github.com/llvm/llvm-project/blob/96db944036a09e773d1e3fff4fa3805a4f75e59e/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp#L5578). It generates code like this on aarch64 -
ui128_7:
%bb.0: // %entry
mov x8, #18725
mov x10, #9362
movk x8, #9362, lsl #16
movk x10, #37449, lsl #16
movk x8, #37449, lsl #32
movk x10, #18724, lsl #32
movk x8, #18724, lsl #48
movk x10, #9362, lsl #48
mul x9, x1, x8
mul x11, x0, x10
umulh x12, x0, x8
mul x13, x1, x10
adds x9, x9, x11
umulh x14, x1, x8
cset w11, hs
cmn x9, x12
and x11, x11, #0x1
and x12, xzr, #0x1
umulh x15, x0, x10
cset w9, hs
and x9, x9, #0x1
umulh x8, xzr, x8
add x9, x11, x9
and x11, xzr, #0x1
adds x13, x13, x14
add x11, x11, x12
umulh x10, x1, x10
cset w12, hs
adds x13, x13, x15
and x12, x12, #0x1
umulh x14, x0, xzr
cset w15, hs
adds x9, x13, x9
add x11, x11, x12
and x12, x15, #0x1
cset w13, hs
add x11, x11, x12
and x12, x13, #0x1
add x8, x8, x10
add x10, x11, x12
add x8, x8, x14
add x8, x8, x10
subs x10, x0, x9
sbc x11, x1, x8
lsr x11, x11, #1
adds x9, x10, x9
mov w10, #7 // =0x7
adc x8, x11, x8
extr x9, x8, x9, #2
lsr x8, x8, #2
umulh x10, x9, x10
lsl x11, x9, #3
lsl x12, x8, #3
sub x9, x11, x9
sub x8, x12, x8
subs x0, x0, x9
add x8, x8, x10
sbc x1, x1, x8
ret
Compared to that chunk summation algorithms produce
ui128_7:
%bb.0:
extr x8, x1, x0, #60
and x9, x0, #0xfffffffffffffff
and x8, x8, #0xfffffffffffffff
add x8, x9, x8
mov x9, #18725
movk x9, #9362, lsl #16
add x8, x8, x1, lsr #56
mov x1, xzr
movk x9, #37449, lsl #32
movk x9, #18724, lsl #48
umulh x9, x8, x9
lsr x9, x9, #1
sub x9, x9, x9, lsl #3
add x0, x8, x9
ret
This is much more efficient and simpler code than the magic multiplicative inverse algorithm.
Chunk summation is more optimized but also relies on the special property of the divisor to satisfy - 2^I % Divisor == 1. So it can only apply to divisors like 7, 34, 100 etc.
SelectionDAG implementation -
Let’s see how the Summing Digits algorithm is implemented in LLVM (https://github.com/xgupta/llvm-project/blob/16e06589e963aadced8a5afeeaa2ec2a95d0a483/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp#L8167).
Before the my commit the Codegen had summing digits technique but only for
Divisor like 12 where (1 << (Bitwidth / 2)) % Divisor == 1 for which chunk size is fixed at 2, here we can add the high and low halves together and use a urem. This lefts many common divisors.
To generalise it to support more integers we need to use variable chunks not just two. As you saw gcc was using three chunks like 60 + 60 + 8 to carry out div by 7.
The current implementation is implemented in an targeting independent manner in TargetLowering which uses DAG structure. This works on only constant divisors since for variable divisors we can not do the summing of their digits. Since this trick only works for odd numbers, we need to right shift zeros from the even number to make it odd. We need to keep the zero digits count so that we can shift them back for even divisors.
Next we look for chunk size that can be used to satisfy the Summing Digits property (1 << W) % Divisor == 1. We search backward to find the largest chunk size possible so we do have fewer chunks and fewer additions.
Next we have a lambda function that checks whether the target supports funnel shift(FSHR) instruction. If it supports it, like in X86/Aarch64 we use that instruction otherwise we will need to explicitly use OR(SHL, SRL) instruction sequence.
We need to shift the input also if we have shifted the divisor.
Next there are two cases. 1. When best chunk width is half of the size of chunkwidth, we add the low and high half, if any carry left add it to the final sum. Then we check if the target supports the uaddo_carry instruction if it then uses it otherwise needs special handling to check the overflow and add the carry to sum.
In second case or in the else part when chunk size is variable but smaller than half of VT, we iterate through the dividend, masking and shifting to extract chunks of BestChunkWidth, summing them into a single register.
for (unsigned I = 0; I < BitWidth - TrailingZeros; I += BestChunkWidth) {
// If there were trailing zeros in the divisor, increase the shift amount.
unsigned Shift = I + TrailingZeros;
SDValue Chunk;
if (Shift == 0)
Chunk = LL;
else if (Shift >= HBitWidth)
Chunk = DAG.getNode(
ISD::SRL, dl, HiLoVT, LH,
DAG.getShiftAmountConstant(Shift - HBitWidth, HiLoVT, dl));
else
Chunk = GetFSHR(LL, LH, Shift);
// If we're on the last chunk, we don't need an AND.
if (I + BestChunkWidth < BitWidth - TrailingZeros)
Chunk = DAG.getNode(ISD::AND, dl, HiLoVT, Chunk, Mask);
if (!Sum)
Sum = Chunk;
else
Sum = DAG.getNode(ISD::ADD, dl, HiLoVT, Sum, Chunk);
}
For udiv instruction we need the Quotient. And for urem instruction we need a remainder.
We calculate the remainder using truncated divisor, rem = sum % divisor.
Calculation for urem is as easy as we can already calculate the remainder but since we shifted the divisor for even, we need to shift again.
For division, we first shift the input dividend (both halves) for training zero if the even divisor also shifted earlier.
Next we subtract the remainder from the shifted dividend, making it exactly divisible by the divisor.
To get the Quotient we use the mul instruction by using inverse multiplication. Lets see what happens here in this call
APInt MulFactor = Divisor.multiplicativeInverse();
Multiplicative Inverse is to get the final quotient(Quotient = Dividend * MulFactor). We have the remainder already and this is a trick to get it without the div instruction.
Then we split quotient into two halves to store in the res vector and return that.
X86 SelectionDAG codegen -
With that the new code which is now emitted for x86 for div by 7 is -
div_by_7(unsigned __int128):
; Mask creation
movabs rax, 1152921504606846975 ; 0x0FFFFFFFFFFFFFFF mask
; Extract low 60 bits
mov rdx, rdi ; rdx = low64(n)
and rdx, rax ; rdx = a = n & ((1<<60)-1)
; Extract middle 60 bits
mov r8, rdi ; r8 = low64(n)
shrd r8, rsi, 60 ; r8 = ( (rsi << 64) | rdi ) >> 60
and r8, rax ; rax = b = (n >> 60) & ((1<<60)-1)
; Extract top bits
mov rcx, rsi ; rcx = high64(n)
shr rcx, 56 ; rcx = c = n >> 120 (64+56) // top 8 bits
; Fold modulo 7
add rcx, rdx ; rcx = c + a
add rcx, r8 ; rcx = x = a + b + c
; Approximate division
movabs rdx, 5270498306774157605 ; rdx = M = floor(2^64 / 7)
mov rax, rcx ; rax = x
mul rdx ; rdx = high64(x * M)
shr rdx ; rdx = q = approx(x / 7)
; Compute remainder
lea rax, [8*rdx] ; rax = 8*q
sub rdx, rax ; rdx = q - 8*q = -7*q
add rdx, rcx ; rdx = r = x - 7*q
; Subtract remainder from original
sub rdi, rdx ; low64(n) -= r
sbb rsi, 0 ; high64(n) -= borrow
; Final exact division
movabs rcx, -5270498306774157605 ; rcx = -M
movabs r8, 7905747460161236407 ; r8 = magic constant for exact division
mov rax, rdi ; rax = low64(n)
mul r8 ; (rax * r8) → rdx:rax
imul rcx, rdi ; rcx = (-M) * low64(n)
add rdx, rcx ; rdx += rcx
imul r8, rsi ; r8 = magic * high64(n)
add rdx, r8 ; rdx += r8
ret ; return rdx
rdi = low 64 bits of n
rsi = high 64 bits of n
If the question you might have why Quotient calculation is happening two time? This is because x(a+b+c) is not exactly equal to n, it is approximate, the first stage is only to recover the remainder (via a reduced value x), and the final division operates on the corrected original number n − r, where multiplicative division becomes exact.
Future work -
GCC still has better codegen than LLVM for this routine with more refinements.
This is only a small part of what could be possible, The Hacker Delight book is filled with so many similar tricks.
I could see this week Craig Topper just put up a patch (https://github.com/llvm/llvm-project/pull/189286) to extend this algorithm to cover more constant divisors like 67 that satisfies 2^k mod divisor == -1. This is similar to the trick we studies in school that a number is dvisible by 11 if we add the even digits and subtract the odd digits and if the resulting sum is divisible by 11 the original number is also divisible by 11. Though covering this type of integer requires extra subtraction of chunks, but it is still better than magic multiplication or generating libcalls and GCC still generates libcall for this.
I have also ported the algorithm to GlobalIsel instruction selector (https://github.com/llvm/llvm-project/pull/189231/) which was lacking behind from SelectionDAG.
Note: I would like to thanks Craig Topper who reviewed the pull request and provide a lot of suggestion to improve.