Optimizing rotations for quad

netmaestro · Post by **netmaestro** » Tue Aug 23, 2011 11:15 pm

Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro
  !mov eax, [esp+4]
  !mov edx, [esp+8]
  !mov ecx, [esp+12]
  !cmp ecx, 31
  !jg large1  
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !jmp exit1
  !large1:
  !push ebx  
  !sub ecx, 32
  !mov ebx, edx
  !shrd edx, eax, 31
  !shrd eax, ebx, 31
  !mov ebx, edx
  !shrd edx, eax, 1
  !shrd eax, ebx, 1  
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !pop ebx
  !exit1:
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure


tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotR64(a, 31)
Next

tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotr64_(a, 31) 
Next
tm3 = ElapsedMilliseconds()

msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)

MessageRequester("Speed", msg)

Btw (offtopic): Thanks for the new subforum, Fred & Freak

Demivec · Post by **Demivec** » Wed Aug 24, 2011 1:46 am

@netmaestro: I gave a cursory look at the routines. It would seem wilbert's gives a consistent speed for most of the rotations from 0 - 63 bits while yours is faster for rotations of 0 - 31 bits and slower for rotations of 32 - 63 bits. The speed gain for the cases when yours was faster is basically equivalent to the speed loss for the cases when it was slower. This means if the rotation amounts are equal it should come out the same.

Ignoring those results for a minute though, your routine doesn't preserve the ebx register for rotations that are less than 32 bits. By correcting this you may find the results are slower for your routine for every case. I made the correction to your code and it seems you still have a slight gain for rotating by 0 - 31 bits.

My results for the length of time to rotate a quad once by each possible bit combination from 0 - 63 is as follows: 82035ms (wilbert) and 84528ms (netmaestro). After the modification it was 81703ms (wilbert) and 87532ms (netmaestro). Just for curiosity here are the results for just rotations 0 - 31 both before and after the correction: 34624ms / 37093ms (netmaestro) and 40658ms / 39923 (wilbert).

With a little more time I see what I can do to answer your original question about combing the best of both.

Here's the corrected code I used for your routine:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with minor corrections by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 31
  !SHRD Eax, Ebx, 31
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 1
  !SHRD Eax, Ebx, 1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

Demivec · Post by **Demivec** » Wed Aug 24, 2011 2:31 am

Here's a version with an improvement:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

My timing show it beating wilbert's (on my single core AMD CPU) for almost all rotations of 32 - 63 bits and tying for rotations of 0 - 31 bits. wilbert still tromps you on 1 or 2 special cases like rotations of 0 or 32 bits.

Demivec · Post by **Demivec** » Wed Aug 24, 2011 4:41 am

@Edit: incorrect code removed.

netmaestro · Post by **netmaestro** » Wed Aug 24, 2011 4:53 am

Thanks for working on it! For now though, your latest fails QC due diligence on the 32 bits rotation only:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !AND cl,31
  !JZ exit1
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure

a.q = 4123456789

For i=0 To 63
  b.q = rotr64(a, i)
  c.q = rotr64_(a, i)
  If c<>b
    Debug "oops: " + Str(i)
  EndIf
Next

wilbert · Post by **wilbert** » Wed Aug 24, 2011 6:07 am

With SSE2 register

Code: Select all

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !movd xmm0, ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !movd ebx, xmm0
  ProcedureReturn
EndProcedure

Without

Code: Select all

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  ProcedureReturn
EndProcedure

netmaestro · Post by **netmaestro** » Wed Aug 24, 2011 6:13 am

Did you test those for accuracy? Run against your earlier plain asm results, the sse fails on 49-63 and the next one fails on 2-63. Something isn't right.

wilbert · Post by **wilbert** » Wed Aug 24, 2011 6:29 am

Try again. I updated the code. I forgot one instruction.
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.

netmaestro · Post by **netmaestro** » Wed Aug 24, 2011 6:37 am

Perfect, they both work now. The second one is faster, I'm doing more tests but I think it's turning out faster than your original (which hasn't been beat by anything yet)

netmaestro · Post by **netmaestro** » Wed Aug 24, 2011 6:57 am

I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.

That's going to take a while to sink in

but it'll come!

wilbert · Post by **wilbert** » Wed Aug 24, 2011 7:25 am

netmaestro wrote:That's going to take a while to sink in but it'll come!

I'm sure you will figure it out

As for speed, if it is really important, you should do it by simulating a C style function

Code: Select all

Procedure rotr64_addr__() ; By wilbert
  !mov eax, rotr64_addr_start
  ProcedureReturn
  !rotr64_addr_start:
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  !ret
EndProcedure

PrototypeC.q proto_rotr64(val.q, n)
Global Rotr64.proto_rotr64 = rotr64_addr__()

When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
You also avoid some pushing and popping PureBasic does for the functions it generates.

netmaestro · Post by **netmaestro** » Wed Aug 24, 2011 7:31 am

When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).

On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

wilbert · Post by **wilbert** » Wed Aug 24, 2011 7:52 am

netmaestro wrote:On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

It might have to do with the stack. On OS X the stack (esp) has to be aligned on 16 bytes at the moment you make a call otherwise the app probably will crash.

Helle · Post by **Helle** » Wed Aug 24, 2011 10:29 am

@wilbert:
SHRD works intern with an and-bit-mask for CL; in 32-bit-mode are only bits 0-4 used. You can remove "and ecx, 63".
Not:

Code: Select all

  !and ecx, 63
  !btr ecx, 5
  !jc rotr64_1

But:

Code: Select all

  !test ecx,100000b     ;or cmp ecx,31 / jg ...
  !jnz rotr64_1

Helle

wilbert · Post by **wilbert** » Wed Aug 24, 2011 11:01 am

You are right Helle.
I removed the and instruction; should make it a little bit faster.

PureBasic Forums - English

Optimizing rotations for quad

Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad

Re: Optimizing rotations for quad