Optimizing rotations for quad

Bare metal programming in PureBasic, for experienced users
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Optimizing rotations for quad

Post by netmaestro »

Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro
  !mov eax, [esp+4]
  !mov edx, [esp+8]
  !mov ecx, [esp+12]
  !cmp ecx, 31
  !jg large1  
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !jmp exit1
  !large1:
  !push ebx  
  !sub ecx, 32
  !mov ebx, edx
  !shrd edx, eax, 31
  !shrd eax, ebx, 31
  !mov ebx, edx
  !shrd edx, eax, 1
  !shrd eax, ebx, 1  
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !pop ebx
  !exit1:
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure


tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotR64(a, 31)
Next

tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotr64_(a, 31) 
Next
tm3 = ElapsedMilliseconds()

msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)

MessageRequester("Speed", msg)
Btw (offtopic): Thanks for the new subforum, Fred & Freak :mrgreen:
BERESHEIT
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Optimizing rotations for quad

Post by Demivec »

@netmaestro: I gave a cursory look at the routines. It would seem wilbert's gives a consistent speed for most of the rotations from 0 - 63 bits while yours is faster for rotations of 0 - 31 bits and slower for rotations of 32 - 63 bits. The speed gain for the cases when yours was faster is basically equivalent to the speed loss for the cases when it was slower. This means if the rotation amounts are equal it should come out the same.

Ignoring those results for a minute though, your routine doesn't preserve the ebx register for rotations that are less than 32 bits. By correcting this you may find the results are slower for your routine for every case. I made the correction to your code and it seems you still have a slight gain for rotating by 0 - 31 bits.

My results for the length of time to rotate a quad once by each possible bit combination from 0 - 63 is as follows: 82035ms (wilbert) and 84528ms (netmaestro). After the modification it was 81703ms (wilbert) and 87532ms (netmaestro). Just for curiosity here are the results for just rotations 0 - 31 both before and after the correction: 34624ms / 37093ms (netmaestro) and 40658ms / 39923 (wilbert).

With a little more time I see what I can do to answer your original question about combing the best of both. :wink:

Here's the corrected code I used for your routine:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with minor corrections by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 31
  !SHRD Eax, Ebx, 31
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 1
  !SHRD Eax, Ebx, 1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Optimizing rotations for quad

Post by Demivec »

Here's a version with an improvement:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure
My timing show it beating wilbert's (on my single core AMD CPU) for almost all rotations of 32 - 63 bits and tying for rotations of 0 - 31 bits. wilbert still tromps you on 1 or 2 special cases like rotations of 0 or 32 bits.
Last edited by Demivec on Wed Aug 24, 2011 4:42 am, edited 1 time in total.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Optimizing rotations for quad

Post by Demivec »

@Edit: incorrect code removed.
Last edited by Demivec on Wed Aug 24, 2011 11:58 pm, edited 1 time in total.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Optimizing rotations for quad

Post by netmaestro »

Thanks for working on it! For now though, your latest fails QC due diligence on the 32 bits rotation only:

Code: Select all

Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1  
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !AND cl,31
  !JZ exit1
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure

a.q = 4123456789

For i=0 To 63
  b.q = rotr64(a, i)
  c.q = rotr64_(a, i)
  If c<>b
    Debug "oops: " + Str(i)
  EndIf
Next

BERESHEIT
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Optimizing rotations for quad

Post by wilbert »

With SSE2 register

Code: Select all

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !movd xmm0, ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !movd ebx, xmm0
  ProcedureReturn
EndProcedure
Without

Code: Select all

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  ProcedureReturn
EndProcedure
Last edited by wilbert on Wed Aug 24, 2011 10:59 am, edited 2 times in total.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Optimizing rotations for quad

Post by netmaestro »

Did you test those for accuracy? Run against your earlier plain asm results, the sse fails on 49-63 and the next one fails on 2-63. Something isn't right.
BERESHEIT
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Optimizing rotations for quad

Post by wilbert »

Try again. I updated the code. I forgot one instruction.
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Optimizing rotations for quad

Post by netmaestro »

Perfect, they both work now. The second one is faster, I'm doing more tests but I think it's turning out faster than your original (which hasn't been beat by anything yet)
BERESHEIT
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Optimizing rotations for quad

Post by netmaestro »

I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.
That's going to take a while to sink in :D but it'll come!
BERESHEIT
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Optimizing rotations for quad

Post by wilbert »

netmaestro wrote:That's going to take a while to sink in :D but it'll come!
I'm sure you will figure it out :wink:

As for speed, if it is really important, you should do it by simulating a C style function

Code: Select all

Procedure rotr64_addr__() ; By wilbert
  !mov eax, rotr64_addr_start
  ProcedureReturn
  !rotr64_addr_start:
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  !ret
EndProcedure

PrototypeC.q proto_rotr64(val.q, n)
Global Rotr64.proto_rotr64 = rotr64_addr__()
When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
You also avoid some pushing and popping PureBasic does for the functions it generates.
Last edited by wilbert on Wed Aug 24, 2011 10:58 am, edited 1 time in total.
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Optimizing rotations for quad

Post by netmaestro »

When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.
BERESHEIT
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Optimizing rotations for quad

Post by wilbert »

netmaestro wrote:On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.
It might have to do with the stack. On OS X the stack (esp) has to be aligned on 16 bytes at the moment you make a call otherwise the app probably will crash.
Helle
Enthusiast
Enthusiast
Posts: 178
Joined: Wed Apr 12, 2006 7:59 pm
Location: Germany
Contact:

Re: Optimizing rotations for quad

Post by Helle »

@wilbert:
SHRD works intern with an and-bit-mask for CL; in 32-bit-mode are only bits 0-4 used. You can remove "and ecx, 63".
Not:

Code: Select all

  !and ecx, 63
  !btr ecx, 5
  !jc rotr64_1
But:

Code: Select all

  !test ecx,100000b     ;or cmp ecx,31 / jg ...
  !jnz rotr64_1
Helle
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Optimizing rotations for quad

Post by wilbert »

You are right Helle.
I removed the and instruction; should make it a little bit faster.
Last edited by wilbert on Thu Aug 25, 2011 6:16 am, edited 1 time in total.
Post Reply