It is currently Sat Mar 06, 2021 3:42 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Optimizing rotations for quad
PostPosted: Tue Aug 23, 2011 11:15 pm 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro
  !mov eax, [esp+4]
  !mov edx, [esp+8]
  !mov ecx, [esp+12]
  !cmp ecx, 31
  !jg large1 
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !jmp exit1
  !large1:
  !push ebx 
  !sub ecx, 32
  !mov ebx, edx
  !shrd edx, eax, 31
  !shrd eax, ebx, 31
  !mov ebx, edx
  !shrd edx, eax, 1
  !shrd eax, ebx, 1 
  !mov ebx, edx
  !shrd edx, eax, cl
  !shrd eax, ebx, cl
  !pop ebx
  !exit1:
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure


tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotR64(a, 31)
Next

tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
  b.q = rotr64_(a, 31)
Next
tm3 = ElapsedMilliseconds()

msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)

MessageRequester("Speed", msg)


Btw (offtopic): Thanks for the new subforum, Fred & Freak :mrgreen:

_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 1:46 am 
Offline
Addict
Addict
User avatar

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3778
Location: Utah, USA
@netmaestro: I gave a cursory look at the routines. It would seem wilbert's gives a consistent speed for most of the rotations from 0 - 63 bits while yours is faster for rotations of 0 - 31 bits and slower for rotations of 32 - 63 bits. The speed gain for the cases when yours was faster is basically equivalent to the speed loss for the cases when it was slower. This means if the rotation amounts are equal it should come out the same.

Ignoring those results for a minute though, your routine doesn't preserve the ebx register for rotations that are less than 32 bits. By correcting this you may find the results are slower for your routine for every case. I made the correction to your code and it seems you still have a slight gain for rotating by 0 - 31 bits.

My results for the length of time to rotate a quad once by each possible bit combination from 0 - 63 is as follows: 82035ms (wilbert) and 84528ms (netmaestro). After the modification it was 81703ms (wilbert) and 87532ms (netmaestro). Just for curiosity here are the results for just rotations 0 - 31 both before and after the correction: 34624ms / 37093ms (netmaestro) and 40658ms / 39923 (wilbert).

With a little more time I see what I can do to answer your original question about combing the best of both. :wink:

Here's the corrected code I used for your routine:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with minor corrections by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1 
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 31
  !SHRD Eax, Ebx, 31
  !MOV Ebx, Edx
  !SHRD Edx, Eax, 1
  !SHRD Eax, Ebx, 1 
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

_________________
Image


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 2:31 am 
Offline
Addict
Addict
User avatar

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3778
Location: Utah, USA
Here's a version with an improvement:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1 
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !SUB Ecx, 32
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure


My timing show it beating wilbert's (on my single core AMD CPU) for almost all rotations of 32 - 63 bits and tying for rotations of 0 - 31 bits. wilbert still tromps you on 1 or 2 special cases like rotations of 0 or 32 bits.

_________________
Image


Last edited by Demivec on Wed Aug 24, 2011 4:42 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 4:41 am 
Offline
Addict
Addict
User avatar

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3778
Location: Utah, USA
@Edit: incorrect code removed.

_________________
Image


Last edited by Demivec on Wed Aug 24, 2011 11:58 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 4:53 am 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Thanks for working on it! For now though, your latest fails QC due diligence on the 32 bits rotation only:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
  !MOV Eax, [Esp+4]
  !MOV Edx, [Esp+8]
  !MOV Ecx, [Esp+12]
  !PUSH Ebx
  !CMP Ecx, 31
  !JG large1 
  !MOV Ebx, Edx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !JMP exit1
  !large1:
  !AND cl,31
  !JZ exit1
  !MOV Ebx, Eax
  !MOV Eax, Edx
  !MOV Edx, Ebx
  !SHRD Edx, Eax, cl
  !SHRD Eax, Ebx, cl
  !exit1:
  !POP Ebx
  ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !and cl,63
  !cmp cl,32
  !jb rotr64_1
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !rotr64_2:
  !and cl,31
  !jz rotr64_3
  !push eax
  !shrd eax, edx, cl
  !push eax
  !mov eax, [esp + 4]
  !shrd edx, eax, cl
  !pop eax
  !add esp,4
  !rotr64_3:
  ProcedureReturn
EndProcedure

a.q = 4123456789

For i=0 To 63
  b.q = rotr64(a, i)
  c.q = rotr64_(a, i)
  If c<>b
    Debug "oops: " + Str(i)
  EndIf
Next


_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 6:07 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3712
Location: Netherlands
With SSE2 register
Code:
Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !movd xmm0, ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !movd ebx, xmm0
  ProcedureReturn
EndProcedure


Without
Code:
Procedure.q Rotr64_(val.q, n) ; By wilbert
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  ProcedureReturn
EndProcedure


Last edited by wilbert on Wed Aug 24, 2011 10:59 am, edited 2 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 6:13 am 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Did you test those for accuracy? Run against your earlier plain asm results, the sse fails on 49-63 and the next one fails on 2-63. Something isn't right.

_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 6:29 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3712
Location: Netherlands
Try again. I updated the code. I forgot one instruction.
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 6:37 am 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Perfect, they both work now. The second one is faster, I'm doing more tests but I think it's turning out faster than your original (which hasn't been beat by anything yet)

_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 6:57 am 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Quote:
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.

That's going to take a while to sink in :D but it'll come!

_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 7:25 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3712
Location: Netherlands
netmaestro wrote:
That's going to take a while to sink in :D but it'll come!

I'm sure you will figure it out :wink:

As for speed, if it is really important, you should do it by simulating a C style function
Code:
Procedure rotr64_addr__() ; By wilbert
  !mov eax, rotr64_addr_start
  ProcedureReturn
  !rotr64_addr_start:
  !mov ecx,[esp + 12]
  !btr ecx, 5
  !jc rotr64_1
  !mov eax,[esp + 4]
  !mov edx,[esp + 8]
  !jmp rotr64_2
  !rotr64_1:
  !mov edx,[esp + 4]
  !mov eax,[esp + 8]
  !rotr64_2:
  !push ebx
  !mov ebx, eax
  !shrd eax, edx, cl
  !shrd edx, ebx, cl
  !pop ebx
  !ret
EndProcedure

PrototypeC.q proto_rotr64(val.q, n)
Global Rotr64.proto_rotr64 = rotr64_addr__()

When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
You also avoid some pushing and popping PureBasic does for the functions it generates.


Last edited by wilbert on Wed Aug 24, 2011 10:58 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 7:31 am 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Location: Fort Nelson, BC, Canada
Quote:
When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).

On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

_________________
BERISHEET


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 7:52 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3712
Location: Netherlands
netmaestro wrote:
On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

It might have to do with the stack. On OS X the stack (esp) has to be aligned on 16 bytes at the moment you make a call otherwise the app probably will crash.


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 10:29 am 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Wed Apr 12, 2006 7:59 pm
Posts: 177
Location: Germany
@wilbert:
SHRD works intern with an and-bit-mask for CL; in 32-bit-mode are only bits 0-4 used. You can remove "and ecx, 63".
Not:
Code:
  !and ecx, 63
  !btr ecx, 5
  !jc rotr64_1

But:
Code:
  !test ecx,100000b     ;or cmp ecx,31 / jg ...
  !jnz rotr64_1

Helle


Top
 Profile  
Reply with quote  
 Post subject: Re: Optimizing rotations for quad
PostPosted: Wed Aug 24, 2011 11:01 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3712
Location: Netherlands
You are right Helle.
I removed the and instruction; should make it a little bit faster.


Last edited by wilbert on Thu Aug 25, 2011 6:16 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye