PureBasic Forum

 It is currently Sun Jan 24, 2021 7:43 pm

 All times are UTC + 1 hour

 Page 1 of 2 [ 20 posts ] Go to page 1, 2  Next
 Print view Previous topic | Next topic
Author Message
 Post subject: Optimizing rotations for quadPosted: Tue Aug 23, 2011 11:15 pm
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro
!mov eax, [esp+4]
!mov edx, [esp+8]
!mov ecx, [esp+12]
!cmp ecx, 31
!jg large1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!jmp exit1
!large1:
!push ebx
!sub ecx, 32
!mov ebx, edx
!shrd edx, eax, 31
!shrd eax, ebx, 31
!mov ebx, edx
!shrd edx, eax, 1
!shrd eax, ebx, 1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!pop ebx
!exit1:
ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!and cl,63
!cmp cl,32
!jb rotr64_1
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!rotr64_2:
!and cl,31
!jz rotr64_3
!push eax
!shrd eax, edx, cl
!push eax
!mov eax, [esp + 4]
!shrd edx, eax, cl
!pop eax
!rotr64_3:
ProcedureReturn
EndProcedure

tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotR64(a, 31)
Next

tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotr64_(a, 31)
Next
tm3 = ElapsedMilliseconds()

msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)

MessageRequester("Speed", msg)

Btw (offtopic): Thanks for the new subforum, Fred & Freak

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 1:46 am

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3767
Location: Utah, USA
@netmaestro: I gave a cursory look at the routines. It would seem wilbert's gives a consistent speed for most of the rotations from 0 - 63 bits while yours is faster for rotations of 0 - 31 bits and slower for rotations of 32 - 63 bits. The speed gain for the cases when yours was faster is basically equivalent to the speed loss for the cases when it was slower. This means if the rotation amounts are equal it should come out the same.

Ignoring those results for a minute though, your routine doesn't preserve the ebx register for rotations that are less than 32 bits. By correcting this you may find the results are slower for your routine for every case. I made the correction to your code and it seems you still have a slight gain for rotating by 0 - 31 bits.

My results for the length of time to rotate a quad once by each possible bit combination from 0 - 63 is as follows: 82035ms (wilbert) and 84528ms (netmaestro). After the modification it was 81703ms (wilbert) and 87532ms (netmaestro). Just for curiosity here are the results for just rotations 0 - 31 both before and after the correction: 34624ms / 37093ms (netmaestro) and 40658ms / 39923 (wilbert).

With a little more time I see what I can do to answer your original question about combing the best of both.

Here's the corrected code I used for your routine:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with minor corrections by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!SUB Ecx, 32
!MOV Ebx, Edx
!SHRD Edx, Eax, 31
!SHRD Eax, Ebx, 31
!MOV Ebx, Edx
!SHRD Edx, Eax, 1
!SHRD Eax, Ebx, 1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure

_________________

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 2:31 am

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3767
Location: Utah, USA
Here's a version with an improvement:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!SUB Ecx, 32
!MOV Ebx, Eax
!MOV Eax, Edx
!MOV Edx, Ebx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure

My timing show it beating wilbert's (on my single core AMD CPU) for almost all rotations of 32 - 63 bits and tying for rotations of 0 - 31 bits. wilbert still tromps you on 1 or 2 special cases like rotations of 0 or 32 bits.

_________________

Last edited by Demivec on Wed Aug 24, 2011 4:42 am, edited 1 time in total.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 4:41 am

Joined: Mon Jul 25, 2005 3:51 pm
Posts: 3767
Location: Utah, USA
@Edit: incorrect code removed.

_________________

Last edited by Demivec on Wed Aug 24, 2011 11:58 pm, edited 1 time in total.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 4:53 am
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Thanks for working on it! For now though, your latest fails QC due diligence on the 32 bits rotation only:
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro with modifications by Demivec
!MOV Eax, [Esp+4]
!MOV Edx, [Esp+8]
!MOV Ecx, [Esp+12]
!PUSH Ebx
!CMP Ecx, 31
!JG large1
!MOV Ebx, Edx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!JMP exit1
!large1:
!AND cl,31
!JZ exit1
!MOV Ebx, Eax
!MOV Eax, Edx
!MOV Edx, Ebx
!SHRD Edx, Eax, cl
!SHRD Eax, Ebx, cl
!exit1:
!POP Ebx
ProcedureReturn
EndProcedure

Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!and cl,63
!cmp cl,32
!jb rotr64_1
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!rotr64_2:
!and cl,31
!jz rotr64_3
!push eax
!shrd eax, edx, cl
!push eax
!mov eax, [esp + 4]
!shrd edx, eax, cl
!pop eax
!rotr64_3:
ProcedureReturn
EndProcedure

a.q = 4123456789

For i=0 To 63
b.q = rotr64(a, i)
c.q = rotr64_(a, i)
If c<>b
Debug "oops: " + Str(i)
EndIf
Next

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 6:07 am
 PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3710
Location: Netherlands
With SSE2 register
Code:
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!movd xmm0, ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!movd ebx, xmm0
ProcedureReturn
EndProcedure

Without
Code:
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!push ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!pop ebx
ProcedureReturn
EndProcedure

Last edited by wilbert on Wed Aug 24, 2011 10:59 am, edited 2 times in total.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 6:13 am
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Did you test those for accuracy? Run against your earlier plain asm results, the sse fails on 49-63 and the next one fails on 2-63. Something isn't right.

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 6:29 am
 PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3710
Location: Netherlands
Try again. I updated the code. I forgot one instruction.
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 6:37 am
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Perfect, they both work now. The second one is faster, I'm doing more tests but I think it's turning out faster than your original (which hasn't been beat by anything yet)

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 6:57 am
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Quote:
I also used the and mask with value 63 for a reason.
When you rotate a quad 66 times to the right, the result should be the same as when you rotate it twice.

That's going to take a while to sink in but it'll come!

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 7:25 am
 PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3710
Location: Netherlands
netmaestro wrote:
That's going to take a while to sink in but it'll come!

I'm sure you will figure it out

As for speed, if it is really important, you should do it by simulating a C style function
Code:
ProcedureReturn
!mov ecx,[esp + 12]
!btr ecx, 5
!jc rotr64_1
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!rotr64_2:
!push ebx
!mov ebx, eax
!shrd eax, edx, cl
!shrd edx, ebx, cl
!pop ebx
!ret
EndProcedure

PrototypeC.q proto_rotr64(val.q, n)

When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).
You also avoid some pushing and popping PureBasic does for the functions it generates.

Last edited by wilbert on Wed Aug 24, 2011 10:58 am, edited 1 time in total.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 7:31 am
 PureBasic Bullfrog

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8149
Quote:
When I tested things, a cdecl call seems to be a little bit faster compared to stdcall (please correct me if I'm wrong).

On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

_________________
BERISHEET

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 7:52 am
 PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3710
Location: Netherlands
netmaestro wrote:
On Mac, it's probably faster but all the tests I've done on Windows show the same performance. In my sha256/224 project I've left the prototype in because even though I can't realize a speed difference here on Win7, it's no slower for sure and on Mac it's probably faster. So it stays.

It might have to do with the stack. On OS X the stack (esp) has to be aligned on 16 bytes at the moment you make a call otherwise the app probably will crash.

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 10:29 am
 Enthusiast

Joined: Wed Apr 12, 2006 7:59 pm
Posts: 174
Location: Germany
@wilbert:
SHRD works intern with an and-bit-mask for CL; in 32-bit-mode are only bits 0-4 used. You can remove "and ecx, 63".
Not:
Code:
!and ecx, 63
!btr ecx, 5
!jc rotr64_1

But:
Code:
!test ecx,100000b     ;or cmp ecx,31 / jg ...
!jnz rotr64_1

Helle

Top

 Post subject: Re: Optimizing rotations for quadPosted: Wed Aug 24, 2011 11:01 am
 PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3710
Location: Netherlands
You are right Helle.
I removed the and instruction; should make it a little bit faster.

Last edited by wilbert on Thu Aug 25, 2011 6:16 am, edited 1 time in total.

Top

 Display posts from previous: All posts1 day7 days2 weeks1 month3 months6 months1 year Sort by AuthorPost timeSubject AscendingDescending
 Page 1 of 2 [ 20 posts ] Go to page 1, 2  Next

 All times are UTC + 1 hour

Who is online

Users browsing this forum: No registered users and 1 guest

 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forum

Search for:
 Jump to:  Select a forum ------------------ PureBasic    Coding Questions    Game Programming    3D Programming    Assembly Programming    The PureBasic Editor    The PureBasic Form Designer    General Discussion    Feature Requests and Wishlists    Tricks 'n' Tips Bug Reports    Bugs - Windows    Bugs - Linux    Bugs - Mac OSX    Bugs - IDE    Bugs - Documentation OS Specific    AmigaOS    Linux    Windows    Mac OSX Miscellaneous    Announcement    Off Topic Showcase    Applications - Feedback and Discussion    PureFORM & JaPBe    TailBite