Last week wilbert wrote some code for me for rotating quads using just the general purpose registers. This turned out to be faster than any other tries using SSE. Today I tried to do the same, without looking at his procedures and then see how similar the code was. The result was interesting in that for rotations of 31 bits or less, my code turned out faster (here, anyway) and for rotations of 32 or more bits, wilbert's code is faster. So a good question might be, is there a way to merge the two codes somehow to get the best features of both? (assuming the speed differences are happening on others' machines too)
Code:
Procedure.q rotr64(val.q, n) ; By netmaestro
!mov eax, [esp+4]
!mov edx, [esp+8]
!mov ecx, [esp+12]
!cmp ecx, 31
!jg large1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!jmp exit1
!large1:
!push ebx
!sub ecx, 32
!mov ebx, edx
!shrd edx, eax, 31
!shrd eax, ebx, 31
!mov ebx, edx
!shrd edx, eax, 1
!shrd eax, ebx, 1
!mov ebx, edx
!shrd edx, eax, cl
!shrd eax, ebx, cl
!pop ebx
!exit1:
ProcedureReturn
EndProcedure
Procedure.q Rotr64_(val.q, n) ; By wilbert
!mov ecx,[esp + 12]
!and cl,63
!cmp cl,32
!jb rotr64_1
!mov edx,[esp + 4]
!mov eax,[esp + 8]
!jmp rotr64_2
!rotr64_1:
!mov eax,[esp + 4]
!mov edx,[esp + 8]
!rotr64_2:
!and cl,31
!jz rotr64_3
!push eax
!shrd eax, edx, cl
!push eax
!mov eax, [esp + 4]
!shrd edx, eax, cl
!pop eax
!add esp,4
!rotr64_3:
ProcedureReturn
EndProcedure
tm1 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotR64(a, 31)
Next
tm2 = ElapsedMilliseconds()
For i = 1 To 100000000
b.q = rotr64_(a, 31)
Next
tm3 = ElapsedMilliseconds()
msg.s = "test1: " + Str(tm2 - tm1) + Chr(13) + Chr(10)
msg + "test2: " + Str(tm3 - tm2) + Chr(13) + Chr(10)
MessageRequester("Speed", msg)
Btw (offtopic): Thanks for the new subforum, Fred & Freak
