Moving Data: ASM vs PB

Bare metal programming in PureBasic, for experienced users
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Moving Data: ASM vs PB

Post by chris319 »

Here is some PB code:

Code: Select all

For pos.l = subChunk2Size - 2 To 0  Step -2
  PokeL(*FB + (pos << 1), PeekW(*FB + pos))
Next
I have tried to replicate it in ASM but it is slower and I don't think it works the same. Note that we are doing a 16- to 32-bit conversion. Below is my attempt. How would I replicate the PB in ASM?

Code: Select all

FB.l = *FB
EnableASM
MOV esi_temp,ESI ;SAVE NON-VOLATILE REGISTER
 
MOV ESI,FB ;LOAD ESI REGISTER WITH BUFFER ADDRESS
MOV ECX,subChunkSize2 ;OFFSET
SUB ECX,2

loop: MOV EDX,[ESI+ECX] ;ADDRESS IS ALREADY IN ESI REGISTER, CX HOLDS OFFSET
MOV [ESI+ECX*2],EDX
SUB ECX,2
JGE l_loop

MOV ESI,esi_temp ;RESTORE NON-VOLATILE REGISTER
DisableASM 
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Moving Data: ASM vs PB

Post by IdeasVacuum »

Here is how: http://www.purebasic.fr/english/viewtop ... 35&t=48298

Edit: Follow sRod's instructions, near the end of the page
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Moving Data: ASM vs PB

Post by wilbert »

If the goal of converting to ASM is to speed things up, the best approach depends on the value of subChunk2Size.
If you would always convert 2 or 4 words into 2 or 4 dwords, probably the fastest way is to use the PUNPCKLWD instruction.
Thorium
Addict
Addict
Posts: 1271
Joined: Sat Aug 15, 2009 6:59 pm

Re: Moving Data: ASM vs PB

Post by Thorium »

It can't be slower. Are you sure you have disabled the debugger on your performance test?
And what exactly do you want to do?
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

Here's what I'm up to:

http://www.purebasic.fr/english/viewtop ... 12&t=39830

FLAC is a lossless compression scheme for audio. This is unfortunately a necessary step for encoding to FLAC.
User avatar
Tenaja
Addict
Addict
Posts: 1948
Joined: Tue Nov 09, 2010 10:15 pm

Re: Moving Data: ASM vs PB

Post by Tenaja »

wilbert wrote:If the goal of converting to ASM is to speed things up, the best approach depends on the value of subChunk2Size.
If you would always convert 2 or 4 words into 2 or 4 dwords, probably the fastest way is to use the PUNPCKLWD instruction.
I am with Wilbert here. If you are going to regularly be moving more than about a dozen var sizes of data (and in audio you would be), then move it in the largest "native" size possible--which is always an integer. Do your loop divided by the difference (i.e. 2 with 32-bit, and 4 with 64-bit if you are working with 16-bit words), and then do the remainder afterwards.
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

Thorium wrote:Are you sure you have disabled the debugger on your performance test?
Ah yes, disabling the debugger made all the difference WRT speed. Now there are issues regarding the 16- to 32-bit conversion because the FLAC encoder isn't encoding properly.
Helle
Enthusiast
Enthusiast
Posts: 178
Joined: Wed Apr 12, 2006 7:59 pm
Location: Germany
Contact:

Re: Moving Data: ASM vs PB

Post by Helle »

@chris319: You read 32-Bit, not 16-Bit (High-Word!). And: Why not this scheme without ReadData:

Code: Select all

Global *FB=AllocateMemory(Lof(File)*2)
j=0
For i=0 To Lof(File)-2 Step 2
  PokeW(*FB+j,ReadWord(File))
  j+4
Next
Code from me without ASM :D ! Off Topic :lol: !
Helle
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

Now for some stats:

PB For ... Next with debugger: 1560 ms

PB For ... Next without debugger: 94 ms

ASM with debugger: 1996 ms

ASM without debugger: 31 ms

Without the debugger, ASM is three times faster than PB.
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

Helle wrote:@chris319: You read 32-Bit, not 16-Bit (High-Word!). And: Why not this scheme without ReadData:

Code: Select all

Global *FB=AllocateMemory(Lof(File)*2)
j=0
For i=0 To Lof(File)-2 Step 2
  PokeW(*FB+j,ReadWord(File))
  j+4
Next
Code from me without ASM :D ! Off Topic :lol: !
Helle
I presume you're talking about the FLAC encoder example program? You should bring that to the attention of the original author, oryaaaaa. All I did was make his program usable (as is it won't compile) and cleaned it up a little bit as noted in my post in that thread. Feel free to enhance it as you see fit. You can download the dll from http://sourceforge.net/projects/flac/fi ... 1.2.1-win/ You want the zip file named flac-1.2.1-devel-win.zip.
User avatar
Tenaja
Addict
Addict
Posts: 1948
Joined: Tue Nov 09, 2010 10:15 pm

Re: Moving Data: ASM vs PB

Post by Tenaja »

chris319 wrote:
Thorium wrote:Are you sure you have disabled the debugger on your performance test?
Ah yes, disabling the debugger made all the difference WRT speed. Now there are issues regarding the 16- to 32-bit conversion because the FLAC encoder isn't encoding properly.
This is because debugger code is executed with each command, whether it is an asm command or a PB command.
Since hand-written asm requires more individual commands to get the work done than pb commands, that debugger code is executed more per given task.
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

Well, it's faster, but the 16- to 32-bit conversion isn't working the same as peek and poke. Examination of the ASM code reveals the external routines CALL PB_PeekW and CALL PB_PokeL.
chris319
Enthusiast
Enthusiast
Posts: 782
Joined: Mon Oct 24, 2005 1:05 pm

Re: Moving Data: ASM vs PB

Post by chris319 »

EUREKA!

The solution to my dilemma lies in CWDE. Works great now!
cwde ; convert the signed word in ax to a double word in eax

Code: Select all

FB = *FB
esi_temp.l ;STORAGE FOR NON-VOLATILE REGISTER
eax_temp.l
EnableASM
MOV esi_temp,ESI ;SAVE NON-VOLATILE REGISTER
mov eax_temp,eax ;SAVE NON-VOLATILE REGISTER

MOV ESI,FB ;LOAD ESI REGISTER WITH BUFFER ADDRESS
MOV ECX,subChunk2Size ;OFFSET
SUB ECX,2

loop: MOV AX,word[ESI+ECX] ;ADDRESS IS ALREADY IN ESI REGISTER, ECX HOLDS OFFSET
CWDE ;CONVERT 16 TO 32 BITS
MOV [ESI+ECX*2],EAX ;STORE IN MEMORY
SUB ECX,2
JGE l_loop

MOV ESI,esi_temp ;RESTORE NON-VOLATILE REGISTER
mov eax,eax_temp ;RESTORE NON-VOLATILE REGISTER
DisableASM 
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Moving Data: ASM vs PB

Post by wilbert »

This is how you could do it using SSE2 but I don't know if it is much faster

Code: Select all

bytes_to_process = subChunk2Size

num_bytes = (bytes_to_process + 7) & -8; Make sure we always process a multiple of 8 bytes
*mem = AllocateMemory(num_bytes * 2 + 15); Allocate 15 bytes extra so we have room to use aligned memory
*FB = (*mem + 15) & -16; Aligned memory pointer

EnableASM
MOV edx, *FB
MOV ecx, num_bytes
DisableASM
!jmp c16_32entry
!c16_32loop:
!movq xmm0, [edx + ecx]
!punpcklwd xmm0, xmm0
!psrad xmm0, 16
!movdqa [edx + ecx * 2], xmm0
!c16_32entry:
!sub ecx, 8
!jnc c16_32loop
Post Reply