Do you have rules to optimise x86 x64 conversion ?

Fig · Post by **Fig** » Sun Dec 16, 2018 9:16 pm

How do you optimise a x86 code to x64 ?

I converted a x86 to x64 code but it's actually 2 times slowler in x64.
I was expecting the x64 code to be as fast as x86 with these simple macros.

Code: Select all

CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
    Macro eax : rax : EndMacro
    Macro ebx : rbx : EndMacro
    Macro ecx : rcx : EndMacro
    Macro edx : rdx : EndMacro
    Macro esi : rsi : EndMacro
    Macro edi : rdi : EndMacro
    Macro ebp : rbp : EndMacro
    Macro esp : rsp : EndMacro
    Macro dword : qword :EndMacro
    Macro octets : 8 :EndMacro
    Debug "Heapx64"
CompilerElse
    Macro octets : 4 :EndMacro
    Debug "x86"
CompilerEndIf

Olliv · Post by **Olliv** » Mon Dec 17, 2018 1:34 am

Hello fig,

could you show an little example to test this gap?

Post by **Fred** » Mon Dec 17, 2018 10:49 am

Without code, we can't tell but x64 has way more registers available, so you should use them instead of stack

Fig · Post by **Fig** » Mon Dec 17, 2018 5:39 pm

I didn't want to bother you with code, sorry.

1/ I would like to understand why I am wrong thinking x86 is as fast as x64 if you just replace x86 register by x64.
My logic was:
Fact (?) :
My cpu's memory bus is 64 bits large. My Cpu as well and also my Os.
Hypothesis :
If I ask for a 32bit dword she (yes my computer is a girl) has to move 64 bits anyway.

So if i replace 32bits register by 64 ones, it should be as fast without any further optimisation.
My result shows me wrong ...

2/ I was looking for general rules or strategy to optimise x86 in x64.
As suggested, I can use more register to prevent access to memory in loop.
I guess I can also keep a 32 bits data format, transfer 2 in one qword and compute double data (if my code fits this logic)
Anything else ?

3/ Some code... But It's probably pretty indigest and clumsy as it is. Nobody should force himeself into this.
Warning, it took arround 80 sec on my computer to get a result to my speed test (there are two of them), so be patient if you try this.
It's a hash table home made for my specific need. It uses chained clusters of 15 lines (I record 7 values per line and use X and Y coordonates as key)
I compare its speed to Pb's in my use (ie: 8 adds, one search and find result, one search and don't find result)
I compare also x64 vs x86 speed.
My point is not to compare Pb general Map to a specific Hashtable, of course.

; Add 8 000 000 elements and SEARCH/FIND 500 000 elements et Search and don't find 500 000 elements
; ___x86______ x64
; 26224ms___40305ms :My Hash add elements
; 80436ms___87073ms ;Pb Hash add elements
;___144ms_____585ms ;My Hash Searching for elements
;___589ms_____890ms ;Pb Hash Searching for elements

[

wilbert · Post by **wilbert** » Mon Dec 17, 2018 5:59 pm

Fig wrote:I was looking for general rules or strategy to optimize x86 in x64.

I think it's hard to give a general rule.
If you look at this document
https://www.agner.org/optimize/instruction_tables.pdf
you will see that on most processors a 64 bit division for example is much slower compared to doing a 32 bit division.
So it's not wise to simply replace every dword with a qword when you convert your code.
Memory alignment can also make a difference when it comes to performance.

Fig · Post by **Fig** » Mon Dec 17, 2018 6:11 pm

I will get rid of Div by shifting and substracting to obtain the rest... Thank you for pointing this.
Concerning memory alignment, it shouldn't be a problem as all my data are 64 bits when I'am in x64 mode, i don't mix different sizes.

Nethertheless, what's wrong with my logic ? I have a doubt... Do computers with x64 cpu have 64bits memory bus width ? (not inside the cpu, i mean bus between CPU/Ram)

djes · Post by **djes** » Mon Dec 17, 2018 6:34 pm

Maybe you could try some registers only tests, then register to memory, and so on. For what I know, it's never as simple and logic as it seems. For example, having a 48 cores CPU versus 2 cores, when program is not made for this. Or program logic not adapted to specific cpu/cache sizes. Somehow, if a JIT compiler can adjust its optimization scheme to a specific platform, a fixed compiled program can also have specific paths.

Olliv · Post by **Olliv** » Mon Dec 17, 2018 9:46 pm

Not easy...

First, you begin your procedures by a transit like :

Code: Select all

Procedure ProcName(*Ptr)

MyHighLevelVar = *Ptr + x ; transit
...

EnableASM
...

Reserve RBP to base procedure argument pointor.

Code: Select all

Procedure ProcName(*Ptr)

; no transit

EnableASM
mov RBP, [p.p_ptr]

; no ASM transit either : use directly in operations statements
; i.e. :
add Rbx, [RBP+8]
; etc...

No there is no 64 bits addressing between CPU and RAM, 64 bits is the data RAM way size. But you have no access to this. For each threaded code, you have a 64 bits memory partial access : internal hardware converts your virtual addresses to physical addresses.

Olliv · Post by **Olliv** » Mon Dec 17, 2018 10:01 pm

Code: Select all

INC R14

+

Code: Select all

Mov rax,r14
Mul Rbp

=

Code: Select all

Add rax,rbp

no mul.

Olliv · Post by **Olliv** » Mon Dec 17, 2018 10:13 pm

Code: Select all

MOV edi,edx ;sauvegarde l'adresse du cluster précédent
MOV ebp,dword [edx+octets] ;charge l'adresse du cluster suivant
CMP ebp,0
Jcc Label

Use pipeline

Code: Select all

Xor ebp,ebp
Mov edi,edx
Add ebp,dword
Jcc Label

Same on X64

Olliv · Post by **Olliv** » Mon Dec 17, 2018 10:40 pm

Code: Select all

ProcedureDLL SuperiorPower2(number.i)
EnableASM
MOV ebx,[p.v_number]
BSR ecx,ebx
MOV eax,1
SAL eax,cl
CMP eax,ebx
JE _suite
SAL eax,1
!_suite:
ProcedureReturn
DisableASM
EndProcedure

I do not know how is [p.v_number] behavior after your initial Exx -> Rxx macro.

For the rest, it seems that a useless high 32 bits running occurs but I did not find where... No computer...

Olliv · Post by **Olliv** » Mon Dec 17, 2018 10:48 pm

fig wrote:I will get rid of Div by shifting and substracting to obtain the rest... Thank you for pointing this

Just a AND mask to get modulo.

Code: Select all

! AND RAX, 1 ; rest of modulo 2
! AND Rbx, 3 ; rest of modulo 4
! AND RCX, 7 ; rest of modulo 8
; etc...

NicTheQuick · Post by **NicTheQuick** » Tue Dec 18, 2018 11:37 am

Because I am a noob with ASM, I just want to point out some architecture dependent ideas, but maybe you know them already.

If your CPU wants a specific value from your memory it reads a fixed sized memory block into one cache line, so it can have faster access to that specific address and the memory around it.
If you need fast access to a bunch of data try to align it properly, so your CPU does not have to read from two memory locations and merge the data together before it can use it.
Try to store data in continuous memory blocks instead of linking small bunches of memory together like LinkedLists do. This way cache misses are way more infrequent.
When traversing through a multi dimensional array always traverse the way the elements are stored in the physical memory. For example when iterating over each pixel of an image, your outer for loop should iterate over Y and your inner loop over X.

Olliv · Post by **Olliv** » Tue Dec 18, 2018 3:10 pm

NicTheQuick wrote:Because I am a noob with ASM,

Hello,

I forgot introducing my suggestings by such a message. I am noob too. CPU is very complex. I just took many hours, any years ago, to measure several ASM statements groups with RDTSC.

Old forum users published too, many very good tests any years ago, I do not remember the links in this forum.

I think fig is searching to manage discontinuous datas, what it seems to be solved by using special technical ways.

X86/X64 treat discontinuous datas, obsviously by storing code addresses in cache. I can say this, after several performance tests when several jumping calls do not respect a linear benchmark. The problem here is that the informations are not code addresses but datas addresses, what it grows the coding difficulty level.

It would be like if slots datas were subroutine codes, what it would require to modify ASM code on the fly.

Fig · Post by **Fig** » Fri Dec 21, 2018 8:50 pm

I'll have some times to test your suggestions, Olliver, thank you very much !

I'll also try small tests as suggested by Djes, to better understand how it works.

Nicthequick wrote:Try to store data in continuous memory blocks instead of linking small bunches of memory together like LinkedLists do. This way cache misses are way more infrequent.

I checked every condition, except this one but I did my best to solve it.
Hastable need either to allocate linkedlist elements or to copy and reallocate the whole table if it gets full (open adressing).
So I just use linked chunks of memory. As clusters, they are a whole contigous memory area and as linked, I just allocate a new cluster when one is full.(or load factor is up to a value...)

PureBasic Forums - English

Do you have rules to optimise x86 x64 conversion ?

Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?

Re: Do you have rules to optimise x86 x64 conversion ?