There was once a stupid trick using local lookup tables for direct conversion, via xlat, or modern equivalent.
You can build a 256 entry table mapping 8 bits to # of bits, and then do an xlat. It's only useful in stupid cases...and nothing else.
With a swap and an add, you can do 16. Duplicate for 32, etc. I don't know modern latencies for memory access, so it will probably be horribly slow. Also, setup is very slow, so it will only look better if you count performance after the table is built.
Even if it works, I will be surprised. But it was the fast way when processors were slower and fewer instructions.
Population count
- Michael Vogel
- Addict
- Posts: 2677
- Joined: Thu Feb 09, 2006 11:27 pm
- Contact:
Re: Population count
I don't have a better assembler code which is clear because wilbert & co are doing the best job...
...but I just want to add two ideas how to it also possible to achieve a fine speed:
1) create tables (PopCount9a is as fast as the best routines above)
2) use macros (PopCount9b only needs 1/3 of the time)
...but I just want to add two ideas how to it also possible to achieve a fine speed:
1) create tables (PopCount9a is as fast as the best routines above)
2) use macros (PopCount9b only needs 1/3 of the time)
Code: Select all
; Initialization
Global Dim pc9table.i($FFFF)
Global pc9
For pc9=0 To $FFFF
pc9table(pc9)=PopCount0(pc9); any procedure you like here
Next pc9
Procedure.l PopCount9a (v.l)
ProcedureReturn pc9table(v&$FFFF)+pc9table(v>>16)
EndProcedure
Macro PopCount9b (v)
pc9table(v&$FFFF)+pc9table(v>>16)
EndMacro
-
- User
- Posts: 61
- Joined: Sun Apr 03, 2016 12:03 am
Re: Population count
Such tables also work well for trig, curves or "analog" functions.
The xlat/table thing was very handy for weighting in neural nets, fuzzy logic and weighted thresholds with multiple outputs.
The xlat/table thing was very handy for weighting in neural nets, fuzzy logic and weighted thresholds with multiple outputs.
Re: Population count
I was interested in this stupid trick!Bo Marchais wrote:There was once a stupid trick using local lookup tables for direct conversion, via xlat, or modern equivalent. You can build a 256 entry table mapping 8 bits to # of bits, and then do an xlat. It's only useful in stupid cases...and nothing else.
Code: Select all
Procedure.i GetBitCount(byte.a)
CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
! mov rax, [p.v_byte]
! mov rdx, rbx
! mov rbx, BitCounts
! xlatb
! mov rbx, rdx
CompilerElseIf #PB_Compiler_Processor = #PB_Processor_x86
! mov eax, [p.v_byte]
! mov edx, ebx
! mov ebx, BitCounts
! xlatb
! mov ebx, edx
CompilerEndIf
ProcedureReturn
DataSection
!BitCounts:
Data.a 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
Data.a 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
Data.a 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
Data.a 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
EndDataSection
EndProcedure
Code: Select all
t0 = 36 ms (10%) <-- popcnt instruction
t1 = 350 ms (100%)
t2 = 420 ms (120%)
t3 = 1032 ms (294%)
t4 = 1906 ms (544%)
t5 = 1837 ms (524%)
t6 = 1242 ms (354%)
t7 = 57 ms (16%)
t8 = 44 ms (12%) <-- this demo using xlatb
Re: Population count
@Keya, your x64 code didn't work on MacOS.
I had to change it into this using lea
Lookups are fast but even faster without xlatb
I had to change it into this using lea
Code: Select all
! mov rax, [p.v_byte]
! mov rdx, rbx
! lea rbx, [BitCounts]
! xlatb
! mov rbx, rdx
Code: Select all
Procedure.i GetBitCount(byte.a)
CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
!movzx eax, byte [p.v_byte]
!lea rdx, [BitCounts]
!movzx eax, byte [rdx + rax]
CompilerElseIf #PB_Compiler_Processor = #PB_Processor_x86
!movzx eax, byte [p.v_byte]
!movzx eax, byte [BitCounts + eax]
CompilerEndIf
ProcedureReturn
DataSection
!BitCounts:
Data.a 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6
Data.a 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
Data.a 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7
Data.a 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
EndDataSection
EndProcedure
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)