AVX-Example

Bare metal programming in PureBasic, for experienced users
Helle
Enthusiast
Enthusiast
Posts: 178
Joined: Wed Apr 12, 2006 7:59 pm
Location: Germany
Contact:

AVX-Example

Post by Helle »

Hey, what´s that? An assembler-forum and no new codes!?
A great new field for assembly-programming are the AVX-Instructions of the new Intel-(Sandy Bridge) and AMD-(Bulldozer, next month?) CPUs. Great: 256-bit-register (YMM)!
This a simple example for this, only for fun:

Code: Select all

;- Computes the sine (Taylor/Maclaurin series, double-precision) with AVX
;- "Helle" Klaus Helbing, 16.09.2011, PB v4.51 (x64)
;- Sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + (x^13)/13!... 
;- Any results (CPU: Intel i7-2600, 3.4GHz):
;-- Test for Radiant = 9.87654321 (12 terms, with range-reduce):
;-- AVX : -0.436554369141429 = -25.0127228798°  in 171 ms
;-- PB  : -0.436554369141429 = -25.0127228798°  in 343 ms
;-- Test for Radiant = 9.87654321 (only first 4 terms, with range-reduce):
;-- AVX : -0.436554369145432 = -25.0127228800°  in 156 ms
;-- PB  : -0.436554369141429 = -25.0127228798°  in 343 ms
;-- Test for Radiant = 1.23456789 (only first 4 terms, without range-reduce): 
;-- AVX : 0.944005976932239 = 54.0875583133°  in 63 ms
;-- PB  : 0.944005725004535 = 54.0875438789°  in 343 ms
;-- Test for Radiant = 1.23456789 (12 terms, without range-reduce):
;-- AVX : 0.944005725004535 = 54.0875438789°  in 93 ms
;-- PB  : 0.944005725004535 = 54.0875438789°  in 343 ms

;- need a CPU with AVX-support (Sept.2011: Intel "Sandy Bridge") and the actual FAsm.exe from http://flatassembler.net 
Procedure.d SinAVX(x.d)                ;x=radiant, parameter in XMM0/YMM0
  ;this part is for reduce to range 0-Pi/2 and set signum.  If x is in this range (or a little bigger, in your application), you can skip this (faster)
  !vmovsd xmm2,qword[Pi_Half]          ;load XMM2 with Pi_Half (1.570796326794896619) 
  !vdivsd xmm0,xmm0,xmm2               ;divide XMM0 (x=radiant) by XMM2 (Pi_Half), result in XMM0 
  !vcvttsd2si rax,xmm0                 ;convert the result (float double precision) to integer with truncation (like Int(x) in PB) 
  !vcvtsi2sd xmm1,xmm1,rax             ;convert the integer in RAX to float double precision in XMM1
  !vsubsd xmm0,xmm0,xmm1               ;subtract XMM1 from XMM0, result in XMM0 
  !vmulsd xmm0,xmm0,xmm2               ;multiply XMM0 with XMM2 (Pi_Half), result in XMM0
  !test rax,1                          ;test for quadrant 2 and 4 (6,8...)
  !jz @f                               ;no
  !vaddsd xmm0,xmm0,xmm2               ;add Pi_Half (XMM2) to XMM0, result in XMM0  
!@@:
  !test rax,2                          ;test for quadrant 3 and 4 (and all negatives)
  !jz @f                               ;no
  !vxorpd xmm0,xmm0,[Minus]            ;change (set) bit 63 (signum) 
!@@:
  ;calculate the first 4 terms (without start-value x), 1=0-63, 2=64-127, 3=128-191, 4=192-255
  !vmovddup xmm0,xmm0                  ;XMM0: 1=x^1 2=x^1  duplicate bits 0-63 in bits 64-127
  !vinsertf128 ymm1,ymm0,xmm0,1b       ;YMM1: 1=x^1 2=x^1 3=x^1 4=x^1  YMM1(0-127)=XMM0, YMM1(128-255)=XMM0
  !vmulpd ymm2,ymm1,ymm1               ;YMM2: 1=x^2 2=x^2 3=x^2 4=x^2
  !vmulsd xmm3,xmm2,xmm2               ;YMM3: 1=x^4 2=x^2 3=x^0 4=x^0
  !vmulpd ymm4,ymm2,ymm2               ;YMM4: 1=x^4 2=x^4 3=x^4 4=x^4
  !vmulpd ymm2,ymm4,ymm4               ;YMM2: 1=x^8 2=x^8 3=x^8 4=x^8  for the next 4 terms
  !vmulpd ymm3,ymm3,ymm1               ;YMM3: 1=x^5 2=x^3 3=x^0 4=x^0
  !vmulpd ymm1,ymm3,ymm4               ;YMM1: 1=x^9 2=x^7 3=x^0 4=x^0
  !vperm2f128 ymm5,ymm1,ymm3,100000b   ;YMM5: 1=x^9 2=x^7 3=x^5 4=x^3  YMM5(0-127)=YMM1(0-127), YMM5(128-255)=YMM3(0-127)  

  !vmovupd ymm4,yword[RezFak]          ;load the first 4 reciprocals factorials (-1/3! ... 1/9!) in YMM4
  !vmulpd ymm1,ymm5,ymm4               ;multiply the 4 values in YMM4 with YMM5, result in YMM1
  ;next 4 terms, I think, too short for a loop. For lower precision (faster) you can reduce the steps 
  !vmulpd ymm5,ymm5,ymm2               ;YMM5: 1=x^17 2=x^15 3=x^13 4=x^11
  !vmulpd ymm3,ymm5,yword[RezFak+32]
  !vaddpd ymm1,ymm3,ymm1
  ;next 4 terms
  !vmulpd ymm5,ymm5,ymm2               ;YMM5: 1=x^25 2=x^23 3=x^21 4=x^19
  !vmulpd ymm3,ymm5,yword[RezFak+64]
  !vaddpd ymm1,ymm3,ymm1

  !vhaddpd ymm2,ymm1,ymm1              ;YMM2: 1=1+2 of YMM1, 3=3+4 of YMM1 
  !vextractf128 xmm1,ymm2,1b           ;XMM1: 3+4 of YMM2
  !vaddsd xmm3,xmm2,xmm1               ;XMM3: 1=sum of iterations 

  !vaddsd xmm0,xmm0,xmm3               ;XMM0: 1=sum of iterations plus start-value (1.term=x)

  !vzeroupper                          ;set YMM0H-YMM15H to zero

!vmovsd qword[rsp+48],xmm0           ;because PB is not ABI (Application Binary Interface)-conform (standard: return-value in XMM0/YMM0)
!fld qword[rsp+48] 

 ProcedureReturn

!Minus:
  !dq 8000000000000000h           ;for change (set) bit 63 (signum)
!Pi_Half:
  !dq  1.570796326794896619
!RezFak:
  !dq  2.755731922398589065e-6    ; 1/9!   4.Iteration
  !dq -1.984126984126984127e-4    ;-1/7!   3.Iteration
  !dq  8.333333333333333333e-3    ; 1/5!   2.Iteration
  !dq -1.666666666666666667e-1    ;-1/3!   1.Iteration

  !dq  2.811457254345520763e-15   ; 1/17!  8.Iteration
  !dq -7.647163731819816476e-13   ;-1/15!  7.Iteration
  !dq  1.605904383682161460e-10   ; 1/13!  6.Iteration
  !dq -2.505210838544171878e-8    ;-1/11!  5.Iteration

  !dq  6.446950284384473396e-26   ; 1/25!  12.Iteration
  !dq -3.868170170630684038e-23   ;-1/23!  11.Iteration
  !dq  1.957294106339126123e-20   ; 1/21!  10.Iteration
  !dq -8.220635246624329717e-18   ;-1/19!  9.Iteration
EndProcedure 

Rad.d = 9.87654321

;- Test AVX
T1= ElapsedMilliseconds() 
For i = 0 To 9999999 
  SinAVX.d = SinAVX(Rad)
Next 
T1 = ElapsedMilliseconds() - T1 

;- Test PB
T2= ElapsedMilliseconds() 
For i = 0 To 9999999 
  SinPB.d = Sin(Rad) 
Next 
T2 = ElapsedMilliseconds() - T2 

Sin$ = "Test for Radiant = " + StrD(Rad) + #CRLF$ + #CRLF$
Sin$ + "AVX : " + StrD(SinAVX, 15) + " = " + StrD(SinAVX * 180 / #PI) + "°  in " + Str(T1) + " ms" + #CRLF$ 
Sin$ + "PB   : " + StrD(SinPB, 15) + " = " + StrD(SinPB * 180 / #PI) + "°  in " + Str(T2) + " ms" + #CRLF$ 
MessageRequester("Sinus Double-Precision with AVX", Sin$) 

End
Have fun!
Helle