Page 2 of 2

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 3:47 pm
by IdeasVacuum
OK, got a bit closer, but not correct:

Code: Select all

;=========== TEST file start ====================================================

If CreateFile(0, "c:\test.txt")

  sLine.s = "abcdefghijklmnopqrstuvwxyz and that is all the letters I can think of"
   ilen.i = Len(sLine)

  For i = 1 To 31234
                                   
       WriteStringN(0,sLine)

  Next

  FlushFileBuffers(0)
         CloseFile(0)

EndIf

;=========== TEST file End =====================================================

Procedure.i CountFileLines(sFile.s)
;----------------------------------
Protected filesize.i = 0
EnableASM
              If ReadFile(0,sFile)
                                  *loc = AllocateMemory(Lof(0))
                      ReadData(0, *loc, Lof(0))
                              filesize = Lof(0)

                     CloseFile(0)
              EndIf

              cnt = 0
              !xor ecx, ecx            ; linecount = 0
              !mov edx, [p.p_loc]      ; readpointer = *loc
              !mov eax, [p.v_filesize] ; remainingbytes = filesize
              !loopstart:              ; While remainingbytes > 0
                !cmp word [edx], 0xA0D ; If word at readpointer <> #CRLF$
                !jnz skip              ; GOTO skip
                !inc ecx               ; Else, increment the linecount
                !skip:                 ;
                !inc edx               ; readpointer + 1
                !dec eax               ; remainingbytes - 1
              !jnz loopstart           ; Wend
              !mov [p.v_cnt], eax

DisableASM

              FreeMemory(*loc)
              ProcedureReturn

EndProcedure

LineCnt.i = CountFileLines("c:\test.txt")
MessageRequester("", Str(LineCnt) + " lines reported")
Edit: Ah, you got there before me netmaestro, so I should now be able to see where I went wrong. :?

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 4:00 pm
by IdeasVacuum
Well, I was very close - the bit I got wrong was from the Help, moving the count result to eax and using ProcedureReturn without an expression to return the content of eax. Not sure why that would not work but then I have never used FASM and have not done any assembler stuff for about 30 years (that was on on CP/M Z80).

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 4:04 pm
by wilbert
It's better to only count for #LF .
It gives the correct results on both Windows and OS X.

Here's a SSE2 approach

Code: Select all

Procedure.i CountFileLines(filename.s)
  Protected result.i
  Protected f.i = ReadFile(#PB_Any, filename)
  Protected num_bytes.i = (Lof(f) + 31) & -32
  Protected *mem = AllocateMemory(num_bytes + 16)
  Protected *pos = (*mem + 15) & -16
  ReadData(f, *pos, num_bytes)
  CloseFile(f)
  !pxor xmm5, xmm5
  !mov edx, 0x0a0a0a0a
  !movd xmm4, edx
  !pshufd xmm4, xmm4, 0
  !mov edx, 0x01010101
  !movd xmm3, edx
  !pshufd xmm3, xmm3, 0
  !pxor xmm0, xmm0
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !mov edx, [p.p_pos]
    !mov ecx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [edx]
    !movdqa xmm2, [edx + 16]
  CompilerElse
    !mov rdx, [p.p_pos]
    !mov rcx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [rdx]
    !movdqa xmm2, [rdx + 16]
  CompilerEndIf
  !pcmpeqb xmm1, xmm4
  !pcmpeqb xmm2, xmm4
  !pand xmm1, xmm3
  !psubb xmm1, xmm2
  !pshufd xmm2, xmm1, 00001110b
  !paddb xmm1, xmm2
  !pshufd xmm2, xmm1, 00000001b
  !paddb xmm1, xmm2
  !punpcklbw xmm1, xmm5
  !punpcklwd xmm1, xmm5
  !paddd xmm0, xmm1
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !add edx, 32
    !sub ecx, 32
  CompilerElse
    !add rdx, 32
    !sub rcx, 32
  CompilerEndIf
  !jnz loop32b
  !pshufd xmm1, xmm0, 00001110b
  !paddd xmm0, xmm1
  !pshufd xmm1, xmm0, 00000001b
  !paddd xmm0, xmm1
  !movd [p.v_result], xmm0
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 4:26 pm
by IdeasVacuum
Wow wilbert - even faster! SSE2 goes a long way back (2004ish Pentium 4?) so there can't be many PC's that this would not work on.

Edit: That's a good point about line terminations by the way, though I thought they were like this:

Windows: CR + LF
Unix: LF
Mac: CR

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 4:48 pm
by ts-soft
IdeasVacuum wrote: Windows: CR + LF
Unix: LF
Mac: CR
MacOS > 9 uses LF!
http://en.wikipedia.org/wiki/Newline

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 4:57 pm
by netmaestro
Well I didn't keep first place for long! Wilbert's code executes here in 4-5 ms compared to my 11. And it's crossplatform too, which mine is not. Nice work, wilbert!

Re: Numbers of lines in a text file

Posted: Sun May 27, 2012 5:03 pm
by wilbert
netmaestro wrote:Nice work, wilbert!
Thanks :)

I used aligned memory and the loop handles 16 bytes at a time.
That probably explains why it's fast.

Re: Numbers of lines in a text file

Posted: Mon May 28, 2012 6:24 am
by wilbert
Using a smaller fixed buffer size seems to work faster for larger files.

Code: Select all

Procedure CountFileLines(filename.s)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !mov edx, 0x0a0a0a0a
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfl_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
For a more generic approach (count any specific byte value)

Code: Select all

Procedure CountFileByte(filename.s, byte)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !movzx edx, byte [p.v_byte]
    !imul edx, 0x01010101
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfb_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
For example CountFileByte("test.txt", 32) will count the amount of spaces.

Re: Numbers of lines in a text file

Posted: Mon May 28, 2012 3:38 pm
by Tomi
thanks wilbert :D

Re: Numbers of lines in a text file

Posted: Thu Jun 27, 2019 2:34 pm
by bbanelli
Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.

Code: Select all

EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  *loc = AllocateMemory(Lof(0))
  ReadData(0, *loc, Lof(0))
  filesize = Lof(0)
  CloseFile(0)
  count3=0
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End
Result wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 433
Ticks : 4338520
TotalDays : 5,02143518518519E-06
TotalHours : 0,000120514444444444
TotalMinutes : 0,00723086666666667
TotalSeconds : 0,433852
TotalMilliseconds : 433,852



PS C:\> Measure-Command {.\PureBasicWordCount.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 68
Ticks : 10682770
TotalDays : 1,23643171296296E-05
TotalHours : 0,000296743611111111
TotalMinutes : 0,0178046166666667
TotalSeconds : 1,068277
TotalMilliseconds : 1068,277
Funny thing is:
PS C:\> .\wc.exe --version
wc (GNU textutils) 2.0
Written by Paul Rubin and David MacKenzie.

Copyright (C) 1999 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Soooo... Is is possible that wc from 1999. that doesn't load while file like this PB code performs so much better?

wc is downloaded from here: https://sourceforge.net/projects/unxuti ... s/current/ but I can't find the code; supposedly, it should be somewhat same as *nix version? If so: http://agentzh.org/misc/code/coreutils/wc.c.html there is no inline ASM or anything else in it...

Asking just out of curiosity. :)

Re: Numbers of lines in a text file

Posted: Thu Jun 27, 2019 3:01 pm
by infratec
The optimized PB version is not fully optimized :wink:

Code: Select all

EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  filesize = Lof(0)
  *loc = AllocateMemory(filesize, #PB_Memory_NoClear)
  ReadData(0, *loc, filesize)
  CloseFile(0)
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End

Re: Numbers of lines in a text file

Posted: Thu Jun 27, 2019 3:25 pm
by bbanelli
infratec wrote:The optimized PB version is not fully optimized :wink:
This one seems to be only negligible faster than the former...
Result wrote: PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 489
Ticks : 4898609
TotalDays : 5,66968634259259E-06
TotalHours : 0,000136072472222222
TotalMinutes : 0,00816434833333333
TotalSeconds : 0,4898609
TotalMilliseconds : 489,8609



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 51
Ticks : 10514816
TotalDays : 1,21699259259259E-05
TotalHours : 0,000292078222222222
TotalMinutes : 0,0175246933333333
TotalSeconds : 1,0514816
TotalMilliseconds : 1051,4816
I tried running it on larger set, 10M rows and ~2GB file, but difference is even greater.
Results wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 628
Ticks : 16283410
TotalDays : 1,88465393518519E-05
TotalHours : 0,000452316944444444
TotalMinutes : 0,0271390166666667
TotalSeconds : 1,628341
TotalMilliseconds : 1628,341



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}

Days : 0
Hours : 0
Minutes : 0
Seconds : 4
Milliseconds : 154
Ticks : 41548180
TotalDays : 4,80881712962963E-05
TotalHours : 0,00115411611111111
TotalMinutes : 0,0692469666666667
TotalSeconds : 4,154818
TotalMilliseconds : 4154,818

Re: Numbers of lines in a text file

Posted: Fri Jun 28, 2019 5:23 am
by wilbert
bbanelli wrote:Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.
wc probably only checks for the LF character while the PB code you chose to compare it against checks for CR+LF which is slower.
There also is no need to clear the allocated memory which PB does by default and also takes time.

Re: Numbers of lines in a text file

Posted: Fri Jun 28, 2019 8:48 am
by NicTheQuick
Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils

Re: Numbers of lines in a text file

Posted: Fri Jun 28, 2019 9:06 am
by wilbert
NicTheQuick wrote:Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils
Nice code :)

That code processes the file in blocks of 16 KiB and only counts LF characters (not CR LF).
It starts with a long_lines flag set to false and as soon as it detects a block where the average line length >= 15, it sets the flag to true and starts using a different counting method from there on using memchr.
It's possible the c function memchr uses speed optimized code for different processors.