Numbers of lines in a text file

Just starting out? Need help? Post your questions and find answers here.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Numbers of lines in a text file

Post by IdeasVacuum »

OK, got a bit closer, but not correct:

Code: Select all

;=========== TEST file start ====================================================

If CreateFile(0, "c:\test.txt")

  sLine.s = "abcdefghijklmnopqrstuvwxyz and that is all the letters I can think of"
   ilen.i = Len(sLine)

  For i = 1 To 31234
                                   
       WriteStringN(0,sLine)

  Next

  FlushFileBuffers(0)
         CloseFile(0)

EndIf

;=========== TEST file End =====================================================

Procedure.i CountFileLines(sFile.s)
;----------------------------------
Protected filesize.i = 0
EnableASM
              If ReadFile(0,sFile)
                                  *loc = AllocateMemory(Lof(0))
                      ReadData(0, *loc, Lof(0))
                              filesize = Lof(0)

                     CloseFile(0)
              EndIf

              cnt = 0
              !xor ecx, ecx            ; linecount = 0
              !mov edx, [p.p_loc]      ; readpointer = *loc
              !mov eax, [p.v_filesize] ; remainingbytes = filesize
              !loopstart:              ; While remainingbytes > 0
                !cmp word [edx], 0xA0D ; If word at readpointer <> #CRLF$
                !jnz skip              ; GOTO skip
                !inc ecx               ; Else, increment the linecount
                !skip:                 ;
                !inc edx               ; readpointer + 1
                !dec eax               ; remainingbytes - 1
              !jnz loopstart           ; Wend
              !mov [p.v_cnt], eax

DisableASM

              FreeMemory(*loc)
              ProcedureReturn

EndProcedure

LineCnt.i = CountFileLines("c:\test.txt")
MessageRequester("", Str(LineCnt) + " lines reported")
Edit: Ah, you got there before me netmaestro, so I should now be able to see where I went wrong. :?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Numbers of lines in a text file

Post by IdeasVacuum »

Well, I was very close - the bit I got wrong was from the Help, moving the count result to eax and using ProcedureReturn without an expression to return the content of eax. Not sure why that would not work but then I have never used FASM and have not done any assembler stuff for about 30 years (that was on on CP/M Z80).
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Numbers of lines in a text file

Post by wilbert »

It's better to only count for #LF .
It gives the correct results on both Windows and OS X.

Here's a SSE2 approach

Code: Select all

Procedure.i CountFileLines(filename.s)
  Protected result.i
  Protected f.i = ReadFile(#PB_Any, filename)
  Protected num_bytes.i = (Lof(f) + 31) & -32
  Protected *mem = AllocateMemory(num_bytes + 16)
  Protected *pos = (*mem + 15) & -16
  ReadData(f, *pos, num_bytes)
  CloseFile(f)
  !pxor xmm5, xmm5
  !mov edx, 0x0a0a0a0a
  !movd xmm4, edx
  !pshufd xmm4, xmm4, 0
  !mov edx, 0x01010101
  !movd xmm3, edx
  !pshufd xmm3, xmm3, 0
  !pxor xmm0, xmm0
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !mov edx, [p.p_pos]
    !mov ecx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [edx]
    !movdqa xmm2, [edx + 16]
  CompilerElse
    !mov rdx, [p.p_pos]
    !mov rcx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [rdx]
    !movdqa xmm2, [rdx + 16]
  CompilerEndIf
  !pcmpeqb xmm1, xmm4
  !pcmpeqb xmm2, xmm4
  !pand xmm1, xmm3
  !psubb xmm1, xmm2
  !pshufd xmm2, xmm1, 00001110b
  !paddb xmm1, xmm2
  !pshufd xmm2, xmm1, 00000001b
  !paddb xmm1, xmm2
  !punpcklbw xmm1, xmm5
  !punpcklwd xmm1, xmm5
  !paddd xmm0, xmm1
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !add edx, 32
    !sub ecx, 32
  CompilerElse
    !add rdx, 32
    !sub rcx, 32
  CompilerEndIf
  !jnz loop32b
  !pshufd xmm1, xmm0, 00001110b
  !paddd xmm0, xmm1
  !pshufd xmm1, xmm0, 00000001b
  !paddd xmm0, xmm1
  !movd [p.v_result], xmm0
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
Last edited by wilbert on Mon May 28, 2012 6:05 am, edited 1 time in total.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Numbers of lines in a text file

Post by IdeasVacuum »

Wow wilbert - even faster! SSE2 goes a long way back (2004ish Pentium 4?) so there can't be many PC's that this would not work on.

Edit: That's a good point about line terminations by the way, though I thought they were like this:

Windows: CR + LF
Unix: LF
Mac: CR
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Re: Numbers of lines in a text file

Post by ts-soft »

IdeasVacuum wrote: Windows: CR + LF
Unix: LF
Mac: CR
MacOS > 9 uses LF!
http://en.wikipedia.org/wiki/Newline
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8425
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Numbers of lines in a text file

Post by netmaestro »

Well I didn't keep first place for long! Wilbert's code executes here in 4-5 ms compared to my 11. And it's crossplatform too, which mine is not. Nice work, wilbert!
BERESHEIT
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Numbers of lines in a text file

Post by wilbert »

netmaestro wrote:Nice work, wilbert!
Thanks :)

I used aligned memory and the loop handles 16 bytes at a time.
That probably explains why it's fast.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Numbers of lines in a text file

Post by wilbert »

Using a smaller fixed buffer size seems to work faster for larger files.

Code: Select all

Procedure CountFileLines(filename.s)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !mov edx, 0x0a0a0a0a
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfl_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
For a more generic approach (count any specific byte value)

Code: Select all

Procedure CountFileByte(filename.s, byte)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !movzx edx, byte [p.v_byte]
    !imul edx, 0x01010101
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfb_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
For example CountFileByte("test.txt", 32) will count the amount of spaces.
User avatar
Tomi
Enthusiast
Enthusiast
Posts: 270
Joined: Wed Sep 03, 2008 9:29 am

Re: Numbers of lines in a text file

Post by Tomi »

thanks wilbert :D
User avatar
bbanelli
Enthusiast
Enthusiast
Posts: 543
Joined: Tue May 28, 2013 10:51 pm
Location: Europe
Contact:

Re: Numbers of lines in a text file

Post by bbanelli »

Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.

Code: Select all

EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  *loc = AllocateMemory(Lof(0))
  ReadData(0, *loc, Lof(0))
  filesize = Lof(0)
  CloseFile(0)
  count3=0
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End
Result wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 433
Ticks : 4338520
TotalDays : 5,02143518518519E-06
TotalHours : 0,000120514444444444
TotalMinutes : 0,00723086666666667
TotalSeconds : 0,433852
TotalMilliseconds : 433,852



PS C:\> Measure-Command {.\PureBasicWordCount.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 68
Ticks : 10682770
TotalDays : 1,23643171296296E-05
TotalHours : 0,000296743611111111
TotalMinutes : 0,0178046166666667
TotalSeconds : 1,068277
TotalMilliseconds : 1068,277
Funny thing is:
PS C:\> .\wc.exe --version
wc (GNU textutils) 2.0
Written by Paul Rubin and David MacKenzie.

Copyright (C) 1999 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Soooo... Is is possible that wc from 1999. that doesn't load while file like this PB code performs so much better?

wc is downloaded from here: https://sourceforge.net/projects/unxuti ... s/current/ but I can't find the code; supposedly, it should be somewhat same as *nix version? If so: http://agentzh.org/misc/code/coreutils/wc.c.html there is no inline ASM or anything else in it...

Asking just out of curiosity. :)
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/
infratec
Always Here
Always Here
Posts: 6817
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Numbers of lines in a text file

Post by infratec »

The optimized PB version is not fully optimized :wink:

Code: Select all

EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  filesize = Lof(0)
  *loc = AllocateMemory(filesize, #PB_Memory_NoClear)
  ReadData(0, *loc, filesize)
  CloseFile(0)
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End
User avatar
bbanelli
Enthusiast
Enthusiast
Posts: 543
Joined: Tue May 28, 2013 10:51 pm
Location: Europe
Contact:

Re: Numbers of lines in a text file

Post by bbanelli »

infratec wrote:The optimized PB version is not fully optimized :wink:
This one seems to be only negligible faster than the former...
Result wrote: PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 489
Ticks : 4898609
TotalDays : 5,66968634259259E-06
TotalHours : 0,000136072472222222
TotalMinutes : 0,00816434833333333
TotalSeconds : 0,4898609
TotalMilliseconds : 489,8609



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 51
Ticks : 10514816
TotalDays : 1,21699259259259E-05
TotalHours : 0,000292078222222222
TotalMinutes : 0,0175246933333333
TotalSeconds : 1,0514816
TotalMilliseconds : 1051,4816
I tried running it on larger set, 10M rows and ~2GB file, but difference is even greater.
Results wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 628
Ticks : 16283410
TotalDays : 1,88465393518519E-05
TotalHours : 0,000452316944444444
TotalMinutes : 0,0271390166666667
TotalSeconds : 1,628341
TotalMilliseconds : 1628,341



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}

Days : 0
Hours : 0
Minutes : 0
Seconds : 4
Milliseconds : 154
Ticks : 41548180
TotalDays : 4,80881712962963E-05
TotalHours : 0,00115411611111111
TotalMinutes : 0,0692469666666667
TotalSeconds : 4,154818
TotalMilliseconds : 4154,818
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Numbers of lines in a text file

Post by wilbert »

bbanelli wrote:Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.
wc probably only checks for the LF character while the PB code you chose to compare it against checks for CR+LF which is slower.
There also is no need to clear the allocated memory which PB does by default and also takes time.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Numbers of lines in a text file

Post by NicTheQuick »

Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Numbers of lines in a text file

Post by wilbert »

NicTheQuick wrote:Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils
Nice code :)

That code processes the file in blocks of 16 KiB and only counts LF characters (not CR LF).
It starts with a long_lines flag set to false and as soon as it detects a block where the average line length >= 15, it sets the flag to true and starts using a different counting method from there on using memchr.
It's possible the c function memchr uses speed optimized code for different processors.
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply