Page 2 of 2
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 3:47 pm
by IdeasVacuum
OK, got a bit closer, but not correct:
Code: Select all
;=========== TEST file start ====================================================
If CreateFile(0, "c:\test.txt")
sLine.s = "abcdefghijklmnopqrstuvwxyz and that is all the letters I can think of"
ilen.i = Len(sLine)
For i = 1 To 31234
WriteStringN(0,sLine)
Next
FlushFileBuffers(0)
CloseFile(0)
EndIf
;=========== TEST file End =====================================================
Procedure.i CountFileLines(sFile.s)
;----------------------------------
Protected filesize.i = 0
EnableASM
If ReadFile(0,sFile)
*loc = AllocateMemory(Lof(0))
ReadData(0, *loc, Lof(0))
filesize = Lof(0)
CloseFile(0)
EndIf
cnt = 0
!xor ecx, ecx ; linecount = 0
!mov edx, [p.p_loc] ; readpointer = *loc
!mov eax, [p.v_filesize] ; remainingbytes = filesize
!loopstart: ; While remainingbytes > 0
!cmp word [edx], 0xA0D ; If word at readpointer <> #CRLF$
!jnz skip ; GOTO skip
!inc ecx ; Else, increment the linecount
!skip: ;
!inc edx ; readpointer + 1
!dec eax ; remainingbytes - 1
!jnz loopstart ; Wend
!mov [p.v_cnt], eax
DisableASM
FreeMemory(*loc)
ProcedureReturn
EndProcedure
LineCnt.i = CountFileLines("c:\test.txt")
MessageRequester("", Str(LineCnt) + " lines reported")
Edit: Ah, you got there before me netmaestro, so I should now be able to see where I went wrong.
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 4:00 pm
by IdeasVacuum
Well, I was very close - the bit I got wrong was from the Help, moving the count result to eax and using ProcedureReturn without an expression to return the content of eax. Not sure why that would not work but then I have never used FASM and have not done any assembler stuff for about 30 years (that was on on CP/M Z80).
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 4:04 pm
by wilbert
It's better to only count for #LF .
It gives the correct results on both Windows and OS X.
Here's a SSE2 approach
Code: Select all
Procedure.i CountFileLines(filename.s)
Protected result.i
Protected f.i = ReadFile(#PB_Any, filename)
Protected num_bytes.i = (Lof(f) + 31) & -32
Protected *mem = AllocateMemory(num_bytes + 16)
Protected *pos = (*mem + 15) & -16
ReadData(f, *pos, num_bytes)
CloseFile(f)
!pxor xmm5, xmm5
!mov edx, 0x0a0a0a0a
!movd xmm4, edx
!pshufd xmm4, xmm4, 0
!mov edx, 0x01010101
!movd xmm3, edx
!pshufd xmm3, xmm3, 0
!pxor xmm0, xmm0
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!mov edx, [p.p_pos]
!mov ecx, [p.v_num_bytes]
!loop32b:
!movdqa xmm1, [edx]
!movdqa xmm2, [edx + 16]
CompilerElse
!mov rdx, [p.p_pos]
!mov rcx, [p.v_num_bytes]
!loop32b:
!movdqa xmm1, [rdx]
!movdqa xmm2, [rdx + 16]
CompilerEndIf
!pcmpeqb xmm1, xmm4
!pcmpeqb xmm2, xmm4
!pand xmm1, xmm3
!psubb xmm1, xmm2
!pshufd xmm2, xmm1, 00001110b
!paddb xmm1, xmm2
!pshufd xmm2, xmm1, 00000001b
!paddb xmm1, xmm2
!punpcklbw xmm1, xmm5
!punpcklwd xmm1, xmm5
!paddd xmm0, xmm1
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!add edx, 32
!sub ecx, 32
CompilerElse
!add rdx, 32
!sub rcx, 32
CompilerEndIf
!jnz loop32b
!pshufd xmm1, xmm0, 00001110b
!paddd xmm0, xmm1
!pshufd xmm1, xmm0, 00000001b
!paddd xmm0, xmm1
!movd [p.v_result], xmm0
FreeMemory(*mem)
ProcedureReturn result
EndProcedure
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 4:26 pm
by IdeasVacuum
Wow wilbert - even faster! SSE2 goes a long way back (2004ish Pentium 4?) so there can't be many PC's that this would not work on.
Edit: That's a good point about line terminations by the way, though I thought they were like this:
Windows: CR + LF
Unix: LF
Mac: CR
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 4:48 pm
by ts-soft
IdeasVacuum wrote:
Windows: CR + LF
Unix: LF
Mac: CR
MacOS > 9 uses LF!
http://en.wikipedia.org/wiki/Newline
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 4:57 pm
by netmaestro
Well I didn't keep first place for long! Wilbert's code executes here in 4-5 ms compared to my 11. And it's crossplatform too, which mine is not. Nice work, wilbert!
Re: Numbers of lines in a text file
Posted: Sun May 27, 2012 5:03 pm
by wilbert
netmaestro wrote:Nice work, wilbert!
Thanks
I used aligned memory and the loop handles 16 bytes at a time.
That probably explains why it's fast.
Re: Numbers of lines in a text file
Posted: Mon May 28, 2012 6:24 am
by wilbert
Using a smaller fixed buffer size seems to work faster for larger files.
Code: Select all
Procedure CountFileLines(filename.s)
Protected result, bytes_read
Protected buffer_size = 65536
Protected f = ReadFile(#PB_Any, filename)
Protected *mem = AllocateMemory(buffer_size + 48)
Protected *pos = (*mem + 15) & -16
While Not Eof(f)
bytes_read = ReadData(f, *pos, buffer_size)
If bytes_read <> buffer_size
FillMemory(*pos + bytes_read, 32, ~byte)
bytes_read = (bytes_read + 31) & -32
EndIf
!pxor xmm5, xmm5
!mov edx, 0x0a0a0a0a
!movd xmm4, edx
!pshufd xmm4, xmm4, 0
!mov edx, 0x01010101
!movd xmm3, edx
!pshufd xmm3, xmm3, 0
!movd xmm0, [p.v_result]
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!mov edx, [p.p_pos]
!mov ecx, [p.v_bytes_read]
!cfl_loop32b:
!movdqa xmm1, [edx]
!movdqa xmm2, [edx + 16]
CompilerElse
!mov rdx, [p.p_pos]
!mov rcx, [p.v_bytes_read]
!cfl_loop32b:
!movdqa xmm1, [rdx]
!movdqa xmm2, [rdx + 16]
CompilerEndIf
!pcmpeqb xmm1, xmm4
!pcmpeqb xmm2, xmm4
!pand xmm1, xmm3
!psubb xmm1, xmm2
!psadbw xmm1, xmm5
!paddq xmm0, xmm1
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!add edx, 32
!sub ecx, 32
CompilerElse
!add rdx, 32
!sub rcx, 32
CompilerEndIf
!jnz cfl_loop32b
!pshufd xmm1, xmm0, 00001110b
!paddq xmm0, xmm1
!movd [p.v_result], xmm0
Wend
CloseFile(f)
FreeMemory(*mem)
ProcedureReturn result
EndProcedure
For a more generic approach (count any specific byte value)
Code: Select all
Procedure CountFileByte(filename.s, byte)
Protected result, bytes_read
Protected buffer_size = 65536
Protected f = ReadFile(#PB_Any, filename)
Protected *mem = AllocateMemory(buffer_size + 48)
Protected *pos = (*mem + 15) & -16
While Not Eof(f)
bytes_read = ReadData(f, *pos, buffer_size)
If bytes_read <> buffer_size
FillMemory(*pos + bytes_read, 32, ~byte)
bytes_read = (bytes_read + 31) & -32
EndIf
!pxor xmm5, xmm5
!movzx edx, byte [p.v_byte]
!imul edx, 0x01010101
!movd xmm4, edx
!pshufd xmm4, xmm4, 0
!mov edx, 0x01010101
!movd xmm3, edx
!pshufd xmm3, xmm3, 0
!movd xmm0, [p.v_result]
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!mov edx, [p.p_pos]
!mov ecx, [p.v_bytes_read]
!cfb_loop32b:
!movdqa xmm1, [edx]
!movdqa xmm2, [edx + 16]
CompilerElse
!mov rdx, [p.p_pos]
!mov rcx, [p.v_bytes_read]
!cfb_loop32b:
!movdqa xmm1, [rdx]
!movdqa xmm2, [rdx + 16]
CompilerEndIf
!pcmpeqb xmm1, xmm4
!pcmpeqb xmm2, xmm4
!pand xmm1, xmm3
!psubb xmm1, xmm2
!psadbw xmm1, xmm5
!paddq xmm0, xmm1
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
!add edx, 32
!sub ecx, 32
CompilerElse
!add rdx, 32
!sub rcx, 32
CompilerEndIf
!jnz cfb_loop32b
!pshufd xmm1, xmm0, 00001110b
!paddq xmm0, xmm1
!movd [p.v_result], xmm0
Wend
CloseFile(f)
FreeMemory(*mem)
ProcedureReturn result
EndProcedure
For example
CountFileByte("test.txt", 32) will count the amount of spaces.
Re: Numbers of lines in a text file
Posted: Mon May 28, 2012 3:38 pm
by Tomi
thanks wilbert
Re: Numbers of lines in a text file
Posted: Thu Jun 27, 2019 2:34 pm
by bbanelli
Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.
Code: Select all
EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
*loc = AllocateMemory(Lof(0))
ReadData(0, *loc, Lof(0))
filesize = Lof(0)
CloseFile(0)
count3=0
EnableASM
!xor rcx, rcx ; linecount = 0
!mov rdx, [p_loc] ; readpointer = *loc
!mov rax, [v_filesize] ; remainingbytes = filesize
!loopstart: ; While remainingbytes > 0
!cmp word [rdx], 0xA0D ; If word at readpointer <> #CRLF$
!jnz skip ; GOTO skip
!inc rcx ; Else, increment the linecount
!skip: ;
!inc rdx ; readpointer + 1
!dec rax ; remainingbytes - 1
!jnz loopstart ; Wend
!mov [v_count3], rcx
DisableASM
FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End
Result wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 433
Ticks : 4338520
TotalDays : 5,02143518518519E-06
TotalHours : 0,000120514444444444
TotalMinutes : 0,00723086666666667
TotalSeconds : 0,433852
TotalMilliseconds : 433,852
PS C:\> Measure-Command {.\PureBasicWordCount.exe}
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 68
Ticks : 10682770
TotalDays : 1,23643171296296E-05
TotalHours : 0,000296743611111111
TotalMinutes : 0,0178046166666667
TotalSeconds : 1,068277
TotalMilliseconds : 1068,277
Funny thing is:
PS C:\> .\wc.exe --version
wc (GNU textutils) 2.0
Written by Paul Rubin and David MacKenzie.
Copyright (C) 1999 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Soooo... Is is possible that wc from 1999. that doesn't load while file like this PB code performs so much better?
wc is downloaded from here:
https://sourceforge.net/projects/unxuti ... s/current/ but I can't find the code; supposedly, it should be somewhat same as *nix version? If so:
http://agentzh.org/misc/code/coreutils/wc.c.html there is no inline ASM or anything else in it...
Asking just out of curiosity.
Re: Numbers of lines in a text file
Posted: Thu Jun 27, 2019 3:01 pm
by infratec
The optimized PB version is not fully optimized
Code: Select all
EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
filesize = Lof(0)
*loc = AllocateMemory(filesize, #PB_Memory_NoClear)
ReadData(0, *loc, filesize)
CloseFile(0)
EnableASM
!xor rcx, rcx ; linecount = 0
!mov rdx, [p_loc] ; readpointer = *loc
!mov rax, [v_filesize] ; remainingbytes = filesize
!loopstart: ; While remainingbytes > 0
!cmp word [rdx], 0xA0D ; If word at readpointer <> #CRLF$
!jnz skip ; GOTO skip
!inc rcx ; Else, increment the linecount
!skip: ;
!inc rdx ; readpointer + 1
!dec rax ; remainingbytes - 1
!jnz loopstart ; Wend
!mov [v_count3], rcx
DisableASM
FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End
Re: Numbers of lines in a text file
Posted: Thu Jun 27, 2019 3:25 pm
by bbanelli
infratec wrote:The optimized PB version is not fully optimized
This one seems to be only negligible faster than the former...
Result wrote:
PS C:\> Measure-Command {.\wc.exe -l .\text.txt}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 489
Ticks : 4898609
TotalDays : 5,66968634259259E-06
TotalHours : 0,000136072472222222
TotalMinutes : 0,00816434833333333
TotalSeconds : 0,4898609
TotalMilliseconds : 489,8609
PS C:\> Measure-Command {.\PB_WC_Optimized.exe}
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 51
Ticks : 10514816
TotalDays : 1,21699259259259E-05
TotalHours : 0,000292078222222222
TotalMinutes : 0,0175246933333333
TotalSeconds : 1,0514816
TotalMilliseconds : 1051,4816
I tried running it on larger set, 10M rows and ~2GB file, but difference is even greater.
Results wrote:PS C:\> Measure-Command {.\wc.exe -l .\text.txt}
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 628
Ticks : 16283410
TotalDays : 1,88465393518519E-05
TotalHours : 0,000452316944444444
TotalMinutes : 0,0271390166666667
TotalSeconds : 1,628341
TotalMilliseconds : 1628,341
PS C:\> Measure-Command {.\PB_WC_Optimized.exe}
Days : 0
Hours : 0
Minutes : 0
Seconds : 4
Milliseconds : 154
Ticks : 41548180
TotalDays : 4,80881712962963E-05
TotalHours : 0,00115411611111111
TotalMinutes : 0,0692469666666667
TotalSeconds : 4,154818
TotalMilliseconds : 4154,818
Re: Numbers of lines in a text file
Posted: Fri Jun 28, 2019 5:23 am
by wilbert
bbanelli wrote:Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.
wc probably only checks for the LF character while the PB code you chose to compare it against checks for CR+LF which is slower.
There also is no need to clear the allocated memory which PB does by default and also takes time.
Re: Numbers of lines in a text file
Posted: Fri Jun 28, 2019 8:48 am
by NicTheQuick
Re: Numbers of lines in a text file
Posted: Fri Jun 28, 2019 9:06 am
by wilbert
Nice code
That code processes the file in blocks of 16 KiB and only counts LF characters (not CR LF).
It starts with a long_lines flag set to false and as soon as it detects a block where the average line length >= 15, it sets the flag to true and starts using a different counting method from there on using memchr.
It's possible the c function memchr uses speed optimized code for different processors.