It is currently Sun Aug 18, 2019 4:20 pm

All times are UTC + 1 hour




Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2
Author Message
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 3:47 pm 
Offline
Always Here
Always Here

Joined: Fri Oct 23, 2009 2:33 am
Posts: 5830
Location: Wales, UK
OK, got a bit closer, but not correct:
Code:
;=========== TEST file start ====================================================

If CreateFile(0, "c:\test.txt")

  sLine.s = "abcdefghijklmnopqrstuvwxyz and that is all the letters I can think of"
   ilen.i = Len(sLine)

  For i = 1 To 31234
                                   
       WriteStringN(0,sLine)

  Next

  FlushFileBuffers(0)
         CloseFile(0)

EndIf

;=========== TEST file End =====================================================

Procedure.i CountFileLines(sFile.s)
;----------------------------------
Protected filesize.i = 0
EnableASM
              If ReadFile(0,sFile)
                                  *loc = AllocateMemory(Lof(0))
                      ReadData(0, *loc, Lof(0))
                              filesize = Lof(0)

                     CloseFile(0)
              EndIf

              cnt = 0
              !xor ecx, ecx            ; linecount = 0
              !mov edx, [p.p_loc]      ; readpointer = *loc
              !mov eax, [p.v_filesize] ; remainingbytes = filesize
              !loopstart:              ; While remainingbytes > 0
                !cmp word [edx], 0xA0D ; If word at readpointer <> #CRLF$
                !jnz skip              ; GOTO skip
                !inc ecx               ; Else, increment the linecount
                !skip:                 ;
                !inc edx               ; readpointer + 1
                !dec eax               ; remainingbytes - 1
              !jnz loopstart           ; Wend
              !mov [p.v_cnt], eax

DisableASM

              FreeMemory(*loc)
              ProcedureReturn

EndProcedure

LineCnt.i = CountFileLines("c:\test.txt")
MessageRequester("", Str(LineCnt) + " lines reported")


Edit: Ah, you got there before me netmaestro, so I should now be able to see where I went wrong. :?

_________________
IdeasVacuum
If it sounds simple, you have not grasped the complexity.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 4:00 pm 
Offline
Always Here
Always Here

Joined: Fri Oct 23, 2009 2:33 am
Posts: 5830
Location: Wales, UK
Well, I was very close - the bit I got wrong was from the Help, moving the count result to eax and using ProcedureReturn without an expression to return the content of eax. Not sure why that would not work but then I have never used FASM and have not done any assembler stuff for about 30 years (that was on on CP/M Z80).

_________________
IdeasVacuum
If it sounds simple, you have not grasped the complexity.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 4:04 pm 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3372
Location: Netherlands
It's better to only count for #LF .
It gives the correct results on both Windows and OS X.

Here's a SSE2 approach
Code:
Procedure.i CountFileLines(filename.s)
  Protected result.i
  Protected f.i = ReadFile(#PB_Any, filename)
  Protected num_bytes.i = (Lof(f) + 31) & -32
  Protected *mem = AllocateMemory(num_bytes + 16)
  Protected *pos = (*mem + 15) & -16
  ReadData(f, *pos, num_bytes)
  CloseFile(f)
  !pxor xmm5, xmm5
  !mov edx, 0x0a0a0a0a
  !movd xmm4, edx
  !pshufd xmm4, xmm4, 0
  !mov edx, 0x01010101
  !movd xmm3, edx
  !pshufd xmm3, xmm3, 0
  !pxor xmm0, xmm0
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !mov edx, [p.p_pos]
    !mov ecx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [edx]
    !movdqa xmm2, [edx + 16]
  CompilerElse
    !mov rdx, [p.p_pos]
    !mov rcx, [p.v_num_bytes]
    !loop32b:
    !movdqa xmm1, [rdx]
    !movdqa xmm2, [rdx + 16]
  CompilerEndIf
  !pcmpeqb xmm1, xmm4
  !pcmpeqb xmm2, xmm4
  !pand xmm1, xmm3
  !psubb xmm1, xmm2
  !pshufd xmm2, xmm1, 00001110b
  !paddb xmm1, xmm2
  !pshufd xmm2, xmm1, 00000001b
  !paddb xmm1, xmm2
  !punpcklbw xmm1, xmm5
  !punpcklwd xmm1, xmm5
  !paddd xmm0, xmm1
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
    !add edx, 32
    !sub ecx, 32
  CompilerElse
    !add rdx, 32
    !sub rcx, 32
  CompilerEndIf
  !jnz loop32b
  !pshufd xmm1, xmm0, 00001110b
  !paddd xmm0, xmm1
  !pshufd xmm1, xmm0, 00000001b
  !paddd xmm0, xmm1
  !movd [p.v_result], xmm0
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure


Last edited by wilbert on Mon May 28, 2012 6:05 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 4:26 pm 
Offline
Always Here
Always Here

Joined: Fri Oct 23, 2009 2:33 am
Posts: 5830
Location: Wales, UK
Wow wilbert - even faster! SSE2 goes a long way back (2004ish Pentium 4?) so there can't be many PC's that this would not work on.

Edit: That's a good point about line terminations by the way, though I thought they were like this:

Windows: CR + LF
Unix: LF
Mac: CR

_________________
IdeasVacuum
If it sounds simple, you have not grasped the complexity.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 4:48 pm 
Offline
Always Here
Always Here
User avatar

Joined: Thu Jun 24, 2004 2:44 pm
Posts: 5754
Location: Berlin - Germany
IdeasVacuum wrote:
Windows: CR + LF
Unix: LF
Mac: CR

MacOS > 9 uses LF!
http://en.wikipedia.org/wiki/Newline

_________________
PureBasic 5.70 | SpiderBasic 2.21 | Windows 10 Pro (x64) | Linux Mint 19.2 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 4:57 pm 
Offline
PureBasic Bullfrog
PureBasic Bullfrog
User avatar

Joined: Wed Jul 06, 2005 5:42 am
Posts: 8004
Location: Fort Nelson, BC, Canada
Well I didn't keep first place for long! Wilbert's code executes here in 4-5 ms compared to my 11. And it's crossplatform too, which mine is not. Nice work, wilbert!

_________________
Veni, vidi, vici.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Sun May 27, 2012 5:03 pm 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3372
Location: Netherlands
netmaestro wrote:
Nice work, wilbert!

Thanks :)

I used aligned memory and the loop handles 16 bytes at a time.
That probably explains why it's fast.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Mon May 28, 2012 6:24 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3372
Location: Netherlands
Using a smaller fixed buffer size seems to work faster for larger files.
Code:
Procedure CountFileLines(filename.s)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !mov edx, 0x0a0a0a0a
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfl_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfl_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure


For a more generic approach (count any specific byte value)
Code:
Procedure CountFileByte(filename.s, byte)
  Protected result, bytes_read
  Protected buffer_size = 65536
  Protected f = ReadFile(#PB_Any, filename)
  Protected *mem = AllocateMemory(buffer_size + 48)
  Protected *pos = (*mem + 15) & -16
  While Not Eof(f)
    bytes_read = ReadData(f, *pos, buffer_size)
    If bytes_read <> buffer_size
      FillMemory(*pos + bytes_read, 32, ~byte)
      bytes_read = (bytes_read + 31) & -32
    EndIf
    !pxor xmm5, xmm5
    !movzx edx, byte [p.v_byte]
    !imul edx, 0x01010101
    !movd xmm4, edx
    !pshufd xmm4, xmm4, 0
    !mov edx, 0x01010101
    !movd xmm3, edx
    !pshufd xmm3, xmm3, 0
    !movd xmm0, [p.v_result]
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !mov edx, [p.p_pos]
      !mov ecx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [edx]
      !movdqa xmm2, [edx + 16]
    CompilerElse
      !mov rdx, [p.p_pos]
      !mov rcx, [p.v_bytes_read]
      !cfb_loop32b:
      !movdqa xmm1, [rdx]
      !movdqa xmm2, [rdx + 16]
    CompilerEndIf
    !pcmpeqb xmm1, xmm4
    !pcmpeqb xmm2, xmm4
    !pand xmm1, xmm3
    !psubb xmm1, xmm2
    !psadbw xmm1, xmm5
    !paddq xmm0, xmm1
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      !add edx, 32
      !sub ecx, 32
    CompilerElse
      !add rdx, 32
      !sub rcx, 32
    CompilerEndIf
    !jnz cfb_loop32b
    !pshufd xmm1, xmm0, 00001110b
    !paddq xmm0, xmm1
    !movd [p.v_result], xmm0
  Wend
  CloseFile(f)
  FreeMemory(*mem)
  ProcedureReturn result
EndProcedure
For example CountFileByte("test.txt", 32) will count the amount of spaces.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Mon May 28, 2012 3:38 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Wed Sep 03, 2008 9:29 am
Posts: 270
thanks wilbert :D


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Thu Jun 27, 2019 2:34 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue May 28, 2013 10:51 pm
Posts: 536
Location: Europe
Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.

Code:
EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  *loc = AllocateMemory(Lof(0))
  ReadData(0, *loc, Lof(0))
  filesize = Lof(0)
  CloseFile(0)
  count3=0
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End


Result wrote:
PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 433
Ticks : 4338520
TotalDays : 5,02143518518519E-06
TotalHours : 0,000120514444444444
TotalMinutes : 0,00723086666666667
TotalSeconds : 0,433852
TotalMilliseconds : 433,852



PS C:\> Measure-Command {.\PureBasicWordCount.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 68
Ticks : 10682770
TotalDays : 1,23643171296296E-05
TotalHours : 0,000296743611111111
TotalMinutes : 0,0178046166666667
TotalSeconds : 1,068277
TotalMilliseconds : 1068,277


Funny thing is:

Quote:
PS C:\> .\wc.exe --version
wc (GNU textutils) 2.0
Written by Paul Rubin and David MacKenzie.

Copyright (C) 1999 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


Soooo... Is is possible that wc from 1999. that doesn't load while file like this PB code performs so much better?

wc is downloaded from here: https://sourceforge.net/projects/unxuti ... s/current/ but I can't find the code; supposedly, it should be somewhat same as *nix version? If so: http://agentzh.org/misc/code/coreutils/wc.c.html there is no inline ASM or anything else in it...

Asking just out of curiosity. :)

_________________
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Thu Jun 27, 2019 3:01 pm 
Offline
Addict
Addict

Joined: Sun Sep 07, 2008 12:45 pm
Posts: 4278
Location: Germany
The optimized PB version is not fully optimized :wink:

Code:
EnableExplicit
Define *loc
Define.i count3, filesize
If ReadFile(0,"text.txt")
  filesize = Lof(0)
  *loc = AllocateMemory(filesize, #PB_Memory_NoClear)
  ReadData(0, *loc, filesize)
  CloseFile(0)
  EnableASM
  !xor rcx, rcx            ; linecount = 0
  !mov rdx, [p_loc]        ; readpointer = *loc
  !mov rax, [v_filesize]   ; remainingbytes = filesize
  !loopstart:              ; While remainingbytes > 0
    !cmp word [rdx], 0xA0D ;   If word at readpointer <> #CRLF$
    !jnz skip              ;     GOTO skip
    !inc rcx               ;   Else, increment the linecount
    !skip:                 ;
    !inc rdx               ;   readpointer + 1
    !dec rax               ;   remainingbytes - 1
  !jnz loopstart           ; Wend
  !mov [v_count3], rcx
  DisableASM
  FreeMemory(*loc)
EndIf
OpenConsole()
PrintN(Str(count3))
End


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Thu Jun 27, 2019 3:25 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue May 28, 2013 10:51 pm
Posts: 536
Location: Europe
infratec wrote:
The optimized PB version is not fully optimized :wink:
This one seems to be only negligible faster than the former...

Result wrote:
PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 489
Ticks : 4898609
TotalDays : 5,66968634259259E-06
TotalHours : 0,000136072472222222
TotalMinutes : 0,00816434833333333
TotalSeconds : 0,4898609
TotalMilliseconds : 489,8609



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 51
Ticks : 10514816
TotalDays : 1,21699259259259E-05
TotalHours : 0,000292078222222222
TotalMinutes : 0,0175246933333333
TotalSeconds : 1,0514816
TotalMilliseconds : 1051,4816


I tried running it on larger set, 10M rows and ~2GB file, but difference is even greater.

Results wrote:
PS C:\> Measure-Command {.\wc.exe -l .\text.txt}


Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 628
Ticks : 16283410
TotalDays : 1,88465393518519E-05
TotalHours : 0,000452316944444444
TotalMinutes : 0,0271390166666667
TotalSeconds : 1,628341
TotalMilliseconds : 1628,341



PS C:\> Measure-Command {.\PB_WC_Optimized.exe}

Days : 0
Hours : 0
Minutes : 0
Seconds : 4
Milliseconds : 154
Ticks : 41548180
TotalDays : 4,80881712962963E-05
TotalHours : 0,00115411611111111
TotalMinutes : 0,0692469666666667
TotalSeconds : 4,154818
TotalMilliseconds : 4154,818

_________________
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Fri Jun 28, 2019 5:23 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3372
Location: Netherlands
bbanelli wrote:
Sorry for bringing such an old thread up, but I was wondering - how come "standard" wc (compiled for Windows!) is twice as fast as most optimized version here? File tested has ~500MB and exactly 2.5M lines of unique data.

wc probably only checks for the LF character while the PB code you chose to compare it against checks for CR+LF which is slower.
There also is no need to clear the allocated memory which PB does by default and also takes time.

_________________
macOS 10.14 Mojave, PB 5.62 x64


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Fri Jun 28, 2019 8:48 am 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Sun Jun 22, 2003 7:43 pm
Posts: 415
Location: Germany, Saarbrücken
Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils

_________________
Electronics, Crazy & Interesting Stuff, all that with text, image and sound? Click here!

The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.


Top
 Profile  
Reply with quote  
 Post subject: Re: Numbers of lines in a text file
PostPosted: Fri Jun 28, 2019 9:06 am 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3372
Location: Netherlands
NicTheQuick wrote:
Here you can find the source code of wc: coreutils/wc.c at master · coreutils/coreutils

Nice code :)

That code processes the file in blocks of 16 KiB and only counts LF characters (not CR LF).
It starts with a long_lines flag set to false and as soon as it detects a block where the average line length >= 15, it sets the flag to true and starts using a different counting method from there on using memchr.
It's possible the c function memchr uses speed optimized code for different processors.

_________________
macOS 10.14 Mojave, PB 5.62 x64


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye