Word Count

Bare metal programming in PureBasic, for experienced users
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Word Count

Post by spacebuddy »

Procedure.i CountWords(a$) ;-Count Words


While FindString(a$,Chr(10),0)
a$=ReplaceString(a$,Chr(10)," ")
Wend

While FindString(a$,Chr(13),0)
a$=ReplaceString(a$,Chr(13)," ")
Wend

While FindString(a$," ",0)
a$=ReplaceString(a$," "," ")
Wend


If (Len(a$)>0)
numwords=CountString(Trim(a$)," ")+1
Else
numwords=CountString(Trim(a$)," ")
EndIf


ProcedureReturn numwords
EndProcedure


I have a routine that counts the number of words in a document. The problem it is a little slow on bigger documents. Can anyone convert this code to asm (x64)
to see if it would be faster :D
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Word Count

Post by Demivec »

spacebuddy wrote:The problem it is a little slow on bigger documents. Can anyone convert this code to asm (x64)
to see if it would be faster :D
I can't help you with an asm version. It was slow for many reasons. Some I fixed in the code below. See if it works for you.

Code: Select all

Procedure.i CountWords(a$) ;-Count Words
  
  ReplaceString(a$,Chr(10)," ", #PB_String_InPlace)
  ReplaceString(a$,Chr(13)," ", #PB_String_InPlace)
  
  While FindString(a$,"  ",0)
    a$=ReplaceString(a$,"  "," ")
  Wend 
  
  Trim(a$)
  If (Len(a$)>0)
    numwords=CountString(a$," ")+1
  Else
    numwords=0
  EndIf
  
  ProcedureReturn numwords 
EndProcedure
Last edited by Demivec on Tue Jul 29, 2014 10:12 pm, edited 1 time in total.
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Re: Word Count

Post by spacebuddy »

Thanks Demivec :D

I have old computer and very slow, I will test to see if it helps
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Word Count

Post by Demivec »

Here's the simple test code I used:

Code: Select all

Procedure.i CountWords(a$) ;-Count Words
  
  ReplaceString(a$,Chr(10)," ", #PB_String_InPlace)
  ReplaceString(a$,Chr(13)," ", #PB_String_InPlace)
  
  While FindString(a$,"  ",0)
    a$=ReplaceString(a$,"  "," ")
  Wend 
  
  Trim(a$)
  If (Len(a$)>0)
    numwords=CountString(a$," ")+1
  Else
    numwords=0
  EndIf
  
  ProcedureReturn numwords 
EndProcedure

filename$ = OpenFileRequester("", "", "Text (*.txt)|*.txt;", 1)
If filename$
  ReadFile(1, filename$)
  a$ = ReadString(1, #PB_File_IgnoreEOL)
  CloseFile(1)
EndIf

If a$
  t1 = ElapsedMilliseconds()
  c = CountWords(a$)
  t2 = ElapsedMilliseconds() - t1
  
  MessageRequester("Results", "For file: '" + GetFilePart(f$) +"', found " + c + " words in " + t2 + " ms.")
EndIf
I have a faster computer and I tested it with a 1418 KB file. It found 237209 words in 74 ms.

I tested the same file with your procedure and I aborted the program after 4 minutes of waiting. :)
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Word Count

Post by IdeasVacuum »

Huh? PB's CountString() will find a partial string or a whole word, so there should be no need to worry about other chars.

For speed, assuming you are working with files, load the file into a memory buffer and then use CountString() directly on the buffer.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Re: Word Count

Post by spacebuddy »

My system q6600 with 1Gig of ram. Everything run slow :oops:
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Word Count

Post by IdeasVacuum »

.... the bottleneck would be how you load the file, once loaded, everything should be fast. How big are the files that need to be searched?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Re: Word Count

Post by spacebuddy »

IdeasVacuum wrote:.... the bottleneck would be how you load the file, once loaded, everything should be fast. How big are the files that need to be searched?
Files are around 100-200MB, this includes pictures and text. Loading is not problem :)
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Word Count

Post by Danilo »

Pictures is binary data, and using strings of 100MB to 200MB does not make sense with
functions like 'a$=ReplaceString(a$," "," ")', because that creates/allocates a new string
of the big size, before it releases the old string. Same for 'CountString(Trim(a$)," ")', which
would create a new trimmed string of 100MB to 200MB first, and then it would count the
words within this big string. But counting spaces within binary data doesn't make much sense anyway?
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Re: Word Count

Post by spacebuddy »

Danilo, this could be big problem for me, now sure how to fix :cry:
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Word Count

Post by Danilo »

What about loading it as binary data into a memory buffer? Then, search the buffer for spaces (Byte value 32).
Depends on the type of data. If it's text files, it depends on how the files are saved (ASCII or Unicode). For pictures,
or other binary data, I don't understand why you want to count space characters in it (.jpg, .png, .bmp)?
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Word Count

Post by wilbert »

It was a bit of a puzzle to create but I hope this works for you.
It should work on x64 and x86, both ascii and unicode.

Code: Select all

Procedure.l CountWords(*Text.Character); Requires SSE
  
  ; init some mmx registers
  !mov eax, 1
  !movd mm4, eax        ; mm4 = previous comparison result
  !pxor mm3, mm3        ; mm3 = 0
  !movq mm2, mm4        ; mm2 = counter
  !mov eax, 0x200d0a09
  !movd mm1, eax
  !punpcklbw mm1, mm3   ; mm1 = separation characters (tab, lf, cr, space)
  !movq mm0, mm4        ; mm0 = working register
  
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    !mov rdx, [p.p_Text]
  CompilerElse
    !mov edx, [p.p_Text]
  CompilerEndIf
  !jmp countwords_entry
  
  ; main loop
  !countwords_loop:
  ; compare character with separation chars
  !pshufw mm0, mm5, 0
  !pcmpeqw mm0, mm1
  !psrlw mm0, 15
  !psadbw mm0, mm3
  ; at this time mm0 = 1 if a separation char is found otherwise 0
  !pandn mm4, mm0
  !paddd mm2, mm4
  ; make a copy of the comparison result
  !movq mm4, mm0
  
  ; entry point for first character
  !countwords_entry:
  CompilerIf #PB_Compiler_Unicode
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
      !movzx eax, word [rdx]
      !add rdx, 2
    CompilerElse
      !movzx eax, word [edx]
      !add edx, 2
    CompilerEndIf
  CompilerElse
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
      !movzx eax, byte [rdx]
      !add rdx, 1
    CompilerElse
      !movzx eax, byte [edx]
      !add edx, 1
    CompilerEndIf
  CompilerEndIf
  !movd mm5, eax
  
  ; loop if not end of string
  !and ax, ax
  !jnz countwords_loop
  
  ; correct counter if last character was a separation character
  !psubd mm2, mm0
  ; set result and empty mmx state
  !movd eax, mm2
  !emms
  ProcedureReturn
  
EndProcedure
Example

Code: Select all

S.s = "This is a test string"
Debug CountWords(@S)
Last edited by wilbert on Wed Jul 30, 2014 1:16 pm, edited 8 times in total.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Word Count

Post by Danilo »

wilbert's code translated to PB syntax:

Code: Select all

Procedure.l CountWords(*Text.Character)
    Protected wordCount
    If *Text
        While *text\c
            c.c = *text\c                                                    ; get current character
            If c = #TAB Or c = 32 Or c = #CR Or c = #LF                      ; If current char is TAB, SPACE, CR, LF
                *text + SizeOf(Character)                                    ;     ignore it
                Continue                                                     ;     Continue
            Else                                                             ; Else
                wordCount + 1                                                ;     wordCount + 1
                While c And c <> #TAB And c <> 32 And c <> #CR And c <> #LF  ;     take all characters, except: TAB, SPACE, CR, LF, 0
                    *text + SizeOf(Character)                                ;
                    c.c = *text\c                                            ;
                Wend                                                         ;
            EndIf                                                            ; EndIf
        Wend
    EndIf
    ProcedureReturn wordCount
EndProcedure

S.s = "This is a test string"
S.s + #TAB$+Space(10)+#TAB$+#CRLF$+#TAB$+"a bcd"+#LF$+#LFCR$+#CRLF$+Space(10)
Debug CountWords(@S)
davido
Addict
Addict
Posts: 1890
Joined: Fri Nov 09, 2012 11:04 pm
Location: Uttoxeter, UK

Re: Word Count

Post by davido »

@wilbert
@Danilo

Thank you for sharing. :D
Both are neater and faster than the one I have been using.

Both are fast enough, but wilbert's is about 4x faster on my machine.
DE AA EB
spacebuddy
Enthusiast
Enthusiast
Posts: 347
Joined: Thu Jul 02, 2009 5:42 am

Re: Word Count

Post by spacebuddy »

Wilbert, I tested this on my machine and it is smoking fast :D
Post Reply