[SOLVED] FindString() but specify a string occurrence

Just starting out? Need help? Post your questions and find answers here.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

[SOLVED] FindString() but specify a string occurrence

Post by Oso »

I noticed that FindString() provides an optional start character position, but it doesn't appear to provide an occurrence function — for instance, return the position of the 3rd occurrence of "/" in a string. Is there a function available that I've missed, or even an easier way than mine below?

I have written this procedure to do it, but it would.be good if there is a shorter alternative. I've tried to reduce the code to a minimum. Am I missing an easier way? Thanks.

Code: Select all

EnableExplicit

; **
; ** Find position of n-th occurrence of a sub-string in a string - for example the 3rd "/" in 1/2/3/4/5 would be 6
; **
Procedure.q Find_String_Index(string.s, find.s, index.q)
  
  Define findpos.q                                                      ; Running FindString() result for each occurrence
  Define foundpos.q                                                     ; Found position of n-th occurrence
  Define findcnt.q                                                      ; Running counter of found sub-string
  Define stpos.q                                                        ; Running start position for FindString()
  Define done.b                                                         ; Loop exit indicator
  
  If index.q > 0                                                        ; Index.q value must be at least 1
    stpos.q = 1                                                         ; Start at position 1 of string to be searched
    done.b = #False
    
    While Not(done.b)                                                   ; Loop until exit flag set
      findpos.q = FindString(string.s, find.s, stpos.q)                 ; Find sub-string with rolling start position stpos
      If findpos.q                                                      ; Found sub-string with position findpos.q
        findcnt.q + 1                                                   ; Increment our occurrence counter
        stpos.q = findpos.q + 1                                         ; Set next start position one char on from found pos.
        If findcnt.q = index.q
          foundpos.q = findpos.q                                        ; Set the position of the final 'index' occurrence
          done.b = #True                                                ; Nothing more to do
        EndIf
      Else
        done.b = #True                                                  ; FindString() didn't find, so nothing more to do
      EndIf
    Wend
  EndIf
  ProcedureReturn foundpos.q                                            ; Return character position where this count found
  
EndProcedure

Define string.s = "words()are()separated()by()bracketed()delimiters()like()this."

Debug Find_String_Index(string.s, "()", 2)  ; <---- Will return 11
Debug Find_String_Index(string.s, "()", 5)  ; <---- Will return 37
Debug Find_String_Index(string.s, "()", 7)  ; <---- Will return 55
Debug Find_String_Index(string.s, "()", 8)  ; <---- Will return 0 since there are only 7 sub-strings
Debug Find_String_Index(string.s, "()", 0)  ; <---- Test nonsense value, will return 0
Debug ""

string.s = "words/are/separated/by/bracketed/delimiters/like/this."
Debug Find_String_Index(string.s, "/", 2)  ; <---- Will return 10
Debug Find_String_Index(string.s, "/", 5)  ; <---- Will return 33
Debug Find_String_Index(string.s, "/", 7)  ; <---- Will return 49
Debug Find_String_Index(string.s, "/", 8)  ; <---- Will return 0 since there are only 7 sub-strings
Debug Find_String_Index(string.s, "/", 0)  ; <---- Test nonsense value, will return 0
Last edited by Oso on Sun Jan 22, 2023 10:52 am, edited 1 time in total.
User avatar
skywalk
Addict
Addict
Posts: 3994
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: FindString() but specify a string occurrence

Post by skywalk »

stringfield()
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
Lord
Addict
Addict
Posts: 849
Joined: Tue May 26, 2009 2:11 pm

Re: FindString() but specify a string occurrence

Post by Lord »

Another way:

Code: Select all

Procedure FindStringIndex(string.s, searched4.s, Index.i)
  found=0
  Repeat
    Position=FindString(string, searched4, Position)
    If Position
      found+1
      If found=Index
        ProcedureReturn Position
      EndIf
      Position+1
    EndIf
  Until Position=0
EndProcedure

Test.s="words/are/separated/by/bracketed/delimiters/like/this."

Debug FindStringIndex(Test, "/", 3)
Image
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

Lord wrote: Wed Nov 23, 2022 4:53 pm Another way:

Code: Select all

Procedure FindStringIndex(string.s, searched4.s, Index.i)
  found=0
  Repeat
    Position=FindString(string, searched4, Position)
    If Position
      found+1
      If found=Index
        ProcedureReturn Position
      EndIf
      Position+1
    EndIf
  Until Position=0
EndProcedure
Yes, that is quite nice tight code, thanks. I see you're feeding Position + 1 back into the FindString(), so you're saving an extra variable there. Actually I wondered about using ProcedureReturn in the middle of loop, as you have done, but I get a bit concerned about garbage collection with PB — perhaps unfounded on my part, but I worry that breaking out of something might leak memory :D

I think your example code is probably likely to be quicker than mine. Ideally, I'm trying to think about performance, because this is something that I'm calling often in large amounts of data.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

skywalk wrote: Wed Nov 23, 2022 4:50 pmstringfield()
Thanks skywalk but I'm looking for the character position, rather than the actual data. I had also been thinking about StringField() since it's perfect for extracting values between delimiters. The dialect of Basic that I've been using for years, has an equivalent to PB's StringField() but also provides two extra functions called Col1() and Col2() that return the character position before and the position after the string. I wish PB had this function.

Code: Select all

string.s = "abc/def/ghi/jkl"
debug StringField(string.s, 2, "/")                      ; <---- Gives def
debug Col1()                                             ; <----- Hypothetical function, gives 4
debug Col2()                                             ; <----- Hypothetical function, gives 8
User avatar
skywalk
Addict
Addict
Posts: 3994
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: FindString() but specify a string occurrence

Post by skywalk »

Yes, you must roll your own. If speed is issue, use CompareMemoryString() while walking through characters.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
ChrisR
Addict
Addict
Posts: 1150
Joined: Sun Jan 08, 2017 10:27 pm
Location: France

Re: FindString() but specify a string occurrence

Post by ChrisR »

skywalk wrote: Wed Nov 23, 2022 5:53 pm If speed is issue, use CompareMemoryString() while walking through characters
Something like this

Code: Select all

EnableExplicit

Procedure FindStringIndex(String.s, StringFind.s, Index.i)
  Protected *String.character, *StringFind.character
  Protected LenStringFind.i, Find.i, FindIndex.i, Position.i, Exit.i
  
  If Index < 1 : ProcedureReturn 0 : EndIf 
  *String = @String
  *StringFind = @StringFind
  If *String\c And *StringFind\c
    LenStringFind = MemoryStringLength(*StringFind)
    Position = 1
    Repeat
      If *String\c = 0
        Exit = #True
      Else
        If *String\c = *StringFind\c
          If LenStringFind = 1 Or (MemoryStringLength(*String) >= LenStringFind And CompareMemoryString(*String, *StringFind, #PB_String_CaseSensitive, LenStringFind) = #PB_String_Equal)
            Find = #True
          EndIf
        EndIf
      EndIf
      If Find
        FindIndex + 1
        If FindIndex = Index
          Break
        Else
          *String + LenStringFind * SizeOf(character)
          Position + LenStringFind
          Find = #False
        EndIf
      Else
        *String + SizeOf(character)
        Position + 1
      EndIf
    Until Exit
    If Not FindIndex Or FindIndex <> Index
      Position = 0
    EndIf
  EndIf
  ProcedureReturn Position
EndProcedure

;-Example
CompilerIf #PB_Compiler_IsMainFile
  
  Define String.s = "abc/def/ghi/jkl"
  Define StringFind.s = "\"
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 2) = " + FindStringIndex(String, StringFind, 2)
  
  StringFind.s = "/"
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 0) = " + FindStringIndex(String, StringFind, 0)
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 1) = " + FindStringIndex(String, StringFind, 1)
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 2) = " + FindStringIndex(String, StringFind, 2)
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 3) = " + FindStringIndex(String, StringFind, 3)
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 4) = " + FindStringIndex(String, StringFind, 4)
  
  String = "abc-/-def-\-ghi-/-jkl"
  StringFind = "-/-"
  Debug "FindStringIndex(" + String + ", " + StringFind + ", 2) = " + FindStringIndex(String, StringFind, 2) 
  
CompilerEndIf
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

ChrisR wrote: Wed Nov 23, 2022 7:04 pm Something like this
Thanks very much for writing this code ChrisR, it's interesting stuff. This presumably starts to be more efficient when looking for longer strings rather than a few characters, because for single-character searches, it's still stepping through the string one character (or 1 unicode) at a time.

There is some code I don't follow. In the few short months that I've been using PB and have seen syntax similar to the below, I still haven't managed to find it in the documentation...

Code: Select all

  Protected *String.character, *StringFind.character
  
  If *String\c And *StringFind\c
1. What is the .character appended to the *String pointers?
2. Similarly, what is the \c that follows *String and *StringFind?
User avatar
Lord
Addict
Addict
Posts: 849
Joined: Tue May 26, 2009 2:11 pm

Re: FindString() but specify a string occurrence

Post by Lord »

Oso wrote: Wed Nov 23, 2022 5:15 pm...
I think your example code is probably likely to be quicker than mine. Ideally, I'm trying to think about performance, because this is something that I'm calling often in large amounts of data.
I didn't look for speed, just another approach.
But I think that both (and ChrisR's code) are at almost the same speed.
I tried all three with

Code: Select all

Test.s="words/are/separated/by/bracketed/delimiters/like/this."

t1=ElapsedMilliseconds()
For i=1 To 10000000
  FindStringIndex(Test, "/", 3)
Next
MessageRequester("", Str(ElapsedMilliseconds()-t1))
Debugg off!
Image
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

Lord wrote: Wed Nov 23, 2022 7:59 pm I didn't look for speed, just another approach.
But I think that both (and ChrisR's code) are at almost the same speed.
I've been doing some performance testing with surprising results. I think the real performance is best gauged by dramatically increasing the size of the string to be searched because it then tests the robustness of the procedure, rather than only multiplying the result on light loads, in which case I find that PB's own FindString is by far the fastest — like this 20 MB string...

Code: Select all

Test.s = Space(10000000) + "words/are/separated/by/bracketed/delimiters/like/this."

t1=ElapsedMilliseconds()
;     For i=1 To 10000000
  FindStringIndex(Test, "/", 3)
;     Next
MessageRequester("", Str(ElapsedMilliseconds()-t1))
I suspect that ChrisR's memory version becomes the fastest for longer search strings, rather than single character searches, because his approach benefits by using CompareMemoryString.
User avatar
ChrisR
Addict
Addict
Posts: 1150
Joined: Sun Jan 08, 2017 10:27 pm
Location: France

Re: FindString() but specify a string occurrence

Post by ChrisR »

Oso wrote: Wed Nov 23, 2022 8:28 pm I suspect that ChrisR's memory version becomes the fastest for longer search strings, rather than single character searches, because his approach benefits by using CompareMemoryString.
I thought so too but no, Lord's version is faster and easier

Code: Select all

EnableExplicit

Procedure FindStringIndexLord(string.s, StringFind.s, Index.i)
  Protected found.i, Position.i
  If Index > 0 And Not String = "" And Not StringFind = ""
    Repeat
      Position = FindString(string, StringFind, Position)
      If Position
        found + 1
        If found = Index
          ProcedureReturn Position
        EndIf
        Position + 1
      EndIf
    Until Position = 0
  EndIf
  ProcedureReturn 0
EndProcedure

Procedure FindStringIndexChris(String.s, StringFind.s, Index.i)
  Protected *String.character, *StringFind.character
  Protected LenStringFind.i, Find.i, FindIndex.i, Position.i, Exit.i
  
  If Index < 1 : ProcedureReturn 0 : EndIf 
  *String = @String
  *StringFind = @StringFind
  If *String\c And *StringFind\c
    LenStringFind = MemoryStringLength(*StringFind)
    Position = 1
    Repeat
      If *String\c = 0
        Exit = #True
      Else
        If *String\c = *StringFind\c
          If LenStringFind = 1 Or (MemoryStringLength(*String) >= LenStringFind And CompareMemoryString(*String, *StringFind, #PB_String_CaseSensitive, LenStringFind) = #PB_String_Equal)
            Find = #True
          EndIf
        EndIf
      EndIf
      If Find
        FindIndex + 1
        If FindIndex = Index
          Break
        Else
          *String + LenStringFind * SizeOf(character)
          Position + LenStringFind
          Find = #False
        EndIf
      Else
        *String + SizeOf(character)
        Position + 1
      EndIf
    Until Exit
    If Not FindIndex Or FindIndex <> Index
      Position = 0
    EndIf
  EndIf
  ProcedureReturn Position
EndProcedure

;-Example
CompilerIf #PB_Compiler_IsMainFile

   Define String.s = "abcdef-/-ghijkl-\-mnopqr-/-stuvwx"
   Define StringFind.s = "-/-"
   Define pos1, pos2, t1, t2, t3, I
   DisableDebugger
  
  t1 = ElapsedMilliseconds()
  For I = 1 To 1000000
    pos1 = FindStringIndexLord(String, StringFind, 2)
  Next
  t2 = ElapsedMilliseconds()
  
  For I = 1 To 1000000
    pos2 = FindStringIndexChris(String, StringFind, 2)
  Next
  t3 = ElapsedMilliseconds()
  MessageRequester("FindStringIndex vs", "FindStringIndexLord position " + Str(pos1) + " (" + Str(t2 - t1) + " ms)" + #CRLF$ + "FindStringIndexChris position " + Str(pos2) + " (" + Str(t3 - t2) + " ms)") 
  
CompilerEndIf
FindStringIndexLord position 25 (154 ms)
FindStringIndexChris position 25 (203 ms)
User avatar
Demivec
Addict
Addict
Posts: 4089
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: FindString() but specify a string occurrence

Post by Demivec »

Oso wrote: Wed Nov 23, 2022 7:58 pm There is some code I don't follow. In the few short months that I've been using PB and have seen syntax similar to the below, I still haven't managed to find it in the documentation...

Code: Select all

  Protected *String.character, *StringFind.character
  
  If *String\c And *StringFind\c
1. What is the .character appended to the *String pointers?
2. Similarly, what is the \c that follows *String and *StringFind?
'Character' is a predefined PureBasic structure that contains only one element 'c.c' that is of type 'character'.. You can find its definition with the IDE tool called 'Structure Viewer' if my memory serves me well. There are similar definitions for each of the basic variable types. I think the exception is there is no definition for a 'fixed String'. The names of the ones defined along with their only element in '{}' are: quad {q}, integer {i}, long {l}, double {d}, float {f}, unicode {u}, ascii {a}, character {c}, word {w}, byte {b} and string {s}.

@Edit: Added the dynamic string structure 'string' to the list.
Last edited by Demivec on Sun Nov 27, 2022 2:57 am, edited 2 times in total.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

ChrisR wrote: Wed Nov 23, 2022 8:39 pm I thought so too but no, Lord's version is faster and easier
Yes, I didn't realise quite how quick PB's FindString is. I remembered reading a post on the forum about one of the string functions (probably FindString) being inefficient if it is called multiple times, because the poor old function has to start right at the beginning each time to search. :( That's why I was keen on your *memory pointer version, but in fact it seems that FindString() is about as quick as we can get and even better if we set the start position.

One of the tests I did was on my earlier version from today, with MID(), before I remembered to use FindString(). It just dies on the 20MB string, it can't cope with it at all...

Code: Select all

  While chrpos.q < lenstr.q                                             ; Continue through string to be searched until end
    chrpos.q + 1                                                        ; Advance find character position
    
    If Mid(string.s,chrpos.q,lenfind.q) = find.s                        ; Found at this position?
      chrcnt.q + 1                                                      ; Advance the count of number found so far
      If chrcnt.q = index.q                                             ; Reached the required count?
        ProcedureReturn chrpos.q                                        ; Return the character position where this count found
      EndIf
    EndIf
  Wend
With FindString(), it takes only milliseconds to find a single character in the 20MB data, which is largely what I need this for, just single delimiters.
Last edited by Oso on Wed Nov 23, 2022 9:26 pm, edited 1 time in total.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: FindString() but specify a string occurrence

Post by Oso »

Demivec wrote: Wed Nov 23, 2022 8:51 pm 'Character' is a predefined PureBasic structure that contains only one element 'c.c' that is of type 'character'.. You can find its definition with the IDE tool called 'Structure Viewer' if my memory serves me well. There are similar definitions for each of the basic variable types. I think the exception is there is no definition for a fixed String. The names of the ones defined along with their only element in '{}' are: quad {q}, integer {i}, long {l}, double {d}, float {f}, unicode {u}, ascii {a}, character {c}, word {w}, and byte {b}.
Thanks for this Demivec, I appreciate the explanation. To be honest, it was one of those things I saw months ago but my brain has been overloaded. I'm fascinated with PB but also overwhelmed with new concepts. When I normally use pointers myself, I don't append anything to the pointer name, so presumably I'm just defining an integer I guess.
User avatar
idle
Always Here
Always Here
Posts: 5089
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: FindString() but specify a string occurrence

Post by idle »

If performance matters pass strings by reference. Use while loops and terminate on null.

If you need to do multiple searches of the same data use either a map with duplicates add a linked list of positions or use squint3 with a list of positions
you also get prefix search and sorting for free.
Squint3 can do ~10,000,00 64 byte key lookups per second vs map ~7,000,000 with same random data but that's for keys that exist if keys don't exist squint rates increase as it bails out early it doesn't need to process the whole key.

If you only need to look up a couple of keys use wilberts finddata module for long keys > 16 bytes use a boyer moore for short keys use SSE search. They are both reentrant. They are ~10x faster than a normal string search. As they don't evaluate every byte a d step through by key length so longer search keys leads to huge gains.
Post Reply