[Solved] Fastest way to find text in a folder of many files?

Just starting out? Need help? Post your questions and find answers here.
BarryG
Addict
Addict
Posts: 3318
Joined: Thu Apr 18, 2019 8:17 am

[Solved] Fastest way to find text in a folder of many files?

Post by BarryG »

I have 2100 plain text files in a folder and I need to search for "foo" in them all and only list the files containing that text. I'm currently using a loop to iterate throught all files (no sub-folders), and below is what I'm currently using to find the matches (ok=1 means found), but it's taking around 20-30 seconds for the loop to parse them all, which is not really acceptable. I'm aiming for a 2-3 second parse time, max. Is there a quicker way? Thanks.

Note: I have tried increasing the file buffer size, but it didn't help.

PS. Maybe a native (not third-party) DOS command could do it and return an output of filenames faster? But I can't find one; "findstr" doesn't open some of the files due to their names being Unicode (it shows "?" in the filename, instead of the Unicode character - see screenshot).

Image

Code: Select all

f=ReadFile(#PB_Any,file$)
If f
  Repeat
    If FindString(ReadString(f,#PB_File_IgnoreEOL),text$)
      ok=1
      Break
    EndIf
  Until Eof(f)
  CloseFile(f)
EndIf
I worked it out! I shouldn't have used the Repeat/Until loop, because ReadString() was reading the entire text file at once. So the amended code is as follows, and now my 2100 files are parsed in 1-2 seconds. Yay!

Code: Select all

f=ReadFile(#PB_Any,file$)
If f
  If FindString(ReadString(f,#PB_File_IgnoreEOL),text$)
    ok=1
  EndIf
  CloseFile(f)
EndIf
Last edited by BarryG on Thu Aug 26, 2021 12:11 pm, edited 1 time in total.
User avatar
idle
Always Here
Always Here
Posts: 5089
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Fastest way to find text in a folder of many files?

Post by idle »

you could try modifying this example with or without the trie.
The trie is good if you're looking for multiple needles/items. you could also use regular expressions which would probably work out just as fast or maybe faster as it does much the same as what the Trie does.

squint.pbi here
viewtopic.php?f=12&t=75783

Code: Select all

;Example of a FindStrings to find multiple occurrences of tokens and return their positions

IncludeFile "Squint2.pbi"

UseModule SQUINT 
EnableExplicit 

Structure Item 
  key.s 
  count.l
  len.l
  List positions.i()
EndStructure 

Structure FindStrings 
  List *item.item() 
EndStructure

Global FindStringsItems.FindStrings 
  
Procedure FindStrings(*squint.squint,*source,*keys,*items.FindStrings=0) 
  Protected *inp.Character,*node.squint_node,*sp    
  Protected count,key.s,*item.item    
  
  If *source 
    *inp = *source 
    *sp = *source 
    While *inp\c <> 0
      While *inp\c > 32 
        *inp+2 
      Wend 
      key = PeekS(*source,(*inp-*source)>>1)
      *node = Squintset(*squint,@key,0) 
      If Not *node\value 
        *item =  AllocateStructure(item)
        AddElement(*item\positions()) 
        *item\positions() = *source - *sp
        *item\count = 1 
        *item\key = key 
        *item\len = (*inp-*source)>>1
        *node\value = *item
      Else   
        *item = *node\value 
        AddElement(*item\positions())
        *item\positions() = *source - *sp
        *item\len = (*inp-*source)>>1
        *item\count + 1 
      EndIf   
      If *inp\c <> 0
        *inp+2
        *source = *inp 
      Else 
        Break 
      EndIf   
    Wend 
  EndIf 
  
  *inp = *keys 
  While *inp\c <> 0
    While *inp\c > 32 
      *inp+2 
    Wend 
    key = PeekS(*keys,(*inp-*keys)>>1)
    *item = Squintget(*squint,@key)
    If *item
      count + *item\count  
      If *items
        If *item\count >= 1 
          AddElement(*items\item()) 
          *items\item() = *item 
        EndIf
      EndIf   
    EndIf   
    If *inp\c <> 0
      *inp+2
      *keys = *inp 
    Else 
      Break 
    EndIf   
  Wend 
  
  ProcedureReturn count 
  
EndProcedure 

Procedure cbFindStringsFree(*key,*value,*Data) 
  FreeStructure(*value) 
EndProcedure   

Procedure cbFindStringsEnum(*key,*value,*items.FindStrings) 
  AddElement(*items\item()) 
  *items\item() = *value
  Debug PeekS(*key,-1,#PB_UTF8)
EndProcedure   

Procedure FindStringsFree(*squint.squint) 
  SquintWalk(*squint,@cbFindStringsFree()) 
  SquintFree(*squint) 
EndProcedure   

Procedure FindStringsEnum(*mt.squint,key.s,*items.FindStrings) 
  ClearList(*items\item())
  SquintEnum(*mt,@key,@cbFindStringsEnum(),*items) 
EndProcedure   

Global String1.s = "373 ac3 b9d45 b iPdC ks23 al97 373 ac5 al99 346 vs42159ssbpx roro ask ePOC foo bar xyz 12dk tifer erer e"
Global String2.s = "346 373 iPdC roro ePOC ac3 375"
Global out.s
Global FindStringsItems.FindStrings 
Global *squint.squint = SquintNew() 
Global Replace.s = "-----------------------------------------------------------"

Debug Str(FindStrings(*squint,@String1,@String2,@FindStringsItems)) + " tokens found"

Debug "item, count and it position and the item " 

ForEach FindStringsItems\item() 
  out=""
  Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
  ForEach FindStringsItems\item()\positions() 
    out + Str(FindStringsItems\item()\positions()) + ": " + PeekS(@string1 + FindStringsItems\item()\positions(),FindStringsItems\item()\len) + " " 
    CopyMemory(@Replace,@string1+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character)) ;replace found items
  Next 
  Debug out 
Next

Debug "replaced found strings with ---"
Debug string1 

Debug "Enum from a" 
 
FindStringsEnum(*squint,"a",@FindStringsItems) 

Debug "for each of a" 

ForEach FindStringsItems\item() 
  Debug FindStringsItems\item()\key 
Next

FindStringsFree(*squint) 
BarryG
Addict
Addict
Posts: 3318
Joined: Thu Apr 18, 2019 8:17 am

Re: Fastest way to find text in a folder of many files?

Post by BarryG »

Hi idle, took a look but I can't work out where to specify a directory of files to parse for "foo". Are you able to show me? The code looks really complicated and un-Basic to my eyes. Thanks.
User avatar
idle
Always Here
Always Here
Posts: 5089
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Fastest way to find text in a folder of many files?

Post by idle »

Sorry if it doesn't look very basic like. I can do a demo tomorrow, I'm off for the evening.

You can use this to scan for a list of candidate files for instance.

Code: Select all

CompilerIf #PB_Compiler_OS = #PB_OS_Windows 
  #Cdir = "\" 
CompilerElse 
  #Cdir = "/"
CompilerEndIf 

Structure FileDate 
  Created.i
  Modified.i
  Accessed.i
EndStructure   

Structure File 
  Name.s
  Attributes.i
  Date.FileDate
  Size.q
EndStructure   

Procedure GetFileList(StartDirectory.s,List Lfiles.file(),Pattern.s="*.*",Recursive=1)
  Protected PatternCount,Depth,a,CurrentDirectoryID,Directory.s,TempDirectory.s
  Protected FileAttributes.i,FileSize.i,FileDate.FileDate,FileName.s,FullFileName.s 
  
  Static NewList Lpattern.s()
  Static PatternSet,FileCount 
  
  If Not PatternSet
    Pattern = RemoveString(Pattern,"*.")
    PatternCount = CountString(Pattern,"|") + 1
    ClearList(lpattern())
    For a = 1 To PatternCount 
      AddElement(Lpattern())
      Lpattern() = UCase(StringField(Pattern,a,"|"))
    Next
    PatternSet=1
  ElseIf depth = 0 
    PatternSet = 0
  EndIf 
  
  CurrentDirectoryID = ExamineDirectory(#PB_Any, StartDirectory, "*.*") 
  If CurrentDirectoryID 
    While NextDirectoryEntry(CurrentDirectoryID)
      If DirectoryEntryType(CurrentDirectoryID) = #PB_DirectoryEntry_File
        Directory = StartDirectory
        FileName = DirectoryEntryName(CurrentDirectoryID)
        FileDate\Created = DirectoryEntryDate(CurrentDirectoryID,#PB_Date_Created)
        FileDate\Modified = DirectoryEntryDate(CurrentDirectoryID,#PB_Date_Modified)
        FileDate\Accessed = DirectoryEntryDate(CurrentDirectoryID,#PB_Date_Accessed)
        FileAttributes = DirectoryEntryAttributes(CurrentDirectoryID)
        FileSize = DirectoryEntrySize(CurrentDirectoryID) 
        
        ForEach Lpattern()
          If lpattern() = "*" Or GetExtensionPart(UCase(FileName)) = lpattern()
            FullFileName.s = StartDirectory + FileName 
            AddElement(LFiles()) 
            Lfiles()\Name = FullFileName
            Lfiles()\Date = FileDate 
            Lfiles()\Size = FileSize 
            Lfiles()\Attributes = FileAttributes 
            FileCount+1
          EndIf
        Next  
        
      Else
        TempDirectory = DirectoryEntryName(CurrentDirectoryID)
        If TempDirectory <> "." And TempDirectory <> ".."
          If Recursive = 1
            Depth + 1
            GetFileList(StartDirectory + TempDirectory + #Cdir,LFiles(),Pattern,Recursive) 
          EndIf
        EndIf
      EndIf
    Wend
    FinishDirectory(CurrentDirectoryID)
  EndIf
  
  ProcedureReturn FileCount
  
EndProcedure

Procedure SortFileListByDate(List InputFiles.file(),List OutPutFiles.file(),Order=#PB_Sort_Ascending,DateOption=#PB_Date_Modified,StartDate=0,EndDate=$7FFFFFFF)
  
  Select DateOption 
    Case #PB_Date_Modified 
      SortStructuredList(InputFiles(),Order,(OffsetOf(File\Date)+OffsetOf(FileDate\Modified)),#PB_Integer) 
    Case #PB_Date_Accessed 
      SortStructuredList(InputFiles(),Order,(OffsetOf(File\Date)+OffsetOf(FileDate\Accessed)),#PB_Integer)
    Case #PB_Date_Created
      SortStructuredList(InputFiles(),Order,(OffsetOf(File\Date)+OffsetOf(FileDate\Created)),#PB_Integer)
  EndSelect 
  
  If StartDate 
    ForEach InputFiles() 
      Select DateOption 
        Case #PB_Date_Modified 
          If (InputFiles()\Date\Modified >= StartDate And InputFiles()\Date\Modified <= EndDate) 
            AddElement(OutPutFiles()) 
            CopyStructure(@InputFiles(),@OutPutFiles(),File)
          EndIf
        Case #PB_Date_Accessed 
          If (InputFiles()\Date\Modified >= StartDate And InputFiles()\Date\Accessed <= EndDate) 
            AddElement(OutPutFiles()) 
            CopyStructure(@InputFiles(),@OutPutFiles(),File)
          EndIf
        Case #PB_Date_Created 
          If (InputFiles()\Date\Modified >= StartDate And InputFiles()\Date\Created <= EndDate) 
            AddElement(OutPutFiles()) 
            CopyStructure(@InputFiles(),@OutPutFiles(),File)
          EndIf
      EndSelect  
    Next  
  EndIf
  
  ProcedureReturn ListSize(OutPutFiles()) 
  
EndProcedure   

Procedure SortFileListBySize(List InputFiles.file(),List OutPutFiles.file(),Order=#PB_Sort_Ascending,MinimumSize=0,MaximumSize.q=$7FFFFFFFFFFFFFFF) 
  
  SortStructuredList(InputFiles(),Order,OffsetOf(File\Size),#PB_Integer) 
  If MinimumSize 
    ForEach InputFiles() 
      If (InputFiles()\Size >= MinimumSize And InputFiles()\Size <= MaximumSize)
        AddElement(OutPutFiles()) 
        CopyStructure(@InputFiles(),@OutPutFiles(),File)
      EndIf
    Next    
  EndIf 
  
  ProcedureReturn ListSize(OutPutFiles())  
  
EndProcedure  



Global NewList AllFiles.File() 
Global NewList FilteredFiles.File() 
Global NewList ReFilteredFiles.File()
Global StartDate = Date(2014,1,1,0,0,0) 
Global EndDate = Date(2016,12,31,23,59,59)
Global Path.s = #PB_Compiler_Home  


If GetFileList(Path,AllFiles(),"*.pb|*.pbi")  ;get all pb files in the directory recursively  
  
  If SortFileListByDate(Allfiles(),FilteredFiles(),#PB_Sort_Ascending,#PB_Date_Modified,StartDate,EndDate) ;sort and filter by date between dates  
    ForEach FilteredFiles() 
      Debug FilteredFiles()\Name 
      Debug FormatDate("%dd/%mm/%yyyy",FilteredFiles()\Date\Modified) 
    Next
  EndIf 
  
  Debug "++++++++++++++++++++++++++++++++++++++++++++++++++"
  
  If SortFileListBySize(FilteredFiles(),ReFilteredFiles(),#PB_Sort_Descending,10000,100000) ;re-sort and filter by size between sizes 
     ForEach ReFilteredFiles() 
      Debug ReFilteredFiles()\Name 
      Debug Str(ReFilteredFiles()\Size / 1024) + " KB" 
    Next
  EndIf 
  
EndIf 
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Fastest way to find text in a folder of many files?

Post by Marc56us »

BarryG wrote: ..."findstr" doesn't open some of the files due to their names being Unicode
Hi Barry,

Two other ways if you want too:
  • Try FIND instead of FINDSTR (Find support Utf-16, not FindStr)
  • You say you do not want external software, but just in case: use the extraordinary (Search) Everything. Many power user have it and many software (i.e: TC) now have an option to use it for accelerate search. There is command line tool and SDK.
:wink:
BarryG
Addict
Addict
Posts: 3318
Joined: Thu Apr 18, 2019 8:17 am

Re: Fastest way to find text in a folder of many files?

Post by BarryG »

Thanks Idle, I'll wait to see if you can come up with an example. Getting the list of files is done; it's just parsing them for matching text that is slow.

As for using "find" instead, it doesn't always work (doesn't show a file with the target text). Plus it also outputs Unicode filenames with a question mark in a box for the Unicode characters (see screenshot). So, not reliable. Not going to use a third-party solution (like "Everything") because I don't want to rely on anything that I can't code or that isn't part of Windows.

Image
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Fastest way to find text in a folder of many files?

Post by Marc56us »

it's just parsing them for matching text that is slow.
If the files are plain text, don't forget to use #PB_File_IgnoreEOL when loading, it speeds up the reading considerably.

... Plus it also outputs Unicode filenames with a question mark in a box for the Unicode characters (see screenshot)
Have you try to change code-page in cmd ?

Code: Select all

chcp 65001
(This is how I correctly display the french accents of Utf-8 files in the Cmd console)

:wink:
User avatar
NicTheQuick
Addict
Addict
Posts: 1226
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Fastest way to find text in a folder of many files?

Post by NicTheQuick »

If you are using Windows you can also enable the file indexer to scan all your files in advance. Then finding a file with a certain content is a matter of seconds: https://www.groovypost.com/howto/search ... indows-10/
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
BarryG
Addict
Addict
Posts: 3318
Joined: Thu Apr 18, 2019 8:17 am

Re: Fastest way to find text in a folder of many files?

Post by BarryG »

Marc56us wrote: Thu Aug 26, 2021 9:12 amdon't forget to use #PB_File_IgnoreEOL when loading, it speeds up the reading considerably
Yes, my first post shows that. All good there.

Can't use indexing as that's messing with the OS settings (and requires admin rights), which is a no-go.

I tried "find" again after changing the code page in the command prompt, and the output was slower than my procedure. So maybe my way is the fastest I can get anyway (without resorting to assembly that I don't understand).

Probably just one of those things I have to live with.
User avatar
NicTheQuick
Addict
Addict
Posts: 1226
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Fastest way to find text in a folder of many files?

Post by NicTheQuick »

You could also implement the Knuth-Morris-Pratt algorithm which should speed up finding your word a lot. I don't think that FindString implements that algorithm.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
BarryG
Addict
Addict
Posts: 3318
Joined: Thu Apr 18, 2019 8:17 am

Re: Fastest way to find text in a folder of many files?

Post by BarryG »

Worked it out! See first post.
AZJIO
Addict
Addict
Posts: 1355
Joined: Sun May 14, 2017 1:48 am

Re: [Solved] Fastest way to find text in a folder of many files?

Post by AZJIO »

You can compare it with my program (screenshot).
User avatar
helpy
Enthusiast
Enthusiast
Posts: 552
Joined: Sat Jun 28, 2003 12:01 am

Re: [Solved] Fastest way to find text in a folder of many files?

Post by helpy »

I use grepWin to search ...
-> https://tools.stefankueng.com/grepWin.html
Windows 10 / Windows 7
PB Last Final / Last Beta Testing
Post Reply