Load huge file directly to MAP

Just starting out? Need help? Post your questions and find answers here.
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Load huge file directly to MAP

Post by Kwai chang caine »

Hello at all,

I search to load a big txt file (200 Mo), the more fastier possible in a MAP
I have try line by line it's very long

Code: Select all

Global NewMap MapElements()

Fichier$ =  "C:\MyHugeTxtFile.txt"
Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
      
 Repeat
  Donnee$ = ReadString(Canal, #PB_UTF8)
  MapElements(Donnee$)
 Until Eof(Canal) <> #False

 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
      
EndIf
After i have try all in memory, but this time it's the loop for cute the variable is too long :|

Code: Select all

Global NewMap MapElements()

Fichier$ = "C:\MyHugeTxtFile.txt"

; Chargement fichier 2
Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
  
 TailleFichier = Lof(Canal)
 *Ptr = AllocateMemory(TailleFichier)
 ReadData(Canal, *Ptr, TailleFichier)
 Donnee$ = PeekS(*Ptr, TailleFichier, #PB_UTF8)
 FreeMemory(*Ptr)
 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
 
EndIf

MaxElements = CountString(Donnee$, #CRLF$) + 1

For i = 1 To MaxElements
 
 Ligne$ = StringField(Donnee$, i, #CRLF$)
 MapElements(Ligne$)

Next

CallDebugger
Someone have a solution for do that ?

Have a good day
ImageThe happiness is a road...
Not a destination
firace
Addict
Addict
Posts: 899
Joined: Wed Nov 09, 2011 8:58 am

Re: Load huge file directly to MAP

Post by firace »

Hi KCC,

The code below should be faster.
I've tested with a 200MB csv file, and it finishes in 10-11 seconds on my system (no SSD here, just an old HDD).

Can you try and compare speed?

Code: Select all


Procedure LoadFileIntoMapFAST(filename$, Map X.s()) 
  If ReadFile(35,  filename$)   : 
    While Not Eof(35) :
      andro$ =  ReadString(35)  
      aa + 1
      X(Str(aa)) = andro$
    Wend
    CloseFile(35)
  EndIf 
EndProcedure


ElapsedMilliseconds()
NewMap a.s()

LoadFileIntoMapFAST("hugefile.csv" , a())

MessageRequester("", Str( ElapsedMilliseconds()))

User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Load huge file directly to MAP

Post by Kwai chang caine »

Hello FIRACE glad to talk to you :D

I don't understand because here, with no SSD too, your code load the txt file (138 Mo) in the MAP in 2 minutes(s) 46 seconde(s) :shock:
Furthermore without debugger with v5.73 X86

Code: Select all

Procedure LoadFileIntoMapFAST(filename$, Map X.s()) 
  If ReadFile(35,  filename$)   : 
    While Not Eof(35) :
      andro$ =  ReadString(35)  
      aa + 1
      X(Str(aa)) = andro$
    Wend
    CloseFile(35)
  EndIf 
EndProcedure


TempDepart = ElapsedMilliseconds()
NewMap a.s()

LoadFileIntoMapFAST("C:\MyHugeTxtFile.txt" , a())

SecondesPasser = (ElapsedMilliseconds() - HeureDebut) / 1000
JourHeureMnSec$ = Trim(RemoveString(RemoveString(RemoveString(RemoveString(Str(SecondesPasser / 86400) + FormatDate(" jour(s) %hh heure(s) %ii minute(s) %ss seconde(s)", SecondesPasser % 86400), "0 jour(s)"), "00 heure(s)"), "00 minute(s)"), "00 seconde(s)"))
MessageRequester("", JourHeureMnSec$)
ImageThe happiness is a road...
Not a destination
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Load huge file directly to MAP

Post by Marc56us »

Hi KCC,

Try splitting into two operations (to find out which is slower: reading file or inserting data into the map).

For reading, read all the file in one go with #PB_File_IgnoreEOL

Code: Select all

While Not Eof(0)
     Txt$ = ReadString(0, #PB_UTF8 | #PB_File_IgnoreEOL)
Wend
CloseFile(0)
Then, uses StringField() or anything else to split lines.

:wink:
nsstudios
Enthusiast
Enthusiast
Posts: 274
Joined: Wed Aug 28, 2019 1:01 pm
Location: Serbia
Contact:

Re: Load huge file directly to MAP

Post by nsstudios »

Got 19 seconds with a 200 MB file.

Code: Select all

CreateRegularExpression(0, "^.*$", #PB_RegularExpression_AnyNewLine|#PB_RegularExpression_MultiLine)
Global NewMap MapElements()

Fichier$ = "c:\MyHugeTxtFile.txt"

t1=ElapsedMilliseconds()
; Chargement fichier 2
Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
  
 TailleFichier = Lof(Canal)
 *Ptr = AllocateMemory(TailleFichier)
 ReadData(Canal, *Ptr, TailleFichier)
 Donnee$ = PeekS(*Ptr, TailleFichier, #PB_UTF8|#PB_ByteLength)
 FreeMemory(*Ptr)
 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
 
EndIf

;MaxElements = CountString(Donnee$, #CRLF$) + 1
ExamineRegularExpression(0, Donnee$)
While NextRegularExpressionMatch(0)
;For i = 1 To MaxElements
 Ligne$ = RegularExpressionMatchString(0);StringField(Donnee$, i, #CRLF$)
 MapElements(Ligne$)
;Next
Wend

MessageRequester("done", Str(ElapsedMilliseconds()-t1), 64)
;CallDebugger
If you don't specifically need it to be a map, you can replace the loop with

Code: Select all

dim lines.s(0)
ExtractRegularExpression(0, Donnee$, lines())
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Load huge file directly to MAP

Post by Kwai chang caine »

Hello Marc56US :D
With your tips, it's again worst.....i wait i wait and nothing passed :|

Code: Select all

Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
   
  While Not Eof(Canal)
   Txt$ = ReadString(Canal, #PB_UTF8 | #PB_File_IgnoreEOL)
   Debug Txt$
  Wend

 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
      
EndIf
@nsstudios
Hello nsstudios, thanks for your code and interest 8)
The file is loaded in few seconds, but the while/Wend a long time and MsgBox write 271641

I don't understand, it's incredible this history :shock:
ImageThe happiness is a road...
Not a destination
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Load huge file directly to MAP

Post by Marc56us »

Even with the debugger disabled?
(I'll check tomorrow)
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Load huge file directly to MAP

Post by Kwai chang caine »

nsstudios wrote:If you don't specifically need it to be a map, you can replace the loop with
Waouuuh !!! your tips is really fast comaring to another ways :shock:

Code: Select all

CreateRegularExpression(0, "^.*$", #PB_RegularExpression_AnyNewLine|#PB_RegularExpression_MultiLine)

Fichier$ = "c:\MyHugeTxtFile.txt"

Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
  
 TailleFichier = Lof(Canal)
 *Ptr = AllocateMemory(TailleFichier)
 ReadData(Canal, *Ptr, TailleFichier)
 Donnee$ = PeekS(*Ptr, TailleFichier, #PB_UTF8|#PB_ByteLength)
 FreeMemory(*Ptr)
 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
 
EndIf

Dim lines.s(0)
ExtractRegularExpression(0, Donnee$, lines())
MaxLines = ArraySize(lines())

For i = 1 To MaxLines
 Debug lines(i)
Next
It's already a good step in the good direction, i keep your tips preciously if i use an array :wink:
It's a pity we can't do the same thing with a MAP :|

Because after, i want search quickly a line into the MAP ?

In fact, i want compare two huge txt file
If a line is not present, or different in one or the other file, i want know what is this line

Perhaps it's better to all do in the memory, but i have searched code for do that and not really found :oops:
Marc56US wrote:Even with the debugger disabled?
Unfortunately yes :|
ImageThe happiness is a road...
Not a destination
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Load huge file directly to MAP

Post by Kwai chang caine »

@nsstudios
Really quick way, your REGEX solution, thanks a lot :shock: 8)
Unfortunaletly when i try to copy the Array create by REGEX in a MAP it's again a time very very very long :|

Code: Select all

Global NewMap MapElements()

HeureDebut = ElapsedMilliseconds() 
CreateRegularExpression(0, "^.*$", #PB_RegularExpression_AnyNewLine|#PB_RegularExpression_MultiLine)

Fichier$ = "c:\MyHugeTxtFile.txt"

Canal = ReadFile(#PB_Any, Fichier$, #PB_UTF8)
     
If Canal
  
 TailleFichier = Lof(Canal)
 *Ptr = AllocateMemory(TailleFichier)
 ReadData(Canal, *Ptr, TailleFichier)
 Donnee$ = PeekS(*Ptr, TailleFichier, #PB_UTF8|#PB_ByteLength)
 FreeMemory(*Ptr)
 CloseFile(Canal)
 
Else

 MessageRequester("Erreur fichier", "Le fichier" + #CRLF$ + Fichier$ + #CRLF$ + "n'a pu être ouvert.")
 
EndIf

Dim lines.s(0)
ExtractRegularExpression(0, Donnee$, lines())
MaxLines = ArraySize(lines())

For i = 1 To MaxLines
 MapElements(lines(i))
Next

SecondesPasser = (ElapsedMilliseconds() - HeureDebut) / 1000
JourHeureMnSec$ = Trim(RemoveString(RemoveString(RemoveString(RemoveString(Str(SecondesPasser / 86400) + FormatDate(" jour(s) %hh heure(s) %ii minute(s) %ss seconde(s)", SecondesPasser % 86400), "0 jour(s)"), "00 heure(s)"), "00 minute(s)"), "00 seconde(s)"))
MessageRequester("", JourHeureMnSec$)
ImageThe happiness is a road...
Not a destination
jassing
Addict
Addict
Posts: 1745
Joined: Wed Feb 17, 2010 12:00 am

Re: Load huge file directly to MAP

Post by jassing »

If you need to search a large text file and you want speed, you should use sqlite & FTS (the built-in sqlite should have fts4 enabled)
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Load huge file directly to MAP

Post by Little John »

Marc56us wrote: Then, uses StringField() or anything else to split lines.

:wink:
Do NOT use StringField() in cases like this!
freak wrote: Using StringField() is [...] not the fastest way, as StringField always has to look at the string from the start with each call.
If you examine the string via pointers, you can get each field without ever looking at allready processed parts.
So StringField() uses the Shlemiel the painter’s algorithm. :)
Joel on software wrote:Who is Shlemiel? He’s the guy in this joke:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”
User avatar
Demivec
Addict
Addict
Posts: 4085
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Load huge file directly to MAP

Post by Demivec »

Kwai chang caine wrote: Sun May 28, 2023 9:10 pm In fact, i want compare two huge txt file
If a line is not present, or different in one or the other file, i want know what is this line

Perhaps it's better to all do in the memory, but i have searched code for do that and not really found :oops:
How many lines are in your file? How many slots are in your Map?
The optional 'Slots' parameter defines how much slots the map will have have to store its elements. The more slots is has, the faster it will be to access an element, but the more memory it will use. It's a tradeoff depending of how many elements the map will ultimately contains and how fast the random access should be. The default value is 512. This parameter has no impact about the number of elements a map can contain.
The less slots you have the more time it will take for the assigning something to the map.

Perhaps something like this will produce some improvement for you:

Code: Select all

Global NewMap MapElements(nbLines)
User avatar
mk-soft
Always Here
Always Here
Posts: 5333
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Load huge file directly to MAP

Post by mk-soft »

I do not know the structure of the text file.
But why a map and not a list?
What do you want to achieve with the map?

Maybe there is another better way
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Load huge file directly to MAP

Post by Marc56us »

Hi KCC,
In fact, i want compare two huge txt file
If a line is not present, or different in one or the other file, i want know what is this line
If only that, uses internal command (FC.EXE) (or special program like Winmerge)

Code: Select all

; compare 2 text files using FC.exe (Windows standard command)
; For linux: uses 'diff'

EnableExplicit

DisableDebugger

SetCurrentDirectory(GetTemporaryDirectory())

Define FileA$ = "A.txt"
Define FileB$ = "B.txt"

Define Start = ElapsedMilliseconds()
If Not RunProgram("fc", FileA$ + " " + FileB$ + " > diff.txt", "", 
                  #PB_Program_Read | #PB_Program_Write | #PB_Program_Wait) 
    MessageRequester("Erreur", "Error Running FC.exe")
    End
EndIf

MessageRequester("", "Create diff: " + Str(ElapsedMilliseconds() - Start) + " ms")

RunProgram("Diff.txt")

End
1.6 sec on my old PC

:wink:
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Load huge file directly to MAP

Post by Kwai chang caine »

Waouuuh !!! :shock:
First ...thanks at all for your answers, it's that ...the PB familly 8)

@JASSING
I don't know this "Fast Text Search" added to Sqlite :shock:
If i have the choice, i prefer not use SQLITE for this project :wink:
It's really a very simple function i need, the line is or not in the other TXT
And I do not feel really comfortable with SQL :oops:
But thanks for your tips, it's when even a good solution to remember 8)

@Little John
I had learned from an old project that Stringfield was not fast, which is why I expected it to drag.
But the worst is that I tried without it in a simple loop and it still drags when even :lol:

@DEMIVEC
I have actually 1823952 lines, but that increases every day
The two TXT file is the list of file in a external HD 4TO, but that can be a day more, i have until 16TO on other HD
For the moment, the PB recursive enumeration takes 41 minute(s) 53 seconde(s) for scan all a disk
This time is good for me 8) , but the problem is to analyse the two files after :lol:
Global NewMap MapElements(nbLines)
I don't know it's possible to give the size of the MAP :shock:
Thanks for this tips 8)

@MK-SOFT
Mk-Soft wrote:I do not know the structure of the text file.
The structure of the text file is simple
NumberLine|FullPath\NameFile.Extension|SizeFile|DateCreated|DateAccessed|DateModified
NumberLine|NameFile.Extension|SizeFile|DateCreated|DateAccessed|DateModified
.....
NumberLine|FullPath\NameFile.Extension|SizeFile|DateCreated|DateAccessed|DateModified
NumberLine|NameFile.Extension|SizeFile|DateCreated|DateAccessed|DateModified

I wrote the FullPath only if it change, for win numerous size of the file :idea:
Mk-Soft wrote:But why a map and not a list?
What do you want to achieve with the map?
At the begining, i believe it's simple :oops:
1/ I create my too huges files (Master.txt/Slave.txt) enumeration of 4TO hard drive ~41 minute(s) 53 seconde(s)
2/ I load the "Slave" file in a MAP
3/ I read line by line the Master file and ask if the line exist in the MAP slave disk
Mk-Soft wrote:Maybe there is another better way
Surely :oops:
And again more sure after the first try and all the day that i took to turn around my machine for wait the end of the loop :lol:

@Marc56US
Where you find all your tips ????? :shock:
I don't know this internal DOS tool exist after more than 30 years of use WINDOWS :oops:
At the begining, i say to me : "This is the solution !!" when i see the rapidity of the answer :shock:
But, I put it back in my panties when i have read the "Diff.txt" :cry:
FC wrote:Comparaison des fichiers C:\MyHgeFileMaster.txt et C:\MyHugeFileSlave.TXT
échec de la resynchronisation. Les fichiers sont trop différents.
Thanks when even, that could have been the right solution :wink:

I have try several software specialized for do that BeyoundCompare, etc ...
But there are always a day where the software have problems with the huge size of the 4TO, and that crash often my machine because an enormous memory is used.
But also several another problems, and i can't do anything, other that wait, wait again, behind my screen
And worst...sometime after wait sometime a full day, the soft is locked and i not have real explications

Then i say to me....PB can surely do something for me, after all, it's just a line by line comparison, no ? :twisted:

Then i say to me, first.. i create 2 files, like this i have always the 2 lists of enumeration files in my HD, even if that crash
And i "just" need to compare this two monstruous files, with all the methods i can found with PB :idea:

It looked simple "On paper" :mrgreen:
ImageThe happiness is a road...
Not a destination
Post Reply