Reading huge files

Everything else that doesn't fall into one of the other PB categories.
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Reading huge files

Post by olmak »

Hi. all! There is such a task - to read very large files (up to 10 GB) for further processing.
An array element is created for each line of the file.
After reading the forums I got such a fragment :

Code: Select all

Procedure ReadFileIntoArray(file$, Array StringArray.s(1), Separator.s = " ")
  Define String.s ; The line to which we copy the memory area containing the entire read file
  Protected S.String, *S.Integer = @S
  Protected.i countFileString, i, pos_separ, slen
  
  file_handler=ReadFile(#PB_Any,file$)
  If file_handler
    ReadStringFormat(file_handler) 
    lengthFile.q=(Lof(file_handler)-Loc(file_handler)) -2 
    If lengthFile>0
      pointMemForReadFile=AllocateMemory(lengthFile) 
      If pointMemForReadFile 
        numberBytesReadingFromFile = ReadData(file_handler,pointMemForReadFile,lengthFile)    
        String=PeekS(pointMemForReadFile,MemorySize(pointMemForReadFile),#PB_UTF8)   
        countFileString = CountString(String, Separator)                             
        slen = Len(Separator)                    
        ReDim StringArray(countFileString)       
        *S\i = @String                           
        While i < countFileString                
          pos_separ = FindString(S\s, Separator) separator "Separator"
          StringArray(i) = PeekS(*S\i, pos_separ - 1)    
          *S\i + (pos_separ + slen - 1) << #PB_Compiler_Unicode 
          i + 1                                                 
        Wend                                     
        StringArray(i) = S\s                     
        *S\i = 0                                 
        FreeMemory(pointMemForReadFile)            
        String=""                                
      Else 
        Debug "Memory Allocation error"
      EndIf
    EndIf
    CloseFile(file_handler)
  EndIf
EndProcedure

Dim LogString$(0); An array in which each line of the file will be read
ReadFileIntoArray("e:\0YP\Purebasic\LogAnalyzer\Log\test.log", LogString$() , Chr(10))
CountLogString=ArraySize(LogString$())  ; The number of lines in the read file
Debug LogString$(0) ; Print the first line of the file
Debug LogString$(CountLogString) ; Print the last line of the file
The problem is that if the file size grows somewhere around 242 or higher, the program stops with an error
[14:42:52] [ERROR] Invalid memory access. (write error at address 0)
The error is mainly in the line: String=PeekS(pointMemForReadFile,MemorySize(pointMemForReadFile),#PB_UTF8)
Sometimes in line : StringArray(i) = S\s
The amount of available memory was calculated during operation using MemoryStatus (#PB_System_FreePhysical) and it is about 10GB
I have no experience and good knowledge of working with memory, just the usual line-by-line reading of files is absolutely not suitable
by speed. Need constructive advice on how to best solve the problem.
User avatar
skywalk
Addict
Addict
Posts: 4003
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Reading huge files

Post by skywalk »

pos_separ = FindString(S\s, Separator) ;separator "Separator"
You should post working code.
As you found, this approach is not feasible for very large files.
You could loop through "block sizes" of your choosing.
Search the forum, this has been done sooooo many times.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Re: Reading huge files

Post by olmak »

Sorry for the typo, I removed comments and something left. As I understand it, you are advised to process data in batches. I also thought about it, but I hoped that maybe there is some more elegant solution. In my case, the size of the String variable into which I copy the entire file from memory is a critical factor. This is strange, the Purebasic manual says that the String type has no size restrictions. Anyway, thanks for the answer
User avatar
Kiffi
Addict
Addict
Posts: 1362
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Re: Reading huge files

Post by Kiffi »

@olmak: Let me get this straight. First you load the large file (up to 10 GB) into memory, then you copy the memory contents into a string, and then from the string into a string array? How much memory does your computer have?
Hygge
olmak
User
User
Posts: 14
Joined: Thu Aug 11, 2016 4:00 am

Re: Reading huge files

Post by olmak »

Kiffi wrote:@olmak: Let me get this straight. First you load the large file (up to 10 GB) into memory, then you copy the memory contents into a string, and then from the string into a string array? How much memory does your computer have?
Yes, exactly. The computer has 16 GB of RAM. And now I'm talking about files of at least 5 GB. 10 GB - this is in the future
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

That's a strange code.
It will also take far too much time !
I think the approach is primarily wrong.
With the strings, so, it's not gonna work.

Best Regards Saki
地球上の平和
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Reading huge files

Post by Marc56us »

Yes, string in PB has no limit, this is ASCIIZ like C (see Null-terminated string)

If file is textfile, Try normal ReadString then with #PB_File_IgnoreEOL

Code: Select all

If Not OpenFile(0, GetTemporaryDirectory() + "File_10_MB.txt") ; 10 MB
     Debug "File not found"
     End
 EndIf
 
Start = ElapsedMilliseconds()
Debug "Reading..."
While Not Eof(0)
     Txt$ = ReadString(0, #PB_Ascii | #PB_File_IgnoreEOL)
Wend
CloseFile(0)
Debug "Done."
Debug FormatNumber((ElapsedMilliseconds() - Start) / 1000, 2) + " secs"
Debug "Len string Txt$ : " + FormatNumber(Len(Txt$), 0)
On my i7 @3.2 GHz - Windows 10x64 - SSD

Code: Select all

Reading...
Done.
5.09 secs
Len string Txt$ : 10,577,903
:wink:
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

FindString always searches for the end of the string first, again and again.

And 5 seconds for 10mb is very much, much too much.
Last edited by Saki on Tue Jun 23, 2020 9:05 pm, edited 1 time in total.
地球上の平和
User avatar
mk-soft
Always Here
Always Here
Posts: 5409
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Reading huge files

Post by mk-soft »

10GB Textfile -> 20GB Unicode RAM
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

You must read the file binary, not as string !
Then, you must search your separators, also binary, not as string !
Put on demand your data sets in a list :wink:

But honestly, a 10gb text file seems a bit strange to me.
地球上の平和
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Reading huge files

Post by Marc56us »

Saki wrote:But honestly, a 10gb text file seems a bit strange to me.
Log files and database dumps are often much larger than that. Server logging systems rotate files daily or by size, but the files are still large.

Big text files are the basic "food" for system administrators.
Most often this type of file is processed by flow and with specialized tools (Perl, Grep, AWK etc (yes, these tools also exist under Windows))
But sometimes it is necessary to edit it in its entirety, even if tools such as Grep make it possible to display any number of lines before and after the searched text.

We've been doing this for years, even on machines with less RAM than the file size (some text editors or system will swap line blocks in RAM).

In PB, one can use Scintilla which has much greater capabilities and is faster than the editor gadget.

:wink:
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

Hi,
OK, in principle it works, but the string handling of PB has to be considered.
Also the PB string handling does not automatically release strings, so a lot of Ram is quickly lost.
The way it is done now with the strings is absolutely not possible.

The EditorGadget should not be a problem, 20 to 30mb can be easily handled.

Scintilla, yes, but I don't think it's that fast.

There is a code in the forum which reads large CSV Data Base files quickly.
If you look at something you will find a lot and you don't have to do it yourself.

It seems to be a special module which was needed for Andre's GeoWorld V2.

Since Andre writes that it works fine, this should solve the import problem.

http://forums.purebasic.com/english/vie ... 12&t=70684
地球上の平和
BarryG
Addict
Addict
Posts: 3330
Joined: Thu Apr 18, 2019 8:17 am

Re: Reading huge files

Post by BarryG »

Saki wrote:PB string handling does not automatically release strings, so a lot of Ram is quickly lost.
This was fixed with the 5.72 release -> viewtopic.php?p=518399#p518399
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Reading huge files

Post by Saki »

That seems to work better now.
But, it also seems that about 2GB are still being eaten.
That is not exactly little.

Just a little heads-up:
The maximum possible string length is about 1e9*2 Bytes :wink:
So you can not load strings larger about 1GB.
You can also no longer work with it, so from 50mb upwards is no more fun.
地球上の平和
User avatar
NicTheQuick
Addict
Addict
Posts: 1227
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Reading huge files

Post by NicTheQuick »

What exactly do you want to achieve?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Post Reply