ReadString() performance best practice

Just starting out? Need help? Post your questions and find answers here.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

ReadString() performance best practice

Post by Oso »

I'm reading an input file that could be any size. There is no 'expected' size, as the input file is indeterminate, so I read the file serially.

Would it be accepted practice to ReadString() a single byte and then process it? I used single-byte to test it first, but after adding a buffered version, I found that there is no variation in speed between the two. The performance is great, either way, so long as I don't display anything with PrintN(), otherwise it's very slow to display into the console. Is there a better practice I should use, or is this fine?

Code: Select all

FileSeek(0, 0)
While Not(Eof(0))  
  inpchr.s = ReadString(0, #PB_Ascii ,1) ; Read a single character
  Select inpchr.s ; Process the single character
    Case "#"
      Do something
    Default
      Do something
  EndSelect
Wend
or...

Code: Select all

FileSeek(0, 0)
While Not(Eof(0))  
  inbuffer.s = ReadString(0, #PB_Ascii, 100) ; Read 100 character buffer
  inbuflen.i = Len(inbuffer.s)
  For inbufpos.i = 1 To inbuflen.i ; Go through each character in the buffer
    inpchr.s = Mid(inbuffer.s, inbufpos.i ,1) ; Read a single character from the buffer
    Select inpchr.s ; Process the single character
      Case "#"
        Do something
      Default
        Do something
    EndSelect
  Next inbufpos.i
Wend
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadString() performance best practice

Post by NicTheQuick »

You also could use 'ReadAscii()' then. It will also retrieve line endings which ReadString() won't.

What content exactly do you read in? Is it binary data or human readable data?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

NicTheQuick wrote: Mon Aug 15, 2022 2:10 pm You also could use 'ReadAscii()' then. It will also retrieve line endings which ReadString() won't.
What content exactly do you read in? Is it binary data or human readable data?
Thanks for the reply. It's 8-bit binary data. Initially I omitted the #PB_Ascii switch, but I found that when my other routine wrote anything above decimal 127, they consumed two bytes in the file. I added #PB_Ascii and found it resolved that perfectly. I don't know if I should be using something else. I just noticed there's also ReadByte() and WriteByte() available, which I hadn't seen at the time.
freak
PureBasic Team
PureBasic Team
Posts: 5929
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: ReadString() performance best practice

Post by freak »

Oso wrote: Mon Aug 15, 2022 1:28 pmWould it be accepted practice to ReadString() a single byte and then process it? I used single-byte to test it first, but after adding a buffered version, I found that there is no variation in speed between the two.
That is because the file commands have buffering built in already. So you can use whatever way you like and not have to worry about this. You could experiment with the FileBuffersSize() command to see if a larger buffer makes a speed difference for you (default is 4kb).

https://www.purebasic.com/documentation ... ssize.html
quidquid Latine dictum sit altum videtur
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

freak wrote: Mon Aug 15, 2022 3:11 pm That is because the file commands have buffering built in already. So you can use whatever way you like and not have to worry about this. You could experiment with the FileBuffersSize() command to see if a larger buffer makes a speed difference for you (default is 4kb).
https://www.purebasic.com/documentation ... ssize.html
Thanks for that, I understand what you mean and yes, that explains why I get the same performance.

To be honest, the speed of writing test data and then reading it back is so fast, it's difficult to make a comparison. I increased my number of test records from 1,200 to 120,000 and even then, it completed in about 1 second.
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadString() performance best practice

Post by NicTheQuick »

Oso wrote: Mon Aug 15, 2022 2:24 pm
NicTheQuick wrote: Mon Aug 15, 2022 2:10 pm You also could use 'ReadAscii()' then. It will also retrieve line endings which ReadString() won't.
What content exactly do you read in? Is it binary data or human readable data?
Thanks for the reply. It's 8-bit binary data. Initially I omitted the #PB_Ascii switch, but I found that when my other routine wrote anything above decimal 127, they consumed two bytes in the file. I added #PB_Ascii and found it resolved that perfectly. I don't know if I should be using something else. I just noticed there's also ReadByte() and WriteByte() available, which I hadn't seen at the time.
If you are working with binary data you shouldn't use anything where strings are involved. ReadString() usually tries to read UTF-8 data which means than a single character can take up to 4 bytes if every byte has a value above 127. Also if there is Nullbyte ReadString() will return an empty string.
If you wanna read raw data you need to use all the other commands like ReadAscii(), ReadByte(), ReadWord(), ... and also ReadData() if you want to read a bunch of bytes at once into a piece of memory.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

NicTheQuick wrote: Mon Aug 15, 2022 6:33 pm If you are working with binary data you shouldn't use anything where strings are involved. ReadString() usually tries to read UTF-8 data which means than a single character can take up to 4 bytes if every byte has a value above 127. Also if there is Nullbyte ReadString() will return an empty string.
If you wanna read raw data you need to use all the other commands like ReadAscii(), ReadByte(), ReadWord(), ... and also ReadData() if you want to read a bunch of bytes at once into a piece of memory.
I've just tried reading my own data back using ReadByte(). From the documentation, it shows a variable type .b for ReadByte(). What I see is that for the conventional ASCII characters up to 127, they return their ASCII value, but for anything above 127, I can't understand the values I'm seeing.

For example, in my data, I'm using ASCII 254 which has a special meaning. This returns -2 in the variable.b = ReadByte(). Similarly I'm using ASCII 255 which returns -1. I realise the .b datatype ranges from -127 to +127 but I don't understand the value returned. I'm guessing this is because for > 127 the negative bit is set. Is that correct?

Strangely, I changed the variable type to .a and it works (code as below), but I can't understand why the example shows a .b

Code: Select all

Define inputchr.a
FileSeek(0, 0)
While Not(Eof(0)) 
  inputchr.a = ReadByte(0)
  output.s = Chr(inputchr.a)
  PrintN(output.s + " : " + Str(inputchr.a))
wend
Incidentally, I probably ought to clarify further here, exactly what data I'm storing. It isn't binary in the same sense as say a JPEG or such type, it's mostly ASCII data consisting of user-entered records, but it *can* potentially include anything from 0x00 to 0xFF. It doesn't contain terminated text lines in the usual ASCII sense.
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadString() performance best practice

Post by NicTheQuick »

I usually simply use ReadAscii() if I don't want to handle the negative value ReadByte() returns.
With this short code you can simulate what ReadByte() does to your values.

Code: Select all

value.b = 254
Debug value
Debug value & $FF ;use this to make it positive
If you want to learn more about, this should be helpful: Wikipedia - Two's complement
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

NicTheQuick wrote: Mon Aug 15, 2022 7:17 pm With this short code you can simulate what ReadByte() does to your values.

Code: Select all

value.b = 254
Debug value
Debug value & $FF ;use this to make it positive
Thanks, I've got it. I guessed it was the negative bit. I can do the following to reverse the negative bit, based on your example...

Code: Select all

value.b = -2
Debug value & $FF
value.b = -1
Debug value & $FF
254
255
infratec
Always Here
Always Here
Posts: 6818
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReadString() performance best practice

Post by infratec »

If the file is not larger then 10MB read it complete into memory, then process the bytes in the RAM buffer via a pointer.
infratec
Always Here
Always Here
Posts: 6818
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReadString() performance best practice

Post by infratec »

Code: Select all

EnableExplicit

Define.i File
Define *File, *Ptr.Ascii, *FileEnd
Define Filename$


Filename$ = OpenFileRequester("Choose a file", "", "All|*.*", 0)
If Filename$
  File = ReadFile(#PB_Any, Filename$)
  If File
    *File = AllocateMemory(Lof(File), #PB_Memory_NoClear)
    If *File
      If ReadData(File, *File, MemorySize(*File)) = MemorySize(*File)
        *Ptr = *File
        *FileEnd = *File + MemorySize(*File)
        While *Ptr <= *FileEnd
          ;Debug RSet(Hex(*Ptr\a), 2, "0")
          If *Ptr\a = '#'
            Debug "Hashtag found at pos: " + Str(*Ptr - *File)
          EndIf
          *Ptr + 1
        Wend
      EndIf
      FreeMemory(*File)
    EndIf
    CloseFile(File)
  EndIf
EndIf
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

infratec wrote: Mon Aug 15, 2022 7:38 pm If the file is not larger then 10MB read it complete into memory, then process the bytes in the RAM buffer via a pointer.
Thanks for the example, it's useful to explain the use of pointers, which is very clear looking at this. The data that I have to deal with is outside my control, so in this case I need to deal with smaller chunks.

By the way, should the FileEnd below be adjusted by -1 (or change the 'While' condition to be less than)?

Code: Select all

        *FileEnd = *File + MemorySize(*File)
        While *Ptr <= *FileEnd
... in other words...

Code: Select all

*FileEnd = *File + MemorySize(*File) - 1
infratec
Always Here
Always Here
Posts: 6818
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReadString() performance best practice

Post by infratec »

You are right:

Code: Select all

*FileEnd = *File + MemorySize(*File) - 1
If you have enough RAM ... no problem.
I already read 1GB files in RAM without problems.

Or you have to do a chunk management which is not trivial if a rest of bytes needs to be copied to the begin.
If I remember, I already posted such an example.
User avatar
jacdelad
Addict
Addict
Posts: 1436
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: ReadString() performance best practice

Post by jacdelad »

Keep in mind, that, if you don't need to read the full file, you maybe shouldn't read the full file. It's not clear to me, if you need to process the whole file.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: ReadString() performance best practice

Post by Oso »

jacdelad wrote: Mon Aug 15, 2022 11:34 pm Keep in mind, that, if you don't need to read the full file, you maybe shouldn't read the full file. It's not clear to me, if you need to process the whole file.
Yes, understood. The code that I've written looks for an identifying key in the file, but once it has found it, then the process is complete and it doesn't need to look any further.
Post Reply