ToUTF8/ToASCII

Share your advanced PureBasic knowledge/code with the community.
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

ToUTF8/ToASCII

Post by Lunasole »

Just another small stuff, I even posted it somewhere, but here is "modern version" :)
(not much optimal and maybe improved, but it's not for performance)
; Also, THIS IS TRICKY STUFF AND REALLY "NOT A GOOD PRACTICE" so no guarantee and use at own risks (and fix if needed)

Code: Select all


; ASCII <> Unicode converter, without using additional buffers
; The ASCII/UTF8 string is packed inside PB UTF16 string here
; (alternative of using PB Ascii()/UTF8() functions)
; v 1.0.0.2
;	2016-2023				Luna Sole

; str$	:	PB unicode string
; RETURN:	UTF8 string packed into PB unicode string
Procedure$ ToUTF8(str$)
    Protected res$ = Space(StringByteLength(str$, #PB_UTF8))
    PokeS(@res$, str$, -1, #PB_UTF8)
    ProcedureReturn res$
EndProcedure
; str$	:	PB unicode string
; RETURN:	ASCII string packed into PB unicode string
Procedure$ ToASCII(str$)
    Protected nE = PokeS(@str$+1, str$, -1, #PB_Ascii)
    MoveMemory(@str$+1, @str$, nE)
    PokeU(@str$ + nE, 0)
    ProcedureReturn str$
EndProcedure
; ;; +variant
; Procedure$ ToASCII(str$)
;     PokeC(@str$+PokeS(@str$+1, str$, -1, #PB_Ascii)+1, 0)
;     ProcedureReturn PeekS(@str$+1, -1, #PB_Unicode)
; EndProcedure

; back to unicode
Procedure$ FromUTF8(str$)
    ProcedureReturn PeekS(@str$, -1, #PB_UTF8)
EndProcedure
; back to unicode
Procedure$ FromASCII(str$)
    ProcedureReturn PeekS(@str$, -1, #PB_Ascii)
EndProcedure

CompilerIf #PB_Compiler_IsMainFile
    Define T$ = "123а'ї"
    Define T2$ = ToASCII(T$)
    Debug T$
    Debug T2$
    Debug FromASCII(T2$)
CompilerEndIf
Last edited by Lunasole on Mon Mar 20, 2023 3:05 pm, edited 1 time in total.
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ToUTF8/ToASCII

Post by Lunasole »

:mrgreen:

Code: Select all

;; +variant
Procedure$ ToASCII(str$)
    PokeC(@str$+PokeS(@str$+1, str$, -1, #PB_Ascii)+1, 0)
    ProcedureReturn PeekS(@str$+1, -1, #PB_Unicode)
EndProcedure
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: ToUTF8/ToASCII

Post by STARGÅTE »

I don't understand how you get the idea of writing/reading to/from the memory of a given string.

First of all, you generate a memory overflow with PokeS(..., #PB_UTF8), if the needed memory length (in UTF8) is longer than the internal string representation (2 byte per character). This happend for character codes larger than 2047, then UTF-8 needs 3 byte per character and PokeS writes over the memory length of @str$. --> IMA
Secondly, why you write at @str$+1 and then move the memory -1 byte back?
And finally, FromUTF8() and FromASCII() can't work like you wrote.
Strings are always in UTF-16. If you read this memory with PeekS() as UTF8 or Ascii you receive just nonsense.
I know, you wrote, "The ASCII/UTF8 string is packed inside PB UTF16 string here", however, this is a very bad practice.

Here my result of your code:, which doesn't work:

Code: Select all

123а'ї
㈱㼳㼧
123?'?
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
AZJIO
Addict
Addict
Posts: 1318
Joined: Sun May 14, 2017 1:48 am

Re: ToUTF8/ToASCII

Post by AZJIO »

Just like in pictures, compression formats are designed for compact storage of files on a hard drive. When you draw a jpg in the program window, it will use the same memory as a bitmap drawing, because each pixel must be drawn with a full RGB color. The UTF-8 format has the same principle, when saved it has a compact compressed notation, but when read it is converted to UTF-16. Any string functions easily work with this format, since the width of each character becomes the same. Once you decide to save the file, you again need the UTF-8 format for compactness, so you use the UTF8() function and save the result to a file. Scintilla also requires a pointer to memory with data in UTF-8 format, so this is another case where the data needs to be converted to UTF-8.
Therefore, it is not clear to me what kind of conversion to Ascii or to UTF-8. This does not make sense, since string functions will not work with this substance.
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: ToUTF8/ToASCII

Post by idle »

It doesn't make to much sense, even here there's a problem as space is 2 chars wide
Procedure.s ToUTF8(in.s)
Protected *utf8, out.s,len
*utf8 = UTF8(in)
len = MemorySize(*utf8)>>1
out.s = Space(Len)
CopyMemory(*utf8,@out,MemorySize(*utf8))
FreeMemory(*utf8)
ProcedureReturn out
EndProcedure

Global out.s
Global in.s = "h€llo_world"
out = ToUTF8(in) + "ok"
Debug PeekS(@out,-1,#PB_UTF8)
Debug out

in.s = "h€llo_world-"
out = ToUTF8(in) + "not ok"
Debug PeekS(@out,-1,#PB_UTF8)
Debug out
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ToUTF8/ToASCII

Post by Lunasole »

STARGÅTE wrote: Sun Mar 19, 2023 7:29 pm I don't understand how you get the idea of writing/reading to/from the memory of a given string.
:mrgreen:
By fact a good idea, memory already is allocated and has a larger size than needed for result.
Secondly, why you write at @str$+1 and then move the memory -1 byte back?
This is like a hack, because PokeS won't write to the same address that it reads, while shifted address goes fine.
There are another variants without such movement.
And finally, FromUTF8() and FromASCII() can't work like you wrote.
Strings are always in UTF-16. If you read this memory with PeekS() as UTF8 or Ascii you receive just nonsense.
I know, you wrote, "The ASCII/UTF8 string is packed inside PB UTF16 string here", however, this is a very bad practice.
Here you wrong, see memory representation of those packed strings, they are not UTF-16 after To function.
But of course PB internals still treat them as UTF16, that's can't be a problem if not trying to use such a packed string with any of function which expects UTF16. Release of such packed strings memory also goes fine (thought until release, there trash data may remain at areas taken by original UTF16, should be wiped if needed).
First of all, you generate a memory overflow with PokeS(..., #PB_UTF8), if the needed memory length (in UTF8) is longer than the internal string representation (2 byte per character). This happend for character codes larger than 2047, then UTF-8 needs 3 byte per character and PokeS writes over the memory length of @str$. --> IMA
Hah. This seems the only correct part of your critics, I missed this moment (that UTF8 chars may be larger than original UTF16).
Should fix it somehow (or maybe I already did that in some of other version of this stuff, just didn't seen them long ago).
STARGÅTE wrote: Sun Mar 19, 2023 7:29 pm Here my result of your code:, which doesn't work:

Code: Select all

123а'ї
㈱㼳㼧
123?'?
I don't know why it shows such results in your tests, just checked and works both To-from Ascii/Utf8 (in 2 variants of packing function).
(strikethrough)Probably you should better check memory, not displayed chars, as there may be something locale-related.
Something else.

Code: Select all

123а'ї
㈱뼧
123а'ї

Code: Select all

123а'ї
㈱퀳➰韑
123а'ї
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ToUTF8/ToASCII

Post by Lunasole »

AZJIO wrote: Sun Mar 19, 2023 9:18 pm Just like in pictures, compression formats are designed for compact storage of files on a hard drive. When you draw a jpg in the program window, it will use the same memory as a bitmap drawing, because each pixel must be drawn with a full RGB color. The UTF-8 format has the same principle, when saved it has a compact compressed notation, but when read it is converted to UTF-16. Any string functions easily work with this format, since the width of each character becomes the same. Once you decide to save the file, you again need the UTF-8 format for compactness, so you use the UTF8() function and save the result to a file. Scintilla also requires a pointer to memory with data in UTF-8 format, so this is another case where the data needs to be converted to UTF-8.
Therefore, it is not clear to me what kind of conversion to Ascii or to UTF-8. This does not make sense, since string functions will not work with this substance.
Yes. Such things make sense only if sending utf8/ascii to external program or function, all the modern PB libraries etc working with UTF16.
And even in such cases there are options with pseudotypes, as well as PB built-in packing functions.
Anyway that's a funny stuff as for me (if know what you're doing).
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ToUTF8/ToASCII

Post by Lunasole »

Oh I've remembered, some of old version were using StringByteLength and memory alloc, so it probably handled UTF8-packing fine too.
What a loops in my own memory for recent years, I know of course nature of such glitches (informational overloads, emotional shocks and finally the factor of smoking weed for some time), but nothing funny^^
Will update later the first post fixing that UTF8.
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: ToUTF8/ToASCII

Post by mk-soft »

I don't see the point of that. A string in PB is UC16. You can't do anything with a corrupted string. Most functions for data exchange (files, etc) already support ASCII or UTF8. When using Win-API, the wide functions are also called automatically. For the rest, the PB ASCII and UTF8 functions are sufficient.
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: ToUTF8/ToASCII

Post by Lunasole »

Lunasole wrote: Mon Mar 20, 2023 2:22 pm What a loops in my own memory for recent years, I know of course nature of such glitches (informational overloads, emotional shocks and finally the factor of smoking weed for some time), but nothing funny^^
Yea, now I'm surely the biggest adequate here, not like before.. :mrgreen:
// at least added some contacts to that my site also.
I'll remove this topic a bit later, just don't want to post in another (well forgot that cannot remove whole topic with replies, so let remains)
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: ToUTF8/ToASCII

Post by Kwai chang caine »

Really useful code :wink:
Thanls for sharing 8)
ImageThe happiness is a road...
Not a destination
Rinzwind
Enthusiast
Enthusiast
Posts: 636
Joined: Wed Mar 11, 2009 4:06 pm
Location: NL

Re: ToUTF8/ToASCII

Post by Rinzwind »

Kwai chang caine wrote: Mon Apr 24, 2023 10:34 am Really useful code :wink:
Thanls for sharing 8)
Why useful?
Post Reply