Characters Unicode numbers ??

applePi · Post by **applePi** » Sat Aug 17, 2019 7:55 am

suppose we want to investigate the Russian Character "Я" : https://www.compart.com/en/unicode/U+042F
in PureBasic:

unicode$="Я"
Debug Asc(unicode$)

the result is 1071 in decimal

in the above page :
UTF-16 Encoding: 0x042F
the same as 1071:

Code: Select all

Debug $042F  ;= 1071

but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.

but also in that page:
UTF-8 Encoding: 0xD0 0xAF

could it be equal :

Code: Select all

Debug $D0AF ; 53423  ???

too big number ??

any additional explanations please about these numbers ?? is appreciated

Thanks in advance

Mijikai · Post by **Mijikai** » Sat Aug 17, 2019 8:14 am

Is there another setting for UTF8 in PureBasic beside the sourcefile?

Intern (current) PureBasic works in Unicode therefore the returned value is correct.

For UTF8 you would need to convert it first:

Code: Select all

EnableExplicit

Global str.s
Global *utf8

str = "Я"
*utf8 = UTF8(str)

Debug Hex(PeekW(*utf8),#PB_Word)

End

#NULL · Post by **#NULL** » Sat Aug 17, 2019 8:21 am

applePi wrote:but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.

You probably mean utf-8 as the source file encoding. The source file on disk will contain the sequence as 0xD0 0xAF, but the executable / at runtime it will be encoded as utf-16 (0x042F), as all purebasic strings by default. You can create an utf-8 buffer explicitly:

Code: Select all


*s = UTF8("Я")
ShowMemoryViewer(*s, 3) ; d0 af 00
FreeMemory(*s)

Code: Select all

*s = AllocateMemory(3)
PokeB(*s, $d0)
PokeB(*s+1, $af)
PokeB(*s+2, 0)
ShowMemoryViewer(*s, 3) ; d0 af 00
Debug PeekS(*s, 2, #PB_UTF8)
FreeMemory(*s)

Demivec · Post by **Demivec** » Sat Aug 17, 2019 8:22 am

applePi wrote:but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.

As everyone who posted before I completed this message said

, the value is the same Unicode character (or codepoint) whether it is UTF-8 or UTF-16.

Looking at the example code you provided:

Code: Select all

unicode$="Я"
Debug Asc(unicode$)

If the source code is encoded in UTF-8 then the strings will be converted to UTF-16 when the program is compiled and so the value display (i.e. 1071) is the value of the Unicode string value encoded as UTF-16.

applePi wrote:too big number ??

any additional explanations please about these numbers ?? is appreciated

The numbers for Unicode can actually extend in hex up to $10FFFF. There are only complications in the UTF-16 encoding when the values are over $FFFF because the codepoints will take 4 bytes to encode instead of 2 bytes. I won't go into those details unless you are really interested in knowing.

PureBasic Forums - English

Characters Unicode numbers ??

Characters Unicode numbers ??

Re: Characters Unicode numbers ??

Re: Characters Unicode numbers ??

Re: Characters Unicode numbers ??