Characters Unicode numbers ??

Just starting out? Need help? Post your questions and find answers here.
applePi
Addict
Addict
Posts: 1404
Joined: Sun Jun 25, 2006 7:28 pm

Characters Unicode numbers ??

Post by applePi »

suppose we want to investigate the Russian Character "Я" : https://www.compart.com/en/unicode/U+042F
in PureBasic:

Code: Select all

unicode$="Я"
Debug Asc(unicode$) 
the result is 1071 in decimal

in the above page :
UTF-16 Encoding: 0x042F
the same as 1071:

Code: Select all

Debug $042F  ;= 1071
but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.

but also in that page:
UTF-8 Encoding: 0xD0 0xAF

could it be equal :

Code: Select all

Debug $D0AF ; 53423  ???
too big number ??

any additional explanations please about these numbers ?? is appreciated

Thanks in advance
User avatar
Mijikai
Addict
Addict
Posts: 1360
Joined: Sun Sep 11, 2016 2:17 pm

Re: Characters Unicode numbers ??

Post by Mijikai »

Is there another setting for UTF8 in PureBasic beside the sourcefile?

Intern (current) PureBasic works in Unicode therefore the returned value is correct.

For UTF8 you would need to convert it first:

Code: Select all

EnableExplicit

Global str.s
Global *utf8

str = "Я"
*utf8 = UTF8(str)

Debug Hex(PeekW(*utf8),#PB_Word)

End
#NULL
Addict
Addict
Posts: 1440
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: Characters Unicode numbers ??

Post by #NULL »

applePi wrote:but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.
You probably mean utf-8 as the source file encoding. The source file on disk will contain the sequence as 0xD0 0xAF, but the executable / at runtime it will be encoded as utf-16 (0x042F), as all purebasic strings by default. You can create an utf-8 buffer explicitly:

Code: Select all


*s = UTF8("Я")
ShowMemoryViewer(*s, 3) ; d0 af 00
FreeMemory(*s)

Code: Select all

*s = AllocateMemory(3)
PokeB(*s, $d0)
PokeB(*s+1, $af)
PokeB(*s+2, 0)
ShowMemoryViewer(*s, 3) ; d0 af 00
Debug PeekS(*s, 2, #PB_UTF8)
FreeMemory(*s)
Last edited by #NULL on Sat Aug 17, 2019 8:22 am, edited 1 time in total.
User avatar
Demivec
Addict
Addict
Posts: 4091
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Characters Unicode numbers ??

Post by Demivec »

applePi wrote:but isn't this should be UTF-8 and not UTF-16 ???
since we choose the encoding in purebasic as utf8.
As everyone who posted before I completed this message said :) , the value is the same Unicode character (or codepoint) whether it is UTF-8 or UTF-16.

Looking at the example code you provided:

Code: Select all

unicode$="Я"
Debug Asc(unicode$)
If the source code is encoded in UTF-8 then the strings will be converted to UTF-16 when the program is compiled and so the value display (i.e. 1071) is the value of the Unicode string value encoded as UTF-16.
applePi wrote:too big number ??

any additional explanations please about these numbers ?? is appreciated
The numbers for Unicode can actually extend in hex up to $10FFFF. There are only complications in the UTF-16 encoding when the values are over $FFFF because the codepoints will take 4 bytes to encode instead of 2 bytes. I won't go into those details unless you are really interested in knowing.
Post Reply