Page 1 of 1

PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Tue Mar 03, 2015 2:15 pm
by ElementE
Can someone explain to me how PureBASIC will store a UTF-8 encoding of a character that is not in the UCS-2 set of characters?

Or does PureBASIC really use UTF-16 instead of UCS-2 internally for unicode?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Tue Mar 03, 2015 8:52 pm
by Thorium
UTF-8 is a variable length encoding. It can store any character. The 8 just means the smallest possible character, which would be 8bit. However UTF-8 can also store 16bit or even 32bit characters.

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Tue Mar 03, 2015 9:05 pm
by skywalk
Small snippet to consider text and formats...

Code: Select all

CompilerIf #PB_Unicode=0
  CompilerError "Compile with Unicode and 'IDE-Preference-Sourcefile Text encoding' = UTF-8 only."
CompilerElse
  Procedure.i SF_ToMem(Unicode$, Enc.i=#PB_Ascii)
    Protected *b = AllocateMemory(Len(Unicode$) + 128)
    PokeS(*b, Unicode$, -1, Enc)
    ProcedureReturn *b
  EndProcedure
  Define$ uni$ = "ŠTEPÁNEK ŽIGIC lives."
  Debug Asc(uni$)
  Debug uni$ ; Without UTF-8 BOM, prints " TEPÁNEK  IGIC lives."  
  Define *b = SF_ToMem(uni$, #PB_UTF8)
  ShowMemoryViewer(*b, 32)  ;<- after program exits, view memory as hex/utf-8/etc.
  FreeMemory(*b)
CompilerEndIf

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Wed Mar 04, 2015 8:51 am
by ElementE
Thank you for your answers, and the example code.

The test character I am using is the Unicode 'HIRAGANA LETTER A' (U+3042) character あ.
This is the Japanese character for the letter A.

According to the fileformat page,

http://www.fileformat.info/info/unicode ... /index.htm

あ has the following unicode encodings:

UTF-8 (hex) 0xE3 0x81 0x82 (e38182)
UTF-16 (hex) 0x3042 (3042)

so it takes three bytes in UTF-8 and two bytes in UTF-15 (also UCS-2?) to encode this character.

But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Wed Mar 04, 2015 9:06 am
by Danilo
ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
It works here:

Code: Select all

Character = $3042
MessageRequester("Character",Chr(Character))
If you use Debug output, you need to set a good Unicode font for it in PB preferences.

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Wed Mar 04, 2015 9:55 am
by Little John
ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
Works fine on my system.
See here how to do it.

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Thu Mar 05, 2015 3:54 pm
by Fred
For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Posted: Wed Mar 25, 2015 12:56 pm
by Rescator
Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
Does this mean that if Windows passes back a UTF-16 string where one character actually uses two UTF-16 "characters" then PureBasic will treat them as if they where two characters rather than a extended one?

(Similar to how a Ascii routine would treat a UTF-8 string with extended UTF-8 characters.)