PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

ElementE · Post by **ElementE** » Tue Mar 03, 2015 2:15 pm

Can someone explain to me how PureBASIC will store a UTF-8 encoding of a character that is not in the UCS-2 set of characters?

Or does PureBASIC really use UTF-16 instead of UCS-2 internally for unicode?

Thorium · Post by **Thorium** » Tue Mar 03, 2015 8:52 pm

UTF-8 is a variable length encoding. It can store any character. The 8 just means the smallest possible character, which would be 8bit. However UTF-8 can also store 16bit or even 32bit characters.

skywalk · Post by **skywalk** » Tue Mar 03, 2015 9:05 pm

Small snippet to consider text and formats...

Code: Select all

CompilerIf #PB_Unicode=0
  CompilerError "Compile with Unicode and 'IDE-Preference-Sourcefile Text encoding' = UTF-8 only."
CompilerElse
  Procedure.i SF_ToMem(Unicode$, Enc.i=#PB_Ascii)
    Protected *b = AllocateMemory(Len(Unicode$) + 128)
    PokeS(*b, Unicode$, -1, Enc)
    ProcedureReturn *b
  EndProcedure
  Define$ uni$ = "ŠTEPÁNEK ŽIGIC lives."
  Debug Asc(uni$)
  Debug uni$ ; Without UTF-8 BOM, prints " TEPÁNEK  IGIC lives."  
  Define *b = SF_ToMem(uni$, #PB_UTF8)
  ShowMemoryViewer(*b, 32)  ;<- after program exits, view memory as hex/utf-8/etc.
  FreeMemory(*b)
CompilerEndIf

ElementE · Post by **ElementE** » Wed Mar 04, 2015 8:51 am

Thank you for your answers, and the example code.

The test character I am using is the Unicode 'HIRAGANA LETTER A' (U+3042) character あ.
This is the Japanese character for the letter A.

According to the fileformat page,

http://www.fileformat.info/info/unicode ... /index.htm

あ has the following unicode encodings:

UTF-8 (hex) 0xE3 0x81 0x82 (e38182)
UTF-16 (hex) 0x3042 (3042)

so it takes three bytes in UTF-8 and two bytes in UTF-15 (also UCS-2?) to encode this character.

But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.

Danilo · Post by **Danilo** » Wed Mar 04, 2015 9:06 am

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.

It works here:

Code: Select all

Character = $3042
MessageRequester("Character",Chr(Character))

If you use Debug output, you need to set a good Unicode font for it in PB preferences.

Little John · Post by **Little John** » Wed Mar 04, 2015 9:55 am

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.

Works fine on my system.
See here how to do it.

Post by **Fred** » Thu Mar 05, 2015 3:54 pm

For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).

Rescator · Post by **Rescator** » Wed Mar 25, 2015 12:56 pm

Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).

Does this mean that if Windows passes back a UTF-16 string where one character actually uses two UTF-16 "characters" then PureBasic will treat them as if they where two characters rather than a extended one?

(Similar to how a Ascii routine would treat a UTF-8 string with extended UTF-8 characters.)

PureBasic Forums - English

PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?