PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Just starting out? Need help? Post your questions and find answers here.
ElementE
Enthusiast
Enthusiast
Posts: 139
Joined: Sun Feb 22, 2015 2:33 am

PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by ElementE »

Can someone explain to me how PureBASIC will store a UTF-8 encoding of a character that is not in the UCS-2 set of characters?

Or does PureBASIC really use UTF-16 instead of UCS-2 internally for unicode?
Think Unicode!
Thorium
Addict
Addict
Posts: 1271
Joined: Sat Aug 15, 2009 6:59 pm

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Thorium »

UTF-8 is a variable length encoding. It can store any character. The 8 just means the smallest possible character, which would be 8bit. However UTF-8 can also store 16bit or even 32bit characters.
User avatar
skywalk
Addict
Addict
Posts: 3994
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by skywalk »

Small snippet to consider text and formats...

Code: Select all

CompilerIf #PB_Unicode=0
  CompilerError "Compile with Unicode and 'IDE-Preference-Sourcefile Text encoding' = UTF-8 only."
CompilerElse
  Procedure.i SF_ToMem(Unicode$, Enc.i=#PB_Ascii)
    Protected *b = AllocateMemory(Len(Unicode$) + 128)
    PokeS(*b, Unicode$, -1, Enc)
    ProcedureReturn *b
  EndProcedure
  Define$ uni$ = "ŠTEPÁNEK ŽIGIC lives."
  Debug Asc(uni$)
  Debug uni$ ; Without UTF-8 BOM, prints " TEPÁNEK  IGIC lives."  
  Define *b = SF_ToMem(uni$, #PB_UTF8)
  ShowMemoryViewer(*b, 32)  ;<- after program exits, view memory as hex/utf-8/etc.
  FreeMemory(*b)
CompilerEndIf
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
ElementE
Enthusiast
Enthusiast
Posts: 139
Joined: Sun Feb 22, 2015 2:33 am

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by ElementE »

Thank you for your answers, and the example code.

The test character I am using is the Unicode 'HIRAGANA LETTER A' (U+3042) character あ.
This is the Japanese character for the letter A.

According to the fileformat page,

http://www.fileformat.info/info/unicode ... /index.htm

あ has the following unicode encodings:

UTF-8 (hex) 0xE3 0x81 0x82 (e38182)
UTF-16 (hex) 0x3042 (3042)

so it takes three bytes in UTF-8 and two bytes in UTF-15 (also UCS-2?) to encode this character.

But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
Think Unicode!
User avatar
Danilo
Addict
Addict
Posts: 3037
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Danilo »

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
It works here:

Code: Select all

Character = $3042
MessageRequester("Character",Chr(Character))
If you use Debug output, you need to set a good Unicode font for it in PB preferences.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Little John »

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
Works fine on my system.
See here how to do it.
Fred
Administrator
Administrator
Posts: 16664
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Fred »

For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Rescator »

Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
Does this mean that if Windows passes back a UTF-16 string where one character actually uses two UTF-16 "characters" then PureBasic will treat them as if they where two characters rather than a extended one?

(Similar to how a Ascii routine would treat a UTF-8 string with extended UTF-8 characters.)
Post Reply