PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Just starting out? Need help? Post your questions and find answers here.
ElementE
Enthusiast
Enthusiast
Posts: 139
Joined: Sun Feb 22, 2015 2:33 am

PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by ElementE »

Can someone explain to me how PureBASIC will store a UTF-8 encoding of a character that is not in the UCS-2 set of characters?

Or does PureBASIC really use UTF-16 instead of UCS-2 internally for unicode?
Think Unicode!
Thorium
Addict
Addict
Posts: 1268
Joined: Sat Aug 15, 2009 6:59 pm

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Thorium »

UTF-8 is a variable length encoding. It can store any character. The 8 just means the smallest possible character, which would be 8bit. However UTF-8 can also store 16bit or even 32bit characters.
User avatar
skywalk
Addict
Addict
Posts: 3555
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by skywalk »

Small snippet to consider text and formats...

Code: Select all

CompilerIf #PB_Unicode=0
  CompilerError "Compile with Unicode and 'IDE-Preference-Sourcefile Text encoding' = UTF-8 only."
CompilerElse
  Procedure.i SF_ToMem(Unicode$, Enc.i=#PB_Ascii)
    Protected *b = AllocateMemory(Len(Unicode$) + 128)
    PokeS(*b, Unicode$, -1, Enc)
    ProcedureReturn *b
  EndProcedure
  Define$ uni$ = "ŠTEPÁNEK ŽIGIC lives."
  Debug Asc(uni$)
  Debug uni$ ; Without UTF-8 BOM, prints " TEPÁNEK  IGIC lives."  
  Define *b = SF_ToMem(uni$, #PB_UTF8)
  ShowMemoryViewer(*b, 32)  ;<- after program exits, view memory as hex/utf-8/etc.
  FreeMemory(*b)
CompilerEndIf
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
ElementE
Enthusiast
Enthusiast
Posts: 139
Joined: Sun Feb 22, 2015 2:33 am

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by ElementE »

Thank you for your answers, and the example code.

The test character I am using is the Unicode 'HIRAGANA LETTER A' (U+3042) character あ.
This is the Japanese character for the letter A.

According to the fileformat page,

http://www.fileformat.info/info/unicode ... /index.htm

あ has the following unicode encodings:

UTF-8 (hex) 0xE3 0x81 0x82 (e38182)
UTF-16 (hex) 0x3042 (3042)

so it takes three bytes in UTF-8 and two bytes in UTF-15 (also UCS-2?) to encode this character.

But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
Think Unicode!
User avatar
Danilo
Addict
Addict
Posts: 3010
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Danilo »

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
It works here:

Code: Select all

Character = $3042
MessageRequester("Character",Chr(Character))
If you use Debug output, you need to set a good Unicode font for it in PB preferences.
Little John
Addict
Addict
Posts: 4007
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Little John »

ElementE wrote:But, I have not been able to get PureBasic v5.31 to display あ correctly, even when both the IDE and compiler are in Unicode Mode.
Works fine on my system.
See here how to do it.
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups
Fred
Administrator
Administrator
Posts: 14413
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Fred »

For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: PureBASIC internal encoding of unicode, UCS-2 or UTF-16?

Post by Rescator »

Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
Does this mean that if Windows passes back a UTF-16 string where one character actually uses two UTF-16 "characters" then PureBasic will treat them as if they where two characters rather than a extended one?

(Similar to how a Ascii routine would treat a UTF-8 string with extended UTF-8 characters.)
Post Reply