Page 2 of 2

Re: Unicode and PureBasic

Posted: Wed Mar 25, 2015 4:40 am
by Rescator
From wikipedia: http://en.wikipedia.org/wiki/UTF-16
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.
UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.
So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.


As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.

Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.

Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.

Re: Unicode and PureBasic

Posted: Wed Mar 25, 2015 10:57 am
by Danilo
Rescator wrote:So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.
- PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).

Re: Unicode and PureBasic

Posted: Thu Mar 26, 2015 7:37 am
by chris319
This thread is supposed to UNconfuse us?

Re: Unicode and PureBasic

Posted: Fri Mar 27, 2015 12:50 pm
by Roger Hågensen
Danilo wrote:PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).
That cold be an issue as Windows will return (if converting UTF-8 that is outside the BMP then a UTF-16 with surrogate pairs are returned. In that case a single UTF-16 character actually takes up 4 bytes rather than 2.
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated as one do with a UTF-8 string, only with UTF-16 you have to keep in mind the endianess of it.

Re: Unicode and PureBasic

Posted: Fri Mar 27, 2015 1:32 pm
by Danilo
Roger Hågensen wrote: So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated [...]
Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.

I can't change that, and it's not my fault. :D

Re: Unicode and PureBasic

Posted: Fri Mar 27, 2015 2:46 pm
by Roger Hågensen
CharNext_() is an interesting WinAPI function. https://msdn.microsoft.com/en-us/librar ... 47469.aspx
This function works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.
There is also a CharPrev_()


Now using Windows API calls and dealing with local text is mostly ok. The issue is when you get text from a different locale than the user (a Spanish guy with a Arabic name for example) how would a program on a American system display that properly, let alone alone apply upper and lower case properly.

Then there is unicode normalization which treat ß and ss the same to simplify comparisons (like list order of filenames for example).

Re: Unicode and PureBasic

Posted: Fri Mar 27, 2015 2:50 pm
by Roger Hågensen
Danilo wrote:Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.
I can't change that, and it's not my fault. :D
I know. But it's troublesome as Windows uses UTF-16 and if PB treats it as UCS-2 (like Windows NT 4.0 and older did) then text may be handled wrong.

Now if text is simply stored as UCS-2 i PureBasic but WinAPI functions (on Windows) are used for the string handling then there probably is no issues as PureBasic is not doing any UCS-2 text processing at all.

This is similar to how UTF-8 can be stored as if it was a ASCII (8bit) string, you just can't process it as if it was ASCII that's all.

Re: Unicode and PureBasic

Posted: Fri Mar 27, 2015 3:25 pm
by Roger Hågensen
Here is a interesting read http://utf8everywhere.org/

Some of these things are worth considering when dealing with PureBasic and Unicode as well.

Re: Unicode and PureBasic

Posted: Wed Aug 12, 2015 3:56 pm
by Little John
mariosk8s has posted interesting information about Converting from UTF-8 NFD to NFC & vice versa.

Re: Unicode and PureBasic

Posted: Sat Sep 12, 2015 8:56 am
by Little John
For another discussion about UCS-2 vs. UTF-16 see here.
In that thread, freak wrote:PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).

Re: Unicode and PureBasic

Posted: Fri Feb 19, 2016 11:33 am
by Little John

Re: Unicode and PureBasic

Posted: Sun Jul 28, 2019 12:02 pm
by Little John
Regular Expressions

The PCRE library which ships with PureBasic (tested with 5.71 beta 2 on Windows) does not properly support Unicode: For instance, the anchor \b, as well as the shorthand character classes \w and \W do not work as expected. So I made a related feature request.

Until PureBasic comes with a PCRE library that completely supports Unicode, we can use tricks for working around some of the limitations. For examples click at the first link in this message. For more information see this tutorial about Unicode Regular Expressions.

Re: Unicode and PureBasic

Posted: Mon Apr 06, 2020 9:58 pm
by Sooraa
Hi Little John,

although your feature request i.r. to real Unicode-Support for \b, \w, \d, \s has led to an integration of PCRE-Lib 8.44. in PB5.72.
But this did'nt help it.

We have to turn on the UCP-Support of the PCRE-compiler during the "CreateRegularExpression" statement by preceding (*UCP) to the regex. For the example \bglich\b" it is "(*UCP)\bglich\b").

\b, \w, \d, \s work fine with it.

Re: Unicode and PureBasic

Posted: Tue Apr 07, 2020 6:53 am
by Little John
Cool, thank you! Image

Re: Unicode and PureBasic

Posted: Fri Feb 02, 2024 2:54 pm
by Little John
idle wrote a UTF-16 module.