PureBasic Forum
http://forums.purebasic.com/english/

Unicode and PureBasic
http://forums.purebasic.com/english/viewtopic.php?f=7&t=61789
Page 2 of 2

Author:  Rescator [ Wed Mar 25, 2015 4:40 am ]
Post subject:  Re: Unicode and PureBasic

From wikipedia: http://en.wikipedia.org/wiki/UTF-16

Quote:
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.


Quote:
UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.


So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.


As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.

Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.

Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.

Author:  Danilo [ Wed Mar 25, 2015 10:57 am ]
Post subject:  Re: Unicode and PureBasic

Rescator wrote:
So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.

- PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:
For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).

Author:  chris319 [ Thu Mar 26, 2015 7:37 am ]
Post subject:  Re: Unicode and PureBasic

This thread is supposed to UNconfuse us?

Author:  Roger Hågensen [ Fri Mar 27, 2015 12:50 pm ]
Post subject:  Re: Unicode and PureBasic

Danilo wrote:
PureBASIC internal encoding of unicode, UCS-2 or UTF-16?
Fred wrote:
For the record, PB uses UCS2 string encoding internally when unicode mode is ON (it doesn't support multibyte UTF16 encoding).


That cold be an issue as Windows will return (if converting UTF-8 that is outside the BMP then a UTF-16 with surrogate pairs are returned. In that case a single UTF-16 character actually takes up 4 bytes rather than 2.
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated as one do with a UTF-8 string, only with UTF-16 you have to keep in mind the endianess of it.

Author:  Danilo [ Fri Mar 27, 2015 1:32 pm ]
Post subject:  Re: Unicode and PureBasic

Roger Hågensen wrote:
So treating uTF-16 (as I mentioned in the post above) as always 2 bytes (or 16bit) is a possible problem. A UTF-16 string should be treated [...]

Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.

I can't change that, and it's not my fault. :D

Author:  Roger Hågensen [ Fri Mar 27, 2015 2:46 pm ]
Post subject:  Re: Unicode and PureBasic

CharNext_() is an interesting WinAPI function. https://msdn.microsoft.com/en-us/librar ... 47469.aspx
Quote:
This function works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.


There is also a CharPrev_()


Now using Windows API calls and dealing with local text is mostly ok. The issue is when you get text from a different locale than the user (a Spanish guy with a Arabic name for example) how would a program on a American system display that properly, let alone alone apply upper and lower case properly.

Then there is unicode normalization which treat ß and ss the same to simplify comparisons (like list order of filenames for example).

Author:  Roger Hågensen [ Fri Mar 27, 2015 2:50 pm ]
Post subject:  Re: Unicode and PureBasic

Danilo wrote:
Fred said PB does not support UTF16. It supports only UCS2, which is always 2 bytes for each character.
I can't change that, and it's not my fault. :D

I know. But it's troublesome as Windows uses UTF-16 and if PB treats it as UCS-2 (like Windows NT 4.0 and older did) then text may be handled wrong.

Now if text is simply stored as UCS-2 i PureBasic but WinAPI functions (on Windows) are used for the string handling then there probably is no issues as PureBasic is not doing any UCS-2 text processing at all.

This is similar to how UTF-8 can be stored as if it was a ASCII (8bit) string, you just can't process it as if it was ASCII that's all.

Author:  Roger Hågensen [ Fri Mar 27, 2015 3:25 pm ]
Post subject:  Re: Unicode and PureBasic

Here is a interesting read http://utf8everywhere.org/

Some of these things are worth considering when dealing with PureBasic and Unicode as well.

Author:  Little John [ Wed Aug 12, 2015 3:56 pm ]
Post subject:  Re: Unicode and PureBasic

mariosk8s has posted interesting information about Converting from UTF-8 NFD to NFC & vice versa.

Author:  Little John [ Sat Sep 12, 2015 8:56 am ]
Post subject:  Re: Unicode and PureBasic

For another discussion about UCS-2 vs. UTF-16 see here.

In that thread, freak wrote:
PB supports UTF-16 in the same way that Java does: A surrogate pair simply counts as two characters in string functions. In my experience, this is close enough for almost all cases because situations in which they need to be treated as a single character are pretty rare (the use of code-points outside of the BMP is pretty exotic in itself).

Author:  Little John [ Fri Feb 19, 2016 11:33 am ]
Post subject:  Re: Unicode and PureBasic

Demivec wrote a module for Detecting Text File Encoding without BOM,
and he also implemented Revised Chr() & Asc() for UTF-16 surrogate pairs.

Author:  Little John [ Sun Jul 28, 2019 12:02 pm ]
Post subject:  Re: Unicode and PureBasic

Regular Expressions

The PCRE library which ships with PureBasic (tested with 5.71 beta 2 on Windows) does not properly support Unicode: For instance, the anchor \b, as well as the shorthand character classes \w and \W do not work as expected. So I made a related feature request.

Until PureBasic comes with a PCRE library that completely supports Unicode, we can use tricks for working around some of the limitations. For examples click at the first link in this message. For more information see this tutorial about Unicode Regular Expressions.

Author:  Sooraa [ Mon Apr 06, 2020 9:58 pm ]
Post subject:  Re: Unicode and PureBasic

Hi Little John,

although your feature request i.r. to real Unicode-Support for \b, \w, \d, \s has led to an integration of PCRE-Lib 8.44. in PB5.72.
But this did'nt help it.

We have to turn on the UCP-Support of the PCRE-compiler during the "CreateRegularExpression" statement by preceding (*UCP) to the regex. For the example \bglich\b" it is "(*UCP)\bglich\b").

\b, \w, \d, \s work fine with it.

Author:  Little John [ Tue Apr 07, 2020 6:53 am ]
Post subject:  Re: Unicode and PureBasic

Cool, thank you! Image

Page 2 of 2 All times are UTC + 1 hour
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/