Re: Unicode and PureBasic
Posted: Wed Mar 25, 2015 4:40 am
From wikipedia: http://en.wikipedia.org/wiki/UTF-16
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.
This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.
As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.
Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.
Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that a fixed-width 2-byte encoding could not encode enough characters to be truly universal.
So do note that on Windows a unicode "character" may use 2, 4, 6 or 8 bytes to represent a unicode character.UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Older Windows NT systems (prior to Windows 2000) only support UCS-2. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.
Never assume a unicode character is 2 bytes when doing strings processing (i.e. parsing two and two bytes would technically be wrong).
The first some 65 thousand unicode characters fit in two bytes, the higher unicode characters need 4 bytes.
This is why UTF-8 is better suited for transmitting/sharing unicode text, you avoid the endian issue, most European/European originating languages are readable even if mistakenly displayed as being ASCII text. XML defaults to UTF-8, and even very old web browsers handle HTML with UTF-8 just fine if you specify the encoding and ASCII (7bit) fits "AS IS" into UTF-8.
My rule of thumb is... If in doubt use UTF-8.
As to Mac and Linux. I can at least say that I have read that with Linux it varies, it's either UTF-8 or it's UTF-32 depending on the desktop environment you use, Gnome vs something else for example. I'm sure some Linux geek could dig that info up and post it here.
Currently UTF-32 fits all unicode characters that exists, but in the future it may not, in that case a UTF-32 character may not use 4 bytes, but 8 bytes instead.
Also note that no normal font exists that have the glyphs for all unicode code points. You may need to compromise on looks to ensure you get/use a font that has a wide support for unicode characters. ID3 v2.4 support UTF-8, and Vorbis comments (used by Ogg, FLAC, Opus) are UTF-8 just as an example on where you may encounter them.