Sticky post needed that discusses Unicode vs Ascii

Everything else that doesn't fall into one of the other PB categories.
PB Fanatic
User
User
Posts: 49
Joined: Wed Dec 17, 2014 11:54 am

Sticky post needed that discusses Unicode vs Ascii

Post by PB Fanatic »

As a user who is migrating my apps from Ascii to Unicode, I've found I'm running into string problems. An example is the MD5FingerPrint() command, which returns different results depending on whether the compiler is set to Ascii or Unicode.

I would like to see a sticky post on these forums that explains what to look out for when migrating our code, so we don't inadvertently introduce bugs when migrating. In my case, I was getting wrong MD5 results without realizing, until I happened to compare my MD5 hash with a website. :(
User avatar
bbanelli
Enthusiast
Enthusiast
Posts: 543
Joined: Tue May 28, 2013 10:51 pm
Location: Europe
Contact:

Re: Sticky post needed that discusses Unicode vs Ascii

Post by bbanelli »

PB Fanatic wrote:As a user who is migrating my apps from Ascii to Unicode, I've found I'm running into string problems. An example is the MD5FingerPrint() command, which returns different results depending on whether the compiler is set to Ascii or Unicode.
Maybe this thread will help (including the discovered bug) -> http://www.purebasic.fr/english/viewtop ... 13&t=61052
"If you lie to the compiler, it will get its revenge."
Henry Spencer
https://www.pci-z.com/
User avatar
netmaestro
PureBasic Bullfrog
PureBasic Bullfrog
Posts: 8433
Joined: Wed Jul 06, 2005 5:42 am
Location: Fort Nelson, BC, Canada

Re: Sticky post needed that discusses Unicode vs Ascii

Post by netmaestro »

A rose, by any other name...
BERESHEIT
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Re: Sticky post needed that discusses Unicode vs Ascii

Post by Rescator »

This is not a migration issue.
The hash functions always operated on memory, none of them operate on strings the are binary only.
The fact that MD5FingerPrint returns a hex string might make people assume it's for getting the hash of a string, but it's not.

This "issue" dates back to Windows 2000 (possibly earlier) when W2K (Windows NT 5.0 kernel and later) went full Unicode and all Ascii string handling (when using the A version of OS APIs) imposes a conversion overhead. (if you used a string function then it would be converted from whatever Windows ASCII-8 codepage it is to Unicode then back to the codepage again. I first noticed this issue when PureBasic began supported Unicode (this is way back) and when using hashing etc.

Also note that ASCII on one users system need not match that of another system, this means that a hash on your system may not match the hash of the same string on my system due to differences in the Windows code pages. So even with a Ascii compiled program you should still convert the strings to UTF8 before doing a hash on it.
Only ASCII-7 (first 127 characters) are the same on all Windows systems, the remaining characters 128 to 255 varies depending on the Windows codepage/language).
ASCII-7 is also known as US-ASCII sometimes. UTF-8 is ASCII-7 compatible, which means that the first 127 characters or UTF-8 and ASCII-7 are the same, any extra characters like umlauts or Æ is in the ASCII-8 range and depending on the code page may or may not have the same character value, and if that is the case the hashes will mismatch across systems. (Unicode and UTF-8 is partly designed as a solution to this problem, BTW! Unicode has been around since 1987 so it's not exactly "new" either.)

The issue with getting a hash from a string is that you have a few choices. Do you hash a ASCII-7 or ASCII-8 string or a UTF-8 string or a Unicode UTF-16 big endian or UTF-16 little endian. Luckily the majority of systems are Little Endian now, so that's less of an issue (AFAIK).
My advise is to convert any ASCII-7 or ASCII-8 (aka Latin-1 or ISO 8859-1 or the multitude of ISO variants) and Unicode and turn that into a UTF-8 string and then hash that, the likelihood it will match HTML Javascript hashes are high (there are UTF8 variants for that out there).
Also UTF-8 uses 8 bits and terminates with a binary 0 just like ASCII-8 (what most people here just tends to call just ASCII), it's just a byte stream and no endian issues.

I'm hoping future PureBasic releases will have more/improved UTF-8 support.
This may include for example string hashing functions that turn a unicode string into UTF8 then hash the binary of that, a MD5FingerPrintString() for example.

The way some people are reacting makes it seem as if Fred and Co broke PureBasic, when in fact Fred is just making sure PureBasic is keeping up with the times.
Also note that handling Ascii-8 strings will still be possible as there will be string functions added to convert from/to Ascii-8 and Unicode.
Post Reply