Unicode and PureBasic

Everything else that doesn't fall into one of the other PB categories.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Unicode and PureBasic

Post by Little John »

Recently there have been repeated questions about this topic, and certainly more questions will come in the future, at the latest when support for ASCII compilation ends.
The purpose of this thread is collecting information about Unicode and PureBasic for reference.
Please post questions and discussions in separate threads. Thank you!


If you are new to Unicode, read this first:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
Also read the
Unicode section in the PureBasic Reference Manual

I'll start with posting a solution for a problem that has caused some confusion in the past.
What I'm writing here is essentially a summary of this thread.

Displaying Unicode characters
  1. In the IDE, choose
    Compiler > Compiler Options... > [v] Create unicode executable
  2. For your gadgets (or wherever you want to show the text) choose a font, that actually contains the glyphs for the characters that you want to display.
  • If your code uses Chr() with Unicode code points not given as variables but as constants, e.g.

    Code: Select all

    MessageRequester("Unicode test", Chr($3042))
    then in the IDE you have to set
    File > File format > Encoding:UTF-8
    ( This is the best source file format anyway for Unicode source files and for ASCII source files.
    For how to convert existing "plain text" source files to UTF-8, see Josh's tip below. )
  • Displaying Unicode characters with Debug only works with PB 5.30+.
    Also when using Debug you must choose an appropriate font, as mentioned above.
    This is done in the IDE via
    File > Preferences... > Debugger > Individual settings > [v] Use a custom font
  • Some information about displaying Unicode characters on the console is in this thread.
Last edited by Little John on Wed Aug 12, 2015 6:02 pm, edited 5 times in total.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode in PureBasic

Post by Little John »

The text in this message is partly based on some information from Freak on the German forum.
Any mistakes are very probably introduced by me. :-)

Tipps for converting ASCII programs to Unicode
  • INI files:
    When using PB's Preference library, there will be no problem. In Unicode mode, the functions of this library read and write files in UTF-8 format. "Old" preference files in ASCII format can be read by PB Unicode programs without a problem.
  • Reading and writing other files:
    In Unicode mode, ReadString() and WriteString() use UTF-8 format by default, too. Here will also be no problem with reading "old" files in ASCII format, that don't contain any special characters.

    Since text files may start with a BOM, when reading unknown text files it's best always to use ReadStringFormat() directly after ReadFile(), and then use the retrieved format as flag for ReadString():

    Code: Select all

    If ReadFile(0, "test.txt")
       format = ReadStringFormat(0)
       While Not Eof(0)
          Debug ReadString(0, format)
       Wend
       CloseFile(0)
    EndIf
    But writing a BOM to a file is not always a good idea.
    For details, see Unicode FAQ, How I should deal with BOMs?
  • Cipher library:
    Functions in the Cipher library such as Base64Encoder() and MD5Fingerprint() are not string functions, but operate on memory areas. That's why the functions work exactly the same way in ASCII mode and in Unicode mode. However, they are often used with a string address, e.g.

    Code: Select all

    Hash$ = MD5Fingerprint(@Password$, Len(Password$))
    In Unicode mode, the resulting Hash$ will be different, because in this mode Password$ uses two bytes per character in memory, in contrast to one byte per character in ASCII mode. For getting the same result as in ASCII mode, the string has to be converted to ASCII first, see e.g. MD5FingerPrint in Unicode.
  • When your PB Unicode program calls a function in a DLL that expects an ASCII string as input parameter, then the string from your Unicode program must be converted to ASCII. When you use Pseudotypes, then PureBasic will do this automatically for you!
    When the output of a DLL function is an ASCII string, then there is no automatic. You'll have to reserve some memory (e.g. with AllocateMemory()), call the DLL function, convert the result with

    Code: Select all

    s$ = PeekS(..., ..., #PB_Ascii)
    and release the memory.
Last edited by Little John on Wed Mar 11, 2015 8:25 am, edited 4 times in total.
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Unicode in PureBasic

Post by Josh »

A other problem you have when changing the file format of your code (File > File format) from plain text to Utf8. Characters like ä, ö, ü will be shown wrong.

Copy your code to clipboard, change your File to UTf8 and paste the code again.
sorry for my bad english
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode in PureBasic

Post by Little John »

<edit 2016-01-01>
Josh wrote:A other problem you have when changing the file format of your code (File > File format) from plain text to Utf8. Characters like ä, ö, ü will be shown wrong.

Copy your code to clipboard, change your File to UTf8 and paste the code again.
That was a valuable tip, which still works.
However, in PB 5.41 final the regarding bug in the editor is fixed.
</edit 2016-01-01>


Another point to consider results from the fact, that in ASCII mode one character of a string takes 1 byte in memory, while in Unicode mode one character of a string internally takes 2 bytes in memory.

In ASCII mode, it doesn't matter whether we think of the number of characters or the number of bytes. In Unicode mode, it makes a difference:

Code: Select all

s$ = "Hello"
Debug Len(s$)               ; -> 5 in both modes
Debug StringByteLength(s$)  ; -> 5 in ASCII mode, and 10 in Unicode mode

Code: Select all

foo.s{5} = "12345678"
Debug Len(foo)               ; -> 5 in both modes
Debug StringByteLength(foo)  ; -> 5 in ASCII mode, and 10 in Unicode mode
AllocateMemory()

When we want to poke a string into memory, then how much memory is to be reserved?
In ASCII mode, we are used to doing this:

Code: Select all

*buffer = AllocateMemory(Len(s$) + 1)
+ 1 is needed for the trailing zero.

In Unicode mode, we can do it like this:

Code: Select all

*buffer = AllocateMemory(2*Len(s$) + 2)
... since the argument of AllocateMemory() is not the number of characters of the string but its number of bytes.
Also note that the trailing zero takes 2 bytes in Unicode mode.

The best bet is to write code that works in ASCII mode and in Unicode mode:

Code: Select all

*buffer = AllocateMemory(StringByteLength(s$) + SizeOf(Character))
Last edited by Little John on Fri Jan 01, 2016 9:52 pm, edited 2 times in total.
Fred
Administrator
Administrator
Posts: 16618
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Unicode in PureBasic

Post by Fred »

Nice tips, i made it sticky. Feel free to edit your first post to add more info to avoid to dig in the thread.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode in PureBasic

Post by Little John »

Thank you. :-)
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode in PureBasic

Post by Little John »

Unfortunately, I don't know much about Linux and Mac API functions. Hopefully someone else will post some information about them here.
In the following, I'll only write about Windows.

Windows API
Many Windows API functions can be used very comfortable with PB, in a way as if they were PB functions. We just have to add a trailing underscore to the original function name, e.g.

Code: Select all

MessageBox_(0, "Hi there!", "Title", #MB_ICONINFORMATION)
We don't have to care here whether or not our program is compiled in Unicode mode. PB will handle that for us automatically.

But some Windows API functions can not be accessed in PB with the "underscore trick".
If we want to use those functions, then we have to do a little more work ourselves. And then we have to take care of ASCII and Unicode. Windows API functions that deal with characters or strings exist in two versions:
  • <Name>A
  • <Name>W
and we have to choose the proper one in our program. The following code shows an example how to do so.

Code: Select all

; tested with PB 5.31 in ASCII and Unicode mode;
; slightly modified after code by ts-soft
; in RSBasic's WinAPI library
; - online  : http://www.rsbasic.de/winapi-library/
; - download: http://www.rsbasic.de/download/

EnableExplicit

Prototype.i ProtoGetDefaultPrinter(Printer$, *BufferSize)

Procedure.s GetDefaultPrinter()
   Protected GetDefaultPrinter_.ProtoGetDefaultPrinter
   Protected BufferSize.i, Result$ = ""
   Protected DLL.i = OpenLibrary(#PB_Any, "winspool.drv")
   
   If DLL
      CompilerIf #PB_Compiler_Unicode
         GetDefaultPrinter_ = GetFunction(DLL, "GetDefaultPrinterW")
      CompilerElse
         GetDefaultPrinter_ = GetFunction(DLL, "GetDefaultPrinterA")
      CompilerEndIf
      
      If GetDefaultPrinter_
         GetDefaultPrinter_(#NUL$, @ BufferSize)
         If BufferSize
            Result$ = Space(BufferSize)
            If GetDefaultPrinter_(Result$, @ BufferSize) = #False
               Result$ = ""
            EndIf  
         EndIf
      EndIf
      CloseLibrary(DLL)
   EndIf
   
   ProcedureReturn Result$
EndProcedure


Debug "Default printer: " + GetDefaultPrinter()
See also instructive code and information by netmaestro concerning ASCII vs. Unicode versions of Win API functions.
Last edited by Little John on Wed Oct 14, 2015 1:54 pm, edited 1 time in total.
infratec
Always Here
Always Here
Posts: 6817
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Unicode in PureBasic

Post by infratec »

Hi,

not directly PB related, but a hint about unicode and databases:

If you need to handle unicode strings in combination with SQL databases you need to prefix the string values with a capital N.

Example:

Code: Select all

INSERT INTO mytable VALUES (100, N'A unicode string')
This is defined in the SQL-92 standard: (search for NATIONAL CHARACTER)
http://www.contrib.andrew.cmu.edu/~shad ... ql1992.txt

It is not clearly written (at least for me), but it works for unicode strings.

A good explanation you can find here:
http://databases.aspfaq.com/general/why ... refix.html
(if the link is not down :cry: )

Bernd
Num3
PureBasic Expert
PureBasic Expert
Posts: 2810
Joined: Fri Apr 25, 2003 4:51 pm
Location: Portugal, Lisbon
Contact:

Re: Unicode and PureBasic

Post by Num3 »

Here are my tips:

Base64 encode an image/whatever to save in database has ASCII BASE64 (Standard for most web languages PHP for instance)

Code: Select all

    Define  blob.s = "", out_blob.s = ""
    
    If ReadFile(0,temp_dir + "img.png")
      blob = Space(Lof(0))
      ReadData(0,@blob, Lof(0))
      out_blob = Space(Lof(0)*3)
      Base64Encoder(@blob,Lof(0),@out_blob,Lof(0)*3)
      CloseFile(0)
      blob = PeekS(@out_blob, StringByteLength(out_blob), #PB_UTF8)
      out_blob = ""
    EndIf      
    
 ; Ascii base64 is now on the blob variable
 
Mysql query to login with a MD5 hash
(When in unicode purebasic MD5 produces a diferent hash from ASCII, so i used this trick to overcome the problem)

Code: Select all

query.s = "select id, name from users where users.user='"+name+"' and users.password=md5('"+password+"') limit 1" ; This way Mysql makes the MD5 Calculation and not Purebasic
    If DatabaseQuery(0, query)    ; -QUERY   
      NextDatabaseRow(0)
      ; Your code here
    Endif
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

Num3 wrote:Base64 encode an image/whatever to save in database has ASCII BASE64 (Standard for most web languages PHP for instance)

Code: Select all

    Define  blob.s = "", out_blob.s = ""
    
    If ReadFile(0,temp_dir + "img.png")
      blob = Space(Lof(0))
      ReadData(0,@blob, Lof(0))
      out_blob = Space(Lof(0)*3)
      Base64Encoder(@blob,Lof(0),@out_blob,Lof(0)*3)
      CloseFile(0)
      blob = PeekS(@out_blob, StringByteLength(out_blob), #PB_UTF8)
      out_blob = ""
    EndIf      
    
 ; Ascii base64 is now on the blob variable
 
For encoding an image/whatever, it's not necessary to use a string as buffer, and Base64Encoder() is not a string function (see http://www.purebasic.fr/english/viewtop ... 56#p462056 #3). So better do completely without strings here, and use buffers that are created with AllocateMemory() instead. Then without any strings, there is nothing such as ASCII or Unicode at all. :)
User avatar
VB6_to_PBx
Enthusiast
Enthusiast
Posts: 617
Joined: Mon May 09, 2011 9:36 am

Re: Unicode and PureBasic

Post by VB6_to_PBx »

Little John ,

many thanks for all your helpful Code and clarifications on UniCode -vs- ASCII
 
PureBasic .... making tiny electrons do what you want !

"With every mistake we must surely be learning" - George Harrison
Num3
PureBasic Expert
PureBasic Expert
Posts: 2810
Joined: Fri Apr 25, 2003 4:51 pm
Location: Portugal, Lisbon
Contact:

Re: Unicode and PureBasic

Post by Num3 »

Little John wrote: For encoding an image/whatever, it's not necessary to use a string as buffer, and Base64Encoder() is not a string function (see http://www.purebasic.fr/english/viewtop ... 56#p462056 #3). So better do completely without strings here, and use buffers that are created with AllocateMemory() instead. Then without any strings, there is nothing such as ASCII or Unicode at all. :)
Not true, here is the diference of output with or without that conversion:

PB Base64 in unicode output:

Code: Select all

噩佂睒䬰杇䅯䅁乁啓䕨杕䅁䉁䅁䅁允䅃䅙䅁晁⼸根䅁䅁塇䙒䡗呒㉢ず㉤祆党䉂䝚椹博䩂坢湆噚汊坙㕒捣汬䅐䅁煁䩬䕒商乥歱さ側ㅅ奅扐呦㍸㙑扢杅乴䙧歰硧䥓潂䑩煆䍊先䕸呙⽅䄫ㅗ䥮栱术浺䥂偘䑫䍪晳攰䝩剨乡剉橇婴塇歯䙁晥䙘䑐㥱⭍㥏扌䕔兪湳婏㝬⭶㜸㔵瘷潐瑨⼲晩㍪䉁⭚稹湎䅉潄䡇楷歁兹䑡乄㥈橵䝗摇㉧㝋㍑䩙獮㑚浇㍫⭄婊兩䵶戶⽚橱愸夵㕓獶⼸㑩扷㤲橳⽌敢卶伷獦噖牗䜳ㅤ睲䤴睨单慯㑂ㄫ祱㡤祁噌塬剄祒敊㙩湺䉏畲桏䭑橷䡆㉮桍䵐嘵捕䑩煨䑘楳捩奌挱䩔坧桋䉊䱥ㅥ楖乴䅮㜴硴䈳牮䑑㝡啁刱娵桇杺癇穸䝈兏䈰べ㉺奮䕲㡩南䍲照䕎爲汧慊ㅓ坳䵄䅍㕺ㅕ敔䬶敨捋㉕䈯畎婮爹捄獖塧⼶眱ㅂ䱐捏䝒瑍㕗㥧坔灅䍷獧橅婢䑺杰焹㉥剄佱䅃䩚浯㥗䕖执兗䍇㑓桐䝹⬲䜳㑭愸樷㠳坘坵噯桋畅捣潡卅灚潤奚瑲䥶祙塊祧晔畅换橄攸剄䅖佬ㅴ䡓噒䕹硣湱乙ㅋ噮丱兖䉸瑆⭷丸睴䥴杬䘹具㉲噤䠵⽕䝘橌䍵㙱塯佌兙穣祱㝳㙭啉䙧ぶ穖啈が啁慒⭓䱃硨扰敕㠵癪䱬㑙祰杆焰牥㡃䱵䩺楶杅睎晥㑺栫噎砫橐吷い橕䝘卑兴汴塦橆湩夰瑊割慵灏佬奰㥉㑆㝤晶洴㐳煯⽙坔㈹慰潡䅲塋桱湆䭌倯㍵慡牱⭫慺㥲晦楁䕅乸䨰䍵挵此摮⬴搳卦慤㍏慵坰䥄㑹啶煷㈶婵婔含㍘扡楇歺朲湫桰ぇ潵癣䙣䩵济桨㑙獏睊䑮倴煣䡹奇啺潴甲㡷洵䡙夲癨橖睫㉗䱇婙硬䤴䰸儲甹㡈睸浴硔䭅㡎㑡栱䈸杂䭁樵砯噗坤睬䅁䅁䩂啒䔵歲杊杧㴽
Same Base64 output after my conversion or using PB compiler without unicode:

Code: Select all

iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAqlJREFUeNqkU0tPE1EYPbfTx3Q6bbEgtNgFpkgxSIBoiDFqJCHQxEYTE/+AW1nI1h/gzmBIXPkDjCsf0eiGhRaNIRGjtZGXokAFefXFPDq9M+O9LbTEjQsnOZl7v+87557vPoht2/ifj3ABZ+9zNnIADoGHwiAkyQaDDNH9ujWGGdg2K7Q3YJnsZ4Gmk3D+JZiQvM6bZ/qj8a5YS5vs8/i4wb29sjL/bevS7OfsVVWr3Gd1rw4IhwUSoaB4+1qyd8AyLVlXDRRyJei6znOBruOhQKwjFHn2MhPM5VUciDhqXDsiicLY1cTJgWKhJBeLe1VitNnA47tx3BnrQDa7AU1R5ZGhzgGvxzHGOQ0By0z2nYrEi8WSrCgqNE2rglJaS1sWDMMAz5U1Te6KheKcU2/BNunZ9rDcVsgX6/1wB1PLOcRGMtW5g9TWEpwCgsEjbZzDpg9qe2DRqOCAZJomW9VEgbWQGCS4PhyG2+3Gm48a7j38XWuWoVKhEuccaoESZpdoZYrtvIYyJXgyTfEubcDj8eDRVAlOt1SHRVyEcxqnYNK1nV1NVQxBFtw+8NtwtIlg9FwQr2dV5HU/XGLjuCq6oXLOYQczqys7m6IUgFv0VzHUL0AURaS+CLhxpbUe58jvlLY4pyFg0qerC8uLzJviEgNwefz4+hNV+xPj7TD0UjXGQStQtlfXFjin0YJtrRuaOplOpYI9F4d7vf4m34oqY/TW92paaorAKXqhFnLK/Pu3aaqrk+zar9ffAiEExN0JuC5cdknd4+3dfSdaO3uapWDIy4vUwq62uZTZ+TX3abGizk2gknphG0uocvcFuJNmhhY4OsJwnD4PcqyHGYzUto2uw85mYH2YhvVjkwW2GLYZlx4I8L2Q9uH8xwtmTxEKN8a41h8BBgAK5j/xWVdWlwAAAABJRU5ErkJggg==
Base64Encode output buffer in Unicode does not comply with RFC4648 / Page 5, so it has to be converted to ASCII.
Num3
PureBasic Expert
PureBasic Expert
Posts: 2810
Joined: Fri Apr 25, 2003 4:51 pm
Location: Portugal, Lisbon
Contact:

Re: Unicode and PureBasic

Post by Num3 »

MD5 and unicode.

I've been trolling around the MD5 RFC's, but i haven't found a direct reference that the input buffer has to be ASCII.
But, from experience and testing I am almost sure this standard specifies somewhere it has to be.

Again, this code produces diferent results in ASCII / UNICODE.

So here is a piece of code to produce MD5 hashes that seem to be correct in unicode:

Code: Select all

Debug "e946adb45d4299def2071880d30136d4 -> is the standard / expected result"
text.s = "Mary had a little lamb" ; Text to get hash from

out.s = MD5Fingerprint(@text,StringByteLength(text))
Debug out + " -> direct result from MD5Fingerprint"

buffer.s = Space(Len(text)) ; A little buffer for conversion
PokeS(@buffer,text,-1,#PB_Ascii) ; Poke the text has ASCII so the it can work has expected
out.s = MD5Fingerprint(@buffer,StringByteLength(buffer))
Debug out + " -> converted input "
Fred
Administrator
Administrator
Posts: 16618
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Unicode and PureBasic

Post by Fred »

MD5 is a multipurpose data hash, it has no link with ASCII or Unicode. It just get a buffer and returns the numeric hash of it, the content of the buffer doesn't matter.
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Unicode and PureBasic

Post by Little John »

Num3 wrote:
Little John wrote: For encoding an image/whatever, it's not necessary to use a string as buffer, and Base64Encoder() is not a string function (see http://www.purebasic.fr/english/viewtop ... 56#p462056 #3). So better do completely without strings here, and use buffers that are created with AllocateMemory() instead. Then without any strings, there is nothing such as ASCII or Unicode at all. :)
Not true, here is the diference of output with or without that conversion:
You did not understand my reply to your previous post.
Num3 wrote:Base64Encode output buffer in Unicode does not comply with RFC4648 / Page 5, so it has to be converted to ASCII.
Not quite correct.

For encoding a picture, the output of Base64Encoder() is exactly the same in ASCII mode and in Unicode mode, if you do not use strings as buffers.
If you are using strings as buffers for encoding pictures, then you are artificially creating this ASCII/Unicode problem with Base64Encoder() yourself.
I already wrote this repeatedly in this thread.

And about using Base64Encoder() or MD5Fingerprint() with strings, I've already written at the beginning of this thread ...

This thread is inteded for collecting tipps for PB users who want to get information about Unicode and PureBasic. For the sake of easy reading, this thread is not intended for discussion.
Please post any discussion in a new separate thread. Thank you!
Last edited by Little John on Sat Sep 12, 2015 9:18 am, edited 2 times in total.
Post Reply