Wish to the list: PBEditor Character table also in unicode

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
User avatar
Psychophanta
Addict
Addict
Posts: 4968
Joined: Wed Jun 11, 2003 9:33 pm
Location: Lípetsk, Russian Federation
Contact:

Wish to the list: PBEditor Character table also in unicode

Post by Psychophanta »

Hello.
In the PB editor, at the "tools" menu, there is a useful option called "Character table", it is extended ASCII only.
Since newer versions of the compiler are unicode only, there would be interesting to implemente a Unicode Character Table.
:)
http://www.zeitgeistmovie.com

While world=business:world+mafia:Wend
Will never leave this forum until the absolute bugfree PB :mrgreen:
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Wish to the list: PBEditor Character table also in unicode

Post by Saki »

The idea is good, but what are the signs ?
All of them ?
地球上の平和
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Wish to the list: PBEditor Character table also in unicode

Post by Demivec »

The unicode codepoints are quite extensive and also still in a state of change.

Perhaps a link to the symbols would be better.

http://www.unicode.org/charts/
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Wish to the list: PBEditor Character table also in unicode

Post by Sicro »

PureBasic uses UCS-2 (2 bytes per character) and is therefore limited to the character codes from 0 to 65,535 (see PB help).

But even with this limitation the filling of the list takes some seconds (I tested it with the source code of the PureBasic IDE).

Maybe it would be better if not all characters are displayed at once. At the top of the window we could place several buttons with different character ranges or take a ComboBoxGadget for it, with which we could switch the displayed character ranges in the list.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Wish to the list: PBEditor Character table also in unicode

Post by Saki »

Demivec wrote: Sun Apr 25, 2021 4:33 pm The unicode codepoints are quite extensive and also still in a state of change.

Perhaps a link to the symbols would be better.

http://www.unicode.org/charts/
It just doesn't work, there are way too many.
The many foreign PB users must then also be supported.
This website from @Demivec is so far the best I have seen.

http://www.columbia.edu/kermit/ucs2.html
地球上の平和
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Wish to the list: PBEditor Character table also in unicode

Post by Demivec »

Sicro wrote: Sun Apr 25, 2021 5:59 pm PureBasic uses UCS-2 (2 bytes per character) and is therefore limited to the character codes from 0 to 65,535 (see PB help).

But even with this limitation the filling of the list takes some seconds (I tested it with the source code of the PureBasic IDE).

Maybe it would be better if not all characters are displayed at once. At the top of the window we could place several buttons with different character ranges or take a ComboBoxGadget for it, with which we could switch the displayed character ranges in the list.
@Sicro: PureBasic says it uses UCS-2 internally but I think that is a bit fiddly. I think the truth is that all of its string functions like Mid() , LSet() and so on simply operate on codepoints as if they were all two bytes long. Many functions that utilize strings actually make use of UTF-16. UTF-16 allows all of the Unicode codepoints to be written using either two or four bytes with a surrogate mechanism.

Here is a demonstration:

Code: Select all

Procedure handleError(value, text.s)
  If Not value
    MessageRequester("Error", text)
    End
  EndIf
EndProcedure

Procedure.s _Chr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
  Protected high, low
  If v < $10000
    ProcedureReturn Chr(v)
  Else
    ;calculate surrogate pair of unicode codepoints to represent value in UTF-16
    v - $10000
    high = v / $400 + $D800 ;high/lead surrogate value
    low = v % $400 + $DC00  ;low/tail surrogate value
    ProcedureReturn Chr(high) + Chr(low)
  EndIf
EndProcedure

#imageWidth = 310
#imageHeight = 310
handleError(LoadFont(0, "Courier", 200), "Can't load font.")
handleError(CreateImage(0, #imageWidth, #imageHeight), "Can't to create image.")

If StartDrawing(ImageOutput(0))
  DrawingFont(FontID(0))
  a$ = _Chr($1F600)
  DrawText(0, 0, a$)
  StopDrawing()
EndIf

handleError(OpenWindow(0, 0, 0, #imageWidth, #imageHeight + 20, a$ + "Unicode Test" + a$), "Can't open window.")
ImageGadget(0, 0, 0, 0, 0, ImageID(0))
TextGadget(1, 5, #imageHeight, #imageWidth, 20, ReplaceString(Space(25), " ", a$))

Repeat: Until WaitWindowEvent() = #PB_Event_CloseWindow
  • If you see a smiling emoji "😀" after running the above code you can see that UTF-16 is being used by the DrawText() function and not UCS-2.
  • If you see a line of smiling emoji in the TextGadget then you can see that UTF-16 is being used by the TextGadget() and not UCS-2.
  • If you see a smiling emoji at the beginning and end of the Window's title then you can see that UTF-16 is being used by the OpenWindow() function and not UCS-2.
  • If you see a smiling emoji in the debug window while debugging than you can see that the Debug command is using UTF-16 and that the font you are using in the Debug window also has a character for that codepoint.
When I run the code in WIndows 10 with PureBasic v5.73 LTS x64 I see smileys in all of the above areas.

As far as a chart of unicode or even only UCS-2 codepoints (and characters) goes, the number is very large and it wouldn't really make much sense to put that much info in picture form into the Help file. Also, as stated earlier the codepoint definitions are still in a process of change. UCS-2 is updated to keep it synchronized to changes in the BMP (Basic Multilingual Plane) of unicode. You'll notice that the chart that Saki linked to has many visible characters with a description of '(unknown)' which shows that the chart is not up-to-date and the website it was posted on was last updated in 2011 (by my guess). One example is codepoint 0220 ('Ƞ '). Codepoint 0220 has a description of 'LATIN CAPITAL LETTER N WITH LONG RIGHT LEG' in the unicode charts available from the link I posted.

I don't think buttons would work very well to select portions of the codepoint range to display simply because it is such a large range.


Note: I verified that the forum update now allows Unicode characters outside the BMP to be posted in messages. The Smiley emoticon in this message is the test case. Here are a few more 🀁🀂🀃🀄🀢🀣🀤🀥🀦🀧🀨🀩(mahjong tiles).
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Wish to the list: PBEditor Character table also in unicode

Post by Sicro »

Demivec wrote: Thu Apr 29, 2021 2:44 am @Sicro: PureBasic says it uses UCS-2 internally but I think that is a bit fiddly. I think the truth is that all of its string functions like Mid() , LSet() and so on simply operate on codepoints as if they were all two bytes long. Many functions that utilize strings actually make use of UTF-16. UTF-16 allows all of the Unicode codepoints to be written using either two or four bytes with a surrogate mechanism.

Here is a demonstration:
[...]
Yes, the functions that display or draw strings interpret the UCS-2 string as UTF-16 (which is an extension of UCS-2). But it is actually the OS API functions that do that, not the PB functions.

But ok, since UTF-16 can be displayed and drawn and the PB string functions do not destroy the other UTF-16 characters in the UCS-2 string, it can be seen that PB supports UTF-16 - even if you have to take into account that then

Code: Select all

Len(one character string)
does not always result in "1".

I didn't know about the surrogate mechanism thing, thanks. I don't deal much with the different Unicode encodings.
Demivec wrote: Thu Apr 29, 2021 2:44 am Also, as stated earlier the codepoint definitions are still in a process of change. UCS-2 is updated to keep it synchronized to changes in the BMP (Basic Multilingual Plane) of unicode.
Ok, then I also think it would be better if a link to an always-up-to-date web page is inserted at the bottom of the characters table window.
Demivec wrote: Thu Apr 29, 2021 2:44 am Note: I verified that the forum update now allows Unicode characters outside the BMP to be posted in messages. The Smiley emoticon in this message is the test case. Here are a few more 🀁🀂🀃🀄🀢🀣🀤🀥🀦🀧🀨🀩(mahjong tiles).
That's cool. Probably the old forum did not use

Code: Select all

<meta charset="utf-8">
in the HTML code.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Post Reply