utf16 string module StrCmp full case folding

Share your advanced PureBasic knowledge/code with the community.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

Updated
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase

See 1st post
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: utf16 string module StrCmp full case folding

Post by Sicro »

The CaseFolding.txt file is not suitable for converting lowercase letters to uppercase or vice versa. This file is for normalizing two strings (reducing character variants; called case-folding) so that they can then be compared.

For converting lowercase to uppercase or vice versa, the CaseMapping.txt file must be used.

Here is an example of a problem when you use the CaseFolding.txt file for conversion from upper case to lower case or vice versa:
CaseFolding.txt wrote: 01C4; C; 01C6; # LATIN CAPITAL LETTER DZ WITH CARON
01C5; C; 01C6; # LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
https://www.compart.com/en/unicode/U+01C6 wrote:01C6 - Latin Small Letter Dz with Caron
Now suppose you want to do toUpperCase($01C6) with the CaseFolding.txt file. Which uppercase letter should the function now return from these two? You don't have this problem with the CaseMapping.txt file, because the Unicode Standard has defined only one target letter in it for mapping.

Edit: I mean UnicodeData.txt not CaseMapping.txt (does not exist), sorry, too tired.
Last edited by Sicro on Mon May 08, 2023 10:01 pm, edited 1 time in total.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

It uses the 1st instance so if it sees a repeat key it ignores it. I have yet to test it against the casemapping txt. That's another 2 hours work.
the mapping in the casefolding should result in
01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
Is it correct, I'm not sure that's today's task
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: utf16 string module StrCmp full case folding

Post by Sicro »

I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
idle wrote: Mon May 08, 2023 9:25 pm 01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
The value I marked in red can be different when mapping with UnicodeData.txt:

01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

Sicro wrote: Mon May 08, 2023 10:19 pm I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
idle wrote: Mon May 08, 2023 9:25 pm It uses the 1st instance so if it sees a repeat key it ignores it.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
idle wrote: Mon May 08, 2023 9:25 pm 01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
The value I marked in red can be different when mapping with UnicodeData.txt:

01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
yes your right the issue is will appear if a character is encoded as Titlecase 01C5
The respective mappings according to unicodedata.txt as upper | lower | Titlecase are

01C4 = 01C4 | 01C6 | 01C5
01C5 = 01C4 | 01C6 | 01C5
01C6 = 01C4 | 01C6 | 01C5

and this would erroneously resulting in returning the character to TitleCase
01C4 = 01C4 | 01C6 | x
01C5 = 01C5 | 01C6 | x
01C6 = 01C4 | 01C6 | x
I will have to reassess how I do the data section, I just wanted to minimize it.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

have redone it
surrogate pairs equality a𐐀abd = A𐐨ABD
case mapping equality aῼabd = aῳABD
full case folding equaility aßEaİdssf = aSSEai̇dßf
simple case folding equality SomeMixedCaseStringWithNothingSpecialOtherThanBeingLong = sOMEmIXEDcASEsTRINGwithnOTHINGsPECIALoTHERtHANbEINGlONG
Nomal case equality Normal cmp = Normal cmp
Tolower somemixedcasestringwithnothingspecialotherthanbeinglong
To upper SOMEMIXEDCASESTRINGWITHNOTHINGSPECIALOTHERTHANBEINGLONG
ꭰ AB70



to upper ABCDEF 0123456789, ÄÖÜ, ÄÖÜ, ÁÓÚ FEDCBA DŽ
to lower abcdef 0123456789, äöü, äöü, áóú fedcba dž
to Title Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
🅐A🅚🅐K🅝
left 2 🅐A
right 2 K🅝
mid 1,4 A🅚🅐K
chr_Asc_((Left_example,1))) 🅐
and previous posts result
abcdef 0123456789, äöü, äöü, áóú FEDCBA Dž
1C4
1C6
1C4
Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
$1C4 = 1C4 | 1C6 | 1C5
$1C5 = 1C4 | 1C6 | 1C5
$1C6 = 1C4 | 1C6 | 1C5
Appears to check out now.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

updated to v2.0.0 and renamed

UTF16 Utility Module
provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)

https://github.com/idle-PB/UTF16
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

Redid strLCase / strUCase, added speed tests

UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
CompareMemoryString(@s3,@s4) 62 ms for 1,000,000

UTF16 strLCase / strUcase 41 ms for 1,000,000
LCase / UCase 486 ms for 1,000,000

https://github.com/idle-PB/UTF16
User avatar
StarBootics
Addict
Addict
Posts: 984
Joined: Sun Jul 07, 2013 11:35 am
Location: Canada

Re: utf16 string module StrCmp full case folding

Post by StarBootics »

Hello Idle,

Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
Best regards
StarBootics
The Stone Age did not end due to a shortage of stones !
User avatar
STARGÅTE
Addict
Addict
Posts: 2090
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: utf16 string module StrCmp full case folding

Post by STARGÅTE »

This should work:

Code: Select all

Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
	Protected buffer.q
	If v < $10000
		ProcedureReturn Chr(v)
	Else
		Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
		ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
	EndIf
EndProcedure
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

STARGÅTE wrote: Mon Oct 16, 2023 7:51 pm This should work:

Code: Select all

Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
	Protected buffer.q
	If v < $10000
		ProcedureReturn Chr(v)
	Else
		Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
		ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
	EndIf
EndProcedure
Thanks stargate, there was a range check added in the IDE. it was still compiling from the command line.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

StarBootics wrote: Mon Oct 16, 2023 5:37 pm Hello Idle,

Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Chr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
Best regards
StarBootics
The issue was caused by a range check that was added to the ide, I've asked for it to be removed as it really doesn't make much sense.

See stargates fix above :D
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

Fixed the examples back up should contain the :D emoji line 5077
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: utf16 string module StrCmp full case folding

Post by idle »

Added function to strip accents in UTF16a.pb
https://github.com/idle-PB/UTF16
"@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬­®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙"

becomes

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „.†‡ ‰S‹STZZ ‘’“”•–— Ts›stzz ˇ Ł¤A¦§ ©S«¬­®Z°± ł μ¶· as»L lzRAAAALCCCEEEEIIDĐNNOOOO×RUUUUYTßraaaalccceeeeiidđnnoooo÷ruuuuyt
performance for 1,000,000
UTF16::Strcmp(s3,s4) 81 ms
CompareMemoryString(@s3,@s4) 782 ms
UTF16::strLCase / UTF16::strUcase 48 ms
LCase / UCase 461 ms
Post Reply