utf16 string module StrCmp full case folding
Re: utf16 string module StrCmp full case folding
Updated
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase
See 1st post
Added left_ right_ mid_ len_ isutf16 pUpCase plowCase
See 1st post
Re: utf16 string module StrCmp full case folding
The CaseFolding.txt file is not suitable for converting lowercase letters to uppercase or vice versa. This file is for normalizing two strings (reducing character variants; called case-folding) so that they can then be compared.
For converting lowercase to uppercase or vice versa, the CaseMapping.txt file must be used.
Here is an example of a problem when you use the CaseFolding.txt file for conversion from upper case to lower case or vice versa:
Edit: I mean UnicodeData.txt not CaseMapping.txt (does not exist), sorry, too tired.
For converting lowercase to uppercase or vice versa, the CaseMapping.txt file must be used.
Here is an example of a problem when you use the CaseFolding.txt file for conversion from upper case to lower case or vice versa:
CaseFolding.txt wrote: 01C4; C; 01C6; # LATIN CAPITAL LETTER DZ WITH CARON
01C5; C; 01C6; # LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
Now suppose you want to do toUpperCase($01C6) with the CaseFolding.txt file. Which uppercase letter should the function now return from these two? You don't have this problem with the CaseMapping.txt file, because the Unicode Standard has defined only one target letter in it for mapping.https://www.compart.com/en/unicode/U+01C6 wrote:01C6 - Latin Small Letter Dz with Caron
Edit: I mean UnicodeData.txt not CaseMapping.txt (does not exist), sorry, too tired.
Last edited by Sicro on Mon May 08, 2023 10:01 pm, edited 1 time in total.
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Re: utf16 string module StrCmp full case folding
It uses the 1st instance so if it sees a repeat key it ignores it. I have yet to test it against the casemapping txt. That's another 2 hours work.
the mapping in the casefolding should result in
01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
Is it correct, I'm not sure that's today's task
the mapping in the casefolding should result in
01c4 01c6 - 01c6 01c4
01c5 01c6 - 01c6 01c4.
Is it correct, I'm not sure that's today's task
Re: utf16 string module StrCmp full case folding
I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
The value I marked in red can be different when mapping with UnicodeData.txt:
01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Re: utf16 string module StrCmp full case folding
yes your right the issue is will appear if a character is encoded as Titlecase 01C5Sicro wrote: ↑Mon May 08, 2023 10:19 pm I mistakenly wrote CaseMapping.txt in the previous post (does not exist), correct is UnicodeData.txt.
Yes, that's what I thought, that's how your code works. But you can't be sure that the 1st instance is always the right one.
The value I marked in red can be different when mapping with UnicodeData.txt:
01c5 => 01c4 (LowerCase) or 01c6 (UpperCase)
The respective mappings according to unicodedata.txt as upper | lower | Titlecase are
01C4 = 01C4 | 01C6 | 01C5
01C5 = 01C4 | 01C6 | 01C5
01C6 = 01C4 | 01C6 | 01C5
and this would erroneously resulting in returning the character to TitleCase
I will have to reassess how I do the data section, I just wanted to minimize it.01C4 = 01C4 | 01C6 | x
01C5 = 01C5 | 01C6 | x
01C6 = 01C4 | 01C6 | x
Re: utf16 string module StrCmp full case folding
have redone it
and previous posts resultsurrogate pairs equality a𐐀abd = A𐐨ABD
case mapping equality aῼabd = aῳABD
full case folding equaility aßEaİdssf = aSSEai̇dßf
simple case folding equality SomeMixedCaseStringWithNothingSpecialOtherThanBeingLong = sOMEmIXEDcASEsTRINGwithnOTHINGsPECIALoTHERtHANbEINGlONG
Nomal case equality Normal cmp = Normal cmp
Tolower somemixedcasestringwithnothingspecialotherthanbeinglong
To upper SOMEMIXEDCASESTRINGWITHNOTHINGSPECIALOTHERTHANBEINGLONG
ꭰ AB70
Ꭰ
Ꭰ
ꭰ
to upper ABCDEF 0123456789, ÄÖÜ, ÄÖÜ, ÁÓÚ FEDCBA DŽ
to lower abcdef 0123456789, äöü, äöü, áóú fedcba dž
to Title Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
🅐A🅚🅐K🅝
left 2 🅐A
right 2 K🅝
mid 1,4 A🅚🅐K
chr_Asc_((Left_example,1))) 🅐
Appears to check out now.abcdef 0123456789, äöü, äöü, áóú FEDCBA Dž
1C4
1C6
1C4
Abcdef 0123456789, Äöü, Äöü, Áóú Fedcba Dž
$1C4 = 1C4 | 1C6 | 1C5
$1C5 = 1C4 | 1C6 | 1C5
$1C6 = 1C4 | 1C6 | 1C5
Re: utf16 string module StrCmp full case folding
updated to v2.0.0 and renamed
UTF16 Utility Module
provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)
https://github.com/idle-PB/UTF16
UTF16 Utility Module
provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)
https://github.com/idle-PB/UTF16
Re: utf16 string module StrCmp full case folding
Redid strLCase / strUCase, added speed tests
UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
CompareMemoryString(@s3,@s4) 62 ms for 1,000,000
UTF16 strLCase / strUcase 41 ms for 1,000,000
LCase / UCase 486 ms for 1,000,000
https://github.com/idle-PB/UTF16
UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
CompareMemoryString(@s3,@s4) 62 ms for 1,000,000
UTF16 strLCase / strUcase 41 ms for 1,000,000
LCase / UCase 486 ms for 1,000,000
https://github.com/idle-PB/UTF16
- StarBootics
- Addict
- Posts: 984
- Joined: Sun Jul 07, 2013 11:35 am
- Location: Canada
Re: utf16 string module StrCmp full case folding
Hello Idle,
Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
StarBootics
Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Best regardsChr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
StarBootics
The Stone Age did not end due to a shortage of stones !
Re: utf16 string module StrCmp full case folding
This should work:
Code: Select all
Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane)
Protected buffer.q
If v < $10000
ProcedureReturn Chr(v)
Else
Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800
ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode)
EndIf
EndProcedure
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and more ― Typeface - Sprite-based font include/module
Lizard - Script language for symbolic calculations and more ― Typeface - Sprite-based font include/module
Re: utf16 string module StrCmp full case folding
Thanks stargate, there was a range check added in the IDE. it was still compiling from the command line.STARGÅTE wrote: ↑Mon Oct 16, 2023 7:51 pm This should work:Code: Select all
Procedure.s StrChr(v.i) ;return a proper surrogate pair for unicode values outside the BMP (Basic Multilingual Plane) Protected buffer.q If v < $10000 ProcedureReturn Chr(v) Else Buffer = (v&$3FF)<<16 | (v-$10000)>>10 | $DC00D800 ProcedureReturn PeekS(@Buffer, 2, #PB_Unicode) EndIf EndProcedure
Re: utf16 string module StrCmp full case folding
The issue was caused by a range check that was added to the ide, I've asked for it to be removed as it really doesn't make much sense.StarBootics wrote: ↑Mon Oct 16, 2023 5:37 pm Hello Idle,
Apparently your code is no longer working with PB 6.03 LTS. I got an error on line 4749 (Function : StrChr(v.i))
Best regardsChr(): Invalid value for Chr(), should be between 0 and $D7FF or between $E000 and $FFFF.
StarBootics
See stargates fix above
Re: utf16 string module StrCmp full case folding
Fixed the examples back up should contain the emoji line 5077
Re: utf16 string module StrCmp full case folding
Added function to strip accents in UTF16a.pb
https://github.com/idle-PB/UTF16
https://github.com/idle-PB/UTF16
performance for 1,000,000"@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙"
becomes
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „.†‡ ‰S‹STZZ ‘’“”•–— Ts›stzz ˇ Ł¤A¦§ ©S«¬®Z°± ł μ¶· as»L lzRAAAALCCCEEEEIIDĐNNOOOO×RUUUUYTßraaaalccceeeeiidđnnoooo÷ruuuuyt
UTF16::Strcmp(s3,s4) 81 ms
CompareMemoryString(@s3,@s4) 782 ms
UTF16::strLCase / UTF16::strUcase 48 ms
LCase / UCase 461 ms