utf16 string module StrCmp full case folding

Share your advanced PureBasic knowledge/code with the community.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

utf16 string module StrCmp full case folding

Post by idle »

UTF16 Utility Module provides utf16 support to PB
with full case folding compare, similar to CompareMemoryString
in place string case mappings for uppercase lowercase and titlecase
string replacements for left, mid, right, len, ucase, lcase, asc, chr
additional tcase (title case)
Strip accents

The data supports both implementations that require simple case foldings
(where string lengths don't change), and implementations that allow full case folding
(where string lengths may grow). Note that where they can be supported, the
full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
;UTF16 v 2.0.0
;Authors idle mk-soft 15/11/22 - 10/5/23
;license MIT
;
;fullcase folding is required when a strings length differs but is deemed equivalent
;see https://www.unicode.org/Public/UCD/late ... olding.txt
;for example "mASSE" and "Maße" are equal.
;this provides a fast scalable string compare
;casemappings
;see "http://www.unicode.org/Public/UCD/lates ... deData.txt"
;

;#CaseNormal s <> S
;#CaseSimple s = S
;#CaseFull ss = ß

;History
;v1.2.1
;redone in table, needs improvment
;added normal strcmp for completeness
;v1.2.2
;fixed stride bug if size of mapped char > $FFFFFF + 3 to other string
;v1.2.3
;swapped around mapping was in reverse order
;v1.2.4
;returns 1 if strings are equal
;v1.2.5 19/12/22
;changed flag to #CASEWITHCASE TO #CASENORMALL
;v1.2.6 Changed to support surrogate pairs for UTF16 support
; added chr_() asc_() functions for surrogate pairs
;v1.2.7 fixed bug in _asc function
;v1.2.8 fixed short string bug
;v1.2.9 fixec bug in same case mapping 1st char
;v1.2.10 fixed start of table
;v1.2.11 Added Left_, Right_, Len_, Is_UTF16 : mk-soft
;v1.2.12a Added Mid_. pUpCase, pLowCase : idle
;v1.2.13a Redid Casemapping data added pTitleCase : idle

;v2.0.0 Renamed module and it's functions as it's grown beyond casefolding
;v2.0.1 Redid strLcase strUcase removed redundant ifs, redid arrays for better cache locality. added speed test for strLcase strUcase : idle


Implementations v2
https://github.com/idle-PB/UTF16

Performance with c backend


UTF16 Strcmp(s3,s4) 68 ms for 1,000,000
PB CompareMemoryString(@s3,@s4) 62 ms for 1,000,000

UTF16 strLCase / strUcase 41 ms for 1,000,000
PB LCase / UCase 486 ms for 1,000,000


Note: If you need to support Turkish with full case folding use the StrcmpTK function.
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

updated added support for surrogate pairs so

Code: Select all

   sa = "a" + _Chr($10400) + "abd" 
   sb = "A" + _Chr($10428) + "ABD" 
   If StrCmp(sa,sb) 
       Debug "surrogate pairs " + sa + " = " + sb  
    EndIf   
surrogate pairs a𐐀abd = A𐐨ABD
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

v 1.2.7 bug fixed asc function
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: strcmp string compare for simple case and full case folding

Post by Sicro »

Had now a bit of time to test it again. Sorry, but there is still something wrong:

Code: Select all

Debug CaseFolding::StrCmp("ß", "ss") ; returns `0`
Debug CaseFolding::StrCmp("ßz", "ssz") ; returns `1`
Debug CaseFolding::StrCmp("zß", "zss") ; returns `0`
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

I broke it adding the surrogate pairs: fixed
The problem was a boundary check had to test it doesn't go over end of string.
It's a complicated bit of code, hope it's all correct now.

Code: Select all

 While (((aa & $ffff) = Casemapping(mode,*b\a[cb])) And *b\a[cb] <> 0) 
Debug StrCmp("ß", "ss") ; returns `1`
Debug StrCmp("ßz", "ssz") ; returns `1`
Debug StrCmp("zß", "zss") ; returns `1`
I've also add it to github see OP for links.
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: strcmp string compare for simple case and full case folding

Post by Sicro »

Now everything works correctly. I have checked it with all characters of the `CaseFolding.txt` file. Well done :)
idle wrote: Sat Jan 21, 2023 8:11 pm I've also add it to github see OP for links.
Nice, I gave it a star.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 538
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: strcmp string compare for simple case and full case folding

Post by Sicro »

Unfortunately, I have now found something after all:

Code: Select all

; 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
; 1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
; 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
CaseFolding::StrCmp(Chr($00DF), Chr($1E9E))
Produces an infinite loop.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

I think I've caught it. when the chars both mapped to the same expanded mapping it resulted in it stalling on the same character. Testing with sharp S isn't really ideal as it evaluates to the same expanded sequence 0073 0073
I've replaced the tests with GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI = 03C9 03B9;

Code: Select all

    sa = "a" + _Chr($1FFC) + "abd" 
    sb = "a" + _Chr($1FF3) + "ABD" 
    
    ;1FFC; F; 03C9 03B9; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI   
    ;1FF3; F; 03C9 03B9; # GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
  
    If StrCmp(sa,sb) 
        Debug "casemapping " + sa + " = " + sb  
    EndIf   

aῼabd = aῳABD
Fingers crossed it's working properly now and after compiling with c backend 6.01b the performance is smoking hot 8)

most languages would parse the strings three times.

1) copy the strings
2) convert to the full case mapped strings
3) compare the strings

Even though you only need one parse and that's why the code is butt ugly!
Smokin! wrote: Strcmp(s3,s4) 67 ms for 1,000,000
CompareMemoryString(@s3,@s4) 64 ms for 1,000,000
Fred
Administrator
Administrator
Posts: 16686
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: strcmp string compare for simple case and full case folding

Post by Fred »

Glad to see the improvement in C optimization are working as expected ! What was the timing for ASM backend ?
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

Fred wrote: Mon Jan 23, 2023 11:10 am Glad to see the improvement in C optimization are working as expected ! What was the timing for ASM backend ?
Around 390ms, It was 160 before my bug fixes. I will take a look at the assembly when I get time tomorrow.
Really cool result and it was a good bit of code for the optimization. I will try it on the elliptic curve module too.
User avatar
RichAlgeni
Addict
Addict
Posts: 914
Joined: Wed Sep 22, 2010 1:50 am
Location: Bradenton, FL

Re: strcmp string compare for simple case and full case folding

Post by RichAlgeni »

Image
Google is a pain!
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

Thanks for the heads up, Googles embargo is specifically about Dnscope.exe and it's installer, It's laughable that I'm considered to be an existential threat by google. You can still get the casefold.pb from github
Rinzwind
Enthusiast
Enthusiast
Posts: 638
Joined: Wed Mar 11, 2009 4:06 pm
Location: NL

Re: strcmp string compare for simple case and full case folding

Post by Rinzwind »

Probably helps if you put your executables in zip files instead of directly executable exes. Google rules the internet...
User avatar
RichAlgeni
Addict
Addict
Posts: 914
Joined: Wed Sep 22, 2010 1:50 am
Location: Bradenton, FL

Re: strcmp string compare for simple case and full case folding

Post by RichAlgeni »

idle wrote: Fri Feb 24, 2023 11:22 pmIt's laughable that I'm considered to be an existential threat by google.
Always have to watch out for those New Zealanders!!! Funny accents, and all!
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: strcmp string compare for simple case and full case folding

Post by idle »

RichAlgeni wrote: Sat Feb 25, 2023 5:01 pm
idle wrote: Fri Feb 24, 2023 11:22 pmIt's laughable that I'm considered to be an existential threat by google.
Always have to watch out for those New Zealanders!!! Funny accents, and all!
The google blocks been removed for now, maybe it was something I said.
Post Reply