## autocomplete Trie

Working on new editor enhancements?
idle
Posts: 3585
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

### autocomplete Trie

Working on a replacement for Autocomplete with a Trie structure bit of a balancing act trying to constrain memory use, though I think I've got a workable compromise between memory and speed by interpreting the strings in nibbles, this reduces the memory by 8 times at the cost of 2 times the lookup, it should be more efficient than current implementation but before I put in place it should get a look over for any bugs. These kind of algorithms do my head in.

Test code loads mozilla GB dictionary from marco's git, not sure if my cursory pass of the file is 100% but it should show it working

Update 0.95
changed structure to array based to facilitate load, save and catch and to keep the memory size the same as x86 on x64
while it will dynamically grow it will incur excessive page faults, so you need to allocate enough slots

Update 0.96
Mashed it into scintiila autocomplete
*note it's still a little buggy
Return enter selection space bar to cancel
Tested on xp with debugger off works better, linux x64 with debugger seems fine

Update 0.96.1

Code: Select all

:note if your rebuilding this between versions please make sure you deleted the file en-GB_trie.tri
DeclareModule TRIE

; Nibbled TRIE, for unicode strings; Access time O(m*2)
; Array based can be saved loaded and caught from memory with compression
; Note node value -1 is reserved for traversal checks
; Author: Idle
; Version: 0.96.1
; Pb 5.71 lts
; Created: 24/01/2020
; Licence MIT

#TRIE_MAX = 1024 ;Max key length in bytes for enumeration of Trie

Structure trie_node
index.l[16]
value.l
EndStructure

Structure Trie
*vt
size.l
avail.l
position.l
datasize.l
nodecount.l
count.l
sb.a[#TRIE_MAX]
*root.Trie_node
EndStructure

Prototype Trie_CB(*Input,len)
Declare Trie_New(size=512)
Declare Trie_Free(*this.Trie)
Declare Trie_Set(*this.Trie,*key,len,value.l=1,mode=#PB_UTF8)
Declare Trie_Get(*this.Trie,*key,len,mode=#PB_UTF8)
Declare Trie_iEnum(*this.Trie,*node.Trie_Node,depth,*pfn.Trie_CB)
Declare Trie_Enum(*this.Trie,*key,len,*pfn.Trie_CB,ReturnMatch=0,mode=#PB_UTF8)
Declare Trie_Clear(*this.Trie)

Interface ITrie
Free()
Set(*key,StringByteLength,value.l=1,mode=#PB_UTF8)
Get(*key,StringBytelength,mode=#PB_UTF8)
Enum(*key,StringBytelength,*pfn.Trie_CB,ReturnMatch=0,mode=#PB_UTF8)
Clear()
Save(File.s,CompressLevel=4)
Stats(item)
EndInterface

EndDeclareModule

Module TRIE

CompilerIf #PB_Compiler_OS = #PB_OS_Windows
ImportC "zlib.lib"
compress2(*dest,*destlen,*source,sourcelen,level)
uncompress(*dest,*destlen,*source,sourcelen)
EndImport
CompilerElse
ImportC "-lz"
compress2(*dest,*destlen,*source,sourcelen,level)
uncompress(*dest,*destlen,*source,sourcelen)
EndImport
CompilerEndIf

EnableExplicit
;Thanks Wilbert for a sane version of PeekN and PokeN
;Peek Nibble in order High Low
Macro PeekNHL(in,Index,out)
*Mem.Ascii = in
*Mem + Index >> 1
If Index & 1
out = *Mem\a & $f Else out = *Mem\a >> 4 EndIf EndMacro Macro PokeNHL(in, Number, Index) *Mem.Ascii = in *Mem + Index >> 1 If Index & 1 *Mem\a = (*Mem\a &$f0) | (Number & $f) Else *Mem\a = (*Mem\a &$0f) | (Number << 4)
EndIf
EndMacro

Procedure Trie_Stats(*this.TRIE,item=1)
Select item
Case 1
ProcedureReturn *this\datasize
Case 2
ProcedureReturn MemorySize(*this\root)
Case 3
ProcedureReturn MemorySize(*this\root) / *this\datasize
Case 4
ProcedureReturn *this\nodecount
Case 5
ProcedureReturn *this\count
EndSelect
EndProcedure

Procedure Trie_Clear(*this.Trie)
Protected idx,*node.Trie_node,size
size = MemorySize(*this\root)
While idx < size
*node = *this\root+idx
If *node\value
*node\value = -1
EndIf
idx + SizeOf(Trie_node)
Wend
EndProcedure

Procedure Trie_New(size=512)
Protected *obj.Trie = AllocateMemory(SizeOf(Trie))
If *obj
*obj\vt=?vt_Trie
*obj\size = size * SizeOf(Trie_node)
*obj\avail= *obj\size  - SizeOf(Trie_node)
*obj\position = SizeOf(Trie_node)
*obj\root = AllocateMemory(*obj\size)
EndIf
ProcedureReturn *obj
EndProcedure

Procedure Trie_Free(*this.Trie)
FreeMemory(*this\root)
FreeMemory(*this)
EndProcedure

;Set or Unset a value in the Trie, len must be StringByteLength
Procedure Trie_Set(*this.Trie,*key,len,value.l=1,mode=#PB_UTF8)
Protected *node.Trie_node,*ikey,ilen,a,idx,*mem.Ascii,offset
*node = *this\root

If mode <> #PB_UTF8
*ikey = UTF8(PeekS(*key,len))
len = MemorySize(*ikey)
Else
*ikey = *key
EndIf

*this\datasize + len

ilen = (len*2)-1

For a = 0 To ilen
PeekNHL(*ikey,a,idx)
If *node\index[idx]
offset = *node\index[idx]
*node = *this\root + offset
ElseIf *this\avail  > 0
offset = *this\Position
*node\index[idx] = offset
*node = *this\root + offset
*node\value = -1
*this\nodecount + 1
*this\avail - SizeOf(Trie_node)
*this\Position+SizeOf(Trie_node)
Else
*this\avail = 512*SizeOf(Trie_node)
*this\size + *this\avail
*this\root = ReAllocateMemory(*this\root,*this\size)
*node = *this\root + offset
offset = *this\Position
*node\index[idx] = offset ;
*node = *this\root + offset
*node\value = -1
*this\nodecount + 1
*this\avail - SizeOf(Trie_node)
*this\Position+SizeOf(Trie_node)
EndIf
Next

If *node\value
*this\count+1
*node\value = value
EndIf
If mode <> #PB_UTF8
FreeMemory(*ikey)
EndIf
EndProcedure

Procedure Trie_Get(*this.Trie,*key,len,mode=#PB_UTF8)
;Find the value of the key returns the value or 0
Protected *ikey,ilen,*node.Trie_Node,a,idx.i,*mem.Ascii

If mode <> #PB_UTF8
*ikey = UTF8(PeekS(*key,len))
len = MemorySize(*ikey)
Else
*ikey = *key
EndIf

ilen = (len*2)-1
*node.Trie_node = *this\root

For a = 0 To ilen
PeekNHL(*ikey,a,idx)
If *node
If *node\index[idx]
*node = *this\root + *node\index[idx]
Else
If mode <> #PB_UTF8
FreeMemory(*ikey)
EndIf
ProcedureReturn 0
EndIf
Else
FreeMemory(*ikey)
ProcedureReturn 0
EndIf
Next
If mode <> #PB_UTF8
FreeMemory(*ikey)
EndIf
If *node\value > -1
ProcedureReturn *node\value
EndIf

EndProcedure

Procedure Trie_iEnum(*this.Trie,*node.Trie_Node,depth,*pfn.Trie_CB)
Protected a.i,tkey.s,*mem.Ascii

If *node\value = 0
ProcedureReturn 0
EndIf
;build the key
For a=0 To 15
If *node\index[a]
PokeNHL(@*this\sb,a,depth)
Trie_iEnum(*this,*this\root + *node\index[a],depth+1,*pfn)
EndIf
Next
;lookup the key value
If *node\value > 0
If *pfn
*pfn(@*this\sb,depth/2-1)
EndIf
EndIf
ProcedureReturn *node

EndProcedure

Procedure Trie_Enum(*this.Trie,*key,len,*pfn.Trie_CB,ReturnMatch=0,mode=#PB_UTF8)
;Enumerate all the entries below the given root key
Protected *node.Trie_Node,*tnode.Trie_Node,tkey.i,ilen.i,pos.i,*mem.Ascii,a,*ikey,lkey
If mode <> #PB_UTF8
*ikey = UTF8(PeekS(*key,len))
len = MemorySize(*ikey)
ilen =((len-1)*2)-1
Else
*ikey = *key
ilen =(len*2)-1
EndIf

FillMemory(@*this\sb[0],#TRIE_MAX,0)

If ReturnMatch
If Trie_Get(*this,*ikey,len)
If *pfn
*pfn(*ikey,len)
EndIf
ProcedureReturn
EndIf
EndIf

*node = *this\root

For a = 0 To ilen
PeekNHL(*ikey,a,tkey)
*node = *this\root + *node\index[tkey]
If *node\value = 0
If mode <> #PB_UTF8
FreeMemory(*ikey)
EndIf
ProcedureReturn 0
EndIf
Next

CopyMemory(*ikey,@*this\sb,MemorySize(*ikey))
Trie_iEnum(*this,*node,a,*pfn)

If mode <> #PB_UTF8
FreeMemory(*ikey)
EndIf
EndProcedure

Procedure Trie_SaveToFile(*this.trie,file.s,CompressLevel=4)
Protected fn,len,*dst,destlen,srclen
fn = OpenFile(#PB_Any,file)
If fn
srclen = MemorySize(*this\root)
destlen = (srclen * 1.5) + SizeOf(trie) + 8
*dst = AllocateMemory(destlen)
If *dst
If Compress2(*dst+(SizeOf(trie)+8),@destlen,*this\root,srclen,CompressLevel) = 0
PokeL(*dst,srclen)
PokeL(*dst+4,destlen)
CopyMemory(*this,*dst+8,SizeOf(Trie))
len = WriteData(fn,*dst,destlen+SizeOf(trie)+8)
CloseFile(fn)
FreeMemory(*dst)
ProcedureReturn len
EndIf
EndIf
EndIf
EndProcedure

Protected fn,destlen,srclen,*src,*dst
fn = OpenFile(#PB_Any,file)
If fn
*src = AllocateMemory(destlen)
*src = AllocateMemory(srclen)
FreeMemory(*this\root)
*this\vt=?vt_Trie
*this\root = AllocateMemory(destlen)
If uncompress(*this\root,@destlen,*src,srclen) = 0
FreeMemory(*src)
CloseFile(fn)
ProcedureReturn destlen
EndIf
EndIf
EndProcedure

Procedure Trie_Catch(*this.Trie,*mem,len)
Protected destlen,srclen,*src
If *mem
FreeMemory(*this\root)
destlen = PeekL(*mem)
srclen = PeekL(*mem+4)
CopyMemory(*mem+8,*this,SizeOf(Trie))
*this\vt=?vt_Trie
*this\root = AllocateMemory(destlen)
If uncompress(*this\root,@destlen,*mem+SizeOf(Trie)+8,srclen) = 0
ProcedureReturn destlen
EndIf
EndIf
EndProcedure

DataSection: vt_Trie:
Data.i @Trie_Free()
Data.i @Trie_Set()
Data.i @Trie_Get()
Data.i @Trie_Enum()
Data.i @Trie_Clear()
Data.i @Trie_SaveToFile()
Data.i @Trie_Catch()
Data.i @Trie_Stats()
EndDataSection

EndModule

CompilerIf #PB_Compiler_IsMainFile
;set compiler to create exe in current directory
;On 1st run it will downloads mozilla GB english dictionary and build the Trie
;when you close the window it will compress and save the Trie to file
;On 2nd run it will Load the saved Trie
;if you uncoment Include binary down the bottom it will catch the Trie from memory
UseModule TRIE
EnableExplicit
#Auto_Start =0
#Auto_Get =1
#Auto_Enum =2

Global TmpPath.s = GetTemporaryDirectory() +"en-GB.dic"
Global TriePath.s = GetPathPart(ProgramFilename()) + "en-GB_trie.tri"
Global *mt.ITrie = Trie_New(700384) ;specify number of nodes required, realloc can be very slow

Global index,result
Global *buf = AllocateMemory(1024)
Global itemsize=1024*1024
Global itempos
Global *items = AllocateMemory(itemsize)  ;utf8 list of words for lookup
Global *text

Procedure ClearItems()
FillMemory(*items,itemsize,0)
EndProcedure

Procedure cbTrie(*key,len)
Protected out.s
If (itempos + len + 1) < MemorySize(*items)
CopyMemory(*key,*items+itempos,len)
out = PeekS(*items+itempos,len,#PB_UTF8)
Debug out
Debug Str(Len(out)) + " " + Str(len)
itempos + len
PokeA(*items+itempos,32)
itempos + 1
Else
itemsize + len + 1
*items = ReAllocateMemory(*items,itemsize)
CopyMemory(*key,*items+itempos,len)
itempos + len
PokeA(*items+itempos,32)
itempos + 1
EndIf
EndProcedure

Protected fn,fm,pos,word.s,count
If fn
While Not Eof(fn)
pos = FindString(word,"/")
If pos > 0
word = Left(word,pos-1)
EndIf
word = LCase(word)
*trie\Set(@word,StringByteLength(word),1,#PB_Unicode)
count+1
Wend
CloseFile(fn)
EndIf
ProcedureReturn count
EndProcedure

Static Auto_state.i
Static pos.i,space.i

Protected mods = (#SC_MOD_DELETETEXT | #SC_PERFORMED_USER)
Select *scinotify\nmhdr\code

Case #SCN_AUTOCSELECTION
Debug "selected " + PeekS(*scinotify\text,-1,#PB_UTF8)
Auto_state = #Auto_Start
;ScintillaSendMessage(0,#SCI_AUTOCCANCEL)
pos = 0
Case #SCN_MODIFIED
If *scinotify\modificationType = mods Or *scinotify\modificationType =  (mods | #SC_STARTACTION)
pos - *scinotify\length
Debug "back spaced " + *scinotify\length
If pos < 1
Debug "cancel from back space"
Auto_state = #Auto_Start
ScintillaSendMessage(0,#SCI_AUTOCCANCEL)
pos = 0
EndIf
EndIf
Select *scinotify\ch
Case $A,$20
If Auto_state < #Auto_Enum
Debug "by space " + PeekS(*buf,pos,#PB_UTF8)
If *mt\Get(*buf,pos+1) = 0
*mt\Set(*buf,pos+1)
EndIf
Auto_state = #Auto_Start
pos=0
EndIf

Default
PokeA(*buf+pos,*scinotify\ch)
If pos > 1
Auto_state = #Auto_Get
ClearItems()
*mt\enum(*buf,pos+1,@cbTrie())
itempos=0
If ScintillaSendMessage(0,#SCI_AUTOCACTIVE) = 0
ScintillaSendMessage(0,#SCI_AUTOCSHOW,pos+1,*items)
Else
ScintillaSendMessage(0,#SCI_AUTOCCANCEL)
ScintillaSendMessage(0,#SCI_AUTOCSHOW,pos+1,*items)
EndIf
EndIf
pos+1
EndSelect
Default

EndSelect

EndProcedure

If ?TRIE_DICTEND - ?TRIE_DICT > 0
result = *mt\Catch(?TRIE_DICT)
Debug "Caught Trie :" + Str(result) + " bytes"
ElseIf FileSize(TriePath) > 0
Debug "Loaded Trie :" + Str(result) + " bytes"
Else
InitNetwork()
Debug "raw bytes loaded =" + Str(*mt\Stats(1))
Debug "size of data stored = " + Str(*mt\Stats(2))
Debug "data / raw = " + Str(*mt\Stats(3))
Debug "nodes " + Str(*mt\Stats(4))
Debug "items " + Str(*mt\Stats(5))
Else
End
EndIf
EndIf

If InitScintilla()
ScintillaSendMessage(0,#SCI_AUTOCSETIGNORECASE,#True)
ScintillaSendMessage(0,#SCI_AUTOCSETAUTOHIDE,0)
ScintillaSendMessage(0,#SCI_AUTOCSETMAXHEIGHT,10)
ScintillaSendMessage(0,#SCI_AUTOCSETORDER,#SC_ORDER_PRESORTED)
ScintillaSendMessage(0, #SCI_STYLESETFORE, 0, RGB(0, 0, 0))
*Text=UTF8(" ")
ScintillaSendMessage(0, #SCI_AUTOCSTOPS,0,*text)
FreeMemory(*Text)
ScintillaSendMessage(0, #SCI_SETTEXT, 0,*Text)
FreeMemory(*Text)
EndIf

Repeat : Until WaitWindowEvent() = #PB_Event_CloseWindow
EndIf
CloseWindow(0)

If FileSize(TriePath) < 0
result = *mt\Save(TriePath)
Debug "Saved Trie :" + Str(result) + " bytes"
EndIf

*mt\clear()
*mt\free()

DataSection : TRIE_DICT:
;IncludeBinary #PB_Compiler_FilePath + "en-GB_trie.tri"
TRIE_DICTEND:
EndDataSection

CompilerEndIf

box_80
User
Posts: 89
Joined: Mon Sep 03, 2012 8:52 pm

### Re: autocomplete Trie

Interesting, might try it out one of these days. Have not been programming much lately. But still enjoy see stuff like this. Thanks for sharing.
Trond
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

### Re: autocomplete Trie

Is this for the IDE?

What is the current data structured used in the IDE, and does it have any practical problems that justify the need for a more complicated data structure (assuming this is)?

idle
Posts: 3585
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

### Re: autocomplete Trie

Hello Trond,

Nice to see you back!

Yes that's the plan, current autocomplete is built with a map and lists, it rebuilds each time and sorts the list, it's not ideal, plus it doesn't work half the time on linux but that's a separate issue which needs resolving. While it works a Trie is much better fit for the task, load it once, add to and remove on the fly.
An alternative would be a Fuzzy Search rather than an exact match, that could be better from the onset, though until someones written one a Trie will be an improvement.
I'm still trying to work out what the best strategy is to handle structures and modules, once I've got a handle on how the IDE does it, I should be able to add it using the nodes value to say that the item is a root, like TRIE:: then if UseModule is active you don't want the root in the results, but if you didn't usemodule and typed in TRIE:: it would still enumerate it with the root TRIE::Get() instead of Get() if usemodule was active.
Trond
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

### Re: autocomplete Trie

idle wrote:Nice to see you back!
Thank you.

Remember to benchmark size and time. Sometimes performance isn't intuitive.

I did try to make a trie a long time ago, but couldn't make it work.
idle
Posts: 3585
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

### Re: autocomplete Trie

It'll be interesting to see how it compares, it's going to take some time to do anyhow and I really want to get the display working on linux before I start butchering it, the enumeration of the Trie could well be slower than the current implementation even once it's optimised.

I'm still not happy with the memory consumption, changing it to utf8 will save ~3rd and it shouldn't impact the enumeration, I could also make it array based rather than heap based which would at least reduce the overhead on x64 by ~1/2
chi
Enthusiast
Posts: 799
Joined: Sat May 05, 2007 5:31 pm
Location: Linz, Austria

### Re: autocomplete Trie

Works great! (if the debugger is disabled )
Et cetera is my worst enemy
kenmo
Posts: 1890
Joined: Tue Dec 23, 2003 3:54 am

### Re: autocomplete Trie

Great start! I'm very excited for IDE contributions. I have a few in progress.

For you I have 1 comment and 1 "bug report"

Comment: If you type a complete word, can you still show other suggestions? Or maybe make that a toggle setting?
If I type "end" I would want to see "EndIf" "EndProcedure" etc.

Bug: I get some wrong results...
type "mur" top result is "mubarak"
"murt" shows "murder" and other similar words with "d" instead of "t"
"ag" shows a lot of words starting with "aw"
freak
PureBasic Team
Posts: 5823
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

### Re: autocomplete Trie

idle wrote:Yes that's the plan, current autocomplete is built with a map and lists, it rebuilds each time and sorts the list, it's not ideal, plus it doesn't work half the time on linux but that's a separate issue which needs resolving. While it works a Trie is much better fit for the task, load it once, add to and remove on the fly.
The list is rebuilt every time because AutoComplete is context-sensitive (procedure and module scope, With-Statements for structured items etc.). The data is not static.

You could probably get rid of the sorting with a new data structure, but you should do some measurements first to see if this is actually problematic enough to be worth the increased complexity.
quidquid Latine dictum sit altum videtur
Josh
Posts: 1184
Joined: Sat Feb 13, 2010 3:45 pm

### Re: autocomplete Trie

@idle
There are many wrong results. If I type 'bb', there are only 4 entries that really start with 'bb', but I get several hundred results.

@freak
freak wrote:... AutoComplete is context-sensitive (procedure and module scope ...
Really? If I type the following example, then I immediately get a suggestion for 'MyVar'.

Code: Select all

Define MyVar

Procedure MyProc()
MyV

kenmo
Posts: 1890
Joined: Tue Dec 23, 2003 3:54 am

### Re: autocomplete Trie

Josh wrote:Really? If I type the following example, then I immediately get a suggestion for 'MyVar'.
It's definitely procedure-aware. Try below.

But you're right, inside a procedure it suggests a non-global defined outside. That seems like a bug, but a pretty harmless one.

Code: Select all

Global This_Is_Global
Define This_Is_Define

Procedure Test1(This_Is_1.i)
; type "this" here

EndProcedure

Procedure Test2(This_Is_2.i)
; type "this" here

EndProcedure

EDIT: Also, hello freak!
idle
Posts: 3585
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

### Re: autocomplete Trie

kenmo wrote:Great start! I'm very excited for IDE contributions. I have a few in progress.

For you I have 1 comment and 1 "bug report"

Comment: If you type a complete word, can you still show other suggestions? Or maybe make that a toggle setting?
If I type "end" I would want to see "EndIf" "EndProcedure" etc.

Bug: I get some wrong results...
type "mur" top result is "mubarak"
"murt" shows "murder" and other similar words with "d" instead of "t"
"ag" shows a lot of words starting with "aw"
Thanks and yes I had noticed it was populating the list out of order since I changed it to interpret the key by nibbles, I'm in the process of making it array based to reduce the memory and make it cache friendly, so will sort it out then.

@Josh
Thanks,

@Freak
Yes I understand why it's doing it but it's really not necessary to rebuild it every time to make it context sensitive, each token can have a type identifier and to handle things like modules it can be done a couple of ways. If you have Usemodule TRIE you'd first get to the rootnode of TRIE:: and then enumerate from there, if you don't have usemodule you would just enumerate it from the root as TRIE::Get ..., same will work for structures.
So the Trie can be statically built and added to when needed and the context is dynamic by the node values.

My main concern is to actually fix the display issues on Linux, which has nothing to do with the structure of the code
freak
PureBasic Team
Posts: 5823
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

### Re: autocomplete Trie

If you don't want to rebuild the structure at display time you have to do more work to keep it up to date when edits are made which is an even more performance critical code path. If you spend too much time there the editing will feel sluggish very quickly. The IDE had this problem a lot with the early implementations of AutoComplete when used with larger source files. This is why the current implementation is a compromise: The parsing work during editing is done only with minimal context which also means that reacting to an edit in the code means that only very little parsing is required to update the data (only the edited line in the best case) but there is work needed to build the list when AutoComplete is displayed. I think the resulting performance is well balanced on both sides.

This problem is a lot bigger than just a question of data structures. Trust me, I worked on this part a lot over the years
quidquid Latine dictum sit altum videtur
freak
PureBasic Team
Posts: 5823
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

### Re: autocomplete Trie

@Josh:

You are right, stuff in the main scope is actually treated as "global" even if it isn't. There is still room for improvement there.

It does work well the other way around though:

Code: Select all

Procedure MyProc(ParameterVar)
Protected ProtectedVar
Global GlobalVar
Shared SharedVar
Static StaticVar
EndProcedure
The variables inside the procedure are suggested in AutoComplete only when the cursor is actually inside the procedure, except for the Global/Shared one which are visible outside as it should be.
quidquid Latine dictum sit altum videtur
idle