Slow regexp replacement on long strings? Alternatives?
Slow regexp replacement on long strings? Alternatives?
Hi,
I'm using PB regex to replace sensitive information in logfile strings (up to 5MB length). This is any sort of value for a system handling a lot of credentials. For this I use a regexp like this:
(search1|search2|search3|...|searchN)
to replace searchN with "---".
The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
This needs around 12 seconds inside a string with only 217 KB (Windows, Xeon 2.6Ghz).
I don't insist on regex and are open for any other solutions. Any ideas for a faster replacement of that many words in strings?
I'm using PB regex to replace sensitive information in logfile strings (up to 5MB length). This is any sort of value for a system handling a lot of credentials. For this I use a regexp like this:
(search1|search2|search3|...|searchN)
to replace searchN with "---".
The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
This needs around 12 seconds inside a string with only 217 KB (Windows, Xeon 2.6Ghz).
I don't insist on regex and are open for any other solutions. Any ideas for a faster replacement of that many words in strings?
Re: Slow regexp replacement on long strings? Alternatives?
If they are fixed strings ReplaceString() will be much faster.Kukulkan wrote: (search1|search2|search3|...|searchN)
to replace searchN with "---".
(No point in using a Regex if you can't use a mask.)
If the log file is a CSV, it is also possible to load it into an SQL database.
(Disable auto-commit to load faster)
No matter how you do it, searching for 500 different words and replacing them will be slow.The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
Considering that log files often have a constant number of fields, it is simpler and faster to remove the fields to be hidden. Using StringField()
Sometimes, to quickly create test data log, we just replace a part (beginning or end) of the fields to hide, whatever the content, it goes very fast.
Re: Slow regexp replacement on long strings? Alternatives?
Regex is 600x slower than PB MemoryString code.
So 12sec would drop to ~20msec.
I never use regex for this reason.
So 12sec would drop to ~20msec.
I never use regex for this reason.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
Re: Slow regexp replacement on long strings? Alternatives?
if your looking for exact or prefix matches you could try the FindStrings example in Squint, the example isn't particularly optimised though you could easily try it and overwrite the data in the source string as below, should be faster than replacestring as well
https://www.purebasic.fr/english/viewto ... 12&t=74786
results in
https://www.purebasic.fr/english/viewto ... 12&t=74786
Code: Select all
Global String1.s = "373 ac3 b9d45 b iPdC ks23 al97 373 ac5 al99 346 vs42159ssbpx roro ask ePOC foo bar xyz 12dk tifer erer e"
Global String2.s = "346 373 iPdC roro ePOC ac3" ;<-strings your interested in finding
Global Replace.s = "-----------------------------------------------------------"
Global FindStringsItems.FindStrings
Global *squint.squint = Squint_New()
FindStrings(*squint,@String1,@String2,@FindStringsItems) ;builds trie and returns the count
ForEach FindStringsItems\item()
Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
ForEach FindStringsItems\item()\positions()
CopyMemory(@Replace,@string1+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character))
Next
Next
--- --- b9d45 b ---- ks23 al97 --- ac5 al99 --- vs42159ssbpx ---- ask ---- foo bar xyz 12dk tifer erer e
Windows 11, Manjaro, Raspberry Pi OS
Re: Slow regexp replacement on long strings? Alternatives?
Upon your answers, I consider using ReplaceString() with #PB_String_InPlace. I only need to replace with a placeholder string of the same byte-length. I will give it a try.
Thanks all of you!
Thanks all of you!
Re: Slow regexp replacement on long strings? Alternatives?
Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
sorry for my bad english
Re: Slow regexp replacement on long strings? Alternatives?
Hi Josh. Its mostly passwords or hex sequences (keys). Some regular, some in quotes and some in square brackets.Josh wrote:Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
Re: Slow regexp replacement on long strings? Alternatives?
Have a look at xombie post in this thread: viewtopic.php?t=26689
Re: Slow regexp replacement on long strings? Alternatives?
Just in case you didn't find a viable solution, I made a test with PB's internal functions: Lists, ReplaceString (#PB_String_InPlace) )
To use #PB_String_InPlace, I adjust the keyword length with RSETI load all the keywords in a list, then I loop (ForEach) as many times as necessary all the previously loaded file in a single variable.
Log test file: 6.8 MB (35,526 lines)
Search: 500 keywords (all differents, so no regex)
Result: 264,000 keyword replaced
Time: 27 sec (14 without debug output informations)
Computer: i7-8700 @3.2Ghz file on SSD drive
And again, it's not very optimistic, I think we can do better with Peek and Poke.
To use #PB_String_InPlace, I adjust the keyword length with RSET
Code: Select all
ReplaceString(Txt$, All_KeyWords$(), RSet("", Len(All_KeyWords$()), "X"), #PB_String_InPlace)
Log test file: 6.8 MB (35,526 lines)
Search: 500 keywords (all differents, so no regex)
Result: 264,000 keyword replaced
Time: 27 sec (14 without debug output informations)
Computer: i7-8700 @3.2Ghz file on SSD drive
And again, it's not very optimistic, I think we can do better with Peek and Poke.
Re: Slow regexp replacement on long strings? Alternatives?
@Marc56us: Thanks for the tests. I also found replacing faster than the RegEx, but not fast enough.
We now try using a B-Tree implementation for the keywords, so that there is only one loop needed through the initial logfile content. But we do in C as we will need it in other places, too. But no results yet as it is low priority...
We now try using a B-Tree implementation for the keywords, so that there is only one loop needed through the initial logfile content. But we do in C as we will need it in other places, too. But no results yet as it is low priority...