Slow regexp replacement on long strings? Alternatives?

Everything else that doesn't fall into one of the other PB categories.
User avatar
Kukulkan
Addict
Addict
Posts: 1352
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Slow regexp replacement on long strings? Alternatives?

Post by Kukulkan »

Hi,

I'm using PB regex to replace sensitive information in logfile strings (up to 5MB length). This is any sort of value for a system handling a lot of credentials. For this I use a regexp like this:

(search1|search2|search3|...|searchN)

to replace searchN with "---".

The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".

This needs around 12 seconds inside a string with only 217 KB (Windows, Xeon 2.6Ghz).

I don't insist on regex and are open for any other solutions. Any ideas for a faster replacement of that many words in strings?
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Slow regexp replacement on long strings? Alternatives?

Post by Marc56us »

Kukulkan wrote: (search1|search2|search3|...|searchN)
to replace searchN with "---".
If they are fixed strings ReplaceString() will be much faster.
(No point in using a Regex if you can't use a mask.)

If the log file is a CSV, it is also possible to load it into an SQL database.
(Disable auto-commit to load faster)
The problem is, that there are up to 500 "searchN" words in the regex to clean out and replace by "---".
No matter how you do it, searching for 500 different words and replacing them will be slow.

Considering that log files often have a constant number of fields, it is simpler and faster to remove the fields to be hidden. Using StringField()

Sometimes, to quickly create test data log, we just replace a part (beginning or end) of the fields to hide, whatever the content, it goes very fast.

:wink:
User avatar
skywalk
Addict
Addict
Posts: 4003
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Slow regexp replacement on long strings? Alternatives?

Post by skywalk »

Regex is 600x slower than PB MemoryString code.
So 12sec would drop to ~20msec.
I never use regex for this reason.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
idle
Always Here
Always Here
Posts: 5097
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Slow regexp replacement on long strings? Alternatives?

Post by idle »

if your looking for exact or prefix matches you could try the FindStrings example in Squint, the example isn't particularly optimised though you could easily try it and overwrite the data in the source string as below, should be faster than replacestring as well
https://www.purebasic.fr/english/viewto ... 12&t=74786

Code: Select all

Global String1.s = "373 ac3 b9d45 b iPdC ks23 al97 373 ac5 al99 346 vs42159ssbpx roro ask ePOC foo bar xyz 12dk tifer erer e"
Global String2.s = "346 373 iPdC roro ePOC ac3"  ;<-strings your interested in finding 
Global Replace.s = "-----------------------------------------------------------"
Global FindStringsItems.FindStrings 
Global *squint.squint = Squint_New() 

FindStrings(*squint,@String1,@String2,@FindStringsItems) ;builds trie and returns the count 

ForEach FindStringsItems\item() 
  Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
  ForEach FindStringsItems\item()\positions() 
      CopyMemory(@Replace,@string1+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character))   
  Next 
Next
results in
--- --- b9d45 b ---- ks23 al97 --- ac5 al99 --- vs42159ssbpx ---- ask ---- foo bar xyz 12dk tifer erer e
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
Kukulkan
Addict
Addict
Posts: 1352
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Slow regexp replacement on long strings? Alternatives?

Post by Kukulkan »

Upon your answers, I consider using ReplaceString() with #PB_String_InPlace. I only need to replace with a placeholder string of the same byte-length. I will give it a try.

Thanks all of you! :)
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Slow regexp replacement on long strings? Alternatives?

Post by Josh »

Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
sorry for my bad english
User avatar
Kukulkan
Addict
Addict
Posts: 1352
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Slow regexp replacement on long strings? Alternatives?

Post by Kukulkan »

Josh wrote:Are the searched sequences in the string always whole words, which are delimited by spaces, dots, commas or similar?
Hi Josh. Its mostly passwords or hex sequences (keys). Some regular, some in quotes and some in square brackets.
mchael
User
User
Posts: 15
Joined: Mon Oct 14, 2019 7:31 am

Re: Slow regexp replacement on long strings? Alternatives?

Post by mchael »

Have a look at xombie post in this thread: viewtopic.php?t=26689
Marc56us
Addict
Addict
Posts: 1479
Joined: Sat Feb 08, 2014 3:26 pm

Re: Slow regexp replacement on long strings? Alternatives?

Post by Marc56us »

Just in case you didn't find a viable solution, I made a test with PB's internal functions: Lists, ReplaceString (#PB_String_InPlace) )
To use #PB_String_InPlace, I adjust the keyword length with RSET

Code: Select all

ReplaceString(Txt$, All_KeyWords$(), RSet("", Len(All_KeyWords$()), "X"), #PB_String_InPlace)
I load all the keywords in a list, then I loop (ForEach) as many times as necessary all the previously loaded file in a single variable.

Log test file: 6.8 MB (35,526 lines)

Search: 500 keywords (all differents, so no regex)
Result: 264,000 keyword replaced
Time: 27 sec (14 without debug output informations)
Computer: i7-8700 @3.2Ghz file on SSD drive

And again, it's not very optimistic, I think we can do better with Peek and Poke.
User avatar
Kukulkan
Addict
Addict
Posts: 1352
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Slow regexp replacement on long strings? Alternatives?

Post by Kukulkan »

@Marc56us: Thanks for the tests. I also found replacing faster than the RegEx, but not fast enough.

We now try using a B-Tree implementation for the keywords, so that there is only one loop needed through the initial logfile content. But we do in C as we will need it in other places, too. But no results yet as it is low priority...
Post Reply