List vs Map when searching for duplicate files

Just starting out? Need help? Post your questions and find answers here.
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

List vs Map when searching for duplicate files

Post by AZJIO »

It seemed to me that the map requires converting a number to a string, which is why it should work more slowly than a list.

4700 ms - List
845 ms - Map

List

Code: Select all

EnableExplicit

Structure Files
	Size.q
	List Path.s()
EndStructure

NewList FilesPS.Files()

Procedure FileSearch(List FilesPS.Files(), sPath.s, Mask$ = "*", depth=130, level = 0)
	
	Protected sName.s, c = 0, LenSPath, tmp$, Size.q
	Protected Dim aExaDir(depth)
	Protected Dim aSePath.s(depth)
	Protected fNotFind
	
	If  Right(sPath, 1) <> #PS$
		sPath + #PS$
	EndIf
	LenSPath = Len(sPath)
	
	aSePath(c) = sPath
	aExaDir(c) = ExamineDirectory(#PB_Any, sPath, Mask$)
	If Not aExaDir(c)
		ProcedureReturn
	EndIf
	
	Repeat
		While NextDirectoryEntry(aExaDir(c))
			sName=DirectoryEntryName(aExaDir(c))
			If sName = "." Or sName = ".."
				Continue
			EndIf
			If DirectoryEntryType(aExaDir(c)) = #PB_DirectoryEntry_Directory
				If c >= depth
					Continue
				EndIf
				sPath = aSePath(c)
				c + 1
				aSePath(c) = sPath + sName + #PS$
				aExaDir(c) = ExamineDirectory(#PB_Any, aSePath(c), Mask$)
				If Not aExaDir(c)
					c - 1
				EndIf
			Else
				tmp$ = aSePath(c) + sName
				Size = FileSize(tmp$)
				
				
				fNotFind = 1
				ForEach FilesPS()
					If FilesPS()\Size = Size
						If AddElement(FilesPS()\Path())
							FilesPS()\Path() = tmp$
						EndIf
						fNotFind = 0
						Break
					EndIf
				Next
				If fNotFind
					If AddElement(FilesPS())
						FilesPS()\Size = Size
						If AddElement(FilesPS()\Path())
							FilesPS()\Path() = tmp$
						EndIf
					EndIf
				EndIf
				
			EndIf
		Wend
		FinishDirectory(aExaDir(c))
		c - 1
	Until c < 0
EndProcedure

Define StartTime.q, time$
StartTime = ElapsedMilliseconds()
FileSearch(FilesPS(), "/home/user")
time$ = Str(ElapsedMilliseconds() - StartTime)
MessageRequester("", time$) ; output without a debugger
Debug time$ + #CRLF$
Debug "Map size: " + Str(ListSize(FilesPS())) + #CRLF$

ForEach FilesPS()
	Debug #CRLF$ + Str(FilesPS()\Size)
	ForEach FilesPS()\Path()
		Debug FilesPS()\Path()
	Next
Next
Map

Code: Select all

EnableExplicit

Structure Files
	List Path.s()
EndStructure

NewMap FilesPS.Files()

Procedure FileSearch(Map FilesPS.Files(), sPath.s, Mask$ = "*", depth=130, level = 0)

	Protected sName.s, c = 0, LenSPath, tmp$, Size$
	Protected Dim aExaDir(depth)
	Protected Dim aSePath.s(depth)

	If  Right(sPath, 1) <> #PS$
		sPath + #PS$
	EndIf
	LenSPath = Len(sPath)

	aSePath(c) = sPath
	aExaDir(c) = ExamineDirectory(#PB_Any, sPath, Mask$)
	If Not aExaDir(c)
		ProcedureReturn
	EndIf

	Repeat
		While NextDirectoryEntry(aExaDir(c))
			sName=DirectoryEntryName(aExaDir(c))
			If sName = "." Or sName = ".."
				Continue
			EndIf
			If DirectoryEntryType(aExaDir(c)) = #PB_DirectoryEntry_Directory
				If c >= depth
					Continue
				EndIf
				sPath = aSePath(c)
				c + 1
				aSePath(c) = sPath + sName + #PS$
				aExaDir(c) = ExamineDirectory(#PB_Any, aSePath(c), Mask$)
				If Not aExaDir(c)
					c - 1
				EndIf
			Else
				tmp$ = aSePath(c) + sName
				Size$ = Str(FileSize(tmp$))
				If FindMapElement(FilesPS(), Size$)
					If AddElement(FilesPS()\Path())
						FilesPS()\Path() = tmp$
					EndIf
				Else ; иначе если не существует
					If AddMapElement(FilesPS(), Size$, #PB_Map_NoElementCheck)
						If AddElement(FilesPS()\Path())
							FilesPS()\Path() = tmp$
						EndIf
					EndIf
				EndIf
			EndIf
		Wend
		FinishDirectory(aExaDir(c))
		c - 1
	Until c < 0
EndProcedure


Define StartTime.q, time$
StartTime = ElapsedMilliseconds()
FileSearch(FilesPS(), "/home/user")
time$ = Str(ElapsedMilliseconds() - StartTime)
MessageRequester("", time$) ; output without a debugger
Debug time$ + #CRLF$
Debug "Map size: " + Str(MapSize(FilesPS())) + #CRLF$

ForEach FilesPS()
	Debug #CRLF$ + MapKey(FilesPS())
	ForEach FilesPS()\Path()
		Debug FilesPS()\Path()
	Next
Next
Last edited by AZJIO on Tue Aug 10, 2021 12:25 pm, edited 3 times in total.
User avatar
NicTheQuick
Addict
Addict
Posts: 817
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: List vs Map when searching for duplicate files

Post by NicTheQuick »

Maps can locate an element with a given key in O(log(n)) time whereas you always need O(n) time with lists. That's exactly the reason why you use maps. The function Str() is very fast and you simply can assume it runs in a constant time.

Btw.: Why are you looking for the element with a given Size instead of just sorting the whole list by the Size member a the end of the procedure?

Edit: Corrected my sentence.
Last edited by NicTheQuick on Mon Aug 09, 2021 3:58 pm, edited 1 time in total.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

Next, I need to group the lists by hash sums If I sort the list instead of searching, then I will need to check the sameness of the size before generate lists of identical hash sums. I'm just looking for the fastest algorithm.

I have already written the main functionality of the program (yandex upload.ee). I'm experimenting. Can I increase the speed before writing the following functionality.
Last edited by AZJIO on Wed Aug 11, 2021 6:04 pm, edited 2 times in total.
User avatar
skywalk
Addict
Addict
Posts: 3554
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: List vs Map when searching for duplicate files

Post by skywalk »

NicTheQuick wrote: Mon Aug 09, 2021 12:48 pm Maps can locate an element with a given key in O(log(n)) time whereas you always need O(n) time with lists. That's exactly the reason why you use lists. The function Str() is very fast and you simply can assume it runs in a constant time.

Btw.: Why are you looking for the element with a given Size instead of just sorting the whole list by the Size member a the end of the procedure?
Can you correct your post? Confusing. O(log(n)) is faster than O(n) so Maps() should be faster.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
NicTheQuick
Addict
Addict
Posts: 817
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: List vs Map when searching for duplicate files

Post by NicTheQuick »

skywalk wrote: Mon Aug 09, 2021 3:32 pm
NicTheQuick wrote: Mon Aug 09, 2021 12:48 pm Maps can locate an element with a given key in O(log(n)) time whereas you always need O(n) time with lists. That's exactly the reason why you use lists. The function Str() is very fast and you simply can assume it runs in a constant time.

Btw.: Why are you looking for the element with a given Size instead of just sorting the whole list by the Size member a the end of the procedure?
Can you correct your post? Confusing. O(log(n)) is faster than O(n) so Maps() should be faster.
Thx. I did correct that one word.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

Made with sorting, now it works quickly (725 ms). It was necessary to use an array, then it would be easier to get the previous element. arr(i) and arr(i-1). I have to use SelectElement() function

Code: Select all

EnableExplicit

Structure Files
	Size.q
	Path.s
EndStructure

Structure DataLst
	Pos.i
	Count.i
EndStructure

NewList FilesPS.Files()
NewList DataLst.DataLst()

Procedure FileSearch(List FilesPS.Files(), sPath.s, Mask$ = "*", depth=130, level = 0)
	
	Protected sName.s, c = 0, LenSPath
	Protected Dim aExaDir(depth)
	Protected Dim aSePath.s(depth)
	
	If  Right(sPath, 1) <> #PS$
		sPath + #PS$
	EndIf
	LenSPath = Len(sPath)
	
	aSePath(c) = sPath
	aExaDir(c) = ExamineDirectory(#PB_Any, sPath, Mask$)
	If Not aExaDir(c)
		ProcedureReturn
	EndIf
	
	Repeat
		While NextDirectoryEntry(aExaDir(c))
			sName=DirectoryEntryName(aExaDir(c))
			If sName = "." Or sName = ".."
				Continue
			EndIf
			If DirectoryEntryType(aExaDir(c)) = #PB_DirectoryEntry_Directory
				If c >= depth
					Continue
				EndIf
				sPath = aSePath(c)
				c + 1
				aSePath(c) = sPath + sName + #PS$
				aExaDir(c) = ExamineDirectory(#PB_Any, aSePath(c), Mask$)
				If Not aExaDir(c)
					c - 1
				EndIf
			Else
				; 				tmp$ = aSePath(c) + sName
				If AddElement(FilesPS())
					FilesPS()\Path = aSePath(c) + sName
					FilesPS()\Size = FileSize(FilesPS()\Path)
				EndIf
			EndIf
		Wend
		FinishDirectory(aExaDir(c))
		c - 1
	Until c < 0
EndProcedure


Define StartTime.q, time$, i, tmp
StartTime = ElapsedMilliseconds()
FileSearch(FilesPS(), "/home/user")
; SortList(FilesPS(), #PB_Sort_Ascending)
SortStructuredList(FilesPS(), #PB_Sort_Ascending , OffsetOf(Files\Size) , TypeOf(Files\Size))
Debug time$ + #CRLF$
Debug "Map size: " + Str(ListSize(FilesPS())) + #CRLF$

Define Count, open
i = 0
open = 0
tmp = -3
ForEach FilesPS()
	If tmp = FilesPS()\Size
		open = 1
		Count + 1
		If Count = 1
			If AddElement(DataLst())
				DataLst()\Pos = i
			EndIf
		EndIf
	Else
		If open = 1
			DataLst()\Count = Count + 1
		EndIf
		open = 0
		Count = 0
	EndIf
	tmp = FilesPS()\Size
	i + 1
Next
If open = 1
	DataLst()\Count = Count + 1
EndIf

Define x
i = 1
If ListSize(DataLst()) = 0 Or ListSize(FilesPS()) = 0
	End
EndIf
ResetList(DataLst())
ResetList(FilesPS())
NextElement(DataLst())
NextElement(FilesPS())
Repeat
	; 	Debug "i = " + Str(i) + " Pos = " + Str(DataLst()\Pos)
	If i = DataLst()\Pos
		For x = 1 To DataLst()\Count
			Debug Str(i) + " " + Str(FilesPS()\Size) + " " + FilesPS()\Path
			If Not NextElement(FilesPS())
				Break 2
			EndIf
			i + 1
		Next
		Debug "_________________________"
		If Not NextElement(DataLst())
			Break
		EndIf
	Else
		i + 1
		If Not NextElement(FilesPS())
			Break
		EndIf
	EndIf
	; 	Debug Str(i) + " " + Str(FilesPS()\Size) + " " + FilesPS()\Path
ForEver
Debug "_____END________"
Debug "_________________________"
Debug "_________________________"
time$ = Str(ElapsedMilliseconds() - StartTime)
MessageRequester("", time$) ; output without a debugger

i = 0
ForEach FilesPS()
	i + 1
	Debug Str(i) + " " + Str(FilesPS()\Size) + " " + FilesPS()\Path
Next

ForEach DataLst()
	Debug Str(DataLst()\Pos) + " " + Str(DataLst()\Count)
Next
Simplified the solution

Code: Select all

ForEach DataLst()
	SelectElement(FilesPS(), DataLst()\Pos - 1)
	For x = 1 To DataLst()\Count
		Debug Str(DataLst()\Pos - 1) + " " + Str(FilesPS()\Size) + " " + FilesPS()\Path
		NextElement(FilesPS())
	Next
	Debug "_________________________"
Next
Debug "_____END________"
Debug "_________________________"
Debug "_________________________"
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

In the third post, I updated the links to the program.
I found that when I send a function in a separate stream, I have an incomprehensible failure. The test was performed on Linux. The old version also gave an error. A separate thread in the program is necessary to report the progress of execution to the status bar.
kinglestat
Enthusiast
Enthusiast
Posts: 732
Joined: Fri Jul 14, 2006 8:53 pm
Location: Malta
Contact:

Re: List vs Map when searching for duplicate files

Post by kinglestat »

I am not enteriely sure why make it so complicated; and used properly a map is always faster. I made some minor modificatiuons so you only use one map. It is also possible to add a reference to original match (rather than use dup). There is also my index for structured elements on the forum which would also work.

Code: Select all

DisableDebugger

Structure stPath
   name.s
   dup.i
EndStructure

;Structure Files
;	List llPath.stPath()
;EndStructure

NewMap FilesPS.stPath(65537) ;Always use primes

Procedure FileSearch(Map FilesPS.stPath(), sPath.s, Mask$ = "*", depth=130, level = 0)

	Protected sName.s, c = 0, LenSPath, tmp$, Size$
	Protected Dim aExaDir(depth)
	Protected Dim aSePath.s(depth)

	If  Right(sPath, 1) <> #PS$
		sPath + #PS$
	EndIf
	LenSPath = Len(sPath)

	aSePath(c) = sPath
	aExaDir(c) = ExamineDirectory(#PB_Any, sPath, Mask$)
	If Not aExaDir(c)
		ProcedureReturn
	EndIf

	Repeat
		While NextDirectoryEntry(aExaDir(c))
			sName=DirectoryEntryName(aExaDir(c))
			If sName = "." Or sName = ".."
				Continue
			EndIf
			If DirectoryEntryType(aExaDir(c)) = #PB_DirectoryEntry_Directory
				If c >= depth
					Continue
				EndIf
				sPath = aSePath(c)
				c + 1
				aSePath(c) = sPath + sName + #PS$
				aExaDir(c) = ExamineDirectory(#PB_Any, aSePath(c), Mask$)
				If Not aExaDir(c)
					c - 1
				EndIf
			Else
				tmp$ = aSePath(c) + sName
				Size$ = Str(FileSize(tmp$))+":"+sName
				If FindMapElement(FilesPS(), Size$)
				   FilesPS()\name = tmp$
				   FilesPS()\dup = 1
				EndIf
				
				If AddMapElement(FilesPS(), Size$, #PB_Map_NoElementCheck)
			      FilesPS()\name = tmp$
				EndIf
			EndIf
		Wend
		FinishDirectory(aExaDir(c))
		c - 1
	Until c < 0
EndProcedure


Define StartTime.q, time$
StartTime = ElapsedMilliseconds()
;FileSearch(FilesPS(), "/home/user")
FileSearch(FilesPS(), "D:\WIP\Sources")
time$ = Str(ElapsedMilliseconds() - StartTime)
MessageRequester("", time$) ; output without a debugger
Debug time$ + #CRLF$
Debug "Map size: " + Str(MapSize(FilesPS())) + #CRLF$

ForEach FilesPS()
	If FilesPS()\dup
		Debug FilesPS()\name
	EndIf
Next
I may not help with your coding
Just ask about mental issues!

http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

kinglestat
The size is not a duplicate.

The deletion level will be set for files and folders, including the prohibition of deletion. It is impossible to determine in advance what is an unnecessary duplicate. You can't just check the delete box during the search.

It is necessary to create an array of arrays, and 2 times.
files -> size - > md5 (Direction of calculation)
md5 -> size - > files (hierarchy)

Only files of the same size can be contenders for the same. In order not to calculate the MD5 of all files, you must first identify the same size. You may not even have to calculate MD5. This will save a lot of time.
kinglestat
Enthusiast
Enthusiast
Posts: 732
Joined: Fri Jul 14, 2006 8:53 pm
Location: Malta
Contact:

Re: List vs Map when searching for duplicate files

Post by kinglestat »

I see.
I did that 8 or 9 years ago in a program I did called FFX which I still use; but for the life of me can't remember how I did it, though I did use a custom CRC32 and CRC32 additive (for large files) for comparison.

You seem to know what you are doing; good luck and thank for the code snippets. It always helps to see other peoples' perspective.
I may not help with your coding
Just ask about mental issues!

http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

kinglestat
I already wrote this program in another language a few years ago. Now I'm interested in doing this in Linux.
collectordave
Addict
Addict
Posts: 1272
Joined: Fri Aug 28, 2015 6:10 pm
Location: Portugal

Re: List vs Map when searching for duplicate files

Post by collectordave »

There is a programme written for this at

viewtopic.php?f=12&t=73849&p=543492&hil ... es#p543492
Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage to move in the opposite direction.
AZJIO
Enthusiast
Enthusiast
Posts: 203
Joined: Sun May 14, 2017 1:48 am

Re: List vs Map when searching for duplicate files

Post by AZJIO »

collectordave
Your mistakes have been explained to you.
Post Reply