Terrible slowdown of AddElement used in Threads - Solution: slab allocator

Just starting out? Need help? Post your questions and find answers here.
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Terrible slowdown of AddElement used in Threads

Post by NicTheQuick »

mk-soft wrote: Fri Aug 19, 2022 11:05 am The problem here is that the list functions are not internally thread-safe and are therefore protected internally with mutex using the thread-safe option.
If you switch off the thread-safe option, this leads to a crash.

So this is not a bug, but an internal optimisation of Fred is necessary.
Oh wow :o
I really thought "Threadsafe" is just a string thing.

Btw: Shouldn't there be better thread safety out of the box with the C compiler backend when using the std libraries?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Terrible slowdown of AddElement used in Threads

Post by mk-soft »

This has nothing to do with the backend comiler, but how the function is programmed internally.

This function is also not ThreadSafe

Code: Select all

Procedure DoAny()
  Static LastValue.i
EndProcedure
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by NicTheQuick »

That's correct and that's why I usually avoid 'Static' and use 'Threaded' for that kind of thing:

Code: Select all

Threaded DoAny_LastValue.i
Procedure DoAny()
	DoAny_LastValue + 1
EndProcedure
But I still do not understand the issue with LinkedLists. Why should there be one mutex for all LinkedLists if there should be only one per LinkedList?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by mk-soft »

I took a look at the header file for the LinkedList in the SDK folder.

Access to a LinkedList is already optimised for threads, but the creation of new elements is protected. This is probably part of the object management.

Code: Select all

; removed
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Rinzwind
Enthusiast
Enthusiast
Posts: 636
Joined: Wed Mar 11, 2009 4:06 pm
Location: NL

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by Rinzwind »

mk-soft wrote: Fri Aug 19, 2022 12:04 pm Access to a LinkedList is already optimised for threads, but...
Is it? Last time I checked, read access with ForEach is not thread safe, since the internal current element pointer is not "threaded". Feature request: viewtopic.php?t=78077
freak
PureBasic Team
PureBasic Team
Posts: 5929
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by freak »

The reason is that linked lists have shared memory management to support commands like MergeLists() and SplitList(). So commands like AddElement(), DeleteElement() or ClearList() need to acquire a lock in thread mode, even if each thread has its own list. List commands that do not create/free memory do not have that.

The code shows the absolute worst case though because all the threads do is call AddElement() repeatedly and nothing else. In a real world program, each thread would also need to do some kind of work between AddElement() calls to fill the list with useful data which decreases the contention on the lock and lessens the effect.

The LinkedList library implementation is a compromise to support many different use cases. Unfortunately this means that it is not the best possible choice for some specific cases like this one. If this is an issue for you, you may be better off creating your own data structure for this specific case.
quidquid Latine dictum sit altum videtur
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by STARGÅTE »

freak wrote: Fri Aug 19, 2022 7:49 pm The reason is that linked lists have shared memory management to support commands like MergeLists() and SplitList(). So commands like AddElement(), DeleteElement() or ClearList() need to acquire a lock in thread mode, even if each thread has its own list. List commands that do not create/free memory do not have that.
Thank you Freak for this comment. However, I do not fully understand, why you need here an internal lock?
At the moment, we have to lock list by our own, when one list is used by multiple Threads, because the current element is not "Threaded".
In which situation MergeLists() needs to lock (both) lists (internally), when adding/freeing elements is allready not possible from multiple threads without using a mutex? There is also no way of merging list, whose origin is in different Threads (initiallized by Threaded). Can you again comment on that?

Of cause, the example code is a absolute worst case.
In a "real world program" the total time was indeed reduced when the calculation was spreaded into threads.
However, the time reduction was not as much as I expected. So I searched the origin of this bottle neck, and it was AddElement().
freak wrote: Fri Aug 19, 2022 7:49 pm The LinkedList library implementation is a compromise to support many different use cases. Unfortunately this means that it is not the best possible choice for some specific cases like this one. If this is an issue for you, you may be better off creating your own data structure for this specific case.
I understand. I will search for a solution.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
freak
PureBasic Team
PureBasic Team
Posts: 5929
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by freak »

STARGÅTE wrote: Fri Aug 19, 2022 8:18 pm Thank you Freak for this comment. However, I do not fully understand, why you need here an internal lock?
At the moment, we have to lock list by our own, when one list is used by multiple Threads, because the current element is not "Threaded".
In which situation MergeLists() needs to lock (both) lists (internally), when adding/freeing elements is allready not possible from multiple threads without using a mutex? There is also no way of merging list, whose origin is in different Threads (initiallized by Threaded). Can you again comment on that?
LinkedLists do not allocate memory for each element separately but use a "block allocation" mechanism to reduce memory fragmentation from the many small elements (see https://www.purebasic.fr/blog/?p=8)

In order for MergeLists() or SplitList() to be able to move elements between the lists in a fast way and without needing to re-allocate each element, lists of the same type/size have to share the allocation data structures (so an element can be created in one list and freed in the other after a possible merge/split). Sharing the allocation data structures means having to use a lock for multithreaded access when allocating/freeing elements.

You cannot directly merge/split between the threaded variants of a list initialized with "Threaded". However, you could move elements to/from an unthreaded other list. Because it is not possible to know how a list will be used at creation time, the only way is to protect all lists the same way.
quidquid Latine dictum sit altum videtur
User avatar
jacdelad
Addict
Addict
Posts: 1431
Joined: Wed Feb 03, 2021 12:46 pm
Location: Planet Riesa
Contact:

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by jacdelad »

Interesting.
If AddElement() is the bottleneck, how about introducing a command to add a bunch of elements at once, like

Code: Select all

AddElements(MyList(),30)
? Would this speed things up (in general)? Like reserving space ahead if I already know how many item I will add. this could help in many situations.
PureBasic 6.04/XProfan X4a/Embarcadero RAD Studio 11/Perl 5.2/Python 3.10
Windows 11/Ryzen 5800X/32GB RAM/Radeon 7770 OC/3TB SSD/11TB HDD
Synology DS1821+/36GB RAM/130TB
Synology DS920+/20GB RAM/54TB
Synology DS916+ii/8GB RAM/12TB
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Terrible slowdown of AddElement used in Threads - Comment from Fred is needed

Post by STARGÅTE »

freak wrote: Fri Aug 19, 2022 8:51 pm In order for MergeLists() or SplitList() to be able to move elements between the lists in a fast way and without needing to re-allocate each element, lists of the same type/size have to share the allocation data structures (so an element can be created in one list and freed in the other after a possible merge/split). Sharing the allocation data structures means having to use a lock for multithreaded access when allocating/freeing elements.
Oh, that makes sense. Thank you again for this deep insight.
Now, I implemented my own "Linked List" similar to your "Slab Allocator".
First tests show nearly equal performance in single core mode, but no slowdown when using multiple threads.
I will perform some more test, clean up my module, and probably post my code and the example here.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Terrible slowdown of AddElement used in Threads

Post by STARGÅTE »

Dear all,

as a reminder, the result of the original code was:
LinkedList wrote:Single Main Thread: 32 ms
1 Thread(s): 32 ms
2 Thread(s): 148 ms
3 Thread(s): 406 ms
4 Thread(s): 982 ms
The result with just allocating and freeing memory:
AllocateMemory wrote:Single Main Thread: 38 ms
1 Thread(s): 39 ms
2 Thread(s): 45 ms
3 Thread(s): 42 ms
4 Thread(s): 68 ms
And now the result with my Slab-Module:
Slab-Allocator wrote:Single Main Thread: 19 ms
1 Thread(s): 19 ms
2 Thread(s): 25 ms
3 Thread(s): 29 ms
4 Thread(s): 36 ms
I'm very happy about the performance.

At the moment I do some more performance test with usual cases and extreme cases when allocating and releasing random elements.
However, here are the code for you and an example:

Code: Select all

CompilerIf Not #PB_Compiler_Thread
	CompilerError "Enable Thread-Safe"
CompilerEndIf

XIncludeFile "Slab.pbm"

Procedure Test(Void.i=0)
	Protected I.i
	Protected *Slab = Slab::NewAllocator(SizeOf(Integer))
	For I = 1 To 1000000
		Slab::AllocateElement(*Slab)
	Next
	Slab::FreeAllocator(*Slab)
EndProcedure


Define Time.i, Threads.i
Define NewList Thread.i()


OpenConsole()

Time = ElapsedMilliseconds()
Test()
Time = ElapsedMilliseconds() - Time
PrintN("Single Main Thread: "+Str(Time)+" ms")

For Threads = 1 To 4
	
	Time = ElapsedMilliseconds()
	For I = 1 To Threads
		AddElement(Thread()) : Thread() = CreateThread(@Test(), 0)
	Next
	ForEach Thread()
		WaitThread(Thread())
	Next
	ClearList(Thread())
	Time = ElapsedMilliseconds() - Time
	PrintN(Str(Threads)+" Thread(s): "+Str(Time)+" ms")
	
Next


Input()
________

Slab Module:

Code: Select all

DeclareModule Slab
	
	Prototype Callback(*Element)
	
	Declare.i NewAllocator(ElementSize.i, SlabLength.i=128)                   ; Creates a new slab-allocator with the specified element size and slab length.
	Declare.i AllocateElement(*Allocator, AllocationCallback.Callback=#Null)  ; Allocates a new element within the slab-allocator with an optional callback for i.e. InitializeStructure()
	Declare.i ReleaseElement(*Element, ReleaseCallback.Callback=#Null)        ; Releases an element from the slab-allocator with an optional callback for i.e. ClearStructure()
	Declare.i FreeAllocator(*Allocator, ReleaseCallback.Callback=#Null)       ; Frees the whole slab-allocator and all its elements with an optional callback for i.e. ClearStructure()
	Declare.i ExamineElements(*Allocator, ExaminationCallback.Callback)       ; Examines all elements within the slab-allocator and calls the callback on each element.
	Declare.i Report(*Allocator)                                              ; Gives a debug report about the slab-allocator properties.
	
EndDeclareModule


Module Slab
	
	EnableExplicit
	
	Structure Allocator
		SlabLength.i
		SlabSize.i
		Slabs.i
		*LatestSlab.Slab
		ElementSize.i
		Elements.i
		*NextUsableElement.Element
	EndStructure
	
	Structure Slab
		*Allocator.Allocator
		*PreviousSlab.Slab
		*NextSlab.Slab
		Elements.i
		Voids.i
		DataArea.i[0]
	EndStructure
	
	Structure Element
		*Slab.Slab
		*NextUsableElement.Element
		DataArea.i[0]
	EndStructure
	
	
	Procedure   ReleaseSlab(*Slab.Slab)
		
		Protected *Allocator.Allocator = *Slab\Allocator
		Protected *NextUsableElement.Element
		
		If *Slab\PreviousSlab
			*Slab\PreviousSlab\NextSlab = *Slab\NextSlab
		EndIf
		If *Slab\NextSlab
			*Slab\NextSlab\PreviousSlab = *Slab\PreviousSlab
		EndIf
		
		While *Allocator\NextUsableElement And *Allocator\NextUsableElement > *Slab And *Allocator\NextUsableElement < *Slab + *Allocator\SlabSize
			*Allocator\NextUsableElement = *Allocator\NextUsableElement\NextUsableElement
			*Slab\Voids - 1
		Wend
		*NextUsableElement = *Allocator\NextUsableElement
		While *NextUsableElement\NextUsableElement And *Slab\Voids > 0
			If *NextUsableElement\NextUsableElement > *Slab And *NextUsableElement\NextUsableElement < *Slab + *Allocator\SlabSize
				*NextUsableElement\NextUsableElement = *NextUsableElement\NextUsableElement\NextUsableElement
				*Slab\Voids - 1
			Else
				*NextUsableElement = *NextUsableElement\NextUsableElement
			EndIf
		Wend
		
		*Allocator\Slabs - 1
		FreeMemory(*Slab)
		
	EndProcedure
	
	
	Procedure.i AllocateSlab(*Allocator.Allocator)
		
		Protected *Slab.Slab = AllocateMemory(*Allocator\SlabSize)
		
		With *Slab
			\Allocator   = *Allocator
			\NextSlab    = *Allocator\LatestSlab
			If *Allocator\LatestSlab
				*Allocator\LatestSlab\PreviousSlab = *Slab
			EndIf
		EndWith
		
		*Allocator\Slabs      + 1
		*Allocator\LatestSlab = *Slab
		
		ProcedureReturn *Slab
		
	EndProcedure
	
	
	Procedure.i NewAllocator(ElementSize.i, SlabLength.i=128)
		
		Protected *Allocator.Allocator = AllocateMemory( SizeOf(Allocator) )
		
		With *Allocator
			\SlabLength        = SlabLength
			\SlabSize          = SizeOf(Slab) + ( SizeOf(Element) + ElementSize ) * SlabLength
			\LatestSlab        = AllocateSlab(*Allocator)
			\ElementSize       = ElementSize
			\NextUsableElement = \LatestSlab + SizeOf(Slab)
		EndWith
		
		ProcedureReturn *Allocator
		
	EndProcedure
	
	
	Procedure.i AllocateElement(*Allocator.Allocator, AllocationCallback.Callback=#Null)
		
		Protected *Element.Element
		
		*Allocator\Elements + 1
		*Element = *Allocator\NextUsableElement
		If *Element\NextUsableElement
			; Allocate a void
			*Element\Slab\Elements + 1
			*Element\Slab\Voids    - 1
			*Allocator\NextUsableElement = *Element\NextUsableElement
			*Element\NextUsableElement   = #Null
		Else
			; Allocate from latest slab 
			*Element\Slab = *Allocator\LatestSlab
			*Element\Slab\Elements + 1
			If *Element\Slab\Elements < *Allocator\SlabLength
				*Allocator\NextUsableElement = *Allocator\NextUsableElement + ( SizeOf(Element) + *Allocator\ElementSize )
			Else
				*Allocator\LatestSlab = AllocateSlab(*Allocator)
				*Allocator\NextUsableElement = *Allocator\LatestSlab + SizeOf(Slab)
			EndIf
		EndIf
		
		If AllocationCallback
			ProcedureReturn AllocationCallback( *Element + SizeOf(Element) ) ; Shift the Pointer to the element's data.
		Else
			ProcedureReturn *Element + SizeOf(Element) ; Shift the Pointer to the element's data.
		EndIf
		
	EndProcedure
	
	
	Procedure   ReleaseElement(*Element.Element, ReleaseCallback.Callback=#Null)
		
		Protected *Slab.Slab, *Allocator.Allocator
		
		*Element - SizeOf(Element) ; Shift the Pointer to the element's head.
		
		*Slab      = *Element\Slab
		*Allocator = *Slab\Allocator
		
		If ReleaseCallback
			ReleaseCallback(*Element + SizeOf(Element))
		Else
			FillMemory(*Element + SizeOf(Element), *Allocator\ElementSize)
		EndIf
		
		*Slab\Elements      - 1
		*Slab\Voids         + 1
		*Allocator\Elements - 1
		*Element\NextUsableElement   = *Allocator\NextUsableElement
		*Allocator\NextUsableElement = *Element
		
		If *Slab\Voids = *Allocator\SlabLength And *Allocator\Elements <= (*Allocator\Slabs-2) * *Allocator\SlabLength
			ReleaseSlab(*Slab)
		EndIf
		
	EndProcedure
	
	
	Procedure   FreeAllocator(*Allocator.Allocator, ReleaseCallback.Callback=#Null)
		
		Protected *Slab.Slab = *Allocator\LatestSlab
		Protected *NextSlab.Slab
		Protected Index.i, *Element.Element
		
		If ReleaseCallback
			While *Slab
				Index = 0
				While Index < *Slab\Elements + *Slab\Voids
					*Element = *Slab + SizeOf(Slab) + Index * (SizeOf(Element) + *Allocator\ElementSize)
					If *Element\NextUsableElement = #Null
						ReleaseCallback(*Element + SizeOf(Element))
					EndIf
					Index + 1
				Wend
				*NextSlab = *Slab\NextSlab
				FreeMemory(*Slab)
				*Slab = *NextSlab
			Wend
		Else
			While *Slab
				*NextSlab = *Slab\NextSlab
				FreeMemory(*Slab)
				*Slab = *NextSlab
			Wend
		EndIf
		FreeMemory(*Allocator)
		
	EndProcedure
	
	
	Procedure.i ExamineElements(*Allocator.Allocator, ExaminationCallback.Callback)
			
		Protected *Slab.Slab = *Allocator\LatestSlab
		Protected *NextSlab.Slab
		Protected Index.i, *Element.Element
		
		While *Slab
			Index = 0
			While Index < *Slab\Elements + *Slab\Voids
				*Element = *Slab + SizeOf(Slab) + Index * (SizeOf(Element) + *Allocator\ElementSize)
				If *Element\NextUsableElement = #Null
					ExaminationCallback(*Element + SizeOf(Element))
				EndIf
				Index + 1
			Wend
			*Slab = *Slab\NextSlab
		Wend
		
		ProcedureReturn *Allocator\Elements
		
	EndProcedure
	
	
	Procedure   Report(*Allocator.Allocator)
		
		Protected *Slab.Slab = *Allocator\LatestSlab
		Protected Index.i, String.s
		Protected *Element.Element
		
		Debug "SlabAllocator {"
		Debug "  Element size = " + *Allocator\ElementSize + " Byte"
		Debug "  Slab length  = " + *Allocator\SlabLength
		Debug "  Slabs        = " + *Allocator\Slabs
		Debug "  Elements     = " + *Allocator\Elements
		Debug "  Total size   = " + Str( SizeOf(Allocator) + *Allocator\Slabs * *Allocator\SlabSize ) + " Byte"
		Debug "  Fill ratio   = " + Str(100 * *Allocator\Elements / (*Allocator\Slabs * *Allocator\SlabLength))+" %"
		Debug "  Slabs {"
		While *Slab
			Index = 0
			String = ""
			While Index < *Slab\Elements + *Slab\Voids
				*Element = *Slab + SizeOf(Slab) + Index * (SizeOf(Element) + *Allocator\ElementSize)
				If *Element\NextUsableElement = #Null
					String + "#"
				Else
					String + "o"
				EndIf
				Index + 1
			Wend
			Debug "    "+LSet(String, *Allocator\SlabLength, ".")
			*Slab = *Slab\NextSlab
		Wend
		Debug "}"
		
	EndProcedure
	
	
EndModule



;- Example

CompilerIf #PB_Compiler_IsMainFile

Procedure ExaminationCallback(*Integer.Integer)  ; A callback for the Slab::ExamineElements() procedure
	Debug "Element: " + Str(*Integer\i)
EndProcedure


Define *Slab = Slab::NewAllocator(8, 3)  ; New slab-allocator with 8 byte for each element and just 3 elements per slab for demonstration.
Define Dim *Element.Integer(100)

*Element(1) = Slab::AllocateElement(*Slab) : *Element(1)\i = 1  ; Allocate some elements and fill them. The array ist just for easier access.
*Element(2) = Slab::AllocateElement(*Slab) : *Element(2)\i = 2
*Element(3) = Slab::AllocateElement(*Slab) : *Element(3)\i = 3
*Element(4) = Slab::AllocateElement(*Slab) : *Element(4)\i = 4
*Element(5) = Slab::AllocateElement(*Slab) : *Element(5)\i = 5
*Element(6) = Slab::AllocateElement(*Slab) : *Element(6)\i = 6
*Element(7) = Slab::AllocateElement(*Slab) : *Element(7)\i = 7
*Element(8) = Slab::AllocateElement(*Slab) : *Element(8)\i = 8

Slab::ReleaseElement(*Element(3)) ; Release some of this elements.
Slab::ReleaseElement(*Element(1))
Slab::ReleaseElement(*Element(7))

Slab::Report(*Slab) ; Give a short report about the slab-allocator

*Element(9) = Slab::AllocateElement(*Slab ) : *Element(9)\i = 9    ; Allocate some more elements, using the free places.
*Element(10) = Slab::AllocateElement(*Slab) : *Element(10)\i = 10

Slab::Report(*Slab)

Slab::ReleaseElement(*Element(9))  ; Release some more elements, resulting in a release of a slab.
Slab::ReleaseElement(*Element(8))
Slab::ReleaseElement(*Element(10))
Slab::ReleaseElement(*Element(2))

Slab::Report(*Slab)

Slab::ExamineElements(*Slab, @ExaminationCallback())

CompilerEndIf
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Terrible slowdown of AddElement used in Threads

Post by idle »

Good result. Did you just write that in a day. Takes me weeks to write stuff like that.
User avatar
skywalk
Addict
Addict
Posts: 3972
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Terrible slowdown of AddElement used in Threads

Post by skywalk »

Haha, some people are savants!
Great topic by the way. 8)

So, I am trying to understand the high level reasoning for the threads?
If a single thread is faster, why break up the computations?
Could this be faster with individual exe's and pipelines to a shared memory database?
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Terrible slowdown of AddElement used in Threads

Post by idle »

skywalk wrote: Sat Aug 20, 2022 10:35 pm Haha, some people are savants!
Great topic by the way. 8)

So, I am trying to understand the high level reasoning for the threads?
If a single thread is faster, why break up the computations?
Could this be faster with individual exe's and pipelines to a shared memory database?
He's wanting to get the same running time on all the threads, each thread will work on a chunk of the problem then he'll merge the results. The reason for the slab allocator generally is faster than allocating all the time, he should ideally by doing it in chunks like 4096 bytes or even larger up to 48k or whatever the L1 cache size.

I'm actually doing a Database at the moment, the filters in theory can handle 150,000,000 look ups p/s it's crazy fast but that's maxing out the cores and running on 12 threads.
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: Terrible slowdown of AddElement used in Threads

Post by STARGÅTE »

idle wrote: Sat Aug 20, 2022 8:56 pm Good result. Did you just write that in a day. Takes me weeks to write stuff like that.
Yes. 2 hours for the concept in the train with pen and paper and afterwards some hours to write the code in a first draft.
idle wrote: Sun Aug 21, 2022 1:03 am The reason for the slab allocator generally is faster than allocating all the time, he should ideally by doing it in chunks like 4096 bytes or even larger up to 48k or whatever the L1 cache size.
Actually, I don't really take care of the CPU caches L1, L2 and L3 at the moment. Currently, related elements (for the same calculation or something) could also be spread far away from each other, if the slabs have many voids which have to be filled.
skywalk wrote: Sat Aug 20, 2022 10:35 pm So, I am trying to understand the high level reasoning for the threads?
If a single thread is faster, why break up the computations?
You are right if you say, a single thread takes e.g. 10s for a calculation, and four threads running in parallel take for the same calculation each 12s. But, those four threads also have performed four times the calculation. So finally, the original calculation takes only 3s if you spread them equally to the threads.

To give to an example:
You want to check the prime status of huges numbers (larger than a quad) and you want to count the primes between 10^100 and 10^100 + 10000. This test can be easily parallized on multiple threads, each thread checking his one range and then add the count of primes in each range (thread) together. However, this test needs a lot of small memory pieces for all the hughes numbers which are not fitting into a quad. So far, I have used a Linked List for these memory pieces, because Linked Lists are faster than using AllocateMemory. However, I now observed that they are slower in parallel running threads (see above). Therefore, a "new" memory allocation technique had to be constructed.

Spoiler: This stuff is needed for a new coming feature of my script language Lizard - the easy access to parallel computing - be excited.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Post Reply