Simulation filling up memory until it gets killed

StefanGohl · December 20, 2024, 2:14pm

Hello,
I am simulating neutron measurements using a converter and then a standard 300um and 500 um timepix silicon detectors. The simulation was working fine until I switched from the [GenericPropagation] module to the [TransientPropagation] module. I left some commented lines in there, which are settings I played around with to see if there is a difference (there is not). What happens is that the memory fills up and when it reaches the maximum capacity the program is killed without any error message (only that it is killed). I played around with “workers” and “buffer_per_worker”, putting it higher or lower. It only changes the time it takes to fill up the memory. It is difficult to find a minimal working code to reproduce the error as it takes several hours before the memory is filled up. It is necessary to have several thousand events for the memory to fill up. My suspicion is that somewhere some allocated memory is not freed up until the program is finished or killed. I watched the memory fill up and getting freed up. The only work around I have right now is to simulate a small amount of events per file.

I use the latest stable Allpix code (3.1.1), Geant4 v11.2.2 and root v6.32.02 on Ubuntu 22.04.5 LTS.

I hope, someone who understand the code better than me can fix it? I am not sure what other information I can give. Let me know if there is anything.

Kind regards
Stefan

I cannot upload files (it says new users are not allowed to upload files), so I just dump the config and related file content here:
//main config start//

[Allpix]
number_of_events = 100000
detectors_file = "Moon_Detector_PE.conf"
#workers = 100

[GeometryBuilderGeant4]

[DepositionGeant4]
physics_list = QGSP_INCLXX_HP_EMZ
source_type = "macro"
file_name = "GPS.mac"
record_all_tracks = 1

number_of_particles = 1

#cutoff_time = 882s

[ElectricFieldReader]
model = "linear"
bias_voltage = 150V
depletion_voltage = 40V

[WeightingPotentialReader]
model = "pad"

[TransientPropagation]
temperature = 323K
charge_per_step = 200
timestep = 0.5ns
integration_time = 30ns
distance = 4

[PulseTransfer]
timestep = 0.5ns

[ROOTObjectWriter]
file_name = "PEconv.root"
exclude = PropagatedCharge,DepositedCharge

//main config end//

//detector file start//

[detector]
type = "timepix"
sensor_thickness = 500um
position = 0 0 0mm
orientation = 0 0 0

[DeadLayer]
type = "box"
size = 14.4mm 14.4mm 800nm
position = 0mm 0mm -250.4um
orientation = 0 0 0deg 
material = "silicon"
role = "passive"

[Electrode]
type = "box"
size = 14.4mm 14.4mm 600nm
position = 0mm 0mm -251.1um
orientation = 0 0 0deg 
material = "aluminum"
role = "passive"

[Converter]
type = "box"
size = 14.4mm 14.4mm 2mm
position = 0mm 0mm -1251.4um
orientation = 0 0 0deg 
material = "G4_POLYETHYLENE"
role = "passive"

//detector file end//

//GPS.mac//

/gps/particle neutron

/gps/ene/type Lin
/gps/ene/intercept 1
/gps/ene/min 1e8 eV
/gps/ene/max 1e9 eV

/gps/pos/type Plane
/gps/pos/shape Rectangle
/gps/pos/halfx 7.2 mm
/gps/pos/halfy 7.2 mm
/gps/pos/centre 0 0 -12 mm
/gps/direction 0 0 1

//GPS.mac//

hwennlof · January 6, 2025, 3:47pm

Hi @StefanGohl ,

This sounds interesting! My first thought was to ask whether you noticed that it gets killed when you hit a specific event number if you fix the random seed. This would point to a large deposit in one event, leading to very many charge carriers being propagated and eating up all the RAM.
However, you mention that the memory seems to fill up as time goes on, if I understand correctly.

Does this persist even with a small buffer or if you run single-threaded? I take it that it does from your statement, just double-checking.

The first thing I would do is reduce the distance in [TransientPropagation]. A distance of 4 means that you collect information on the pixel the charge is moving in along with the surrounding pixels four layers out (so in total 81 pixels for each charge carrier movement), which comes to quite a lot of resources in the end providing little information. A distance of 1 makes the induced charge be calculated on 3x3 pixels, which is normally plenty.

You could also try running without [ROOTObjectWriter]. I don’t think there is a memory leak, but rather the open ROOT file becoming excessively large as events go on, but this could be a way to show it. You could also try including fewer objects in the output file.

Kind regards,
Håkan

StefanGohl · January 6, 2025, 4:20pm

Hello Hakan,
I was playing around with the buffer size and number of threads (I didn’t run only a single threat), but this only changed the time it took for the memory to fill up. The number of events that were processed before the crash was random, though I didn’t run it with a fixed seed. However, the memory fills up slowly over time as events are processed. I did reduce the distance to 3 before without a noticeable change.

I will try your suggestions for more tests and come back with the results.

Kind regards,
Stefan

simonspa · January 14, 2025, 4:57pm

Hi @StefanGohl

there are three things I would look at in your setup after having played a bit around:

I would recommend increasing the group size significantly. Whenever a particle hits your detector, you get more than 500ke of charge carriers, so the group size of 200 is very fine and I would expect the results to not be drastically affected by a size of 2000 or even larger.
I would recommend using a pre-calculated weighting potential. In my trials this has drastically reduced the run time. You can set model = mesh and pre-calculate the same weighting potential using the generate_potential command while supplying your detector model
The memory issue might really be the too many events with 1M objects each. I would suggest running with multithreading = false to see if then the RAM of your machine is sufficient.

Nest,
Simon

StefanGohl · January 17, 2025, 1:14pm

Hello @hwennlof ans @simonspa,
Thank you for your suggestions. I tried some of them and here are my findings so far:

Setting the distance in [TransitionPropagation] to 1 is not viable for me. I am detecting alpha particles, which usually create large blobs of several adjacent pixels and I do an analysis that is based on the cluster shape of these particle signals. So, I do need the distance to be that high. However, I set it to 1 just for testing purposes and the memory build-up is still there, just slower.
Running without [ROOTObjectWriter] did not change anything for the memory build-up.
Including fewer objects in the output file is a possible, but I have to go down to 1000 events only. Given that I simulate a low efficiency process, I need to simulate million of events to get a reasonable statistic. So, simulating 1000 events per file would be possible, but doe snot solve the underlying issue.
I changed the group size to 2000 and I was able to simulate a much larger amount of events per file (100k, can be maybe more). However, the memory build-up was still there.
Maybe, I should have mentioned that before: I am running the simulation on a virtual machine with 32 GB RAM on a server. I would think, that this should be plenty of memory. I am currently running a test with multithreading off, but I don’t think it will change anything other than make the simulation last much longer.

The memory build-up happens over the course of many events. What I observe is as the number of simulated events rises the memory slowly fills up. It always gets more. There is no situation, where I see memory being freed. According to some colleagues this is happening only when simulating neutrons. It does not happen with other source particles. I have not verified that. Everything, I tried so far only modifies by how much the memory gets filled up. So with any setting that I tried, if I would run the simulation long enough, I would reach a point, where the 32GB memory would be fully filled up and the simulation would crash.

I have not looked into the pre-calculation of the weighting potential. I still have to figure out how this works exactly. Currently, I do not know, but I will look into that.

Best
Stefan

simonspa · January 17, 2025, 2:22pm

Hi @StefanGohl

thanks for your response!

Setting the distance in [TransitionPropagation] to 1 is not viable for me

I think you are misinterpeting the distance parameter here. It applies to every single charge carrier and only affects how many surrounding pixel are used to calculate the induced current. So this is entirely unrelated to your actual deposition - if the charge carrier moves 3 pixels (or 15 because it’s a delta) this this does not mean you need a large distance. Induction will always be calculated starting from the current position of the carrier.

Running without [ROOTObjectWriter] did not change anything for the memory build-up.

This is a good confirmation and expected. I really think it is related to the initial number of objects generated, not concerning the ones stored on disk.

However, the memory build-up was still there.

I believe that this is not really a “build-up” as in “allocated memory is not freed” but rather a G4 event with a particularly large deposited energy. If you want to investigate this, switch to single-threaded, observe until when the simulation runs, then note the initial seed and the event number and rerun with the same seed, directly jumping to that event by using skip_events.

I would think, that this should be plenty of memory.

That’s always relative But indeed 32GB is quite a bit.

I am currently running a test with multithreading off, but I don’t think it will change anything other than make the simulation last much longer.

Of course it will change things, because you don’t have N events in parallel in memory but just one.

The memory build-up happens over the course of many events

Events which are not fully finished (e.g. because they could not be written to file because one event is still missing) are buffered and occupy the memory until they are fully finished and freed. I am quite sure that we do not have memory leaks because we a) sanitize very strictly and b) ourselves regularly run simulations with millions of events without seeing the leak of even a couple of MB.

If you don’t care about the order of events in your output file, I would trying this MR I prepared ROOTObjectWriter: allow waiving sequence requirement (!1154) · Merge requests · Allpix Squared / Allpix Squared · GitLab using the setting

[ROOTObjectWriter]
require_sequence = false

because this drops need the buffering and just writes events in the order they finish, not their original seeding order. This might help to work around the issue further.

According to some colleagues this is happening only when simulating neutrons. It does not happen with other source particles.

This would point to the things we have discussed above: when they interact, you have an enormous amount of energy deposited, as opposed to a MIP for example.

I still have to figure out how this works exactly. Currently, I do not know, but I will look into that.

You take your detector model, timepix in your case, and adjust the sensor thickness. then you just run

generate_potential --model timepix.conf

and use the resulting file. The code is the same as run in the weightingpotentialreader but only once instead of for every event.

Best,
Simon

StefanGohl · January 17, 2025, 5:24pm

Hi @simonspa,
thank you for your very fast answer!

I understand that distance is not the distance that a particle travels in the detector. In my simulation, the neutrons created secondary alphas. These alphas have an energy around 2.7 MeV . Those alphas do not travel far and can be stopped in a single pixel. However, given the large amount of energy deposited, the induction effect can reach quite far. This induction corresponds to the Halo that is seen in real measurements, if I understand correctly. With the Halo, alpha blobs can have a radius of 4/5 pixel. If I simulate with distance = 1, I only get blobs within a matrix of 3x3. But that’s not a realistic representation of a typical alpha signature at this energy in a timepix detector, which can span over a 5x5 matrix. If I set distance = 1, I get too small clusters for my alphas.

First, “build-up” == “allocated memory is not freed”. Forgive me my sloppy language^^. Second, if I understand you correctly, you mean that if there are enough events in the buffer with a high energy deposition that are not being finished for whatever reason, this might fill up the memory? Given the setup of my machine and all, I have close to 5000 buffer slots under the settings that I am normally using. And the buffer is usually full.

I will test your suggestions over the weekend and report next week if it works or not next week. Thank you!
Stefan

simonspa · January 20, 2025, 11:42am

Yes, but neither is it the distance charge carriers travel in the detector. It is something solely dependent on the weighting potential, which - as you will agree - is independent of your impinging particle type and energy.

/Simon

StefanGohl · January 20, 2025, 2:53pm

@simonspa
Using a fixed seed and multithreading = false did not result in the simulation being killed. I guess, you are right and I have just too many events with high number of objects that are filling up my RAM. So it’s a matter of finding the right settings. Increasing the group size from 200 by a factor of 10 already reduced the memory usage significantly. Also with the pre-calculated weighting potential the simulation runs much faster. I am still running the last tests as I haven’t looked at the results of the simulation with the new settings, but there are hopefully no bad surprises.

Thank you very much! Your help is greatly appreciated.
Kind regards
Stefan

simonspa · January 20, 2025, 4:57pm

Hi @StefanGohl

very happy that it seems to work out for you! Finding the right parameters indeed sometimes is difficult, but oyu now seem to be on the right track!

Increasing the group size from 200 by a factor of 10 already reduced the memory usage significantly

I would expect essentially by a factor close to 10. Most of the settings really enter the system requirements (RAM, simulation time, …) linearly, and so does this. We allocate objects for each of these groups, so by cutting the number of groups by 10 so goes your memory requirements.

Using a fixed seed

Just to make sure: this was only meant if you wanted to play with reproducibility, please don’t run all your simulations with the same seed, you would get the same output…

If you fancy playing some more, the above-linked merge-request (require_sequence = false) together with multithreading = true and maybe a buffer_per_worker = 0 shoudl unleash the power of using multiple cores while not using any buffer.

Cheers,
Simon

P.S. I’d be very interesting to hear about your simulations at our next user workshop: 6th Allpix Squared User Workshop (7-9 May 2025): Overview · Indico

StefanGohl · January 20, 2025, 5:17pm

I am aware. Thank you!

I went through the merge-request and I am using it. I can’t really tell if it makes a significant difference. I was thinking of at least reducing the number of buffers, if my issue persist. Right now, it seems it’s not necessary.

I’ll think about it.

Regards
Stefan