RootObjectWriter very slow when defining array of objects in include parameter

Hi!
New user to Allpix Squared here! I am having some speed-related issues when including multiple objects for output writing to a root file in the RootObjectWriter module. When I include only a single object, the runtime is reasonbly linear with number of events, so every event approximately takes the same amount of time. However, when including an array of objects, I notice the runtime per event is not linear; the later events take much longer than the earlier events. This starts to really be a problem for larger simulations with millions of events.

To verify this behaviour I ran the simulation with four different configurations of the RootObjectWriter module, all other configurations were kept untouched. The simulations consist of 50,000 events each. The following include parameters were passed to the RootObjectWriter module for each of the four simulations:

  1. include = “DepositedCharge”
  2. include = “PropagatedCharge”
  3. include = “PixelCharge”
  4. include = “PixelCharge” “PropagatedCharge” “DepositedCharge”

(as a new user I can only add a single photo to the post, so I wrote out the results for the first three)
1:
Total time: 4 minutes 25 seconds
Time spent in ROOTObjectWriter: 5.09s
2:
Total time: 4 minutes 23 seconds
Time spent in ROOTObjectWriter: 32.91s
3:
Total time: 4 minutes 17 seconds
Time spent in ROOTObjectWriter: 3.038s
4:

Notice that in the first three cases, only a couple of seconds is spent in the RootObjectWriter module. In the last case, almost half the time is spent this module. This means that it would be faster for me to do the same simulation 3 times, each with a separate object written to a root file, than a single simulation with all three objects. I was wondering if this is a known issue?
Thank you so much!

I wanted to upload my conf files here, but it seems I am not allowed to upload files as a new user, so I have added them here below:

[Allpix]
log_level = "INFO"
log_format = "DEFAULT"
model_paths = "./models"
detectors_file = "./detectors/bare_detector_gold_contacts.conf"
output_directory = "./simulation_results/bare_detector_gold_contacts"
root_file = "bare_detector_gold_contacts_test"
number_of_events = 50000

[GeometryBuilderGeant4]
world_material = "vacuum"
log_level ="WARNING"

[DepositionGeant4]
log_level ="WARNING"
physics_list = FTFP_BERT_LIV
particle_type = "gamma" 
source_energy = 4keV 
source_type = "beam"
source_position = 0um 0um -10mm
beam_size = 0mm
flat_beam = true
beam_direction = 0 0 1
number_of_particles = 1

[ElectricFieldReader]
log_level ="WARNING"
model = "linear"
bias_voltage=-1000V
depletion_voltage = 500V

[GenericPropagation]
log_level ="WARNING"
temperature = 223K #-50C
integration_time = 200ns
propagate_electrons = true
output_plots = 1
charge_per_step = 10 # Max number of charges propagated together


[SimpleTransfer]
log_level ="WARNING"
max_depth_distance = 5um
output_plots = 1

[ROOTObjectWriter] 
log_level ="WARNING"
file_name = "bare_detector_gold_contacts_test_data"
include = "PixelCharge" "PropagatedCharge" "DepositedCharge" # only this line is changed

detector conf:

[detector1]
type = "pin"
position = 0 0 5mm
orientation = 0 0 0

[goldelectrodefront]
type = "box"
size = 10mm 10mm 18nm
position = 0mm 0mm 3.499991mm #3.5mm - 9nm
orientation = 0 0 0 
material = "Au"
role = "passive"
color = 1 0.8 0 

Model:

type = "monolithic"
geometry = "pixel"

number_of_pixels = 1 1
pixel_size = 10mm 10mm
sensor_material = "SILICON"

sensor_thickness = 3mm
sensor_excess = 0um

chip_thickness = 5um

Hi @sblokhuizen and welcome! :slight_smile:

the slowdown you are seeing has not to do with the fact that you defined an array of objects to be written, but with the objects you selected themselves. The object types DepositedCharge and PropagatedCharge contain an individual object for every single charge carrier in your sensor, so easily thousands of objects per-event. Storing them all will slow down your simulation significantly - and in most cases is not required for any analysis afterwards.

To cross-check that you are not seeing another issue, try with the following array:

include = MCTrack, MCParticle, PixelCharge, PixelHit

this would mean you are storing the MC truth information of the primary particle as well as the total collected charge on each pixel before and after the front-end simulation.

Please let me know if this clarifies things for you and solves your issue!

All the best,
Simon

Hi Simon,
Thank you for replying so quickly :slight_smile: !

I am interested in doing analysis on the charge distributions per event for every step of the simulation. To this end, I extract the charges from DepositedCharge, PropagatedCharge and PixelCharge and sum them up for each event.

Perhaps I am thinking of this too simplistically, but shouldn’t the time spent in the ROOTObjectWriter in simulation 4 be approximately the sum of the times spent in ROOTObjectWriter in simulations 1 through 3, as simulation 4 writes essentially the same amount of data as simulation 1-3 combined? So, in the order of (5+32+3=)~40s instead of 1664s? The ROOTObjectWriter seems to not scale linear with number of events, causing it to completely overrule the runtime in larger simulations. Furthermore, when solely writing out the DepositedCharge or PropgatedCharge object, I do not perceive this nonlinear runtime, where earlier events are much faster (~500 events per second) than later events (~10 events per second). In these cases, the runtime stays at a constant ~500 events per second throughout all events.

I have tried running with your include array:

include = MCTrack, MCParticle, PixelCharge, PixelHit

This works fine and I obtain the following result:

|10:28:02.113|  (STATUS) Finalization completed                                                         
|10:28:02.113|  (STATUS) Executed 6 instantiations in 2 minutes 17 seconds, spending 96% of time in slowest instantiation GenericPropagation:detector1                                                          
|10:28:02.113|    (INFO)  Module GeometryBuilderGeant4 took 316.127ms      
|10:28:02.113|    (INFO)  Module DepositionGeant4 took 28.441s              
|10:28:02.114|    (INFO)  Module ElectricFieldReader:detector1 took 186.252ms    
|10:28:02.114|    (INFO)  Module GenericPropagation:detector1 took 1430.28s                
|10:28:02.114|    (INFO)  Module SimpleTransfer:detector1 took 2.32468s                                 
|10:28:02.114|    (INFO)  Module ROOTObjectWriter took 21.1003s                                         
|10:28:02.114|  (STATUS) Average processing time is 2.71669ms/event, event generation at 368 Hz  
|10:28:02.114|  (STATUS) This corresponds to a processing time of 29.8836ms/event per worker

Thank you again for your help!
Kind regards,
Sebbe

Hi @sblokhuizen

aha, thanks for pointing me at the numbers again. I’ll look into what might be causing this - but my initial guess is that it is the relations between the individual objects.

If you just store one type, they will not have any cross-references, but when storing several, there is a link allowing you to access the history, e.g. PixelCharge::getPropagatedCharges() et cetera. These links are raw pointers during the run and therefore very lightweight. In order to make them persistent when storing and loading to and from a file, we need to convert them to ROOT::TRef objects - which is what might create the issue here.

(if you’re into reading C++ code, it’s these petrifyHistory calls along with the PointerWrapper::store method :slightly_smiling_face: )

Cheers,
Simon

Hi @sblokhuizen

I did some investigation of your issue, using your configuration files. I have been using the same random_seed for all of the runs to make sure the simulation itself is reproduced correctly.

To me it looks like there are several issues that play a role here:

  • There likely is an issue in our accounting for module running times, especially when buffering events in between
  • The overall run time does not change significantly for 10k events, around 10s difference only
  • If the buffers are exhausted, the simulation will be slowed down significantly because

As a short explanation: the ROOTObjectWriter expects events to arrive in sequential order for writing them to file (i.e. event 1001 after event 1000). If one event is not ready yet, all subsequent events are buffered until the hanging event is done and the sequence can continue. If buffer slots are full, no new events can be processed until that one long-running event is done - essentially slowing down the simulation down to single-threaded speed.

Could you provide me the details on how many threads you were running on and with how many buffer slots? This should be printed on STATUS level at the begin of the run.

Also, I’d be interested in the output of allpix --version for your system.

The details are below.

All the best,
Simon

Investigation

Unless otherwise noted, all runs executed with above-posted configuration files, 10k events and fixed random seed. Profiling via Intel OneAPI VTune.

All three types, Sequential Event Writing

Buffer starts filling slowly at around 6k events

|11:42:43.881|  (STATUS) Executed 6 instantiations in 1 minutes 31 seconds, spending 82% of time in slowest instantiation GenericPropagation:detector1
|11:42:43.881|  (STATUS)  Module GeometryBuilderGeant4 took 255.965ms
|11:42:43.881|  (STATUS)  Module DepositionGeant4 took 21.3169s
|11:42:43.881|  (STATUS)  Module ElectricFieldReader:detector1 took 120.97ms
|11:42:43.881|  (STATUS)  Module GenericPropagation:detector1 took 445.733s
|11:42:43.881|  (STATUS)  Module SimpleTransfer:detector1 took 1.49942s
|11:42:43.881|  (STATUS)  Module ROOTObjectWriter took 71.7542s

|11:42:43.881|  (STATUS) Average processing time is 8.72411ms/event, event generation at 115 Hz
|11:42:43.881|  (STATUS) This corresponds to a processing time of 61.0688ms/event per worker

There is a 3% penalty for generating TRefs for the object history chain:

Only PropagatedCharge, Sequential Event Writing

|11:47:36.058|  (STATUS) Finalization completed
|11:47:36.058|  (STATUS) Executed 6 instantiations in 1 minutes 18 seconds, spending 94% of time in slowest instantiation GenericPropagation:detector1
|11:47:36.058|  (STATUS)  Module GeometryBuilderGeant4 took 252.934ms
|11:47:36.059|  (STATUS)  Module DepositionGeant4 took 21.6912s
|11:47:36.059|  (STATUS)  Module ElectricFieldReader:detector1 took 111.603ms
|11:47:36.059|  (STATUS)  Module GenericPropagation:detector1 took 472.376s
|11:47:36.059|  (STATUS)  Module SimpleTransfer:detector1 took 1.40575s
|11:47:36.059|  (STATUS)  Module ROOTObjectWriter took 4.09551s
|11:47:36.059|  (STATUS) Average processing time is 7.14635ms/event, event generation at 140 Hz
|11:47:36.059|  (STATUS) This corresponds to a processing time of 50.0245ms/event per worker

No call to TRef generation for single-type object run:

All three types, No Sequence Requirements

|11:55:52.529|  (STATUS) Executed 6 instantiations in 1 minutes 39 seconds, spending 61% of time in slowest instantiation GenericPropagation:detector1
|11:55:52.529|  (STATUS)  Module GeometryBuilderGeant4 took 381.243ms
|11:55:52.530|  (STATUS)  Module DepositionGeant4 took 23.354s
|11:55:52.530|  (STATUS)  Module ElectricFieldReader:detector1 took 156.042ms
|11:55:52.530|  (STATUS)  Module GenericPropagation:detector1 took 403.678s
|11:55:52.530|  (STATUS)  Module SimpleTransfer:detector1 took 1.66258s
|11:55:52.530|  (STATUS)  Module ROOTObjectWriter took 228.496s
|11:55:52.530|  (STATUS) Average processing time is 9.39507ms/event, event generation at 106 Hz
|11:55:52.531|  (STATUS) This corresponds to a processing time of 65.7655ms/event per worker

Generating TRefs accounts for around 10% of the time:

Only PropagatedCharge, No Sequence Requirement

|12:01:05.024|  (STATUS) Executed 6 instantiations in 1 minutes 30 seconds, spending 90% of time in slowest instantiation GenericPropagation:detector1
|12:01:05.024|  (STATUS)  Module GeometryBuilderGeant4 took 404.379ms
|12:01:05.025|  (STATUS)  Module DepositionGeant4 took 27.4709s
|12:01:05.025|  (STATUS)  Module ElectricFieldReader:detector1 took 149.413ms
|12:01:05.025|  (STATUS)  Module GenericPropagation:detector1 took 525.819s
|12:01:05.025|  (STATUS)  Module SimpleTransfer:detector1 took 1.73658s
|12:01:05.026|  (STATUS)  Module ROOTObjectWriter took 23.4508s
|12:01:05.029|  (STATUS) Average processing time is 8.26715ms/event, event generation at 121 Hz
|12:01:05.030|  (STATUS) This corresponds to a processing time of 57.8701ms/event per worker

No call to TRef generation for single-type object run:

38k events run with buffers exhausted, Sequential Event Writing

Buffers full at 11k events

|12:22:27.169|  (STATUS) Executed 6 instantiations in 16 minutes 51 seconds, spending 59% of time in slowest instantiation GenericPropagation:detector1
|12:22:27.170|  (STATUS)  Module GeometryBuilderGeant4 took 743.316ms
|12:22:27.170|  (STATUS)  Module DepositionGeant4 took 54.7667s
|12:22:27.170|  (STATUS)  Module ElectricFieldReader:detector1 took 311.116ms
|12:22:27.170|  (STATUS)  Module GenericPropagation:detector1 took 1562.67s
|12:22:27.170|  (STATUS)  Module SimpleTransfer:detector1 took 4.26565s
|12:22:27.170|  (STATUS)  Module ROOTObjectWriter took 985.628s
|12:22:27.170|  (STATUS) Average processing time is 27.1147ms/event, event generation at 37 Hz
|12:22:27.170|  (STATUS) This corresponds to a processing time of 189.803ms/event per worker

Fallback clearly visible when buffers fill, threads are locking:

Run time split between initial event processing and buffer work:

Direct processing:

Buffer work:

Hi @simonspa

Thanks for looking into it!
I am running on 12 threads. Running my previous configuration filewith the STATUS level, I obtain the following:

|13:20:40.069|  (STATUS) Welcome to Allpix^2 v2.0.0-1789-g71c695ca2
|13:20:40.069|  (STATUS) Initialized PRNG with system entropy seed 9048716555984905851
|13:20:40.283|  (STATUS) Loaded 6 modules                                                                                                                                                                          
|13:20:40.284|  (STATUS) Multithreading enabled, processing events in parallel on 11 worker threads                                                                                                                
|13:20:40.284|  (STATUS) Allocating a total of 2816 event slots for buffered modules 

When outputting all 3 objects, I notice the buffered events start piling up at around 7k events, and reaching the max (2816) after around ~17.5k events.

Output for allpix --version:

Allpix Squared version v2.0.0-1789-g71c695ca2
built on 2023-01-11, 09:10:44 UTC
using Boost.Random 1.81.0                                                                                        
ROOT 6.26/10
Geant4 11.1.0
running on 12x Intel(R) Xeon(R) W-10855M CPU @ 2.80GHz 

For the fun of it - and if you have enough RAM - could you try adding

[Allpix]
buffer_per_worker = 1024

or similar? You should then see that the buffers fill up much later.

/Simon

To add, also I do not observe the buffers being filled up (tested upto 50k events) using,

include = "DepositedCharge" "PropagatedCharge"

Also not with:

include = "DepositedCharge" "PixelCharge"

But I do see it using the following:

include = "PropagatedCharge" "PixelCharge"

Perhaps that can nudge you in the right direction?

Setting buffer_per_worker = 1024, quadruples the number of allowed buffered events, which delays the problem by about 10k events, buffers fill at around ~25k events. Which for the short-term is nice, but provided that I’d need millions of events for accurate data, does not help me in the long run. I still have quite a lot of memory overhead, so I will be bumping that value up in the meanwhile :slight_smile: ! Thanks!

The config

include = "PropagatedCharge" "PixelCharge"

is the only one with a one-to-many relation of TRefs to be created. The other combinations either have no relation (missing an intermediate object for full history) or have a one-to-one correspondence (one DepositedCharge → one PropagatedCharge), so the culprit is there somehow.

I am able to create a much more exaggerated version of the problem, and at least part of the issue is really the TRef allocation:

I’ll dig some more and let you know if I find a solution.

1 Like

Hi @sblokhuizen

just to let you know, this hasn’t been idle, but there is a long discussion with the ROOT developers as well as a draft merge request working around this issue in Allpix Squared.

If you have time and could test this branch, that would be great!

All the best,
Simon

Hi @sblokhuizen

this has been merged to the master branch now and is available in the latest version of CVMFS as of tomorrow. We are also considering a backport to v2.4.1.

Cheers,
Simon