WHIZARD: Hunting a random number (non-)bug
GDB for the win, somehow!

<2021-03-05 Fri 17:46>

WHIZARD is a general Monte Carlo event generator for high-energy particle physics.¹ It's written in (mostly) modern Fortran, some less modern Fortran and OCaml. Beside some cumulated and ancient parts, which are shipped with WHIZARD, but are not really backed by any developer today, WHIZARD has a well designed program architecture and includes a fair amount of documentation. However, WHIZARD know has a rather larger code base for a scientific software with a rather smallish core team and some volatile on-and-off developers - or more slangish: tenure position against PhD students and Postdocs.

Today, a particular design trait hit me hard: How we handle the creation of independent random number sequences in WHIZARD. Thanks to Pascal Stienemeier for pointing it out to me and his patience.

The problem: We run a WHIZARD program with identical instructions, twice. But we skip the Monte Carlo integration in the second run and reuse the integration result and grid from the previous invocation. We let WHIZARD produce a single event and compare it. And, …, the events are not identical.

The question: Would we expect the events to be numerically identical?

The following is an (hell of) abbreviation of what I tried until I understood the issue:

We need a debug build of WHIZARD:
- <path-to-repo>/configure --prefix"$(pwd)/install" FCFLAGS="-O0 -g" –disable-static=
- make -j
- make install
We need to run WHIZARD twice, firstly, redoing everything and, secondly, using the previous results and doing only the event generation, respectively with GDB. Thanks, again, to Pascal for setting up a Sindarin and shell reproducer.
We need to find out where all the RNG modules, procedures and types are located: info modules rng.
WHIZARD uses Fortran's polymorphic type system to apply different type extension on a common base type interface - in WHIZARD "dispatch" an typed-object. In our case, info module functions -m dispatch_rng reveals that we make a call to rng_tao_factory_make in order to initialize a new RNG state based on the previous states (the first states starts with the original seed set by the user / system time).

We have four place in WHIZARD where we create a new starting point for a random number sequence:

Process Setup of the multi-channel integration
Event Setup of the event generation
- random number sequence for the process
- random number sequence for the event transformations
Random number sequence for the event generation itself

Next step, we need to find out when we call which random number sequence. And, it gets a little bit complicated know: The integration method VAMP provides a random number generator (RNG) - T. Ohl coined it "TAO" as it is a RNG design by Donald Knuth -, which WHIZARD does only wrap into a extended RNG base type. When we call into the VAMP integration sample procedure (in the first WHIZARD run), we only pass the wrapped plain RNG state to VAMP and do not use the wrapper routines. Thus, we have to find out which routines are then called directly from the integration sampler: With info module functions -m tao_random_numbers we find that all procedures call at their core generate, thus, we once hit generate to find which routine the sample procedure uses; tao_random_numbers::real_array_state.

With that knowledge, we can know finger the wrapper state of the TAO RNG object each time we hit tao_random_numbers::real_array_state - the wrapper RNG object is located a little up the stack:

break tao_random_numbers::real_array_state
command
up 3
print ((rng_tao_t)* <address of rng%_data>)%seed
continue
end

Given the break command, we see that the integration calls 1776 (plus one, but I used the first break to get the address of rng%_data) into the TAO random number sequence from the process setup of the multi-channel integration, and, to my personal surprise, twice during the event generation.

I repeated the previous debug setup with the second, consecutive WHIZARD run and, lo and behold, we call the TAO RNG from the process setup of the multi-channel integration exactly only twice, namely during event generation.

So, we found the culprit's behavior: WHIZARD imports the process setup from the multi-channel integration into the event generation, including the random number state as is - this conclusion required another debug run, but I skip its explanation as it does not illuminate the issue further. Therefore, depending on whether we run WHIZARD with an actual integration or not - we load the integration result and grid of a previous invocation - we find a different random number sequence for the event generation.

For an end user (and, also me), this behavior is strange as it is neither documented nor intuitive.

Last, but not least, GDB is the natural choice. In principle, I could have automated a lot of the debug process, but GDB has some annoying bugs with regard to modern Fortran and the correct determination of object types. Thus, I had to do a lot of the debug process manually. Sigh.

Footnotes:

https://whizard.hepforge.org/

WHIZARD: Hunting a random number (non-)bug GDB for the win, somehow!

Footnotes:

WHIZARD: Hunting a random number (non-)bug
GDB for the win, somehow!