WHIZARD: Hunting a random number (non-)bug
GDB for the win, somehow!
WHIZARD is a general Monte Carlo event generator for high-energy particle physics.1
It's written in (mostly) modern Fortran, some less modern Fortran and OCaml.
Beside some cumulated and ancient parts, which are shipped with
WHIZARD, but are not really backed by any developer today,
WHIZARD has a well designed program architecture and includes a fair amount of documentation.
WHIZARD know has a rather larger code base for a scientific software with a rather smallish core team and some volatile on-and-off developers - or more slangish: tenure position against PhD students and Postdocs.
Today, a particular design trait hit me hard: How we handle the creation of independent random number sequences in
Thanks to Pascal Stienemeier for pointing it out to me and his patience.
The problem: We run a
WHIZARD program with identical instructions, twice. But we skip the Monte Carlo integration in the second run and reuse the integration result and grid from the previous invocation.
WHIZARD produce a single event and compare it. And, …, the events are not identical.
The question: Would we expect the events to be numerically identical?
The following is an (hell of) abbreviation of what I tried until I understood the issue:
- We need a debug build of
<path-to-repo>/configure --prefix"$(pwd)/install" FCFLAGS="-O0 -g" –disable-static=
- We need to run
WHIZARDtwice, firstly, redoing everything and, secondly, using the previous results and doing only the event generation, respectively with
GDB. Thanks, again, to Pascal for setting up a Sindarin and shell reproducer.
- We need to find out where all the RNG modules, procedures and types are located:
info modules rng.
WHIZARDuses Fortran's polymorphic type system to apply different type extension on a common base type interface - in
WHIZARD"dispatch" an typed-object. In our case,
info module functions -m dispatch_rngreveals that we make a call to
rng_tao_factory_makein order to initialize a new RNG state based on the previous states (the first states starts with the original seed set by the user / system time).
We have four place in
WHIZARD where we create a new starting point for a random number sequence:
- Process Setup of the multi-channel integration
- Event Setup of the event generation
- random number sequence for the process
- random number sequence for the event transformations
- Random number sequence for the event generation itself
Next step, we need to find out when we call which random number sequence.
And, it gets a little bit complicated know:
The integration method
VAMP provides a random number generator (RNG) - T. Ohl coined it "TAO" as it is a RNG design by Donald Knuth -, which
WHIZARD does only wrap into a extended RNG base type.
When we call into the
VAMP integration sample procedure (in the first
WHIZARD run), we only pass the wrapped plain RNG state to
VAMP and do not use the wrapper routines.
Thus, we have to find out which routines are then called directly from the integration sampler:
info module functions -m tao_random_numbers we find that all procedures call at their core
generate, thus, we once hit
generate to find which routine the sample procedure uses;
With that knowledge, we can know finger the wrapper state of the TAO RNG object each time we hit
tao_random_numbers::real_array_state - the wrapper RNG object is located a little up the stack:
break tao_random_numbers::real_array_state command up 3 print ((rng_tao_t)* <address of rng%_data>)%seed continue end
Given the break command, we see that the integration calls
1776 (plus one, but I used the first break to get the address of
rng%_data) into the TAO random number sequence from the process setup of the multi-channel integration, and, to my personal surprise, twice during the event generation.
I repeated the previous debug setup with the second, consecutive
WHIZARD run and, lo and behold, we call the TAO RNG from the process setup of the multi-channel integration exactly only twice, namely during event generation.
So, we found the culprit's behavior:
WHIZARD imports the process setup from the multi-channel integration into the event generation, including the random number state as is - this conclusion required another debug run, but I skip its explanation as it does not illuminate the issue further.
Therefore, depending on whether we run
WHIZARD with an actual integration or not - we load the integration result and grid of a previous invocation - we find a different random number sequence for the event generation.
For an end user (and, also me), this behavior is strange as it is neither documented nor intuitive.
Last, but not least,
GDB is the natural choice.
In principle, I could have automated a lot of the debug process, but
GDB has some annoying bugs with regard to modern Fortran and the correct determination of object types.
Thus, I had to do a lot of the debug process manually. Sigh.