TL;DR: using structural plasticity with mpi-based simulation leads to spontaneous crashes in NESTv3.6 onward. A minimal script reproducing the crash is provided
This is a followup to my previous mailing list post (https://www.nest-simulator.org/mailinglist/hyperkitty/list/users@nest-simula...) where I encountered segmentation faults while executing structural-plasticity-based simulations in NESTv3.8.
Previously, I suspected that the crashes occurred due to presumed buggy installations of NEST on HPCs. The helpful comments on that post did not solve the issue, so I performed some tests and gathered feedback from other users testing structural plasticity (SP) in NESTv3.6+.
My conclusion is that MPI-based simulations spontaneously crash when using structural plasticity. The probability of spontaneous crashes increases with the number of MPI processes. Below, I'll provide more details and also a link to a minimal code that reproduces the segmentation fault.
Background: I use SP to perform large-scale network simulations on HPC. Hitherto, I've been using NESTv2.20.1 for various reasons. When I found that SP functionality in NESTv3.6 was finally equivalent to that of NESTv2.20.1, I decided to port my code to the latest release (v3.8). I installed v3.8 on my local machine with MPI support, and tested a scaled-down experiment as a sanity check---everything worked as expected.
However, when I ran the full-scale network on HPCs (JUSUF at Jülich and NEMO at uni-freiburg), I got segmentation faults. HPC support couldn't help in resolving the issues. Reinstalling didn't help. At this point, I reached out with the aforementioned post on the mailing list ("MPI-based error on v3.8").
Current Status: I created a minimal pynest script that simulates an E-I network, where the E-->E connections are organised by structural plasticity. I believe this minimal example includes the necessary steps involved in any SP-based experiment. The script can be run by the following BASH call: `mpirun python minimal.py $RANDOM_SEED $TOTAL_NUM_VIRTUAL_PROCS # use srun, when required`
The following table gives a summary of the segmentation-fault crashes for `minimal.py`: NEST version machine (MPI procs count) Crash v2.20.1 local (2) No v2.20.2 local (2) No v2.20.1 HPC-2 (upto 1024) No v2.20.2 HPC-1 (upto 1024) No v3.8 local (2) Yes (rarely) v3.8 HPC-1 (8) No v3.8 HPC-1 (16, 24, 32, 64, 128) Yes v3.8 HPC-2 (8) No v3.8 HPC-2 (16, 24, 32, 64, 128) Yes v3.6 HPC-1 (32, 64, 128, 256) Yes
- machines: - local: Linux 5.4.0-204-generic x86_64; Intel Core i5-5300U; 16GB - HPC-2: NEMO (https://www.nemo.uni-freiburg.de/) - HPC-1: JUSUF (https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/jusuf)
- NESTv2.20.1, v2.20.2, and v3.8 in all cases were installed manually. v3.6 was available as a preinstalled module in HPC-1 - The crashes do not occur if MPI is not used for SP simulations - The crashes do not occur in MPI-based simulations that do not involve SP
Conclusion: - MPI-based crashes occur in SP simulations. - Perhaps how MPI-procs handle data in NESTv3.6+ leads to spontaneous segmentation faults. - MPI-proc count potentially correlates with segmention-fault frequency. It is possible that faults will occur, although rarely, with small no. of MPI procs
Expectations from this post: Large no. of MPI-procs provide a very significant speed-up for SP-based simulations, therefore it is important that MPI is fully functional in NEST. - Kindly let me know if you can reproduce the same crash. If what I observe is indeed true, then this merits creating an issue on GitHub. - If you cannot reproduce the crashes, I would greatly appreciate any help and fixing the issue.
Code: - You will find a formated version of this post with acocmpanying code and crash dumps on this GitLab page: https://gitlab.rz.uni-freiburg.de/as2013/minimal_share.git - minimal.py: https://gitlab.rz.uni-freiburg.de/as2013/minimal_share/-/blob/119e52d3552a7e...
Best, Ady
PS. Is it possible to add some sort of text formatting (HTML/markdown) to these posts?