This documentation is not really complete (yet).
Throughout this documentation we assume that you are familiar with the theoretical background behind the scanning transmission electron microscope (STEM) to some degree. Also, we assume that you have some knowledge about the UNIX/Linux command line and parallelized computation. STEMsalabim is currently not intended to be run on a desktop computer. While that is possible and works, the main purpose of the program is to be used in a highly parallelized multi-computer environment.
We took great care of making STEMsalabim easy to install. You can find instructions at Installing STEMsalabim. However, if you run into technical problems you should seek help from an administrator of your computer cluster first.
Structure of a simulation¶
There essence of STEMsalabim is to model the interaction of a focused electron beam with a bunch of atoms, typically in the form of a crystalline sample. Given the necessary input files, the simulation crunches numbers for some time, after which all of the calculated results can be found in the output file. Please refer to Running STEMsalabim for notes how to start a simulation.
All information about the specimen are listed in the Crystal file format, which is one of the two required input files for STEMsalabim. It contains each atom’s species (element), coordinates, and mean square displacement as it appears in the Debye-Waller factors.
In addition, you need to supply a Parameter files for each simulation, containing information about the microscope, detector, and all required simulation parameters. All these parameters are given in a specific syntax in the Parameter files that are always required for starting a STEMsalabim* simulation.
The complete output of a STEMsalabim simulation is written to a NetCDF file. NetCDF is a binary, hierarchical file format for scientific data, based on HDF5. NetCDF/HDF5 allow us to compress the output data and store it in machine-readable, organized format while still only having to deal with a single output file.
You can read more about the output file structure at Output file format.
Hybrid Parallelization model¶
STEMsalabim simulations is parallelized both via POSIX threads and via message passing interface (MPI). A typical simulation will use both schemes at the same time: MPI is used for communication between the computing nodes, and threads are used for intra-node parallelization, the usual multi-cpu/multi-core structure.
A high performance computation cluster is an array of many (equal) computing nodes. Typical highly-parallelized software uses more than one of the nodes for parallel computations. There is usually no memory that is shared between the nodes, so all information required for the management of parallel computing needs to be explicitely communicated between the processes on the different machines. The quasi-standard for that is the message passing interface (MPI).
Let us assume a simulation that runs on \(M\) computers and each of them spawns \(N\) threads.
Depending on the simulation parameters chosen, STEMsalabim may need to loop through multiple frozen phonon configurations and values of the probe defocus. The same simulation (with differently displaced atoms and different probe defocus) is therefore typically run multiple times. There are three parallelization schemes implemented in STEMsalabim:
When \(M == 1\), i.e., no MPI parallelization is used, all pixels (probe positions) are distributed among the \(N\) threads and calculated in parallel.
Each MPI processor calculates all pixels (probe positions) of its own frozen phonon / defocus configuration, i.e., \(M\) configurations are calculated in parallel. Each of the \(M\) calculations splits its pixels between \(N\) threads (each thread calculates one pixel at a time).
This scheme makes sense when the total number of configurations (probe.num_defoci \(\times\) frozen_phonon.number_configurations) is much larger than or divisible by \(M\).
A single configuration is calculated at a time, and all the pixels are split between all \(M \times N\) threads. In order to reduce the required MPI communication around, only the main thread of each of the \(M\) MPI processors communicates with the master thread. The master thread sends a work package containing some number of probe pixels to be calculated to an MPI process, which then carries out all the calculations in parallel on its \(N\) threads. When a work package is finished, it requests another work package from the master MPI process until there is no work left. In parallel, the worker threads of the MPI process with rank 0 also work on emptying the work queue.
In MPI mode, each MPI process writes results to its own temporary file, and after each frozen lattice configuration the
results are merged. Merging is carried out sequentially by each individual MPI processor, to avoid race conditions.
output.tmp_dir (see Parameter files) should be set to a directory that is local
to each MPI processor (e.g.,
Within one MPI processor, the threads can share their memory. As the main memory consumption comes from storing the weak phase objects of the slices in the multi-slice simulation, which don’t change during the actual simulation, this greatly reduces memory usage as compared to MPI only parallelization. You should therefore always aim for hybrid parallelization!