diff --git a/source/running/main.rst b/source/running/main.rst index 4a955ab32341ff5b5f7a45c4570d7d4b9aba5477..744cc39f904f946da78d0d33aa3421dca58d1318 100644 --- a/source/running/main.rst +++ b/source/running/main.rst @@ -169,10 +169,10 @@ Running batch jobs from ScriptEngine ------------------------------------ ScriptEngine can send jobs to the SLURM batch system when the -``scriptengine-tasks-hpc package`` is installed, which is done automatically if +``scriptengine-tasks-hpc`` package is installed, which is done automatically if the ``environment.yml`` file has been used to create the Python virtual -environment, as described in :ref:`creating_virtual_environment`. -Here is an example of the ``hpc.slurm.sbatch`` task in ``example.yml``: +environment, as described in :ref:`creating_virtual_environment`. Here is an +example of using the ``hpc.slurm.sbatch`` task: .. code-block:: yaml+jinja @@ -206,6 +206,11 @@ again, but do nothing because it already runs in a batch job. Then, the next task (``base.echo``) would be executed, writing the message to standard output in the batch job. +Note that in the default runskript examples, submitting the job to SLURM is done +behind the scenes in ``scriptlib/submit.yml``. The actual configuration for the +batch job, such as account, allocated resources, etc, is configured according to +the chosen launch option, as described below. + Launch options -------------- @@ -216,6 +221,7 @@ model run once the jobs is executed by the batch system: * SLURM heterogeneous jobs (``slurm-hetjob``) * SLURM multiple program configuration and ``taskset`` process/thread pinning (``slurm-mp-taskset``) +* SLURM wrapper with taskset and node groups (``slurm-wrapper-taskset``) * SLURM job with generic shell script template (``slurm-shell``) Each option has advantages and disadvantages and they come also with different @@ -303,6 +309,169 @@ in many cases, because the remaining nodes are used exclusively for one component each. +SLURM wrapper and taskset +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This launch option uses the SLURM ``srun`` command together with + +* a HOSTFILE created on-the-fly +* a wrapper created on-the-fly, which uses +* the ``taskset`` command to set the CPU's affinity for MPI processes, OpenMP threads + and hyperthreads + +The ``slurm-wrapper-taskset`` option is configured per node. Instead of choosing +the total number of tasks or nodes dedicated to each component, you specify the +number of MPI processes for each component that will execute on each computing +node. To avoid repeating the same node configuration over and over again, the +configuration is structured in groups, each representing a set of nodes with the +same configuration. + +The following simple example assumes a computer platform that has 128 cores per +comupte node, such as, for example, the ECMWF HPC2020 system. Three nodes are +allocated to run a model configuration with four components: XIOS (1 process), +OpenIFS (250 processes), NEMO (132) and the Runoff-mapper (1 process): + +.. code-block:: yaml + + platform: + cpus_per_node: 128 + job: + launch: + method: slurm-wrapper-taskset + groups: + - {nodes: 1, xios: 1, oifs: 126, rnfm: 1} + - {nodes: 2, oifs: 62, nemo: 66} + +Two groups are defined in this example: the first comprising **one** node +(running XIOS, OpenIFS and the Runoff-mapper), and the second group with **two** +nodes running OpenIFS and NEMO. + +.. note:: The ``platform.cpus_per_node`` parameter and the ``job.*`` parameters + do not have to be defined in the same file, as suggested in the simple + example. In fact, the ``platform.*`` parameters are usually defined in the + platform configuration file, while ``job.*`` is usually found in the + experiment configuration. + +A second example illustrates the use of hybrid parallelization (MPI+OpenMP) for +OpenIFS. The number of MPI tasks per node reflects that each process will be +using more than one core: + +.. code-block:: yaml + + platform: + cpus_per_node: 128 + job: + launch: + method: slurm-wrapper-taskset + oifs: + omp_num_threads: 2 + omp_stacksize: "64M" + groups: + - {nodes: 1, xios: 1, oifs: 63, rnfm: 1} + - {nodes: 2, oifs: 64} + - {nodes: 2, oifs: 31, nemo: 66} + +Note the configuration of ``job.oifs.omp_num_thread`` and +``job.oifs.omp_stacksize``, which set the OpenMP environment for OpenIFS. The +example utilises the same number of MPI ranks for XIOS, NEMO and the +Runoff-mapper, and 253 MPI ranks for OpenIFS. However, each OpenIFS MPI +rank has now two OpenMP threads, which results in 506 cores being used for the +atmosphere. + +.. caution:: The ``omp_stacksize`` parameter is needed on some platforms in + order to avoid errors when there is too little stack memory for OpenMP threads + (see `OpenMP documentation + <https://www.openmp.org/spec-html/5.0/openmpse54.html>`_). However, the + example (and in particular the value of 64MB) should not be seen as a general + recommendation for all platforms. + +Overall, the ``slurm-wrapper-taskset`` launch method allows to share the compute +nodes flexibly and in a controlled way between |ece4| components, which is +useful to avoid idle cores. It can also help to decrease the computational costs +of configurations involving components with high memory requirements, by +allowing them to share nodes with components that need less memory. + +Optional configuration +...................... + +Some special configuration parameters may be required for the +``slurm-wrapper-taskset`` launcher on some machines. + +.. hint:: Do not use these special parameters, unless you need to! + +The first special parameter is ``platform.mpi_rank_env_var``: + +.. code-block:: yaml + + platform: + mpi_rank_env_var: SLURM_PROCID + +This is the name of an environment variable that must contain the MPI rank for +each task at runtime. The default value is `SLURM_PROCID`, which should work for +SLURM when using the `srun` command. Other possible choices that work for some +platforms are `PMI_RANK`` or `PMIX_RANK`. + + +Another special parameter is ``platform.shell``: + +.. code-block:: yaml + + platform: + shell: "/usr/bin/env bash" + +It is used for the wrapper script to determine the appropriate shell. It must be +configured if the given default value is not valid for your platform. + + +Implementation of Hyper-threading +................................. + +The implementation of Hyper-threading in this launch method is restricted to +OpenMP programs (only available for OpenIFS for now). It assumes that CPUs +number ``i`` and ``i + platform.cpus_per_node`` correspond to the same physical +core. By enabling the ``job.oifs.use_hyperthreads`` option, both cpus ``i`` and +``i + job.cpus_per_node`` are bound for the execution of that component. In this +case, the number of OpenMP threads executing that component is twice the value +given in ``job.oifs.omp_num_threads``. The following example would configure +OpenIFS to execute using 4 threads in the [0..127] range: + +.. code-block:: yaml + + platform: + cpus_per_node: 128 + job: + oifs: + omp_num_threads: 4 + omp_stacksize: "64M" + use_hyperthreads: false + +while the following example would result in 8 OpenIFS threads, with 4 of them in the [0..127] +range, and the others in [128..255]: + +.. code-block:: yaml + + platform: + cpus_per_node: 128 + job: + oifs: + omp_num_threads: 4 + omp_stacksize: "64M" + use_hyperthreads: true + +There is also the possibility of using all the 256 logical cpus in the node to +run more MPI tasks, as in the following example. In this case, the +``job.oifs.use_hyperthreads`` option must be disabled for every component (it is +disabled by default): + +.. code-block:: yaml + + platform: + cpus_per_node: 256 + job: + oifs: + use_hyperthreads: false + + SLURM shell template ~~~~~~~~~~~~~~~~~~~~ @@ -339,12 +508,18 @@ shared between OpenIFS and the Runoff-mapper. ntasks: 127 ntasks_per_node: 127 omp_num_threads: 1 + omp_stacksize: "64M" + nemo: ntasks: 127 ntasks_per_node: 127 xios: ntasks: 1 ntasks_per_node: 1 + slurm: + sbatch: + opts: + hint: nomultithread # remaining configuration same as for slurm-hetjob