Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
Klaus Zimmermann
climix
Commits
0cef7bd6
Commit
0cef7bd6
authored
Jul 15, 2021
by
Klaus Zimmermann
Browse files
Improve slurm integration (closes
#236
)
parent
5aa5301e
Changes
2
Hide whitespace changes
Inline
Side-by-side
climix/dask_setup.py
View file @
0cef7bd6
...
...
@@ -152,7 +152,7 @@ SCHEDULERS = OrderedDict([
def
setup_scheduler
(
args
):
scheduler_spec
=
args
.
dask_scheduler
.
split
(
'
:
'
)
scheduler_spec
=
args
.
dask_scheduler
.
split
(
'
@
'
)
scheduler_name
=
scheduler_spec
[
0
]
scheduler_kwargs
=
{
k
:
v
for
k
,
v
in
(
e
.
split
(
'='
)
for
e
in
scheduler_spec
[
1
:])}
...
...
jobscripts/jobscript-hetjob.sh
0 → 100644
View file @
0cef7bd6
#!/bin/bash
#
#SBATCH -J climix-test
#SBATCH -t 10:00:00
#SBATCH -N 1 --exclusive
#SBATCH hetjob
#SBATCH -N 16 --exclusive --cpus-per-task=4
# General approach
# ----------------
# We use slurm's heterogeneous job support for midas, using two components. The
# first component contains dask scheduler and client, the second component runs
# the workers. Since neither scheduler nor client are naturally parallel, we
# run both together on a single node. The workers, however, bring the scaling
# parallelism and can be run on an arbitrary number of nodes, depending on the
# size of the data and time and memory contraints. Note that we often want to
# use several nodes purely to gain access to sufficient memory.
#
# Bi specific notes
# -----------------
# Cores
# ~~~~~
# As of this writing, bi nodes are setup with hyperthreading active, with every
# node having 32 virtual cores provided by 16 physical cores. We want the
# workers to use 16 threads total per node, thus using all physical cores but
# avoiding conflicts due to hyperthreading. We achieve this by running 8 worker
# processes with 2 threads each on every node. This is implemented via slurm's
# `--cpus-per-task=4` option, which instructs slurm to start one task for every
# 4 (virtual) cpus. That means that the number of nodes can be freely chosen
# using the `-N` option in the header at the top of this file.
#
# Memory
# ~~~~~~
# Every normal (`thin`) node has 64GB of memory and there is a small number of
# `fat` nodes with 256GB of memory.
# We use a single fat node for the first component of the heterogeneous job,
# giving scheduler and client a bit of headroom for transfer and handling of
# larger chunks of memory.
# The workers are run on normal nodes. To allow for a little bit of breathing
# room for the system and other programs, we use 90% of the available memory,
# equally distributed among the worker processes (or equivalently slurm tasks)
# on each node for a total of 7.2GB per worker.
NO_SCHEDULERS
=
1
NO_PROGRAM
=
1
# NO_WORKERS=$((($SLURM_NTASKS - $NO_SCHEDULERS - $NO_PROGRAM) / 2))
NO_WORKERS
=
$((
(
$SLURM_NTASKS
-
$NO_SCHEDULERS
-
$NO_PROGRAM
)
/
2
))
# MEM_PER_WORKER=$(echo "2 * $SLURM_CPUS_PER_TASK * $SLURM_MEM_PER_CPU * .9" |bc -l)
MEM_PER_WORKER
=
7200
echo
"Number of workers:
$NO_WORKERS
, memory:
$MEM_PER_WORKER
"
# >>> conda initialize >>>
__conda_setup
=
"
$(
'/nobackup/rossby20/rossby/software/conda/bi/miniconda3-20201119/bin/conda'
'shell.bash'
'hook'
2> /dev/null
)
"
if
[
$?
-eq
0
]
;
then
eval
"
$__conda_setup
"
else
if
[
-f
"/nobackup/rossby20/rossby/software/conda/bi/miniconda3-20201119/etc/profile.d/conda.sh"
]
;
then
.
"/nobackup/rossby20/rossby/software/conda/bi/miniconda3-20201119/etc/profile.d/conda.sh"
else
export
PATH
=
"/nobackup/rossby20/rossby/software/conda/bi/miniconda3-20201119/bin:
$PATH
"
fi
fi
unset
__conda_setup
# <<< conda initialize <<<
conda activate climix-devel-2
COORDINATE_DIR
=
/nobackup/rossby26/users/sm_klazi/test_climix/20210416
cd
$COORDINATE_DIR
SCHEDULER_FILE
=
$COORDINATE_DIR
/scheduler-
$SLURM_JOB_ID
.json
# Start scheduler
srun
--het-group
=
0
--ntasks
1
\
dask-scheduler
\
--interface
ib0
\
--scheduler-file
$SCHEDULER_FILE
&
srun
--het-group
=
1
\
dask-worker
\
--interface
ib0
\
--scheduler-file
$SCHEDULER_FILE
\
--memory-limit
"
${
MEM_PER_WORKER
}
MB"
\
--nthreads
2 &
BASE_DIR
=
/nobackup/smhid17/users/sm_thobo/projects/klimatfabriken/shype/v1/midasjobs
EXP_DIR
=
$BASE_DIR
/CLMcom-CCLM4-8-17/v1/ICHEC-EC-EARTH/r12i1p1/rcp26/input_data
REFERENCE_DIR
=
$EXP_DIR
/predictand/netcdf
MODEL_DIR
=
$EXP_DIR
/predictor/netcdf
srun
--het-group
=
0
--ntasks
1
\
climix
-e
-s
-d
external@scheduler_file
=
$SCHEDULER_FILE
-x
tn90p
-l
debug /home/rossby/imports/cordex/EUR-11/CLMcom-CCLM4-8-17/v1/ICHEC-EC-EARTH/r12i1p1/rcp85/bc/links-hist-scn/day/tasmin_EUR-11_ICHEC-EC-EARTH_rcp85_r12i1p1_CLMcom-CCLM4-8-17_v1_day_
*
.nc
# wait
# Script ends here
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment