Here we collect some tips on how to run Dask with Distributed on the NSC supercomputers Bi and Tetralith. Both machines use SLURM, so if that is your workload manager, you may still glean something useful. Be advised though, that certain things, like paths to scratch and other storage or the way that Conda may be set up, are likely to differ.
Both machines are typical HPC clusters. That means that we must take care mostly of two resources, memory and storage. Raw CPU hours, even though they are the thing that is billed, are usually not of great concern. This is because, for large problems that might need a lot of CPU, the memory demands will often dictate the number of nodes that we require and this will imply ample FLOPS.
Managing memory is critical because memory-starved workers (in the Distributed sense) are prone to dying and will also start to spill data to disk which degrades performance a great deal.
Using the right storage is critical because there are different kinds of storage available with different characteristics in terms of speed, size, and reliability, and using the wrong one for the job can not only impact performance adversely but moreover impact other jobs and users of the cluster by degrading the performance of entire filesystems.