Occasional 'Heartbeat to scheduler failed' error near end of execution
I get an occasional error when running climix locally (have not tried to reproduce with mpi/srun on multiple nodes). It seems to happen after output file has been written. I get an output distributed.worker - WARNING - Heartbeat to scheduler failed
... tornado.iostream.StreamClosedError: Stream is closed
.
Full log (info):
$ climix -v -e -x dtr -o /nobackup/rossby27/users/sm_joalo/data/test_climix/mpi_221019/output/\{var_name\}_\{frequency\}.nc /nobackup/rossby27/users/sm_joalo/data/test_climix/mpi_221019/tasmin_EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_SMHI-RCA4_v1a_day_20010101-20051231.nc /nobackup/rossby27/users/sm_joalo/data/test_climix/mpi_221019/tasmax_EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_SMHI-RCA4_v1a_day_20010101-20051231.nc
2470ms:main.py:main() INFO:root:Loading metadata
4771ms:main.py:main() INFO:root:Scheduler ready; starting main program.
4771ms:main.py:do_main() INFO:root:Preparing indices
4814ms:metadata.py:build_index_function() INFO:root:Found implementation for index_function <diurnal_temperature_range> from distribution <<importlib.metadata.PathDistribution object at 0x2aecf40cdf90>>
4823ms:main.py:do_main() INFO:root:Starting calculations for index <climix.index.Index object at 0x2aeceb0a5ab0>
4823ms:main.py:do_main() INFO:root:Building output filename
4823ms:main.py:do_main() INFO:root:Preparing input data
/home/sm_joalo/dev/repos/climix/climix/datahandling.py:59: FutureWarning: Ignoring a datum in netCDF load for consistency with existing behaviour. In a future version of Iris, this datum will be applied. To apply the datum when loading, use the iris.FUTURE.datum_support flag.
datacubes = iris.load_raw(datafiles, callback=ignore_cb)
/home/sm_joalo/dev/repos/climix/climix/datahandling.py:59: FutureWarning: Ignoring a datum in netCDF load for consistency with existing behaviour. In a future version of Iris, this datum will be applied. To apply the datum when loading, use the iris.FUTURE.datum_support flag.
datacubes = iris.load_raw(datafiles, callback=ignore_cb)
17320ms:main.py:do_main() INFO:root:Calculating index
17320ms:index.py:__call__() INFO:root:Starting preprocess
17321ms:index.py:__call__() INFO:root:Finished preprocess
17321ms:index.py:__call__() INFO:root:Data found for input low_data
17321ms:index.py:__call__() INFO:root:Data found for input high_data
17321ms:index.py:__call__() INFO:root:Adding coord categorisation.
18909ms:index.py:__call__() INFO:root:Preparing cubes
18909ms:index.py:__call__() INFO:root:Setting up aggregation
19178ms:aggregators.py:compute_pre_result() INFO:root:Setting up pre-result in aggregate mode
19198ms:aggregators.py:compute_pre_result() INFO:root:Setup completed in 0
19198ms:main.py:do_main() INFO:root:Saving result
19199ms:datahandling.py:save() INFO:climix.datahandling:Storing non-iteratively
19199ms:datahandling.py:save() INFO:climix.datahandling:Computing result
[########################################] | 100% Completed | 4.3s
23548ms:datahandling.py:save() INFO:climix.datahandling:Storing result
23624ms:datahandling.py:save() INFO:climix.datahandling:Calculation complete
23627ms:main.py:main() INFO:root:Calculation took 18.8558 seconds.
2022-10-24 08:59:20,501 - distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/worker.py", line 1158, in heartbeat
response = await retry_operation(
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation
return await retry(
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry
return await coro()
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/core.py", line 1154, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/core.py", line 919, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
convert_stream_closed_error(self, e)
File "/home/sm_joalo/.conda/envs/omni/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:45440 remote=tcp://127.0.0.1:45998>: Stream is closed