Decouple output blocksize and chunking

In iterative storage mode, the blocks that are calculated and stored are currently given by the chunking of the result array. This guarantees that every chunk is needed for exactly one output block, but depending on the overall shape of the data it leads to too fine granularity, which results in poor performance. We should decouple the two things; this will give us full flexibility to choose the best chunking for the calculations and the best output block size at the same time.