groupby operations rechunk data · dask

Stream: dask

Topic: groupby operations rechunk data

Kevin Paul (Mar 02 2021 at 16:18):

So, the example we benchmarked that proved to scale poorly involved computing monthly means. So, we applied groupby month, but the dataset was chunked over lat and lon. After the groupby operation, the intermediate state was chunked over lat, lon and month, which increased the number of chunks by a factor of 12. The scheduler got bogged down with the number of chunks, and the anomaly calculation stopped scaling.

Without knowing that Xarray re-chunks over groups, this was surprising to us! Because if you think about it, the groupby+mean by month operation should be trivially parallelizable over lat and lon. So, you might imagine that you could do the groupby month operation "in place" (i.e., inside a single dask chunk), but that requires a hand implementation of the groupby+mean operation to be applied over every dask chunk independently...not what Xarray does.

Kevin Paul (Mar 02 2021 at 16:22):

The takeaway from this was that you should chunk over the same dimension you apply the groupby operation over.

Deepak Cherian (Mar 02 2021 at 17:31):

This looks like a missed opportunity for optimization. See https://github.com/pydata/xarray/issues/2237

Kevin Paul (Mar 02 2021 at 17:38):

I wouldn't say missed. Just hasn't begun. I don't have time and nobody else seems to complain.

Kevin Paul (Mar 02 2021 at 17:43):

And, actually, I believe my collaborator pointed me to that Xarray issue during our benchmarking effort and we just decided to make note of it and avoid mixing groupby and chunk dimensions. So, you could say that we learned to optimize our workflows from that issue.

Kevin Paul (Mar 02 2021 at 17:48):

I have not heard of other people having this problem, and since we discovered it during a benchmarking exercise, our conclusion was that our benchmarking example was not representative of the problems that other people experience...and we felt the issue was well documented in the issue you already referenced. So, if it's not actively being worked on...is it really a big issue?

Kevin Paul (Mar 02 2021 at 17:49):

...However, this is probably something that should be documented as a "best practice."

Deepak Cherian (Mar 02 2021 at 17:54):

I mean missed opportunity for dask to optimize that graph. Yes it is an issue! this kind of inefficiency builds up and breaks at scale.

Ultimately solving this kind of thing is bottlenecked on the single-digit-number-of-people that can "speak science", "speak xarray" & "speak dask". And then bottlenecked on getting a dask maintainer to think about and fix it. The second thing is solvable by contracting. The first thing... I don't know.

Kevin Paul (Mar 02 2021 at 18:05):

Yeah. Ok. I can't follow things closely enough to know if other people are experiencing these problems. (Was @Isla Simpson's problem related to this?) So, if it's a routinely encountered problem, then we should figure out how to fix it.

However, not being a core developer, it's not clear to me if it is an Xarray decision or a Dask decision to make? Is there a heuristic that can be developed that will tell you when to do the reduction "in-place" (i.e., on the existing chunks) or "out-of-place" (i.e., by creating new chunks). If a heuristic like that can be devised, then it seems to me that Xarray could make that decision for Dask...?

...Maybe we should move this discussion to GitHub. Is this specific enough to warrant a new issue?

Deepak Cherian (Mar 02 2021 at 18:13):

I commented here: https://github.com/pydata/xarray/issues/2237#issuecomment-789078512 Maybe that will kickstart more conversation.

However, not being a core developer, it's not clear to me if it is an Xarray decision or a Dask decision to make?

xarray always avoids making chunking decisions. Those decisions are always dask's. The challenge is to determine what xarray is doing under-the-hood with standard dask operations (in this case, indexing followed by mean followed by concat), and figure out why those result in inefficient computations (first bottleneck)

Last updated: Jan 30 2022 at 12:01 UTC