here's a quick test that suggests its better to hash the dask variable instead of the function + args + kwargs. If you run this cell multiple times, func
is redefined and its token changes. But the dask graph isn't changing so that token for the output of func(a,b)
is invariant (and accessible with the name attribute of dask arrays)
def func(a, b): return a * b + 10 import dask a = dask.array.ones((10, 1)) b = dask.array.ones((10, 1)) print(dask.base.tokenize(func, a, b)) print(dask.base.tokenize(func(a,b)))
I run the above cell multiple times. Here are the outputs
first run:
aef9da9c2d23c57b78705c645bd30087 b241f225ffe8f6d84e1b97981dfe96e2
second run
457d0d1b307cfcae61947f108d97a8e5 b241f225ffe8f6d84e1b97981dfe96e2
restart kernel.
8934369653f9ff6cfc2fe14f038433fe b241f225ffe8f6d84e1b97981dfe96e2
dask.base.tokenize(func(a,b))
is redundant actually. func(a,b).name
works just as well (add-933890fbb1484cc1eb3fbd568e967cfd
). This is what xarray uses to avoid expensive equality checks.
The problem with this approach is that copies have different names: func(a,b).copy().name = copy-1e2375ec589dad4ad50d1abfe5d00bdd
Thanks @Deepak Cherian! Great suggestion.
cc @xdev
@Deepak Cherian , it looks like the code you wrote end up evaluating func(a,b)
to get the token. This seems to defeat the purpose of avoiding the function evaluation via a cache. Maybe I'm missing something that you're implying.
That said, in an attempt to explore this, I tried to run the code you wrote. I get an error message:
AttributeError: module 'dask' has no attribute 'array'
@Keith Lindsay
You need an explicit import:
import dask.array
Oops yeah, I was assuming that it was dask-aware function that was cheap. If not, you can do dask.base.tokenize((func, a, b))
or something like that since the output is a unique function of the inputs.
EDIT: this only works if func
is not redefined (looking at my first post in this thread)
dask.base.tokenize(func(a, b))
hashes the output of the function so it's more robust.
Last updated: May 16 2025 at 17:14 UTC