Stream: python-dev
Topic: xpersist hashing
Deepak Cherian (May 13 2020 at 18:21):
here's a quick test that suggests its better to hash the dask variable instead of the function + args + kwargs. If you run this cell multiple times, func
is redefined and its token changes. But the dask graph isn't changing so that token for the output of func(a,b)
is invariant (and accessible with the name attribute of dask arrays)
def func(a, b): return a * b + 10 import dask a = dask.array.ones((10, 1)) b = dask.array.ones((10, 1)) print(dask.base.tokenize(func, a, b)) print(dask.base.tokenize(func(a,b)))
I run the above cell multiple times. Here are the outputs
first run:
aef9da9c2d23c57b78705c645bd30087 b241f225ffe8f6d84e1b97981dfe96e2
second run
457d0d1b307cfcae61947f108d97a8e5 b241f225ffe8f6d84e1b97981dfe96e2
restart kernel.
8934369653f9ff6cfc2fe14f038433fe b241f225ffe8f6d84e1b97981dfe96e2
Deepak Cherian (May 13 2020 at 18:33):
dask.base.tokenize(func(a,b))
is redundant actually. func(a,b).name
works just as well (add-933890fbb1484cc1eb3fbd568e967cfd
). This is what xarray uses to avoid expensive equality checks.
The problem with this approach is that copies have different names: func(a,b).copy().name = copy-1e2375ec589dad4ad50d1abfe5d00bdd
Matt Long (May 13 2020 at 21:19):
Thanks @Deepak Cherian! Great suggestion.
cc @xdev
Keith Lindsay (Jan 22 2021 at 15:55):
@Deepak Cherian , it looks like the code you wrote end up evaluating func(a,b)
to get the token. This seems to defeat the purpose of avoiding the function evaluation via a cache. Maybe I'm missing something that you're implying.
That said, in an attempt to explore this, I tried to run the code you wrote. I get an error message:
AttributeError: module 'dask' has no attribute 'array'
Anderson Banihirwe (Jan 22 2021 at 15:56):
@Keith Lindsay
You need an explicit import:
import dask.array
Deepak Cherian (Jan 22 2021 at 17:04):
Oops yeah, I was assuming that it was dask-aware function that was cheap. If not, you can do dask.base.tokenize((func, a, b))
or something like that since the output is a unique function of the inputs.
EDIT: this only works if func
is not redefined (looking at my first post in this thread)
dask.base.tokenize(func(a, b))
hashes the output of the function so it's more robust.
Last updated: Jan 30 2022 at 12:01 UTC