Stream: python-dev

Topic: xpersist hashing


view this post on Zulip Deepak Cherian (May 13 2020 at 18:21):

here's a quick test that suggests its better to hash the dask variable instead of the function + args + kwargs. If you run this cell multiple times, func is redefined and its token changes. But the dask graph isn't changing so that token for the output of func(a,b) is invariant (and accessible with the name attribute of dask arrays)

def func(a, b):
    return a * b + 10

import dask
a = dask.array.ones((10,  1))
b = dask.array.ones((10, 1))

print(dask.base.tokenize(func, a, b))
print(dask.base.tokenize(func(a,b)))

I run the above cell multiple times. Here are the outputs
first run:

aef9da9c2d23c57b78705c645bd30087
b241f225ffe8f6d84e1b97981dfe96e2

second run

457d0d1b307cfcae61947f108d97a8e5
b241f225ffe8f6d84e1b97981dfe96e2

restart kernel.

8934369653f9ff6cfc2fe14f038433fe
b241f225ffe8f6d84e1b97981dfe96e2

view this post on Zulip Deepak Cherian (May 13 2020 at 18:33):

dask.base.tokenize(func(a,b)) is redundant actually. func(a,b).name works just as well (add-933890fbb1484cc1eb3fbd568e967cfd). This is what xarray uses to avoid expensive equality checks.

The problem with this approach is that copies have different names: func(a,b).copy().name = copy-1e2375ec589dad4ad50d1abfe5d00bdd

view this post on Zulip Matt Long (May 13 2020 at 21:19):

Thanks @Deepak Cherian! Great suggestion.

cc @xdev

view this post on Zulip Keith Lindsay (Jan 22 2021 at 15:55):

@Deepak Cherian , it looks like the code you wrote end up evaluating func(a,b) to get the token. This seems to defeat the purpose of avoiding the function evaluation via a cache. Maybe I'm missing something that you're implying.

That said, in an attempt to explore this, I tried to run the code you wrote. I get an error message:

AttributeError: module 'dask' has no attribute 'array'

view this post on Zulip Anderson Banihirwe (Jan 22 2021 at 15:56):

@Keith Lindsay

You need an explicit import:

import dask.array

view this post on Zulip Deepak Cherian (Jan 22 2021 at 17:04):

Oops yeah, I was assuming that it was dask-aware function that was cheap. If not, you can do dask.base.tokenize((func, a, b)) or something like that since the output is a unique function of the inputs.

EDIT: this only works if func is not redefined (looking at my first post in this thread)

dask.base.tokenize(func(a, b)) hashes the output of the function so it's more robust.


Last updated: Jan 30 2022 at 12:01 UTC