Stream: python-questions

Topic: Reproducible environments


view this post on Zulip Matt Long (Oct 15 2020 at 20:25):

What is the best way to ensure reproducible environments?

Many of us have been curating environment.yaml files in our repos; these are handy for defining environments, but resolve to a particular environment depending on when they are invoked. In some cases, it is important to ensure that environments can be reproduced exactly.

For example, I am working on a calculation, it's taken months to develop my code and during that time, I've only updated my conda environment occasionally. If a collaborator clones my repo and builds an environment from the environment.yaml file therein, they will likely get newer package versions, which could break my code.

One thing I've come across is conda-lock. This seems like a very nice utility to curate platform-specific files for curating exact replicates of an environment:
https://github.com/brl0/conda-lock

The workflow seems to be something like

# generate the lockfiles
conda-lock -f environment.yml -p osx-64 -p linux-64

# create an environment from the lockfile
conda create -n my-locked-env --file conda-linux-64.lock

I think I should be curating these conda lock files in my repos.

Is this the best way to solve this problem? Are there other tools/approaches to consider?

cc @xdev, @geocat, @Michael Levy, @Keith Lindsay

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 20:36):

conda-lock is a great tool for this... It's worth noting that if you have packages that are installed from pypi via pip or from GitHub via pip, locking environment fails

E.g.: Locking this environment results in failure:

name: my-env
channels:
  - conda-forge
dependencies:
  - ....
  - pip:
      - dummy-pkg
      - git+https://github.com/foo/bar.git@some-hash

If all packages in your environment are conda installable, everything should work just fine...

view this post on Zulip Matt Long (Oct 15 2020 at 20:38):

Good point. Some packages are only available via pip. Does docker provide a solution? Not viable on HPC, though?

view this post on Zulip Kevin Paul (Oct 15 2020 at 20:49):

docker is a solution, but its a bit more of a lift to get it working. While docker doesn't work on HPC, singularity does, and you can pretty easily convert a docker image to a singularity image. I think the process goes like this.

  1. On your local machine (with docker) save your image to a tarball:
    bash $ docker save IMAGEID -o IMAGENAME.tar

  2. Copy the tarball to Cheyenne or Casper.

  3. module load singularity ...I think

  4. Use Singularity to convert the image:
    bash $ singularity build --sandbox IMAGENAME docker-archive:///path/to/IMAGENAME.tar

Or you can pull directly from DockerHub with just one command:

singularity build --sandbox IMAGENAME docker://OWNER/IMAGENAME

Never tried it myself, but it's worth trying.

view this post on Zulip Kevin Paul (Oct 15 2020 at 20:51):

Also, note that there is a Visual Studio (VS) Code extension that lets you run and dev inside a docker container, so you can actually do your development in the exact same environment.

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 20:57):

Good point. Some packages are only available via pip. Does docker provide a solution? Not viable on HPC, though?

It turns out that docker images are just a tarball collection of files. So, if one could emit a tarball for the environment, it's likely that you can get an environment that is as reproducible as Docker's or Singularity's (without the Operating system bits though) .

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 20:57):

There's an existing tool for this: https://conda.github.io/conda-pack/

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 20:57):

I haven't used it though

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 21:07):

One piece that is missing is something equivalent to Docker Hub or similar to serve as an environment registry... With an environment registry, one would be able to

One caveat is that this workflow is going to be Operating system dependent (because we are not shipping the operating system)..

view this post on Zulip Anderson Banihirwe (Oct 15 2020 at 21:10):

This actually sounds like a fun, prototype project worth exploring if time permits.... :smiley:

view this post on Zulip Matt Long (Oct 15 2020 at 21:46):

We are actively dealing with these issues in our work related to the #0.1° JRA BGC Run. Perhaps we can prototype a solution there?

view this post on Zulip Matt Long (Oct 28 2020 at 12:07):

cc @Brian Dobbins

view this post on Zulip Brian Dobbins (Oct 28 2020 at 14:26):

Thanks for pinging me, Matt - I hadn't known about this discussion, but I'm definitely interested in it.

Two quick questions/comments:

1) As Anderson said, Docker (and other containers) use the 'OCI' [Open Container Initiative] API, so all have compatible / convertible formats of their file systems. So on Cheyenne, I've converted Docker containers to Charliecloud ones and run them fine. In terms of HPC use, it really depends on what you're trying to do - running a single-node image on a (single) HPC node is a piece of cake, and something we do often (and applicable to any system). Running a multi-node HPC code, and taking advantage of the high-speed interconnect, is harder. Doable on Cheyenne, but not exactly portable. If you want to run a Docker container for some analysis on a single node on Cheyenne, though, I'm happy to help.

2) My admittedly limited understanding of environment.yml files is that they can track the actual build of a package, ensuring reproducibility across similar systems (eg, Linux on x86/64). For example, if I have a dependency of 'liblapack', then I'm going to get different versions depending on when I build the environment. If I have a dependency of 'liblapack=3.8.0=17_openblas', I'm ensuring I get THAT build, but can't (necessarily) build the same environment across OS and architecture. The OS issue can be handled by containers, and the architecture issue... probably isn't portable anyway. Eg, I'm guessing a numPy matrix-multiply of significant size on x86 will give different results than on Arm. Just a guess, but an educated one, given how floating-point is handled on these systems.

Conda-pack looks great and has some extra features, but I'm not sure it's needed on the same platform? Regardless, I'm interested in this topic, and in helping.


Last updated: Jan 30 2022 at 12:01 UTC