Reproducible environments · python-questions

Many of us have been curating environment.yaml files in our repos; these are handy for defining environments, but resolve to a particular environment depending on when they are invoked. In some cases, it is important to ensure that environments can be reproduced exactly.

For example, I am working on a calculation, it's taken months to develop my code and during that time, I've only updated my conda environment occasionally. If a collaborator clones my repo and builds an environment from the environment.yaml file therein, they will likely get newer package versions, which could break my code.

One thing I've come across is conda-lock. This seems like a very nice utility to curate platform-specific files for curating exact replicates of an environment:
https://github.com/brl0/conda-lock

# generate the lockfiles
conda-lock -f environment.yml -p osx-64 -p linux-64

# create an environment from the lockfile
conda create -n my-locked-env --file conda-linux-64.lock

Is this the best way to solve this problem? Are there other tools/approaches to consider?

Anderson Banihirwe (Oct 15 2020 at 20:36):

conda-lock is a great tool for this... It's worth noting that if you have packages that are installed from pypi via pip or from GitHub via pip, locking environment fails

name: my-env
channels:
  - conda-forge
dependencies:
  - ....
  - pip:
      - dummy-pkg
      - git+https://github.com/foo/bar.git@some-hash

If all packages in your environment are conda installable, everything should work just fine...

Matt Long (Oct 15 2020 at 20:38):

Good point. Some packages are only available via pip. Does docker provide a solution? Not viable on HPC, though?

Kevin Paul (Oct 15 2020 at 20:49):

docker is a solution, but its a bit more of a lift to get it working. While docker doesn't work on HPC, singularity does, and you can pretty easily convert a docker image to a singularity image. I think the process goes like this.

singularity build --sandbox IMAGENAME docker://OWNER/IMAGENAME

Kevin Paul (Oct 15 2020 at 20:51):

Also, note that there is a Visual Studio (VS) Code extension that lets you run and dev inside a docker container, so you can actually do your development in the exact same environment.

Anderson Banihirwe (Oct 15 2020 at 20:57):

It turns out that docker images are just a tarball collection of files. So, if one could emit a tarball for the environment, it's likely that you can get an environment that is as reproducible as Docker's or Singularity's (without the Operating system bits though) .

Anderson Banihirwe (Oct 15 2020 at 20:57):

Anderson Banihirwe (Oct 15 2020 at 21:07):

One piece that is missing is something equivalent to Docker Hub or similar to serve as an environment registry... With an environment registry, one would be able to

One caveat is that this workflow is going to be Operating system dependent (because we are not shipping the operating system)..

Anderson Banihirwe (Oct 15 2020 at 21:10):

This actually sounds like a fun, prototype project worth exploring if time permits.... :smiley:

Matt Long (Oct 15 2020 at 21:46):

We are actively dealing with these issues in our work related to the #0.1° JRA BGC Run. Perhaps we can prototype a solution there?

Matt Long (Oct 28 2020 at 12:07):

Brian Dobbins (Oct 28 2020 at 14:26):

Thanks for pinging me, Matt - I hadn't known about this discussion, but I'm definitely interested in it.

1) As Anderson said, Docker (and other containers) use the 'OCI' [Open Container Initiative] API, so all have compatible / convertible formats of their file systems. So on Cheyenne, I've converted Docker containers to Charliecloud ones and run them fine. In terms of HPC use, it really depends on what you're trying to do - running a single-node image on a (single) HPC node is a piece of cake, and something we do often (and applicable to any system). Running a multi-node HPC code, and taking advantage of the high-speed interconnect, is harder. Doable on Cheyenne, but not exactly portable. If you want to run a Docker container for some analysis on a single node on Cheyenne, though, I'm happy to help.

2) My admittedly limited understanding of environment.yml files is that they can track the actual build of a package, ensuring reproducibility across similar systems (eg, Linux on x86/64). For example, if I have a dependency of 'liblapack', then I'm going to get different versions depending on when I build the environment. If I have a dependency of 'liblapack=3.8.0=17_openblas', I'm ensuring I get THAT build, but can't (necessarily) build the same environment across OS and architecture. The OS issue can be handled by containers, and the architecture issue... probably isn't portable anyway. Eg, I'm guessing a numPy matrix-multiply of significant size on x86 will give different results than on Arm. Just a guess, but an educated one, given how floating-point is handled on these systems.

Conda-pack looks great and has some extra features, but I'm not sure it's needed on the same platform? Regardless, I'm interested in this topic, and in helping.

Stream: python-questions

Topic: Reproducible environments

Matt Long (Oct 15 2020 at 20:25):

Anderson Banihirwe (Oct 15 2020 at 20:36):

Matt Long (Oct 15 2020 at 20:38):

Kevin Paul (Oct 15 2020 at 20:49):

Kevin Paul (Oct 15 2020 at 20:51):

Anderson Banihirwe (Oct 15 2020 at 20:57):

Anderson Banihirwe (Oct 15 2020 at 20:57):

Anderson Banihirwe (Oct 15 2020 at 20:57):

Anderson Banihirwe (Oct 15 2020 at 21:07):

Anderson Banihirwe (Oct 15 2020 at 21:10):

Matt Long (Oct 15 2020 at 21:46):

Matt Long (Oct 28 2020 at 12:07):

Brian Dobbins (Oct 28 2020 at 14:26):