Stream: python-questions
Topic: Reproducible environments
Matt Long (Oct 15 2020 at 20:25):
What is the best way to ensure reproducible environments?
Many of us have been curating environment.yaml files in our repos; these are handy for defining environments, but resolve to a particular environment depending on when they are invoked. In some cases, it is important to ensure that environments can be reproduced exactly.
For example, I am working on a calculation, it's taken months to develop my code and during that time, I've only updated my conda environment occasionally. If a collaborator clones my repo and builds an environment from the environment.yaml file therein, they will likely get newer package versions, which could break my code.
One thing I've come across is conda-lock
. This seems like a very nice utility to curate platform-specific files for curating exact replicates of an environment:
https://github.com/brl0/conda-lock
The workflow seems to be something like
# generate the lockfiles conda-lock -f environment.yml -p osx-64 -p linux-64 # create an environment from the lockfile conda create -n my-locked-env --file conda-linux-64.lock
I think I should be curating these conda lock files in my repos.
Is this the best way to solve this problem? Are there other tools/approaches to consider?
cc @xdev, @geocat, @Michael Levy, @Keith Lindsay
Anderson Banihirwe (Oct 15 2020 at 20:36):
conda-lock
is a great tool for this... It's worth noting that if you have packages that are installed from pypi via pip
or from GitHub via pip
, locking environment fails
E.g.: Locking this environment results in failure:
name: my-env channels: - conda-forge dependencies: - .... - pip: - dummy-pkg - git+https://github.com/foo/bar.git@some-hash
If all packages in your environment are conda
installable, everything should work just fine...
Matt Long (Oct 15 2020 at 20:38):
Good point. Some packages are only available via pip
. Does docker
provide a solution? Not viable on HPC, though?
Kevin Paul (Oct 15 2020 at 20:49):
docker
is a solution, but its a bit more of a lift to get it working. While docker
doesn't work on HPC, singularity
does, and you can pretty easily convert a docker
image to a singularity
image. I think the process goes like this.
-
On your local machine (with
docker
) save your image to a tarball:
bash $ docker save IMAGEID -o IMAGENAME.tar
-
Copy the tarball to Cheyenne or Casper.
-
module load singularity
...I think -
Use Singularity to convert the image:
bash $ singularity build --sandbox IMAGENAME docker-archive:///path/to/IMAGENAME.tar
Or you can pull directly from DockerHub with just one command:
singularity build --sandbox IMAGENAME docker://OWNER/IMAGENAME
Never tried it myself, but it's worth trying.
Kevin Paul (Oct 15 2020 at 20:51):
Also, note that there is a Visual Studio (VS) Code extension that lets you run and dev inside a docker container, so you can actually do your development in the exact same environment.
Anderson Banihirwe (Oct 15 2020 at 20:57):
Good point. Some packages are only available via
pip
. Doesdocker
provide a solution? Not viable on HPC, though?
It turns out that docker images are just a tarball
collection of files. So, if one could emit a tarball for the environment, it's likely that you can get an environment that is as reproducible as Docker's or Singularity's (without the Operating system bits though) .
Anderson Banihirwe (Oct 15 2020 at 20:57):
There's an existing tool for this: https://conda.github.io/conda-pack/
Anderson Banihirwe (Oct 15 2020 at 20:57):
I haven't used it though
Anderson Banihirwe (Oct 15 2020 at 21:07):
One piece that is missing is something equivalent to Docker Hub
or similar to serve as an environment registry... With an environment registry, one would be able to
- (from source system): pack an environment into a tarball. --> equivalent to
docker build
- push the tarball to some remote registry --> equivalent to
docker push
- (from the target system) pull the tarball and you are good to go.... --> equivalent to
docker pull && docker run
One caveat is that this workflow is going to be Operating system dependent (because we are not shipping the operating system)..
Anderson Banihirwe (Oct 15 2020 at 21:10):
This actually sounds like a fun, prototype project worth exploring if time permits.... :smiley:
Matt Long (Oct 15 2020 at 21:46):
We are actively dealing with these issues in our work related to the #0.1° JRA BGC Run. Perhaps we can prototype a solution there?
Matt Long (Oct 28 2020 at 12:07):
cc @Brian Dobbins
Brian Dobbins (Oct 28 2020 at 14:26):
Thanks for pinging me, Matt - I hadn't known about this discussion, but I'm definitely interested in it.
Two quick questions/comments:
1) As Anderson said, Docker (and other containers) use the 'OCI' [Open Container Initiative] API, so all have compatible / convertible formats of their file systems. So on Cheyenne, I've converted Docker containers to Charliecloud ones and run them fine. In terms of HPC use, it really depends on what you're trying to do - running a single-node image on a (single) HPC node is a piece of cake, and something we do often (and applicable to any system). Running a multi-node HPC code, and taking advantage of the high-speed interconnect, is harder. Doable on Cheyenne, but not exactly portable. If you want to run a Docker container for some analysis on a single node on Cheyenne, though, I'm happy to help.
2) My admittedly limited understanding of environment.yml files is that they can track the actual build of a package, ensuring reproducibility across similar systems (eg, Linux on x86/64). For example, if I have a dependency of 'liblapack', then I'm going to get different versions depending on when I build the environment. If I have a dependency of 'liblapack=3.8.0=17_openblas', I'm ensuring I get THAT build, but can't (necessarily) build the same environment across OS and architecture. The OS issue can be handled by containers, and the architecture issue... probably isn't portable anyway. Eg, I'm guessing a numPy matrix-multiply of significant size on x86 will give different results than on Arm. Just a guess, but an educated one, given how floating-point is handled on these systems.
Conda-pack looks great and has some extra features, but I'm not sure it's needed on the same platform? Regardless, I'm interested in this topic, and in helping.
Last updated: Jan 30 2022 at 12:01 UTC