Stream: python-questions
Topic: github advice
Isla Simpson (Nov 28 2020 at 15:44):
Hello again, I'm trying to start using github for my project and I'm wondering if there's a way (other than just cloning the repository all over the place) to deal with the fact that on the CGD machines we have a tiny home directory. Normally the way I work is to have the following:
code in /home/islas
data in /project/cas/islas/savs
plots in /project/cas/islas/plots
I like to keep things separate like this. But I can't see how to deal with this in github. links don't work because/home and /project are on different devices. I want to include some small data files in the github repository that are needed for making the plots, but I don't have space for them in my home directory. How do others deal with this? Do you just do everything in the project space? Or is there a way to incorporate other locations into your git repository? Thanks in advance!
Anderson Banihirwe (Nov 30 2020 at 15:43):
@Isla Simpson,
Have you considered putting everything ( code, data, plots) in a parent directory under /project/cas/islas
and creating a soft link in your home directory pointing to the parent directory in /project/cas/islas
?
Michael Levy (Nov 30 2020 at 15:49):
@Isla Simpson do you need the data and plots to be under version control? My typical workflow would be to keep the code in github but find a better way to back up the datasets (I typically work on glade and don't recall the CGD backup policy for /project/
... if ISG doesn't keep backups, could you copy the data to campaign or /glade/work/
or something?) and then plan on regenerating any plots that are lost if the disk happens to fail
Michael Levy (Nov 30 2020 at 15:51):
I'd probably make softlinks to the data and plot directories so I could use relative paths in the code, but then include those softlinks in .gitignore
so they aren't kept in the repository. Then if you move to a separate machine you'll need to create similar softlinks after copying the data over, but you'll be free to keep the data on a separate volume from the code again.
Isla Simpson (Nov 30 2020 at 16:47):
Hi Anderson, Mike,
Thanks for the suggestions. I think not having the data and plots under version control is probably the way to go. The reason I wanted to put the data up there is that I wanted to provide the basic data for making the plots for when it comes to publication and making the data available. But, perhaps what I should do is use some other platform for sharing the data and not have it be part of my github repository. The suggestion of putting everything in /project/cas would be an idea, but I'd like to take advantage of the backups in /home. So, I think continuing to keep them separate and not provide the data on github is what I'll do and then put the necessary data somewhere else. Thanks for the advice!
Kevin Paul (Nov 30 2020 at 16:47):
I think @Michael Levy is right, here. The best strategy for using version control on a project or package is to only apply version control to the source code and not the data or products (i.e., plots). You might even include all plot and data file types in your .gitignore
file, so that new data files and/or plot files are automatically ignored by git. (Sometimes, not including data in your git repository is necessary because of file size limitations on hosting platforms like GitHub. GitHub does not allow files larger than 100 MB.)
Another useful design suggestion might be to try to generalize what you keep in your git repository so that it doesn't actually depend on specific data files (and specific data file paths). That makes the code more easily reusable. Even if nobody else uses it, you might for another project (and other data files).
Isla Simpson (Nov 30 2020 at 16:48):
ok, thanks a lot!
Kevin Paul (Nov 30 2020 at 16:49):
@Isla Simpson: If you want to include data for example/demonstration purposes, that's usually okay (and appreciated). But @Michael Levy is correct that usually keeping all of the data is unnecessary.
Last updated: Jan 30 2022 at 12:01 UTC