--- author: Anderson Banihirwe, Matthew Long date: 2022-03-17 tags: python, jupyter, notebooks, papermill, automation --- # Batch Processing Jupyter Notebooks with Papermill ## Background and Motivation [Jupyter notebooks](https://jupyter.org/) are really good for doing the heavy lifting of data analysis by: - Allowing you to showcase your work in a single place. You can see the complete "paper trail" of what was done. This includes the code, results, visuals and the narrative accompanying your analysis. - Allowing others to easily use your work as a starting point for their own analysis. A new user can run cell by cell through the notebook to better get an understanding of the code. However, Jupyter notebooks have a few drawbacks: - Jupyter notebooks are difficult to maintain and reuse. Unlike a regular Python module that you can import and use in any Python project, users are copying and pasting snippets from each other's notebooks, and it's very easy to get out of sync. - Jupyter notebooks are hard to parameterize. This makes it difficult to maintain one version of the truth that can be used as a notebook template for exploring different parameters. Parameters in this context correspond to different variables/arguments that you want to feed to your notebook whenever you run it. - Unlike a regular Python scripts that can be run from the command line, running a Jupyter notebook in batch mode requires additional setup and configuration. Some of these drawbacks can be addressed with the help of Papermill. ## What is Papermill? [**Papermill**](https://papermill.readthedocs.io/en/latest/) is a Python library that aims to allow users to run and parametrize Jupyter notebooks in a way that is easy to maintain and reuse. In this post, we will walk through how to use Papermill's approach to parametrize a notebook that loads an Xarray dataset and plots a map of the data. ## How to use Papermill ### Step 1: Prepare the notebook To transform your notebook into a Papermill-enabled notebook (a notebook that can be run with Papermill), you need to add the `parameters` tag to cells that contains the parameters you intend to parametrize when running the notebook with Papermill: 1. Select the cell to parameterize 2. Click the property inspector in the right sidebar (double gear icon) 3. Type `parameters` in the `Add Tag +` box and hit `Enter`. ![](../../images/papermill-parameters.png) ### Step 2: Prepare the execution environment Once you are ready to run your notebook with Papermill, you need to prepare the execution environment by ensuring the following packages are installed in the enviroment from which you plan to invoke Papermill: - [**papermill**](https://papermill.readthedocs.io/en/latest/) - [**jupyterlab**](https://jupyterlab.readthedocs.io/en/stable/) For demonstration purposes, we will use `mamba`/`conda` to install these packages in a new execution environment: ```bash $ mamba create -n myenv -c conda-forge papermill ipykernel # or $ conda create -n myenv -c conda-forge papermill ipykernel ``` ```{note} There are no restrictions on the environment in which you can run Papermill. You can use any environment that you like (e.g. the same environment used by your notebook). ``` Once your environment is ready, you need to ensure the Jupyter kernel used by your notebook is properly configured by running the following commands: ```bash $ conda activate my-analysis-env $ python3 -m ipykernel install --user --name my-analyis-env ``` ### Step 3: Running the notebook Now that you have the environment ready and the notebook is parametrized, you can run the notebook. There are two ways to run the notebook: #### Option 1: from the command line To run the notebook from the command line, you need to run the following command ```bash $ conda activate myenv $ papermill sample-notebook.ipynb output-notebook.ipynb -p dataset air_temperature -p variable air -k my-analysis-env ``` which returns the following output: ```bash Input Notebook: sample-notebook.ipynb Output Notebook: output-notebook.ipynb Executing: 0%| | 0/6 [00:00