With all the posts about learning python, hopefully you’ve seen how Python and Jupyter Notebooks are powerful tools for exploring data about your products and services. In this post, I’d like to share a tool I’ve created to help make your life a little easier - a tool to run Jupyter Notebooks automatically on a hourly, daily, weekly or even monthly schedule.
I was inspired by this post from the Netflix Technology Blog, where they outline how they’ve embraced notebooks as one of their main tools for exploring data and reporting on experiments. The post points to a brilliant library called Papermill which can execute a Jupyter Notebook and save the results to a file.
As a product manager, I have many stats and metrics which I track for the various products and services I manage. Each of these are stored as notebooks which allow me to interactively explore the data in real time. The game-changer here is that I can now use the same notebooks to track the stats and update my dashboard automatically.
NotebookScheduler is a simple Python script which uses Papermill to execute a directory of Jupyter Notebooks. Notebooks are arranged into subfolders for hourly, daily, weekly or monthly execution. Each time a notebook is run, a snapshot is saved to a timestamped folder (along with any other outputs your notebook saves) giving you the ability to look back at past executions and to have a full audit of the analysis that has been done.
Once I’ve set up the notebook to provide whatever stats I want, scheduling its execution on a weekly basis is now as simple as a drag-and-drop into the weekly subfolder.
The code is available in this GitHub repository -clone or download it to a folder on your PC. The first time you run the script, it will create a skeleton directory structure, with subdirectories for hourly, daily, weekly and monthly notebooks.
Simply move your notebook (*.ipynb) files into the relevant subdirectory and when the script is run they will be executed.
The directory structure is shown below:
<script_folder>/ ├── NotebookScheduler.py ├── hourly/ │ ├── notebook1.ipynb │ ├── notebook2.ipynb │ └── snapshots/ │ ├── notebook1/ | │ └──<timestamp> │ │ └── notebook1.ipynb │ └── notebook2/ | └──<timestamp> │ └── notebook2.ipynb ├── daily/ │ ├── notebook3.ipynb │ ├── notebook4.ipynb │ └── snapshots/ │ ├── notebook3/ | │ └──<timestamp> │ │ └── notebook1.ipynb │ └── notebook4/ | └──<timestamp> │ └── notebook2.ipynb └── weekly/ ├── notebook5.ipynb ├── notebook6.ipynb └── snapshots/ ├── notebook5/ │ └──<timestamp> │ └── notebook1.ipynb └── notebook6/ └──<timestamp> └── notebook2.ipynb
Install the dependencies
The script has a few dependencies.
Papermill is the module that runs the jupyter notebooks. You’ll need to install Papermill and its dependencies first.
pip install papermill
If you want to use the built in scheduler, then you’ll need to install Schedule.
pip install schedule
If you’re going to use Windows Task Scheduler or Cron jobs to schedule the execution, then you don’t need this.
Running the script without an external scheduler
The simplest way to get started is to use the built in scheduler. In this mode, you’ll run the Python script in a terminal and leave it running. The script itself will loop and run the notebooks as per the schedule determined by which of the subdirectories the notebook is in (e.g. daily, weekly).
To do this, once you have some notebooks in your folders, simply run the script from its root folder:
Running the script with an external scheduler
An alternative way of running is to use an external scheduler, like the built in Windows Task Scheduler or a Cron job to execute the script. In this mode, the external scheduler will determine the frequency of execution. You just need to set the -d command line option to tell the script which directory to execute. So, if you wanted to run your hourly and daily scripts, you’d set up two tasks:
One job set to run hourly, with the script executed as follows:
python NotebookScheduler.py -d hourly
And another one set to run daily, with the script executed as follows:
python NotebookScheduler.py -d daily
When the directory is specified using the -d option, the notebooks in the specified directory are executed immediately.
About the snapshots
Within each of the daily/hourly/weekly/monthly directories a “snapshot” directory will be created. This will have sub-folders for each notebook that is executed, and each execution will be stored in time stamped folder. Whilst this is a lot of nesting, it makes it quick and easy to view the output of a particular notebook on a particular day. Once the notebook is executed, Papermill will save the output notebook to the snapshot directory.
Saving other artifacts
Papermill can pass parameters to the notebooks it is executing. NotebookScheduler will set a snapshotDir parameter so that you can use this within your notebooks for saving files within the snapshot directory. For example, the following notebook generates a random dataframe and then saves a .csv file into the snapshot directory. This means that each execution of the notebook has it’s .csv output right next to the output notebook in the timestamped folder.
Hopefully that helps keep everything neat and tidy!
Logging is setup - once the script is run you’ll see notebook.log appear in the folder. All executions are logged here. If anything goes wrong with the execution (e.g. somethings broken in your notebook) then a stacktrace will be included in the log. All actions are logged to a single log file so you only have one place to check to see if scripts have run or find out why they broke.
Testing and feedback
I’ve only tested the script using Python 3.6 so far. If you encounter any bugs or strange behaviour then please raise an issue via the repository. pip install papermill