github-dl: CLI for Downloading Gists and Repos

Problem

As part of my internship at Mozilla, I have been tasked with analyzing the structure of jupyter notebooks in the wild. On the pipeline/platform team, it is our responsibility to analysts and employees that the tooling that we provide is functioning well. In addition to other tools, analysts use jupyter notebooks and spark for analyzing large datasets and performing machine learning. In analyzing notebook usage across mozilla and outside of the organization, it is our goal is to create better templates for analysts to use and understand how we might be able to improve jupyter notebook.

Solution

In order to download or scrape notebooks from the web, I wrote github-dl to download notebooks. The tool suites my use case, because I needed to be able to filter and query for mozilla notebooks and additionally search for notebooks in a particular field of interest (ex: “spark” or “big data”).

github-dl is a command line wrapper on top of pygithub, gitpython, and requests. I used click in order to parse command line arguments and provide them to pygithub. pygithub performs the validation of credentials and the functionality behind searching for gists and repositories. GitHub API v3 does not allow querying for gists, so the only way in which a user can query gists.github.com is by passing usernames. Below is simple command line call to download my gists containing jupyter notebooks.

$ gist-dl cameres gist-notebooks --extension=ipynb

It’s much easier to perform querying on GitHub repositories, because GitHub API v3 allows passing a query like a user would on github.com. Below is a sample query that I wrote to download jupyter notebooks that are returned by searching for machine learning. I also include the parameter size to filter out repositories that could be massive. The destination of the repositories is github-notebooks.

$ github-dl 'machine learning language:jupyter-notebook size:<1000' github-notebooks

This tool can be used to download repositories and gists of any type. github-dl is available on the python package index (pip install github-dl) and has been tested for both python 2.7.12 and python 3.5.2. See the repository for more information.

You're awesome for taking time out of your day to read this! Please consider sharing below!