Avoiding data leaks on github through jupyter notebooks

1 minute read


A. Preventing a data leakage

Clear all notebooks automatically on commit

Rationale: run a filter over certain files before they are added to git. This will leave the original file on disk as-is, but commit the β€œcleaned” version.

  1. Create a .gitattributes file in your repo


*.ipynb filter=jupyternotebook
  1. Create a .gitconfig file in your repo


[filter "jupyternotebook"]
        clean = jupyter nbconvert --to=notebook --ClearOutputPreprocessor.enabled=True --stdout %f
  1. Add custom .gitconfig to local git config: git config --local include.path .gitconfig
    • N.B.: this step has to be repeated every time the repo is cloned.
  2. Verify that custom config was added to local git config


        path = .gitconfig

B. In case of a data leak already on github

Clear all notebooks

Whilst in the repo:

jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace */*/*/*/*/*.ipynb (or nbconvert > 6.0: jupyter nbconvert --clear-output --inplace */*/*/*/*/*.ipynb)

where one should start from *.ipynb to as deep as your repo structure goes (*/*/*/*/*/*) in this case

Clean all jupyter notebooks already on github

Requirements: BFG (can be installed via brew)

Procedure as explained in this blog post.

  1. Remove your old local repo folder (keep it temporarily save): mv example-repo example-repo-old
  2. Clone in mirror repo: git clone --mirror [email protected]:example/example-repo.git
  3. Target all jupyter notebooks of repo: bfg --delete-files "{*.ipynb}" example-repo.git/
  4. Rewrite history: cd example-repo && git reflog expire --expire=now --all && git gc --prune=now --aggressive
  5. git push
  6. Delete mirror repo (as well as your temporary copy of the old version): cd ../ && rm -rf example-repo.git
  7. Re-download the clean version of the repo: git clone [email protected]:example/example-repo.git
  8. Repeat step 3 and 4 of part A. (reinstaure custom filter after cloning)

This essentially removes all history of these files.