Sharing Jupyter notebooks within a team

I would like to set up a server which could support a data science team in the following way: be a central point for storing, versioning, sharing and possible also executing Jupyter notebooks.

Some desired properties:

  1. Different users can access the server and open and execute notebooks that were stored by them or by other team members. The interesting question here is what would be the behavior if user X executes cells in a notebook authored by user Y. I guess the notebook should NOT be changed:
  2. Solution should be self-hosted.
  3. Notebooks should be stored either on the server or on Google drive or on self-hosted instance of owncloud.
  4. (Bonus) Notebooks will be under git versioning control (git may be self-hosted. Cannot be bounded to GitHub or something of that sort).

I looked into JupyterHub and Binder. With the former, I didn't understand how to allow cross users access. The latter seems to only support GitHub as the storage of the notebooks.

Do you have experience with either of the solutions?

Topic software-recommendation

Category Data Science


Options:

  1. Jupyter notebooks are files, so if your IT infrastructure supports it, you can make a file share available to whatever hosts your users are running Jupyter on, and ask them to configure their Jupyter to use that share for their files.
  2. Set up a JupyterHub server (we use the DockerSpawner) and make a shared volume available to your users. This assumes you have the resources to allow your users to all work on that server. If they can put real load on the server, you may want to make it scale by using Kubernetes.

Or do both.

This doesn't provide version control, which would be a good idea, but you didn't ask for that.


Domino Data Lab offers premises, SaaS, and VPC-based notebook hosting (Jupyter, Zeppelin, RStudio), git integration, scalable compute, environment templates, and a bunch of other useful things. The premises/ VPC offerings may be overkill and too pricey if you're a small team, but the SaaS plans are pretty reasonably priced.

[ Full disclosure: I'm a former Domino employee ]


What I found - sharing notebooks for data scientists is a not a desirable format for communication. Many of them prefer IDE like Spider/RStudio or just a text editors (I know a few data scientists who use vi).

You might just share code by your source control and data by cloud storages. It will increase flexibility.

I've recently open sourced a tool which combines code, data, and the dependencies between data and code to a single environment and makes your data science project reproducible: DVC or dataversioncontrol.com (there is a tutorial).

With DVC tool you can just share your project by Git, sync data to S3 by a single DVC command. If some of your data scientists decide to change the code at any stage of your project then the final result could be easily reproduced by a single command dvc repro data/target_metrics.txt.


Isn't this solution good enough ?

You can protect the access with ssh, and the hosted files could be the git repository you want, with different linux (or whatever) user access. You'll need your own server.


JupyterHub does not provide version control system nor facilitates sharing of Notebooks. You mentioned yourself limitation of Binder.

Try Zeppelin. Version 0.7 should be released within a few next days.

  • As you can see from the roadmap, this version delivers "enterprise" features which are exactly about collaboration.
  • Version control system (git) is integrated.
  • It's self-hosted.

In essence, I think it meets all requirements you posted. On top of that it delivers richer visualisation capabilities and plethora of other features (works with Shiro, Knox, Kerberos - secure Spark anyone?).


Airbnb recently open sourced their internal data science knowledge repository: https://github.com/airbnb/knowledge-repo

From its readme, it seems it could loosely fit your use case:

The Knowledge Repository project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using data formats and tools that make sense in these professions. It provides various data stores (and utilities to manage them) for "knowledge posts", with a particular focus on notebooks (R Markdown and Jupyter / iPython Notebook) to better promote reproducible research.

There's also a blog post commenting on its motivation.


The only self-hosted solution I know is the paid Anaconda Enterprise cloud setup, https://anaconda.org/about. The other solutions I am aware of are not self-hostable!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.