Python Libraries, and Virtual Environments

Python has lots of packages available, only a core subset of which will be installed and available by default. It is common to have to install extra packages if you have any sort of specialised work to do. But how should you go about that? What should you do and not do?

The problem §

The core problem here is that not all Python packages work well with all other packages, and it's quite possible to get your Python system in an inconsistent or excessively customised state, especially if you're using bleeding-edge versions of packages. The problem with an excessively customised Python install is that your code might work, or work correctly, only on one ‘magic’ machine, and that you can get into a state where neither your collaborators nor you are confident how to reproduce that magic. That's bad science.

That's why the short version of this advice is:

Never use pip to install a Python package system-wide

This is despite the observation that lots of online ‘How do I...?’ Python advice start ‘Just do sudo pip install ...’ There are a few exceptions to this advice, and when you know enough about Python packages to work out what those exceptions are, you know enough to make the exceptions safely.

Possible options

Anaconda: If you're using an environment like Anaconda, there are instructions to add packages to your local Python environment. With Anaconda, if you mess things up and all else fails, at least you can uninstall Anaconda and start again.

JupyterHub: If we're thinking about JupyterHub, then the different JupyterHub ‘kernels’ are distinguished by having different sets of packages available.

If you're using a particular unusual package, then it might not be installed in the kernel you need to use. Talk to your supervisor about this, or IT support staff, since it might be reasonable to simply add the package. However the JupyterHub support staff will be unwilling to install a package if it's at all unusual, or might destabilise the system in any way.

Command line: If you're using Python at the command line, either exclusively or by being able to ssh to the JupyterHub machine, then you can add packages for yourself alone, with the command pip:

% pip install --user <packagename>

You'll see similar advice spread around the web, but while it works, I think this is not the best way of managing Python libraries. A better approach is to use Python's ‘virtual environments’.

Virtual Environments – setting up §

The best way of doing this, not just on brutha but generally, is to use Python 'virtual environments'.

Go to your project and do

% python3 -m venv myvenv       # or whatever you want to call it
% source ./myvenv/bin/activate # changes the shell and Python paths
(myvenv) % which python
.../myvenv/bin/python
(myvenv) % which pip
.../myvenv/bin/pip
(mvvenv) % python --version
Python 3.6.8
(myvenv) %

that is, you are now using a sort-of private copy of Python, which you can access by sourcing the script myvenv/bin/activate.

Do a bit more housekeeping by updating the local version of pip:

(myvenv) % pip install --upgrade pip
...
(myvenv) %

At this point you can install things using pip (note, without the --user option), such as numpy:

(myvenv) % pip install numpy

Note that I'm using numpy as an example, here – it's already installed in the standard JupyterHub kernels.

This is essentially a one-time setup, for a particular project.

Using virtual environments – command line §

If you log off and log on again, or open up a different terminal window, then the Python you see (which python) will be the usual system Python – that is you have made no system-wide changes to your Python. This is a good thing. You can activate your private version in any terminal window by going back to your project directory, and reissuing the command

% source ./myenv/bin/activate

Using virtual environments – JupyterHub §

You can also use a virtual environment in JupyterHub. You have to do the setup at the command line, as above. Then

% ls myvenv/lib
python3.8
% ls myvenv/lib/python3.8/site-packages
...
numpy
...
%

you can see that numpy is installed in this local-to-you virtual environment. But how do you use that within JupyterHub?

In a Jupyter notebook, insert and evaluate the following

import os.path
sys.path.insert(0, os.path.expanduser("~/myvenv/lib/python3.8/site-packages)")
sys.path

The site-packages location is to the appropriate directory under the myenv virtual environment that you created in the previous step. Although this example starts from the top of your home-directory – ie, it starts from the location called ‘~’ – this need not be in your top-level directory.

This adds this particular site-packages directory to the list of directories that Python will check, in this notebook only, when you import a package. Now, when you evaluate

import numpy

it will use this local-to-you virtual environment version.

Recording/freezing the configuration §

Once you've got a collection of libraries which does actually work together (possibly a non-trivial exercise), then you can 'freeze' this with

(myvenv) % python -m pip freeze >requirements.txt

Save this file requirements.txt (for example, if you have a source repository, then save it in that repository). At this point you can use this file to replicate the environment on this or another machine, or delete your myvenv directory and reconstruct it, with

% python3 -m venv myvenv
% source ./myvenv/bin/activate
(myvenv) % pip install --upgrade pip
(myvenv) % python -m pip install -r requirements.txt

...and this will reinstall all of the libraries you had before, in exactly the exact same versions.

This technique is slightly expensive in disk space, but it makes your work much more replicable, since you can reconstruct your exact python environment using the requirements.txt file. Crucially, if you're collaborating with others, this means that your collaborators can set up a set of libraries identical to yours, using an identical process, and starting with only this requirements.txt file. If you are using a version control system such as Git (and you should be) then this requirements.txt should be checked in as part of your project, and its use noted in the README.

It also means that if you mess up your python library directory (easier to do than you might think), you can simply blow away myvenv and start again from the python3 -m venv myenv step).

This isn't a matter of installing one's own Python, simply a personal package library.

The things installed in the system python library on brutha are a very conservative set of essential libraries -- essentially only those for which there is an OS package (as opposed to a Python package). Even on my own laptop, I never install libraries in the system python.