Python Libraries, and Virtual Environments
Python has lots of packages available, only a core subset of which will be installed and available by default. It is common to have to install extra packages if you have any sort of specialised work to do. But how should you go about that? What should you do and not do?
The problem §
The core problem here is that not all Python packages work well with all other packages, and it's quite possible to get your Python system in an inconsistent or excessively customised state, especially if you're using bleeding-edge versions of packages. The problem with an excessively customised Python install is that your code might work, or work correctly, only on one ‘magic’ machine, and that you can get into a state where neither your collaborators nor you are confident how to reproduce that magic. That's bad science.
That's why the short version of this advice is:
Never use
pip
to install a Python package system-wide
This is despite the observation that lots of online ‘How do I...?’
Python advice start ‘Just do sudo pip install ...
’ There are a few
exceptions to this advice, and when you know enough about Python
packages to work out what those exceptions are, you know enough to make
the exceptions safely.
Possible options
Anaconda: If you're using an environment like Anaconda, there are instructions to add packages to your local Python environment. With Anaconda, if you mess things up and all else fails, at least you can uninstall Anaconda and start again.
JupyterHub: If we're thinking about JupyterHub, then the different JupyterHub ‘kernels’ are distinguished by having different sets of packages available.
If you're using a particular unusual package, then it might not be installed in the kernel you need to use. Talk to your supervisor about this, or IT support staff, since it might be reasonable to simply add the package. However the JupyterHub support staff will be unwilling to install a package if it's at all unusual, or might destabilise the system in any way.
Command line:
If you're using Python at the command line, either exclusively or by
being able to ssh to the JupyterHub machine, then you can add packages
for yourself alone, with the command pip
:
% pip install --user <packagename>
You'll see similar advice spread around the web, but while it works, I think this is not the best way of managing Python libraries. A better approach is to use Python's ‘virtual environments’.
Virtual Environments – setting up §
The best way of doing this, not just on brutha but generally, is to use Python 'virtual environments'.
Go to your project and do
% python3 -m venv myvenv # or whatever you want to call it
% source ./myvenv/bin/activate # changes the shell and Python paths
(myvenv) % which python
.../myvenv/bin/python
(myvenv) % which pip
.../myvenv/bin/pip
(mvvenv) % python --version
Python 3.6.8
(myvenv) %
that is, you are now using a sort-of private copy of Python, which
you can access by sourcing the script myvenv/bin/activate
.
Do a bit more housekeeping by updating the local version of pip:
(myvenv) % pip install --upgrade pip
...
(myvenv) %
At this point you can install things using pip (note, without
the --user
option), such as numpy:
(myvenv) % pip install numpy
Note that I'm using numpy
as an example, here – it's already
installed in the standard JupyterHub kernels.
This is essentially a one-time setup, for a particular project.
Using virtual environments – command line §
If you log off and log on again, or open up a different terminal
window, then the Python you see (which python
) will be the usual
system Python – that is you have made no system-wide changes to your
Python. This is a good thing. You can activate your private version
in any terminal window by going back to your project directory, and
reissuing the command
% source ./myenv/bin/activate
Using virtual environments – JupyterHub §
You can also use a virtual environment in JupyterHub. You have to do the setup at the command line, as above. Then
% ls myvenv/lib
python3.8
% ls myvenv/lib/python3.8/site-packages
...
numpy
...
%
you can see that numpy
is installed in this local-to-you virtual
environment. But how do you use that within JupyterHub?
In a Jupyter notebook, insert and evaluate the following
import os.path
sys.path.insert(0, os.path.expanduser("~/myvenv/lib/python3.8/site-packages)")
sys.path
The site-packages
location is to the appropriate directory under the
myenv
virtual environment that you created in the previous step.
Although this example starts from the top of your home-directory – ie,
it starts from the location called ‘~
’ – this need not be in your
top-level directory.
This adds this particular site-packages directory to the list of
directories that Python will check, in this notebook only,
when you import
a package. Now, when you evaluate
import numpy
it will use this local-to-you virtual environment version.
Recording/freezing the configuration §
Once you've got a collection of libraries which does actually work together (possibly a non-trivial exercise), then you can 'freeze' this with
(myvenv) % python -m pip freeze >requirements.txt
Save this file requirements.txt (for example, if you have a source
repository, then save it in that repository). At this point you can
use this file to replicate the environment on this or another machine, or
delete your myvenv
directory and reconstruct it, with
% python3 -m venv myvenv
% source ./myvenv/bin/activate
(myvenv) % pip install --upgrade pip
(myvenv) % python -m pip install -r requirements.txt
...and this will reinstall all of the libraries you had before, in exactly the exact same versions.
This technique is slightly expensive in disk space, but it makes your
work much more replicable, since you can reconstruct your exact
python environment using the requirements.txt
file. Crucially, if
you're collaborating with others, this means that your
collaborators can set up a set of libraries identical to yours,
using an identical process, and
starting with only this requirements.txt
file. If you are using a
version control system such as Git (and you should be) then this
requirements.txt
should be checked in as part of your project, and
its use noted in the README.
It also means that if you mess up your python library directory
(easier to do than you might think), you can simply blow away myvenv
and start again from the python3 -m venv myenv
step).
This isn't a matter of installing one's own Python, simply a personal package library.
The things installed in the system python library on brutha are a very conservative set of essential libraries -- essentially only those for which there is an OS package (as opposed to a Python package). Even on my own laptop, I never install libraries in the system python.