Non-ROOT users configure a remote CUDA server for a deep learning environment, below is an example of Pytorch.

CUDA driver

Check the version of CUDA already installed on the server:

1
nvcc -V

Showing version 11.4 here, but it doesn’t matter. Next to check the driver version, this relates to the latest CUDA version that we can install in the conda virtual environment.

1
2
3
4
nvidia-smi

### Output
### Driver Version: 535.129.03   CUDA Version: 12.2

Tips:

The CUDA Version shown here is 12.2, which is actually the latest CUDA version that the CUDA driver installed on the server supports to install, which means that the highest CUDA version we can install next is 12.2, and the CUDA version that has already been installed on this machine is 11.4.

Because we are a NON-ROOT user, we can’t change the installed driver version, but we can install the new CUDA version through the Conda virtual environment.

Conda

When configuring python-related environments for non-root users, I extremely recommend using the conda package manager for dependency management (in fact, it’s also recommended for those using R).

I’m installing the miniconda version here (the minimized version), but you can also install anaconda (the full version).

The blog has been written about related content before.

1
2
3
# I installed the environment with python 3.10
conda create -n torch python=3.10
conda activate torch

Pytorch

You can check the Pytorch history release at Pytorch History Release and combine it with your own needs (e.g., [Colossal-AI](https:// colossalai.org/zh-Hans/docs/get_started/installation/) requires PyTorch >= 1.11 and PyTorch <= 2.1). I ended up with pytorch==2.1.0 and CUDA==11.8.

1
2
3
4
5
# CUDA 11.8
# Less than the maximum supported driver version 12.2 but more than the installed CUDA version 11.4
# Try not to jump to a bigger version
# For example, if I have 11.4 installed on my machine, then I would install it with cuda=11.8. Does it seem to be a problem to install 12.1?
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Thinking:

As for why pytorch doesn’t redistribute for some cuda versions, in this issue one of pytorch’s developers explains that pytorch-cuda113 can be used on cuda114 This simply means that there is no need to re-release it, and that the 113 version can be used for the 114 version.

pytorch-cuda

Check availability

1
2
3
4
5
6
7
(torch) [username/pwd]$ python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> exit()

CUDA

At Nvidia-Cuda look for the CUDA version corresponding to the installed Pytorch, in my case 11.8.

1
2
3
# In nvidia, this channel doesn't even show a progress bar when downloading packages.
# It will only show 0% or 100%, but the download speed is fine, it's not walled off, just wait patiently.
conda install nvidia/label/cuda-11.8.0::cuda

At this point, enter the nvcc -V command to find that the installed version is now 11.8.

nvcc -V

Cuda installed using conda

1
2
# This prevents programs that cannot find the CUDA (11.8) installed by conda from using the CUDA (11.4) that comes with the system.
export CUDA_HOME=$CONDA_PREFIX

Extra

gcc clang g++ clang++

1
2
# The server comes with a version of gcc that is too low and doesn't even support g++. Use conda to install it and it works straight away.
conda install -c conda-forge cxx-compiler

xformers

1
2
3
4
5
6
7
# Official repository suggested commands
conda install xformers -c xformers

# Be sure to double-check the package change log before confirming the installation.
# There is a chance that it could replace an installed version of pytorch, so don't install it.
# Try adding -c nvidia -c pytorch and you should be fine.
conda install xformers -c xformers -c nvidia -c pytorch

transformers & datasets & accelerate

1
2
# Installation with conda may not be the latest version
pip install transformers datasets accelerate -U

Reference

Nvidia-Cuda

Pytorch History Release

Pytorch Github Issue