测试代码
$ python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.backends.cudnn.version()); print(torch.cuda.is_available()); print(torch.randn(1).cuda())"2025-01-03
遇到了lib/python3.11/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent
google结果都是近几个月的,感觉像是最近几个月新出的问题
正在参考🔗 [Installing PyTorch (CUDA 11.8) and PyTorch3D on Python 3.11. | by Gaurav Yadav | Medium] https://pro2017001.medium.com/installing-pytorch-cuda-11-8-and-pytorch-on-python-3-11-1fe872f29368
(事实证明这个教程确实管用,核心内容,本篇笔记后面的命令已经更新)
在python 3.12环境里遇到了AssertionError: Torch not compiled with CUDA enabled
不管了,要么是python3.12的问题要么是之前各种依赖粘一块的问题,新开个干净的环境看看
在一个新的python 3.11环境里遇到了RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
发现nvidia-smi也出问题了:Failed to initialize NVML: Driver/library version mismatch
重启以后就好了
1, pytorch狂吃cpu,完全不吃gpu
用代码torch.cuda.is_available()发现返回False,说明cuda没搞对
2,用conda安装正确的环境
$ nvcc --version 显示版本11.5
开一个新的python 3.11环境
然后参考🔗 [PyTorch installation with GPU support on Ubuntu - PyTorch Forums] https://discuss.pytorch.org/t/pytorch-installation-with-gpu-support-on-ubuntu/196350 里面的命令
# 2026-01-03,用这个
$ conda install -y "mkl<2024.1" "intel-openmp<2024.1" pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# 之前的
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia安装完毕后解决了torch.cuda.is_available()的问题
3, 指定model使用gpu
device = torch.device("cuda")
model = model.cuda()4, 训练的过程中把普通的x和y变成tensor类型(否则会报错Expected all tensors to be on the same device, but found at least two devices...)
# 原代码
for X, y in train_data:
# 新代码
device = torch.device("cuda")
for X, y in train_data:
X, y = X.to(device), y.to(device)5. 训练出的gpu模型如果用来计算test_data,则原本的cpu代码也要改,而且改动幅度比较大。如果直接用gpu model去计算整个test_data则有概率gpu内存爆炸,这种情况下需要用batch_size切分,然后再把计算出来的结果append到一起去。实际上大多数情况下test_data只需要用cpu model就可以:
cpu_model = gpu_model.to('cpu')