简介
关于gpu在虚拟化中的使用,目前分为三种
- 直通:好处很明显,可以直接对gpu进行分配,不过一个显卡只能直通到一个设备上,只能独占,在虚拟机中的gpu切换较为繁琐。
- vgpu:可以将一个gpu拆解成多个虚拟显卡分给多个虚拟机使用,但是支持vgpu的都是高端显卡,无力承受
- docker-gpu:较为方便,可以使用docker对gpu进行分配,docker容器的启动速度相对于虚拟机快上很多,且docker搭建深度学习环境相对来说非常轻松。
环境搭建
基础环境
- ubuntu20.04.04 desktop
- GTX860M
使用桌面版由于驱动的问题,所以相对于server版本问题多一些,由于显卡价格太高,先使用860M进行部分测试,以下均为root执行。
环境搭建
更新内核
sed -i 's;://\([^/]*\)/;://mirrors.ustc.edu.cn/;' /etc/apt/sources.list /etc/apt/sources.list.d/parrot.list /etc/apk/repositories;
apt-get update;
apt-get upgrade -y;
这一步的目的在于更新内核版本,防止一些未知错误,可以选择跳过。
处理现有驱动
sudo apt-get purge nvidia*
建立驱动黑名单
sudo vim /etc/modprobe.d/blacklist-nouveau.conf
写入
blacklist nouveau
options nouveau modeset=0
执行
sudo update-initramfs -u
reboot
安装驱动依赖
apt-get install -y vim gcc make
下载驱动https://www.nvidia.cn/
并且安装
安装docker-ce
export want_os=ubuntu
sudo apt-get update;
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common gnupg2;
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/$want_os/gpg | sudo apt-key add - ;
sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/$want_os $(lsb_release -cs) stable";
sudo apt-get -y update;
sudo apt-get -y install docker-ce;
sudo systemctl start docker;
sudo systemctl enable docker;
这里酌情换源,阿里源过慢
安装nvidia-container-toolkit
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
apt-get upgrade
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
测试使用
测试代码
import torch
print(torch.cuda.current_device())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
print(torch.cuda.is_available())
import numpy as np
x_values = [i for i in range(11)]
x_train = np.array(x_values, dtype=np.float32)
x_train = x_train.reshape(-1, 1)
x_train.shape
y_values = [2*i + 1 for i in x_values]
y_train = np.array(y_values, dtype=np.float32)
y_train = y_train.reshape(-1, 1)
y_train.shape
import torch
import torch.nn as nn
import numpy as np
class LinearRegressionModel(nn.Module):
def __init__(self, input_dim, output_dim):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
out = self.linear(x)
return out
input_dim = 1
output_dim = 1
model = LinearRegressionModel(input_dim, output_dim)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.MSELoss()
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
epochs = 1000
for epoch in range(epochs):
epoch += 1
inputs = torch.from_numpy(x_train).to(device)
labels = torch.from_numpy(y_train).to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if epoch % 50 == 0:
print('epoch {}, loss {}'.format(epoch, loss.item()))
分别使用两种环境进行验证
docker run -d -it --name="cuda" --gpus=all pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
docker run -d -it --name="cuda" --gpus=all pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel
结果1
0
1
NVIDIA GeForce GTX 860M
True
epoch 50, loss 0.024971697479486465
epoch 100, loss 0.014242921955883503
epoch 150, loss 0.008123651146888733
epoch 200, loss 0.004633431322872639
epoch 250, loss 0.002642718842253089
epoch 300, loss 0.0015073101967573166
epoch 350, loss 0.0008597124251537025
epoch 400, loss 0.0004903504741378129
epoch 450, loss 0.00027967619826085865
epoch 500, loss 0.00015952263493090868
epoch 550, loss 9.098430018639192e-05
epoch 600, loss 5.1895622164011e-05
epoch 650, loss 2.959909033961594e-05
epoch 700, loss 1.688160773483105e-05
epoch 750, loss 9.62934791459702e-06
epoch 800, loss 5.491684987646295e-06
epoch 850, loss 3.13294117404439e-06
epoch 900, loss 1.7874630202641129e-06
epoch 950, loss 1.0191258752456633e-06
epoch 1000, loss 5.815898020955501e-07
结果2
0
1
NVIDIA GeForce GTX 860M
True
epoch 50, loss 0.12959100306034088
epoch 100, loss 0.07391376048326492
epoch 150, loss 0.04215765744447708
epoch 200, loss 0.024045169353485107
epoch 250, loss 0.013714421540498734
epoch 300, loss 0.00782221183180809
epoch 350, loss 0.00446149380877614
epoch 400, loss 0.0025446717627346516
epoch 450, loss 0.0014513888163492084
epoch 500, loss 0.0008278193417936563
epoch 550, loss 0.0004721507430076599
epoch 600, loss 0.00026930044987238944
epoch 650, loss 0.0001535998162580654
epoch 700, loss 8.76076373970136e-05
epoch 750, loss 4.996786447009072e-05
epoch 800, loss 2.8499058316810988e-05
epoch 850, loss 1.625478034839034e-05
epoch 900, loss 9.272713214159012e-06
epoch 950, loss 5.288254669721937e-06
epoch 1000, loss 3.0164499094098574e-06
都是可用的,为深度学习的环境变化提供的非常大的用途。