从头搭建神经网络模型环境

2023-04-01 本文已影响0人 cnwinds

本文档记录了从头开始搭建环境的过程，主要是中间遇到的问题以及解决方法。
本文档安装的程序会尽量带上版本号，因为很多问题都是版本不匹配引起的。
使用环境是Windows10系统。

安装python3.7（略）

安装pip（略）

安装jupyter notebook

pip install jupyter==1.0.0

启动命令：

jupyter notebook

安装tensorflow

pip install tensorflow-gpu==2.10.1 pandas

支持GPU

前提是你要有一块Nvidia的高性能显卡。主要是安装CUDA和cuDNN，可以按照你访问网站时的最新版本下载安装，两个程序版本是配套的就行。我这里安装的版本是可以和tensorflow 2.10.1 正常配合使用的。

安装CUDA 12.1

https://developer.nvidia.com/cuda-downloads
安装目录默认在C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1

下载 cuDNN 8.8.1 for CUDA 12.x

https://developer.nvidia.com/rdp/cudnn-download
cuDNN的版本要和CUDA配套。解压出的bin目录内容复制进入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin目录中。其实就是将所有的dll文件放入CUDA的bin目录中。

问题汇总

启动jupyter notebook后出现错误：ImportError: cannot import name 'soft_unicode' from 'markupsafe'

解决方法：

pip install MarkupSafe==2.0.1

出现各种dll动态库无法加载的问题。

解决方法：
1、将C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin里面的所有dll都复制到C:\Windows\System32目录下
2、将cuDNN压缩包里面bin目录下的dll文件都复制到C:\Windows\System32目录下

出现错误：error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice

解决方法：
设置系统环境变量 XLA_FLAGS=--xla_gpu_cuda_data_dir="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1"
注意里面的引号不能少。

验证GPU是否能工作

识别GPU

import tensorflow as tf
device_gpu = tf.config.list_physical_devices('GPU')
print(device_gpu)

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

如果数据不是空数组，说明GPU识别成功。

调用GPU计算

import tensorflow as tf
gpus = tf.config.list_physical_devices("GPU")
print(gpus)

import time
from contextlib import contextmanager

@contextmanager
def timer():
    start_time = time.perf_counter()
    yield
    end_time = time.perf_counter()
    print(f"Code block took {end_time - start_time:.6f} s to run")


with timer():
    with tf.device("/gpu:0"):
        tf.random.set_seed(0)
        a = tf.random.uniform((10000,100),minval = 0,maxval = 3.0)
        b = tf.random.uniform((100,100000),minval = 0,maxval = 3.0)
        c = a@b
        tf.print(tf.reduce_sum(tf.reduce_sum(c,axis = 0),axis=0))


with timer():
    with tf.device("/cpu:0"):
        tf.random.set_seed(0)
        a = tf.random.uniform((10000,100),minval = 0,maxval = 3.0)
        b = tf.random.uniform((100,100000),minval = 0,maxval = 3.0)
        c = a@b
        tf.print(tf.reduce_sum(tf.reduce_sum(c,axis = 0),axis=0))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2.24953778e+11
Code block took 0.443041 s to run
2.24953778e+11
Code block took 5.125858 s to run

如果这个代码能正常执行，说明你的环境已经准备好了。

GPU相关的命令行

命令行	说明
nvidia-smi	提供监控GPU使用情况和更改GPU状态的功能。GPU之nvidia-smi命令详解