Cambricon PyTorch使用说明(基于MLU270)

2022-03-08  本文已影响0人  Mr_Michael

一、简介

Cambricon PyTorch借助PyTorch自身提供的设备扩展接口将MLU后端库中所包含的算子操作动态注册到PyTorch中,MLU后端库可处理MLU上的张量和神经网络算子的运算。Cambricon PyTorch会基于CNML库在MLU后端实现一些常用神经网络算子,并完成一些数据拷贝操作。

Cambricon PyTorch兼容原生PyTorch的Python编程接口和原生PyTorch网络模型,支持在线逐层、在线融合和离线三种方式执行推理,同时还支持在线逐层训练。网络的权重可以从pth格式文件中读取,已支持的分类和检测网络结构由torchvision管理,可以从torchvision中读取。对于推理任务,Cambricon PyTorch不仅支持float16、float32等网络模型,而且在寒武纪机器学习处理器上能高效地支持int8和int16网络模型。对于训练任务,支持float32及自适应量化网络模型。

为了能在Torch模块方便使用MLU设备,Cambricon PyTorch在PyTorch后端进行了以下扩展:

1.MLU使用基础

Cambricon PyTorch主要使用场景为模型推理和模型训练。

寒武纪提供了CNML和CNNL两套算子库:

image image

2.MLU推理

1)模型推理网络支持

类别 名称
分类网络 PreActResNet50、PreActResNet101、inception_v3、vgg16、mobilenet、mobilenet_v2、mobilenet_v3、googlenet、densenet121、squeezenet1_1、resnet18、resnet34、resnet50、resnet101、resnet152、efficientnet、resnext50_32x4d、resnext101_32x8d
检测网络 SSD、SSD MobileNet v1、SSD MobileNet v2、YOLOv2、YOLOv3、YOLOv4、YOLOv5、EAST、MTCNN、Faster-RCNN(fpn)、centernet
超分网络 FCN8s、SegNet、VSDR、FSRCNN
其他网络 BERT

2)推理模式

3)算子支持

Cambricon PyTorch中已经实现了在线逐层(特指添加了MLU支持的层),在线融合以及离线模式之间的同步。任何已经实现的算子都可以同时采用在线逐层、在线融合和离线模式的任何一种进行使用。

Cambricon PyTorch已支持的算子如下:

二、使用Neuware SDK

Neuware SDK可前往寒武纪开发者社区注册账号按需下载,也可在官方提供的专属FTP账户指定路径下载。

运行环境

硬件环境准备

1.下载Neuware SDK

通过寒武纪提供的ftp链接和账户密码,下载所需依赖包。本文以MLU270-S4加速卡,Ubuntu 18.04系统,Neuware SDK 1.7.604为例。

wget -r -nH -P./ ftp://download.cambricon.com:8821/product/GJD/MLU270/1.7.604/Ubuntu18.04/* --ftp-user="username" --ftp-password="password"

1)Neuware SDK目录说明

$ tree -L 3 
├── CNCL
│   ├── cncl_0.8.0-1.ubuntu18.04_amd64.deb
├── CNCV
│   ├── cncv_0.4.602-1.ubuntu18.04_amd64.deb
│   ├── cncv_0.4.702-1.ubuntu18.04_arm64.deb
├── CNML
│   ├── cnml_7.10.3-1.ubuntu18.04_amd64.deb
├── CNNL
│   ├── cnnl_1.3.0-1.ubuntu18.04_amd64.deb
├── CNPlugin
│   ├── Cambricon-CNPlugin-MLU270.tar.gz
│   ├── cnplugin_1.12.4-1.ubuntu18.04_amd64.deb
├── CNToolkit
│   ├── cntoolkit_1.7.5-1.ubuntu18.04_amd64.deb
│   ├── cntoolkit_1.7.5-1.ubuntu18.04_arm64.deb
├── Driver  # mlu270驱动
│   ├── neuware-mlu270-driver-aarch64-4.9.8.tar.gz
│   ├── neuware-mlu270-driver-dkms_4.9.8_all.deb
└── PyTorch
    ├── docker
    │   ├── pytorch-0.15.604-ubuntu18.04.tar.gz # 安装了PyTorch和Catch的Docker镜像
    ├── src
    │   ├── pytorch-v0.15.604.tar.gz    # pytorch、pytorch_mlu和torchvision的源码 !!!
    └── wheel   # 基于 pytorch-v0.15.604.tar.gz编译所得
        ├── torch-1.3.0a0+b8d5360-cp27-cp27mu-linux_x86_64.whl
        ├── torch-1.3.0a0+b8d5360-cp36-cp36m-linux_x86_64.whl   # 寒武纪pytorch sdk下的pytorch编译所得
        ├── torch_mlu-0.15.0.post1-cp27-cp27mu-linux_x86_64.whl
        ├── torch_mlu-0.15.0.post1-cp36-cp36m-linux_x86_64.whl  #  寒武纪pytorch sdk下的catch编译所得
        ├── torchvision-0.2.1-cp27-cp27mu-linux_x86_64.whl
        ├── torchvision-0.2.1-cp36-cp36m-linux_x86_64.whl   #  寒武纪pytorch sdk下的vision编译所得

2.pytorch-v0.15.604.tar.gz源码说明

PyTorch/src/pytorch-v0.15.604.tar.gz是编译pytorch、pytorch_mlu和torchvision的源码。解压后可得到cambricon_pytorch文件夹。

/opt/work/cambricon_pytorch $ tree -L 4
├── configure_pytorch.sh    
├── env_pytorch.sh  # 环境变量声明文件
├── pytorch
│   ├── examples
│   │   ├── offline
│   │   │   └── c++
│   │   │       ├── classification
│   │   │       │   ├── run_all_offline_mc_int16.sh
│   │   │       │   ├── run_all_offline_mc_int16_220.sh
│   │   │       │   ├── run_all_offline_mc_int8.sh
│   │   │       │   └── run_all_offline_mc_int8_220.sh
│   │   │       ├── east
│   │   │       │   ├── run_all_offline_mc_int16.sh
│   │   │       │   ├── run_all_offline_mc_int16_220.sh
│   │   │       │   ├── run_all_offline_mc_int8.sh
│   │   │       │   └── run_all_offline_mc_int8_220.sh
│   │   │       ├── mtcnn
│   │   │       │   ├── run_offline_int16.sh
│   │   │       │   ├── run_offline_int16_220.sh
│   │   │       │   ├── run_offline_int8.sh
│   │   │       │   └── run_offline_int8_220.sh
│   │   │       ├── ssd
│   │   │       │   ├── run_all_offline_mc_int16.sh
│   │   │       │   ├── run_all_offline_mc_int16_220.sh
│   │   │       │   ├── run_all_offline_mc_int8.sh
│   │   │       │   └── run_all_offline_mc_int8_220.sh
│   │   │       ├── ssd_mobilenet_v1
│   │   │       │   ├── run_all_offline_mc_int16.sh
│   │   │       │   ├── run_all_offline_mc_int16_220.sh
│   │   │       │   ├── run_all_offline_mc_int8.sh
│   │   │       │   └── run_all_offline_mc_int8_220.sh
│   │   │       ├── yolov2
│   │   │       │   ├── run_all_offline_mc_int16.sh
│   │   │       │   ├── run_all_offline_mc_int16_220.sh
│   │   │       │   ├── run_all_offline_mc_int8.sh
│   │   │       │   └── run_all_offline_mc_int8_220.sh
│   │   │       └── yolov3
│   │   │           ├── run_all_offline_mc_int16.sh
│   │   │           ├── run_all_offline_mc_int16_220.sh
│   │   │           ├── run_all_offline_mc_int8.sh
│   │   │           └── run_all_offline_mc_int8_220.sh
│   │   └── online
│   │       └── python
│   │           ├── classification
│   │           │   └── test_clas_online.sh
│   │           ├── east
│   │           │   └── run_test_east.sh
│   │           ├── mtcnn
│   │           │   └── run_mtcnn.sh
│   │           ├── ssd
│   │           │   └── run_test_ssd.sh
│   │           ├── ssd_mobilenet_v1
│   │           │   └── run_ssd_mobilenet_v1.sh
│   │           ├── yolov2
│   │           │   └── run_test_yolov2.sh
│   │           └── yolov3
│   │               └── run_test_yolov3.sh
│   ├── include
│   ├── lib
│   ├── models
│   │   └── pytorch_models
│   │       └── int8
│   │           └── checkpoints
│   ├── src     # 源代码
│   │   ├── catch   # catch软件包
│   │   │   ├── CPPLINT.cfg
│   │   │   ├── PYLINT.cfg
│   │   │   ├── README.md
│   │   │   ├── cmake
│   │   │   ├── examples
│   │   │   ├── pytorch_patches # 对当前pytorch的寒武纪补丁包
│   │   │   ├── requirements.txt
│   │   │   ├── script
│   │   │   ├── setup.py
│   │   │   ├── test
│   │   │   ├── third_party
│   │   │   ├── torch_mlu   # 核心程序
│   │   ├── pytorch # pytorch软件包
│   │   │   ├── CITATION
│   │   │   ├── CMakeLists.txt
│   │   │   ├── CODEOWNERS
│   │   │   ├── CONTRIBUTING.md
│   │   │   ├── LICENSE
│   │   │   ├── Makefile
│   │   │   ├── NOTICE
│   │   │   ├── README.md
│   │   │   ├── android
│   │   │   ├── aten
│   │   │   ├── azure-pipelines.yml
│   │   │   ├── benchmarks
│   │   │   ├── binaries
│   │   │   ├── c10
│   │   │   ├── caffe2
│   │   │   ├── cmake
│   │   │   ├── compile_commands.json
│   │   │   ├── docker
│   │   │   ├── docs
│   │   │   ├── ios
│   │   │   ├── modules
│   │   │   ├── mypy-README.md
│   │   │   ├── mypy-files.txt
│   │   │   ├── mypy.ini
│   │   │   ├── requirements.txt
│   │   │   ├── scripts
│   │   │   ├── setup.py
│   │   │   ├── submodules
│   │   │   ├── test
│   │   │   ├── third_party
│   │   │   ├── tools
│   │   │   ├── torch
│   │   │   └── ubsan.supp
│   │   └── vision  # vision软件包
│   │       ├── CMakeLists.txt
│   │       ├── LICENSE
│   │       ├── MANIFEST.in
│   │       ├── README.rst
│   │       ├── docs
│   │       ├── setup.cfg
│   │       ├── setup.py
│   │       ├── test
│   │       ├── torchvision
│   │       └── tox.ini
│   └── tools
│       └── genoff.py -> /opt/work/cambricon_pytorch/pytorch/src/catch/examples/offline/genoff/genoff.py
└── run_pytorch.sh

1)Cambricon catch目录说明

/opt/work/cambricon_pytorch/pytorch/src/catch$ tree -L 2
.
├── CPPLINT.cfg
├── PYLINT.cfg
├── README.md
├── cmake
│   └── modules
├── examples
│   ├── data
│   │   ├── ICDAR_2015
│   │   ├── IWSLT
│   │   ├── coco
                    │   ├── coco.data
                    │   ├── coco.names
                    │   ├── file_list_for_release
                    │   ├── get_coco_dataset.sh
                    │   ├── samples
                    │   │   ├── ......
                    │   │   └── person.jpg
                    │   └── val2017
                    │       ├── 5k.txt
                    │       ├── coco.data
                    │       ├── coco.names
                    │       ├── file_list_for_release
                    │       ├── get_coco_dataset.sh
                    │       ├── label_map_coco.txt
                    │       └── val_file_info
│   │   ├── coco2017
│   │   ├── fddb
│   │   ├── imagenet
│   │   ├── voc2007
│   │   └── voc2012
│   ├── offline # 离线推理示例程序
│   │   ├── CMakeLists.txt
│   │   ├── README.md
│   │   ├── bert
│   │   ├── build
│   │   ├── centernet
│   │   ├── clas_offline_multicore
│   │   ├── cmake
│   │   ├── common
│   │   ├── east
│   │   ├── faster-rcnn
│   │   ├── fcn8s
│   │   ├── fsrcnn
│   │   ├── genoff
│   │   ├── mtcnn
│   │   ├── scripts
│   │   ├── segnet
│   │   ├── ssd
│   │   ├── ssd_mobilenet_v1
│   │   ├── ssd_mobilenet_v2
│   │   ├── test
│   │   ├── test_forward_offline
│   │   ├── vdsr
│   │   ├── yolov2
│   │   ├── yolov3
│   │   ├── yolov4
│   │   └── yolov5
                    ├── CMakeLists.txt
                    ├── README.md
                    ├── label_map_coco.txt
                    ├── post_process
                    │   ├── yolov5_off_post.cpp
                    │   ├── yolov5_off_post.hpp
                    │   ├── yolov5_processor.cpp
                    │   └── yolov5_processor.hpp
                    ├── run_all_offline_mc.sh
                    └── yolov5_offline_multicore.cpp
│   ├── onetest
│   │   ├── onetest.conf
│   │   ├── onetest.py
│   │   └── onetest_mlu220.conf
│   ├── online
│   │   ├── README.md
│   │   ├── bert
│   │   ├── centernet
│   │   ├── common_utils.py
│   │   ├── east
│   │   ├── efficientnet
│   │   ├── faster-rcnn
│   │   ├── fcn8s
│   │   ├── fsrcnn
│   │   ├── mask-rcnn
│   │   ├── mtcnn
│   │   ├── segnet
│   │   ├── ssd
                    ├── README.md
                    ├── data
                    │   ├── __init__.py
                    │   ├── coco.py
                    │   ├── coco_labels.txt
                    │   ├── config.py
                    │   ├── example.jpg
                    │   ├── scripts
                    │   │   ├── COCO2014.sh
                    │   │   ├── VOC2007.sh
                    │   │   └── VOC2012.sh
                    │   └── voc0712.py
                    ├── eval.py
                    ├── run_test_ssd.sh
                    └── utils
                        ├── __init__.py
                        └── augmentations.py
│   │   ├── ssd_mobilenet_v1
│   │   ├── ssd_mobilenet_v2
│   │   ├── test_clas_online.py
│   │   ├── vdsr
│   │   ├── yolov2
│   │   ├── yolov3
│   │   ├── yolov4
│   │   └── yolov5
                    ├── README.md
                    ├── models
                    │   ├── __init__.py
                    │   ├── common.py
                    │   ├── experimental.py
                    │   ├── export.py
                    │   ├── yaml
                    │   │   └── yolov5s.yaml
                    │   └── yolo.py
                    ├── requirements.txt
                    ├── run_test_yolov5.sh
                    ├── test.py
                    └── utils
                        ├── __init__.py
                        ├── activations.py
                        ├── datasets.py
                        ├── general.py
                        ├── google_utils.py
                        └── torch_utils.py
    ├── __init__.py
    ├── activations.py
    ├── datasets.py
    ├── general.py
    ├── google_utils.py
    └── torch_utils.py
│   ├── tools
│   │   ├── convert_weight
│   │   ├── loss_check
│   │   └── operator_statistic
│   └── training
│       ├── multi_card_demo.py
│       └── single_card_demo.py
├── pytorch_patches
│   ├── commit_id
│   ├── fix_cnnl_not_support_asstride.diff
│   ├── fix_no_access_permission_for_pth.diff
│   ├── fix_setup_clean.diff
│   ├── improve_performance_changes.diff
│   ├── max_out_impl_support_mlu_dispatch.diff
│   ├── profiler_mlu_support.diff
│   ├── register_mlu_device.diff
│   ├── support_mlu_dataloader.diff
│   ├── support_mlu_fusion.diff
│   ├── support_mlu_quanz_dequanz.diff
│   ├── support_mlu_segmentation.diff
│   └── support_mlu_serialization.diff
├── requirements.txt
├── script
│   ├── apply_patches_to_pytorch.sh
│   ├── build_catch.sh
│   ├── build_catch_lib.sh
│   ├── build_docview.sh
│   ├── build_mlu_libs.sh
│   ├── build_pytorch_src_test.sh
│   ├── catch_coverage_test.sh
│   ├── config_for_release.sh
│   ├── hooks
│   │   ├── README
│   │   ├── commit-msg
│   │   └── pre-commit
│   ├── release
│   │   ├── Dockerfiles
│   │   ├── build.property
│   │   ├── build_docker.sh
│   │   ├── config
│   │   ├── configure_pytorch.sh
│   │   ├── env_pytorch.sh
│   │   ├── independent_build.sh
│   │   ├── json_parser.py
│   │   ├── run_pytorch.sh
│   │   └── tools
│   └── yapf_format.py
├── setup.py
├── test
│   ├── cnml
│   │   ├── data
│   │   ├── op_test
│   │   ├── test_acquire_hardware_time.py
│   │   ├── test_bind_tensor.py
│   │   ├── test_cnml_op_exception.py
│   │   ├── test_dump.py
│   │   ├── test_eqnm_quantization.py
│   │   ├── test_forward_offline.py
│   │   ├── test_jit_inplace.py
│   │   ├── test_logging_cnml.py
│   │   ├── test_mfus_exception_case1.py
│   │   ├── test_mfus_exception_case2.py
│   │   ├── test_mfus_exception_case3.py
│   │   ├── test_mfus_exception_modules.py
│   │   ├── test_mixed_quantized_mods.py
│   │   ├── test_nan_quantization.py
│   │   ├── test_op_methods_cnml.py
│   │   ├── test_perchannel_use_avg.py
│   │   ├── test_quantization_exception.py
│   │   ├── test_quantize_generate.py
│   │   ├── test_quantized_mods.py
│   │   ├── test_save_cambricon.py
│   │   ├── test_segment_graph_exception.py
│   │   └── test_set_mem_channel.py
│   ├── cnnl
│   │   ├── distributed_env_prepare.sh
│   │   ├── op_test
│   │   ├── test_adaptive_quantize.py
│   │   ├── test_cnnl_op_exception.py
│   │   ├── test_distributed.py
│   │   ├── test_logging_cnnl.py
│   │   ├── test_op_methods_cnnl.py
│   │   ├── test_pin_memory.py
│   │   ├── test_profiler.py
│   │   ├── test_queue.py
│   │   └── test_save_and_load.py
│   ├── common_utils.py
│   ├── data
│   ├── run_test.py
│   ├── test_caching_allocator.py
│   ├── test_clas_onnx.py
│   ├── test_device.py
│   ├── test_jit_exception.py
│   ├── test_notifier.py
│   └── test_queue.py
├── third_party
│   └── neuware # 空目录
├── torch_mlu   # 核心代码
│   ├── __init__.py
│   ├── core
│   ├── csrc
│   ├── distributed
│   └── tools
├── torch_mlu.egg-info
│   ├── PKG-INFO
│   ├── SOURCES.txt
│   ├── dependency_links.txt
│   └── top_level.txt

3.编译pytorch-v0.15.604.tar.gz并安装pytorch

1)安装依赖

Caffe/Pytorch 框架编译安装前,需要安装CNToolkit 软件包和 CNML、CNNL、CNPlugin、CNCL 等组件。

# 安装MLU270驱动
cd [Neuware_SDK_Path]/Driver/
dpkg -i neuware-mlu270-driver-dkms_4.9.8_all.deb
# 检测Driver 是否已经安装成功且版本依赖满足要求。
$ cat /proc/driver/cambricon/mlus/0000\:b3\:00.0/information 
Device name: MLU270-S4
Device inode path: /dev/cambricon_dev0
Device Major: 508
Device Minor: 0
Driver Version: v4.9.8
MCU Version: v1.1.3
Board Serial Number: SN/122011101324
MLU Firmware Version: 4.9.8
Board CV: 0
IPU Freq: 1000MHz
Linux Version: 5.4.0-100-generic (buildd@lcy02-amd64-060)
Interrupt Mode: MSI
Bus Location: b3_0_0
Bus Type: PCIE
LnkCap: Speed 8.0GT/s, Width x16
Region 0: Memory at 38ffc0000000 [size=256M]
Region 2: Memory at 38fff4000000 [size=64M]
Region 4: Memory at 38fff0000000 [size=64M]

# CNToolkit
cd [Neuware_SDK_Path]/CNToolkit/
dpkg -i cntoolkit_1.7.5-1.ubuntu18.04_amd64.deb
cd /var/cntoolkit-1.7.5/
dpkg -i *.deb
rm *.deb
# CNML
cd [Neuware_SDK_Path]/CNML/
dpkg -i cnml_7.10.3-1.ubuntu18.04_amd64.deb
# CNPlugin
cd [Neuware_SDK_Path]/CNPlugin
dpkg -i cnplugin_1.12.4-1.ubuntu18.04_amd64.deb
# CNNL
cd [Neuware_SDK_Path]/CNNL/
dpkg -i cnnl_1.3.0-1.ubuntu18.04_amd64.deb
# CNCL
cd [Neuware_SDK_Path]/CNCL
dpkg -i cncl_0.8.0-1.ubuntu18.04_amd64.deb

2)解压pytorch-v0.15.604.tar.gz

cd  [Neuware_SDK_Path]/PyTorch/src/
tar zxvf pytorch-v0.15.604.tar.gz -C /opt/work/

3)编译与安装Cambricon PyTorch

#设置压缩包解压后的根目录
export ROOT_HOME=/opt/work/cambricon_pytorch
cd $ROOT_HOME
#创建数据集和模型软链接目录(以实际目录为准):DATASET_HOME, CAFFE_MODELS_DIR
ln -s /data/datasets datasets
ln -s /data/models models
#设置环境变量
source env_pytorch.sh

# 安装Virtualenv并激活虚拟环境
pip install virtualenv
pushd ${CATCH_HOME}
virtualenv -p $(which python3) venv/pytorch #安装虚拟环境,此处Python 3可按需更换为指定版本
source venv/pytorch/bin/activate #激活虚拟环境
popd

# 将Cambricon Catch中所包含的Cambricon PyTorch的Patch打到Cambricon PyTorch代码中。
pushd ${CATCH_HOME}/script
bash apply_patches_to_pytorch.sh
popd

# 编译Cambricon PyTorch
pushd ${PYTORCH_HOME}
pip install -r requirements.txt #安装第三方包
rm -rf build #清理环境
rm -rf dist
python setup.py install #编译并安装
python setup.py bdist_wheel # 生成.whl包
popd

# 编译Cambricon Catch
pushd ${CATCH_HOME}
pip install -r requirements.txt #安装第三方包
rm -rf build
rm -rf dist
python setup.py install #编译并安装
python setup.py bdist_wheel # 生成.whl包
popd

# 编译并安装Cambricon Vision
pushd ${VISION_HOME}
rm -rf dist
python setup.py bdist_wheel
pip install dist/torchvision-*.whl
popd

4)测试是否编译成功

# Python:
>>> import torch
>>> import torch_mlu
CNML: 7.10.3 85350b141
CNRT: 4.10.4 41e356b

# 如需退出虚拟环境,
deactivate
# 重新进入虚拟环境
source $ROOT_HOME/pytorch/src/catch/venv/pytorch/bin/activate

5)当前pip list

Package            Version
------------------ ------------
astroid            2.9.3
attrs              21.4.0
boto3              1.5.22
botocore           1.8.50
certifi            2021.10.8
chardet            3.0.4
charset-normalizer 2.0.12
cloudpickle        2.0.0
cpplint            1.6.0
cycler             0.11.0
Cython             0.29.16
dask               2021.3.0
decorator          4.4.2
docopt             0.6.2
docutils           0.18.1
future             0.18.2
idna               2.6
isort              5.10.1
jmespath           0.10.0
jsonpickle         0.9.6
kiwisolver         1.3.1
lanms              1.0.2
lazy-object-proxy  1.7.1
matplotlib         2.2.2
mccabe             0.6.1
munch              2.2.0
networkx           2.5.1
nltk               3.2.5
numpy              1.16.0
onnx               1.6.0
opencv-python      3.4.2.17
pandas             0.23.2
Pillow             5.2.0
pip                21.3.1
platformdirs       2.4.0
pluggy             0.6.0
protobuf           3.19.4
py                 1.11.0
pycocotools        2.0.0
pylint             2.12.2
pyparsing          3.0.7
pytest             3.4.0
python-dateutil    2.8.2
pytz               2021.3
PyWavelets         1.1.1
PyYAML             6.0
regex              2018.2.3
requests           2.18.4
s3transfer         0.1.13
sacred             0.7.2
scikit-image       0.14.2
scikit-learn       0.19.2
scipy              1.1.0
setuptools         59.6.0
Shapely            1.7.0
six                1.16.0
tensorboardX       1.0
toml               0.10.2
toolz              0.11.2
torch              1.3.0a0      # 注意:寒武纪Pytorch为1.3版本
torch-mlu          0.15.0.post1
torchvision        0.2.1
tqdm               4.19.5
typed-ast          1.5.2
typing             3.7.4.3
typing_extensions  4.1.1
urllib3            1.22
wheel              0.37.1
wrapt              1.13.3
yacs               0.1.6

4.使用容器启动pytorch【推荐】

docker镜像 [neuware_sdk]/PyTorch/docker/pytorch-0.15.604-ubuntu18.04.tar.gz 包含catch的example,将PyTorch和Catch编译成wheel包以及Python 3虚拟环境,且已经安装了cntoolkit等依赖包。

tree /torch/ -L 1
├── examples          -- 在线与离线demo
├── requirements.txt  -- Python依赖包
├── src               -- PyTorch/Catch/Vision源代码
├── venv2             -- Python 2虚拟环境
├── venv3             -- Python 3虚拟环境
├── wheel_py2         -- Python 2 wheel包
└── wheel_py3         -- Python 3 wheel包

加载镜像

sudo docker load < [neuware_sdk]/PyTorch/docker/pytorch-0.15.604-ubuntu18.04.tar.gz

启动容器

sudo docker run -itd --privileged --net=host -v /home/ubuntu/Downloads/neuware-ftp:/home/ftp yellow.hub.cambricon.com/pytorch/pytorch:0.15.604-ubuntu18.04 /bin/bash

在容器中激活Python虚拟环境

cd /torch
source venv3/pytorch/bin/activate

# 测试torch_mlu
python
>>> import torch
>>> import torch_mlu
CNML: 7.10.3 85350b141
CNRT: 4.10.4 41e356b

三、寒武纪FTP其他资料说明

1.download/demo

获取文件

wget -r -nH -P./ ftp://download.cambricon.com:8821/download/demo/* --ftp-user="username" --ftp-password="password"

目录说明

$ tree 
├── cnrtexec
│   ├── cnrtexec.tar.gz
│   └── README.MD
├── lprnet
│   ├── cnnl_auto_log
│   ├── cnrtexec
│   │   ├── clean.sh
│   │   ├── cnrtexec
│   │   ├── cnrtexec.cpp
│   │   ├── cnrtexec.h
│   │   ├── cnrtexec.o
│   │   ├── main.cpp
│   │   ├── main.o
│   │   └── Makefile
│   ├── cpu.py
│   ├── env.sh
│   ├── labels_mlu.txt
│   ├── labels.txt
│   ├── lpr.cambricon
│   ├── lpr.cambricon_twins
│   ├── lprini.pth
│   ├── lpr_intx.pth
│   ├── mlu.py
│   ├── model
│   │   ├── __init__.py
│   │   ├── LPRNet.py
│   │   ├── LPRNet.py.bak
│   │   └── __pycache__
│   │       ├── __init__.cpython-35.pyc
│   │       ├── __init__.cpython-37.pyc
│   │       ├── LPRNet.cpython-35.pyc
│   │       └── LPRNet.cpython-37.pyc
│   ├── prebs_mlu.txt
│   ├── prebs.txt
│   ├── quant.py
│   ├── readme
│   ├── test1.jpg
│   └── test.jpg
├── mlu_caffe_fcn
│   ├── mlu_caffe_fcn_8s.tar
│   └── README.MD
├── mlu_caffe_yolov3
│   ├── mlu_caffe_yolov3.tar.gz
│   └── README.md
├── retinaface
│   ├── cnrt_simple_demo_retinaface_final.zip
│   └── mlu-pytorch_retinaface_native.tar.gz
└── yolov5
    ├── 20211229
    │   ├── gen.py
    │   ├── model
    │   │   ├── yolov5s.pt  # yolov5s pytorch源模型文件(pytorch>=1.6)
    │   │   └── yolov5s-state-31.pth
    │   └── readme_yolov5.md
    ├── 20220112
    │   └── mlu-pytorch-yolov5-v5.0.tar.gz
    ├── cnstream_patch
    │   └── cnstream_yolov5_patch.tar.gz    # yolov5 cnstream补丁包
    ├── pytorch-yolov5-image-test
    │   ├── pytorch-yolov5-image-1.6.602.tar    # docker镜像:包含寒武纪pytorch和yolov5环境,注意,非Neuware sdk 1.7.604版本
    │   ├── run-pytorch-yolov5-docker-ubuntu16.04-1.6.602.sh    # docker镜像运行脚本
    │   ├── run-pytorch-yolov5-docker-ubuntu16.04-1.6.602.sh.tar.gz
    │   └── yolov5_Readme.md
    ├── torch-yolov5模型转换
    │   ├── torch.ubuntu18.04.yolov5.v1.7.tar.gz    # docker镜像:包含官方pytorch和yolov5环境,用于将网络模型转换为 Torch 版本为1.3的模型文件。
    │   └── yolov5-3.1.tar.gz   # 官方yolov5 sdk
    └── yolov5m
        └── quantize_online_v5.0_20211223.tar.gz    # yolov5 模型量化、离线模型生成代码

2.product/datasets

获取文件

wget -r -nH -P./ ftp://download.cambricon.com:8821/product/datasets/* --ftp-user="username" --ftp-password="password"

文件目录

$ tree
├── COCO-2014.tar.gz
├── COCO2017_datasets.tar.gz
├── MLU270_datasets_COCO.tar.gz
├── MLU270_datasets_en_core_web_sm.tar.gz
├── MLU270_datasets_FDDB.tar.gz
├── MLU270_datasets_imagenet.tar.gz
├── MLU270_datasets_tensorflow_models.tar.gz
├── MLU270_datasets_VOC2007.tar.gz
├── MLU270_datasets_VOC2012.tar.gz

四、YoloV5 MLU算法移植

整个移植过程大体可分为环境准备、模型量化、在线推理、生成离线模型、离线推理、性能测试、精度测试共七个步骤。

1.模型转换(Torch对齐)

1)准备网络模型

从官网下载配置文件及模型权重,以下以yolov5(416*416)为例进行演示。

Name URL Note
YOLOv5 https://github.com/ultralytics/yolov5.git Ultralytics官网GitHub地址
yolov5s.yaml https://github.com/ultralytics/yolov5/blob/master/models/yolov5s.yaml v5.0
yolov5s.pt https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5s.pt v5.0
yolov5m.yaml https://github.com/ultralytics/yolov5/blob/master/models/yolov5m.yaml v5.0
yolov5m.pt https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt v5.0
yolov5l.yaml https://github.com/ultralytics/yolov5/blob/master/models/yolov5l.yaml v5.0
yolov5l.pt https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5l.pt v5.0
yolov5x.yaml https://github.com/ultralytics/yolov5/blob/master/models/yolov5x.yaml v5.0
yolov5x.pt https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5x.pt v5.0

2)模型转换

转换代码:aligntorch.py

import torch
from models.yolo import Model

weight='yolov5s.pt'
model = torch.load(weight, map_location='cpu')['model']
#print(model)
torch.save(model.state_dict(), "./yolov5s-nozip.pth",_use_new_zipfile_serialization=False)

执行转换

python3 aligntorch.py
# 得到未压缩模型文件yolov5s-nozip.pth

2.模型量化

1)获取yolov5模型量化代码

通过官方ftp获取yolov5模型量化及推理代码:quantize_online_v5.0_20211223.tar.gz

wget -r -nH -P./ ftp://download.cambricon.com:8821/download/demo/yolov5/yolov5m/* --ftp-user="username" --ftp-password="password"

将quantize_online_v5.0_20211223.tar.gz解压到已安装pytorch_mlu环境的设备或其容器中的路径下。

$ tree -L 1
├── FocusWeight.txt
├── README.md
├── clean.sh
├── config.ini
├── data
│   ├── argoverse_hd.yaml
│   ├── coco.yaml
│   ├── coco128.yaml
│   ├── hyp.finetune.yaml
│   ├── hyp.scratch.yaml
│   ├── images
│   │   ├── bus.jpg
│   │   └── zidane.jpg
│   ├── scripts
│   │   ├── get_argoverse_hd.sh
│   │   ├── get_coco.sh
│   │   └── get_voc.sh
│   └── voc.yaml
├── detect.py
├── models
│   ├── common.py
│   ├── experimental.py
│   ├── export.py
│   ├── hub
│   ├── yolo.py
│   ├── yolov5l.yaml
│   ├── yolov5m.yaml
│   ├── yolov5s.yaml
│   └── yolov5x.yaml
├── post_process.py
├── quant.py    # 量化程序
├── results
├── run.sh
├── tools
├── utils
├── v5.0    # torch对齐的模型文件
│   ├── yolov5m-nozip.pth
│   ├── yolov5s-nozip.pth
│   └── yolov5x-nozip.pth

2)模型量化及CPU在线推理

在已安装pytorch_mlu的环境下执行

模型量化程序参数说明

$ python quant.py --help
usage: quant.py [-h] [--cfg CFG] [--device DEVICE] [--weights WEIGHTS]
                [--qua_weight QUA_WEIGHT] [--source SOURCE]
                [--imgsz IMGSZ [IMGSZ ...]] [--conf-thres CONF_THRES]
                [--iou-thres IOU_THRES] [--classes CLASSES [CLASSES ...]]
                [--agnostic-nms] [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --cfg CFG             model.yaml
  --device DEVICE       device, i.e. 0 or 0,1,2,3 or cpu    # 默认cpu
  --weights WEIGHTS     model.pt path(s)
  --qua_weight QUA_WEIGHT
                        model.pt path(s)
  --source SOURCE       source
  --imgsz IMGSZ [IMGSZ ...], --img IMGSZ [IMGSZ ...], --img-size IMGSZ [IMGSZ ...]
                        inference size h,w
  --conf-thres CONF_THRES
                        object confidence threshold
  --iou-thres IOU_THRES
                        IOU threshold for NMS
  --classes CLASSES [CLASSES ...]
                        filter by class: --class 0, or --class 0 2 3
  --agnostic-nms        class-agnostic NMS
  --output OUTPUT       output folder

将torch对齐后的yolov5s-nozip.pth文件复制到quantize_online文件夹根目录下。

$ python quant.py --cfg models/yolov5s.yaml --weights v5.0/yolov5s-nozip.pth
CNML: 7.10.3 85350b141
CNRT: 4.10.4 41e356b
Namespace(agnostic_nms=False, cfg='models/yolov5s.yaml', classes=None, conf_thres=0.3, device='cpu', imgsz=[640], iou_thres=0.45, output='./results/cpu', qua_weight='yolov5_quan.pt', source='data/images', weights='v5.0/yolov5s-nozip.pth')
image 1/2 /home/share/pytorch/yolov5/quantize_online_v5.0/data/images/bus.jpg: r 0.5925925925925926
torch.Size([1, 3, 640, 640])
[tensor([[1.11905e+02, 2.35018e+02, 2.15107e+02, 5.23988e+02, 8.86216e-01, 0.00000e+00],
        [2.11545e+02, 2.41887e+02, 2.85995e+02, 5.09628e+02, 8.55155e-01, 0.00000e+00],
        [4.76838e+02, 2.42796e+02, 5.61337e+02, 5.18281e+02, 8.49822e-01, 0.00000e+00],
        [7.89480e+01, 1.29193e+02, 5.60420e+02, 4.37727e+02, 7.12828e-01, 5.00000e+00],
        [8.04472e+01, 3.21336e+02, 1.24692e+02, 5.24205e+02, 3.92542e-01, 0.00000e+00]], grad_fn=<IndexBackward>)]
results/cpu/bus.jpg
640x640 4 persons, 1 buss, 
image 2/2 /home/share/pytorch/yolov5/quantize_online_v5.0/data/images/zidane.jpg: r 0.5
torch.Size([1, 3, 640, 640])
[tensor([[374.74130, 161.78339, 573.84186, 492.30505,   0.87496,   0.00000],
        [216.89871, 356.90222, 258.78680, 497.36224,   0.69026,  27.00000],
        [ 57.89629, 237.86765, 548.31989, 494.05441,   0.62544,   0.00000]], grad_fn=<IndexBackward>)]
results/cpu/zidane.jpg
640x640 2 persons, 1 ties, 

SAVE quantize model: yolov5_quan.pt

3.MLU在线推理

在已安装pytorch_mlu且具备mlu计算卡的环境下执行

mlu推理程序参数说明

$ python detect.py --help
usage: detect.py [-h] [--cfg CFG] [--device DEVICE] [--qua_weight QUA_WEIGHT]
                 [--jit JIT] [--half_input HALF_INPUT] [--save SAVE]
                 [--mcore MCORE] [--core_number CORE_NUMBER]
                 [--batch_size BATCH_SIZE] [--fuse]
                 [--fake_device FAKE_DEVICE] [--mname MNAME] [--output OUTPUT]
                 [--source SOURCE] [--imgsz IMGSZ [IMGSZ ...]]

optional arguments:
  -h, --help            show this help message and exit
  --cfg CFG             model.yaml
  --device DEVICE       device, i.e. mlu or cpu # 内部固定使用mlu
  --qua_weight QUA_WEIGHT   # 量化后的权重文件,默认yolov5_quan.pt
                        model.pt path(s)
  --jit JIT             fusion  # false使用逐层模式,true使用融合模式
  --half_input HALF_INPUT
                        he input data type, 0-float32, 1-float16/Half, default
                        1.
  --save SAVE           selection of save *.cambrcion
  --mcore MCORE         Set MLU Architecture
  --core_number CORE_NUMBER
                        Core number of mfus and offline model with simple
                        compilation.
  --batch_size BATCH_SIZE
                        size of each image batch
  --fuse                Use Model fuse
  --fake_device FAKE_DEVICE
                        genoff offline cambricon without mlu device if fake
                        device is true. 1-fake_device, 0-mlu_device
  --mname MNAME         Name of the .pt offline file
  --output OUTPUT       output folder   # 默认./results/mlu
  --source SOURCE       source  # 默认data/images
  --imgsz IMGSZ [IMGSZ ...], --img IMGSZ [IMGSZ ...], --img-size IMGSZ [IMGSZ ...]
                        inference size h,w

执行融合模式推理,并生成离线模型。

# 开启MLU优化配置,设置export CNML_OPTIMIZE=USE_CONFIG:config.ini ,取消unset CNML_OPTIMIZE
$ export CNML_OPTIMIZE=USE_CONFIG:config.ini

# 以mlu270 yolov5s  4core 1batch 为例,不使用卡执行
$ python detect.py --save true --jit true --mcore MLU270 --fake_device 1 --mname mlu270_yolov5_4c1b --batch_size 1 --core_number 4 --cfg models/yolov5s.yaml
CNML: 7.10.3 85350b141
CNRT: 4.10.4 41e356b
Namespace(batch_size=1, cfg='models/yolov5s.yaml', core_number=4, device='cpu', fake_device=1, fuse=False, half_input=1, imgsz=[640, 640], jit=True, mcore='MLU270', mname='mlu270_yolov5_4c1b', output='./results/mlu', qua_weight='yolov5_quan.pt', save=True, source='data/images')
weight: yolov5_quan.pt
half_input 
fake_device mode:save offline model mname:  mlu270_yolov5_4c1b
batchNum: 1

# 以mlu270 yolov5s  4core 1batch 为例,使用卡执行
$ python detect.py --save true --jit true --mcore MLU270 --fake_device 0 --mname mlu270_yolov5_4c1b --batch_size 1 --core_number 4 --cfg models/yolov5s.yaml
CNML: 7.10.3 85350b141
CNRT: 4.10.4 41e356b
Namespace(batch_size=1, cfg='models/yolov5s.yaml', core_number=4, device='cpu', fake_device=0, fuse=False, half_input=1, imgsz=[640, 640], jit=True, mcore='MLU270', mname='mlu270_yolov5_4c1b', output='./results/mlu', qua_weight='yolov5_quan.pt', save=True, source='data/images')
weight: yolov5_quan.pt
half_input 
batchNum: 1
mlu
image 1/2 /home/share/pytorch/yolov5/quantize_online_v5.0/data/images/bus.jpg: r 0.5925925925925926
torch.Size([1, 3, 640, 640])
batchNum: 1
num_boxes_final:  5.0
[tensor([[1.11000e+02, 2.36250e+02, 2.15125e+02, 5.24000e+02, 8.77930e-01, 0.00000e+00],
        [2.11875e+02, 2.44000e+02, 2.86750e+02, 5.07000e+02, 8.56445e-01, 0.00000e+00],
        [4.74250e+02, 2.46375e+02, 5.59000e+02, 5.16000e+02, 8.42773e-01, 0.00000e+00],
        [8.01250e+01, 3.25000e+02, 1.25500e+02, 5.22000e+02, 3.70850e-01, 0.00000e+00],
        [7.46875e+01, 1.22188e+02, 5.66000e+02, 4.44000e+02, 7.13867e-01, 5.00000e+00]])]
results/mlu/bus.jpg
640x640 4 persons, 1 buss, 
image 2/2 /home/share/pytorch/yolov5/quantize_online_v5.0/data/images/zidane.jpg: r 0.5
torch.Size([1, 3, 640, 640])
num_boxes_final:  3.0
[tensor([[374.00000, 162.00000, 573.00000, 492.50000,   0.89404,   0.00000],
        [ 57.03125, 238.87500, 557.50000, 493.75000,   0.59912,   0.00000],
        [216.87500, 357.00000, 259.50000, 496.25000,   0.59033,  27.00000]])]
results/mlu/zidane.jpg
640x640 2 persons, 1 ties, 

4.MLU离线推理

在具备CNRT、opencv和mlu计算卡的环境下执行

离线程序中包含头文件"cnrt.h"(/usr/local/neuware/include/cnrt.h)

使用C++程序在mlu设备上运行离线模型mlu270_yolov5_4c1b.cambricon。

# 获取yolov5离线推理程序
$ cd yolov5-offline_test/
# 复制离线模型到该目录
$ cp [path]/mlu270_yolov5_4c1b.cambricon ./
# 编译程序
$ make clean; make
# 执行离线推理
$ ./yolov5_offline_simple_demo ./mlu270_yolov5_4c1b.cambricon ./image.jpg ./output/offline_result.jpg
CNRT: 4.10.4 41e356b
---------
427 640
154 44 287 610
71 247 200 602
289 375 424 604
上一篇 下一篇

猜你喜欢

热点阅读