DRRG代码踩坑记录
一、代码卡死问题:
训练一个epoch时,模型不能接着训练,只能通过Ctrl+C强制性暂停,直接卡死,并且输入watch -n 1 nvidia-smi时,观察不到GPU的使用,在经过两个多小时的查找以及各种尝试之后,发现错误居然是num_workers的问题!!!!!!!!!!!!!!
训练之前,一定要指定num_workers为零
CUDA_LAUNCH_BLOCKING=1 python train_TextGraph.py - -exp_name Ctw1500 --max_epoch 600 --batch_size 6 --gpu 0 --input_size 640 --optim SGD --lr 0.001 -- start_epoch 0 --viz --net vgg --num_workers 0
虽然在config.py文件中可以指定num_workers的大小,但如果不指定num_workers的话,num_workers会自动变成8
二、“FileNotFoundError: [Errno 2] No such file or directory: './vis/Ctw1500_train' ”的问题:
这个是训练结果的保存路径,需要手动设置一个空文件夹
三、makefile编译错误:
编译lanms下的Makefile文件,产生如下报错:
g++ -o adaptor.so -I include -std=c++11 -O3 -I/home/zhangmingzhou1/anaconda3/envs/pytorch/include/python3.7m -I/home/zhangmingzhou1/anaconda3/envs/pytorch/include/python3.7m -Wno-unused-result -Wsign-compare -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -fdebug-prefix-map==/usr/local/src/conda/- -fdebug-prefix-map==/usr/local/src/conda-prefix -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -flto -flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -DNDEBUG -fwrapv -O3 -Wall -L/home/zhangmingzhou1/anaconda3/envs/pytorch/lib/python3.7/config-3.7m-x86_64-linux-gnu -L/home/zhangmingzhou1/anaconda3/envs/pytorch/lib -lpython3.7m -lpthread -ldl -lutil -lrt -lm -Xlinker -export-dynamic adaptor.cpp include/clipper/clipper.cpp --shared -fPIC
g++: error: unrecognized command line option ‘-fno-plt’
Makefile:10: recipe for target 'adaptor.so' failed
make: *** [adaptor.so] Error 1
将原文件代码:
CXXFLAGS= -I include -std=c++11 -O3$(shellpython3-config --cflags)
LDFLAGS=$(shellpython3-config --ldflags)
DEPS= lanms.h$(shellfind include -xtype f)
CXX_SOURCES= adaptor.cpp include/clipper/clipper.cpp
LIB_SO= adaptor.so
$(LIB_SO):$(CXX_SOURCES)$(DEPS)
$(CXX)-o$@$(CXXFLAGS)$(LDFLAGS)$(CXX_SOURCES)--shared -fPIC
clean:
rm -rf$(LIB_SO)
改为:
CXXFLAGS = -I include -std=c++11 -O3 -I/home/zhangmingzhou1/anaconda3/envs/pytorch/include/python3.7m/
LDFLAGS = -I/home/zhangmingzhou1/anaconda3/envs/pytorch/include/python3.7m/
DEPS = lanms.h $(shell find include -xtype f)
CXX_SOURCES = adaptor.cpp include/clipper/clipper.cpp
LIB_SO = adaptor.so
$(LIB_SO): $(CXX_SOURCES) $(DEPS)
$(CXX) -o $@ $(CXXFLAGS) $(LDFLAGS) $(CXX_SOURCES) --shared -fPIC
clean:
rm -rf $(LIB_SO)