在caffe中自定义cuda版loss
上次说到了cuda加速初体验,这次则说明如何实现它,推荐大家仿照系统的loss-layer层来实现,如果您的loss层需要传额外的参数,例如Crop层:
layer { bottom: "side5_up" bottom: "data" top: "side5_crop" type: "Crop" name: "side5_crop" crop_param { axis: 2 offset: 4 offset: 4 } }
这里的crop_param
就是所指的额外参数,则需要修改caffe.proto
文件。除此之外你的loss layer层必不可少的三个文件,头文件loss_yourname.hpp
,cpu版实现loss_yourname.cpp
,gpu版实现loss_yourname.cu
。总结一下,核心文件有:
- loss_yourname.hpp
写好之后放入code_root/src/caffe/layers
下 - loss_yourname.cpp
写好之后放入code_root/include/caffe/layers
下 - loss_yourname.cu
写好之后放入code_root/include/caffe/layers
下 - caffe.proto(可选)
写好之后放入code_root/src/caffe/proto
下
这里将参照caffe自带的
sigmoid_cross_entropy_loss_layer
层做修改,仅指出需要注意的点。
loss_yourname.hpp的实现
#ifndef CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
#define CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
#include "caffe/layers/loss_layer.hpp"
#include "caffe/layers/sigmoid_layer.hpp"
namespace caffe {
template <typename Dtype>
class SigmoidCrossEntropyLossLayer : public LossLayer<Dtype> {
public:
explicit SigmoidCrossEntropyLossLayer(const LayerParameter& param)
: LossLayer<Dtype>(param),
sigmoid_layer_(new SigmoidLayer<Dtype>(param)),
sigmoid_output_(new Blob<Dtype>()) {}
virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual inline const char* type() const { return "SigmoidCrossEntropyLoss"; }
protected:
/// @copydoc SigmoidCrossEntropyLossLayer
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual Dtype get_normalizer(
LossParameter_NormalizationMode normalization_mode, int valid_count);
/// The internal SigmoidLayer used to map predictions to probabilities.
shared_ptr<SigmoidLayer<Dtype> > sigmoid_layer_;
/// sigmoid_output stores the output of the SigmoidLayer.
shared_ptr<Blob<Dtype> > sigmoid_output_;
/// bottom vector holder to call the underlying SigmoidLayer::Forward
vector<Blob<Dtype>*> sigmoid_bottom_vec_;
/// top vector holder to call the underlying SigmoidLayer::Forward
vector<Blob<Dtype>*> sigmoid_top_vec_;
/// Whether to ignore instances with a certain label.
bool has_ignore_label_;
/// The label indicating that an instance should be ignored.
int ignore_label_;
/// How to normalize the loss.
LossParameter_NormalizationMode normalization_;
Dtype normalizer_;
int outer_num_, inner_num_;
};
} // namespace caffe
#endif // CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
核心注意需要修改的地方有:
class SigmoidCrossEntropyLossLayer : public LossLayer<Dtype>
需要修改成自己对应class名
virtual inline const char* type() const { return "SigmoidCrossEntropyLoss"; },
其中的SigmoidCrossEntropyLoss
替换成你自己定义的layer它的type
。如最上面的例子,Crop就是它的type。
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
其中函数Forward_cpu
和Backward_cpu
分别是cpu上的前传和后传,在loss_yourname.cpp
中将是你实现的重点。而函数Forward_gpu
和Backward_gpu
分别是gpu上的前传和后传,在loss_yourname.cu
中将是你实现的重点。如果你不需要实现cuda版loss的话,则把对应的两个gpu实现函数去掉即可。
#ifndef CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
#define CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
#endif CAFFE_SIGMOID_CROSS_ENTROPY_LOSS_LAYER_HPP_
这里记得修改成你自己的宏定义名。
注意:其它变量及相关函数可以根据自己需求增删。
loss_yourname.cpp
#include <algorithm>
#include <vector>
#include "caffe/layers/sigmoid_cross_entropy_loss_layer.hpp"
#include "caffe/util/math_functions.hpp"
namespace caffe {
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::LayerSetUp(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
/*
*略
*/
}
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Reshape(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
/*
*略
*/
}
// TODO(shelhamer) loss normalization should be pulled up into LossLayer,
// instead of duplicated here and in SoftMaxWithLossLayer
template <typename Dtype>
Dtype SigmoidCrossEntropyLossLayer<Dtype>::get_normalizer(
LossParameter_NormalizationMode normalization_mode, int valid_count) {
/*
*略
*/
}
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_cpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
// The forward pass computes the sigmoid outputs.
sigmoid_bottom_vec_[0] = bottom[0];
sigmoid_layer_->Forward(sigmoid_bottom_vec_, sigmoid_top_vec_);
// Compute the loss (negative log likelihood)
// Stable version of loss computation from input data
const Dtype* input_data = bottom[0]->cpu_data();
const Dtype* target = bottom[1]->cpu_data();
int valid_count = 0;
Dtype loss = 0;
for (int i = 0; i < bottom[0]->count(); ++i) {
const int target_value = static_cast<int>(target[i]);
if (has_ignore_label_ && target_value == ignore_label_) {
continue;
}
loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
++valid_count;
}
normalizer_ = get_normalizer(normalization_, valid_count);
top[0]->mutable_cpu_data()[0] = loss / normalizer_;
}
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_cpu(
const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
if (propagate_down[1]) {
LOG(FATAL) << this->type()
<< " Layer cannot backpropagate to label inputs.";
}
if (propagate_down[0]) {
// First, compute the diff
const int count = bottom[0]->count();
const Dtype* sigmoid_output_data = sigmoid_output_->cpu_data();
const Dtype* target = bottom[1]->cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
caffe_sub(count, sigmoid_output_data, target, bottom_diff);
// Zero out gradient of ignored targets.
if (has_ignore_label_) {
for (int i = 0; i < count; ++i) {
const int target_value = static_cast<int>(target[i]);
if (target_value == ignore_label_) {
bottom_diff[i] = 0;
}
}
}
// Scale down gradient
Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer_;
caffe_scal(count, loss_weight, bottom_diff);
}
}
#ifdef CPU_ONLY
STUB_GPU(SigmoidCrossEntropyLossLayer);
#endif
INSTANTIATE_CLASS(SigmoidCrossEntropyLossLayer);
REGISTER_LAYER_CLASS(SigmoidCrossEntropyLoss);
} // namespace caffe
核心注意需要修改的地方有:
#include "caffe/layers/sigmoid_cross_entropy_loss_layer.hpp"
导入对应的头文件
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_cpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
修改对应的前传函数名及实现它
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_cpu(
const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom)
修改对应的后传函数名及实现它
#ifdef CPU_ONLY
STUB_GPU(SigmoidCrossEntropyLossLayer);
#endif
INSTANTIATE_CLASS(SigmoidCrossEntropyLossLayer);
REGISTER_LAYER_CLASS(SigmoidCrossEntropyLoss);
这里是一个不能忽视的地方!把SigmoidCrossEntropyLossLayer
和SigmoidCrossEntropyLoss
修改成自己的layer。举个例子,如果没有实现REGISTER_LAYER_CLASS(SigmoidCrossEntropyLoss);
,那么我们在调用自定义loss层时,会提示找不到layer对应的type
loss_yourname.cu
#include <vector>
#include "caffe/layers/sigmoid_cross_entropy_loss_layer.hpp"
#include "caffe/util/math_functions.hpp"
namespace caffe {
template <typename Dtype>
__global__ void SigmoidCrossEntropyLossForwardGPU(const int nthreads,
const Dtype* input_data, const Dtype* target, Dtype* loss,
const bool has_ignore_label_, const int ignore_label_,
Dtype* counts) {
CUDA_KERNEL_LOOP(i, nthreads) {
const int target_value = static_cast<int>(target[i]);
if (has_ignore_label_ && target_value == ignore_label_) {
loss[i] = 0;
counts[i] = 0;
} else {
loss[i] = input_data[i] * (target[i] - (input_data[i] >= 0)) -
log(1 + exp(input_data[i] - 2 * input_data[i] *
(input_data[i] >= 0)));
counts[i] = 1;
}
}
}
template <typename Dtype>
__global__ void SigmoidCrossEntropyLossIgnoreDiffGPU(const int count,
const int ignore_label, const Dtype* target, Dtype* diff) {
CUDA_KERNEL_LOOP(i, count) {
const int target_value = static_cast<int>(target[i]);
if (target_value == ignore_label) {
diff[i] = 0;
}
}
}
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_gpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
// The forward pass computes the sigmoid outputs.
sigmoid_bottom_vec_[0] = bottom[0];
sigmoid_layer_->Forward(sigmoid_bottom_vec_, sigmoid_top_vec_);
// Compute the loss (negative log likelihood)
const int count = bottom[0]->count();
// Stable version of loss computation from input data
const Dtype* input_data = bottom[0]->gpu_data();
const Dtype* target = bottom[1]->gpu_data();
// Since this memory is not used for anything until it is overwritten
// on the backward pass, we use it here to avoid having to allocate new GPU
// memory to accumulate intermediate results in the kernel.
Dtype* loss_data = bottom[0]->mutable_gpu_diff();
Dtype* count_data = bottom[1]->mutable_gpu_diff();
Dtype valid_count;
// NOLINT_NEXT_LINE(whitespace/operators)
SigmoidCrossEntropyLossForwardGPU<Dtype><<<CAFFE_GET_BLOCKS(count),
CAFFE_CUDA_NUM_THREADS>>>(count, input_data, target, loss_data,
has_ignore_label_, ignore_label_, count_data);
// Only launch another CUDA kernel if we actually need the valid count.
if (normalization_ == LossParameter_NormalizationMode_VALID &&
has_ignore_label_) {
caffe_gpu_asum(count, count_data, &valid_count);
} else {
valid_count = count;
}
Dtype loss;
caffe_gpu_asum(count, loss_data, &loss);
normalizer_ = get_normalizer(normalization_, valid_count);
top[0]->mutable_cpu_data()[0] = loss / normalizer_;
}
template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_gpu(
const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
if (propagate_down[1]) {
LOG(FATAL) << this->type()
<< " Layer cannot backpropagate to label inputs.";
}
if (propagate_down[0]) {
// First, compute the diff
const int count = bottom[0]->count();
const Dtype* sigmoid_output_data = sigmoid_output_->gpu_data();
const Dtype* target = bottom[1]->gpu_data();
Dtype* bottom_diff = bottom[0]->mutable_gpu_diff();
caffe_copy(count, sigmoid_output_data, bottom_diff);
caffe_gpu_axpy(count, Dtype(-1), target, bottom_diff);
// Zero out gradient of ignored targets.
if (has_ignore_label_) {
// NOLINT_NEXT_LINE(whitespace/operators)
SigmoidCrossEntropyLossIgnoreDiffGPU<Dtype><<<CAFFE_GET_BLOCKS(count),
CAFFE_CUDA_NUM_THREADS>>>(count, ignore_label_, target, bottom_diff);
}
// Scale down gradient
Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer_;
caffe_gpu_scal(count, loss_weight, bottom_diff);
}
}
INSTANTIATE_LAYER_GPU_FUNCS(SigmoidCrossEntropyLossLayer);
} // namespace caffe
核心注意需要修改的地方有:
#include "caffe/layers/sigmoid_cross_entropy_loss_layer.hpp"
导入对应的头文件
template <typename Dtype>
__global__ void SigmoidCrossEntropyLossForwardGPU
template <typename Dtype>
__global__ void SigmoidCrossEntropyLossIgnoreDiffGPU
这里有两个kernel函数,是cuda加速的关键!!!
,其中nthreads
是线程数,而i
是线程的索引,在这里是给所有像素点都开辟了单独的线程,做并行计算
。所有速度非常快,不像上面cpu版使用的是for循环
遍历每个像素点来做计算。所以借鉴这里的思想,大家也可以利用kernel函数来并行计算自己loss里需要大量运算的代码。
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_gpu
修改对应的前传函数名及实现它
SigmoidCrossEntropyLossForwardGPU<Dtype><<<CAFFE_GET_BLOCKS(count),
CAFFE_CUDA_NUM_THREADS>>>(count, input_data, target, loss_data,
has_ignore_label_, ignore_label_, count_data);
这里是具体调用kernel函数的代码
Dtype* loss_data = bottom[0]->mutable_gpu_diff();
Dtype* count_data = bottom[1]->mutable_gpu_diff();
这里存储并行运算产生的结果,利用到了存储梯度的空间。避免重新申请空间。很重要的思想,大家也尽量避免自己去开辟空间
。
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_gpu
修改对应的后传函数名及实现它
SigmoidCrossEntropyLossIgnoreDiffGPU<Dtype><<<CAFFE_GET_BLOCKS(count),
CAFFE_CUDA_NUM_THREADS>>>(count, ignore_label_, target, bottom_diff);
同样调用后传时的kernel函数
INSTANTIATE_LAYER_GPU_FUNCS(SigmoidCrossEntropyLossLayer);
把当前类名SigmoidCrossEntropyLossLayer
实例化到GPU上
- 除此之外需要注意点,相比cpu版,在这里把所有的字样cpu都替换成gpu,因为处理的数据都需要在gpu上,当然这两个除外:
top[0]->mutable_cpu_data()[0]
top[0]->cpu_diff()[0]
宝贵的采坑经验!!!
- 在
.cu
文件中的前传或后传函数中若使用cpu
上的空间申请方式,并把它直接传到kernel
函数里去,程序会中断哦。也不推荐先在cpu
上申请空间,然后拷贝到gpu
上,会大大降低效率。直接使用cudaMalloc
申请一维空间吧,二维、三维数组的话,自己去查找对应申请函数,最后记得释放,释放,释放
!!! - 如果在
.cu
文件中的前传或后传函数中定义常量,把它直接传到kernel
函数里去,是可以滴!例如
Dtype w_neg = 1.0;
为什么说这个呢,因为后面我会扩展点内容
,有需求的可以看到最后。需求为:
在loss中动态获取当前的迭代次数,根据当前迭代次数动态设置loss中权重问题。
caffe.proto中设置额外参数
比如需要给layer添加一个这样的参数:
auto_weight_param { init_pos: 0.9 }
- 添加message,如下
message AutoWeightParameter {
optional float init_pos = 1 [default=0.9];
}
2.在message LayerParameter{}中添加
optional AutoWeightParameter auto_weight_param = 147;
这里的编号147
是根据前面的注释查看到的,我之前可用的ID就为147,当大家添加自己的ID之后记得修改过来,如下:
// LayerParameter next available layer-specific ID: 148 (last added: auto_weight_param)
- 在
loss_yourname.cpp
或loss_yourname.cpp
中获取额外参数:
const Dtype init_pos = this->layer_param_.auto_weight_param().init_pos();//0.9
如何动态的获取当前迭代次数
参考地址,这里只给出需要修改的地方,其它代码不变
。
- common.hpp中申明setter、getter方法以及变量
Caffe {
public: static Caffe& Get();
// Return the current iteration
inline static int cur_iter(){return Get().cur_iter_;}
// Set the current iteration
inline static void set_cur_iter(int iter){Get().cur_iter_=iter;}
// Return the maxiters
inline static int max_iter(){return Get().max_iter_;}
// Set the maxiters
inline static void set_max_iter(int iter){Get().max_iter_=iter;}
protected:
static int cur_iter_;
static int max_iter_;
- common.cpp中初始化变量
namespace caffe {
int Caffe::cur_iter_ = 0;
int Caffe::max_iter_ = 0;
}
- solver.cpp中给变量动态赋值
void Solver<Dtype>::Step(int iters) {
const int start_iter = iter_;
const int stop_iter = iter_ + iters;
Caffe::set_max_iter(stop_iter);
while (iter_ < stop_iter) {
Caffe::set_cur_iter(iter_);
}
}
- 在
loss_yourname.cpp
或loss_yourname.cpp
中动态获取当前迭代次数及最大迭代次数:
const int cur_iter = Caffe::cur_iter();
const int max_iter = Caffe::max_iter();