opencv加速
图像处理中,很多时候用到resize去处理图片,但是opencv的resize的速度比较慢,下面使用了SIMD库去做了加速。
应该会有4到5倍的加速效果,效果依赖于cpu的线程数。
一般opencv中的resize是这样:
cv::Mat src, dest ;
cv::Size size(xx,xx);
src = cv::imread(path.c_str());
cv::resize(src, dest, size);
使用SIMD后:
View viewsrc = src;
cv::Mat tmp= cv::Mat::zeros(size, src.type()); ///先创建一个符号size的空图片
View viewdest = tmp;
Simd::ResizeBilinear(viewsrc, viewdest);
dest = tmp;
结果比较惊艳。为什么会这么快? 来看下simd代码实现:
SIMD_API void SimdResizeBilinear(const uint8_t *src, size_t srcWidth, size_t srcHeight, size_t srcStride,
uint8_t *dst, size_t dstWidth, size_t dstHeight, size_t dstStride, size_t channelCount)
{
#ifdef SIMD_AVX512BW_ENABLE
if (Avx512bw::Enable && dstWidth >= Avx512bw::A)
Avx512bw::ResizeBilinear(src, srcWidth, srcHeight, srcStride, dst, dstWidth, dstHeight, dstStride, channelCount);
else
#endif
。。。。
}
SIMD_AVX512BW_ENABLE 是对应不同的cpu指令集。下面会有对应指令集的代码:
template <size_t channelCount> void ResizeBilinear(
const uint8_t *src, size_t srcWidth, size_t srcHeight, size_t srcStride,
uint8_t *dst, size_t dstWidth, size_t dstHeight, size_t dstStride)
{
assert(dstWidth >= A);
size_t size = 2 * dstWidth*channelCount;
size_t bufferSize = AlignHi(dstWidth, A)*channelCount * 2;
size_t alignedSize = AlignHi(size, DA) - DA;
const size_t step = A*channelCount;
Buffer buffer(bufferSize, dstWidth, dstHeight);
Base::EstimateAlphaIndex(srcHeight, dstHeight, buffer.iy, buffer.ay, 1);
EstimateAlphaIndexX<channelCount>(srcWidth, dstWidth, buffer.ix, buffer.ax);
ptrdiff_t previous = -2;
__m512i a[2];
for (size_t yDst = 0; yDst < dstHeight; yDst++, dst += dstStride)
{
a[0] = _mm512_set1_epi16(int16_t(Base::FRACTION_RANGE - buffer.ay[yDst]));
a[1] = _mm512_set1_epi16(int16_t(buffer.ay[yDst]));
ptrdiff_t sy = buffer.iy[yDst];
int k = 0;
if (sy == previous)
k = 2;
else if (sy == previous + 1)
{
Swap(buffer.bx[0], buffer.bx[1]);
k = 1;
}
previous = sy;
for (; k < 2; k++)
{
Gather<channelCount>(src + (sy + k)*srcStride, buffer.ix, dstWidth, buffer.bx[k]);
uint8_t * pbx = buffer.bx[k];
for (size_t i = 0; i < bufferSize; i += step)
InterpolateX<channelCount>(buffer.ax + i, pbx + i);
}
for (size_t ib = 0, id = 0; ib < alignedSize; ib += DA, id += A)
InterpolateY<true>(buffer.bx[0] + ib, buffer.bx[1] + ib, a, dst + id);
size_t i = size - DA;
InterpolateY<false>(buffer.bx[0] + i, buffer.bx[1] + i, a, dst + i / 2);
}
}
(为什么把channel写成模版??)
里面的精华(看不懂的)就是:
__m512i a[2];
a[0] = _mm512_set1_epi16(int16_t(Base::FRACTION_RANGE - buffer.ay[yDst]));
a[1] = _mm512_set1_epi16(int16_t(buffer.ay[yDst]));
再看下__m512i 、_mm512_set1_epi16是什么东西。
嗯。。。。。。 链接:http://caidongrong.blog.163.com/blog/static/21424025220133282132973/
然后总结: SIMD:
single instruction multiple data,单指令流多数据流,也就是说一次运算指令可以执行多个数据流,这样在很多时候可以提高程序的运算速度。
SIMD是CPU实现DLP(Data Level Parallelism)的关键,DLP就是按照SIMD模式完成计算的。
个人理解:现在的cpu一般有多个线程,你可以在一个指令发送多条指令,这样就可以让cpu飞快的转起来!