cuda全局内存中的“分区冲突”

2020-10-08 本文已影响0人 bazinga_dmc

CUDA将GPU的内存模型暴露给开发人员，包括全局内存、常量/纹理内存、共享内存、本地内存、寄存器，不同类型内存的读取和访问的模式有所差别。在不合理的访问模式下，全局内存访问可能发生“分区冲突”（partion camping），其类似于共享内存中的bank conflict，只不过粒度较大（资料[1]中介绍的架构下分区宽度为256字节，而bank宽度通常为4或8字节）。全局内存按照256字节划分为多个分区，所有针对全局内存的访问操作由不同的分区完成，如果多个内存访问操作地址落在同一个分区中，这些访问操作将被串行处理，对性能有较大的影响（全局内存访问本身就是高延迟的操作）。下图是一个分区总数为8的全局内存的分区情况。

以全局内存分区数量为8为例，下面图片给出了发生和不发生”分区冲突“的全局内存访问情况。

在发生”分区冲突“时，SM-1到SM-30的全局内存访问操作完全变成串行访问（全部由分区1处理）。我们下面分别给出发生和不发生”分区冲突“的核函数示例，通过执行该核函数可以对”分区冲突”对性能的影响有大致了解。

不发生“分区冲突”

//TYPEcanbea2-,4-oran8-byteword
__global__ void readBenchmark(TYPE *d_arr){
    //assignuniquepartitionstoblocks,
    int numOfPartitions=8;
    int curPartition=blockIdx.x%numOfPartitions;
    int partitionSize=256;//256bytes
    int elemsInPartition=partitionSize/sizeof(int);
    //jumptouniquepartition
    int startIndex=elemsInPartition*curPartition;
    TYPE readVal=0;
    //Loopcounter’x’ensurescoalescing
    for(int x=0;x<ITERATIONS;x+=16){
    /*offsetguaranteestorestrictthe
    indextothesamepartition*/
        int offset=((threadIdx.x+x)%elemsInPartition);
        int index=startIndex+offset;
        //Readfromglobalmemorylocation
        readVal=d_arr[index];
    }
    /*Writeoncetomemorytopreventtheabove
    codefrombeingoptimizedout*/
    d_arr[0]=readVal;
}

发生“分区冲突”

//TYPEcanbea2-,4-oran8-byteword
__global__ void readBenchmarkPC(TYPE *d_arr){
    int partitionSize=256;//256bytes
    int elemsInPartition=partitionSize/sizeof(TYPE);
    TYPE readVal=0;
    //Loopcounter’x’ensurescoalescing.
    for(int x=0;x<ITERATIONS;x+=16){
        /*allblocksreadfromasinglepartition
        tosimulatePartitionCamping*/
        int index=((threadIdx.x+x)%elemsInPartition);
        //Readfromglobalmemorylocation
        readVal=d_arr[index];
    }
    /*Writeoncetomemorytopreventtheabove
    codefrombeingoptimizedout*/
    d_arr[0]=readVal;
}

具体执行配置：网格配置为256x1，线程块配置为32x32，数据类型（TYPE）为整型，数据个数为256x8，迭代次数为4096x4096，设备为RTX2080ti。下面是执行结果。

从图中我们可以看到，即使在第一个核函数执行更多指令的情况下，“分区冲突”还是使核函数的性能下降了4倍左右。

注意：

在编译时需要禁用一级缓存，否则读操作可能由缓存完成而不访问全局内存，从而无法观察到“分区冲突”现象。

参考资料

书籍《cuda C权威编程指南》
论文《Bounding the Effect of Partition Camping in GPU Kernels》

cuda全局内存中的“分区冲突”

猜你喜欢

热点阅读