- 输出format [ x, y, z, w, h, l, Θ,Φ,Ψ ]

- 基于RGB 的单目目标检测
- 基于RGB-D 的单目目标检测
- 基于激光点云的3d目标检测
- 基于RGB 的双目目标检测
- 基于RGB-D 的双目目标检测
二、基于RGB 的单目/双目目标检测

- Templates Matching based Methods.
region proposals - Geometric Properties based Methods.
先进行2d检测,后将2d框转换为3d检测框 - Pseudo LiDAR based Methods.
1、Templates Matching based Methods.
3DOP这篇文章是当下使用双目相机进行3D bounding-box效果做好的方法,其是Fast RCNN方法在3D领域之内的拓展。由于原论文发表于NIPS15,出于Fast RCNN的效果并没有Faster RCNN和基于回归的方法好,且远远达不到实时性,因此其处理一张图片的时间达到了4.0s。
(3)、Deep MANTA
2、Geometric Properties based Methods.
(3)、Stereo R-CNN
(4)、FCOS3D: Fully Convolutional One-stage Monocular 3D Object Detection (1st place of NIPS 2020 vision-only nuScenes 3D detection)
3、Pseudo LiDAR based Methods.
(3)、Stereo R-CNN
三、基于点云 的3D目标检测

- bbox:2D检测框的准确率
- 3d: 3D检测框的准确率
- bev: BEV视图下检测框的准确率
- aos: 检测目标旋转角度的准确率
1、Rotated Intersection over Union (IoU3D)

基于iou 3d,可以定义出TP和FP
- TP with IoU ≥ threshold
- otherwise is FP
- an undetected ground-truth bounding box is regarded as False Negative (FN).
- Note that true negative (TN) does not apply since there exist infinite possible candidates.
(KITTI, the threshold is set to 0.7 for car, 0.5 for pedestrians 0.5 for pedestrians) - IoU(Intersection over union):交并比IoU衡量的是两个区域的重叠程度,是两个区域重叠部分面积占二者总面积的比例。在目标检测中,如果模型输出的结果与真值gt的交并比 > 某个阈值(0.5或0.7)时,即认为我们的模型输出了正确的结果。
- Precision :检索出来的条目中有多大比例是我们需要的。
- Recall:我们需要的条目中有多大比例被检索出来了。
- AP(Average Precision):平均精准度,对Precision-Recall曲线上的Precision值求均值。

通过绘制精确性×召回率曲线(PRC),曲线下的面积往往表示一个检测器的性能。然而,在实际案例中,"之 "字形的PRC给准确计算其面积带来了挑战。KITTI采用AP@SN公制作为替代方案,直接规避了计算方法。
- The KITTI 3D object detection benchmark [16] is divided into 7,481 training samples and 7,518 testing samples. The training samples are commonly divided into a train set (3,712 samples) and a val set (3,769 samples) following [10], which is also adopted her
- 80 epochs on the KITTI dataset
- a NVIDIA Tesla V100 (32G) GPU.
- consists of 798 training sequences and 202 validation sequences. The dataset also includes 150 test sequences without ground truth data. The dataset provides object labels in the full 360◦field of view with a multi-camera rig. We only use the front camera and only consider object labels in the front-camera’s field of view (50.4◦) for the task of monocular object detection, and provide results on the validation se- quences. We sample every 3rd frame from the training sequences to form our training set (51,564 samples) due to the large dataset size and high frame rate
- 10epochs on the Waymo Open Datase
- a NVIDIA Tesla V100 (32G) GPU.
3、nuScenes :单目,双目,雷达点云
NuScenes consists of multi-modal data collected from 1000 scenes, including RGB images from 6 cameras, points from 5 Radars, and 1 LiDAR. It is split into 700/150/150 scenes for training/validation/testing. There are overall 1.4M annotated 3D bounding boxes from 10 categories. In addition, nuScenes uses different metrics, distance-based mAP and NDS, which can help evaluate our method from another perspective.