Pix2Vox论文阅读

2019-11-15 本文已影响0人 FantDing

title: Pix2Vox论文阅读
date: 2019-11-06 21:12:22
tags:

paper
todo
3D

论文原文《Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images》

Abstract

前人工作：
- 方法： RNN融合提取的sequential input images features^[1]
缺点：
- RNN方法：必须和训练same order
- long-term memory loss^[2]
工作：
- novel framework for single-view and multi-view
- context-aware fusion module
成果：
- SOTA on ShapeNet and Pix3D
- 24 times faster than 3D-R2N2 in terms of back-ward inference time [原因应该是3D-R2N2模型是RNN-based的]
- superior generalization

Introduction

前人工作
- 传统方法：需要匹配特征，但是视角外观等因素导致，有时候特征匹配不到
- RNN-based model：同序； long-term memory；耗时
本文工作
- encoder-decoder: eliminates the effect of the orders； eliminates the effect of the orders
- context-aware fusion module：selects high-quality reconstructions & fuse^[3]
- refiner: refine

Method

Overview

能输入single or multiple RGB images

Network architecture

Pix2Vox-F参数少，Pix2Vox-A更精确；主要差别在于Pix2Vox-F少了refiner部分，也就少了RLoss；卷积核大小也有细微差别

Pix2Vox-F

Pix2Vox-A

3.2.1 Encoder

VGG16+ BN+ 追加了3层，两种结构追加的部分不一样

3.2.2 Decoder

5个3D卷积层
输出 $32*32*32$ 的vox
最后一层接sigmoid，保证每个点的输出是概率值

3.2.3 Context-aware Fusion

作者相信不同视角出来的vox，都是object's canonical view[物体的正则视图]，只是vox在view看到的部分会恢复的更好， Context-aware Fusion Module负责融合这些最可信的部分

如何从生成
- 所有视角：通过 $c_i^1$ 和 $c_i^2$ 的融合，得到context $c_i$
- 所有视角： $c_i$ 送入Context Scoring网络(也就是几个3D卷积)，得到逐点的评分 $m_i$
- 所有 $m_i$ : 在相同位置处进行normalize，文中使用softmax，得到正则化后的逐点评分 $s_i$ ，相当于点的权重 [这个地方，视角之间关联上了]
- 所有 $v_i^c$ 使用 $s_i$ 加权平均，得到 $v_f$

Context-aware Fusion

3.2.4 Refiner

结构: 3D encoder-decoder with the U-net connections

3.2.5 Loss Function

loss function

这里的 $N$ 是所有vox个数

4 Experiments

4.1 Datasets and Metrics

Dataset

SharpNet
- 合成的图片
- use a subset: 包括13 major categories and 43,783 3D models
Pix3D
- real image
- use the 2894 untruncated(非截断) and unoccluded（非封闭） chair images^[5]

Evaluation Metrics

可以看成3D IOU

Metrics

4.2 实验细节

$224*224$ RGB -> $32*32*32$ VOX
前250 epoches:
- 只输入single-view image, 也就不需要训练context-aware fusion module。因为context-aware fusion module是为了计算不同视角vox的权值，在只输入单视角图片时，其权值肯定为1，相当于不加权直接输出结果
后100 epoches:
- random numbers of input images
- train whole network, 即加上了context-aware fusion module

4.3 合成图片的重建结果

single-view恢复结果
multi-view恢复结果

4.4 真实世界图片重建结果

we test our methods for single-view reconstruction on the Pix3D dataset. ^[6]
We use the pipeline of RenderForCNN to generate 60 images for each 3D CAD model in the ShapeNet dataset.读不懂^[7]

4.5 Reconstruction of Unseen Objects

所有模型都是在SharNet 13 major categories上训练的
Unseen Objects:
- ShapeNetCore 剩下的44 categories上，选取24 random views进行预测

4.6 消融实验

Context-aware fusion

使用均值来代替前面的context weighted
- 效果差
  
  image
使用3D convolutional LSTM^[8]代替Context-aware fusion，来融合多个视角
- 效果比均值还要差

Refiner

随着视角的增多, Refiner的效果越不明显

4.7 复杂度

image

4.8 讨论

可视化score mapes发现， our scoring scheme是有效的
在multi-view stereo (MVS) 任务上: 在LSM模型中^[9]，使用context-aware fusion module代替RNN, 重建效果更好。进一步说明context-aware fusion module优越
improve the reconstruction resolutions in the future work by introducing GANs

Conclusion and Future Works

plan to extend Pix2Vox to reconstruct 3D objects from RGB-D images.

3D-R2N2 ↩
看图片也只有3张呀，哪里来的long term memory ↩
作者说“To the best of our knowledge, it is the first time to exploit context across multiple views for 3D reconstruction.”，但个人觉得应该不会是第一次吧，怎么可能之前的结构都不考虑融合多视角呢？ ↩
应该不是使用RNN，那是怎么做的呢？ ↩
为什么强调这两点？全是椅子的图片？ ↩
为什么只使用single view？ ↩
这是什么意思？REAL-WORLD images不是使用Pix3D dataset吗？ ↩
什么玩意，如何work的？ ↩
what ↩