【CVPR】Person RE-ID文献摘要【更新中。。】
-
(Oral)
Group Consistent Similarity Learning via Deep CRF for Person Re-Identification
Dapeng Chen, Dan Xu, Hongsheng Li, Nicu Sebe, Xiaogang Wang
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8649-8658
摘 要
行人重新识别依赖深度神经网络(DNN)学习准确的相似性度量和强大的特征嵌入得以提升。但是,目前的大多数方法仅对相似性学习施加局部约束。在本文中,我们通过将CRF与深度神经网络相结合,将约束结合到大图像组中。我们所提出的方法旨在在考虑组中所有图像依赖关系的同时,学习图像对的“局部相似性”度量并形成的“群体相似”。我们的方法在统一的CRF训练期间,用多个图像来整合局部和整体之间的相似性,同时在测试阶段结合多尺度的局部相似性作为预测的相似性。我们采用近似推理方案来估计整体相似性,实现端到端训练。大量实验证明了我们结合DNN和CRF的模型,在学习健壮性多尺度的局部相似更有效。三个广泛使用的基准数据集上,我们总体结果大大优于现有最好方法。
key insights in abstract
- CRF结合神经网络,CRF计算整体相似
- P2G(probe&gallery)、G2G(gallery&gallery)
- 多尺度
Abstract
Person re-identification benefits greatly from deep neural networks (DNN) to learn accurate similarity metrics and robust feature embeddings. However, most of the current methods impose only local constraints for similarity learning. In this paper, we incorporate constraints on large image groups by combining the CRF with deep neural networks. The proposed method aims to learn the "local similarity" metrics for image pairs while taking into account the dependencies from all the images in a group, forming "group similarities". Our method involves multiple images to model the relationships among the local and global similarities in a unified CRF during training, while combines multi-scale local similarities as the predicted similarity in testing. We adopt an approximate inference scheme for estimating the group similarity, enabling end-to-end training. Extensive experiments demonstrate the effectiveness of our model that combines DNN and CRF for learning robust multi-scale local similarities. The overall results outperform those by state-of-the-arts with considerable margins on three widely-used benchmarks.
-
(Spotlight)
Person Transfer GAN to Bridge Domain Gap for Person Re-Identification
Longhui Wei, Shiliang Zhang, Wen Gao, Qi Tian;
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 79-88
摘 要
虽然行人重识别的检测能力已经显著提升,但是在实际场合中很多具有挑战性的问题还没被完全解决,比如复杂的场景,光线差异,视角和姿势的改变,大量的行人在一个监控摄像头网络中等。为了促进相关研究能朝解决这些问题的方向前进,我们这篇文章提出了一个命名为MSMT7的新数据集,该数据集包括很多重要的特点,比如,(1)由15个监控摄像头组成并在室内和室外都有安置的监控网络拍摄出来的原始视频,(2)这些视频持续了很长的时间,所以包括了复杂的光线变化,(3)它包含了大量的行人框图,比如有4,101个行人和126,441个矩形边界框。我们同时也观察到,在现有不同的数据集中有一个区域差距(domain gap),并且这个区域差异是导致了在不同数据集训练和测试表现结果大幅下降的重要原因。在相关训练集上训练出来的模型不能很高效应用在新的测试集。为了降低标注新的训练样本的昂贵代价,我们提出了一个行人迁移的生成对抗网络(PTGAN)来连接这些区域差距。大量的实验表明使用PTGAN,能从本质上缩短区域差距。
key insights in abstract
- new dataset
- transfer different dataset
Abstract
Although the performance of person Re-Identification (ReID) has been significantly boosted, many challenging issues in real scenarios have not been fully investigated, e.g., the complex scenes and lighting variations, viewpoint and pose changes, and the large number of identities in a camera network. To facilitate the research towards conquering those issues, this paper contributes a new dataset called MSMT17 with many important features, e.g., 1) the raw videos are taken by an 15-camera network deployed in both indoor and outdoor scenes, 2) the videos cover a long period of time and present complex lighting variations, and 3) it contains currently the largest number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes. We also observe that, domain gap commonly exists between datasets, which essentially causes severe performance drop when training and testing on different datasets. This results in that available training data cannot be effectively leveraged for new testing domains. To relieve the expensive costs of annotating new training samples, we propose a Person Transfer Generative Adversarial Network (PTGAN) to bridge the domain gap. Comprehensive experiments show that the domain gap could be substantially narrowed-down by the PTGAN.
3.(Spotlight)
Disentangled Person Image Generation
Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, MarioFritz
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 99-108
摘 要
key insights in abstract
Abstract
Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor, respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on the Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.
4.【Spotlight】Multi-Shot Pedestrian Re-Identification via Sequential Decision Making
-
Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-Identification
Shuang Li, Slawomir Bak, Peter Carr, Xiaogang Wang;
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 369-378
摘要
基于视频的行人重识别是在不交叉的不同摄像头中匹配视频片段中的人。现在大多数方法都是通过对视频帧中一个人的完整图像进行编码然后计算所有帧中的总体表征来解决问题。但是在实际上,行人总会部分遮挡 ,这会导致特征抽取失败。因此,我们提出了一个新颖的时空注意力模型,这个模型能自动发现大量的显著局部身体特征。这样可以对所有的帧数抽取有用的信息而不被遮挡和不对齐问题所困扰。我们的网络先通过学习了大量的空间注意力模型,并且各个模型使用了大量的正则化式子来保证不会发现同样的身体部位。从图像局部区域抽取的特征被很有顺序地放入空间注意力模型并与时间注意力一起整合。结果发现,我们的网络能从整个视频流使用最有用的图像块来学习脸部,躯干和其他身体部位的潜在表征。在三个数据集上表明我们的框架比现有最好的方法在各种评价指标上都要出色很多。
Abstract
Video-based person re-identification matches video clips of people across non-overlapping cameras. Most existing methods tackle this problem by encoding each video frame in its entirety and computing an aggregate representation across all frames. In practice, people are often partially occluded, which can corrupt the extracted features. Instead, we propose a new spatiotemporal attention model that automatically discovers a diverse set of distinctive body parts. This allows useful information to be extracted from all frames without succumbing to occlusions and misalignments. The network learns multiple spatial attention models and employs a diversity regularization term to ensure multiple models do not discover the same body part. Features extracted from local image regions are organized by spatial attention model and are combined using temporal attention. As a result, the network learns latent representations of the face, torso and other body parts using the best available image patches from the entire video sequence. Extensive evaluations on three datasets show that our framework outperforms the state-of-the-art approaches by large margins on multiple metrics.
key insights in abstract
- 对人局部进行特征抽取
- 空间注意力和时间注意力
时崎狂三