We propose a new method for depth denoising. Our model learned in a self-supervised way takes color (a) and depth (b) data coming from the sensor of an iPhone X as input and produces a denoised and refined depth (c). For reference, Kinect v2 depth map capture is included (d).

Abstract

Consumer-level depth cameras and depth sensors embedded in mobile devices enable numerous applications, such as AR games and face identification. However, the quality of the captured depth is sometimes insufficient for 3D reconstruction, tracking and other computer vision tasks. In this paper, we propose a self-supervised depth denoising approach to denoise and refine depth coming from a low quality sensor. We record simultaneous RGB-D sequences with unzynchronized lower- and higher-quality cameras and solve a challenging problem of aligning sequences both temporally and spatially. We then learn a deep neural network to denoise the lower-quality depth using the matched higher-quality data as a source of supervision signal. We experimentally validate our method against state-of-the-art filtering-based and deep denoising techniques and show its application for 3D object reconstruction tasks where our approach leads to more detailed fused surfaces and better tracking.

Main Idea

We propose learning-based method for depth denoising of lower-quality (LQ) depth sensor's output using supervision of a higher-quality (HQ) depth sensor. We recorded simultaneous RGB-D sequences with unzynchronized lower- and higher-quality cameras. We consider an in-the-wild scenario where the hardware clock synchronization and prior extrinsic calibration for the sensors is not possible. The created dataset consists of people captured in a variety of poses and lightning conditions using the rig.

We solve a challenging problem of aligning sequences both temporally and spatially:
Temporal Alignment: for each pair of sequences we seek for a shift that would align the timestamps of the two sensors so that a simple nearest neighbour search between the timestamps give us the best mapping. The correctness of alignment is measured based on spatial alignment score.
Spacial Alignment: for each pair of matched frames we use Superpoint detector to extract a set of 2D correspondences. We then optimize an extrinsic matrix that would transform HQ sensor coordinate system into the LQ one.

To exploit temporal information available in the consecutive frames we use a recurrent model. We utilize two-level training approach based on out-of-fold predictions method. First, we train first-level model to denoise depth on per-frame basis. As a second-level model we train a convolutional recurrent model to account for temporal correlations in the data.

Paper


arXiv


BibTex

@article{Ahan-LQ2HQ,
    Author = {Akhmedkhan Shabanov and Ilya Krotov and
                 Nikolay Chinaev and Vsevolod Poletaev and
                 Sergei Kozlukov and Igor Pasechnik and
                 Bulat Yakupov and Artsiom Sanakoyeu and
                 Vadim Lebedev and Dmitry Ulyanov},
    title = {Self-supervised Depth Denoising Using Lower- and Higher-quality RGB-D sensors},
    booktitle={8th International Conference on 3D Vision},
    url   = {https://arxiv.org/abs/2009.04776},
    year  = {2020},
  }