Perception Finished Project

3D-RECON III

Novel methods for 3D models from camera shoots

Runtime

01.04.2021 - 31.03.2022

The focus of this research initiative is to advance the field of multi-view stereo (MVS) by addressing key challenges and introducing innovative solutions. One core goal is the generation of more complete 3D reconstructions for datasets characterized by a scarcity of captured camera poses. To achieve this, the approach involves generating additional virtual camera poses alongside existing ones. Depth maps corresponding to these new poses are then refined using a depth completion algorithm. This refinement process significantly enhances the density of the resultant 3D reconstructions, enabling a more accurate representation of the scene's geometry.
One of the novel directions within this research area is the exploration of MVS techniques without relying on traditional cost-volume approaches. The aim is to devise a more adaptable architecture that can generalize effectively across a wide range of data types and scenarios. Building on recent advancements in binary mask-based depth estimation from classical stereo, the project will implement a similar strategy within the MVS context, thereby innovating over the state of the art. Furthermore, the research delves into the incorporation of semantic information, particularly for planar surfaces, to enhance depth map completion. By leveraging the inherent structure of certain scene elements like walls and floors, incomplete regions of the depth map can be intelligently filled, resulting in more coherent and realistic 3D reconstructions.

The optimization of correspondence search, a computationally intensive step in the MVS pipeline, is another vital aspect of this project. A hierarchical approach is being developed to reduce the computational burden of this process. This involves the exploration of hierarchical descriptor vectors and the formulation of an appropriate cost function for the neural network responsible for descriptor computation.
Lastly, the project addresses the fusion of RGB color and depth (RGBD) data for improved feature extraction. By tailoring neural network architectures originally designed for color data to accommodate the additional depth dimension, the goal is to leverage the synergies between color and depth information to extract more informative and discriminative features.

In essence, this research initiative aims to revolutionize the field of multi-view stereo by tackling critical challenges in depth estimation, reconstruction completeness, efficiency, and feature extraction. The culmination of these efforts promises to significantly enhance the quality and accuracy of 3D reconstructions, opening up new possibilities for applications in computer vision, robotics, and augmented reality.

Goals

This project's objective is to push the boundaries of 3D reconstruction by investigating, developing, and evaluating methods that exceed current technological capabilities. The primary focus lies in incorporating semantic insights and contemporary machine learning techniques.
Two key domains are under scrutiny:
• The deployment of machine learning, driven by data, to enhance the precision and dependability of camera pose determination.
• The integration of machine learning, powered by data, into the realm of 3D data creation via multi-view stereo methodologies.

The project's achievement will be evident in various ways:
• The successful validation of these novel techniques on pertinent benchmarks, including the notable ETH 3D dataset, as well as Sony's provided datasets.
• Tangible improvements that become apparent through subjective and qualitative assessments.
• The dissemination of successful methodologies through scientific publications and potential consideration for patenting.

In essence, this initiative is driven by the ambition to not only achieve advanced 3D reconstructions but also to rigorously validate their effectiveness. By harnessing machine learning and semantic insights, this project strives to leave a notable mark on the landscape of 3D reconstruction methodologies. The project will not initially focus on reconstructing static environments such as rooms and places and will put less emphasis on capturing dynamic scenes or objects like humans.

Approach

The project approaches the given tasks by utilizing artificial environments and scenes derived from Unreal Engine 4, to increase the availability of data for training. In this way, dataset generation was streamlined using the large number of available scenes in the associated online store. Using the EasySynth plugin, it is possible to export (i) rgb images, (ii) camera poses (in COLMAP format), (iii) depth maps (in floating point format, (iv) normal vector maps (in floating point format), and (v) GT point cloud (reconstructed from the extracted depth maps and poses). Using this synthetic data, existing MVS algorithms will be refined and a multi-detector for local feature extraction, based on a retrained deep-matcher on top of the MD-net structure, developed.

Expected and Achieved Results

The project expects the following results: (i) enhancing the accuracy and robustness of estimating camera positions within a structure-frommotion (SFM) system; (ii) improving reconstruction of indoor spaces, specifically flat surfaces such as walls, ceilings, and floors, surpassing the capabilities of current leading methods; (iii) achieving both qualitative and quantitative assessments of the newly developed techniques; (iv) creating software implementations of these novel methods, and (v) finally showcasing the extended practicality of 3D reconstruction methods, highlighting their efficacy in generating higher quality results. The project has been concluded and achieved its expected results.