三維重建工作的一些調研

钢之炼丹术师發表於2024-07-07

GPT關於3D-GS的回答

如果我獲得了一個3D高斯濺射的表達,給定任意的相機內參和外參,是不是就能生成一副2D影像

Yes, given a 3D Gaussian splatting representation and the intrinsic and extrinsic camera parameters, you can generate a 2D image from any desired viewpoint. Here’s a step-by-step explanation of how this process works:

1. 3D Gaussian Representation

3D Gaussians are defined by their mean (center), covariance matrix (defining their shape and orientation in 3D space), and color. These parameters collectively represent the scene.

2. Camera Parameters

To render a 2D image, you need:

  • Intrinsic parameters: These define the camera's internal characteristics, such as focal length, principal point, and distortion coefficients.
  • Extrinsic parameters: These define the camera's position and orientation in the world space, typically represented by a rotation matrix and a translation vector.

3. View Transformation

The 3D Gaussians need to be transformed from the world coordinate system to the camera coordinate system using the extrinsic parameters. This involves:

  • Rotation: Aligning the scene with the camera's viewpoint.
  • Translation: Positioning the scene relative to the camera.

The transformation can be represented as:
[ \text{camera_coords} = R \cdot \text{world_coords} + T ]
where ( R ) is the rotation matrix and ( T ) is the translation vector.

4. Projection Transformation

The transformed 3D Gaussians are then projected onto the 2D image plane using the intrinsic parameters. This projection involves:

  • Mapping the 3D coordinates to 2D coordinates on the image plane.
  • Applying perspective projection using the camera’s focal length and principal point.

The projection matrix ( P ) combines the intrinsic parameters:
[ \text{image_coords} = P \cdot \text{camera_coords} ]
where ( P ) is the projection matrix constructed from the intrinsic parameters.

5. Gaussian Splatting

The 3D Gaussians are then "splatted" onto the 2D image plane. This involves:

  • Projecting each Gaussian's mean to the 2D plane.
  • Transforming the 3D covariance matrix to 2D to define the shape and orientation of the projected Gaussian.
  • Rendering the 2D Gaussians as ellipses based on the transformed covariance matrix.

6. Visibility and Occlusion Handling

To correctly render the scene, visibility and occlusion need to be managed. This is typically handled using a depth buffer (z-buffer) to ensure that closer Gaussians correctly occlude those farther away from the camera.

Example Code Outline

Here is an outline of what the process might look like in Python using a hypothetical rendering function:

import numpy as np
import matplotlib.pyplot as plt

def render_image(gaussians, intrinsic_params, extrinsic_params):
    R, T = extrinsic_params
    P = intrinsic_params

    projected_gaussians = []
    for gaussian in gaussians:
        mean, covariance, color = gaussian

        # Transform mean to camera space
        camera_mean = R @ mean + T
        
        # Project mean to 2D
        homogeneous_mean = np.append(camera_mean, 1)
        image_mean = P @ homogeneous_mean
        image_mean /= image_mean[2]
        
        # Transform covariance to camera space and project to 2D
        camera_covariance = R @ covariance @ R.T
        projected_covariance = P @ camera_covariance @ P.T
        
        # Store the 2D Gaussian (mean, covariance, color)
        projected_gaussians.append((image_mean[:2], projected_covariance, color))
    
    # Render the projected Gaussians
    fig, ax = plt.subplots()
    for mean, covariance, color in projected_gaussians:
        ellipse = plot_ellipse(mean, covariance, color)
        ax.add_patch(ellipse)

    plt.xlim(0, intrinsic_params[0][2] * 2)
    plt.ylim(0, intrinsic_params[1][2] * 2)
    plt.show()

# Example usage with hypothetical data
gaussians = generate_gaussians()  # Your 3D Gaussians
intrinsic_params = get_intrinsic_params()  # Your camera's intrinsic parameters
extrinsic_params = get_extrinsic_params()  # Your camera's extrinsic parameters

render_image(gaussians, intrinsic_params, extrinsic_params)

In this example, generate_gaussians, get_intrinsic_params, and get_extrinsic_params are placeholders for functions that would provide the necessary data. The plot_ellipse function would handle the rendering of each Gaussian as an ellipse on the 2D plane.

Conclusion

By following these steps, you can render a 2D image from any viewpoint using a 3D Gaussian splatting representation and given camera parameters. This approach is powerful for creating realistic images from complex 3D data efficiently.

MVSplat論文


資料構造:包含url、時間戳、相機引數(沒看懂)、連續幀影像、key

相機裡面有H、T、data,可能要去pixelSplat論文裡去看

三維重建——3D Gaussian Splatting

高斯濺射3D Guassian Splatting是2023年Siggraph發表的一項創新性技術,其基本的思路為利用運動結構恢復SfM,從一組多目影像中估計一個顯性的稀疏點雲。對於該點雲中的每一個點,構造一個類似散射場的高斯橢球機率預測模型,透過神經網路完成學習,獲得每一個橢球的對應引數,進而得到一個類似體畫素的離散表示,以支援多角度的體渲染和光柵化。
神經散射場的問題在於無法將機率預測控制在一個可控的區域。高斯橢球提供了一個有效的解決方案,該技術將機率預測壓縮在一個基於稀疏點雲的多個高斯分佈中。即每一個機率預測的計算都是以稀疏點雲中的一個點為標定,一個特定的作用範圍作為機率預測的界限。這樣,體渲染面對的不是全域性場景,而是橢球限定的一組小區域。全域性最佳化被拆解為一組區域性最佳化,對應的計算效率自然會有所提升。高斯濺射技術就是基於上述思路提出,以平衡渲染效率和精度。
https://blog.csdn.net/aliexken/article/details/136072393

https://zhuanlan.zhihu.com/p/661569671

三維高斯根據知乎的解讀,主要應用點在於透過在一個方向上的積分快速生成二維渲染影像,不太適合樂高裝配任務

3D高斯可能並不是3D模型

https://github.com/graphdeco-inria/gaussian-splatting/issues/418

高斯濺射出來的2D可能看起來很像本來那個3D的東西向那個觀察點投影所看到的,但是,“也僅僅是看起來像,和那個物理實在,可能有比較遠的差距”; 當你需要更接近物理實際的點雲和表面mesh的時候,問題挑戰不僅在於充分利用高斯點的全部資訊(不僅mean, covar等直接的位置中心和點形狀資訊,還包括sh feature,還可能需要考慮周邊點,因為渲染的的時候有個融合過程),更在於GS的輸出GaussPointCloud本身的質量就沒有那麼好(相對於物理實際)

三維重建——occupancy network類的

三維重建——voxel模型類

A Voxel Generator Based on Autoencoder

從簡單圖形生成voxel模型

3D Shape Generation and Completion through Point-Voxel Diffusion

從深度圖生成voxel模型

https://arxiv.org/pdf/2104.03670

Image-to-Voxel Model Translation for 3D Scene Reconstruction and Segmentation

https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123520103.pdf
部分開源: https://github.com/vlkniaz/SSZ

看起來比較接近要做的事
沒說速度多少

TMVNet : Using Transformers for Multi-view Voxel-based 3D Reconstruction

https://openaccess.thecvf.com/content/CVPR2022W/PBVS/papers/Peng_TMVNet_Using_Transformers_for_Multi-View_Voxel-Based_3D_Reconstruction_CVPRW_2022_paper.pdf


從多視角影像中生成voxel模型,輸入比較乾淨
沒開源

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

https://openaccess.thecvf.com/content/CVPR2023/papers/Li_VoxFormer_Sparse_Voxel_Transformer_for_Camera-Based_3D_Semantic_Scene_Completion_CVPR_2023_paper.pdf

英偉達的組
開源: https://github.com/NVlabs/VoxFormer

以上都是單目的


多視角 Multi-view Stereo

Scan2CAD: Learning CAD Model Alignment in RGB-D Scans

https://arxiv.org/pdf/1811.11187v1

Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

https://arxiv.org/pdf/1808.00758v2


這篇看起來輸入輸出比較像

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction
https://arxiv.org/pdf/1604.00449v1
這篇也類似

背景乾淨的圖片生成3D模型

Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from Single and Multiple Images

https://arxiv.org/pdf/2006.12250 2020
https://gitlab.com/hzxie/Pix2Vox

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

https://arxiv.org/pdf/2308.08724 2023

文章的baseline提到了

LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction

https://arxiv.org/pdf/2106.12102
https://github.com/faridyagubbayli/LegoFormer

可以單幀也可以多幀

VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single and Multi-view 3D Reconstruction

https://arxiv.org/pdf/2203.07553 2022

DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT Based Diffusion Model

https://arxiv.org/pdf/2402.11241v1 2024

提到了這方面的baseline
這篇沒開源

體素生成方面的資料集


好像就ShapeNet和Data3D-R2N2

https://paperswithcode.com/sota/3d-object-reconstruction-on-data3dr2n2

0624階段性總結

DTU資料集
LRGT這類使用shapenet資料集的渲染圖訓練
MVSFormer這種生成的是基於參考圖視角的深度圖,速度慢,只有一個視角
MVSplat和pixelSplat這類3DGS是生成渲染圖,也可以從高斯生成點雲,據說不夠準確
voxFormer等單目occ方法,可以多目聯動,但是資料集都是大尺度的KITTI之類的
上海人工智慧實驗室新資料集:
https://omniobject3d.github.io/

相關文章