Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion

Jongseong Bae, Junwoo Ha, Jinnyeong Heo, Yeongin Lee, Ha Young Kim; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Towards Temporal Fusion Beyond the Field of View
for Camera-based Semantic Scene Completion

Jongseong Bae^*, Junwoo Ha^*, Jinnyeong Heo^*, Yeongin Lee^*, Ha Young Kim^†

Yonsei University
AAAI 2026
^*Indicates Equal Contribution, ^† Indicates Corresponding Author

Overview

Given a sequence of \( n \) RGB images \( \{I_i\}_{i=t-n+1}^{t} \), our objective is to predict voxel-wise semantic class probabilities at time \( t \) within a voxelized 3D space \( \mathbf{V} \in \mathbb{R}^{X \times Y \times Z \times P} \), where \( P \) includes the empty class. Our model follows the standard architectural paradigm of modern camera-based SSC (Jiang et al. 2024; Yu et al. 2024; Li et al. 2024a; Wang et al. 2025b), consisting of three main components: viewing transformation, voxel processing, and semantic prediction. Among these, the proposed C3DFusion module primarily focuses on the viewing transformation stage, which plays a pivotal role in overall SSC performance, as it performs both 2D-to-3D lifting and temporal geometry fusion.

Figure 2: An overview of our model, highlighting the proposed C3DFusion. The symbol ‘⊕’ denotes feature concatenation.

Abstract

Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-offrame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques—historical context blurring and currentcentric feature densification—which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.

Qualitative Results

The visualization results present a qualitative comparison between our method and existing open-source camera-based approaches on the SemanticKITTI validation set. In scenarios where critical semantic objects, such as cars and persons, are invisible but located very close to the ego-vehicle in the current frame, temporal-frame-based baseline methods like VoxFormer-T and HTCL-S fail to recover these OOV regions, showing performance comparable to CGFormer, which relies only on a single frame. In contrast, our approach successfully reconstructs the missing structures, demonstrating effective use of temporal cues to accurately capture their spatial context even in the absence of direct visual observations in the current frame.

Figure A.3: Visual comparison of our model against other recent camera-based methods on the SemanticKITTI validation set.

Quantitative Results

SemanticKITTI hidden test set

Table 1: Quantitative results on SemanticKITTI hidden test set. ‘OOV’ on the right indicates performance on out-of-view regions, evaluated on the validation set using our implementation. Bold / underline highlight the best / second-best, respectively.

SSCBench-KITTI-360 hidden test set

Table 2: Quantitative results on SSCBench-KITTI-360 test set. ‘OOV’ on the right indicates performance on out-of-view regions, evaluated on the test set using our implementation. Bold / underline highlight the best / second-best, respectively.

Poster

BibTeX

@misc{bae2025temporalfusionfieldview,
      title={Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion},
      author={Jongseong Bae and Junwoo Ha and Jinnyeong Heo and Yeongin Lee and Ha Young Kim},
      year={2025},
      eprint={2511.12498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.12498},
}