Implicit 4D Gaussian Splatting for Fast Motion with Large Inter-Frame Displacements

Hanyang University
ICLR 2026
SPIN-4DGS Teaser

SPIN-4DGS enables faithful 4D Gaussian splatting reconstruction under fast motions with large inter-frame displacements.

Overview

Abstract

Recent 4D Gaussian Splatting (4DGS) methods often fail under fast motion with large inter-frame displacements, where Gaussian attributes are poorly learned during training, and fast-moving objects are often lost from the reconstruction. In this work, we introduce Spatiotemporal Position Implicit Network for 4DGS, coined SPIN-4DGS, which learns Gaussian attributes from explicitly collected spatiotemporal positions rather than modeling temporal displacements, thereby enabling more faithful splatting under fast motions with large inter-frame displacements. To avoid the heavy memory overhead of explicitly optimizing attributes across all spatiotemporal positions, we instead predict them with a lightweight feed-forward network trained under a rasterization-based reconstruction loss. Consequently, SPIN-4DGS learns shared representations across Gaussians, effectively capturing spatiotemporal consistency and enabling stable high-quality Gaussian splatting even under challenging motions. Across extensive experiments, SPIN-4DGS consistently achieves higher fidelity under large displacements, with clear improvements in PSNR and SSIM on challenging sports scenes from the CMU Panoptic dataset. For example, SPIN-4DGS notably outperforms the strongest baseline, D3DGS, by achieving +1.83 higher PSNR on the Basketball scene.

Interactive Qualitative Results

Visual Comparisons

Visual comparisons between our SPIN-4DGS method and strong baselines on challenging sports scenes. The interaction layer is refactored so only the active scene is rendered and loaded at a time.

Choose a scene, then drag the divider to inspect each reconstruction pair. Videos are loaded only for the active scene and only when the cards approach the viewport.

Failure Modes of Prior 4DGS

The Challenge in Fast Motion

Large inter-frame displacement exposes two recurring failure modes in existing 4DGS pipelines: unstable attributes in explicit formulations and broken canonical assumptions in deformable ones.

Motion Regime Large inter-frame displacement
Explicit 4DGS Cross-frame attribute interference
Deformable 4DGS Static canonical mismatch
Explicit 4DGS

Attribute Collapse

Shared Gaussian attributes become inconsistent across fast-moving frames.

Attribute Collapse in Explicit 4DGS

Failure Mode

Cross-frame interference accumulates during training, so one Gaussian is pushed to represent incompatible temporal states.

Observed Effect

The reconstruction becomes blurry and temporally unstable as attributes collapse over time.

Deformable 4DGS

Canonical Miss

A static canonical space cannot keep up with rapid non-rigid motion.

Canonical Miss in Deformable 4DGS

Failure Mode

The canonical representation no longer matches the true object position once displacement becomes too large between frames.

Observed Effect

Fast-moving objects disappear or fragment because the model cannot recover them in canonical space.

Key Insight
Decouple position modeling from attribute learning

This observation motivates our core idea: collect reliable spatiotemporal positions first, then predict Gaussian attributes from them implicitly.

Approach

Method Overview

Method Overview

Illustration of the overall framework. SPIN-4DGS consists of two stages of (a) Spatiotemporal Position Estimation and (b) Implicit Network for 4DGS. Specifically, (a) we slice Gaussians along the temporal axis to obtain spatiotemporal position sets and refine them with rasterization loss. Then, (b) the refined positions are normalized and passed through a 4D hash encoder and multibranch decoders to predict Gaussian attributes (scale, rotation, color, and opacity).

Benchmarks

Experimental Results

SPIN-4DGS consistently improves reconstruction fidelity across sports, free-view, and indoor dynamic scenes while remaining stable under challenging motion.

CMU Panoptic Sports

CMU Panoptic Sports Results

CMU Panoptic Sports. SPIN-4DGS achieves the best PSNR on all six scenes with an average of 30.11 dB, outperforming Realtime-4DGS (28.38 dB) by +1.73 dB and D3DGS (28.70 dB) by +1.41 dB. SSIM remains consistently high at 0.93, demonstrating strong robustness under extreme fast motion and large inter-frame displacements.

Neu3DV Benchmark

Neu3DV Benchmark Results

Neu3DV Benchmark. Using the default SPIN-4DGS training setup without refinement, our method achieves the highest average PSNR of 32.19 dB, surpassing Realtime-4DGS (32.01 dB) by +0.18 dB and clearly outperforming deformable baselines such as Grid4D (31.49 dB) and 4DGaussian (31.01 dB). SSIM remains consistently high at 0.95, confirming stable and sharp reconstructions even in complex scenes.

MeetRoom Benchmark

MeetRoom Benchmark Results

MeetRoom Benchmark. Evaluated on three indoor scenes (Discussion, Trimming, VR Headset), SPIN-4DGS consistently achieves the highest PSNR across all baselines. It outperforms the strongest explicit method Realtime-4DGS (30.47 dB) by an average margin of +1.6 dB. These results demonstrate strong robustness in cluttered indoor environments with relatively small motions.

Ablation

Effect of Spatiotemporal Position Slicing

Without slicing, Gaussians are optimized jointly across time, causing cross-frame interference and blurred fast-motion reconstructions. Our spatiotemporal slicing decouples Gaussians per frame, improving fidelity while reducing training cost.

Qualitative comparison of spatiotemporal slicing
Qualitative comparison of spatiotemporal slicing. Without slicing, Gaussians interfere across frames, causing blur and artifacts. Our slicing decouples Gaussians per time step, yielding sharper and temporally consistent reconstructions.
Slicing PSNR ↑ SSIM ↑ Train Time ↓ Train Memory ↓
27.48 0.89 1h 20m 18GB
28.96 0.92 25m 9GB

Evaluated on Football sequence (batch size fixed to 1). Slicing improves fidelity and reduces training cost.

Ablation

Effect of Spatiotemporal Position Refinement

The videos below visualize the refinement ablation in Table 2. Even with a small refinement budget, SPIN-4DGS already preserves stable object structures, while additional iterations mainly sharpen fine details.

Refinement Iterations

0.5K Iterations

Stable object structures are already recovered without collapse, showing that SPIN-4DGS remains effective even under a minimal refinement budget.

Refinement Iterations

1K Iterations

Additional refinement improves local geometry and appearance, making fine structures such as ball edges and racket nets more consistent.

Refinement Iterations

2K Iterations

More iterations further sharpen fine details, especially around motion-sensitive regions, while preserving the same stable overall structure.

Qualitative results on spatiotemporal position refinement. Figure 6 and the videos above show that SPIN-4DGS already captures stable object structures with only 0.5K refinement iterations. Increasing the budget to 1K and 2K mainly improves fine details such as ball edges and racket nets.

Reference

BibTeX

@inproceedings{kim2026implicit,
title={Implicit 4D Gaussian Splatting for Fast Motion with Large Inter-Frame Displacements},
author={Seung-gyeom Kim and Areum Kim and Yongjae Yoo and Sukmin Yun},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=MWtXs60n38}
}