ACM MM 2026

HyperGaussian: Referring 4D Gaussian Splatting

HyperGaussian Authors
Open Source Release
HyperGaussian teaser
HyperGaussian reconstructs dynamic scenes as 4D Gaussians and performs training-free referring inference over complex spatiotemporal queries — including temporally varying, multi-target, reasoning-intensive, and zero-target conditions.

Abstract

Dynamic 4D Gaussian representations have shown strong capability in reconstruction and novel-view synthesis, yet they remain insufficient for complex natural-language referring understanding in dynamic scenes. Existing 4D Gaussian methods are primarily designed for rendering-oriented dynamic modeling, while current semantic extensions often rely on scene-specific optimization or are evaluated on relatively simple query settings.


We study Referring 4D Gaussian Splatting (R4DGS), a task that targets realistic 4D referring understanding under temporally varying queries, multi-target and reasoning-intensive queries, and zero-target or distractor queries. To support this task, we introduce R4D-Bench-QA, a benchmark comprising 12 dynamic scenes, 266 structured query annotations, and four query categories. We further present HyperGaussian, a unified framework that couples generalized dynamic Gaussian reconstruction, entity-centric scene structuring via an EntityBank, and training-free referring inference via a Qwen-based Hyper-Planner. HyperGaussian decomposes complex queries into trackable object phrases and spatiotemporal constraints, and performs query-conditioned grounding without retraining the underlying scene representation.

R4DGS task illustration
Illustration of the Referring 4D Gaussian Splatting (R4DGS) task. Given a dynamic scene and a natural-language query, the goal is to identify and segment the referred Gaussian entities across time — covering temporally varying, multi-target, reasoning-intensive, and zero-target query types.

R4D-Bench-QA

R4D-Bench-QA is the first benchmark for Referring 4D Gaussian Splatting. It covers 12 dynamic scenes with 266 structured query annotations across four categories: snapshot, temporal-state, exclusion, and reasoning queries. Each query is paired with per-frame Gaussian-level ground-truth segmentation masks.

R4D-Bench-QA statistics
R4D-Bench-QA statistics: query word cloud (left), query composition by category (center), and token-length distribution (right). The benchmark is designed to stress-test spatiotemporal reasoning beyond simple object lookup.

Method Overview

HyperGaussian framework overview
Overview of the HyperGaussian framework. Dynamic Reconstruction builds the 4D Gaussian scene representation. The Qwen-based Hyper-Planner then drives three sequential stages: static entity segmentation via Grounded-SAM2, semantic assignment to populate the EntityBank, and training-free spatiotemporal grounding conditioned on the decomposed query.
Dynamic Reconstruction

4D Gaussian Scene Representation

The scene is reconstructed as a set of 4D Gaussians with explicit temporal extent. A learned contextual time warp models deformable motion, producing a compact and differentiable dynamic scene representation that serves as the shared substrate for all downstream semantic stages.

Hyper-Planner (Qwen-based MLLM) — 3-stage referring pipeline

Stage 1

Static Entity Segmentation

Grounded-SAM2 is applied to reference frames to obtain per-entity masks. Masks are propagated across time via temporal consistency to associate each Gaussian with a persistent entity identity.

Stage 2

Semantic Assignment

Each entity is assigned visual and semantic embeddings from representative crops and geometric priors. The results are stored in the EntityBank — a shared entity-centric scene memory reused across all queries.

Stage 3

Spatio-Temporal Grounding

The Hyper-Planner decomposes the natural-language query into object phrases, attributes, relations, temporal cues, and cardinality constraints. Candidate entities from the EntityBank are scored and selected without any scene-specific retraining.

Experimental Results

Joint evaluation on R4D-Bench-QA (referring + reconstruction)

Method Acc ↑ vIoU ↑ PSNR ↑ SSIM ↑ LPIPS ↓
Segment then Splat 55.628.420.32080.70270.3971
4D LangSplat 58.432.120.32080.70270.3971
HyperGaussian (Ours) 76.534.420.41590.70690.4082

Generalization on 4D LangSplat HyperNeRF split (americano, split-cookie, espresso)

Method Acc ↑ vIoU ↑
LangSplat54.2724.13
Deformable CLIP65.0145.37
Non-Status Field84.5862.00
4D LangSplat88.8666.14
HyperGaussian (Ours)91.6266.48

Module ablation on R4D-Bench-QA

Variant Acc ↑ vIoU ↑
4DGS reconstruction (no HyperGS)62.931.5
w/o Stage 1 static segmentation48.617.2
w/o Stage 2 semantic assignment62.929.8
w/o Stage 3 spatio-temporal reasoning36.026.1
HyperGaussian (full)76.534.4

Qualitative Comparison

Qualitative comparison on R4D-Bench-QA
Qualitative comparison on R4D-Bench-QA. Each block shows a temporal-state query (top) and an exclusion-style query (bottom). Rows: RGB input, ground truth, HyperGaussian (ours), Segment then Splat, 4D LangSplat. HyperGaussian correctly handles temporally conditioned and negation-based queries where baselines fail.

Additional Results

Additional qualitative results
Additional qualitative results across diverse scenes and query types, demonstrating HyperGaussian's generalization to multi-target, reasoning-intensive, and zero-target queries.

Citation

@inproceedings{hypergaussian2026,
  title     = {HyperGaussian: Referring 4D Gaussian Splatting},
  author    = {HyperGaussian Authors},
  booktitle = {Proceedings of the 34th ACM International Conference on Multimedia},
  year      = {2026}
}

Acknowledgements

This project builds on 4DGaussians, Grounded-SAM2, and Qwen. The project page template is adapted from GeoThinker and Nerfies.