
Modeling the inherent hierarchical structure of 3D objects and 3D scenes is highly desirable, as it enables a more holistic understanding of environments for autonomous agents. Accomplishing this with implicit representations, such as Neural Radiance Fields, remains an unexplored challenge.
Existing methods that explicitly model hierarchical structures often face significant limitations: they either require multiple rendering passes to capture embeddings at different levels of granularity, significantly increasing inference time, or rely on predefined, closed-set discrete hierarchies that generalize poorly to the diverse and nuanced structures encountered by agents in the real world.
To address these challenges, we propose OpenHype, a novel approach that represents scene hierarchies using a continuous hyperbolic latent space. By leveraging the properties of hyperbolic geometry, OpenHype naturally encodes multi-scale relationships and enables smooth traversal of hierarchies through geodesic paths in latent space. Our method outperforms state-of-the-art approaches on standard benchmarks, demonstrating superior efficiency and adaptability in 3D scene understanding.
Mask hierarchies are extracted with Semantic-SAM from input images and encoded with CLIP to obtain features. An auto-encoder using all training images of a scene is trained to learn a hyperbolic latent space that encodes the hierarchical relationships between the masks. Leaf nodes represent the smallest parts and are furthest away from the origin $O$.
The hierarchical loss operates in hyperbolic space and is the sum of two contrastive losses: for one, we use the geodesic distance $d$ between two nodes in the hierarchy as a measure of similarity, and for the other, we use the exterior angle $\alpha$.
Together with a color and density field, a vision-language (VL) feature field is trained. In contrast to existing approaches, this VL field encodes not just single features but a full curve (geodesic path from leaf node to origin) that represents the hierarchies of objects and parts one pixel is part of.
During inference, hierarchical traversal along geodesics in hyperbolic space produces multi-scale semantic responses, aggregated through a softmax-weighted scheme for robust open-vocabulary segmentation.
We evaluate OpenHype on the Search3D benchmark and the LERF dataset. On Search3D, OpenHype improves mean IoU by +8.9 and accuracy by +12.1 on part-level queries compared to state-of-the-art baselines. On LERF, our method surpasses recent discrete-hierarchy approaches, including N2F2 with an overall Iou of 54.6.
We visualize our hyperbolic latent space using CO-SNE, showing how mask features are organized along geodesic paths. The embeddings capture the mask hierarchy — with higher-level masks near the origin, lower-level masks near the boundary, and consistent object features across views.
OpenHype segments both large objects and small parts. For example, it can distinguish between a normal keyboard and a laptop keyboard as well as correctly segment the printer control panel, showing that it mitigates the bag-of-words issues of CLIP features.
@article{weijler2025openhype,
author = {Weijler, Lisa and Koch, Sebastian and Poiesi, Fabio and Ropinski, Timo and Hermosilla, Pedro},
title = {OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields},
journal = {NeurIPS},
year = {2025},
}