Ground Truth

Extending the translation equivariance property of convolutional neural networks to larger symmetry groups has been shown to reduce sample complexity and enable more discriminative feature learning. Further, exploiting additional symmetries facilitates greater weight sharing than standard convolutions, leading to an enhanced network expressivity without an increase in parameter count. However, extending the equivariant properties of a convolution layer comes at a computational cost. In particular, for 3D data, expanding equivariance to the SE(3) group (rotation and translation) results in a 6D convolution operation, which is not tractable for larger data samples such as 3D scene scans. While efforts have been made to develop efficient SE(3) equivariant networks, existing approaches rely on discretization or only introduce global rotation equivariance. This limits their applicability to point clouds representing a scene composed of multiple objects.
This work presents an efficient, continuous, and local SE(3) equivariant convolution layer for point cloud processing based on general group convolution and local reference frames. Our experiments show that our approach achieves competitive or superior performance across a range of datasets and tasks, including object classification and semantic segmentation, with negligible computational overhead.
Instead of 3D positions the input to the convolution operator are group elements allowing to detect patters transformed by group actions (e.g. rotations).
Given
\[\text{SE(3)} = \mathbb{R}^3 \rtimes \text{SO(3)},\]the \(SE(3)\) group convolution for 3D point clouds can be written as
\[ \int_{\mathbb{R}^3} \int_{\text{SO(3)}} f(\text{t, R'})k(\text{R}^{-1}(\text{t} - \text{x}), \text{R}^{-1}\text{R'}) d\text{t} d\mu(\text{R'}). \]In addition to relative 3D positions, relative rotations are also used as input to the kernel resulting in a 6D convolution.
Solving the group convolution requires defining a grid on \(\text{SO(3)}\), which is not straightforward. Previous work has addressed this by discretizing the \(\text{SO(3)}\) group, for example, using platonic solids. To stay in the continuous space, a random grid can be constructed, such as through Monte Carlo sampling, \[\sum_{j} \frac{1}{\lvert H'_j \rvert}\sum_{(\text{t, R'})\in H'_j} f(\text{t, R'})k(\text{R}^{-1}(\text{t} - \text{x}), \text{R}^{-1}\text{R'}).\]
Yet, the approximation quality of the integral over \(\text{SO(3)}\) depends on the number of samples i.e. the number of \(\text{SO(3)}\) group group elements sampled per point \(|H'_j|\). The memory footprint increases linearly with \(|H'_j|\), while the number of computations increases quadratically.
Using a random grid results in a trade-off between computational efficiency and preciseness of equivariance property, showing that an efficient grid on SE(3) that allows for exact equivariance with finite rotation elements is crucial to make continuous group convolutions practical for point-based networks.
To achieve exact equivariance with tractable computational load, we propose a carefully constructed grid \(\mathcal{F}(x_j) \subset \text{SE(3)}\) specific to each point \(x_j \in \mathbb{R}^3\),
\[\sum_{j} \frac{1}{\lvert \mathcal{F}(x_j) \rvert}\sum_{(\text{t, R'})\in \mathcal{F}(x_j)} f(\text{t, R'})k(\text{R}^{-1}(\text{t} - \text{x}), \text{R}^{-1}\text{R'}).\]We show that if \(\mathcal{F}(x_j)\) is equivariant to \(\text{SE(3)}\), so is our 3D convolution as defined above. \(\mathcal{F}(x_j)\) is called a Frame and consists of only 4 elements for the \(\text{SE(3)}\) group; it can be constructed with local PCA. Further, we propose to perform a stochastic approximation during training by only sampling a subset of the elements of \(\mathcal{F}(x_j)\) for input and output domains of the feature maps; randomly sampling only 1 element will maintain the memory consumption and computations equal to the model with standard convolutional layers.
Our method achieves competitive or superior performance across a range of datasets and tasks, including object classification and semantic segmentation, with negligible computational overhead.
Using only one sample to approximate the integral over \(\text{SO(3)}\) has approximately similar memory con- sumption and frames per second (FPS) as the non-\(\text{SO(3)}\) equivariant version of our model. This shows that with our method, we can introduce the equivariant property without extra costs, demonstrating the efficiency of our proposed model. .
Since our surroundings have a notion of an up orientation, we fix the z-axis and conduct our experiments for \(SO(2)\). We sample only one orientation from the frame for all experiments, which does not pose additional memory or computational burden on the model. This is a crucial property for processing such large point clouds, making it intractable for the other methods to run reasonable-sized networks for this task.
@article{weijler2025roteq,
title = {Efficient Continuous Group Convolutions for Local SE(3) Equivariance in 3D Point Clouds},
author = {Weijler, L. and Hermosilla, P.},
journal = {International Conference on 3D Vision (3DV)},
year = {2025},
}