fix: eliminate RuntimeWarnings in von Mises-Fisher loss backward pass (#824)

carlosm-silva · web-flow · commit 1bb42adc0a36 · 2025-09-08T05:58:27.000-04:00
* fix: eliminate RuntimeWarnings in von Mises-Fisher loss backward pass

- Replace np.where with boolean masking to avoid double evaluation
- Add comprehensive unit tests for zero handling in LogCMK.backward
- Enhanced docstrings with mathematical background and implementation details
- Added mathematical background documentation in docs/source/models/

Fixes division by zero warnings while maintaining numerical accuracy.
Error bound analysis shows &lt;1e-21 accuracy for |κ| &lt; 1e-6.

* Fix mathematical expressions in documentation

Corrected mathematical expressions in the von Mises-Fisher documentation so they are properly rendered in Github markdown.
diff --git a/docs/source/models/models.rst b/docs/source/models/models.rst
@@ -513,3 +513,10 @@ Below is a minimal example for training a GNN in GraphNeT for energy reconstruct
 Because :code:`ModelConfig` summarises a :code:`Model` completely, including its :code:`Task`\ (s), 
 the only modifications required to change the example to reconstruct (or classify) a different attribute than energy, is to pass a :code:`ModelConfig` that defines a model with the corresponding :code:`Task`.
 Similarly, if you wanted to train on a different :code:`Dataset`, you would just have to pass a :code:`DatasetConfig` that defines *that* :code:`Dataset` instead.
+
+Mathematical Background
+-----------------------
+
+For detailed mathematical derivations and implementation details of specific loss functions:
+
+* :doc:`von_mises_fisher_mathematical_background` - Mathematical background for the von Mises-Fisher loss function implementation, including Taylor series analysis and numerical stability considerations.
diff --git a/docs/source/models/von_mises_fisher_mathematical_background.md b/docs/source/models/von_mises_fisher_mathematical_background.md
@@ -0,0 +1,197 @@
+
+
+# Mathematical Background: von Mises-Fisher Loss Implementation
+
+## Calculation for 3D
+To show that for $m=3$,
+
+$$- \frac{I_{m/2}(\kappa)}{I_{m/2-1}(\kappa)} = \frac{1}{\kappa}-\frac{1}{\tanh(\kappa)}$$
+
+we first substitute $m=3$ into the equation. This gives us:
+
+$$- \frac{I_{3/2}(\kappa)}{I_{1/2}(\kappa)} = \frac{1}{\kappa}-\frac{1}{\tanh(\kappa)}$$
+
+We need to evaluate the left side of this equation, which involves **modified Bessel functions of the first kind** of half-integer order, specifically $I_{3/2}(\kappa)$ and $I_{1/2}(\kappa)$. 
+
+### Expressing Bessel Functions
+
+The modified Bessel functions of the first kind of half-integer order can be expressed in terms of elementary hyperbolic functions.
+
+For the order $1/2$, the function is:
+
+$$I_{1/2}(\kappa) = \sqrt{\frac{2}{\pi\kappa}} \sinh(\kappa)$$
+
+For the order $3/2$, the function is:
+
+$$I_{3/2}(\kappa) = \sqrt{\frac{2}{\pi\kappa}} \left( \cosh(\kappa) - \frac{\sinh(\kappa)}{\kappa} \right)$$
+
+### Calculating the Ratio
+
+Now, we can compute the ratio on the left side of the equation:
+
+$$- \frac{I_{3/2}(\kappa)}{I_{1/2}(\kappa)} = - \frac{\sqrt{\frac{2}{\pi\kappa}} \left( \cosh(\kappa) - \frac{\sinh(\kappa)}{\kappa} \right)}{\sqrt{\frac{2}{\pi\kappa}} \sinh(\kappa)}$$
+
+The term $\sqrt{\frac{2}{\pi\kappa}}$ cancels out, leaving:
+
+$$- \frac{\cosh(\kappa) - \frac{\sinh(\kappa)}{\kappa}}{\sinh(\kappa)} = - \left( \frac{\cosh(\kappa)}{\sinh(\kappa)} - \frac{1}{\kappa} \right)$$
+
+Using the definition of the **hyperbolic cotangent**, $\coth(\kappa) = \frac{\cosh(\kappa)}{\sinh(\kappa)}$, we have:
+
+$$- \left(\coth(\kappa) - \frac{1}{\kappa}\right) = \frac{1}{\kappa} - \coth(\kappa)$$
+
+### Final Result
+
+Finally, since $\coth(\kappa) = \frac{1}{\tanh(\kappa)}$, we can write:
+
+$$\frac{1}{\kappa} - \frac{1}{\tanh(\kappa)}$$
+
+This is exactly the right side of the initial equation. Thus, we have shown that for $m=3$:
+
+$$- \frac{I_{3/2}(\kappa)}{I_{1/2}(\kappa)} = \frac{1}{\kappa}-\frac{1}{\tanh(\kappa)} \quad \blacksquare$$
+
+---
+
+## Taylor Series Approximation
+
+To prove that the Taylor series for $f(\kappa) = \frac{1}{\kappa}-\frac{1}{\tanh(\kappa)}$ is well-defined and alternating at $\kappa = 0$, we first need to analyze the behavior of the function at the origin and then examine the structure of its series expansion.
+
+### Well-Defined at $\kappa = 0$
+The function $f(\kappa)$ appears to be singular at $\kappa=0$ because both $\frac{1}{\kappa}$ and $\frac{1}{\tanh(\kappa)}$ diverge. To determine if the singularity is removable, we can evaluate the limit as $\kappa \to 0$.
+
+First, rewrite the function as a single fraction:
+
+$$f(\kappa) = \frac{1}{\kappa} - \frac{\cosh(\kappa)}{\sinh(\kappa)} = \frac{\sinh(\kappa) - \kappa\cosh(\kappa)}{\kappa\sinh(\kappa)}$$
+
+We can find the limit by expanding the hyperbolic functions into their Taylor series around $\kappa = 0$:
+
+$$\sinh(\kappa) = \kappa + \frac{\kappa^3}{3!} + \frac{\kappa^5}{5!} + O(\kappa^7)$$
+
+$$\cosh(\kappa) = 1 + \frac{\kappa^2}{2!} + \frac{\kappa^4}{4!} + O(\kappa^6)$$
+
+Substituting these into the numerator and denominator:
+
+**Numerator:**
+
+$$\sinh(\kappa) - \kappa\cosh(\kappa) = \left(\kappa + \frac{\kappa^3}{6} + \dots\right) - \kappa\left(1 + \frac{\kappa^2}{2} + \dots\right)$$
+
+$$= \left(\kappa + \frac{\kappa^3}{6}\right) - \left(\kappa + \frac{\kappa^3}{2}\right) + \dots = \left(\frac{1}{6} - \frac{1}{2}\right)\kappa^3 + \dots = -\frac{1}{3}\kappa^3 + O(\kappa^5)$$
+
+**Denominator:**
+
+$$\kappa\sinh(\kappa) = \kappa\left(\kappa + \frac{\kappa^3}{6} + \dots\right) = \kappa^2 + \frac{\kappa^4}{6} + O(\kappa^6)$$
+
+Now, the limit of the function is:
+
+$$\lim_{\kappa \to 0} f(\kappa) = \lim_{\kappa \to 0} \frac{-\frac{1}{3}\kappa^3 + O(\kappa^5)}{\kappa^2 + O(\kappa^4)} = \lim_{\kappa \to 0} \frac{\kappa^3(-\frac{1}{3} + O(\kappa^2))}{\kappa^2(1 + O(\kappa^2))} = \lim_{\kappa \to 0} \kappa \frac{-\frac{1}{3} + O(\kappa^2)}{1 + O(\kappa^2)} = 0$$
+
+Since the limit exists and is finite, the singularity at $\kappa=0$ is removable. We can define $f(0) = 0$, and thus the Taylor series for $f(\kappa)$ around $\kappa=0$ is **well-defined**.
+
+### Alternating Series
+
+To show the series is alternating, we derive its form using the known series expansion for $\kappa \coth(\kappa)$ which involves the **Bernoulli numbers**, $B_{2n}$. The expansion is:
+
+$$\kappa \coth(\kappa) = \sum_{n=0}^{\infty} \frac{B_{2n} (2\kappa)^{2n}}{(2n)!} = 1 + \frac{1}{3}\kappa^2 - \frac{1}{45}\kappa^4 + \frac{2}{945}\kappa^6 - \dots$$
+
+Our function can be written as $f(\kappa) = \frac{1}{\kappa} - \coth(\kappa) = \frac{1 - \kappa \coth(\kappa)}{\kappa}$. Substituting the series:
+
+$$f(\kappa) = \frac{1}{\kappa} \left( 1 - \sum_{n=0}^{\infty} \frac{B_{2n} (2\kappa)^{2n}}{(2n)!} \right)$$
+
+The first term of the sum (for $n=0$) is $\frac{B_0 (2\kappa)^0}{0!} = 1$, since $B_0 = 1$.
+
+$$f(\kappa) = \frac{1}{\kappa} \left( 1 - \left(1 + \sum_{n=1}^{\infty} \frac{B_{2n} (2\kappa)^{2n}}{(2n)!} \right) \right) = \frac{1}{\kappa} \left( - \sum_{n=1}^{\infty} \frac{B_{2n} 2^{2n} \kappa^{2n}}{(2n)!} \right)$$
+
+$$f(\kappa) = - \sum_{n=1}^{\infty} \frac{B_{2n} 2^{2n} \kappa^{2n-1}}{(2n)!}$$
+
+The sign of the Bernoulli numbers $B_{2n}$ for $n \ge 1$ alternates according to the formula $\text{sgn}(B_{2n}) = (-1)^{n-1}$. We can write $B_{2n} = (-1)^{n-1} |B_{2n}|$. Substituting this into the series for $f(\kappa)$:
+
+$$f(\kappa) = - \sum_{n=1}^{\infty} \frac{(-1)^{n-1} |B_{2n}| 2^{2n} \kappa^{2n-1}}{(2n)!} = \sum_{n=1}^{\infty} (-1)^n \frac{|B_{2n}| 2^{2n} \kappa^{2n-1}}{(2n)!}$$
+
+Let's write out the first few terms of the series:
+
+$$f(\kappa) = -\frac{|B_2| 2^2}{(2)!}\kappa^1 + \frac{|B_4| 2^4}{(4)!}\kappa^3 - \frac{|B_6| 2^6}{(6)!}\kappa^5 + \dots$$
+
+With $B_2=1/6$, $B_4=-1/30$, $B_6=1/42$:
+
+$$f(\kappa) = -\frac{1/6 \cdot 4}{2}\kappa + \frac{1/30 \cdot 16}{24}\kappa^3 - \frac{1/42 \cdot 64}{720}\kappa^5 + \dots = -\frac{1}{3}\kappa + \frac{1}{45}\kappa^3 - \frac{2}{945}\kappa^5 + \dots$$
+
+The series for $f(\kappa)$ contains only odd powers of $\kappa$. The coefficient of the term $\kappa^{2n-1}$ is:
+
+$$c_{2n-1} = (-1)^n \frac{|B_{2n}| 2^{2n}}{(2n)!}$$
+
+The coefficient of the next non-zero term, $\kappa^{2(n+1)-1} = \kappa^{2n+1}$, is:
+
+$$c_{2n+1} = (-1)^{n+1} \frac{|B_{2(n+1)}| 2^{2(n+1)}}{(2(n+1))!}$$
+
+Since $|B_{2k}|$ is positive for all $k \ge 1$, the sign of the coefficient $c_{2n-1}$ is determined by $(-1)^n$, and the sign of $c_{2n+1}$ is determined by $(-1)^{n+1}$. Clearly, these are opposite. Therefore, the coefficients of successive non-zero terms have alternating signs. This proves that the Taylor series for $f(\kappa)$ at $\kappa=0$ is an **alternating series**.
+
+### Error Bound
+
+The Taylor series derived for the function $f(\kappa)$ is an alternating series, meaning the signs of successive terms alternate. For such series, the error from truncating the series can be estimated using the **Alternating Series Estimation Theorem**. This theorem states that if you approximate the sum of a convergent alternating series by its $N$-th partial sum, the absolute value of the error (the remainder) is less than or equal to the absolute value of the first neglected term. This holds true under the conditions that the absolute values of the terms are monotonically decreasing and approach zero.
+
+These conditions are met by the series for $f(\kappa)$ within its radius of convergence ($|\kappa| < \pi$). The magnitude of the general term, $\frac{|B_{2n}| 2^{2n} |\kappa|^{2n-1}}{(2n)!}$, tends to zero as $n \to \infty$, ensuring convergence. Moreover, for any given $\kappa$ in this interval, the magnitudes of the terms will eventually be monotonically decreasing. For sufficiently small values of $\kappa$, this decreasing trend holds from the very first term. Therefore, the error in approximating the function with the sum of its first $N$ terms is bounded by the magnitude of the $(N+1)$-th term. For example, the error in the approximation $f(\kappa) \approx -\frac{1}{3}\kappa$ is less than or equal to the next term's magnitude, $\frac{1}{45}|\kappa|^3$. Thus:
+
+$$\left\lvert \kappa \right\rvert < 10^{-6} \implies \varepsilon \lesssim \mathcal{O}(10^{-21})$$
+
+---
+
+## Implementation in GraphNeT
+
+### Problem Statement
+The mathematical derivation above provides exact formulas for computing gradients in the von Mises-Fisher loss function. However, a naive implementation would encounter **division by zero errors** when $\kappa = 0$, even though the mathematical limit is well-defined. This creates RuntimeWarnings and potential numerical instability.
+
+### Numerical Challenge
+The core issue arises in the backward pass when computing:
+
+$$\frac{\partial}{\partial \kappa} \log C_3(\kappa) = \frac{1}{\kappa} - \frac{1}{\tanh(\kappa)}$$
+
+For $\kappa = 0$:
+- Both $\frac{1}{\kappa}$ and $\frac{1}{\tanh(\kappa)}$ diverge individually
+- Their difference converges to the finite limit $\lim_{\kappa \to 0} f(\kappa) = 0$- Standard floating-point evaluation triggers division by zero warnings
+
+### Solution Strategy
+Our implementation uses **boolean masking** to conditionally apply different computational approaches:
+
+```python
+# Initialize gradient array
+grads = np.zeros_like(kappa)
+
+# Handle small kappa values (including zero)
+small_mask = np.abs(kappa) < 1e-6
+grads[small_mask] = -kappa[small_mask] / 3
+
+# Handle large kappa values  
+large_mask = ~small_mask
+if np.any(large_mask):
+    kappa_large = kappa[large_mask]
+    grads[large_mask] = 1/kappa_large - 1/np.tanh(kappa_large)
+```
+
+### Key Implementation Features
+
+1. **Threshold Selection**: $|\kappa| < 10^{-6}$   - Based on error analysis: truncation error $\leq \frac{|\kappa|^3}{45} \approx 10^{-21}$   - Well below machine precision for typical floating-point arithmetic
+   - Provides excellent numerical accuracy
+
+2. **Boolean Masking vs. np.where()**
+   - **Avoided**: `np.where(condition, small_branch, large_branch)` 
+     - Evaluates both branches, still triggers warnings
+   - **Used**: Boolean indexing with separate computations
+     - Only evaluates necessary expressions
+     - Eliminates all RuntimeWarnings
+
+3. **Mathematical Consistency**
+   - Small $\kappa$: Uses first-order Taylor approximation $-\kappa/3$   - Large $\kappa$: Uses exact formula $\frac{1}{\kappa} - \frac{1}{\tanh(\kappa)}$   - Seamless transition preserves continuity and differentiability
+
+4. **Edge Case Handling**
+   - $\kappa = 0$: Returns exactly $0$ (no approximation needed)
+   - Multiple zeros in batch: Each handled independently
+   - Mixed arrays: Efficient vectorized computation
+
+### Verification
+The implementation is validated through comprehensive unit tests that verify:
+- ✅ No RuntimeWarnings generated during backward pass
+- ✅ Gradients remain finite for all input values
+- ✅ Mathematical accuracy: $f(0) = 0$ exactly
+- ✅ Correct Taylor approximation for small $|\kappa|$
+- ✅ Proper handling of arrays containing multiple zeros
+
+This approach ensures both mathematical correctness and numerical stability, making the von Mises-Fisher loss function robust for practical deep learning applications where $\kappa$ values may include zeros or near-zero elements.
diff --git a/src/graphnet/training/loss_functions.py b/src/graphnet/training/loss_functions.py
@@ -270,15 +270,65 @@ def forward(
     def backward(
         ctx: Any, grad_output: Tensor
     ) -> Tensor:  # pylint: disable=invalid-name,arguments-differ
-        """Backward pass."""
+        """Backward pass for LogCMK computation.
+        
+        Mathematical Background:
+        -----------------------
+        For the von Mises-Fisher distribution, the gradient of log C_m(κ) with 
+        respect to κ is given by the ratio of modified Bessel functions:
+        
+        ∂/∂κ log C_m(κ) = (m/2-1)/κ - I_{m/2}(κ)/I_{m/2-1}(κ)
+        
+        For m=3, this simplifies to the exact formula:
+        ∂/∂κ log C_3(κ) = 1/κ - 1/tanh(κ)
+        
+        For small κ values, we use the Taylor series approximation:
+        f(κ) = -κ/3 + κ³/45 - 2κ⁵/945 + O(κ⁷)
+        
+        The first-order approximation -κ/3 provides sufficient accuracy for 
+        |κ| < 1e-6, with truncation error bounded by |κ|³/45 ≲ O(10⁻²¹).
+        
+        Implementation Details:
+        ----------------------
+        Uses boolean masking to avoid double evaluation and RuntimeWarnings:
+        - Small κ: |κ| < 1e-6 → gradient = -κ/3 (Taylor approximation)
+        - Large κ: |κ| ≥ 1e-6 → gradient = 1/κ - 1/tanh(κ) (exact formula)
+        
+        References:
+        ----------
+        [1] von Mises-Fisher distribution: Wikipedia
+        [2] arXiv:1812.04616, Section 8.2
+        [3] MIT License (c) 2019 Max Ryabinin - Modified for GraphNeT
+        
+        Args:
+            ctx: Autograd context containing saved tensors and metadata.
+            grad_output: Gradient with respect to the output tensor.
+            
+        Returns:
+            Tuple of gradients: (None for m, gradient w.r.t. κ).
+        """
         kappa = ctx.saved_tensors[0]
         m = ctx.m
         dtype = ctx.dtype
         kappa = kappa.double().cpu().numpy()
-        grads = -(
-            (scipy.special.iv(m / 2.0, kappa))
-            / (scipy.special.iv(m / 2.0 - 1, kappa))
-        )
+        if np.isclose(m, 3, atol=1e-6):
+            # Initialize gradient array
+            grads = np.zeros_like(kappa)
+            
+            # Handle small kappa values (including zero) to avoid division by zero
+            small_mask = np.abs(kappa) < 1e-6
+            grads[small_mask] = -kappa[small_mask] / 3
+            
+            # Handle large kappa values
+            large_mask = ~small_mask
+            if np.any(large_mask):
+                kappa_large = kappa[large_mask]
+                grads[large_mask] = 1/kappa_large - 1/np.tanh(kappa_large)
+        else:
+            grads = -(
+                (scipy.special.iv(m / 2.0, kappa))
+                / (scipy.special.iv(m / 2.0 - 1, kappa))
+            )
         return (
             None,
             grad_output
diff --git a/tests/training/test_loss_functions.py b/tests/training/test_loss_functions.py
@@ -3,6 +3,7 @@
 import numpy as np
 import pytest
 import torch
+import warnings
 from torch import Tensor
 from torch.autograd import grad
 
@@ -182,3 +183,93 @@ def test_von_mises_fisher_approximation_large_kappa(
     assert torch.allclose(
         grads_approx[exact_is_valid], grads_exact[exact_is_valid], rtol=1e-2
     )
+
+
+def test_logcmk_backward_zero_handling(dtype: torch.dtype = torch.float64) -> None:
+    """Test LogCMK backward pass handles arrays with zero values correctly.
+    
+    This test ensures that the LogCMK.backward method correctly handles cases
+    where the kappa tensor contains zero values without raising division by zero
+    errors or warnings. The implementation uses boolean masking to conditionally
+    apply different formulas for small (including zero) and large kappa values,
+    avoiding double evaluation that would cause RuntimeWarnings.
+    
+    Args:
+        dtype: PyTorch data type for the test tensors.
+    """
+    # Test parameters
+    m = 3  # Dimension for which we have the exact formula
+    
+    # Create kappa tensor with zeros and other values, including edge cases
+    kappa_values = [0.0, 1e-7, 1e-6, 1e-5, 0.1, 1.0, 10.0]
+    kappa = torch.tensor(kappa_values, dtype=dtype, requires_grad=True)
+    
+    # Forward pass using VonMisesFisherLoss.log_cmk_exact which internally uses LogCMK
+    result = VonMisesFisherLoss.log_cmk_exact(m, kappa)
+    
+    # Test that backward pass doesn't raise any errors or warnings
+    # Capture warnings to ensure no RuntimeWarnings are generated
+    with warnings.catch_warnings(record=True) as caught_warnings:
+        warnings.simplefilter("always")
+        try:
+            grads = torch.autograd.grad(
+                outputs=result.sum(),
+                inputs=kappa,
+                grad_outputs=None,
+                retain_graph=False,
+                create_graph=False,
+            )[0]
+            backward_success = True
+            error_msg = ""
+        except (ZeroDivisionError, RuntimeWarning) as e:
+            backward_success = False
+            error_msg = str(e)
+    
+    # Verify no errors occurred
+    assert backward_success, f"Backward pass failed with error: {error_msg}"
+    
+    # Verify no RuntimeWarnings were generated
+    runtime_warnings = [w for w in caught_warnings if issubclass(w.category, RuntimeWarning)]
+    assert len(runtime_warnings) == 0, f"RuntimeWarnings were generated: {[str(w.message) for w in runtime_warnings]}"
+    
+    # Verify gradients are finite
+    assert torch.all(torch.isfinite(grads)), "Gradients should be finite for all kappa values"
+    
+    # Test specific values for correctness
+    # For kappa=0, the gradient should be -kappa/3 = 0
+    zero_idx = 0  # Index where kappa=0
+    assert torch.isclose(grads[zero_idx], torch.tensor(0.0, dtype=dtype)), \
+        f"Gradient at kappa=0 should be 0, got {grads[zero_idx]}"
+    
+    # For very small kappa (1e-7), should use -kappa/3 approximation
+    small_kappa_idx = 1  # Index where kappa=1e-7
+    expected_small_grad = -kappa_values[small_kappa_idx] / 3
+    assert torch.isclose(grads[small_kappa_idx], torch.tensor(expected_small_grad, dtype=dtype), atol=1e-10), \
+        "Gradient for small kappa should use -kappa/3 approximation"
+    
+    # Test with array containing multiple zeros
+    kappa_multi_zero = torch.tensor([0.0, 0.0, 1.0, 0.0, 10.0], dtype=dtype, requires_grad=True)
+    result_multi = VonMisesFisherLoss.log_cmk_exact(m, kappa_multi_zero)
+    
+    with warnings.catch_warnings(record=True) as caught_warnings_multi:
+        warnings.simplefilter("always")
+        try:
+            grads_multi = torch.autograd.grad(
+                outputs=result_multi.sum(),
+                inputs=kappa_multi_zero,
+                grad_outputs=None,
+            )[0]
+            multi_zero_success = True
+        except (ZeroDivisionError, RuntimeWarning):
+            multi_zero_success = False
+    
+    assert multi_zero_success, "Should handle arrays with multiple zero values"
+    assert torch.all(torch.isfinite(grads_multi)), "All gradients should be finite with multiple zeros"
+    
+    # Verify no RuntimeWarnings for multiple zeros case
+    runtime_warnings_multi = [w for w in caught_warnings_multi if issubclass(w.category, RuntimeWarning)]
+    assert len(runtime_warnings_multi) == 0, f"RuntimeWarnings were generated with multiple zeros: {[str(w.message) for w in runtime_warnings_multi]}"
+    
+    # Verify that zero elements have zero gradients
+    zero_mask = kappa_multi_zero == 0.0
+    assert torch.all(grads_multi[zero_mask] == 0.0), "Zero kappa values should have zero gradients"