23  Method of Moments Estimation

Introduction

This file formalizes the method of moments (MoM) estimator for phase-type distribution parameters. Given observed data and a parameterized phase-type model, MoM finds parameter estimates by minimizing the distance between sample moments (computed from data) and model moments (computed from the phase-type distribution as a function of the parameter vector). The resulting estimates serve as informed starting points or prior centers for subsequent Bayesian inference via SVGD ([18]) or MCMC ([19]).

In the phasic pipeline, MoM operates after model construction ([08]) and moment computation ([15]). The model moments are evaluated via reward-transformed moment formulas, and the estimation is cast as a nonlinear least-squares problem solved by the Trust Region Reflective algorithm. An optional two-step efficient Generalized Method of Moments (GMM) procedure uses the inverse of the estimated moment covariance as an optimal weighting matrix, down-weighting high-variance moments. Standard errors are computed via the delta method (van der Vaart 1998) for constructing Gaussian priors.

Prerequisites: [01], [15]

Source files:

  • src/phasic/method_of_moments.py (functions: method_of_moments, _compute_weight_transform, _select_nr_moments, _initial_guess_grid, _estimate_moment_covariance, _reconstruct_theta)

Definitions

Definition 23.1 (Sample Moments and Model Moments) Let x_1, \ldots, x_n be i.i.d. observations from a phase-type distribution \operatorname{PH}(\boldsymbol{\alpha}, \mathbf{S}(\boldsymbol{\theta})). For multivariate observations with F features, let x_{i,f} denote the f-th feature of observation i. The sample moments are

\hat{m}_{f,k} = \frac{1}{n_f} \sum_{i=1}^{n_f} x_{i,f}^k, \quad k = 1, \ldots, K, \;\; f = 1, \ldots, F, \tag{23.1}

where n_f is the number of valid (non-missing) observations for feature f and K is the number of moments matched. The model moments \mu_{f,k}(\boldsymbol{\theta}) are the k-th moments of the phase-type distribution under parameter \boldsymbol{\theta}, computed via the reward-transformed moment formula ([15]):

\mu_{f,k}(\boldsymbol{\theta}) = k! \, \boldsymbol{\alpha} \left( (-\mathbf{S}(\boldsymbol{\theta}))^{-1} \triangle(\mathbf{r}_f) \right)^k \mathbf{e}, \tag{23.2}

where \mathbf{r}_f is the reward vector for feature f. The sample and model moments are arranged as vectors \hat{\mathbf{m}}, \boldsymbol{\mu}(\boldsymbol{\theta}) \in \mathbb{R}^{FK} with the ordering (f=1,k=1), (f=1,k=2), \ldots, (f=F,k=K).

Intuition Sample moments are empirical averages of powers of the data. Model moments are the corresponding theoretical quantities predicted by the phase-type model for a given parameter vector. When the model is correct, there exists a \boldsymbol{\theta}^* such that \boldsymbol{\mu}(\boldsymbol{\theta}^*) = \mathbb{E}[\hat{\mathbf{m}}], and the method of moments seeks this parameter.
Example For a univariate model with K = 2 moments and n = 100 observations: \hat{m}_1 = \bar{x} (sample mean) and \hat{m}_2 = n^{-1}\sum x_i^2 (sample second moment). If the model is \operatorname{Exp}(\theta), then \mu_1(\theta) = 1/\theta and \mu_2(\theta) = 2/\theta^2.

Definition 23.2 (Moment Matching Objective) The moment matching objective is the weighted nonlinear least-squares problem

\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta} > 0} \left\lVert \mathbf{T}\left(\boldsymbol{\mu}(\boldsymbol{\theta}) - \hat{\mathbf{m}}\right) \right\rVert_2^2, \tag{23.3}

where \mathbf{T} \in \mathbb{R}^{FK \times FK} is a weight transform matrix. This is equivalent to minimizing the generalized sum of squared residuals (\boldsymbol{\mu}(\boldsymbol{\theta}) - \hat{\mathbf{m}})^\top \mathbf{W} (\boldsymbol{\mu}(\boldsymbol{\theta}) - \hat{\mathbf{m}}) with weight matrix \mathbf{W} = \mathbf{T}^\top \mathbf{T}. The constraint \boldsymbol{\theta} > 0 ensures positive rates.

Intuition The weight matrix \mathbf{W} accounts for the different scales and variances of the moments. Without weighting (\mathbf{T} = \mathbf{I}), high-order moments (which are typically much larger) would dominate the objective. Optimal weighting uses \mathbf{W} = \boldsymbol{\Sigma}^{-1}, the inverse of the moment covariance, which normalizes each moment by its sampling variability.

Definition 23.3 (Optimal Weighting via Moment Covariance) The moment covariance matrix \boldsymbol{\Sigma} \in \mathbb{R}^{FK \times FK} has a block-diagonal structure (features are independent):

\boldsymbol{\Sigma} = \operatorname{diag}(\boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}_F), \tag{23.4}

where \boldsymbol{\Sigma}_f \in \mathbb{R}^{K \times K} has entries

(\boldsymbol{\Sigma}_f)_{jk} = \frac{1}{n_f} \operatorname{Cov}(X_f^j, X_f^k) = \frac{1}{n_f} \left( \frac{1}{n_f - 1} \sum_{i=1}^{n_f} (x_{i,f}^j - \hat{m}_{f,j})(x_{i,f}^k - \hat{m}_{f,k}) \right). \tag{23.5}

The optimal weight transform \mathbf{T} is the inverse of the Cholesky factor of \boldsymbol{\Sigma}: if \boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top, then \mathbf{T} = \mathbf{L}^{-1}, giving \mathbf{W} = \mathbf{T}^\top \mathbf{T} = \boldsymbol{\Sigma}^{-1}. A fallback chain is employed: if \boldsymbol{\Sigma} is ill-conditioned (condition number \geq 10^{10}), diagonal weighting T_{ii} = 1/\sqrt{\Sigma_{ii}} is used; if the diagonal also contains non-positive entries, the identity \mathbf{T} = \mathbf{I} (unweighted) is used.

Intuition The Cholesky factorization of \boldsymbol{\Sigma} transforms the correlated moment residuals into uncorrelated, unit-variance residuals. This is the standard efficient GMM weighting (Hansen 1982): it produces the minimum-variance estimator among all GMM estimators using the same moment conditions. The fallback chain ensures robustness when the sample size is small relative to the number of moments.

Theorems and Proofs

Theorem 23.1 (Consistency of the Method of Moments Estimator) Suppose (i) the moment function \boldsymbol{\mu}(\boldsymbol{\theta}) is continuous on \Theta = \mathbb{R}^d_{>0}, (ii) the true parameter \boldsymbol{\theta}^* is the unique solution of \boldsymbol{\mu}(\boldsymbol{\theta}) = \mathbb{E}[\hat{\mathbf{m}}] (identification), and (iii) the observations x_1, \ldots, x_n are i.i.d. with finite 2K-th moment. Then the MoM estimator \hat{\boldsymbol{\theta}}_n satisfies

\hat{\boldsymbol{\theta}}_n \xrightarrow{n \to \infty} \boldsymbol{\theta}^* \quad \text{in probability.}

Furthermore, the asymptotic covariance of \hat{\boldsymbol{\theta}}_n is

\sqrt{n}(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}^*) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0}, \; \mathbf{G} \boldsymbol{\Sigma} \mathbf{G}^\top \right), \tag{23.6}

where \mathbf{G} = (\mathbf{J}^\top \mathbf{W} \mathbf{J})^{-1} \mathbf{J}^\top \mathbf{W} is the generalized inverse mapping, \mathbf{J} = \partial \boldsymbol{\mu} / \partial \boldsymbol{\theta}^\top \big|_{\boldsymbol{\theta}^*} is the Jacobian of the moment function, and \boldsymbol{\Sigma} is the moment covariance. For the efficient GMM weighting \mathbf{W} = \boldsymbol{\Sigma}^{-1}, this simplifies to

\operatorname{Cov}(\hat{\boldsymbol{\theta}}_n) \approx \frac{1}{n} (\mathbf{J}_w^\top \mathbf{J}_w)^{-1}, \tag{23.7}

where \mathbf{J}_w = \mathbf{T} \mathbf{J} is the weighted Jacobian.

Proof. By the strong law of large numbers and condition (iii), \hat{\mathbf{m}} \xrightarrow{\text{a.s.}} \mathbb{E}[\hat{\mathbf{m}}] = \boldsymbol{\mu}(\boldsymbol{\theta}^*). The objective function Q_n(\boldsymbol{\theta}) = \lVert \mathbf{T}(\boldsymbol{\mu}(\boldsymbol{\theta}) - \hat{\mathbf{m}}) \rVert^2 converges uniformly on compact subsets to Q(\boldsymbol{\theta}) = \lVert \mathbf{T}(\boldsymbol{\mu}(\boldsymbol{\theta}) - \boldsymbol{\mu}(\boldsymbol{\theta}^*)) \rVert^2 by continuity of \boldsymbol{\mu}. By identification (ii), Q(\boldsymbol{\theta}) has a unique minimizer at \boldsymbol{\theta}^*, so \hat{\boldsymbol{\theta}}_n \to \boldsymbol{\theta}^* in probability [standard extremum estimator consistency; see Newey and McFadden (1994), Theorem 2.1].

For the asymptotic distribution, linearize \boldsymbol{\mu}(\hat{\boldsymbol{\theta}}_n) \approx \boldsymbol{\mu}(\boldsymbol{\theta}^*) + \mathbf{J}(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}^*). The first-order condition of the weighted least-squares problem gives \mathbf{J}_w^\top \mathbf{T}(\boldsymbol{\mu}(\hat{\boldsymbol{\theta}}_n) - \hat{\mathbf{m}}) = \mathbf{0}, which after linearization yields

\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}^* \approx -(\mathbf{J}_w^\top \mathbf{J}_w)^{-1} \mathbf{J}_w^\top \mathbf{T}(\boldsymbol{\mu}(\boldsymbol{\theta}^*) - \hat{\mathbf{m}}).

Since \sqrt{n}(\hat{\mathbf{m}} - \boldsymbol{\mu}(\boldsymbol{\theta}^*)) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}) by the central limit theorem, applying the linear transformation gives Equation 23.6. When \mathbf{W} = \boldsymbol{\Sigma}^{-1} and \mathbf{T} = \mathbf{L}^{-1}, the covariance simplifies: \mathbf{G} \boldsymbol{\Sigma} \mathbf{G}^\top = (\mathbf{J}_w^\top \mathbf{J}_w)^{-1} \mathbf{J}_w^\top (\mathbf{T} \boldsymbol{\Sigma} \mathbf{T}^\top) \mathbf{J}_w (\mathbf{J}_w^\top \mathbf{J}_w)^{-1} = (\mathbf{J}_w^\top \mathbf{J}_w)^{-1}, since \mathbf{T} \boldsymbol{\Sigma} \mathbf{T}^\top = \mathbf{L}^{-1} \mathbf{L} \mathbf{L}^\top \mathbf{L}^{-\top} = \mathbf{I}. Dividing by n gives Equation 23.7. \square

Algorithms

23.0.0.1 Method of Moments Estimation

Description. Estimates the parameter vector \boldsymbol{\theta} of a parameterized phase-type model by matching model moments to sample moments. The algorithm proceeds in several stages: (1) compute sample moments from the data, (2) optionally select the number of moments K by condition number analysis, (3) find an initial guess via coordinate-wise grid search on the first moment, (4) solve the unweighted least-squares problem, and (5) optionally re-solve with optimal GMM weighting. Standard errors are computed via the delta method (Theorem 23.1) and used to construct Gaussian priors.

Method of Moments Estimation
1: Let G = (V, E) be a parameterized phase-type graph with d parameters
2: Let x_1, ..., x_n be observed data (possibly multivariate, F features)
3: Let K be the number of moments (auto-selected if not specified)
4: Let fixed be a set of (index, value) pairs for fixed parameters
5: Let d_free = d - |fixed| be the number of free parameters
6:
7: function MethodOfMoments(G, x, K, fixed, weighted)
8:   m_hat <- SampleMoments(x, K, F)               triangleright Eq. (1)
9:   if K = nil then
10:    if weighted then
11:      K <- SelectNrMoments(x, F, d_free)         triangleright Condition number pruning
12:    else
13:      K <- max(2 * d_free, 4) / F                triangleright Heuristic overdetermination
14:    end if
15:  end if
16:  Sigma <- EstimateMomentCovariance(x, K, F)     triangleright Eq. (5)
17:  mu(theta) <- ModelMoments(G, theta, K, F)      triangleright Eq. (2) via [15]
18:
19:  x0 <- InitialGuessGrid(mu, m_hat[1], d_free, fixed)
20:                                                  triangleright Grid search on first moment
21:  r(theta_free) <- mu(Reconstruct(theta_free, fixed)) - m_hat
22:                                                  triangleright Residual function
23:
24:  theta_1 <- LeastSquares(r, x0, bounds=(epsilon, inf))
25:                                                  triangleright Step 1: unweighted
26:
27:  if weighted then
28:    T, method <- ComputeWeightTransform(Sigma)    triangleright Definition 21.3
29:    if T != nil then
30:      r_w(theta_free) <- T * r(theta_free)        triangleright Weighted residual
31:      theta_hat <- LeastSquares(r_w, theta_1, bounds=(epsilon, inf))
32:                                                  triangleright Step 2: GMM
33:    else
34:      theta_hat <- theta_1
35:    end if
36:  else
37:    theta_hat <- theta_1
38:  end if
39:
40:  J <- Jacobian(r, theta_hat)                     triangleright Finite differences
41:  if weighted and T != nil then
42:    Cov_theta <- pinv(J_w^T J_w)                  triangleright Eq. (7)
43:  else
44:    G_mat <- pinv(J)
45:    Cov_theta <- G_mat * Sigma * G_mat^T          triangleright Eq. (6)
46:  end if
47:  se <- sqrt(diag(Cov_theta))
48:  prior_i <- GaussPrior(theta_hat_i, c * se_i) for free i
49:                                                  triangleright c = std_multiplier
50:  return (theta_hat, se, prior)
51: end function

Correspondence table:

Pseudocode variable Math symbol Code variable (file:function)
G G = (V, E) graph (method_of_moments.py:method_of_moments)
x x_1, \ldots, x_n observed_data (method_of_moments.py:method_of_moments)
K K (number of moments) nr_moments (method_of_moments.py:method_of_moments)
F F (number of features) n_features (method_of_moments.py:method_of_moments)
d d (parameter dimension) theta_dim (method_of_moments.py:method_of_moments)
d_free d_{\text{free}} n_free (method_of_moments.py:method_of_moments)
m_hat \hat{\mathbf{m}} target_moments (method_of_moments.py:method_of_moments)
Sigma \boldsymbol{\Sigma} moment_cov (method_of_moments.py:method_of_moments)
mu \boldsymbol{\mu}(\boldsymbol{\theta}) moments_fn (method_of_moments.py:method_of_moments)
T \mathbf{T} (weight transform) T (method_of_moments.py:_compute_weight_transform)
x0 initial guess x0 (method_of_moments.py:method_of_moments)
r residual function residual_fn (method_of_moments.py:method_of_moments)
r_w weighted residual weighted_residual_fn (method_of_moments.py:method_of_moments)
theta_1 step-1 estimate result.x (first call) (method_of_moments.py:method_of_moments)
theta_hat \hat{\boldsymbol{\theta}} theta_full_opt (method_of_moments.py:method_of_moments)
J \mathbf{J} (Jacobian) result.jac (method_of_moments.py:method_of_moments)
J_w \mathbf{J}_w = \mathbf{T}\mathbf{J} J_w = result.jac (weighted) (method_of_moments.py:method_of_moments)
Cov_theta \operatorname{Cov}(\hat{\boldsymbol{\theta}}) cov_theta (method_of_moments.py:method_of_moments)
se standard errors std_free / std_full (method_of_moments.py:method_of_moments)
c std multiplier std_multiplier (method_of_moments.py:method_of_moments)
prior Gaussian priors priors (method_of_moments.py:method_of_moments)

Complexity. Time: O(N_{\text{iter}} \cdot (C_{\text{mom}} + FK)), where N_{\text{iter}} is the number of least-squares iterations (at most 200 \cdot d_{\text{free}}), C_{\text{mom}} is the cost of one model-moment evaluation via [15], and FK is the cost of computing the residual. The initial grid search costs O(d_{\text{free}} \cdot |G_{\text{grid}}| \cdot C_{\text{mom}}) where |G_{\text{grid}}| = 20 is the grid size. The Cholesky factorization and pseudoinverse cost O((FK)^3) and O((FK)^2 d_{\text{free}}) respectively. Space: O((FK)^2) for the moment covariance matrix and O(d_{\text{free}}) for the parameter vector.

Algorithm 23.1: Correctness. By Theorem 23.1, the estimator is consistent under identification and finite moments. The two-step GMM procedure achieves efficiency: the first step provides a consistent (but possibly inefficient) estimate, and the second step uses the estimated optimal weighting \mathbf{W} = \hat{\boldsymbol{\Sigma}}^{-1} to minimize asymptotic variance. The delta method standard errors (Equation 23.7) provide valid confidence intervals asymptotically.

Numerical Considerations

Condition number pruning. When weighted=True and nr_moments is not specified, the function _select_nr_moments starts from a maximum number of moments and decreases until the moment covariance matrix has condition number below 10^3. This prevents ill-conditioning from high-order moments that have extreme variance, which would make the Cholesky factorization unstable and the weighted residuals unreliable.

Fallback chain for weighting. The weight transform computation (_compute_weight_transform) implements a three-level fallback: (1) full Cholesky of \boldsymbol{\Sigma} (if condition number < 10^{10} and all diagonal entries positive), (2) diagonal weighting T_{ii} = 1/\sqrt{\Sigma_{ii}} (if all variances are positive and finite), (3) identity (unweighted). This ensures the algorithm never fails due to a degenerate covariance matrix.

Positivity constraint. The Trust Region Reflective solver enforces \theta_j > \varepsilon with \varepsilon = 10^{-10} via bound constraints. This avoids numerical issues with zero or negative rates while being effectively unconstrained for practical parameter values.

Standard error floor. After the delta method, any standard error that is zero, negative, or non-finite for a free parameter is replaced by 0.5 \cdot \max(|\hat{\theta}_j|, 10^{-4}). This ensures that priors always have a reasonable spread.

Multivariate data. The implementation accepts SparseObservations for multivariate data with potentially different numbers of observations per feature. The moment covariance is block-diagonal (features are independent), with each block estimated from the valid observations of that feature.

Implementation Notes

Source code mapping:

Algorithm File Function Lines
Algorithm 23.1 (entry) src/phasic/method_of_moments.py method_of_moments L205–L590
Weight transform src/phasic/method_of_moments.py _compute_weight_transform L56–L88
Nr moments selection src/phasic/method_of_moments.py _select_nr_moments L91–L119
Moment covariance src/phasic/method_of_moments.py _estimate_moment_covariance L122–L149
Initial guess grid src/phasic/method_of_moments.py _initial_guess_grid L168–L202
Theta reconstruction src/phasic/method_of_moments.py _reconstruct_theta L152–L165

Deviations from pseudocode:

  • The pseudocode shows the Jacobian \mathbf{J} computed as a separate step. The implementation obtains it as result.jac from scipy.optimize.least_squares, which computes the Jacobian by finite differences as part of the Trust Region Reflective algorithm.
  • The model moments function moments_fn is constructed differently depending on whether rewards are 1D, 2D, or absent. The pseudocode abstracts this as a single ModelMoments call.
  • The MoMResult dataclass also stores sample_moments, model_moments, and weighted flag for diagnostic purposes, not shown in the pseudocode.
  • Fixed parameters are handled via _reconstruct_theta, which inserts fixed values into the full parameter vector at each residual evaluation. The solver only optimizes over free parameters.

Symbol Index

Symbol Name First appearance
\hat{\boldsymbol{\theta}}_n MoM estimator Theorem 23.1
d_{\text{free}} Number of free parameters Algorithm 23.1
F Number of features Definition 23.1
\mathbf{G} Generalized inverse mapping Theorem 23.1
\hat{\mathbf{m}} Sample moment vector Definition 23.1
\hat{m}_{f,k} Sample k-th moment for feature f Definition 23.1
\mathbf{J} Jacobian of moment function Theorem 23.1
\mathbf{J}_w Weighted Jacobian \mathbf{T}\mathbf{J} Theorem 23.1
K Number of moments matched Definition 23.1
\mathbf{L} Cholesky factor of \boldsymbol{\Sigma} Definition 23.3
\boldsymbol{\mu}(\boldsymbol{\theta}) Model moment vector Definition 23.1
\mu_{f,k}(\boldsymbol{\theta}) Model k-th moment for feature f Definition 23.1
n_f Number of observations for feature f Definition 23.1
\boldsymbol{\Sigma} Moment covariance matrix Definition 23.3
\boldsymbol{\Sigma}_f Moment covariance block for feature f Definition 23.3
\mathbf{T} Weight transform matrix Definition 23.2
\mathbf{W} Weight matrix \mathbf{T}^\top \mathbf{T} Definition 23.2