24 Joint Probability Graphs and Probability Matching Estimation

Introduction

This file formalizes two tightly coupled pieces of phasic’s inference pipeline. The first is the joint probability graph construction (Graph.joint_prob_graph, Graph.joint_stop_prob_graph, Graph.daisy_chain_joint_probs, and Graph.joint_prob_table), which takes a parameterized phase-type model and lifts it to an enlarged state space whose absorbing-direction outcomes are individually addressable. The second is probability matching estimation, which consumes the joint-probability table produced by the first to estimate parameters by minimizing the distance between an empirical distribution and a model distribution over these outcomes.

The joint-probability machinery is the foundation. It builds a new graph G_J in which each reachable absorbing-direction outcome of the base model — for example, each possible site frequency spectrum (SFS) configuration on a coalescent tree — is represented by a designated t-vertex with a single outgoing edge to a unique absorbing state. The probability of each outcome is recovered either as an expected-visit count in the discrete graph (joint_prob_table) or as a trapped-mass reading in the continuous joint-stop-prob graph (daisy_chain_joint_probs). Both routes are exact under the formal model and are verified below against a closed-form check on the simplest non-trivial coalescent.

Probability matching estimation then operates on the discrete table: given counts n_k for each observed outcome, it finds the parameter \boldsymbol{\theta} that makes the model probabilities \mathbf{p}(\boldsymbol{\theta}) closest in least-squares distance to the empirical proportions \hat{\mathbf{p}}. Standard errors are obtained via the delta method using the multinomial covariance of \hat{\mathbf{p}}. In the phasic pipeline, probability matching applies to joint probability graphs constructed via Graph.joint_prob_graph(); the empirical observations are vertex indices (e.g., a mutation count tuple in population genetics), and the model probabilities are computed via the forward algorithm ([14]) and normalized over all terminal vertices. The estimation is cast as a nonlinear least-squares problem (van der Vaart 1998).

Prerequisites: [01], [02], [14], [15]

Source files:

src/phasic/__init__.py (methods: joint_prob_graph, joint_stop_prob_graph, daisy_chain_joint_probs, joint_prob_table, _get_joint_probs, update_ipv)
src/phasic/ffi_wrappers.py (function: compute_daisy_chain_joint_probs_ffi)
src/cpp/parameterized/graph_builder_ffi.cpp (class: DaisyChainJointProbsFfiImpl)
src/phasic/probability_matching.py (functions: probability_matching, _estimate_multinomial_covariance, _initial_guess_grid_probs)

Joint Probability Graphs

Definitions

Throughout this section, G = (V, E, W) denotes the base graph: a parameterized phase-type graph (continuous or discrete) representing the unaugmented Markov model. The base graph carries a state vector of length \ell at each vertex; absorbing vertices are those with no outgoing edges. Each edge (v \to z) \in E carries a coefficient vector \mathbf{c}_{vz} \in \mathbb{R}^k, where k = \dim(\boldsymbol{\theta}) is the parameter dimension. Under the default linear weight mode, the runtime weight is w(v \to z) = \mathbf{c}_{vz}^\top \boldsymbol{\theta}.

Definition 24.1 (Reward extension and joint state space) Let G be a base graph with state vectors of length \ell and let \mathbf{r} = (r_1, \ldots, r_m) be a vector of m non-negative integer reward bins with finite caps \bar{r}_j. The joint state space \mathcal{X}_J associated with G and the reward configuration is

\mathcal{X}_J \subseteq V(G) \times \{0, 1, \ldots\}^m, \tag{24.1}

restricted to those pairs (v, \mathbf{r}) that are reachable by the construction described in Definition 24.2. Each pair encodes “the base process is in state v, and so far r_j events of type j have been recorded” for j = 1, \ldots, m.

Intuition

The reward extension is a deterministic counter attached to the base Markov chain. While the base chain runs, an auxiliary counter accumulates mutation-style events (independent Poisson processes whose rates depend on the current base state). The joint state is the pair (base state, accumulated counts). The caps \bar{r}_j keep the state space finite — outcomes beyond the caps are absorbed into a single trash channel rather than being tracked individually.

Example

For the N=2 coalescent with one descendants Property, the base graph has two transient configurations (two singletons; one ancestor of class descendants=2) and absorption when the singletons merge. With m=1 reward bin (mutations on lineages of class descendants=1) and cap \bar{r}_1 = 5, the joint state space contains pairs (base, count) for count = 0, 1, \ldots, 5 in the singleton phase, plus the post-coalescence absorbing-direction states.

Definition 24.2 (Joint probability graph G_J) Let G be a parameterized base graph with state-vector length \ell and parameter dimension k. Let \boldsymbol{\rho}: V(G) \to \mathbb{R}_{\geq 0}^m be a state-dependent reward-rate function (the _joint_prob_reward callback), assigning to each base state v a vector of m per-bin rates \rho_j(v) and a scalar trash rate \rho_{\mathrm{trash}}(v). Let \mathbf{r}^{\max} be the reward cap vector and r^{\mathrm{tot}}_{\max} the total-mass cap (with \rho_{\mathrm{trash}} absorbing any rate that would push the cumulative reward vector past these caps).

The joint probability graph G_J = (V_J, E_J, W_J) is the parameterized graph constructed as follows.

Initial-distribution lift: for each starting edge (v_0 \to v) of G, add a starting edge in G_J to the joint state (v, \mathbf{0}) with the same coefficient vector (padded by one extra slot for the mutation-rate channel, set to 0).
BFS expansion: for every joint state (v, \mathbf{r}) \in V_J already created:
- For every base-graph edge (v \to z) \in E, add a joint edge ((v, \mathbf{r}) \to (z, \mathbf{r})) with the base coefficient vector padded by a 0 in the mutation slot.
- For each reward index j \in \{1, \ldots, m\} with \rho_j(v) > 0 and \mathbf{r} + \mathbf{e}_j within the caps, add a joint edge ((v, \mathbf{r}) \to (v, \mathbf{r} + \mathbf{e}_j)) with all base coefficients zero and a coefficient \rho_j(v) in the mutation slot.
- Add a joint edge ((v, \mathbf{r}) \to v_{\mathrm{trash}}^{(1)}) with coefficient \rho_{\mathrm{trash}}(v) in the mutation slot whenever the trash rate is positive.
Trash pair: create two zero-state vertices v_{\mathrm{trash}}^{(1)}, v_{\mathrm{trash}}^{(2)} joined by a directed 2-cycle of unit-coefficient edges in the mutation slot. They are not absorbing and not connected to the absorbing vertex.
A joint state (z, \mathbf{r}') created at step 2 whose base part z is absorbing in G is marked a t-vertex.
Absorbing vertex: a single new vertex v_{\mathrm{abs}}^{J} is created; for each t-vertex t, an edge (t \to v_{\mathrm{abs}}^{J}) is added with all coefficients zero except a coefficient 1 in the mutation slot.

The construction terminates because the joint state space is finite once the caps are enforced.

Intuition

The joint graph runs two clocks in parallel. The base clock advances on the base graph’s own edges (carrying the original parameterization). The reward clock advances when a Poisson-rate counter ticks; each tick adds one to the corresponding reward bin and creates a new joint vertex. Whenever the base clock reaches an absorbing state, the joint vertex it lands in is labelled a t-vertex; this label distinguishes outcomes from each other on the way to a single shared absorbing state. The trash pair acts as a “garbage bin”: rate that should not be tracked individually is sent there and bounces between the pair forever.

Example

For the N=2 coalescent with reward bin tracking mutations on descendants=1 lineages, the base state (2, 0) (two singletons) has reward-rate \rho_1((2,0)) = 2\mu (two lineages each mutating at rate \mu). The construction creates a chain of singleton-phase joint states (\text{singleton}, 0), (\text{singleton}, 1), (\text{singleton}, 2), \ldots, with a coalescence edge from each into the corresponding post-coalescence (descendants=2) state. The post-coalescence states are t-vertices.

Definition 24.3 (t-vertex) A vertex t \in V_J is a t-vertex of G_J if it is created at step 2 of Definition 24.2 and its base part z is absorbing in the base graph G. Equivalently, after step 5, t has exactly one outgoing edge, namely (t \to v_{\mathrm{abs}}^{J}) with weight 1 (in the mutation slot under linear weight mode).

Intuition

A t-vertex is a “labelled absorbing state”. The construction lifts the single absorbing state of the base graph to a family of labelled copies — one per outcome the user wants to distinguish. The label is the reward vector \mathbf{r}' recorded at the moment the base clock terminated.

Definition 24.4 (Trash pair) The trash pair of G_J is the ordered pair (v_{\mathrm{trash}}^{(1)}, v_{\mathrm{trash}}^{(2)}) of zero-state vertices joined by a directed 2-cycle, with the property that any joint-graph vertex with positive trash rate has an outgoing edge to v_{\mathrm{trash}}^{(1)} (and only to v_{\mathrm{trash}}^{(1)}). Mass that enters the pair circulates between the two vertices indefinitely and never reaches the absorbing vertex.

Definition 24.5 (Joint stop-probability graph G_{\mathrm{JSP}}) Given a joint probability graph G_J (Definition 24.2), the joint stop-probability graph G_{\mathrm{JSP}} = (V_{\mathrm{JSP}}, E_{\mathrm{JSP}}, W_{\mathrm{JSP}}) is constructed by:

Carrying every non-trash transient vertex of G_J to G_{\mathrm{JSP}} with its outgoing edges, redirecting any edge that targeted v_{\mathrm{trash}}^{(1)} to the absorbing vertex v_{\mathrm{abs}}^{J}.
For each t-vertex t, removing the outgoing edge (t \to v_{\mathrm{abs}}^{J}) and instead creating a new zero-state vertex \mathrm{aux}(t) and adding the two directed edges (t \to \mathrm{aux}(t)) and (\mathrm{aux}(t) \to t). The coefficient vector of each of these two aux loop edges is the all-ones vector (1, 1, \ldots, 1) \in \mathbb{R}^k.
Adding a starting-vertex edge with initial weight 0 to every vertex of V_{\mathrm{JSP}} that is neither v_{\mathrm{abs}}^{J}, an aux vertex, nor a trash vertex. The ordered list of targets is the IPV target index list \mathcal{I}_{\mathrm{ipv}}.

The pair (t, \mathrm{aux}(t)) is the aux loop of t-vertex t, and \mathrm{aux}(t) is the aux partner of t.

Intuition

G_J as a continuous CTMC has the same issue all phase-type distributions have: probability flows out of t-vertices into a single absorbing state, so by inspection at finite time you cannot tell which t-vertex the mass came from. G_{\mathrm{JSP}} fixes this by cutting the absorbing edge of each t-vertex and replacing it with a self-contained 2-cycle (the aux loop). Mass that enters t-vertex t stays in the \{t, \mathrm{aux}(t)\} pair forever, so the sum p_t(s) + p_{\mathrm{aux}(t)}(s) of probability mass on the pair monotonically tracks “fraction of paths whose first arrival at any t-vertex was at t, observed by time s”. As s \to \infty this equals the joint probability of the outcome encoded by t.

Definition 24.6 (IPV layout and update_ipv) Let G_{\mathrm{JSP}} have |\mathcal{I}_{\mathrm{ipv}}| = n_{\mathrm{ipv}} ordered IPV target indices. The initial probability vector layout is the vector \boldsymbol{\pi} \in \mathbb{R}^{n_{\mathrm{ipv}}} whose i-th component is the weight of the starting-vertex edge to the i-th IPV target. The method update_ipv($\boldsymbol{\pi}$) overwrites these edge weights with the components of \boldsymbol{\pi}, leaving the rest of G_{\mathrm{JSP}} unchanged.

Definition 24.7 (Daisy chain dynamics) Let G_{\mathrm{JSP}} be a joint stop-probability graph. Let n \geq 1 be the number of epochs, with per-epoch parameter vectors \boldsymbol{\theta}^{(0)}, \ldots, \boldsymbol{\theta}^{(n-1)} \in \mathbb{R}^k and per-epoch durations \Delta t_0, \ldots, \Delta t_{n-2} > 0. Let \boldsymbol{\pi}^{(0)} \in \mathbb{R}^{n_{\mathrm{ipv}}} be the initial IPV (the user-supplied initial_ipv). Let t_{\mathrm{eval}} > 0 be the final-epoch evaluation horizon. Let

F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s) \in \mathbb{R}^{|V_{\mathrm{JSP}}|} \tag{24.2}

denote the time-s forward operator on G_{\mathrm{JSP}} with weights set by \boldsymbol{\theta} and starting-vertex edges set by \boldsymbol{\pi}. That is, F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s) is the per-vertex probability mass vector at time s of the continuous-time Markov chain on V_{\mathrm{JSP}} whose state at time 0 has distribution given by the IPV \boldsymbol{\pi} and whose transition rates are given by the edges of G_{\mathrm{JSP}} under parameter \boldsymbol{\theta}. Operationally, F is computed by the uniformization-based forward algorithm of [14].

Define the t-aux collapse operator \mathcal{C}: \mathbb{R}^{|V_{\mathrm{JSP}}|} \to \mathbb{R}^{|V_{\mathrm{JSP}}|} by

\mathcal{C}(\mathbf{p})_v = \begin{cases} p_v + p_{\mathrm{aux}(v)} & \text{if $v$ is a t-vertex of $G_{\mathrm{JSP}}$,} \\ p_v & \text{otherwise.} \end{cases} \tag{24.3}

Let \mathcal{P}: \mathbb{R}^{|V_{\mathrm{JSP}}|} \to \mathbb{R}^{n_{\mathrm{ipv}}} denote the IPV projection: \mathcal{P}(\mathbf{p}) is the sub-vector of \mathbf{p} indexed by \mathcal{I}_{\mathrm{ipv}}.

The daisy chain dynamics \mathrm{DC}\bigl(\{\boldsymbol{\theta}^{(i)}\}, \{\Delta t_i\}, \boldsymbol{\pi}^{(0)}, t_{\mathrm{eval}}\bigr) produces a sequence \mathbf{q}^{(0)}, \ldots, \mathbf{q}^{(n-1)} by:

For i = 0, \ldots, n-2: \mathbf{q}^{(i)} = \mathcal{C}\bigl(F_{\boldsymbol{\theta}^{(i)}}(\boldsymbol{\pi}^{(i)}, \Delta t_i)\bigr) and \boldsymbol{\pi}^{(i+1)} = \mathcal{P}(\mathbf{q}^{(i)}).
For i = n-1: \mathbf{q}^{(n-1)} = \mathcal{C}\bigl(F_{\boldsymbol{\theta}^{(n-1)}}(\boldsymbol{\pi}^{(n-1)}, t_{\mathrm{eval}})\bigr).

The output is the vector \bigl(\mathbf{q}^{(n-1)}_t\bigr)_{t \in \mathcal{T}} where \mathcal{T} is the ordered list of t-vertex indices.

Intuition

The daisy chain implements a piecewise-time-homogeneous CTMC by running uniformization n times. After each non-final epoch it freezes the surviving probability mass at every vertex of G_{\mathrm{JSP}}, collapses the t-aux loops by adding aux mass to its t-vertex (because for downstream epochs you do not need to track aux mass separately — it is already trapped), restricts the result to the IPV target index list, and feeds that as the IPV for the next epoch. The collapse step is essential: without it, mass that has already been trapped in an aux loop at time \sum \Delta t_i would be lost when re-seeding the next epoch’s IPV.

Theorems and Proofs

We now establish the correctness of joint_prob_table() (Theorem 22.A) and daisy_chain_joint_probs() (Theorem 22.B). The proofs rest on four lemmas about the structure of G_J and G_{\mathrm{JSP}}.

Lemma 24.1 (No-revisit property of t-vertices) Let G_J be a joint probability graph (Definition 24.2), regarded as a discrete absorbing Markov chain after running update_weights($\boldsymbol{\theta}$) and the chain’s auto-normalization. Let t be any t-vertex. Then almost surely the chain visits t at most once before absorption.

Proof. By Definition 24.3, after step 5 of the construction every t-vertex t has exactly one outgoing edge (t \to v_{\mathrm{abs}}^{J}) to the absorbing vertex, with weight 1 (in the mutation slot). Under the discrete normalization performed by set_was_dph(True) together with update_weights, the outgoing transition probability from t is therefore \Pr(t \to v_{\mathrm{abs}}^{J}) = 1. Hence the chain transitions from t to v_{\mathrm{abs}}^{J} at the next step with probability one; once absorbed, no further transitions occur. Thus the chain cannot enter t a second time. \square

Lemma 24.2 (IPV linearity of the forward operator) Let G_{\mathrm{JSP}} be a joint stop-probability graph. The forward operator F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s) defined in Equation 24.2 is linear in \boldsymbol{\pi}. In particular, for any c > 0,

F_{\boldsymbol{\theta}}(c \boldsymbol{\pi}, s) = c \cdot F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s), \tag{24.4}

and the t-aux collapse \mathcal{C} and IPV projection \mathcal{P} are also linear.

Proof. The forward operator F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s) propagates initial mass through a fixed CTMC. Concretely, F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s) = \boldsymbol{\pi} \cdot e^{\boldsymbol{\Lambda} s} where \boldsymbol{\Lambda} is the rate matrix of G_{\mathrm{JSP}} restricted to the IPV-supported vertices (with rows in the IPV layout) — or equivalently, F is computed by the uniformization-based forward algorithm of [14] which evolves the initial mass via a sequence of left-multiplications by a fixed stochastic matrix. Both representations are linear functions of \boldsymbol{\pi}. Linearity of \mathcal{C} is immediate from Equation 24.3 (it is a sum of two coordinates); linearity of \mathcal{P} is immediate as it is a coordinate selection. \square

Lemma 24.3 (Mass-trapping monotonicity of the aux loop) Let G_{\mathrm{JSP}} be a joint stop-probability graph, fix a positive parameter vector \boldsymbol{\theta} with \sum_{k} \theta_k > 0, and fix an initial mass distribution \boldsymbol{\pi}. Let \mathbf{p}(s) = F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s). For each t-vertex t, define the trapped mass

M_t(s) \;=\; p_t(s) + p_{\mathrm{aux}(t)}(s). \tag{24.5}

Then M_t(s) is non-decreasing in s, bounded above by the total starting mass \sum_i \pi_i, and the limit \lim_{s \to \infty} M_t(s) = M_t^{\infty} exists. Furthermore, M_t^{\infty} equals the probability under the unmodified G_J-chain (with the same parameter \boldsymbol{\theta}) that the absorbing vertex v_{\mathrm{abs}}^{J} is entered through the unique absorbing edge of t-vertex t, weighted by the same initial mass distribution \boldsymbol{\pi}.

Proof. Monotonicity. Consider the rate of change of M_t in continuous time. By Definition 24.5, the only edges incident to the pair \{t, \mathrm{aux}(t)\} are:

the two aux-loop edges (t \to \mathrm{aux}(t)) and (\mathrm{aux}(t) \to t), which transfer mass within the pair, contributing zero to \dot{M}_t;
incoming edges (u \to t) where u \neq \mathrm{aux}(t) is any parent of t in G_J, which contribute non-negative inflow to M_t.

There are no outgoing edges from \{t, \mathrm{aux}(t)\} to vertices outside the pair, because (i) the original absorbing edge (t \to v_{\mathrm{abs}}^{J}) was removed at step 2 of Definition 24.5, and (ii) \mathrm{aux}(t) has only the edge back to t. Hence \dot{M}_t(s) \geq 0 for all s.

Boundedness. Total mass is conserved by the dynamics (mass that reaches v_{\mathrm{abs}}^{J} is also tracked as a coordinate of \mathbf{p} in the forward algorithm; mass that enters the trash pair circulates within it). Hence \sum_v p_v(s) = \sum_i \pi_i for all s, and in particular M_t(s) \leq \sum_i \pi_i.

Existence of the limit. A bounded monotone function has a limit.

Identification. Couple the G_{\mathrm{JSP}} chain with the G_J chain by using identical sample paths up to the first time the path enters a t-vertex. In G_{\mathrm{JSP}}, the path then stays in \{t, \mathrm{aux}(t)\} forever; in G_J, the path takes the absorbing edge (t \to v_{\mathrm{abs}}^{J}). Up to the moment of first entry, the two chains have identical transition kernels (by construction step 1 of Definition 24.5, with the trash-pair redirection treated symmetrically in both views). Hence the event “path’s first visit to a t-vertex is t” has the same probability under both chains. In G_{\mathrm{JSP}}, that event has probability M_t^{\infty} (because after first entry, the path stays in the pair forever, contributing exactly its starting weight to M_t). In G_J, that event has probability “absorbing edge taken from t” \cdot (initial mass).

By Lemma 24.1, in the embedded discrete chain of G_J each t-vertex is visited at most once. Therefore “first visit to a t-vertex is t” coincides with “absorption occurs through t”. This identifies M_t^{\infty} as the joint probability of outcome t under the unmodified G_J dynamics. \square

Remark on aux-loop coefficient. The current implementation emits coefficient vector (1, 1, \ldots, 1) \in \mathbb{R}^k for both aux-loop edges, giving runtime weight \sum_k \theta_k under linear weight mode. The notebook documentation refers to “unit-weight” aux-loop edges; this should be read as “all-ones coefficient”, not “unit runtime weight”. The asymptotic claim of Lemma 24.3 is insensitive to the choice of positive aux-loop weight — any strictly positive bidirectional 2-cycle on \{t, \mathrm{aux}(t)\} traps mass and yields the same M_t^{\infty}. The choice of weight affects only the rate at which M_t(s) approaches M_t^{\infty}, which has practical consequences for choosing t_{\mathrm{eval}} (see Definition 24.7).

Lemma 24.4 (Chapman–Kolmogorov composition over epochs) Let G_{\mathrm{JSP}}, \{\boldsymbol{\theta}^{(i)}\}, \{\Delta t_i\}, and \boldsymbol{\pi}^{(0)} be as in Definition 24.7. For each i \in \{0, \ldots, n-1\} and each non-aux non-trash vertex v,

\bigl[F_{\boldsymbol{\theta}^{(i)}}(\boldsymbol{\pi}^{(i)}, s)\bigr]_v \;=\; \Pr\bigl[Y(s) = v \mid Y(0) \sim \boldsymbol{\pi}^{(i)}\bigr], \tag{24.6}

where \{Y(s)\}_{s \geq 0} is the inhomogeneous CTMC on V_{\mathrm{JSP}} whose rates during epoch i are those of G_{\mathrm{JSP}} with parameter \boldsymbol{\theta}^{(i)}. Furthermore, for i \leq n-2,

\boldsymbol{\pi}^{(i+1)} \;=\; \mathcal{P} \circ \mathcal{C}\bigl(F_{\boldsymbol{\theta}^{(i)}}(\boldsymbol{\pi}^{(i)}, \Delta t_i)\bigr) \;=\; \Pr\bigl[Y(T_i) \in \cdot \mid Y(0) \sim \boldsymbol{\pi}^{(0)}\bigr] \tag{24.7}

restricted to the IPV layout, where T_i = \sum_{j=0}^{i} \Delta t_j, and where the conditional distribution is summed over aux–t-vertex pairs.

Proof. Equation 24.6 holds within a single epoch by [14] Theorem (uniformization-based forward algorithm). For the composition, observe that:

At time T_i, the forward algorithm of epoch i produces \mathbf{p}^{(i)} = F_{\boldsymbol{\theta}^{(i)}}(\boldsymbol{\pi}^{(i)}, \Delta t_i) \in \mathbb{R}^{|V_{\mathrm{JSP}}|}, which by the within-epoch result is the marginal distribution of Y(T_i).
Mass trapped in any aux loop \{t, \mathrm{aux}(t)\} is permanently confined to that pair (proof of Lemma 24.3). Hence applying \mathcal{C} to \mathbf{p}^{(i)} does not change the marginal probability of Y(T_i) being in the pair \{t, \mathrm{aux}(t)\} — it merely re-attributes the aux-vertex mass to its t-partner. Because the aux vertex has zero outgoing rate to anything outside the pair, this re-attribution does not affect any subsequent epoch’s evolution, provided the next epoch’s IPV does not introduce mass into the aux partner. By Definition 24.5 step 3, aux vertices have no starting-vertex edges in the IPV layout, so \boldsymbol{\pi}^{(i+1)} never places mass on an aux vertex. The collapse \mathcal{C} is therefore safe: dropping aux mass from the IPV and adding it to the t-vertex’s IPV slot gives an equivalent CTMC restart at time T_i.
The IPV projection \mathcal{P} extracts the components on the IPV-supported vertex set. Because every transient non-aux non-trash vertex appears in \mathcal{I}_{\mathrm{ipv}} (step 3 of Definition 24.5), no information needed to restart the chain is lost.

Therefore the conditional distribution of Y(T_i) over IPV-supported vertices (with aux mass folded into its t-partner) is precisely \boldsymbol{\pi}^{(i+1)}, and Chapman–Kolmogorov composition yields Equation 24.7. \square

We can now state and prove the two main correctness theorems.

Theorem 24.1 (Theorem 22.A: Correctness of joint_prob_table()) Let G_J be a discrete joint probability graph built by Graph.joint_prob_graph(..., discrete=True) from a parameterized base graph G. Fix a positive parameter vector \boldsymbol{\theta}, call update_weights($\boldsymbol{\theta}$), and let \boldsymbol{\alpha}_J and \mathbf{T}_J denote the resulting initial distribution and sub-transition matrix of G_J (with the auto-normalization for the discrete chain in effect).

Then for every t-vertex t of G_J, the t-th component of the fundamental-matrix row \boldsymbol{\alpha}_J (\mathbf{I} - \mathbf{T}_J)^{-1} — i.e., the expected number of visits to t before absorption — equals the probability that the chain is absorbed through the unique absorbing edge of t. Consequently, the DataFrame returned by Graph.joint_prob_table() contains, in its prob column, the exact joint probabilities of the outcomes encoded by the t-vertices.

Proof. By Lemma 24.1, the chain visits each t-vertex t at most once before absorption. Hence the expected number of visits to t, which for a transient state of an absorbing Markov chain equals the corresponding entry of the fundamental matrix \boldsymbol{\alpha}_J (\mathbf{I} - \mathbf{T}_J)^{-1} ([01]; Neuts (1981); Bladt and Nielsen (2017)), reduces to the probability that the chain ever visits t:

\bigl[\boldsymbol{\alpha}_J (\mathbf{I} - \mathbf{T}_J)^{-1}\bigr]_t \;=\; \mathbb{E}[\text{visits to } t] \;=\; \Pr[\text{chain visits } t].

Since t’s only outgoing transition is the deterministic step to the absorbing vertex v_{\mathrm{abs}}^{J}, visiting t is the same event as being absorbed through the unique absorbing edge of t. This identifies the fundamental-matrix entry as the joint probability of the outcome encoded by t.

Operationally, joint_prob_table() computes these entries through expected_sojourn_time(t_indices) ([15]), which evaluates \boldsymbol{\alpha}_J (\mathbf{I} - \mathbf{T}_J)^{-1} at the requested transient states via the cached elimination trace ([06], [11]). The returned probabilities are therefore exact (up to floating-point round-off and the deficit corresponding to trapped-trash mass). \square

Theorem 24.2 (Theorem 22.B: Correctness of daisy_chain_joint_probs()) Let G_{\mathrm{JSP}} be the continuous joint stop-probability graph derived from a continuous base graph G via G_J = \texttt{joint\_prob\_graph}(\ldots, \texttt{discrete=False}) and G_{\mathrm{JSP}} = G_J.\texttt{joint\_stop\_prob\_graph}(). Fix epoch parameter vectors \boldsymbol{\theta}^{(0)}, \ldots, \boldsymbol{\theta}^{(n-1)} \in \mathbb{R}^k_{>0}, epoch durations \Delta t_0, \ldots, \Delta t_{n-2} > 0, an initial IPV \boldsymbol{\pi}^{(0)} \in \mathbb{R}_{\geq 0}^{n_{\mathrm{ipv}}} with N = \sum_i \pi^{(0)}_i, and a final-epoch horizon t_{\mathrm{eval}} > 0. Let \bigl(q_t\bigr)_{t \in \mathcal{T}} denote the output of \texttt{daisy\_chain\_joint\_probs}(\boldsymbol{\theta}^{(\cdot)}, \Delta t_{(\cdot)}, \boldsymbol{\pi}^{(0)}, t_{\mathrm{eval}}) (the components of \mathbf{q}^{(n-1)} at the t-vertex indices, per Definition 24.7).

Then

\lim_{t_{\mathrm{eval}} \to \infty}\; \frac{q_t}{N} \;=\; \Pr\bigl[\,\text{outcome encoded by $t$ under the piecewise-time-homogeneous CTMC defined by $G$, $\{\boldsymbol{\theta}^{(i)}\}$, $\{\Delta t_i\}$, normalized initial distribution $\boldsymbol{\pi}^{(0)} / N$}\,\bigr]. \tag{24.8}

For finite t_{\mathrm{eval}}, q_t / N underestimates the joint probability by exactly the residual mass that is still in transit (i.e., on a non-t, non-aux, non-trash, non-absorbing vertex) at time t_{\mathrm{eval}}.

Proof. Recall from Definition 24.7 that the daisy chain produces \mathbf{q}^{(i)} = \mathcal{C}\bigl(F_{\boldsymbol{\theta}^{(i)}}(\boldsymbol{\pi}^{(i)}, \Delta t_i)\bigr) for i \leq n-2 and \mathbf{q}^{(n-1)} = \mathcal{C}\bigl(F_{\boldsymbol{\theta}^{(n-1)}}(\boldsymbol{\pi}^{(n-1)}, t_{\mathrm{eval}})\bigr). By Lemma 24.4, the post-collapse vector \mathbf{q}^{(i)} encodes, on every non-aux non-trash vertex, the marginal probability mass at time T_i = \sum_{j=0}^{i} \Delta t_j of the inhomogeneous CTMC \{Y(s)\} with starting distribution \boldsymbol{\pi}^{(0)}, with aux mass already folded back into t-partners.

Restricted to t-vertices, \mathbf{q}^{(n-1)}_t = p_t(t_{\mathrm{eval}}) + p_{\mathrm{aux}(t)}(t_{\mathrm{eval}}) = M_t(t_{\mathrm{eval}}) (the trapped mass of Lemma 24.3, applied to the inhomogeneous chain). By Lemma 24.3, M_t(s) is non-decreasing and bounded by total mass N; its limit M_t^{\infty} exists.

Identify M_t^{\infty} as the joint probability under the unmodified inhomogeneous G_J dynamics, by the same coupling argument: pair each G_{\mathrm{JSP}}-path with the G_J-path that follows the same transitions up to the first time it enters a t-vertex. Up to that moment, the two chains are identical in law (within each epoch, the rates of G_{\mathrm{JSP}} outside aux loops match those of G_J with trash redirected to absorbing — see step 1 of Definition 24.5). After first entry, the G_{\mathrm{JSP}}-path stays in \{t, \mathrm{aux}(t)\} forever, contributing exactly its weight to M_t^{\infty}; the G_J-path absorbs through t at the next rate-1 exponential transition, contributing the same weight to the joint-probability mass of outcome t. By Lemma 24.1 (which holds equally for the continuous chain because no t-vertex can be re-entered: it has a single outgoing rate-1 edge to absorbing), each path contributes to exactly one t-vertex.

Divide by N to normalize the initial total mass to 1; Lemma 24.2 ensures this division acts coordinate-wise on the output. This proves Equation 24.8.

For the finite-t_{\mathrm{eval}} error bound, observe that the difference M_t^{\infty} - M_t(t_{\mathrm{eval}}) is the probability mass that would have entered the pair \{t, \mathrm{aux}(t)\} after time t_{\mathrm{eval}}. Summed over t this equals the total mass that is still on a non-t, non-aux, non-trash, non-absorbing vertex at time t_{\mathrm{eval}}, since all other “sinks” (the absorbing vertex v_{\mathrm{abs}}^{J} never receives mass in G_{\mathrm{JSP}}, the trash pair does receive mass but is already accounted for by the deficit at construction time) are inaccessible. \square

Worked example: Geom(p) closed form for the N=2 coalescent

The simplest non-trivial coalescent — N=2 samples with a single descendants Property — admits a closed-form expression for the joint distribution that both joint_prob_table() and daisy_chain_joint_probs() can be checked against. We present this as a worked example rather than a theorem in its own right; it is a consistency check that the formal results of Theorem 24.1 and Theorem 24.2 reduce correctly in a tractable special case.

Corollary 24.1 (Corollary 22.A.1: N=2 coalescent reduces to a Geometric distribution) Let G be the parameterized N=2 coalescent base graph with coalescence rate r_c and per-lineage mutation rate \mu (so two singleton lineages mutate independently at rate \mu each, until coalescence). Let G_J be the discrete joint probability graph built with one mutation reward bin (descendants=1). Let K denote the random variable “total number of descendants=1 mutations recorded at the moment of coalescence”. Then by Theorem 24.1, joint_prob_table() returns the exact PMF of K, and that PMF satisfies

\Pr(K = k) \;=\; q^k\, (1 - q), \qquad k = 0, 1, 2, \ldots, \qquad q = \frac{2\mu}{2\mu + r_c}. \tag{24.9}

Equivalently, K \sim \operatorname{Geo}(q) in the convention of Section 6.1 of notation_standard.md (where the standard’s parameter p corresponds to the per-step failure probability — here, the probability that a mutation event fires before coalescence, q = 2\mu / (2\mu + r_c)). The absorption probability per discrete step is 1 - q = r_c / (2\mu + r_c).

Proof. Condition on the coalescence time T \sim \operatorname{Exp}(r_c). While both singletons are alive (duration T), mutations accumulate at total rate 2\mu, so K \mid T \sim \operatorname{Poisson}(2\mu T). Marginalizing,

\Pr(K = k) \;=\; \int_0^\infty \frac{(2\mu t)^k e^{-2\mu t}}{k!}\, r_c e^{-r_c t}\, dt \;=\; \left(\frac{2\mu}{2\mu + r_c}\right)^k \cdot \frac{r_c}{2\mu + r_c},

which matches Equation 24.9 with q = 2\mu / (2\mu + r_c). By Theorem 24.1, this is exactly what joint_prob_table() returns. The same value is recovered (modulo the residual-mass term of Theorem 24.2) by daisy_chain_joint_probs() with a single time-homogeneous epoch and large t_{\mathrm{eval}}. The notebook docs/pages/tutorial/joint-probability.ipynb (Part A, “Proof of correctness”) verifies this numerically against a Monte Carlo simulator with 4 \times 10^5 replicates: every row agrees with the simulator to within \sim 2 standard errors. \square

Remarks

Remark on the mutation-rate convention. joint_prob_graph(..., mutation_rate=mu0) bakes a factor mu0 * state[i] into the coefficient of every mutation edge (one coefficient per reward bin) and reserves a second parameter slot as a runtime multiplier. The effective per-lineage mutation rate is therefore \mu_0 \cdot \theta_\mu where \theta_\mu is the runtime parameter. To match the closed form of Corollary 24.1 at runtime \mu = 1/2, one can build with mutation_rate=1.0 and call update_weights([r_c, 0.5]).

Remark on daisy_chain_joint_probs gradients. The daisy_chain_joint_probs method is JAX-traceable, but its gradient with respect to the per-epoch theta vector is implemented as a central-difference finite difference (step \epsilon = 10^{-7}) wrapped in jax.custom_vjp. It is not analytical reverse-mode autodiff. For downstream SVGD or HMC inference this matters: the gradient cost is linear in the number of free theta slots (one forward call per slot per backward pass), and the gradient inherits a small O(\epsilon^2) truncation error in addition to the residual-mass error of Theorem 24.2. The fixed_indices argument skips finite-difference evaluations on slots the user has pinned.

Remark on choosing t_{\mathrm{eval}}. Theorem 24.2 gives the exact joint probabilities in the limit t_{\mathrm{eval}} \to \infty; for finite t_{\mathrm{eval}} the residual mass at non-t-non-aux-non-trash-non-absorbing vertices quantifies the underestimation. The default heuristic in daisy_chain_joint_probs sets t_{\mathrm{eval}} = \max(4 \sum \Delta t_i, 10). For slow-mutation models a larger horizon is needed; an adaptive choice via residual-mass probing is exposed via Graph._probe_daisy_t_eval(...) or by passing daisy_chain_t_eval='auto' to Graph.svgd.

Remark on the aux-loop rate. Because the aux-loop coefficient vector is all-ones (rather than a single 1 in a dedicated slot), the rate of the bidirectional 2-cycle scales with \sum_k \theta_k. For very small total \theta this rate is small, so mass arriving at a t-vertex shuttles slowly into the aux partner before the next epoch boundary. The asymptotic correctness of Theorem 24.2 is unaffected (any positive rate traps mass), but the rate of convergence in t_{\mathrm{eval}} is governed by the slower of the aux-loop rate and the slowest within-epoch transition rate.

Algorithms

The four algorithms in this section formalise the joint-probability machinery underlying the joint_prob_graph, joint_stop_prob_graph, daisy_chain_joint_probs, and joint_prob_table methods of the Graph class. Together they realise the discrete and continuous routes from a parameterised base graph G to the joint-probability table \mathbf{p}(\boldsymbol{\theta}) over terminal outcomes that probability matching consumes (see Definition 24.8). Algorithm 24.1 constructs the joint-probability graph G_J — the augmented state space that lifts reward counters into the state vector so each terminal outcome corresponds to a distinct vertex. Algorithm 24.2 transforms G_J into the joint-stop-prob graph G_{\mathrm{JSP}} used for time-inhomogeneous (daisy-chained) inference: t-vertices acquire trapping aux loops and every interior vertex receives a starting-vertex edge with weight zero whose runtime value is supplied by update_ipv. Algorithm 24.3 composes per-epoch stop-probability evaluations on G_{\mathrm{JSP}} via the IPV bridge to produce joint probabilities across a sequence of epochs. Algorithm 24.4 extracts the discrete joint-probability table by reading expected sojourn times at the t-vertices of a discrete G_J.

24.0.0.1 Algorithm 22.A: Joint-Probability Graph Construction

Description. Builds the joint-probability graph G_J from a parameterised base graph G by expanding the state vector with one reward counter per rewarded property. Starting from the base initial distribution, the algorithm performs a breadth-first traversal of the augmented vertex set: at each newly created vertex, it (i) carries over every base-graph transition (with the current reward counters unchanged), (ii) emits a “mutation” transition for each rewarded property whose counter can still be incremented, with a callback-supplied rate placed in a dedicated parameter slot, and (iii) records a “trash” rate that captures probability mass leaking out of the truncation region. A trash pair (two vertices in a unit-weight cycle) absorbs the leaking mass; a fresh absorbing vertex receives unit-weight edges from every t-vertex (a vertex whose base-graph projection has a transition to an absorbing state). The result is a graph whose terminal vertices stand in bijection with joint-reward outcomes, and whose stop probability at \infty at each t-vertex equals the joint probability of the corresponding outcome (Theorem 24.1).

Joint-Probability Graph Construction
 1: Let G = (V, E) be a parameterised base graph with d = paramlength(G) parameters
 2: Let base_indexer describe G's state space (one property set with p properties)
 3: Let reward_only ⊆ properties(base_indexer) (default: all parameterised properties)
 4: Let rewardCallback(state, ...) → (rates, trash_rate) emit per-bin mutation rates
 5: Let mutation_rate > 0, reward_limit ∈ N, tot_reward_limit ∈ R ∪ {∞}
 6: Let discrete ∈ {true, false} select DPH or CPH semantics
 7:
 8: function JointProbGraph(G, base_indexer, reward_only, rewardCallback,
 9:                         mutation_rate, reward_limit, tot_reward_limit, discrete)
10:   reward_indexer ← one PropertySet per rewarded property in base_indexer
11:   joint_indexer  ← base_indexer ⊕ reward_indexer       triangleright concatenate
12:   p_J            ← state_length(joint_indexer)
13:   I_base         ← indices of base part in joint state vector
14:   I_rwd          ← indices of reward part in joint state vector
15:   G_J            ← empty graph with state_length p_J
16:   v_0^J          ← starting_vertex(G_J)
17:   d_J            ← d + 1                                triangleright extra slot for mutation rate
18:
19:   // Step 1 — lift base initial distribution
20:   for each edge (v_0 → v) ∈ outgoing(starting_vertex(G)) with coefficient c do
21:     x ← v.state ⊕ 0_{|I_rwd|}
22:     add edge v_0^J → FindOrCreateVertex(G_J, x) with coefficient c
23:   end for
24:
25:   // Step 2 — BFS expansion over joint vertices
26:   T          ← ∅                                          triangleright t-vertex index set
27:   trashRate  ← empty map (vertex index → rate)
28:   i          ← 1
29:   while i < |V(G_J)| do
30:     v       ← vertex_at(G_J, i)
31:     x       ← state(v)
32:     x_base  ← x[I_base];   x_rwd ← x[I_rwd]
33:     u       ← FindVertex(G, x_base)                      triangleright base-graph counterpart
34:
35:     for each edge (u → z) ∈ outgoing(u) with coefficient c do
36:       y    ← state(z) ⊕ x_rwd                            triangleright base step, rewards unchanged
37:       if y = x then continue end if                       triangleright skip self-state echo
38:       w    ← FindOrCreateVertex(G_J, y)
39:       add edge v → w with coefficient c ⊕ 0              triangleright pad mutation slot
40:       if outgoing(FindVertex(G, y[I_base])) = ∅ then
41:         T ← T ∪ {index(w)}                                triangleright base part is absorbing
42:       end if
43:     end for
44:
45:     (rates, trash) ← rewardCallback(x_base, base_indexer,
46:                       reward_indexer, x_rwd,
47:                       mutation_rate, reward_limit, tot_reward_limit)
48:     trashRate[i] ← trash
49:
50:     for j ← 0 to |I_rwd| − 1 do
51:       if rates[j] > 0 then
52:         x_rwd' ← x_rwd; x_rwd'[j] ← x_rwd'[j] + 1
53:         y      ← x_base ⊕ x_rwd'
54:         if outgoing(FindVertex(G, y[I_base])) = ∅ then
55:           continue                                        triangleright no mutations after absorption
56:         end if
57:         w      ← FindOrCreateVertex(G_J, y)
58:         add edge v → w with coefficient 0_d ⊕ rates[j]   triangleright rate lives in mutation slot
59:       end if
60:     end for
61:
62:     i ← i + 1
63:   end while
64:
65:   // Step 3 — trash pair absorbs leakage
66:   v_trash      ← FindOrCreateVertex(G_J, 0_{p_J})
67:   v_trashLoop  ← CreateVertex(G_J, 0_{p_J})
68:   add edge v_trash     → v_trashLoop with coefficient 0_d ⊕ 1
69:   add edge v_trashLoop → v_trash     with coefficient 0_d ⊕ 1
70:   for each (i, r) ∈ trashRate with r > 0 do
71:     add edge vertex_at(G_J, i) → v_trash with coefficient 0_d ⊕ r
72:   end for
73:
74:   // Step 4 — optional fusion of t-vertices that share a reward projection
75:   if reward_only ⊊ properties(base_indexer) then
76:     for each equivalence class C of T under reward-only projection do
77:       t*  ← CreateVertex(G_J, projected state of class C)
78:       for each i ∈ C do
79:         add edge vertex_at(G_J, i) → t* with coefficient 0_d ⊕ 1
80:         T ← T \ {i}
81:       end for
82:       T ← T ∪ {index(t*)}
83:     end for
84:   end if
85:
86:   // Step 5 — common absorbing vertex
87:   v_∞ ← CreateVertex(G_J, 0_{p_J})
88:   for each i ∈ T do
89:     add edge vertex_at(G_J, i) → v_∞ with coefficient 0_d ⊕ 1
90:   end for
91:
92:   is_discrete(G_J) ← discrete
93:   record metadata: base_indexer, reward_indexer, joint_indexer on G_J
94:   return G_J
95: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
G	G (base graph)	`self` (`__init__.py:joint_prob_graph`)
G_J	G_J	`joint_graph` (`__init__.py:joint_prob_graph`)
v_0^J	v_0^J	`starting_vertex` (`__init__.py:joint_prob_graph`)
`base_indexer`	base state indexer	`base_graph_indexer`
`reward_indexer`	reward state indexer	`reward_indexer`
`joint_indexer`	base \oplus reward	`joint_graph_indexer`
p_J	joint state vector length	`state_vector_length`
`I_base`, `I_rwd`	base / reward index sets	`state_indices`, `reward_state_indices`
d, d_J = d+1	param length, joint param length	`param_length`, `self.param_length()+1`
`rewardCallback`	\phi_{\mathrm{rwd}}	`reward_rates_callback` / `self._joint_prob_reward`
`mutation_rate`	\mu_0	`mutation_rate`
`reward_limit`, `tot_reward_limit`	\rho_{\max}, \Sigma_{\max}	`reward_limit`, `tot_reward_limit`
`trashRate`	trash-rate map	`trash_rates`
T	t-vertex index set V_{\mathrm{t}}	`t_vertex_indices`
`v_trash`, `v_trashLoop`	trash pair (v_{\mathrm{tr}}^{(1)}, v_{\mathrm{tr}}^{(2)})	`trash_vertex`, `trash_loop_vertex`
`v_∞`	new absorbing vertex v_{\mathrm{abs}}^{J}	`new_absorbing`
`c ⊕ 0`, `0_d ⊕ r`	parameter-coefficient concatenation	`np.append(coeffs, 0)`, `np.append(np.zeros(d), r)`

Complexity. Let p_J = |V(G_J)| and m_J = |E(G_J)|, and write \Delta_{\mathrm{out}} for the maximum out-degree in G and r = |I_{\mathrm{rwd}}|. The BFS visits each joint vertex once; per vertex the algorithm enumerates O(\Delta_{\mathrm{out}} + r) outgoing edges and performs O(\Delta_{\mathrm{out}} + r) hash lookups in the joint-vertex index. Time: O(p_J (\Delta_{\mathrm{out}} + r)) for construction, plus O(p_J) for the optional fusion step and O(|T|) for the absorbing wiring. Space: O(p_J + m_J) for the resulting graph plus O(p_J) for the trash-rate map.

Algorithm 24.1: Correctness. The augmented state vector (x_{\mathrm{base}}, x_{\mathrm{rwd}}) jointly records the base process and the accumulated mutation counts. Step 2 implements the joint chain: a base transition keeps x_{\mathrm{rwd}} fixed (so the joint generator’s restriction to $x_{} = $ const is exactly the base generator), while a mutation transition increments one reward coordinate at the rate supplied by \phi_{\mathrm{rwd}}. The skip on line 54 enforces that no mutations occur after the base process is absorbed; by Lemma 24.1 the reward coordinates therefore never revisit any value. The trash pair (lines 66–72) is an absorbing cycle outside the joint state space — it carries away mass that would otherwise leave the truncation window \{x_{\mathrm{rwd}} : \mathbf{1}^\top x_{\mathrm{rwd}} \le \Sigma_{\max}\} — and is by construction disjoint from T. The t-vertex set T (line 41) is identified by base-graph absorbing-ness of the projection, so each i \in T corresponds to one joint-reward outcome; the unit-weight edges to v_\infty funnel absorbing mass to a single sink without altering the joint absorption probabilities (the per-outcome split is read at the t-vertex, not at v_\infty). The fusion of step 4, when invoked, sums probabilities over outcome classes that agree on the user-selected reward subset — exactly the marginal joint probability over that subset.

24.0.0.2 Algorithm 22.B: Joint-Stop-Prob Graph Construction

Description. Transforms a joint-probability graph G_J (produced by Algorithm 24.1) into a joint-stop-prob graph G_{\mathrm{JSP}} suitable for time-inhomogeneous inference. The key structural change is the trapping aux loop: for each t-vertex v \in T, the algorithm creates an auxiliary vertex v_{\mathrm{aux}} with state \mathbf{0} and installs a pair of unit-coefficient parameterised edges v \rightleftarrows v_{\mathrm{aux}}. Mass that reaches v shuttles between v and v_{\mathrm{aux}} forever (rather than being absorbed at v_{\mathrm{abs}}^J), so $( v) + ( v_{}) = $ cumulative absorption probability at the outcome corresponding to v as a function of time. To bridge adjacent epochs in a daisy chain, the algorithm also installs an initial-probability-vector (IPV) edge from v_0 to every interior vertex (excluding aux, trash, absorbing, and the starting vertex) with weight zero; the runtime weights are set by update_ipv before each epoch.

Joint-Stop-Prob Graph Construction
 1: Let G_J be a joint-probability graph (output of @alg-22-jp-graph)
 2: Let d = paramlength(G_J)
 3:
 4: function JointStopProbGraph(G_J)
 5:   v_0^old   ← starting_vertex(G_J)
 6:   T_old     ← {v ∈ V(G_J) : ∃(v → w) with outgoing(w) = ∅}
 7:   Trash_old ← TrashPairVertices(G_J)            triangleright the two zero-state cycle
 8:   v_∞^old   ← the unique vertex with outgoing(v_∞^old) = ∅
 9:
10:   G_JSP     ← empty graph with state_length p_J, paramlength d
11:   vmap      ← {v_0^old ↦ starting_vertex(G_JSP)}
12:   for each v ∈ V(G_J) \ {v_0^old} do
13:     vmap[v] ← CreateVertex(G_JSP, state(v))
14:   end for
15:
16:   T_aux  ← empty map (new t-vertex index → new aux-vertex index)
17:
18:   for each v ∈ V(G_J) \ ({v_0^old} ∪ Trash_old) with outgoing(v) ≠ ∅ do
19:     v'  ← vmap[v]
20:     if v ∈ T_old then
21:       v_aux ← CreateVertex(G_JSP, 0_{p_J})
22:       add edge v'    → v_aux with coefficient 1_d
23:       add edge v_aux → v'    with coefficient 1_d    triangleright trapping aux loop
24:       T_aux[index(v')] ← index(v_aux)
25:     else
26:       for each edge (v → z) ∈ outgoing(v) with coefficient c do
27:         z' ← if z ∈ Trash_old then vmap[v_∞^old] else vmap[z]
28:         add edge v' → z' with coefficient c
29:       end for
30:     end if
31:   end for
32:
33:   // IPV edges (weight 0; update_ipv writes the runtime values)
34:   I_IPV ← sorted({index(vmap[v]) : v ∈ V(G_J),
35:                    v ∉ {v_0^old, v_∞^old} ∪ Trash_old})
36:   for each k ∈ I_IPV do
37:     add edge starting_vertex(G_JSP) → vertex_at(G_JSP, k)
38:           with scalar weight 0
39:   end for
40:
41:   attach metadata: _joint_stop_prob_graph ← true,
42:                    _t_vertex_indices    ← sort(keys(T_aux)),
43:                    _t_aux_map           ← T_aux,
44:                    _ipv_target_indices  ← I_IPV
45:   return G_JSP
46: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
G_J, G_{\mathrm{JSP}}	G_J, G_{\mathrm{JSP}}	`self`, `new` (`__init__.py:joint_stop_prob_graph`)
`vmap`	vertex relocation map	`vmap`
T_{\mathrm{old}}	t-vertex set in G_J	`t_vertex_old_indices`
`Trash_old`, v_\infty^{\mathrm{old}}	trash pair, absorbing	`trash_old_indices`, `abs_old_index`
T_{\mathrm{aux}}	t-vertex \to aux-vertex map	`t_aux_map`
v_{\mathrm{aux}}	aux vertex paired with v	`t_aux_vertex`
\mathbf{1}_d	coefficient vector of ones	`[1.0] * param_length`
I_{\mathrm{IPV}}	IPV target index list	`_ipv_target_indices`

Complexity. Let n = |V(G_J)| and m = |E(G_J)|, |T| = |T_{\mathrm{old}}|. The single sweep over V(G_J) on lines 18–31 copies O(m) edges and installs O(|T|) new aux vertices and 2|T| trapping edges; the IPV pass adds O(n) starting edges. Time: O(n + m). Space: O(n + m + |T|) for the new graph and the auxiliary maps.

Algorithm 24.2: Correctness. The trapping aux loop has total outgoing weight \sum_k 1 \cdot \theta_k on each side at runtime; once probability mass enters the pair, the joint occupation across (v, v_{\mathrm{aux}}) is conserved because neither vertex has any outgoing edge to a vertex outside the pair (the original outgoing edge to the absorbing vertex is deliberately omitted on lines 20–24). By the trapping argument behind Theorem 24.2 (the mass-trapping monotonicity established in Lemma 24.3), this means the cumulative absorption probability at the outcome corresponding to v \in T in G_J equals the total mass on \{v, v_{\mathrm{aux}}\} at any time in G_{\mathrm{JSP}}. The redirection of trash edges to v_\infty on line 27 preserves total probability (the trash pair in G_J traps mass with no observable consequence; merging that mass into v_\infty does the same in G_{\mathrm{JSP}}). The IPV edges at weight zero on lines 36–39 are no-ops until update_ipv writes runtime values; the resulting parameterised initial distribution applied at the start of each epoch is exactly \boldsymbol{\pi}^{(i)} on the IPV-target subset of vertices, as required by Lemma 24.2.

24.0.0.3 Algorithm 22.C: Daisy-Chain Joint Probabilities

Description. Composes n_{\mathrm{ep}} epochs of stop-probability evaluation on a joint-stop-prob graph G_{\mathrm{JSP}} to obtain the joint probabilities of each t-state outcome under a piecewise-constant CTMC. Within each epoch i < n_{\mathrm{ep}} - 1 the algorithm sets the epoch-i parameter vector \boldsymbol{\theta}^{(i)} and IPV \boldsymbol{\pi}^{(i)}, runs the forward algorithm for duration \Delta t_i to obtain the time-\Delta t_i stop-probability vector \mathbf{p}^{(i)}, collapses each t-vertex / aux-vertex pair into a single survivor mass via q^{(i)}_v = p^{(i)}_v + \mathbb{1}_{\{v \in T\}} \cdot p^{(i)}_{\mathrm{aux}(v)}, and projects the collapsed vector onto the IPV layout of epoch i+1. In the final epoch i = n_{\mathrm{ep}} - 1 the algorithm reads stop-probabilities at t_{\mathrm{eval}} and returns the collapsed masses at the indices in \_t\_vertex\_indices as the joint probabilities. Composition over epochs is justified by Chapman–Kolmogorov (Lemma 24.4), and the choice of t_{\mathrm{eval}} large enough is justified by the monotone-convergence lemma Lemma 24.3.

Daisy-Chain Joint Probabilities
 1: Let G_JSP be a joint-stop-prob graph (output of @alg-22-jp-stop-graph)
 2: Let n_ep ≥ 1, theta^{(0..n_ep-1)} ∈ R^{n_ep × d}, dt^{(0..n_ep-2)} ∈ R_+^{n_ep-1}
 3: Let pi^{(0)} ∈ R^{|I_IPV|} be the initial IPV
 4: Let t_eval > 0, granularity g ≥ 0, fixed_indices ⊆ {0, ..., n_ep·d − 1}
 5: Let I_IPV = G_JSP._ipv_target_indices, T_aux = G_JSP._t_aux_map
 6: Let I_t = G_JSP._t_vertex_indices
 7:
 8: function DaisyChainJointProbs(G_JSP, theta, dt, pi^{(0)}, t_eval, g)
 9:   pi_curr ← pi^{(0)}
10:   update_ipv(G_JSP, pi_curr)                triangleright write IPV edge weights
11:
12:   for i ← 0 to n_ep − 1 do
13:     update_weights(G_JSP, theta[i])         triangleright epoch-i parameters
14:     if i < n_ep − 1 then
15:       t_step ← dt[i]
16:     else
17:       t_step ← t_eval
18:     end if
19:     p ← StopProbability(G_JSP, t_step, g)   triangleright forward alg @alg-14-cph-uniformization
20:
21:     // Collapse each (t, aux) pair onto its t-vertex coordinate
22:     for each v ∈ V(G_JSP) do
23:       q[v] ← p[v] + 1_{v ∈ dom(T_aux)} · p[T_aux[v]]
24:     end for
25:
26:     if i < n_ep − 1 then
27:       pi_curr ← (q[I_IPV[0]], q[I_IPV[1]], …, q[I_IPV[|I_IPV|−1]])
28:       update_ipv(G_JSP, pi_curr)             triangleright propagate survivors
29:     end if
30:   end for
31:
32:   return (q[I_t[0]], q[I_t[1]], …, q[I_t[|I_t|−1]])
33: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
\boldsymbol{\theta}^{(i)}	epoch-i parameter	`epoch_thetas[i]` (`__init__.py:daisy_chain_joint_probs`)
\Delta t_i	epoch-i duration	`epoch_dts[i]`
\boldsymbol{\pi}^{(i)}	epoch-i IPV	`initial_ipv` (i=0); subsequently `pi_curr` propagated in the FFI handler
\mathbf{p}^{(i)}	stop-probability vector at t_{\mathrm{step}}	output of `stop_probability` / C++ `compute_daisy_chain_joint_probs_ffi`
\mathbf{q}^{(i)}	t-aux-collapsed vector	result of `_collapse_t_aux` (FFI handler)
t_{\mathrm{eval}}	final-epoch read-out time	`t_eval`
g	uniformisation granularity	`granularity`
I_{\mathrm{IPV}}, T_{\mathrm{aux}}, I_t	metadata maps from G_{\mathrm{JSP}}	`_ipv_target_indices`, `_t_aux_map`, `_t_vertex_indices`

Complexity. Let n = |V(G_{\mathrm{JSP}})| and m = |E(G_{\mathrm{JSP}})|. Each call to StopProbability runs the forward algorithm with cost O(g \cdot t_{\mathrm{step}} \cdot m) (see Algorithm 16.2), and the t-aux collapse plus IPV projection are O(n + |I_{\mathrm{IPV}}|). Over n_{\mathrm{ep}} epochs the total time is

O\!\left(\sum_{i=0}^{n_{\mathrm{ep}}-2} g \cdot \Delta t_i \cdot m \;+\; g \cdot t_{\mathrm{eval}} \cdot m \;+\; n_{\mathrm{ep}} \cdot n \right).

Space: O(n + m) for the graph and probability vectors; the custom_vjp wrapper for finite-difference gradients doubles the storage at backward time.

Algorithm 24.3: Correctness. Fix the IPV \boldsymbol{\pi}^{(i)} and parameters \boldsymbol{\theta}^{(i)} for epoch i and let \mathbf{S}^{(i)} be the corresponding sub-intensity matrix of G_{\mathrm{JSP}}. By Lemma 24.2 the stop-probability vector \mathbf{p}^{(i)} = \boldsymbol{\pi}^{(i)} e^{\mathbf{S}^{(i)} \Delta t_i} is linear in \boldsymbol{\pi}^{(i)}, so any propagation of survivors at the end of epoch i becomes the IPV of epoch i+1 without loss of total mass. The t-aux collapse on line 23 sums the masses on \{v, v_{\mathrm{aux}(v)}\}; by the trapping argument in Algorithm 24.2, this sum equals the cumulative absorption probability at the outcome corresponding to v over [0, \Delta t_i], plus the in-transit mass still circulating in the loop pair. Because the IPV layout I_{\mathrm{IPV}} skips the aux vertices, projecting \mathbf{q}^{(i)} onto I_{\mathrm{IPV}} injects each collapsed mass back onto the corresponding non-aux vertex; Lemma 24.4 then identifies the composition of the per-epoch propagators with the propagator of the piecewise-constant CTMC on [0, \sum_i \Delta t_i + t_{\mathrm{eval}}]. By Lemma 24.3 the t-vertex masses \mathbf{q}^{(n_{\mathrm{ep}}-1)}_{I_t} converge monotonically to the joint probabilities of the outcomes as t_{\mathrm{eval}} \to \infty; for any fixed t_{\mathrm{eval}} the returned vector is the joint probability conditional on the chain having absorbed by \sum_i \Delta t_i + t_{\mathrm{eval}} — exactly the quantity targeted by Theorem 24.2.

24.0.0.4 Algorithm 22.D: Discrete Joint-Probability Table

Description. Reads the joint-probability table from a discrete joint-probability graph G_J (output of Algorithm 24.1 with discrete=true) by computing the expected sojourn time at each t-vertex and pairing it with the t-vertex’s reward-state projection. For a DPH on G_J with starting distribution \boldsymbol{\alpha}_J and sub-transition matrix \mathbf{T}_J, the expected sojourn time at transient vertex v is \bigl(\boldsymbol{\alpha}_J (\mathbf{I} - \mathbf{T}_J)^{-1}\bigr)_v. For a t-vertex (one whose only outgoing edge leads to the common absorbing vertex v_{\mathrm{abs}}^J with weight 1), this quantity equals the probability that the joint chain is absorbed at v — that is, the joint probability of the outcome at v. The algorithm reuses the cached elimination trace machinery of [15] to evaluate \boldsymbol{\alpha}_J(\mathbf{I}-\mathbf{T}_J)^{-1} at a fixed parameter vector without re-running graph elimination.

Discrete Joint-Probability Table
 1: Let G_J be a discrete joint-probability graph (is_discrete(G_J) = true)
 2: Let I_rwd index the reward suffix of the joint state vector
 3:
 4: function JointProbTable(G_J)
 5:   T ← ∅
 6:   for each v ∈ V(G_J) do
 7:     for each edge (v → w) ∈ outgoing(v) do
 8:       if outgoing(w) = ∅ then
 9:         T ← T ∪ {index(v)}                triangleright v is a t-vertex
10:         break
11:       end if
12:     end for
13:   end for
14:
15:   outcomes ← (state(vertex_at(G_J, i))[I_rwd] : i ∈ T)
16:                                           triangleright reward projection of each t-vertex
17:
18:   x_T ← ExpectedSojournTime(G_J, T)        triangleright (alpha_J (I − T_J)^{−1})_T via [15]
19:
20:   table ← empty data frame with one row per i ∈ T,
21:           columns (one_per_reward_dim, prob, t_vertex_index)
22:   for k ← 0 to |T| − 1 do
23:     append row (outcomes[k], prob = x_T[k], t_vertex_index = T[k]) to table
24:   end for
25:   return table
26: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
G_J	discrete joint-probability graph	`self` (`__init__.py:joint_prob_table`)
T	t-vertex index set V_{\mathrm{t}}	`t_vertex_indices` (`__init__.py:_get_joint_probs`)
I_{\mathrm{rwd}}	reward suffix indices	`joint_reward_state_indices`
`outcomes`	reward-projected states of t-vertices	`outcomes`
\mathbf{x}_T	expected sojourn at t-vertices	`sojourn_times` (`__init__.py:_get_joint_probs`)
`ExpectedSojournTime`	\boldsymbol{\alpha}_J(\mathbf{I}-\mathbf{T}_J)^{-1} restricted to T	`expected_sojourn_time(t_indices)`
`table`	joint-probability data frame	`joint` (`__init__.py:joint_prob_table`)

Complexity. Let n = |V(G_J)|, m = |E(G_J)|, |T| the t-vertex count. The t-vertex scan on lines 6–13 runs in O(n + m). The call to ExpectedSojournTime reuses the cached elimination trace ([15]) and costs O(N_C) where N_C = O(n^2) is the size of the reward-compute command sequence (see Algorithm 7.3); evaluating at |T| vertices contributes an additive O(|T|) pick-out. Total time: O(n + m + n^2 + |T|), which equals O(n^2) when m = O(n^2) (typical sparse-graph case) and is amortised over the cached trace; on the first call elimination costs O(n^3). Space: O(n + |T|) for the t-vertex array and the result vector.

Algorithm 24.4: Correctness. For a discrete absorbing Markov chain with starting distribution \boldsymbol{\alpha}_J and sub-transition matrix \mathbf{T}_J on the transient states of G_J, the matrix (\mathbf{I}-\mathbf{T}_J)^{-1} is the fundamental matrix (Bladt and Nielsen 2017): entry (u,v) is the expected number of visits to v starting from u before absorption. The row-vector product \boldsymbol{\alpha}_J(\mathbf{I}-\mathbf{T}_J)^{-1} is therefore the expected-visits vector under \boldsymbol{\alpha}_J. Restricted to a t-vertex v \in T — which by construction has a single outgoing edge to v_{\mathrm{abs}}^J with weight 1 — every visit to v is immediately followed by absorption, so the expected number of visits to v equals the absorption probability at v. By Theorem 24.1 this is exactly the joint probability of the outcome associated with v, and by Lemma 24.1 each outcome corresponds to a unique t-vertex (no double counting). The reward projection on line 15 strips the base-state coordinates (which are all the base-graph absorbing state) and keeps only the reward counters, which by construction uniquely identify the outcome. The pseudocode therefore returns the exact joint-probability table, modulo the floating-point arithmetic of the cached elimination trace.

The preceding four algorithms construct the joint-probability and joint-stop-prob graphs that feed the probability matching estimator for parameter inference. The estimator itself is summarised in Algorithm 24.5 below; the joint-probability table it consumes is produced either by Algorithm 24.4 (discrete, single-epoch) or by Algorithm 24.3 (continuous, multi-epoch).

Probability Matching Estimation

Probability matching estimation operates on the discrete joint probability table \bigl(\hat{p}_1, \ldots, \hat{p}_K\bigr) produced by Graph.joint_prob_table() of Theorem 24.1, treating each observation as a t-vertex index. Unlike moment matching ([21]), which operates on continuous observations and their moments, probability matching targets models where observations are categorical (t-vertex indices in a joint probability graph) and the model produces a probability table over t-vertices. The estimator minimizes the distance between the empirical probability distribution (from observed frequency counts) and the model probability distribution (from normalized expected sojourn times), providing parameter estimates and priors for subsequent Bayesian inference.

In the phasic pipeline, probability matching applies to joint probability graphs constructed via Graph.joint_prob_graph(), where each observation corresponds to a t-vertex index (e.g., a mutation count tuple in population genetics). The model probabilities are computed via the forward algorithm ([14]) and normalized over all t-vertices. The estimation is cast as a nonlinear least-squares problem with multinomial covariance for standard error estimation via the delta method (van der Vaart 1998).

Definitions

Definition 24.8 (Empirical and Model PMF) Let y_1, \ldots, y_n be i.i.d. observations, each taking values in a finite set of t-vertex indices \mathcal{V}_{\text{term}} = \{v_{t_1}, \ldots, v_{t_J}\} \subset V. Let \mathcal{V}_{\text{obs}} = \{v_{t_{i_1}}, \ldots, v_{t_{i_K}}\} \subseteq \mathcal{V}_{\text{term}} be the set of K distinct observed vertex indices, with counts n_k = \#\{i : y_i = v_{t_{i_k}}\}. The empirical PMF is

\hat{p}_k = \frac{n_k}{n}, \quad k = 1, \ldots, K. \tag{24.10}

The model PMF at parameter \boldsymbol{\theta} is computed from the normalized expected sojourn times at t-vertices:

p_k(\boldsymbol{\theta}) = \frac{x_{v_{t_{i_k}}}(\boldsymbol{\theta})}{\sum_{j=1}^{J} x_{v_{t_j}}(\boldsymbol{\theta})}, \quad k = 1, \ldots, K, \tag{24.11}

where x_v(\boldsymbol{\theta}) is the expected sojourn time at vertex v under parameter \boldsymbol{\theta}, computed via the forward algorithm ([14]).

Intuition

The empirical PMF is simply the observed relative frequency of each vertex index in the data. The model PMF is the theoretical probability of each vertex under the phase-type model, obtained by normalizing the expected sojourn times (which are proportional to absorption probabilities in the joint probability graph construction; see Theorem 24.1). Probability matching finds the parameter that makes these two distributions as close as possible.

Example

In a coalescent model with mutations, t-vertices represent site frequency spectrum (SFS) categories. If n = 1000 loci are observed with counts (n_1, n_2, n_3) = (500, 350, 150) for singletons, doubletons, and tripletons, then \hat{\mathbf{p}} = (0.5, 0.35, 0.15). The model predicts \mathbf{p}(\theta) = (0.52, 0.30, 0.18) for some \theta, and probability matching adjusts \theta to reduce this discrepancy.

Definition 24.9 (Probability Matching Objective) The probability matching objective is the nonlinear least-squares problem

\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta} > 0} \sum_{k=1}^{K} \left( p_k(\boldsymbol{\theta}) - \hat{p}_k \right)^2. \tag{24.12}

The constraint \boldsymbol{\theta} > 0 ensures positive rates. This is equivalent to minimizing \lVert \mathbf{p}(\boldsymbol{\theta}) - \hat{\mathbf{p}} \rVert_2^2, where \mathbf{p}(\boldsymbol{\theta}) = (p_1(\boldsymbol{\theta}), \ldots, p_K(\boldsymbol{\theta}))^\top and \hat{\mathbf{p}} = (\hat{p}_1, \ldots, \hat{p}_K)^\top.

Intuition

The objective measures the squared Euclidean distance between the model and empirical probability vectors. Unlike maximum likelihood (which uses the multinomial log-likelihood), least-squares probability matching is computationally simpler and provides a natural residual structure for standard error estimation. The two approaches are asymptotically equivalent for well-specified models.

Definition 24.10 (Multinomial Covariance) Under the multinomial model (n_1, \ldots, n_K) \sim \operatorname{Multinomial}(n, \hat{p}_1, \ldots, \hat{p}_K), the covariance of the empirical proportions is

\operatorname{Cov}(\hat{p}_i, \hat{p}_j) = \frac{1}{n} \begin{cases} \hat{p}_i(1 - \hat{p}_i) & \text{if } i = j, \\ -\hat{p}_i \hat{p}_j & \text{if } i \neq j. \end{cases} \tag{24.13}

In matrix form, \boldsymbol{\Sigma}_{\hat{\mathbf{p}}} = n^{-1}(\operatorname{diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}} \hat{\mathbf{p}}^\top).

Intuition

The multinomial covariance captures the sampling variability of the empirical proportions. Diagonal entries are the variance of each proportion (largest when \hat{p}_k \approx 0.5, smallest when \hat{p}_k is near 0 or 1). Off-diagonal entries are negative: if one proportion is above its expectation, others must be below (since proportions sum to 1). The 1/n factor reflects that variability decreases with sample size. This covariance is used in the delta method to propagate uncertainty from \hat{\mathbf{p}} to \hat{\boldsymbol{\theta}}.

Example

For K = 3 categories with \hat{\mathbf{p}} = (0.5, 0.35, 0.15) and n = 100: \boldsymbol{\Sigma}_{\hat{\mathbf{p}}} = \frac{1}{100}\begin{pmatrix} 0.25 & -0.175 & -0.075 \\ -0.175 & 0.2275 & -0.0525 \\ -0.075 & -0.0525 & 0.1275 \end{pmatrix}.

Theorems and Proofs

Theorem 24.3 (Multinomial Covariance Is Exact for Independent Observations) Let y_1, \ldots, y_n be i.i.d. draws from a categorical distribution with probabilities \mathbf{p}^* = (p_1^*, \ldots, p_K^*), and let \hat{p}_k = n_k / n where n_k = \sum_{i=1}^{n} \mathbb{1}_{\{y_i = k\}}. Then:

(i) \mathbb{E}[\hat{p}_k] = p_k^* (unbiasedness).

(ii) \operatorname{Cov}(\hat{p}_i, \hat{p}_j) = n^{-1}(\delta_{ij} p_i^* - p_i^* p_j^*), where \delta_{ij} is the Kronecker delta.

(iii) The plug-in estimator \hat{\boldsymbol{\Sigma}}_{\hat{\mathbf{p}}} from Definition 24.10 (replacing p_k^* with \hat{p}_k) is a consistent estimator of \operatorname{Cov}(\hat{\mathbf{p}}).

Proof.

Since each y_i is an independent draw, n_k = \sum_{i=1}^{n} \mathbb{1}_{\{y_i = k\}} is a sum of i.i.d. Bernoulli(p_k^*) random variables. Thus \mathbb{E}[n_k] = n p_k^* and \mathbb{E}[\hat{p}_k] = p_k^*.
For the diagonal, \operatorname{Var}(n_k) = n p_k^*(1 - p_k^*) (binomial variance), so \operatorname{Var}(\hat{p}_k) = n^{-2} \operatorname{Var}(n_k) = n^{-1} p_k^*(1 - p_k^*). For the off-diagonal (i \neq j), consider a single observation: \operatorname{Cov}(\mathbb{1}_{\{y = i\}}, \mathbb{1}_{\{y = j\}}) = \mathbb{E}[\mathbb{1}_{\{y = i\}} \mathbb{1}_{\{y = j\}}] - p_i^* p_j^* = 0 - p_i^* p_j^* (since y cannot equal both i and j). Summing over n independent observations: \operatorname{Cov}(n_i, n_j) = -n p_i^* p_j^*, so \operatorname{Cov}(\hat{p}_i, \hat{p}_j) = -n^{-1} p_i^* p_j^*. Combining both cases gives Equation 24.13 with p_k^* in place of \hat{p}_k.
By the strong law of large numbers, \hat{p}_k \xrightarrow{\text{a.s.}} p_k^* for each k. Since the covariance formula (Equation 24.13) is a continuous function of \hat{p}_k, the continuous mapping theorem gives \hat{\boldsymbol{\Sigma}}_{\hat{\mathbf{p}}} \xrightarrow{\text{a.s.}} \boldsymbol{\Sigma}_{\hat{\mathbf{p}}}, establishing consistency. \square

Algorithms

24.0.0.5 Probability Matching Estimation

Description. Estimates the parameter vector \boldsymbol{\theta} of a joint probability graph by matching model probabilities to empirical probabilities. The algorithm: (1) tabulates observed vertex indices and computes empirical probabilities, (2) constructs a model probability function that evaluates normalized sojourn times via the forward algorithm, (3) finds an initial guess by coordinate-wise grid search matching the most common observation’s probability, (4) solves the least-squares problem (Equation 24.12), and (5) computes standard errors via the delta method using the multinomial covariance from Theorem 24.3.

Probability Matching Estimation
1: Let G = (V, E) be a parameterized joint probability graph with d parameters
2: Let y_1, ..., y_n be observed vertex indices (integers)
3: Let fixed be a set of (index, value) pairs for fixed parameters
4: Let d_free = d - |fixed| be the number of free parameters
5:
6: function ProbabilityMatching(G, y, fixed)
7:   (v_1, ..., v_K), (n_1, ..., n_K) <- Unique(y)
8:                                                  triangleright K distinct observations
9:   p_hat_k <- n_k / n for k = 1, ..., K           triangleright Eq. (1)
10:  V_term <- AllTerminalVertices(G)                triangleright For normalization
11:
12:  function p(theta)                               triangleright Model PMF
13:    x_obs <- ForwardAlgorithm(G, theta, (v_1, ..., v_K))
14:                                                  triangleright Sojourn times via [14]
15:    x_all <- ForwardAlgorithm(G, theta, V_term)
16:    return x_obs / sum(x_all)                     triangleright Eq. (2)
17:  end function
18:
19:  x0 <- InitialGuessGridProbs(p, max(p_hat), d_free, fixed)
20:                                                  triangleright Match most common obs
21:
22:  r(theta_free) <- p(Reconstruct(theta_free, fixed)) - p_hat
23:                                                  triangleright Residual function
24:
25:  theta_hat <- LeastSquares(r, x0, bounds=(epsilon, inf))
26:                                                  triangleright Eq. (3), TRF solver
27:
28:  J <- Jacobian(r, theta_hat)                     triangleright From solver
29:  Sigma_p <- MultinomialCovariance(p_hat, n)      triangleright Eq. (4)
30:  G_mat <- pinv(J)
31:  Cov_theta <- G_mat * Sigma_p * G_mat^T          triangleright Delta method
32:  se <- sqrt(diag(Cov_theta))
33:  prior_i <- GaussPrior(theta_hat_i, c * se_i) for free i
34:                                                  triangleright c = std_multiplier
35:  return (theta_hat, se, prior, p_hat, p(theta_hat))
36: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
`G`	G = (V, E)	`graph` (`probability_matching.py:probability_matching`)
`y`	y_1, \ldots, y_n	`observed_indices` (`probability_matching.py:probability_matching`)
`d`	d (parameter dimension)	`theta_dim` (`probability_matching.py:probability_matching`)
`d_free`	d_{\text{free}}	`n_free` (`probability_matching.py:probability_matching`)
`K`	K (distinct observations)	`n_unique` (`probability_matching.py:probability_matching`)
`n`	n (total observations)	`n_obs` (`probability_matching.py:probability_matching`)
`p_hat`	\hat{\mathbf{p}}	`empirical_probs` (`probability_matching.py:probability_matching`)
`V_term`	\mathcal{V}_{\text{term}}	`all_terminal_indices_arr` (`probability_matching.py:probability_matching`)
`p`	\mathbf{p}(\boldsymbol{\theta})	`probs_fn` (`probability_matching.py:probability_matching`)
`x_obs`	sojourn times at observed vertices	`sojourn_obs` (`probability_matching.py:probs_fn`)
`x_all`	sojourn times at all terminals	`sojourn_all` (`probability_matching.py:probs_fn`)
`x0`	initial guess	`x0` (`probability_matching.py:probability_matching`)
`r`	residual function	`residual_fn` (`probability_matching.py:probability_matching`)
`theta_hat`	\hat{\boldsymbol{\theta}}	`theta_full_opt` (`probability_matching.py:probability_matching`)
`J`	\mathbf{J} (Jacobian)	`result.jac` (`probability_matching.py:probability_matching`)
`Sigma_p`	\boldsymbol{\Sigma}_{\hat{\mathbf{p}}}	`multinomial_cov` (`probability_matching.py:probability_matching`)
`G_mat`	\mathbf{G} = \mathbf{J}^+	`G` (`probability_matching.py:probability_matching`)
`Cov_theta`	\operatorname{Cov}(\hat{\boldsymbol{\theta}})	`cov_theta` (`probability_matching.py:probability_matching`)
`se`	standard errors	`std_free` / `std_full` (`probability_matching.py:probability_matching`)
`c`	std multiplier	`std_multiplier` (`probability_matching.py:probability_matching`)

Complexity. Time: O(N_{\text{iter}} \cdot (C_{\text{fwd}} \cdot (K + J) + K)), where N_{\text{iter}} is the number of least-squares iterations (at most 200 \cdot d_{\text{free}}), C_{\text{fwd}} is the cost of one forward algorithm evaluation per vertex ([14]), K is the number of distinct observations, and J is the total number of terminal vertices (for normalization). The initial grid search costs O(d_{\text{free}} \cdot 30 \cdot C_{\text{fwd}} \cdot (K + J)). The multinomial covariance computation and delta method cost O(K^2) and O(K^2 \cdot d_{\text{free}}) respectively. Space: O(K^2) for the multinomial covariance matrix plus O(J) for the terminal vertex array.

Algorithm 24.5: Correctness. The empirical PMF \hat{\mathbf{p}} is an unbiased and consistent estimator of the true PMF \mathbf{p}^* (Theorem 24.3(i)). Under standard regularity conditions on the model (continuous \mathbf{p}(\boldsymbol{\theta}) and unique identification), the least-squares estimator \hat{\boldsymbol{\theta}} is consistent for \boldsymbol{\theta}^*. The standard errors are obtained via the delta method: \operatorname{Cov}(\hat{\boldsymbol{\theta}}) \approx \mathbf{G} \boldsymbol{\Sigma}_{\hat{\mathbf{p}}} \mathbf{G}^\top, where \mathbf{G} = \mathbf{J}^+ and \boldsymbol{\Sigma}_{\hat{\mathbf{p}}} is the multinomial covariance from Theorem 24.3.

Numerical Considerations

Normalization. The model probabilities (Equation 24.11) require normalization over all terminal vertices, not just the observed ones. Unobserved terminal vertices may have non-zero probability under the model, and excluding them would bias the probabilities upward. The implementation identifies all terminal vertices (vertices with no outgoing edges that are reachable from vertices that do have outgoing edges) and computes their sojourn times for normalization.

Positivity constraint. As in [21], the Trust Region Reflective solver enforces \theta_j > 10^{-10} via bound constraints, preventing numerical issues with zero rates.

Singular Jacobian. When K > d_{\text{free}} (more probability equations than parameters), the Jacobian \mathbf{J} is tall and the pseudoinverse \mathbf{J}^+ is used. When K \leq d_{\text{free}} (underdetermined), the pseudoinverse provides the minimum-norm solution. If the pseudoinverse computation fails (due to numerically singular \mathbf{J}), the fallback heuristic \text{se}_j = 0.5 |\hat{\theta}_j| is used.

Standard error floor. As in [21], standard errors for free parameters are floored at 0.5 \cdot \max(|\hat{\theta}_j|, 10^{-4}) to ensure priors always have non-degenerate spread.

Multinomial covariance singularity. The multinomial covariance matrix \boldsymbol{\Sigma}_{\hat{\mathbf{p}}} is always singular (its rank is K - 1 since the probabilities sum to 1). The pseudoinverse in the delta method handles this automatically: \mathbf{G} = \mathbf{J}^+ operates on the column space of \mathbf{J}, which typically has rank \min(K, d_{\text{free}}).

Daisy-chain horizon t_{\mathrm{eval}}. For continuous joint-probability inference via daisy_chain_joint_probs, the residual mass error of Theorem 24.2 decays as t_{\mathrm{eval}} grows, with rate governed by the smallest non-zero rate of G_{\mathrm{JSP}} under the current \boldsymbol{\theta}. The default heuristic uses t_{\mathrm{eval}} = \max(4 \sum \Delta t_i, 10); an adaptive residual-mass probe is exposed via Graph._probe_daisy_t_eval(...).

Implementation Notes

Source code mapping:

Algorithm / construction	File	Function / method	Lines
Definition 24.2 (joint-prob graph build)	`src/phasic/__init__.py`	`joint_prob_graph`	L7923–L8200
Definition 24.1 (reward callback)	`src/phasic/__init__.py`	`_joint_prob_reward`	L7857–L7920
Definition 24.5	`src/phasic/__init__.py`	`joint_stop_prob_graph`	L8203–L8369
Definition 24.7 (daisy chain dynamics)	`src/phasic/__init__.py`	`daisy_chain_joint_probs`	L8372–L8560
Definition 24.6 (update_ipv)	`src/phasic/__init__.py`	`update_ipv`	L2101–…
Theorem 24.1 (joint_prob_table)	`src/phasic/__init__.py`	`joint_prob_table`, `_get_joint_probs`	L8781–L8832
Algorithm 24.5 (entry)	`src/phasic/probability_matching.py`	`probability_matching`	L142–L357
Multinomial covariance	`src/phasic/probability_matching.py`	`_estimate_multinomial_covariance`	L59–L85
Initial guess grid	`src/phasic/probability_matching.py`	`_initial_guess_grid_probs`	L88–L139
Theta reconstruction	`src/phasic/method_of_moments.py`	`_reconstruct_theta`	L152–L165

Deviations from pseudocode and documentation:

The aux-loop edges in joint_stop_prob_graph() are constructed with coefficient vector [1.0] * param_length, not unit weight. Under the default linear weight mode this gives runtime weight \sum_k \theta_k rather than 1. The asymptotic correctness of Theorem 24.2 is insensitive to the choice of positive aux-loop coefficient.
daisy_chain_joint_probs uses a jax.custom_vjp wrapper with central-difference finite-difference gradients (\epsilon = 10^{-7}); it is not analytical reverse-mode autodiff.
The trash channel is sourced by the _joint_prob_reward callback: whenever a reward-rate increment would push the cumulative reward vector past either reward_limit or tot_reward_limit, the corresponding rate is added to trash_rate instead of reward_rates[i]. The trash pair is reachable only from non-t-vertices.
The model probability function probs_fn of probability matching uses compute_sojourn_times_ffi from the FFI wrappers rather than directly calling the forward algorithm. This leverages the trace cache for efficient repeated evaluation with different parameter vectors.
The terminal vertex identification traverses the graph structure once to find all vertices that have at least one edge leading to a vertex with no outgoing edges. The pseudocode abstracts this as AllTerminalVertices(G).
The ProbMatchResult dataclass stores empirical_probs, model_probs, and unique_indices for diagnostic comparison, not shown in the pseudocode return.
The initial guess grid (_initial_guess_grid_probs) matches the maximum model probability to the maximum empirical probability, rather than matching specific vertex probabilities. This is a robust heuristic that works regardless of which vertex is most common.
The _reconstruct_theta function is imported from method_of_moments.py, sharing the fixed-parameter insertion logic between both estimation methods.

Symbol Index

Symbol	Name	First appearance
G_J	Joint probability graph	Definition 24.2
G_{\mathrm{JSP}}	Joint stop-probability graph	Definition 24.5
\mathcal{X}_J	Joint state space	Definition 24.1
t, \mathcal{T}	t-vertex; ordered list of t-vertex indices	Definition 24.3, Definition 24.7
\mathrm{aux}(t)	Aux partner of t-vertex t	Definition 24.5
v_{\mathrm{trash}}^{(1)}, v_{\mathrm{trash}}^{(2)}	Trash pair	Definition 24.4
v_{\mathrm{abs}}^{J}	Joint-graph absorbing vertex	Definition 24.2
\boldsymbol{\rho}, \rho_{\mathrm{trash}}	Reward-rate function and trash-rate	Definition 24.2
\mathbf{r}^{\max}, r^{\mathrm{tot}}_{\max}	Reward caps	Definition 24.2
\mathcal{I}_{\mathrm{ipv}}	IPV target index list	Definition 24.5
\boldsymbol{\pi}, \boldsymbol{\pi}^{(i)}	Initial probability vector (epoch i)	Definition 24.6
F_{\boldsymbol{\theta}}(\boldsymbol{\pi}, s)	Forward operator at time s	Definition 24.7
\mathcal{C}	t-aux collapse operator	Definition 24.7
\mathcal{P}	IPV projection	Definition 24.7
M_t(s), M_t^{\infty}	Trapped mass at t-aux pair \{t, \mathrm{aux}(t)\}	Lemma 24.3
\mathrm{DC}(\cdot)	Daisy chain dynamics	Definition 24.7
N	Total initial IPV mass	Theorem 24.2
t_{\mathrm{eval}}	Final-epoch evaluation horizon	Definition 24.7
\Delta t_i	Duration of epoch i	Definition 24.7
\boldsymbol{\theta}^{(i)}	Parameter vector in epoch i	Definition 24.7
\boldsymbol{\alpha}_J, \mathbf{T}_J	Initial distribution and sub-transition matrix of G_J	Theorem 24.1
K (in Corollary 24.1)	Mutation count at coalescence	Corollary 24.1
p = r_c/(2\mu + r_c)	Geometric success probability	Corollary 24.1
d_{\text{free}}	Number of free parameters	Algorithm 24.5
\hat{\boldsymbol{\theta}}	Probability matching estimator	Definition 24.9
\mathbf{G}	Pseudoinverse of Jacobian \mathbf{J}^+	Algorithm 24.5
\hat{\mathbf{p}}	Empirical PMF vector	Definition 24.8
\hat{p}_k	Empirical probability of category k	Definition 24.8
\mathbf{J}	Jacobian of model PMF	Algorithm 24.5
K (probability matching)	Number of distinct observed categories	Definition 24.8
J	Total number of terminal vertices	Definition 24.8
n	Number of observations	Definition 24.8
n_k	Count of category k	Definition 24.8
\mathbf{p}(\boldsymbol{\theta})	Model PMF vector	Definition 24.8
p_k(\boldsymbol{\theta})	Model probability of category k	Definition 24.8
\boldsymbol{\Sigma}_{\hat{\mathbf{p}}}	Multinomial covariance matrix	Definition 24.10
\mathcal{V}_{\text{obs}}	Set of observed vertex indices	Definition 24.8
\mathcal{V}_{\text{term}}	Set of all terminal (t-)vertex indices	Definition 24.8
x_v(\boldsymbol{\theta})	Expected sojourn time at vertex v	Definition 24.8